From Articles to Events, Part III

Categorizing texts with an LLM
OpenAI
Articles to Events
Published

March 11, 2024

This is one post in a series where I’m working to expand the working paper “Extracting protest events from newspaper articles with ChatGPT” I wrote with Andy Andrews and Rashawn Ray. In that paper, we tested whether ChatGPT could replace my undergraduate RAs in extracting details about Black Lives Matter protests from media accounts. This time, I want to expand it to include more articles, movements, and variables.

Earlier Installments * Part 1: From Articles to Events * Part 2: Extracting text from media HTML files

In this part, I’m want to check whether the file I downloaded actually contains a media account of a protest that already happened. I’m hoping to filter out a few types of bad texts: files where I downloaded “You can’t see this page without paying.” instead of the article; articles about future events; and pages that are organizational listings of events rather than media accounts. The plan is to use ChatGPT to categorize the articles.

from datetime import date
from enum import Enum
import json

from pydantic import BaseModel, Field
from openai import OpenAI
import pandas as pd

from concurrent.futures import ThreadPoolExecutor, as_completed

Load the article dataset made in step 2.

df = pd.read_json('https://github.com/nealcaren/notes/raw/main/posts/from-articles-to-events/article_texts.json')
df.sample(3)
title text url authors date description site publisher file_location
2127 New Braunfels plans annual march, announces cl... State Alabama Alaska Arizona Arkansas Californ... https://herald-zeitung.com/community_alert/new... [Hannah Thompson The Herald-Zeitung, Hannah Th... None In observance of Martin Luther King Jr. Day on... New Braunfels Herald-Zeitung {} _HTML/https-herald-zeitung-com-community-alert...
1651 Trump Supporters Plan Protest Outside Fulton C... Laura Loomer, a staunch supporter of Donald Tr... https://www.newsweek.com/trump-supporters-plan... [Nick Mordowanec, Aron Solomon, Dan Perry, Pau... 2023-08-22T21:53:41.000Z "The American people recognize that this is a ... Newsweek {} _HTML/https-www-newsweek-com-trump-supporters-...
621 Protest held to terminate auditor accused of r... HOWARD COUNTY, Md. — Members from different or... https://www.wmar2news.com/local/protest-held-t... [Ashley Mcdowell] 2023-03-07T03:19:29.618 Members from different organizations in Howard... WMAR 2 News Baltimore {} _HTML/https-www-wmar2news-com-local-protest-he...

Rather than ask a single question, “Is this an media article that describes a protest event that has already happened or is ongoing?” I decided to ask it three separate questions. I think this strategy of breaking the question down into its component parts will lead to more accurate answers.

To provide information about the questions and the format of a response (True/False), I use the pydantic library to create the structure of the output.

class ArticleReview(BaseModel):
    discusses_political_protest: bool = Field(
        ...,
        description="Indicates whether the article discusses a political protest. True if it does, False otherwise."
    )
    from_media_source: bool = Field(
        ...,
        description="Determines if the article is from a media source. Respond False if it is a press release or event listing."
    )
    protest_event_future: bool = Field(
        ...,
        description="Is the event planned for the future? Respond false if the event occurred or is currently happening."
    )
    

Next, a Python function to call the OpenAI ChatGPT model.

def is_protest(article_info, client):
    messages = [
        {
            "role": "system",
            "content": "You are a helpful assistant that extracts summaries of newspaper articles about political protests as JSON for a database. ",
        },
        {
            "role": "user",
            "content": f"""Extract information about the details about a protest from the following article.
      Only use information from the article.

      {article_info}
      
      """,
        },
    ]
    
    completion = client.chat.completions.create(
        model="gpt-3.5-turbo",
        functions=[
            {
                "name": "protest_details",
                "description": "Extract insights from media article about protest.",
                "parameters": ArticleReview.model_json_schema(),
            }
        ],
        n=1,
        messages=messages,
    )
    
    r = json.loads(completion.choices[0].message.function_call.arguments)

    return r

Normally, I include the client as part of the function but this time I wanted to pass it to the function so it is only called once. That should speed things up a tiny bit. Plus, down the road I was thinking about using non-OpenAI models from anyscale and this might make that easier to add on.

client = OpenAI(
    max_retries=3,
    timeout=20.0,
)

Grab a sample article, including only the fields I want to process. I also truncate the text length to save $.

useful = ['title', 'text','date','site']
a = df[useful].sample().to_dict(orient='records')[0]
a['text'] = a['text'][:2000]
a
{'title': 'Faces of protest: Thursday at the Indiana Statehouse',
 'text': "We are The Statehouse File\n\nFrom an office in the Press Corps of the Indiana Statehouse, the journalism majors of Franklin College's Pulliam School of Journalism work alongside the pros, digging into the behind-the-scenes stories of Indiana politics. We're a student newsroom, but our work doesn't sit on a professor's desk. We create daily content for this website and professional media outlets around the state.\n\nUSE OUR CONTENT FOR FREE: Thanks to a $180,000 grant from Lumina, TheStatehouseFile.com has taken down its reader paywall and is offering its year-round coverage of the Indiana Statehouse to professional media outlets to republish for free. Just retain the author's and The Statehouse File's name, helping our young journalists on their way.",
 'date': None,
 'site': 'The Statehouse File'}

Trying it out! I ran these surrounding cells a couple of times to make sure it works for different kinds or articles. Results seem good. A prior version used asked about protest_event_past but that seemed to miss a few, so I changed the question around in this version to ask only if the event was in the future.

is_protest(a, client)
{'discusses_political_protest': False,
 'from_media_source': True,
 'protest_event_future': True}

Now, I want to apply the function to the whole dataframe. I had ChatGPT help me out here because I wanted a function that would (1) make multiple calls to the API at the same time to speed things up, and (2) be able to pick up where I left off in case I closed my laptop during the process. After some negotiation, the solution we agreed upon was one where it created a new feature api_called that would start as false but then be switched to true when the API was successfully called.

# Function to be executed by each thread
def process_row(index, row):
    useful = ['title', 'date','site','text_truncated']
    if not row['api_called']:
        text = row['text_truncated']
        result = is_protest(row[useful], client)
        return (index, result)
    return (index, None)

# Function to execute API calls in parallel and update DataFrame
def update_dataframe(df):
    # Select rows where API call hasn't been made
    rows_to_process = df[~df['api_called']].copy()
    
    # Use ThreadPoolExecutor to parallelize API calls
    with ThreadPoolExecutor(max_workers=5) as executor:
        # Submit tasks
        futures = {executor.submit(process_row, index, row): index for index, row in rows_to_process.iterrows()}
        
        for future in as_completed(futures):
            result = future.result()
            if result:
                index, api_result = result
                # Update DataFrame with the result
                for key, value in api_result.items():
                    if key not in df.columns:
                        df[key] = pd.NA  # Initialize new column with missing values
                    df.at[index, key] = value
                
                # Mark the row as processed
                df.at[index, 'api_called'] = True

# Truncate 'text' to 2000 characters and initialize 'api_called' column
df['text_truncated'] = df['text'].str[:2000]
df['api_called'] = False

# Update the DataFrame with API call results
update_dataframe(df)

This took 9 minutes to process, which works about to be about .2 second for each record. Using ThreadPoolExecutor allowed me to make 5 API calls at a time, so if I hadn’t used it, the process would have taken 45 minutes.

Finally, create a new variable for the articles that match all the conditions that I want, and save those as a new JSON.

df['all_conditions_met'] = (
    (df['discusses_political_protest'] == True) & 
    (df['from_media_source'] == True) & 
    (df['protest_event_future'] == False)
)
df['all_conditions_met'].mean()
0.4548115623856568
df['all_conditions_met'].sum()
1243
screen = df['all_conditions_met']==True
df[screen].to_json('protest_articles.json', orient='records')

The data prep is done! I have about the full text of 1,243 articles that are already coded by CCC for use in testing out the accuracy of LLMs for my use case.