From Articles to Events, Part II

Extracting text from media HTML files
newspaper3k
Articles to Events
Published

February 29, 2024

This is one post in a series where I’m working to expand the working paper “Extracting protest events from newspaper articles with ChatGPT” I wrote with Andy Andrews and Rashawn Ray. In that paper, we tested whether ChatGPT could replace my undergraduate RAs in extracting details about Black Lives Matter protests from media accounts. This time, I want to expand it to include more articles, movements, and variables.

Earlier Installments
* Part 1: From Articles to Events

In this part, I’m taking the downloaded HTML files and extracting the useful information, such as the article headline and text. Rather than build custom parsers for each webpage, I’m going to use the wonderful Newspaper3k library to extract the relevant information from each article. It works on almost every media site, so it’s incredibly useful for turning HTML into something useful.

pip install newspaper3k
Requirement already satisfied: newspaper3k in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (0.2.8)
Requirement already satisfied: beautifulsoup4>=4.4.1 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k) (4.12.2)
Requirement already satisfied: Pillow>=3.3.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k) (10.2.0)
Requirement already satisfied: PyYAML>=3.11 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k) (6.0.1)
Requirement already satisfied: cssselect>=0.9.2 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k) (1.2.0)
Requirement already satisfied: lxml>=3.6.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k) (5.1.0)
Requirement already satisfied: nltk>=3.2.1 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k) (3.8.1)
Requirement already satisfied: requests>=2.10.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k) (2.31.0)
Requirement already satisfied: feedparser>=5.2.1 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k) (6.0.11)
Requirement already satisfied: tldextract>=2.0.1 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k) (5.1.1)
Requirement already satisfied: feedfinder2>=0.0.4 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k) (0.0.4)
Requirement already satisfied: jieba3k>=0.35.1 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k) (0.35.1)
Requirement already satisfied: python-dateutil>=2.5.3 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k) (2.8.2)
Requirement already satisfied: tinysegmenter==0.3 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k) (0.3)
Requirement already satisfied: soupsieve>1.2 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from beautifulsoup4>=4.4.1->newspaper3k) (2.5)
Requirement already satisfied: six in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from feedfinder2>=0.0.4->newspaper3k) (1.16.0)
Requirement already satisfied: sgmllib3k in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from feedparser>=5.2.1->newspaper3k) (1.0.0)
Requirement already satisfied: click in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from nltk>=3.2.1->newspaper3k) (8.1.7)
Requirement already satisfied: joblib in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from nltk>=3.2.1->newspaper3k) (1.3.2)
Requirement already satisfied: regex>=2021.8.3 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from nltk>=3.2.1->newspaper3k) (2023.12.25)
Requirement already satisfied: tqdm in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from nltk>=3.2.1->newspaper3k) (4.66.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from requests>=2.10.0->newspaper3k) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from requests>=2.10.0->newspaper3k) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from requests>=2.10.0->newspaper3k) (1.26.18)
Requirement already satisfied: certifi>=2017.4.17 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from requests>=2.10.0->newspaper3k) (2024.2.2)
Requirement already satisfied: requests-file>=1.4 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from tldextract>=2.0.1->newspaper3k) (2.0.0)
Requirement already satisfied: filelock>=3.0.8 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from tldextract>=2.0.1->newspaper3k) (3.13.1)
Note: you may need to restart the kernel to use updated packages.
from newspaper import Article
import pandas as pd
import os
from slugify import slugify

I’ve used a variant of the function below for more than five years. The documentation for newspaper is a little vague on using the library when you already downloaded the HTML, but this version works. There can be more relevant stuff in the meta_data field, but it varies by paper and overtime, as they are primarily fields and values to improve the pages Google ranking.

def get_article_info(file_location):
    with open(file_location, 'rb') as fh:
        html = fh.read()
    article = Article(url = file_location)
    article.set_html(html)
    article.parse()
    
    article_details = {'title'       : article.title,
                       'text'        : article.text,
                       'url'         : article.meta_data['og'].get('url', article.url),
                       'authors'     : article.authors,
                       'date'        : article.publish_date,
                       'description' : article.meta_description,
                       'site'        : article.meta_data['og'].get('site_name', ''),
                       'publisher'   : article.meta_data['publisher']}
    

    return article_details

Applying the get_article_info function to a single file is pretty straightforward, but it takes a few seconds, so, ideally, you only do it once per file. Things get more complicated when a few articles, but then add new HTML files to the folder. The set of functions below creates a dataframe to store the results, or loads one if it already exists, and then processes each of the files that we want and that we haven’t already processed.

import os
import pandas as pd

def load_existing_data(json_file):
    """Load existing JSON data into a DataFrame."""
    try:
        return pd.read_json(json_file)
    except (ValueError, FileNotFoundError):
        return pd.DataFrame()

def is_file_processed(df, file_path):
    """Check if a file has been processed."""
    if 'file_location' in df.columns:
        return df['file_location'].isin([file_path]).any()
    else:
        return False

def update_dataframe(df, file_path):
    """Update the DataFrame with new article information."""
    article_info = get_article_info(file_path)  # Assuming this function exists and works as expected
    article_info['file_location'] = file_path
    
    # Check if the DataFrame is empty and initialize columns if necessary
    if df.empty:
        for key in article_info.keys():
            df[key] = pd.Series(dtype='object')
    
    # Add the new article information as a new row
    new_row_index = len(df)
    df.loc[new_row_index] = article_info
    
    return df

def save_to_json(df, json_file):
    """Save the DataFrame to a JSON file."""
    df.to_json(json_file, orient="records", date_format="iso")

def process_files(folder_path, json_file, sources_json):
    df = load_existing_data(json_file)
    
    # Filter the list of files to process based on 'source_1' column in sources_json
    sources_df = pd.read_json(sources_json)
    urls_to_process  = sources_df['source_1'].tolist()
    files_to_process = [slugify(url)+ ".html" for url in urls_to_process]
            
    for file in os.listdir(folder_path):
        if file in files_to_process:  # Check if the file should be processed
            file_path = os.path.join(folder_path, file)
            
            if not is_file_processed(df, file_path):
                try:
                    df = update_dataframe(df, file_path)
                except Exception as e:  # It's a good practice to catch specific exceptions
                    print(f"Error processing file {file_path}: {e}")
    
    save_to_json(df, json_file)
process_files('_HTML', 
              'article_texts.json', 
              'ccc_sample.json')
text_df = pd.read_json('article_texts.json')
print(len(text_df))
text_df.sample(3)
2733
title text url authors date description site publisher file_location
1338 Rally urges Montana leaders to extend meal ass... Earlier this year, Montana leaders announced t... https://www.ktvh.com/news/rally-urges-montana-... [Jonathon Ambarian] 2023-07-11T01:48:51.124 Earlier this year, Montana leaders announced t... KTVH {} _HTML/https-www-ktvh-com-news-rally-urges-mont...
306 NYC students protest gun violence after incide... While the nation reels from tragedy this week ... https://www.nydailynews.com/2023/03/30/nyc-stu... [Cayla Bamberger] 2023-03-30T19:33:59.000Z While the nation reels from tragedy this week ... New York Daily News {} _HTML/https-www-nydailynews-com-new-york-educa...
717 Mission Hospital nurses to hold rally today fo... Press release from National Nurses United:\n\n... http://mountainx.com/blogwire/mission-hospital... [] None Asheville and Western North Carolina News | Lo... Mountain Xpress {} _HTML/https-mountainx-com-blogwire-mission-hos...

Pretty good. Next up: Are they actually media accounts?