This is one post in a series where I’m working to expand the working paper “Extracting protest events from newspaper articles with ChatGPT” I wrote with Andy Andrews and Rashawn Ray. In that paper, we tested whether ChatGPT could replace my undergraduate RAs in extracting details about Black Lives Matter protests from media accounts. This time, I want to expand it to include more articles, movements, and variables.
In this part, I’m taking the downloaded HTML files and extracting the useful information, such as the article headline and text. Rather than build custom parsers for each webpage, I’m going to use the wonderful Newspaper3k library to extract the relevant information from each article. It works on almost every media site, so it’s incredibly useful for turning HTML into something useful.
pip install newspaper3k
Requirement already satisfied: newspaper3k in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (0.2.8)
Requirement already satisfied: beautifulsoup4>=4.4.1 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k) (4.12.2)
Requirement already satisfied: Pillow>=3.3.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k) (10.2.0)
Requirement already satisfied: PyYAML>=3.11 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k) (6.0.1)
Requirement already satisfied: cssselect>=0.9.2 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k) (1.2.0)
Requirement already satisfied: lxml>=3.6.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k) (5.1.0)
Requirement already satisfied: nltk>=3.2.1 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k) (3.8.1)
Requirement already satisfied: requests>=2.10.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k) (2.31.0)
Requirement already satisfied: feedparser>=5.2.1 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k) (6.0.11)
Requirement already satisfied: tldextract>=2.0.1 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k) (5.1.1)
Requirement already satisfied: feedfinder2>=0.0.4 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k) (0.0.4)
Requirement already satisfied: jieba3k>=0.35.1 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k) (0.35.1)
Requirement already satisfied: python-dateutil>=2.5.3 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k) (2.8.2)
Requirement already satisfied: tinysegmenter==0.3 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k) (0.3)
Requirement already satisfied: soupsieve>1.2 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from beautifulsoup4>=4.4.1->newspaper3k) (2.5)
Requirement already satisfied: six in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from feedfinder2>=0.0.4->newspaper3k) (1.16.0)
Requirement already satisfied: sgmllib3k in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from feedparser>=5.2.1->newspaper3k) (1.0.0)
Requirement already satisfied: click in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from nltk>=3.2.1->newspaper3k) (8.1.7)
Requirement already satisfied: joblib in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from nltk>=3.2.1->newspaper3k) (1.3.2)
Requirement already satisfied: regex>=2021.8.3 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from nltk>=3.2.1->newspaper3k) (2023.12.25)
Requirement already satisfied: tqdm in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from nltk>=3.2.1->newspaper3k) (4.66.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from requests>=2.10.0->newspaper3k) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from requests>=2.10.0->newspaper3k) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from requests>=2.10.0->newspaper3k) (1.26.18)
Requirement already satisfied: certifi>=2017.4.17 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from requests>=2.10.0->newspaper3k) (2024.2.2)
Requirement already satisfied: requests-file>=1.4 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from tldextract>=2.0.1->newspaper3k) (2.0.0)
Requirement already satisfied: filelock>=3.0.8 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from tldextract>=2.0.1->newspaper3k) (3.13.1)
Note: you may need to restart the kernel to use updated packages.
from newspaper import Articleimport pandas as pdimport osfrom slugify import slugify
I’ve used a variant of the function below for more than five years. The documentation for newspaper is a little vague on using the library when you already downloaded the HTML, but this version works. There can be more relevant stuff in the meta_data field, but it varies by paper and overtime, as they are primarily fields and values to improve the pages Google ranking.
Applying the get_article_info function to a single file is pretty straightforward, but it takes a few seconds, so, ideally, you only do it once per file. Things get more complicated when a few articles, but then add new HTML files to the folder. The set of functions below creates a dataframe to store the results, or loads one if it already exists, and then processes each of the files that we want and that we haven’t already processed.
import osimport pandas as pddef load_existing_data(json_file):"""Load existing JSON data into a DataFrame."""try:return pd.read_json(json_file)except (ValueError, FileNotFoundError):return pd.DataFrame()def is_file_processed(df, file_path):"""Check if a file has been processed."""if'file_location'in df.columns:return df['file_location'].isin([file_path]).any()else:returnFalsedef update_dataframe(df, file_path):"""Update the DataFrame with new article information.""" article_info = get_article_info(file_path) # Assuming this function exists and works as expected article_info['file_location'] = file_path# Check if the DataFrame is empty and initialize columns if necessaryif df.empty:for key in article_info.keys(): df[key] = pd.Series(dtype='object')# Add the new article information as a new row new_row_index =len(df) df.loc[new_row_index] = article_inforeturn dfdef save_to_json(df, json_file):"""Save the DataFrame to a JSON file.""" df.to_json(json_file, orient="records", date_format="iso")def process_files(folder_path, json_file, sources_json): df = load_existing_data(json_file)# Filter the list of files to process based on 'source_1' column in sources_json sources_df = pd.read_json(sources_json) urls_to_process = sources_df['source_1'].tolist() files_to_process = [slugify(url)+".html"for url in urls_to_process]forfilein os.listdir(folder_path):iffilein files_to_process: # Check if the file should be processed file_path = os.path.join(folder_path, file)ifnot is_file_processed(df, file_path):try: df = update_dataframe(df, file_path)exceptExceptionas e: # It's a good practice to catch specific exceptionsprint(f"Error processing file {file_path}: {e}") save_to_json(df, json_file)