I’ve always been intrigued by the idea of the Common Crawl News dataset. Since 2016, the non-profit organization Common Crawl has been collecting a specialized dataset of media articles, with new releases every day. While this dataset is smaller than the full Common Crawl, which currently stands at 454 terabytes and contains around 3 billion web pages, it is still a potentially useful and underutilized resource, both in terms of size and the process of extracting the news stories.
Below, you’ll find my code for downloading one day’s worth of data from the Common Crawl News dataset, which amounts to approximately 20 gigabytes. After running the code, I obtained 59,866 English language articles for April 20th, 2024. The process was time-consuming, taking about eight hours, with most of that time spent on extracting the text from the WARC (Web ARChive) files.
I attempted to optimize the code and speed up the extraction process, but most of those efforts were unsuccessful, likely because the parts I was trying to parallelize were not the primary bottlenecks. It’s important to note that a significant portion of the articles in the dataset are likely to be spam or press releases. However, I believe there is enough valuable content within the dataset to make it a worthwhile resource for those willing to invest the time and effort to process it.
pip install warcio news-please
Requirement already satisfied: warcio in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (1.7.4)
Requirement already satisfied: news-please in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (1.5.44)
Requirement already satisfied: six in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from warcio) (1.16.0)
Requirement already satisfied: Scrapy>=1.1.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (2.11.1)
Requirement already satisfied: PyMySQL>=0.7.9 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (1.1.0)
Requirement already satisfied: psycopg2-binary>=2.8.4 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (2.9.9)
Requirement already satisfied: hjson>=1.5.8 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (3.1.0)
Requirement already satisfied: elasticsearch>=2.4 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (8.13.0)
Requirement already satisfied: beautifulsoup4>=4.3.2 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (4.12.3)
Requirement already satisfied: readability-lxml>=0.6.2 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (0.8.1)
Requirement already satisfied: newspaper3k>=0.2.8 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (0.2.8)
Requirement already satisfied: langdetect>=1.0.7 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (1.0.9)
Requirement already satisfied: python-dateutil>=2.4.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (2.8.2)
Requirement already satisfied: plac>=0.9.6 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (1.4.3)
Requirement already satisfied: dotmap>=1.2.17 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (1.3.30)
Requirement already satisfied: PyDispatcher>=2.0.5 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (2.0.7)
Requirement already satisfied: ago>=0.0.9 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (0.0.95)
Requirement already satisfied: lxml>=3.3.5 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (5.1.0)
Requirement already satisfied: hurry.filesize>=0.9 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (0.9)
Requirement already satisfied: bs4 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (0.0.2)
Requirement already satisfied: faust-cchardet>=2.1.18 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (2.1.19)
Requirement already satisfied: boto3 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (1.34.90)
Requirement already satisfied: soupsieve>1.2 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from beautifulsoup4>=4.3.2->news-please) (2.5)
Requirement already satisfied: elastic-transport<9,>=8.13 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from elasticsearch>=2.4->news-please) (8.13.0)
Requirement already satisfied: setuptools in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from hurry.filesize>=0.9->news-please) (68.2.2)
Requirement already satisfied: Pillow>=3.3.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k>=0.2.8->news-please) (10.2.0)
Requirement already satisfied: PyYAML>=3.11 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k>=0.2.8->news-please) (6.0.1)
Requirement already satisfied: cssselect>=0.9.2 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k>=0.2.8->news-please) (1.2.0)
Requirement already satisfied: nltk>=3.2.1 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k>=0.2.8->news-please) (3.8.1)
Requirement already satisfied: requests>=2.10.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k>=0.2.8->news-please) (2.31.0)
Requirement already satisfied: feedparser>=5.2.1 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k>=0.2.8->news-please) (6.0.11)
Requirement already satisfied: tldextract>=2.0.1 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k>=0.2.8->news-please) (5.1.1)
Requirement already satisfied: feedfinder2>=0.0.4 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k>=0.2.8->news-please) (0.0.4)
Requirement already satisfied: jieba3k>=0.35.1 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k>=0.2.8->news-please) (0.35.1)
Requirement already satisfied: tinysegmenter==0.3 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k>=0.2.8->news-please) (0.3)
Requirement already satisfied: chardet in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from readability-lxml>=0.6.2->news-please) (5.2.0)
Requirement already satisfied: Twisted>=18.9.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Scrapy>=1.1.0->news-please) (24.3.0)
Requirement already satisfied: cryptography>=36.0.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Scrapy>=1.1.0->news-please) (42.0.5)
Requirement already satisfied: itemloaders>=1.0.1 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Scrapy>=1.1.0->news-please) (1.2.0)
Requirement already satisfied: parsel>=1.5.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Scrapy>=1.1.0->news-please) (1.9.1)
Requirement already satisfied: pyOpenSSL>=21.0.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Scrapy>=1.1.0->news-please) (24.1.0)
Requirement already satisfied: queuelib>=1.4.2 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Scrapy>=1.1.0->news-please) (1.6.2)
Requirement already satisfied: service-identity>=18.1.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Scrapy>=1.1.0->news-please) (24.1.0)
Requirement already satisfied: w3lib>=1.17.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Scrapy>=1.1.0->news-please) (2.1.2)
Requirement already satisfied: zope.interface>=5.1.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Scrapy>=1.1.0->news-please) (6.3)
Requirement already satisfied: protego>=0.1.15 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Scrapy>=1.1.0->news-please) (0.3.1)
Requirement already satisfied: itemadapter>=0.1.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Scrapy>=1.1.0->news-please) (0.8.0)
Requirement already satisfied: packaging in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Scrapy>=1.1.0->news-please) (23.2)
Requirement already satisfied: botocore<1.35.0,>=1.34.90 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from boto3->news-please) (1.34.90)
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from boto3->news-please) (1.0.1)
Requirement already satisfied: s3transfer<0.11.0,>=0.10.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from boto3->news-please) (0.10.1)
Requirement already satisfied: urllib3!=2.2.0,<3,>=1.25.4 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from botocore<1.35.0,>=1.34.90->boto3->news-please) (2.1.0)
Requirement already satisfied: cffi>=1.12 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from cryptography>=36.0.0->Scrapy>=1.1.0->news-please) (1.16.0)
Requirement already satisfied: certifi in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from elastic-transport<9,>=8.13->elasticsearch>=2.4->news-please) (2023.11.17)
Requirement already satisfied: sgmllib3k in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from feedparser>=5.2.1->newspaper3k>=0.2.8->news-please) (1.0.0)
Requirement already satisfied: click in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from nltk>=3.2.1->newspaper3k>=0.2.8->news-please) (8.1.7)
Requirement already satisfied: joblib in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from nltk>=3.2.1->newspaper3k>=0.2.8->news-please) (1.3.2)
Requirement already satisfied: regex>=2021.8.3 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from nltk>=3.2.1->newspaper3k>=0.2.8->news-please) (2023.12.25)
Requirement already satisfied: tqdm in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from nltk>=3.2.1->newspaper3k>=0.2.8->news-please) (4.66.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from requests>=2.10.0->newspaper3k>=0.2.8->news-please) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from requests>=2.10.0->newspaper3k>=0.2.8->news-please) (3.6)
Requirement already satisfied: attrs>=19.1.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from service-identity>=18.1.0->Scrapy>=1.1.0->news-please) (23.2.0)
Requirement already satisfied: pyasn1 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from service-identity>=18.1.0->Scrapy>=1.1.0->news-please) (0.6.0)
Requirement already satisfied: pyasn1-modules in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from service-identity>=18.1.0->Scrapy>=1.1.0->news-please) (0.4.0)
Requirement already satisfied: requests-file>=1.4 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from tldextract>=2.0.1->newspaper3k>=0.2.8->news-please) (2.0.0)
Requirement already satisfied: filelock>=3.0.8 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from tldextract>=2.0.1->newspaper3k>=0.2.8->news-please) (3.13.1)
Requirement already satisfied: automat>=0.8.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Twisted>=18.9.0->Scrapy>=1.1.0->news-please) (22.10.0)
Requirement already satisfied: constantly>=15.1 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Twisted>=18.9.0->Scrapy>=1.1.0->news-please) (23.10.4)
Requirement already satisfied: hyperlink>=17.1.1 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Twisted>=18.9.0->Scrapy>=1.1.0->news-please) (21.0.0)
Requirement already satisfied: incremental>=22.10.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Twisted>=18.9.0->Scrapy>=1.1.0->news-please) (22.10.0)
Requirement already satisfied: typing-extensions>=4.2.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Twisted>=18.9.0->Scrapy>=1.1.0->news-please) (4.9.0)
Requirement already satisfied: pycparser in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from cffi>=1.12->cryptography>=36.0.0->Scrapy>=1.1.0->news-please) (2.21)
Note: you may need to restart the kernel to use updated packages.
def download_file(url, destination_path):""" Download a file from a URL to a local destination path using the requests library. Args: - url (str): URL of the file to download. - destination_path (str): Local path to save the downloaded file. """ response = requests.get(url, stream=True)withopen(destination_path, "wb") as f: shutil.copyfileobj(response.raw, f)print(f"Downloaded {os.path.basename(destination_path)}")def download_warc_files(date, base_dir):""" Downloads all WARC files for a specific date into a designated directory, ensuring no duplicates. """ COMMON_CRAWL_CC_NEWS_PREFIX ="crawl-data/CC-NEWS" CC_DATA_ROOT ="https://data.commoncrawl.org/" date_folder = date.strftime("%Y-%m-%d") day_dir = os.path.join(base_dir, date_folder) os.makedirs(day_dir, exist_ok=True) warc_listing_url =f"{CC_DATA_ROOT}{COMMON_CRAWL_CC_NEWS_PREFIX}/{date.year}/{date.strftime('%m')}/warc.paths.gz"with urllib.request.urlopen(warc_listing_url) as response:with gzip.open(response, "rb") as decompressed_file:return process_warc_listing( decompressed_file, CC_DATA_ROOT, day_dir, date.strftime("%Y%m%d") )def process_warc_listing(decompressed_file, data_root, day_dir, date_str):""" Process each line in the decompressed warc listing to download files. """ files_downloaded = []for line in decompressed_file:if date_str instr(line): file_url = data_root +str(line.strip(), "utf-8") destination_path = os.path.join(day_dir, os.path.basename(file_url))ifnot os.path.exists(destination_path): download_file(file_url, destination_path)else:print(f"File already exists: {os.path.basename(destination_path)}") files_downloaded.append(destination_path)return files_downloadeddef is_english(text):"""Check if the text is English."""try:return detect(text) =="en"except:returnFalsedef file_exists(file_path):"""Check if the file exists."""return os.path.exists(file_path)def is_relevant_domain(url):"""Check if the domain is relevant based on the specified criteria.""" domain = urlparse(url).netlocreturn ( (domain.endswith(".org") or domain.endswith(".com"))and"hindustan"notin urland"minga."notin url )def is_english_and_has_text(article):"""Check if the article is in English and has text."""return article and article.language =="en"and article.maintextdef process_record(record):"""Process a single record, extracting the article if conditions are met."""if record.rec_type =="response"and"html"in record.http_headers.get_header("Content-Type", "" ): url = record.rec_headers.get_header("WARC-Target-URI")if is_relevant_domain(url):try: article = NewsPlease.from_warc(record)if is_english_and_has_text(article):print(article.title)return article.get_serializable_dict()except (EmptyResponseError, ParserError):print("Blank!!!")returnNonereturnNonedef extract_articles_from_warc(file_path):"""Extracts articles from a WARC file, filtering for English language and .org/.com domains.""" articles = []if file_exists(file_path):withopen(file_path, "rb") as stream:for record in ArchiveIterator(stream): article = process_record(record)if article: articles.append(article)return articlesdef save_articles_to_json(articles, output_file):""" Saves a list of articles to a JSON file using pandas, which automatically handles datetime serialization. """# Convert the list of dictionaries (articles) to a DataFrame df = pd.DataFrame(articles)# Save the DataFrame to a JSON file df.to_json(output_file, orient="records", lines=True, date_format="iso")def process_warc_file(warc_file):""" Processes a single WARC file, extracts articles, and saves them to a JSON file. This function only runs if the corresponding JSON file does not already exist. """ json_filename = os.path.splitext(warc_file)[0] +".json"# Check if the JSON file already existsif os.path.exists(json_filename):returnf"JSON file already exists. Skipped processing for: {json_filename}"# Extract articles from the WARC file if JSON does not exist articles = extract_articles_from_warc(warc_file)if articles: # Ensure there are articles to write df = pd.DataFrame(articles) df.to_json(json_filename, orient="records", lines=True, date_format="iso")returnf"Saved articles to {json_filename}"else:returnf"No articles found in {warc_file}. Nothing was saved."def process_date(date_str, base_dir, num_workers=None):""" Processes all WARC files for a given date using threads to speed up IO-bound tasks. Args: - date_str (str): The date string in 'YYYY-MM-DD' format. - base_dir (str): The base directory where WARC files are downloaded and processed. - num_workers (int): The number of worker threads to use; defaults to the number of CPUs. """ date = datetime.strptime(date_str, "%Y-%m-%d") warc_files = download_warc_files(date, base_dir) num_workers = num_workers or os.cpu_count()# Use ThreadPoolExecutor to process files in parallelwith ThreadPoolExecutor(max_workers=num_workers) as executor:# Map the process_warc_file function to each WARC file futures = { executor.submit(process_warc_file, warc_file): warc_filefor warc_file in warc_files }
# Example usageprocess_date("2024-04-20", "commoncrawl-data/", num_workers=2)
from glob import globdf_names = glob('commoncrawl-data/2024-04-20/*.json')dfs = [pd.read_json(df_name, lines=True) for df_name in df_names]df= pd.concat(dfs)