Web Scraping in Bulk

This script is one way to download multiple web pages at the same time. It’s useful when you have many URLs from different websites you want to save. Instead of visiting each website individually, the script visits multiple websites simultaneously and saves what it finds. Rather than requesting the page through Python, it uses a “headless” web browser, which is much more likely to get you the actual content you want.

This approach works best when the URLs are from different servers. You will eventually get locked out if you try to visit the same web server multiple times in the same second, but there’s no reason not to visit five different websites at once. I don’t know anything about parallel processing with the asyncio library, which is the only way to parallelize pyppeteer, so the script is mainly written by ChatGPT, but I’ve used it successfully a few times.

pip install pyppeteer python-slugify chromedriver-py

Requirement already satisfied: pyppeteer in /Users/nealcaren/anaconda3/envs/whisperplus/lib/python3.11/site-packages (2.0.0)
Requirement already satisfied: python-slugify in /Users/nealcaren/anaconda3/envs/whisperplus/lib/python3.11/site-packages (8.0.4)
Requirement already satisfied: appdirs<2.0.0,>=1.4.3 in /Users/nealcaren/anaconda3/envs/whisperplus/lib/python3.11/site-packages (from pyppeteer) (1.4.4)
Requirement already satisfied: certifi>=2023 in /Users/nealcaren/anaconda3/envs/whisperplus/lib/python3.11/site-packages (from pyppeteer) (2023.11.17)
Requirement already satisfied: importlib-metadata>=1.4 in /Users/nealcaren/anaconda3/envs/whisperplus/lib/python3.11/site-packages (from pyppeteer) (7.0.1)
Requirement already satisfied: pyee<12.0.0,>=11.0.0 in /Users/nealcaren/anaconda3/envs/whisperplus/lib/python3.11/site-packages (from pyppeteer) (11.1.0)
Requirement already satisfied: tqdm<5.0.0,>=4.42.1 in /Users/nealcaren/anaconda3/envs/whisperplus/lib/python3.11/site-packages (from pyppeteer) (4.66.2)
Requirement already satisfied: urllib3<2.0.0,>=1.25.8 in /Users/nealcaren/anaconda3/envs/whisperplus/lib/python3.11/site-packages (from pyppeteer) (1.26.18)
Requirement already satisfied: websockets<11.0,>=10.0 in /Users/nealcaren/anaconda3/envs/whisperplus/lib/python3.11/site-packages (from pyppeteer) (10.4)
Requirement already satisfied: text-unidecode>=1.3 in /Users/nealcaren/anaconda3/envs/whisperplus/lib/python3.11/site-packages (from python-slugify) (1.3)
Requirement already satisfied: zipp>=0.5 in /Users/nealcaren/anaconda3/envs/whisperplus/lib/python3.11/site-packages (from importlib-metadata>=1.4->pyppeteer) (3.17.0)
Requirement already satisfied: typing-extensions in /Users/nealcaren/anaconda3/envs/whisperplus/lib/python3.11/site-packages (from pyee<12.0.0,>=11.0.0->pyppeteer) (4.9.0)
Note: you may need to restart the kernel to use updated packages.

import os
import asyncio
import nest_asyncio
from random import shuffle

from slugify import slugify
from pyppeteer import launch
from pyppeteer.errors import NetworkError

import pandas as pd


nest_asyncio.apply()

The section below loads the wonderful protest event data set created by the Crowd Counting Consortium. Each protest event is linked to one or more media accounts, and the URLs are in the source_ fields. Using just the 2024 events, I combine the URL fields and remove the social media pages and duplicates. Finally, I extract a random sample of 100 articles.

df = pd.read_csv(
    "https://github.com/nonviolent-action-lab/crowd-counting-consortium/raw/master/ccc_compiled_2021-present.csv",
    encoding="latin",
    low_memory=False,
)

# Limit to just 2024
df = df[df["date"].str.contains("2024")]

# grab the sources
urls = (
    list(df["source_1"].astype(str).values)
    + list(df["source_2"].astype(str).values)
    + list(df["source_3"].astype(str).values)
    + list(df["source_4"].astype(str).values)
)

# eliminate social media
for sm in ["twitter", "youtube", "facebook", "instagram", "tiktok", "bsky"]:
    urls = [u for u in urls if f"{sm}.com" not in u and "http" in u]

urls = list(set(urls))
print(len(urls))
shuffle(urls)
urls = urls[:100]

This function below uses asynchronous programming to download and save the HTML content of web pages from a list of URLs. It uses a headless Chrome browser, controlled via the Pyppeteer library, to render pages just as they would appear in a web browser. This approach is particularly useful for capturing dynamically generated content, which traditional HTTP requests might miss.

Key components of the script include:

HTML Directory Creation: At the start, the script ensures that there is a designated directory (named ‘HTML’) where all downloaded page contents will be saved. If this directory does not exist, it is created.
User Agent Setting: A user agent string is defined and used for all requests to mimic a real web browser, helping to avoid potential blocking by web servers that may restrict access to non-browser clients.
fetch Function: The core of the script is the fetch function. This asynchronous function takes a browser page object, a URL, and an optional timeout parameter. It performs the following actions for each URL:
- URL Slugification: Converts the URL into a filename-safe string and checks if the content has already been downloaded to avoid duplication.
- Page Navigation: Uses the headless browser to navigate to the URL, with a specified timeout to handle slow-loading pages.
- Content Saving: If the page loads successfully, its HTML content is saved to a file within the ‘HTML’ directory. If the page fails to load or an error occurs, the URL is added to a list of bad URLs for later reference.
Concurrency Management: The script is designed to process multiple URLs in parallel, maximizing efficiency by utilizing asynchronous operations. This approach allows for faster completion of download tasks compared to sequential processing.
Error Handling: The script includes basic error handling to manage timeouts and other exceptions, ensuring that it can continue running even if some pages fail to load.

# Ensure the HTML directory exists
html_dir = "HTML"
os.makedirs(html_dir, exist_ok=True)

# User agent to be used for all requests
ua = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.6 Safari/605.1.15"
bad_urls = []


async def fetch(page, url, timeout=30):
    # Slugify the URL to create a valid filename
    filename = slugify(url) + ".html"
    file_path = os.path.join(html_dir, filename)

    if os.path.isfile(file_path):
        # print(f"File {file_path} already exists, skipping download.")
        return

    if url in bad_urls:
        print(f"Skipping bad URL: {url}")
        return

    try:
        # Set the user agent for the page
        await page.setUserAgent(ua)

        # Navigate to the page with a timeout
        response = await asyncio.wait_for(
            page.goto(url, {"waitUntil": "networkidle0"}), timeout
        )

        # Check if the page was successfully retrieved
        if response and response.ok:
            content = await page.content()
            # Save the content to a file in the 'HTML' directory
            with open(file_path, "w", encoding="utf-8") as file:
                file.write(content)
            print(f"Content from {url} has been saved to {file_path}")
        else:
            print(f"Failed to retrieve {url}")
            bad_urls.append(url)
    except asyncio.TimeoutError:
        print(f"Fetching {url} took too long and was cancelled.")
        bad_urls.append(url)
    except Exception as e:
        print(f"An error occurred while fetching {url}: {e}")
        bad_urls.append(url)

This next section actual does the downloading by employing an asynchronous queue-based approach to manage URLs and distribute them across multiple browser pages for parallel processing. This method significantly improves efficiency by ensuring that each browser page is continuously utilized without idle time waiting for other pages to complete their tasks.

Key components and functionalities:

process_url Function: An asynchronous function that continuously processes URLs from a shared asyncio queue. Each browser page runs an instance of this function, fetching and processing URLs one after another until the queue is empty.
main Function Setup:
- Browser and Page Initialization: Initializes a headless browser instance and opens a specified number of browser pages. 5 to 10 seems reasonable.
- URL Queue Creation: Prepares an asyncio queue and populates it with URLs to be processed. This queue acts as a shared resource for distributing URLs among the available pages.
Task Management:
- Asynchronous Tasks: For each browser page, an asynchronous task is created to process URLs from the queue. These tasks run concurrently, allowing for simultaneous processing across pages.
- Task Synchronization: Utilizes asyncio.gather to wait for all tasks to complete before proceeding, ensuring that all URLs are processed before closing the browser and pages.
Resource Cleanup: After processing all URLs, the script ensures a clean shutdown by closing each browser page and the browser itself, releasing system resources.
Error Handling and Reporting: Tracks URLs that could not be downloaded for any reason, reporting them at the end of the execution for further analysis or retry.

async def process_url(page, url_queue):
    while not url_queue.empty():
        url = await url_queue.get()
        await fetch(page, url)  # Your existing fetch function
        url_queue.task_done()

async def main():
    browser = await launch()
    pages = [await browser.newPage() for _ in range(5)]  # Initialize pages once

    # Create a queue of URLs
    url_queue = asyncio.Queue()
    for url in urls:
        await url_queue.put(url)

    # Create a task for each page to process URLs from the queue
    tasks = [asyncio.create_task(process_url(page, url_queue)) for page in pages]

    # Wait for all tasks to complete
    await asyncio.gather(*tasks)

    # Close pages and browser after all operations are complete
    for page in pages:
        await page.close()
    await browser.close()

    if bad_urls:
        print("The following URLs had issues and were not downloaded:")
        print("\n".join(bad_urls))

asyncio.run(main())

Failed to retrieve https://www.wgmd.com/pro-palestinian-protesters-deface-veterans-cemetery-in-los-angeles-spray-paint-free-gaza/
Content from https://www.fox61.com/article/news/local/hartford-county/west-hartford/west-hartford-vandalism-under-investigation-police/520-7b65ab7b-d93b-42f9-8ca2-0ed9d5b7689a has been saved to HTML/https-www-fox61-com-article-news-local-hartford-county-west-hartford-west-hartford-vandalism-under-investigation-police-520-7b65ab7b-d93b-42f9-8ca2-0ed9d5b7689a.html
Fetching https://www.nbcnews.com/politics/donald-trump/trump-confuses-nikki-haley-pelosi-talking-jan-6-rcna134863 took too long and was cancelled.
Fetching https://www.purdueexponent.org/campus/article_78be7d6e-c2bb-11ee-a25c-a3e2dff21694.html took too long and was cancelled.
Fetching https://13wham.com/news/local/local-advocates-rally-in-downtown-rochester-on-51st-anniversary-of-roe-v-wade took too long and was cancelled.
Fetching https://www.wvtm13.com/article/protests-kenneth-smith-execution-untied-nations-montgomery/46496998 took too long and was cancelled.
Content from https://www.wwnytv.com/2024/01/20/congresswoman-stefanik-speaks-new-hampshire-support-trump/ has been saved to HTML/https-www-wwnytv-com-2024-01-20-congresswoman-stefanik-speaks-new-hampshire-support-trump.html
Fetching https://nypost.com/2024/01/21/news/scream-actress-melissa-barrera-joins-disruptive-anti-israel-rally-at-sundance/ took too long and was cancelled.
Fetching https://www.wlky.com/article/nonprofits-rally-frankfort-legislation-kentucky/46676604 took too long and was cancelled.
Fetching https://www.courier-journal.com/story/news/politics/2024/02/08/kentucky-employees-retirement-system-participants-rally-for-13th-check/72527656007/ took too long and was cancelled.
Fetching https://www.northjersey.com/story/news/2024/02/06/israel-hamas-war-day-of-action-for-palestine-nj-students-march/72478632007/ took too long and was cancelled.
Fetching https://www.washingtonpost.com/dc-md-va/2024/01/15/virginia-assembly-gun-rights-rally/ took too long and was cancelled.
Fetching https://dailybruin.com/2024/01/19/uc-divest-coalition-at-ucla-leads-hands-off-yemen-protest-on-campus took too long and was cancelled.
Fetching https://www.washingtonpost.com/dc-md-va/2024/01/18/dc-march-for-life-rally-abortion/ took too long and was cancelled.
Fetching https://www.fox5dc.com/news/dc-activists-plan-protest-against-capitals-wizards-move-to-virginia took too long and was cancelled.
Fetching https://www.latimes.com/entertainment-arts/movies/story/2024-01-21/pro-palestinian-protestors-vie-for-hollywoods-attention-at-2024-sundance-film-festival took too long and was cancelled.
Failed to retrieve https://secure.everyaction.com/RKr139EpKUCZg8_TIWA18A2

Future exception was never retrieved
future: <Future finished exception=NetworkError('Protocol error (Target.detachFromTarget): No session with given id')>
pyppeteer.errors.NetworkError: Protocol error (Target.detachFromTarget): No session with given id

Fetching https://www.thetimestribune.com/news/dozens-rally-in-support-of-school-choice-amendment/article_6da6faa2-bbbb-11ee-9f2a-8354d0bdfbfd.html took too long and was cancelled.
Fetching https://www.nbcnews.com/news/latino/convoy-rally-texas-mexico-border-attracts-trump-fans-decry-illegal-imm-rcna136967 took too long and was cancelled.
Fetching https://nyunews.com/news/2024/01/26/pro-palestinian-bobst-poetry/ took too long and was cancelled.
Fetching https://www.wjhl.com/news/local/kyle-rittenhouse-event-draws-supporters-protesters-at-etsu/ took too long and was cancelled.
Fetching https://newjersey.news12.com/group-gathers-ahead-of-toms-river-council-meeting-to-protest-policeemt-funding-decision took too long and was cancelled.
Fetching https://www.denverpost.com/2024/01/02/alamo-drafthouse-employees-union-drive-rally-denver/ took too long and was cancelled.
The following URLs had issues and were not downloaded:
https://www.wgmd.com/pro-palestinian-protesters-deface-veterans-cemetery-in-los-angeles-spray-paint-free-gaza/
https://www.nbcnews.com/politics/donald-trump/trump-confuses-nikki-haley-pelosi-talking-jan-6-rcna134863
https://www.purdueexponent.org/campus/article_78be7d6e-c2bb-11ee-a25c-a3e2dff21694.html
https://13wham.com/news/local/local-advocates-rally-in-downtown-rochester-on-51st-anniversary-of-roe-v-wade
https://www.wvtm13.com/article/protests-kenneth-smith-execution-untied-nations-montgomery/46496998
https://nypost.com/2024/01/21/news/scream-actress-melissa-barrera-joins-disruptive-anti-israel-rally-at-sundance/
https://www.wlky.com/article/nonprofits-rally-frankfort-legislation-kentucky/46676604
https://www.courier-journal.com/story/news/politics/2024/02/08/kentucky-employees-retirement-system-participants-rally-for-13th-check/72527656007/
https://www.northjersey.com/story/news/2024/02/06/israel-hamas-war-day-of-action-for-palestine-nj-students-march/72478632007/
https://www.washingtonpost.com/dc-md-va/2024/01/15/virginia-assembly-gun-rights-rally/
https://dailybruin.com/2024/01/19/uc-divest-coalition-at-ucla-leads-hands-off-yemen-protest-on-campus
https://www.washingtonpost.com/dc-md-va/2024/01/18/dc-march-for-life-rally-abortion/
https://www.fox5dc.com/news/dc-activists-plan-protest-against-capitals-wizards-move-to-virginia
https://www.latimes.com/entertainment-arts/movies/story/2024-01-21/pro-palestinian-protestors-vie-for-hollywoods-attention-at-2024-sundance-film-festival
https://secure.everyaction.com/RKr139EpKUCZg8_TIWA18A2
https://www.thetimestribune.com/news/dozens-rally-in-support-of-school-choice-amendment/article_6da6faa2-bbbb-11ee-9f2a-8354d0bdfbfd.html
https://www.nbcnews.com/news/latino/convoy-rally-texas-mexico-border-attracts-trump-fans-decry-illegal-imm-rcna136967
https://nyunews.com/news/2024/01/26/pro-palestinian-bobst-poetry/
https://www.wjhl.com/news/local/kyle-rittenhouse-event-draws-supporters-protesters-at-etsu/
https://newjersey.news12.com/group-gathers-ahead-of-toms-river-council-meeting-to-protest-policeemt-funding-decision
https://www.denverpost.com/2024/01/02/alamo-drafthouse-employees-union-drive-rally-denver/

Using this approach, it took me five minutes to go through the list 100 URLs. I didn’t get every webpage, and I usually also run it twice on the same list to catch URLs that were missed either because of errors on my end or in the cloud.

The main delay is slow-loading pages. I have the timeout arbitrarily set to 30 seconds. Setting in longer might load one or two more more pages, but would also slow down the process since some pages will never load.