Web Scraping in Bulk

Dwnloading websites in parallel with pyppeteer
Web Scraping
Published

February 19, 2024

This script is one way to download multiple web pages at the same time. It’s useful when you have many URLs from different websites you want to save. Instead of visiting each website individually, the script visits multiple websites simultaneously and saves what it finds. Rather than requesting the page through Python, it uses a “headless” web browser, which is much more likely to get you the actual content you want.

This approach works best when the URLs are from different servers. You will eventually get locked out if you try to visit the same web server multiple times in the same second, but there’s no reason not to visit five different websites at once. I don’t know anything about parallel processing with the asyncio library, which is the only way to parallelize pyppeteer, so the script is mainly written by ChatGPT, but I’ve used it successfully a few times.

pip install pyppeteer python-slugify chromedriver-py
Requirement already satisfied: pyppeteer in /Users/nealcaren/anaconda3/envs/whisperplus/lib/python3.11/site-packages (2.0.0)
Requirement already satisfied: python-slugify in /Users/nealcaren/anaconda3/envs/whisperplus/lib/python3.11/site-packages (8.0.4)
Requirement already satisfied: appdirs<2.0.0,>=1.4.3 in /Users/nealcaren/anaconda3/envs/whisperplus/lib/python3.11/site-packages (from pyppeteer) (1.4.4)
Requirement already satisfied: certifi>=2023 in /Users/nealcaren/anaconda3/envs/whisperplus/lib/python3.11/site-packages (from pyppeteer) (2023.11.17)
Requirement already satisfied: importlib-metadata>=1.4 in /Users/nealcaren/anaconda3/envs/whisperplus/lib/python3.11/site-packages (from pyppeteer) (7.0.1)
Requirement already satisfied: pyee<12.0.0,>=11.0.0 in /Users/nealcaren/anaconda3/envs/whisperplus/lib/python3.11/site-packages (from pyppeteer) (11.1.0)
Requirement already satisfied: tqdm<5.0.0,>=4.42.1 in /Users/nealcaren/anaconda3/envs/whisperplus/lib/python3.11/site-packages (from pyppeteer) (4.66.2)
Requirement already satisfied: urllib3<2.0.0,>=1.25.8 in /Users/nealcaren/anaconda3/envs/whisperplus/lib/python3.11/site-packages (from pyppeteer) (1.26.18)
Requirement already satisfied: websockets<11.0,>=10.0 in /Users/nealcaren/anaconda3/envs/whisperplus/lib/python3.11/site-packages (from pyppeteer) (10.4)
Requirement already satisfied: text-unidecode>=1.3 in /Users/nealcaren/anaconda3/envs/whisperplus/lib/python3.11/site-packages (from python-slugify) (1.3)
Requirement already satisfied: zipp>=0.5 in /Users/nealcaren/anaconda3/envs/whisperplus/lib/python3.11/site-packages (from importlib-metadata>=1.4->pyppeteer) (3.17.0)
Requirement already satisfied: typing-extensions in /Users/nealcaren/anaconda3/envs/whisperplus/lib/python3.11/site-packages (from pyee<12.0.0,>=11.0.0->pyppeteer) (4.9.0)
Note: you may need to restart the kernel to use updated packages.
import os
import asyncio
import nest_asyncio
from random import shuffle

from slugify import slugify
from pyppeteer import launch
from pyppeteer.errors import NetworkError

import pandas as pd


nest_asyncio.apply()

The section below loads the wonderful protest event data set created by the Crowd Counting Consortium. Each protest event is linked to one or more media accounts, and the URLs are in the source_ fields. Using just the 2024 events, I combine the URL fields and remove the social media pages and duplicates. Finally, I extract a random sample of 100 articles.

df = pd.read_csv(
    "https://github.com/nonviolent-action-lab/crowd-counting-consortium/raw/master/ccc_compiled_2021-present.csv",
    encoding="latin",
    low_memory=False,
)

# Limit to just 2024
df = df[df["date"].str.contains("2024")]

# grab the sources
urls = (
    list(df["source_1"].astype(str).values)
    + list(df["source_2"].astype(str).values)
    + list(df["source_3"].astype(str).values)
    + list(df["source_4"].astype(str).values)
)

# eliminate social media
for sm in ["twitter", "youtube", "facebook", "instagram", "tiktok", "bsky"]:
    urls = [u for u in urls if f"{sm}.com" not in u and "http" in u]

urls = list(set(urls))
print(len(urls))
shuffle(urls)
urls = urls[:100]

This function below uses asynchronous programming to download and save the HTML content of web pages from a list of URLs. It uses a headless Chrome browser, controlled via the Pyppeteer library, to render pages just as they would appear in a web browser. This approach is particularly useful for capturing dynamically generated content, which traditional HTTP requests might miss.

Key components of the script include:

# Ensure the HTML directory exists
html_dir = "HTML"
os.makedirs(html_dir, exist_ok=True)

# User agent to be used for all requests
ua = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.6 Safari/605.1.15"
bad_urls = []


async def fetch(page, url, timeout=30):
    # Slugify the URL to create a valid filename
    filename = slugify(url) + ".html"
    file_path = os.path.join(html_dir, filename)

    if os.path.isfile(file_path):
        # print(f"File {file_path} already exists, skipping download.")
        return

    if url in bad_urls:
        print(f"Skipping bad URL: {url}")
        return

    try:
        # Set the user agent for the page
        await page.setUserAgent(ua)

        # Navigate to the page with a timeout
        response = await asyncio.wait_for(
            page.goto(url, {"waitUntil": "networkidle0"}), timeout
        )

        # Check if the page was successfully retrieved
        if response and response.ok:
            content = await page.content()
            # Save the content to a file in the 'HTML' directory
            with open(file_path, "w", encoding="utf-8") as file:
                file.write(content)
            print(f"Content from {url} has been saved to {file_path}")
        else:
            print(f"Failed to retrieve {url}")
            bad_urls.append(url)
    except asyncio.TimeoutError:
        print(f"Fetching {url} took too long and was cancelled.")
        bad_urls.append(url)
    except Exception as e:
        print(f"An error occurred while fetching {url}: {e}")
        bad_urls.append(url)

This next section actual does the downloading by employing an asynchronous queue-based approach to manage URLs and distribute them across multiple browser pages for parallel processing. This method significantly improves efficiency by ensuring that each browser page is continuously utilized without idle time waiting for other pages to complete their tasks.

Key components and functionalities:

async def process_url(page, url_queue):
    while not url_queue.empty():
        url = await url_queue.get()
        await fetch(page, url)  # Your existing fetch function
        url_queue.task_done()

async def main():
    browser = await launch()
    pages = [await browser.newPage() for _ in range(5)]  # Initialize pages once

    # Create a queue of URLs
    url_queue = asyncio.Queue()
    for url in urls:
        await url_queue.put(url)

    # Create a task for each page to process URLs from the queue
    tasks = [asyncio.create_task(process_url(page, url_queue)) for page in pages]

    # Wait for all tasks to complete
    await asyncio.gather(*tasks)

    # Close pages and browser after all operations are complete
    for page in pages:
        await page.close()
    await browser.close()

    if bad_urls:
        print("The following URLs had issues and were not downloaded:")
        print("\n".join(bad_urls))

asyncio.run(main())
Failed to retrieve https://www.wgmd.com/pro-palestinian-protesters-deface-veterans-cemetery-in-los-angeles-spray-paint-free-gaza/
Content from https://www.fox61.com/article/news/local/hartford-county/west-hartford/west-hartford-vandalism-under-investigation-police/520-7b65ab7b-d93b-42f9-8ca2-0ed9d5b7689a has been saved to HTML/https-www-fox61-com-article-news-local-hartford-county-west-hartford-west-hartford-vandalism-under-investigation-police-520-7b65ab7b-d93b-42f9-8ca2-0ed9d5b7689a.html
Fetching https://www.nbcnews.com/politics/donald-trump/trump-confuses-nikki-haley-pelosi-talking-jan-6-rcna134863 took too long and was cancelled.
Fetching https://www.purdueexponent.org/campus/article_78be7d6e-c2bb-11ee-a25c-a3e2dff21694.html took too long and was cancelled.
Fetching https://13wham.com/news/local/local-advocates-rally-in-downtown-rochester-on-51st-anniversary-of-roe-v-wade took too long and was cancelled.
Fetching https://www.wvtm13.com/article/protests-kenneth-smith-execution-untied-nations-montgomery/46496998 took too long and was cancelled.
Content from https://www.wwnytv.com/2024/01/20/congresswoman-stefanik-speaks-new-hampshire-support-trump/ has been saved to HTML/https-www-wwnytv-com-2024-01-20-congresswoman-stefanik-speaks-new-hampshire-support-trump.html
Fetching https://nypost.com/2024/01/21/news/scream-actress-melissa-barrera-joins-disruptive-anti-israel-rally-at-sundance/ took too long and was cancelled.
Fetching https://www.wlky.com/article/nonprofits-rally-frankfort-legislation-kentucky/46676604 took too long and was cancelled.
Fetching https://www.courier-journal.com/story/news/politics/2024/02/08/kentucky-employees-retirement-system-participants-rally-for-13th-check/72527656007/ took too long and was cancelled.
Fetching https://www.northjersey.com/story/news/2024/02/06/israel-hamas-war-day-of-action-for-palestine-nj-students-march/72478632007/ took too long and was cancelled.
Fetching https://www.washingtonpost.com/dc-md-va/2024/01/15/virginia-assembly-gun-rights-rally/ took too long and was cancelled.
Fetching https://dailybruin.com/2024/01/19/uc-divest-coalition-at-ucla-leads-hands-off-yemen-protest-on-campus took too long and was cancelled.
Fetching https://www.washingtonpost.com/dc-md-va/2024/01/18/dc-march-for-life-rally-abortion/ took too long and was cancelled.
Fetching https://www.fox5dc.com/news/dc-activists-plan-protest-against-capitals-wizards-move-to-virginia took too long and was cancelled.
Fetching https://www.latimes.com/entertainment-arts/movies/story/2024-01-21/pro-palestinian-protestors-vie-for-hollywoods-attention-at-2024-sundance-film-festival took too long and was cancelled.
Failed to retrieve https://secure.everyaction.com/RKr139EpKUCZg8_TIWA18A2
Future exception was never retrieved
future: <Future finished exception=NetworkError('Protocol error (Target.detachFromTarget): No session with given id')>
pyppeteer.errors.NetworkError: Protocol error (Target.detachFromTarget): No session with given id
Fetching https://www.thetimestribune.com/news/dozens-rally-in-support-of-school-choice-amendment/article_6da6faa2-bbbb-11ee-9f2a-8354d0bdfbfd.html took too long and was cancelled.
Fetching https://www.nbcnews.com/news/latino/convoy-rally-texas-mexico-border-attracts-trump-fans-decry-illegal-imm-rcna136967 took too long and was cancelled.
Fetching https://nyunews.com/news/2024/01/26/pro-palestinian-bobst-poetry/ took too long and was cancelled.
Fetching https://www.wjhl.com/news/local/kyle-rittenhouse-event-draws-supporters-protesters-at-etsu/ took too long and was cancelled.
Fetching https://newjersey.news12.com/group-gathers-ahead-of-toms-river-council-meeting-to-protest-policeemt-funding-decision took too long and was cancelled.
Fetching https://www.denverpost.com/2024/01/02/alamo-drafthouse-employees-union-drive-rally-denver/ took too long and was cancelled.
The following URLs had issues and were not downloaded:
https://www.wgmd.com/pro-palestinian-protesters-deface-veterans-cemetery-in-los-angeles-spray-paint-free-gaza/
https://www.nbcnews.com/politics/donald-trump/trump-confuses-nikki-haley-pelosi-talking-jan-6-rcna134863
https://www.purdueexponent.org/campus/article_78be7d6e-c2bb-11ee-a25c-a3e2dff21694.html
https://13wham.com/news/local/local-advocates-rally-in-downtown-rochester-on-51st-anniversary-of-roe-v-wade
https://www.wvtm13.com/article/protests-kenneth-smith-execution-untied-nations-montgomery/46496998
https://nypost.com/2024/01/21/news/scream-actress-melissa-barrera-joins-disruptive-anti-israel-rally-at-sundance/
https://www.wlky.com/article/nonprofits-rally-frankfort-legislation-kentucky/46676604
https://www.courier-journal.com/story/news/politics/2024/02/08/kentucky-employees-retirement-system-participants-rally-for-13th-check/72527656007/
https://www.northjersey.com/story/news/2024/02/06/israel-hamas-war-day-of-action-for-palestine-nj-students-march/72478632007/
https://www.washingtonpost.com/dc-md-va/2024/01/15/virginia-assembly-gun-rights-rally/
https://dailybruin.com/2024/01/19/uc-divest-coalition-at-ucla-leads-hands-off-yemen-protest-on-campus
https://www.washingtonpost.com/dc-md-va/2024/01/18/dc-march-for-life-rally-abortion/
https://www.fox5dc.com/news/dc-activists-plan-protest-against-capitals-wizards-move-to-virginia
https://www.latimes.com/entertainment-arts/movies/story/2024-01-21/pro-palestinian-protestors-vie-for-hollywoods-attention-at-2024-sundance-film-festival
https://secure.everyaction.com/RKr139EpKUCZg8_TIWA18A2
https://www.thetimestribune.com/news/dozens-rally-in-support-of-school-choice-amendment/article_6da6faa2-bbbb-11ee-9f2a-8354d0bdfbfd.html
https://www.nbcnews.com/news/latino/convoy-rally-texas-mexico-border-attracts-trump-fans-decry-illegal-imm-rcna136967
https://nyunews.com/news/2024/01/26/pro-palestinian-bobst-poetry/
https://www.wjhl.com/news/local/kyle-rittenhouse-event-draws-supporters-protesters-at-etsu/
https://newjersey.news12.com/group-gathers-ahead-of-toms-river-council-meeting-to-protest-policeemt-funding-decision
https://www.denverpost.com/2024/01/02/alamo-drafthouse-employees-union-drive-rally-denver/

Using this approach, it took me five minutes to go through the list 100 URLs. I didn’t get every webpage, and I usually also run it twice on the same list to catch URLs that were missed either because of errors on my end or in the cloud.

The main delay is slow-loading pages. I have the timeout arbitrarily set to 30 seconds. Setting in longer might load one or two more more pages, but would also slow down the process since some pages will never load.