60,000 news articles from April 20, 2024

I’ve always been intrigued by the idea of the Common Crawl News dataset. Since 2016, the non-profit organization Common Crawl has been collecting a specialized dataset of media articles, with new releases every day. While this dataset is smaller than the full Common Crawl, which currently stands at 454 terabytes and contains around 3 billion web pages, it is still a potentially useful and underutilized resource, both in terms of size and the process of extracting the news stories.

Below, you’ll find my code for downloading one day’s worth of data from the Common Crawl News dataset, which amounts to approximately 20 gigabytes. After running the code, I obtained 59,866 English language articles for April 20th, 2024. The process was time-consuming, taking about eight hours, with most of that time spent on extracting the text from the WARC (Web ARChive) files.

I attempted to optimize the code and speed up the extraction process, but most of those efforts were unsuccessful, likely because the parts I was trying to parallelize were not the primary bottlenecks. It’s important to note that a significant portion of the articles in the dataset are likely to be spam or press releases. However, I believe there is enough valuable content within the dataset to make it a worthwhile resource for those willing to invest the time and effort to process it.

pip install warcio news-please

Requirement already satisfied: warcio in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (1.7.4)
Requirement already satisfied: news-please in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (1.5.44)
Requirement already satisfied: six in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from warcio) (1.16.0)
Requirement already satisfied: Scrapy>=1.1.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (2.11.1)
Requirement already satisfied: PyMySQL>=0.7.9 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (1.1.0)
Requirement already satisfied: psycopg2-binary>=2.8.4 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (2.9.9)
Requirement already satisfied: hjson>=1.5.8 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (3.1.0)
Requirement already satisfied: elasticsearch>=2.4 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (8.13.0)
Requirement already satisfied: beautifulsoup4>=4.3.2 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (4.12.3)
Requirement already satisfied: readability-lxml>=0.6.2 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (0.8.1)
Requirement already satisfied: newspaper3k>=0.2.8 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (0.2.8)
Requirement already satisfied: langdetect>=1.0.7 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (1.0.9)
Requirement already satisfied: python-dateutil>=2.4.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (2.8.2)
Requirement already satisfied: plac>=0.9.6 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (1.4.3)
Requirement already satisfied: dotmap>=1.2.17 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (1.3.30)
Requirement already satisfied: PyDispatcher>=2.0.5 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (2.0.7)
Requirement already satisfied: ago>=0.0.9 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (0.0.95)
Requirement already satisfied: lxml>=3.3.5 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (5.1.0)
Requirement already satisfied: hurry.filesize>=0.9 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (0.9)
Requirement already satisfied: bs4 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (0.0.2)
Requirement already satisfied: faust-cchardet>=2.1.18 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (2.1.19)
Requirement already satisfied: boto3 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from news-please) (1.34.90)
Requirement already satisfied: soupsieve>1.2 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from beautifulsoup4>=4.3.2->news-please) (2.5)
Requirement already satisfied: elastic-transport<9,>=8.13 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from elasticsearch>=2.4->news-please) (8.13.0)
Requirement already satisfied: setuptools in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from hurry.filesize>=0.9->news-please) (68.2.2)
Requirement already satisfied: Pillow>=3.3.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k>=0.2.8->news-please) (10.2.0)
Requirement already satisfied: PyYAML>=3.11 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k>=0.2.8->news-please) (6.0.1)
Requirement already satisfied: cssselect>=0.9.2 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k>=0.2.8->news-please) (1.2.0)
Requirement already satisfied: nltk>=3.2.1 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k>=0.2.8->news-please) (3.8.1)
Requirement already satisfied: requests>=2.10.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k>=0.2.8->news-please) (2.31.0)
Requirement already satisfied: feedparser>=5.2.1 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k>=0.2.8->news-please) (6.0.11)
Requirement already satisfied: tldextract>=2.0.1 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k>=0.2.8->news-please) (5.1.1)
Requirement already satisfied: feedfinder2>=0.0.4 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k>=0.2.8->news-please) (0.0.4)
Requirement already satisfied: jieba3k>=0.35.1 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k>=0.2.8->news-please) (0.35.1)
Requirement already satisfied: tinysegmenter==0.3 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from newspaper3k>=0.2.8->news-please) (0.3)
Requirement already satisfied: chardet in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from readability-lxml>=0.6.2->news-please) (5.2.0)
Requirement already satisfied: Twisted>=18.9.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Scrapy>=1.1.0->news-please) (24.3.0)
Requirement already satisfied: cryptography>=36.0.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Scrapy>=1.1.0->news-please) (42.0.5)
Requirement already satisfied: itemloaders>=1.0.1 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Scrapy>=1.1.0->news-please) (1.2.0)
Requirement already satisfied: parsel>=1.5.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Scrapy>=1.1.0->news-please) (1.9.1)
Requirement already satisfied: pyOpenSSL>=21.0.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Scrapy>=1.1.0->news-please) (24.1.0)
Requirement already satisfied: queuelib>=1.4.2 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Scrapy>=1.1.0->news-please) (1.6.2)
Requirement already satisfied: service-identity>=18.1.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Scrapy>=1.1.0->news-please) (24.1.0)
Requirement already satisfied: w3lib>=1.17.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Scrapy>=1.1.0->news-please) (2.1.2)
Requirement already satisfied: zope.interface>=5.1.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Scrapy>=1.1.0->news-please) (6.3)
Requirement already satisfied: protego>=0.1.15 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Scrapy>=1.1.0->news-please) (0.3.1)
Requirement already satisfied: itemadapter>=0.1.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Scrapy>=1.1.0->news-please) (0.8.0)
Requirement already satisfied: packaging in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Scrapy>=1.1.0->news-please) (23.2)
Requirement already satisfied: botocore<1.35.0,>=1.34.90 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from boto3->news-please) (1.34.90)
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from boto3->news-please) (1.0.1)
Requirement already satisfied: s3transfer<0.11.0,>=0.10.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from boto3->news-please) (0.10.1)
Requirement already satisfied: urllib3!=2.2.0,<3,>=1.25.4 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from botocore<1.35.0,>=1.34.90->boto3->news-please) (2.1.0)
Requirement already satisfied: cffi>=1.12 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from cryptography>=36.0.0->Scrapy>=1.1.0->news-please) (1.16.0)
Requirement already satisfied: certifi in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from elastic-transport<9,>=8.13->elasticsearch>=2.4->news-please) (2023.11.17)
Requirement already satisfied: sgmllib3k in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from feedparser>=5.2.1->newspaper3k>=0.2.8->news-please) (1.0.0)
Requirement already satisfied: click in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from nltk>=3.2.1->newspaper3k>=0.2.8->news-please) (8.1.7)
Requirement already satisfied: joblib in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from nltk>=3.2.1->newspaper3k>=0.2.8->news-please) (1.3.2)
Requirement already satisfied: regex>=2021.8.3 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from nltk>=3.2.1->newspaper3k>=0.2.8->news-please) (2023.12.25)
Requirement already satisfied: tqdm in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from nltk>=3.2.1->newspaper3k>=0.2.8->news-please) (4.66.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from requests>=2.10.0->newspaper3k>=0.2.8->news-please) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from requests>=2.10.0->newspaper3k>=0.2.8->news-please) (3.6)
Requirement already satisfied: attrs>=19.1.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from service-identity>=18.1.0->Scrapy>=1.1.0->news-please) (23.2.0)
Requirement already satisfied: pyasn1 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from service-identity>=18.1.0->Scrapy>=1.1.0->news-please) (0.6.0)
Requirement already satisfied: pyasn1-modules in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from service-identity>=18.1.0->Scrapy>=1.1.0->news-please) (0.4.0)
Requirement already satisfied: requests-file>=1.4 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from tldextract>=2.0.1->newspaper3k>=0.2.8->news-please) (2.0.0)
Requirement already satisfied: filelock>=3.0.8 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from tldextract>=2.0.1->newspaper3k>=0.2.8->news-please) (3.13.1)
Requirement already satisfied: automat>=0.8.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Twisted>=18.9.0->Scrapy>=1.1.0->news-please) (22.10.0)
Requirement already satisfied: constantly>=15.1 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Twisted>=18.9.0->Scrapy>=1.1.0->news-please) (23.10.4)
Requirement already satisfied: hyperlink>=17.1.1 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Twisted>=18.9.0->Scrapy>=1.1.0->news-please) (21.0.0)
Requirement already satisfied: incremental>=22.10.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Twisted>=18.9.0->Scrapy>=1.1.0->news-please) (22.10.0)
Requirement already satisfied: typing-extensions>=4.2.0 in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from Twisted>=18.9.0->Scrapy>=1.1.0->news-please) (4.9.0)
Requirement already satisfied: pycparser in /Users/nealcaren/anaconda3/envs/code/lib/python3.10/site-packages (from cffi>=1.12->cryptography>=36.0.0->Scrapy>=1.1.0->news-please) (2.21)
Note: you may need to restart the kernel to use updated packages.

import gzip
import io
import json
import os
import pathlib
import shutil
import urllib.request
from concurrent.futures import ThreadPoolExecutor, as_completed
from datetime import datetime
from urllib.parse import urlparse

import pandas as pd
import requests
import warcio
from lxml.etree import ParserError
from newsplease import NewsPlease, EmptyResponseError
from warcio.archiveiterator import ArchiveIterator

DEFAULT_COMMON_CRAWL_DATA_DIR = "./commoncrawl-data"
COMMON_CRAWL_DATA_DIR = os.environ.get(
    "COMMON_CRAWL_DATA_DIR", DEFAULT_COMMON_CRAWL_DATA_DIR
)

DEFAULT_WARC_EXTRACT_DIR = "warc-extract"
WARC_EXTRACT_DIR = os.environ.get("WARC_EXTRACT_DIR", DEFAULT_WARC_EXTRACT_DIR)

DEFAULT_PROCESSED_CONTENT_DIR = "processed-content"
PROCESSED_CONTENT_DIR = os.environ.get(
    "PROCESSED_CONTENT_DIR", DEFAULT_PROCESSED_CONTENT_DIR
)

JSON_OUT_FILE_EXT = ".json"

COMMON_CRAWL_BUCKET = "commoncrawl"
COMMON_CRAWL_CC_NEWS_PREFIX = "crawl-data/CC-NEWS"
now = datetime.now()
CC_DATA_ROOT = f"https://data.commoncrawl.org/"

WARC_LISTING_FILE_URL = f"{CC_DATA_ROOT}{COMMON_CRAWL_CC_NEWS_PREFIX}/{now.year}/{now.strftime('%m')}/warc.paths.gz"

def download_file(url, destination_path):
    """
    Download a file from a URL to a local destination path using the requests library.

    Args:
    - url (str): URL of the file to download.
    - destination_path (str): Local path to save the downloaded file.
    """
    response = requests.get(url, stream=True)
    with open(destination_path, "wb") as f:
        shutil.copyfileobj(response.raw, f)

    print(f"Downloaded {os.path.basename(destination_path)}")


def download_warc_files(date, base_dir):
    """
    Downloads all WARC files for a specific date into a designated directory, ensuring no duplicates.
    """
    COMMON_CRAWL_CC_NEWS_PREFIX = "crawl-data/CC-NEWS"
    CC_DATA_ROOT = "https://data.commoncrawl.org/"
    date_folder = date.strftime("%Y-%m-%d")
    day_dir = os.path.join(base_dir, date_folder)

    os.makedirs(day_dir, exist_ok=True)

    warc_listing_url = f"{CC_DATA_ROOT}{COMMON_CRAWL_CC_NEWS_PREFIX}/{date.year}/{date.strftime('%m')}/warc.paths.gz"
    with urllib.request.urlopen(warc_listing_url) as response:
        with gzip.open(response, "rb") as decompressed_file:
            return process_warc_listing(
                decompressed_file, CC_DATA_ROOT, day_dir, date.strftime("%Y%m%d")
            )


def process_warc_listing(decompressed_file, data_root, day_dir, date_str):
    """
    Process each line in the decompressed warc listing to download files.
    """
    files_downloaded = []
    for line in decompressed_file:
        if date_str in str(line):
            file_url = data_root + str(line.strip(), "utf-8")
            destination_path = os.path.join(day_dir, os.path.basename(file_url))
            if not os.path.exists(destination_path):
                download_file(file_url, destination_path)
            else:
                print(f"File already exists: {os.path.basename(destination_path)}")
            files_downloaded.append(destination_path)
    return files_downloaded


def is_english(text):
    """Check if the text is English."""
    try:
        return detect(text) == "en"
    except:
        return False


def file_exists(file_path):
    """Check if the file exists."""
    return os.path.exists(file_path)


def is_relevant_domain(url):
    """Check if the domain is relevant based on the specified criteria."""
    domain = urlparse(url).netloc
    return (
        (domain.endswith(".org") or domain.endswith(".com"))
        and "hindustan" not in url
        and "minga." not in url
    )


def is_english_and_has_text(article):
    """Check if the article is in English and has text."""
    return article and article.language == "en" and article.maintext


def process_record(record):
    """Process a single record, extracting the article if conditions are met."""
    if record.rec_type == "response" and "html" in record.http_headers.get_header(
        "Content-Type", ""
    ):
        url = record.rec_headers.get_header("WARC-Target-URI")
        if is_relevant_domain(url):
            try:
                article = NewsPlease.from_warc(record)
                if is_english_and_has_text(article):
                    print(article.title)
                    return article.get_serializable_dict()
            except (EmptyResponseError, ParserError):
                print("Blank!!!")
                return None
    return None


def extract_articles_from_warc(file_path):
    """Extracts articles from a WARC file, filtering for English language and .org/.com domains."""
    articles = []
    if file_exists(file_path):
        with open(file_path, "rb") as stream:
            for record in ArchiveIterator(stream):
                article = process_record(record)
                if article:
                    articles.append(article)
    return articles


def save_articles_to_json(articles, output_file):
    """
    Saves a list of articles to a JSON file using pandas, which automatically handles datetime serialization.
    """
    # Convert the list of dictionaries (articles) to a DataFrame
    df = pd.DataFrame(articles)

    # Save the DataFrame to a JSON file
    df.to_json(output_file, orient="records", lines=True, date_format="iso")


def process_warc_file(warc_file):
    """
    Processes a single WARC file, extracts articles, and saves them to a JSON file.
    This function only runs if the corresponding JSON file does not already exist.
    """
    json_filename = os.path.splitext(warc_file)[0] + ".json"

    # Check if the JSON file already exists
    if os.path.exists(json_filename):
        return f"JSON file already exists. Skipped processing for: {json_filename}"

    # Extract articles from the WARC file if JSON does not exist
    articles = extract_articles_from_warc(warc_file)
    if articles:  # Ensure there are articles to write
        df = pd.DataFrame(articles)
        df.to_json(json_filename, orient="records", lines=True, date_format="iso")
        return f"Saved articles to {json_filename}"
    else:
        return f"No articles found in {warc_file}. Nothing was saved."


def process_date(date_str, base_dir, num_workers=None):
    """
    Processes all WARC files for a given date using threads to speed up IO-bound tasks.

    Args:
    - date_str (str): The date string in 'YYYY-MM-DD' format.
    - base_dir (str): The base directory where WARC files are downloaded and processed.
    - num_workers (int): The number of worker threads to use; defaults to the number of CPUs.
    """
    date = datetime.strptime(date_str, "%Y-%m-%d")
    warc_files = download_warc_files(date, base_dir)
    num_workers = num_workers or os.cpu_count()

    # Use ThreadPoolExecutor to process files in parallel
    with ThreadPoolExecutor(max_workers=num_workers) as executor:
        # Map the process_warc_file function to each WARC file
        futures = {
            executor.submit(process_warc_file, warc_file): warc_file
            for warc_file in warc_files
        }

# Example usage
process_date("2024-04-20", "commoncrawl-data/", num_workers=2)

File already exists: CC-NEWS-20240420012526-03589.warc.gz
File already exists: CC-NEWS-20240420031225-03590.warc.gz
File already exists: CC-NEWS-20240420044652-03591.warc.gz
File already exists: CC-NEWS-20240420061440-03592.warc.gz
File already exists: CC-NEWS-20240420073737-03593.warc.gz
File already exists: CC-NEWS-20240420085814-03594.warc.gz
File already exists: CC-NEWS-20240420101143-03595.warc.gz
File already exists: CC-NEWS-20240420113539-03596.warc.gz
File already exists: CC-NEWS-20240420125652-03597.warc.gz
File already exists: CC-NEWS-20240420141952-03598.warc.gz
File already exists: CC-NEWS-20240420153927-03599.warc.gz
File already exists: CC-NEWS-20240420165810-03600.warc.gz
File already exists: CC-NEWS-20240420182948-03601.warc.gz
File already exists: CC-NEWS-20240420195655-03602.warc.gz
File already exists: CC-NEWS-20240420214148-03603.warc.gz
File already exists: CC-NEWS-20240420234347-03604.warc.gz

from glob import glob

df_names = glob('commoncrawl-data/2024-04-20/*.json')
dfs = [pd.read_json(df_name, lines=True) for df_name in df_names]
df= pd.concat(dfs)

len(df)

df.sample(3)

	title	url	published	text	authors	date_download	date_modify	date_publish	description	filename	image_url	language	localpath	maintext	source_domain	title_page	title_rss
2798	After VW plant victory, UAW sets its sights on...	https://finance.yahoo.com/news/vw-plant-victor...	NaN	NaN	[Nora Eckert]	2024-04-20 13:44:57+00:00	None	2024-04-20 13:06:58	CHATTANOOGA, Tennessee (Reuters) -The United A...	https%3A%2F%2Ffinance.yahoo.com%2Fnews%2Fvw-pl...	https://media.zenfs.com/en/reuters-finance.com...	en	NaN	(Fixes media identifier)\nBy Nora Eckert\nCHAT...	finance.yahoo.com	NaN	NaN
265	Donovan quoted in Business Times on US enforce...	https://www.atlanticcouncil.org/insight-impact...	2024-04-16T18:11:53.000	New Atlanticist\nNew Atlanticist is where top ...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1126	Allegri says his players didn't understand wha...	https://www.juvefc.com/allegri-says-his-player...	NaN	NaN	[Martin U]	2024-04-20 18:55:08+00:00	None	2024-04-20 14:00:00	Juventus manager Max Allegri suggested that hi...	https%3A%2F%2Fwww.juvefc.com%2Fallegri-says-hi...	https://icdn.juvefc.com/wp-content/uploads/202...	en	NaN	Juventus manager Max Allegri suggested that hi...	www.juvefc.com	NaN	NaN