General News Scrapers

About

Welcome to the General News Scrapers page.
This site aims to provide an overview of all useful tools that can be used to research on different Newssites f.e. NYT, SPON, Guardian, and others. If you face problems or issues with one of the apps within the list, feel free to post an Issue on our repo. It helps us to maintain this list.

Useful Scrapers

General Scrapers	Headlines	Lead Paragraph	Article	Main Image	Login	Author	Date	Language
News Please	√	√	√	√	x	√	√	√
Scrapy	√	√	√	√	-	√	√	-
Newspaper3k	√	√	√	√	x	√	√	-
Scrape Bot	√	√	√	√	x	√	√	-
Media Cloud	√	-		-	√		√	√
trafilatura	√	-	√	-	x	√	√	√
paperboy	√	-	√	-	x	√	√	-

Keys

Headline: Scraped the Headline of an Article
Lead Paragraph: Fetches lead paragraph
Articles: scrapes complete article
Main Image: downloads the main image of the article
Login: Logs into the Memberspage
Author: Scrapes the name of the Author
Date: gets date (and Time)
language: Try to find out what language the article is written in

Description

News Please

news-please is an open-source, easy-to-use news crawler that extracts structured information from almost any news website. It can follow recursively internal hyperlinks and read RSS feeds to fetch both most recent and also old, archived articles. You only need to provide the root URL of the news website to crawl it completely.

Features:

headline
lead paragraph
main text
main image
name(s) of author(s)
publication date
language

Scrapy

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Features:

For the complete documentation of Scrapy’s features, please visit the Offical Scrappy Homepage.

Newspaper3k

Inspired by requests for its simplicity and powered by lxml for its speed. It worked as a Python Library.

Features:

Multi-threaded article download framework
News url identification
Text extraction from html
Top image extraction from html
All image extraction from html
Keyword extraction from text
Summary extraction from text
Author extraction from text
Google trending terms extraction
Works in 10+ languages

Scrape Bot

ScrapeBot is a tool for so-called “agent-based testing” to automatically visit, modify, and scrape a defined set of webpages regularly. It was built to automate various web-based tasks and keep track of them in a controllable way for academic research, primarily in the realm of computational social science.

Media Cloud

Media Cloud is an open-source content analysis tool that aims to map news media coverage of current events. The media cloud platform offers three tools explorer, topic mapper, and source manager.

Video intro can be found here

Features:

Map geographic coverage
Track attention over time
Slice and dice the subtopics
Imports stories from many resources daily

trafilatura

trafilatura is a command line tool and Python package which gathers text and corresponding metadata swiftly and efficiently.

Features:

web crawling
downloading
scraping
extraction
exports to TXT, CSV, JSON, XML
comprehensive documentation (see link above)

paperboy

An R package collecting webscraping scripts for news media sites to ensure consistently formatted news media data across a variety of websites.

Features:

overview of scrapable domains
variety of scrapers
reads shortened URLs
export of raw HTML code

RISJbot

Scrapy-based tool for news text and metadata extraction.

Features:

pre-defined spiders for various UK & US websites
convertable into R-based ecosystems
exports JSONLines
functionality can be expanded with further middleware and extensions, outlined in readme

[edit the edge version of this page]

SMO Wiki

Navigation

General News Scrapers

About

Useful Scrapers

Description