General News Scrapers
About
Welcome to the General News Scrapers page.
This site aims to provide an overview of all useful tools that can be used to research on different Newssites f.e. NYT, SPON, Guardian, and others. If you face problems or issues with one of the apps within the list, feel free to post an Issue on our repo. It helps us to maintain this list.
Useful Scrapers
Keys
- Headline: Scraped the Headline of an Article
- Lead Paragraph: Fetches lead paragraph
- Articles: scrapes complete article
- Main Image: downloads the main image of the article
- Login: Logs into the Memberspage
- Author: Scrapes the name of the Author
- Date: gets date (and Time)
- language: Try to find out what language the article is written in
Description
News Please
news-please is an open-source, easy-to-use news crawler that extracts structured information from almost any news website. It can follow recursively internal hyperlinks and read RSS feeds to fetch both most recent and also old, archived articles. You only need to provide the root URL of the news website to crawl it completely.
Features:
- headline
- lead paragraph
- main text
- main image
- name(s) of author(s)
- publication date
- language
Scrapy
Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
Features:
For the complete documentation of Scrapy’s features, please visit the Offical Scrappy Homepage.
Newspaper3k
Inspired by requests for its simplicity and powered by lxml for its speed. It worked as a Python Library.
Features:
- Multi-threaded article download framework
- News url identification
- Text extraction from html
- Top image extraction from html
- All image extraction from html
- Keyword extraction from text
- Summary extraction from text
- Author extraction from text
- Google trending terms extraction
- Works in 10+ languages
Scrape Bot
ScrapeBot is a tool for so-called “agent-based testing” to automatically visit, modify, and scrape a defined set of webpages regularly. It was built to automate various web-based tasks and keep track of them in a controllable way for academic research, primarily in the realm of computational social science.
Media Cloud
Media Cloud is an open-source content analysis tool that aims to map news media coverage of current events. The media cloud platform offers three tools explorer, topic mapper, and source manager.
Video intro can be found here
Features:
- Map geographic coverage
- Track attention over time
- Slice and dice the subtopics
- Imports stories from many resources daily
trafilatura
trafilatura is a command line tool and Python package which gathers text and corresponding metadata swiftly and efficiently.
Features:
- web crawling
- downloading
- scraping
- extraction
- exports to TXT, CSV, JSON, XML
- comprehensive documentation (see link above)
paperboy
An R package collecting webscraping scripts for news media sites to ensure consistently formatted news media data across a variety of websites.
Features:
- overview of scrapable domains
- variety of scrapers
- reads shortened URLs
- export of raw HTML code
RISJbot
Scrapy-based tool for news text and metadata extraction.
Features:
- pre-defined spiders for various UK & US websites
- convertable into R-based ecosystems
- exports JSONLines
- functionality can be expanded with further middleware and extensions, outlined in readme