General News Scrapers
Welcome to the General News Scrapers page.
This site aims to provide an overview of all useful tools that can be used to research on different Newssites f.e. NYT, SPON, Guardian, and others. If you face problems or issues with one of the apps within the list, feel free to post an Issue on our repo. It helps us to maintain this list.
Useful News Scrapers
- Headline: Scraped the Headline of an Article
- Lead Paragraph: Fetches lead paragraph
- Articles: scrapes complete article
- Main Image: downloads the main image of the article
- Login: Logs into the Memberspage
- Author: Scrapes the name of the Author
- Date: gets date (and Time)
- language: Try to find out what language the article is written in
1. News Please
news-please is an open-source, easy-to-use news crawler that extracts structured information from almost any news website. It can follow recursively internal hyperlinks and read RSS feeds to fetch both most recent and also old, archived articles. You only need to provide the root URL of the news website to crawl it completely.
- lead paragraph
- main text
- main image
- name(s) of author(s)
- publication date
Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing
For the complete documentation of Scrapy’s features, please visit the Offical Scrappy Homepage.
Inspired by requests for its simplicity and powered by lxml for its speed. It worked as a Python Library.
- Multi-threaded article download framework
- News url identification
- Text extraction from html
- Top image extraction from html
- All image extraction from html
- Keyword extraction from text
- Summary extraction from text
- Author extraction from text
- Google trending terms extraction
- Works in 10+ languages
4. Scrape Bot
ScrapeBot is a tool for so-called “agent-based testing” to automatically visit, modify, and scrape a defined set of webpages regularly. It was built to automate various web-based tasks and keep track of them in a controllable way for academic research, primarily in the realm of computational social science.
5. Media Cloud
Media Cloud is an open-source content analysis tool that aims to map news media coverage of current events. The media cloud platform offers three tools explorer, topic mapper, and source manager. Video intro can be found here
- Map geographic coverage
- Track attention over time
- Slice and dice the subtopics
- Imports stories from many resources daily