SMO Wiki

A pretty snapshot of the Wiki brought to you by the Social Media Observatory at HBI

General News Scrapers

Welcome to the General News Scrapers page.
This site aims to provide an overview of all useful tools that can be used to research on different Newssites f.e. NYT, SPON, Guardian, and others. If you face problems or issues with one of the apps within the list, feel free to post an Issue on our repo. It helps us to maintain this list.

Useful News Scrapers

General Scrapers Headlines Lead Paragraph Article
Main Image
Login Author Date Language
News Please x
Scrapy - -
Newspaper3k x -
Scrape Bot x -
Media Cloud -   -  

Keys

Description

1. News Please

news-please is an open-source, easy-to-use news crawler that extracts structured information from almost any news website. It can follow recursively internal hyperlinks and read RSS feeds to fetch both most recent and also old, archived articles. You only need to provide the root URL of the news website to crawl it completely.

Features:

2. Scrapy

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing

Features:

For the complete documentation of Scrapy’s features, please visit the Offical Scrappy Homepage.

3. Newspaper3k

Inspired by requests for its simplicity and powered by lxml for its speed. It worked as a Python Library.

Features:

4. Scrape Bot

ScrapeBot is a tool for so-called “agent-based testing” to automatically visit, modify, and scrape a defined set of webpages regularly. It was built to automate various web-based tasks and keep track of them in a controllable way for academic research, primarily in the realm of computational social science.



5. Media Cloud

Media Cloud is an open-source content analysis tool that aims to map news media coverage of current events. The media cloud platform offers three tools explorer, topic mapper, and source manager. Video intro can be found here

Features: