SMO Wiki

A pretty snapshot of the Wiki brought to you by the Social Media Observatory at HBI

General News Scrapers

About

Welcome to the General News Scrapers page.
This site aims to provide an overview of all useful tools that can be used to research on different Newssites f.e. NYT, SPON, Guardian, and others. If you face problems or issues with one of the apps within the list, feel free to post an Issue on our repo. It helps us to maintain this list.

Useful Scrapers

General Scrapers Headlines Lead Paragraph Article
Main Image
Login Author Date Language
News Please x
Scrapy - -
Newspaper3k x -
Scrape Bot x -
Media Cloud -   -  
trafilatura - - x
paperboy - - x -

Keys

Description

News Please

news-please is an open-source, easy-to-use news crawler that extracts structured information from almost any news website. It can follow recursively internal hyperlinks and read RSS feeds to fetch both most recent and also old, archived articles. You only need to provide the root URL of the news website to crawl it completely.

Features:

  • headline
  • lead paragraph
  • main text
  • main image
  • name(s) of author(s)
  • publication date
  • language

Scrapy

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Features:

For the complete documentation of Scrapy’s features, please visit the Offical Scrappy Homepage.

Newspaper3k

Inspired by requests for its simplicity and powered by lxml for its speed. It worked as a Python Library.

Features:

  • Multi-threaded article download framework
  • News url identification
  • Text extraction from html
  • Top image extraction from html
  • All image extraction from html
  • Keyword extraction from text
  • Summary extraction from text
  • Author extraction from text
  • Google trending terms extraction
  • Works in 10+ languages

Scrape Bot

ScrapeBot is a tool for so-called “agent-based testing” to automatically visit, modify, and scrape a defined set of webpages regularly. It was built to automate various web-based tasks and keep track of them in a controllable way for academic research, primarily in the realm of computational social science.

Media Cloud

Media Cloud is an open-source content analysis tool that aims to map news media coverage of current events. The media cloud platform offers three tools explorer, topic mapper, and source manager.

Video intro can be found here

Features:

  • Map geographic coverage
  • Track attention over time
  • Slice and dice the subtopics
  • Imports stories from many resources daily

trafilatura

trafilatura is a command line tool and Python package which gathers text and corresponding metadata swiftly and efficiently.

Features:

  • web crawling
  • downloading
  • scraping
  • extraction
  • exports to TXT, CSV, JSON, XML
  • comprehensive documentation (see link above)

paperboy

An R package collecting webscraping scripts for news media sites to ensure consistently formatted news media data across a variety of websites.

Features:

  • overview of scrapable domains
  • variety of scrapers
  • reads shortened URLs
  • export of raw HTML code

RISJbot

Scrapy-based tool for news text and metadata extraction.

Features:

  • pre-defined spiders for various UK & US websites
  • convertable into R-based ecosystems
  • exports JSONLines
  • functionality can be expanded with further middleware and extensions, outlined in readme