SMO Wiki

A pretty snapshot of the Wiki brought to you by the Social Media Observatory at HBI

Web Scraping

Web scraping

Web scraping is the process of extracting data or information from websites and turning that information into a useful format for further analysis. A typical process of web scraping is first to fetch the target webpage and then, second, parse information from that page. Next, the information is brought into a useful format and then stored in an archivable file format, database or server for further analysis.

Web scraper

A web scraper is a computer program that can be used for web scraping. A web scraper often exhibits a crawler and a scraper functionality. A crawler is an algorithm or AI which is built to discover websites with desirable data. Then, the scraper is the tool to extract this data from a website. Usually, when a scraper needs to scrape data from a website, first the URLs of the website are provided (e.g., by a crawler). Then it loads the HTML code (which mostly contains content and overall structure of the content), sometimes alongside CSS code (which determines much of the design) and javascript elements (which usually make a website interactive) depending on the ability of the scraper. Next, the scraper extracts the desired data (e.g. links, or names of politicians from online articles) and saves the data in a useful format. Most scrapers use CSV-like formats, or JSON to save the data.

Advantages of web scraping

Disadvantages of web scraping

Useful open source scrapers

This page contains a handful of useful news scrapers which are open source and already documented on our website.

For non-programmers

The following list is sorted by the ease of access (open-source status and required programming knowledge).

Scrapy

Scrapy is a strong web crawler and scraper which can be used to scrape data from a website and then store the data in a structured way. However, scrapy has a little bit of python programming knowledge.

Heritrix

Heritrix is a java based open-source scraper which provides a user interface with a web browser to operate the crawler. Heritrix required a strong programming background, so it’s not for the beginners