Web scraping is the process of extracting data or information from websites and turning that information into a useful format for further analysis. A typical process of web scraping is first to fetch the target webpage and then, second, parse information from that page. Next, the information is brought into a useful format and then stored in an archivable file format, database or server for further analysis.
Table of Contents
Advantages of web scraping
Unique, rich, and independent datasets can be acquired by using a scraper. A researcher does not depend on any third party to get the data.
Instead of copying and pasting data from the internet or buying data from a third party, we can choose what data we want to collect exactly
Data collection can be automated and repeated. E.g., we can run the scraper on a daily basis and collect data for every day.
Disadvantages of web scraping
Building a scraper might require a lot of programming knowledge. Otherwise, ready-made scraping software can be used but might be costly. Also using third-party software can create limitations regarding the customizability of the data to be collected.
Websites change their structure regularly which might require a great deal of maintenance for long-term collections.
Also, scraping a website means using their resources so best practices involve being respectful, avoiding plagiarism, respecting privacy expectations and setting a gentle request rate limit. Also, scraping involves often more risks of violating ethical guidelines or legal restrictions
Useful open source scrapers
This page contains a handful of useful news scrapers which are open source and already documented on our website.
The following list is sorted by the ease of access (open-source status and required programming knowledge).
Scrapy is a strong web crawler and scraper which can be used to scrape data from a website and then store the data in a structured way. However, scrapy has a little bit of python programming knowledge.
Heritrix is a java based open-source scraper which provides a user interface with a web browser to operate the crawler. Heritrix required a strong programming background, so it’s not for the beginners