Web scraping is the process of extracting data or information from websites and turning that information into a useful format for further analysis. A typical process of web scraping is first to fetch the target webpage and then, second, parse information from that page. Next, the information is brought into a useful format and then stored in an archivable file format, database or server for further analysis.
Advantages of web scraping
Unique, rich, and independent datasets can be acquired by using a scraper. A researcher does not depend on any third party to get the data.
Instead of copying and pasting data from the internet or buying data from a third party, we can choose what data we want to collect exactly
Data collection can be automated and repeated. E.g., we can run the scraper on a daily basis and collect data for every day.
Disadvantages of web scraping
Building a scraper might require a lot of programming knowledge. Otherwise, ready-made scraping software can be used but might be costly. Also using third-party software can create limitations regarding the customizability of the data to be collected.
Websites change their structure regularly which might require a great deal of maintenance for long-term collections.
Also, scraping a website means using their resources so best practices involve being respectful, avoiding plagiarism, respecting privacy expectations and setting a gentle request rate limit. Also, scraping involves often more risks of violating ethical guidelines or legal restrictions
Useful open source scrapers
This page contains a handful of useful news scrapers which are open source and already documented on our website.
The following list is sorted by the ease of access (open-source status and required programming knowledge).
Scrapy is a strong web crawler and scraper which can be used to scrape data from a website and then store the data in a structured way. However, scrapy has a little bit of python programming knowledge.
Heritrix is a java based open-source scraper which provides a user interface with a web browser to operate the crawler. Heritrix required a strong programming background, so it’s not for the beginners