Over the course of the build up of the SMO we present a growing collection of useful (social) media datasets. These will comprise Open Access Datasets and Access on Request Datasets, both collected by ourselves as well as collections by other projects.
Table of contents
Datenbank öffentlicher Sprecher (DBÖS) / Database of Public Speakers (DOPS) [planned]
The Leibniz Institute for Media Research works at the moment on a collection of public speakers as defined by pre-defined criterions, such as being member of a parliamentary body, press organisation, professional sports team, or another institution of public interest.
This collection will comprise general publicly available information about these public figures or entities, dependent on the category they fall in (e.g. journalists, politicians, celebrities) alongside with their public social media accounts.
We hope that it is possible to make this dataset open access and open it up to contributions on a data repository.
[Text and Link missing]
[Text and Link missing]
Twitter (and maybe later other social media) data around the Ukraine Invasion in February 2022. Data reaches back until February 1st and will be updated daily. (https://github.com/Leibniz-HBI/ukraine_data)
Access on Request
We plan to make certain datasets available on request or in form of collaborations with the SMO. This is mostly necessary either due to ethical considerations, legal constraints or the Terms of Service of the APIs they have been collected with.
While we a working on a formal request procedure, please contact us via email if you are interested in one of the datasets in the meantime.
This dataset comprises a sparsified sample of the German-speaking Twitter follow network. It was collected via an adaption of a graph sampling method that aims to prioritise the most central nodes in a network (https://github.com/flxvctr/RADICES).
As a result it contains a set of approximately 200 000 accounts which makes up an estimated 40% of an average German-using Twitter account’s followings (based on data from 2016 and 2019, we have reasons to assume however, that the user base is only changing slowly since 2016). A detailed description of the collection method and the dataset can be found in this journal paper: https://journals.sagepub.com/doi/full/10.1177/2056305120984475
The useNews data set comes in (Puschmann & Haim, 2020). It combines three innovative data sources and links their content for the years 2018-2020: the Reuters Digital News Report (user preferences and rankings of news brands), MediaCloud (news content) and CrowdTangle (Facebook engagement metrics). The dataset can be found here: https://osf.io/uzca3/
- 3 million news items from 81 sources and 12 countries
- 530 million words
- 4 million Facebook posts from 400,000 Facebook users which mention these amounts.
- Overall, these posts received a cumulative
- 468 million likes,
- 216 million shares,
- 177 million comments.
Event-centric data sets
In a cooperation with NDR Data data sets focusing on wide spread media topics in Germany in 2021 have been collected. Details and data sets are available upon request.
- #LukeMockridge, 36,919 tweet ids
- #nemielhassam, 70,716 tweet ids
- #gilofarim, 163,450 tweet ids
Roadmap for SMO provided datasets
Based on DOPS and GETCORE we plan to collect social media activity on all platforms that we will cover over the course of the next 4 years (i.e. Twitter, Online News Media, Youtube, Facebook, Instagram, and Wikipedia).
DOPS forms the basis of various social media trackings.
GETCORE provides an alternative to DOPS for the identification of relevant public speakers.
Based on both we will start continuous long-term collection of activity
- based on social media accounts of public actors (ACTORS)
- by media (MEDIA)
Additionally we will start a long-term collection by topics of interest (TOPICS) and short-term event-related tracking on a case by case or demand basis (EVENTS).
We plan to implement the collections in a half-year cycle, building up the collection platform by platform.
Most activity datasets will have to be accessible on request due to legal or ethical reasons.
Datasets provided by external resources
The following table contains a list of datasets provided by several external resources.
||Sizes/Number of observation
|One Million Posts: A Data Set of German Online Discussions
||An Austrian daily broadsheet called Der Standard has a specific section for discussion for its users. This data set contains a selection of user posts from the 12 months time span from 2015-06-01 to 2016-05-31. There are 11,773 labeled and 1,000,000 unlabeled posts in the data set.The post which are labeled were annotated by the newspaper employee.Annotated labels are mainly sentiment(positive,negative,neutral),off topic(yes,no), discriminating(yes,no), inappropriate(yes,no) etcs.
||Austrian newspaper websites.
||Dietmar Schabus, Marcin Skowron, Martin Trapp
||Development, deployment and usability testing of a Natural Language Processing (NLP) and Information Retrieval system that supports the moderation of user comments on a large newspaper website
|The Polly Corpus: online political debate
||POLLY is a free multimodal corpus with 125,000 German tweets posted before, during and after the 2017 German federal elections. It includes tweets about politicians, by politicians, by fans of politicians, and by far-right supporters.
||Tom De Smedt , Sylvia Jaki
||Political discourse, hate speech
||This repository contains a German, annotated corpus of tweets regarding refugees in Germany. The tweets are annotated with hate speech ratings.
||User-Centred Social Media(UCSM)
||Hate speech, tweets about refugees
||GermEval is a dataset of offensive language identification. Data generated from twitter was used for this purpose .
||Josef Ruppenhofer, Melanie Siegel, Michael Wiegand
||Test sets , Train sets
||Offensive language, Text classification
|HASOC track at FIRE 2019: Hate Speech and Offensive Content Identification in Indo-European Languages
||HASOC track intends to stimulate development in Hate Speech for Hindi, German and English. Three datasets were developed from Twitter and Facebook and made available
||Twitter and Facebook
||Hate speech and offensive content identification in into european language(HASOC)
||Topics and Subtopics, as well as Premises and conclusions, scraped form the debate platform debatepedia.org. It contains data on 465 topics and has identified 1623 subtopics as frames for these topics.
||debate platform debatepedia.org (currently unavailable; 10.11.2021)
||Yamen Ajjour, Milad Alshomary, Henning Wachsmuth, and Benno Stein
||7.4 MB/ 12,326
||Argument Mining, Aspect/Frame detection