This page provides download information on how to download past the Twitter datasets and associated labels released from past TREC-IS editions. Participants normally use this data as training when participating in subsequent editions. This data is downloaded in three main parts:
The TREC-IS data is provided for a set of (currently) 118 crisis events (note four events 7/23/58/67 have no positive labels and hence can be ignored, but numbering spans 1-122). In 2020A/B we separated out the pandemic event type into its own task, however for 2021 pandemics have been folded into the main task and as such are not treated differently. The metadata for each event is provided in an XML topic file. For 2021 participants, the 2021 training event descriptions can be downloaded directly below:
Meanwhile, the 2021 test event descriptions are available below:
For each event, we sample a set of tweets and have human annotators label those tweets based on an ontology of 25 information types, as well as assign a priority label. We have currently labeled around 90,000 tweets and assigned over 185,000 labels. You can download the labels for the 71 training events in JSON format directly below:
If you are planning to submit your runs to our online leaderboard, then you should use the above labels for training. We also have labeled a further around 60,000 tweets that are used to test participant systems. We provide these labels below for groups wishing to perform event analysis, but you must not submit any runs that use these labels.
The aforementioned labels were manually annotated based on an ontology of information types. You can find out more about these types in the overview papers:
The organisers maintain a server from which you can download the tweets for each of the events. Follow the instructions at the bottom of this page to download the tweets.
In 2021, the organisers also collected an aligned image dataset to the TREC-IS events containing 312,546 images. If you want access to this dataset then contact Cody Buntain.
For each event, we provide a stream of tweets collected during that event to categorize that can be downloaded as described below:
Stream download via TREC-IS-DatasetClient-4.1.jar: Twitter allows the hosting of small datasets (less than 50k tweets) and the track organizers maintain a server with the event tweets which you can use to download a copy directly. The client jar will first attempt to connect to the central server and upload your institution information (see below for why we collect this) and then will download a copy of the tweets for each event, for a particular dataset. It will write one file per event (GZIPed JSON format) in the current directory, one line per tweet. The jar was compiled with Java (OpenJDK) 14. We do not guarantee that this service will always be available, if the service is down, you can email me.
First, you need to download the TREC-IS-DatasetClient-4.1.jar file and info.json file using the two buttons below and put them in a folder together. Second, you need to open the info.json file and edit the information in here for your particular institution:
java -jar TREC-IS-DatasetClient-4.1.jar info.json
Note: The set of tweets that this downloads is significantly larger than the assessed set as defined in the label files. This discrepancy is due to a combination of tweets not being assessed due to limited assessor time and from 2020-A onward system pooling was used to determine what tweets were assessed (which mandated a much larger set of tweets be given to participants than could be assessed).
The tweet streams that we use here were collected from a variety of sources, both internal and external. Tweets were subject to pre-filtering by the organisers. Below are where each tweet stream was sourced and the appropriate citation:
Events: 2013 Bohol Earthquake
A. Olteanu, S. Vieweg, C. Castillo. 2015. What to Expect When the Unexpected Happens: Social Media Communications Across Crises. In Proceedings of the ACM 2015 Conference on Computer Supported Cooperative Work and Social Computing (CSCW '15). ACM, Vancouver, BC, Canada.
Events:2014 California Earthquake
Muhammad Imran, Prasenjit Mitra, and Carlos Castillo: Twitter as a Lifeline: Human-annotated Twitter Corpora for NLP of Crisis-related Messages. In Proceedings of the 10th Language Resources and Evaluation Conference (LREC), pp. 1638-1643. May 2016, Portorož, Slovenia.
Events:2017 Dallas Shooting
German Aerospace Center (DLR) Dataset
Events:2018 Florence Hurricane
Donated by: Anna Kruspe, Jens Kersten and Friederike Klan
Dataverse Scholar Portal Web Archive
Events:2016 Fort McMurray Wildfire
Crawled by the Organizers
Everything else
By downloading the twitter datasets using the tool provided, you agree to the University of Glasgow processing your personal data, as defined by the EU General Data Protection Regulation (GDPR) - your name and email in this case. Queries about data processing and access/deletion requests should be sent to me via email. We will store your data for as long as the track is on-going and up-to 2 years beyond that. I may contact you using the details provided to notify you about changes in the datasets or track, to provide information or ask you questions about your participation or otherwise contact you about topics relevant to emergency management. We may collate statistics from the provided information that will be published, but we will not release individual names or email addresses.