TREC-IS Data and Labels

This page provides download information on how to download past the Twitter datasets and associated labels released from past TREC-IS editions. Participants normally use this data as training when participating in subsequent editions. This data is downloaded in three main parts:

Events / Topics
Card image

The TREC-IS data is provided for a set of (currently) 118 crisis events (note four events 7/23/58/67 have no positive labels and hence can be ignored, but numbering spans 1-122). In 2020A/B we separated out the pandemic event type into its own task, however for 2021 pandemics have been folded into the main task and as such are not treated differently. The metadata for each event is provided in an XML topic file. For 2021 participants, the 2021 training event descriptions can be downloaded directly below:

Meanwhile, the 2021 test event descriptions are available below:

Human Annotations / Labels
Card image

For each event, we sample a set of tweets and have human annotators label those tweets based on an ontology of 25 information types, as well as assign a priority label. We have currently labeled around 90,000 tweets and assigned over 185,000 labels. You can download the labels for the 71 training events in JSON format directly below:

If you are planning to submit your runs to our online leaderboard, then you should use the above labels for training. We also have labeled a further around 60,000 tweets that are used to test participant systems. We provide these labels below for groups wishing to perform event analysis, but you must not submit any runs that use these labels.

Information Type Ontology

The aforementioned labels were manually annotated based on an ontology of information types. You can find out more about these types in the overview papers:

Tweets and Images

The organisers maintain a server from which you can download the tweets for each of the events. Follow the instructions at the bottom of this page to download the tweets.

In 2021, the organisers also collected an aligned image dataset to the TREC-IS events containing 312,546 images. If you want access to this dataset then contact Cody Buntain.

Downloading Tweets

Getting the Tweets (JSON, updated 06/04/2021)

For each event, we provide a stream of tweets collected during that event to categorize that can be downloaded as described below:

Stream download via TREC-IS-DatasetClient-4.1.jar: Twitter allows the hosting of small datasets (less than 50k tweets) and the track organizers maintain a server with the event tweets which you can use to download a copy directly. The client jar will first attempt to connect to the central server and upload your institution information (see below for why we collect this) and then will download a copy of the tweets for each event, for a particular dataset. It will write one file per event (GZIPed JSON format) in the current directory, one line per tweet. The jar was compiled with Java (OpenJDK) 14. We do not guarantee that this service will always be available, if the service is down, you can email me.

First, you need to download the TREC-IS-DatasetClient-4.1.jar file and info.json file using the two buttons below and put them in a folder together. Second, you need to open the info.json file and edit the information in here for your particular institution:

  • institution should be the name of your company or university.
  • contactname should be the name of the person downloading the dataset.
  • email should be the email address of that person.
  • type should be either 'academic', 'public sector', or 'industry'.
  • request should be the dataset identifier. The server can provide different sets of events dependant on the value of this field (see below).
The downloader currently supports the following dataset identifiers:
  • trecis2018-A: Events 1-6, 3,771 tweets
  • trecis2018-B: Events 7-21, 22,200 tweets
  • trecis2019-A: Events 22-28, 9,497 tweets
  • trecis2019-B: Events 29-34, 14,988 tweets
  • trecis2020-A: Events 35-49, 7,515 tweets
  • trecis2020-A-covid: Events 50-52, 14,8247 tweets
  • trecis2020-B: Events 53-66, 274,663 tweets
  • trecis2020-B-covid: Events 67-75, 329,394 tweets
  • trecis2021-A: Events 76-122, 1,532,359 tweets
  • past: Events 1-75
  • everything: Events 1-122
Once you have finished editing the info.json file, save it and then run the following command from a shell or terminal:

java -jar TREC-IS-DatasetClient-4.1.jar info.json

Note: The set of tweets that this downloads is significantly larger than the assessed set as defined in the label files. This discrepancy is due to a combination of tweets not being assessed due to limited assessor time and from 2020-A onward system pooling was used to determine what tweets were assessed (which mandated a much larger set of tweets be given to participants than could be assessed).

Citations

The tweet streams that we use here were collected from a variety of sources, both internal and external. Tweets were subject to pre-filtering by the organisers. Below are where each tweet stream was sourced and the appropriate citation:

CrisisLex T26

Events: 2013 Bohol Earthquake

A. Olteanu, S. Vieweg, C. Castillo. 2015. What to Expect When the Unexpected Happens: Social Media Communications Across Crises. In Proceedings of the ACM 2015 Conference on Computer Supported Cooperative Work and Social Computing (CSCW '15). ACM, Vancouver, BC, Canada.

CrisisNLP Resource #1

Events:2014 California Earthquake

Muhammad Imran, Prasenjit Mitra, and Carlos Castillo: Twitter as a Lifeline: Human-annotated Twitter Corpora for NLP of Crisis-related Messages. In Proceedings of the 10th Language Resources and Evaluation Conference (LREC), pp. 1638-1643. May 2016, Portorož, Slovenia.

UNT Digital Libruary Dataset

Events:2017 Dallas Shooting

German Aerospace Center (DLR) Dataset

Events:2018 Florence Hurricane

Donated by: Anna Kruspe, Jens Kersten and Friederike Klan

Dataverse Scholar Portal Web Archive

Events:2016 Fort McMurray Wildfire

Crawled by the Organizers

Everything else

Personal Data Processing Policy

By downloading the twitter datasets using the tool provided, you agree to the University of Glasgow processing your personal data, as defined by the EU General Data Protection Regulation (GDPR) - your name and email in this case. Queries about data processing and access/deletion requests should be sent to me via email. We will store your data for as long as the track is on-going and up-to 2 years beyond that. I may contact you using the details provided to notify you about changes in the datasets or track, to provide information or ask you questions about your participation or otherwise contact you about topics relevant to emergency management. We may collate statistics from the provided information that will be published, but we will not release individual names or email addresses.

Supported By
Card image Card image Card image