This page provides download information for the 2019-A and 2018 Editions of TREC-IS. A dataset is comprised of three components: the test topics (events); the ontology of high-level information types; and the tweets for each event to be categorized.
For this track we select a number of events/incidents of different types, e.g. earthquakes, hurricanes, public or shootings to evaluate participating systems for. Participants need to categorize tweets for each event. Topics are provided in a standard TREC topic format as shown below:
The contents of the num tag is the unique identifier for the event. Meanwhile, the content of the dataset tag is the reference to the corpus of tweets used during evaluation (if you use the tweet downloader tool you need to know this).
When developing your solution, you may use the type of the event (type tag), the event query/title (title tag), and the contents of the narrative (narr tag) as input to your system.
We also provide the link to a wikipedia page providing context for the event. We provide this to help participants understand what each event is about (we have found this is useful when perfroming failure analysis for instance). However, your system should not use this URL or linked page, as the information on that page would not exist at the time the event was ongoing.
To represent the information types that an emergency management officier might be interested in, we developed an ontology for the track based on analysis of other emergency information ontolotgies, response procedure documentation and talking with experts. The full ontology has three levels (Root/High/Low). For 2019-B, we are continuing to focus on the high-level categories. An example ontology entry is shown below:
Participant systems should assign as many categories (high-level information types) to each tweet in each event stream as they think are relevant. Note that this is different to the 2018 edition, see this page for more information. Systems may use any information provided in the ontology.
It is worth noting that what might classify as belonging to a category is often event-dependent. For example, people reporting an explosion is only likely to occur during bombing events. For this reason, we also provide participants with Event Type Profiles for the different broad event types used this year. Participants may use these profiles as they wish when building their systems.
For each event, we provide a stream of tweets collected during that event to categorize that can be downloaded as described below:
Stream download via TREC-IS-Client-v3.jar: Twitter allows the hosting of small datasets (less than 50k tweets) and the track organizers maintain a server with the event tweets which you can use to download a copy directly. The client jar will first attempt to connect to the central server and upload your institution information (see below for why we collect this) and then will download a copy of the tweets for each event, for a particular dataset. There are three currently available datasets 'trecis2018-test', 'trecis2018-train' and 'trecis2019-A-test'. It will write one file per event (GZIPed JSON format) in the current directory, one line per tweet. The jar was compiled with Java 1.8, it may not be compatible with later java versions. We do not guarantee that this service will always be available, if the service is down, you can email me.
First, you need to download the TREC-IS-Client-v3.jar file and info.json file using the two buttons below and put them in a folder together. Second, you need to open the info.json file and edit the information in here for your particular institution:
java -jar TREC-IS-Client-v3.jar info.json
The tweet streams that we use here were collected from a variety of sources, both internal and external. Tweets were subject to pre-filtering by the organisers. Below are where each tweet stream was sourced and the appropriate citation:
Events: 2013 Bohol Earthquake
A. Olteanu, S. Vieweg, C. Castillo. 2015. What to Expect When the Unexpected Happens: Social Media Communications Across Crises. In Proceedings of the ACM 2015 Conference on Computer Supported Cooperative Work and Social Computing (CSCW '15). ACM, Vancouver, BC, Canada.
Events:2014 California Earthquake
Muhammad Imran, Prasenjit Mitra, and Carlos Castillo: Twitter as a Lifeline: Human-annotated Twitter Corpora for NLP of Crisis-related Messages. In Proceedings of the 10th Language Resources and Evaluation Conference (LREC), pp. 1638-1643. May 2016, Portorož, Slovenia.
Events:2017 Dallas Shooting
German Aerospace Center (DLR) Dataset
Events:2018 Florence Hurricane
Donated by: Anna Kruspe, Jens Kersten and Friederike Klan
Dataverse Scholar Portal Web Archive
Events:2016 Fort McMurray Wildfire
Crawled by the Organizers
Events: 2019 Andover Fire, 2019 Choco Flood
By downloading the twitter datasets using the tool provided, you agree to the University of Glasgow processing your personal data, as defined by the EU General Data Protection Regulation (GDPR) - your name and email in this case. Queries about data processing and access/deletion requests should be sent to me via email. We will store your data for as long as the track is on-going and up-to 2 years beyond that. I may contact you using the details provided to notify you about changes in the datasets or track, to provide information or ask you questions about your participation or otherwise contact you about topics relevant to emergency management. We may collate statistics from the provided information that will be published, but we will not release individual names or email addresses.