2019-A TREC-IS Instructions

TREC-IS Data Download

This page provides download information for the 2019-A and 2018 Editions of TREC-IS. A dataset is comprised of three components: the test topics (events); the ontology of high-level information types; and the tweets for each event to be categorized.

Downloading the Tweets, Topics and Ontology

Topic Information (Events)

For this track we select a number of events/incidents of different types, e.g. earthquakes, hurricanes, public or shootings to evaluate participating systems for. Participants need to categorize tweets for each event. Topics are provided in a standard TREC topic format as shown below:

The contents of the num tag is the unique identifier for the event. Meanwhile, the content of the dataset tag is the reference to the corpus of tweets used during evaluation (if you use the tweet downloader tool you need to know this).

When developing your solution, you may use the type of the event (type tag), the event query/title (title tag), and the contents of the narrative (narr tag) as input to your system.

We also provide the link to a wikipedia page providing context for the event. We provide this to help participants understand what each event is about (we have found this is useful when perfroming failure analysis for instance). However, your system should not use this URL or linked page, as the information on that page would not exist at the time the event was ongoing.

Ontology

To represent the information types that an emergency management officier might be interested in, we developed an ontology for the track based on analysis of other emergency information ontolotgies, response procedure documentation and talking with experts. The full ontology has three levels (Root/High/Low). For 2019-B, we are continuing to focus on the high-level categories. An example ontology entry is shown below:

Usage Guidelines

Participant systems should assign as many categories (high-level information types) to each tweet in each event stream as they think are relevant. Note that this is different to the 2018 edition, see this page for more information. Systems may use any information provided in the ontology.

It is worth noting that what might classify as belonging to a category is often event-dependent. For example, people reporting an explosion is only likely to occur during bombing events. For this reason, we also provide participants with Event Type Profiles for the different broad event types used this year. Participants may use these profiles as they wish when building their systems.

Event Type Profiles

Getting the Tweets (JSON)

For each event, we provide a stream of tweets collected during that event to categorize that can be downloaded as described below:

Stream download via TREC-IS-Client-v3.jar: Twitter allows the hosting of small datasets (less than 50k tweets) and the track organizers maintain a server with the event tweets which you can use to download a copy directly. The client jar will first attempt to connect to the central server and upload your institution information (see below for why we collect this) and then will download a copy of the tweets for each event, for a particular dataset. There are three currently available datasets 'trecis2018-test', 'trecis2018-train' and 'trecis2019-A-test'. It will write one file per event (GZIPed JSON format) in the current directory, one line per tweet. The jar was compiled with Java 1.8, it may not be compatible with later java versions. We do not guarantee that this service will always be available, if the service is down, you can email me.

First, you need to download the TREC-IS-Client-v3.jar file and info.json file using the two buttons below and put them in a folder together. Second, you need to open the info.json file and edit the information in here for your particular institution:

institution should be the name of your company or university.
contactname should be the name of the person downloading the dataset.
email should be the email address of that person.
type should be either 'academic', 'public sector', or 'industry'.
request should be the dataset identifier, i.e. either 'trecis2018-test', 'trecis2018-train' or 'trecis2019-A-test'. If you are reading this page, you probably want 'trecis2019-A-test', although you can also download the older events to use as training data (if you want to download all of them, you will need to call the client jar three times, with a different 'request' line in the info.json file each time).

Once you have finished editing the info.json file, save it and then run the following command from a shell or terminal:

java -jar TREC-IS-Client-v3.jar info.json

Citations

The tweet streams that we use here were collected from a variety of sources, both internal and external. Tweets were subject to pre-filtering by the organisers. Below are where each tweet stream was sourced and the appropriate citation:

CrisisLex T26

Events: 2013 Bohol Earthquake

A. Olteanu, S. Vieweg, C. Castillo. 2015. What to Expect When the Unexpected Happens: Social Media Communications Across Crises. In Proceedings of the ACM 2015 Conference on Computer Supported Cooperative Work and Social Computing (CSCW '15). ACM, Vancouver, BC, Canada.

CrisisNLP Resource #1

Events:2014 California Earthquake

Muhammad Imran, Prasenjit Mitra, and Carlos Castillo: Twitter as a Lifeline: Human-annotated Twitter Corpora for NLP of Crisis-related Messages. In Proceedings of the 10th Language Resources and Evaluation Conference (LREC), pp. 1638-1643. May 2016, Portorož, Slovenia.

UNT Digital Libruary Dataset

Events:2017 Dallas Shooting

German Aerospace Center (DLR) Dataset

Events:2018 Florence Hurricane

Donated by: Anna Kruspe, Jens Kersten and Friederike Klan

Dataverse Scholar Portal Web Archive

Events:2016 Fort McMurray Wildfire

Crawled by the Organizers

Events: 2019 Andover Fire, 2019 Choco Flood

Personal Data Processing Policy

By downloading the twitter datasets using the tool provided, you agree to the University of Glasgow processing your personal data, as defined by the EU General Data Protection Regulation (GDPR) - your name and email in this case. Queries about data processing and access/deletion requests should be sent to me via email. We will store your data for as long as the track is on-going and up-to 2 years beyond that. I may contact you using the details provided to notify you about changes in the datasets or track, to provide information or ask you questions about your participation or otherwise contact you about topics relevant to emergency management. We may collate statistics from the provided information that will be published, but we will not release individual names or email addresses.