Annotation data released under Project Epic. For annotation related to the CHIME grant, see https://github.com/Project-EPIC/chime-annotation. We're still working on the collected data; here's what is and what will be available:
Dataset | # of Tweets |
Part-of-speech tagging for a variety of events | 32,626 |
Named Entity Annotation for 10 different events | 18,081 |
Behavioral Annotation (from Verma et al (ICWSM 2011)) for 3 events | 1,500 |
Semantic role labeling (Gustav, Red River) | 32,912 lines |
Territory of Information/Evidentiality/Speech act annotation | 500 tweets for 4 events |
For the Named Entity and Behavioral Annotation, we can only provide includes only annotations, and not original tweets, in order to attempt to honor privacy concerns of potentially sensitive information. The original tweets can be accessed through Twitter: we've included tools to facilitate this: please see the Epic Tweet Documentation.
This annotation is simple part of speech tags for collections of tweets surrounding multiple events. This annotation was done by using an automatic POS tagger, and the output was then hand corrected. The datasets we include and number of tweets for each are as follows:
- Dallas Tornado (2012) : 850
- Haiti Earthquake : 487
- Hurricane Gustav : 1,000
- Highland Park Fire : 700
- New Zealand Earthquake : 14,800
- Oklahmoa Fires : 449
- Red River Floods (2009 and 2010): 14,340
Each event has a file, with each line containing a word and the corrected part of speech. Tweets are separated by blank lines.
This annotation is based on the paper Foundations of a Multilayer Annotation Framework for Twitter. They describe collection of tweets for five events, searching on certain hand-curated keywords. These were then filtered down into usable datasets. For a full description of the data collection process, see Anderson and Schram, 2009.
Based on these methods, named entities were tagged over the following events: The Events, with the number of tweets for each JSON:
- Colorado Wildfires (2012) : 741
- Dallas Tornado (2012) : 475
- Haiti Earthquake : 480
- Highland Park Fire : 344
- Hurricane Sandy : 716
- Lower North Fork Fire : 239
- New Mexico Fire : 122
- New Zealand Earthquake : 1227
- Red River Flood (2009) : 12885
- Red River Flood (2010) : 450
- Winter Storm Nemo : 402
Total : 18081
Some of these datasets may not have been collected with accurate Tweet IDs, and thus they may not be recoverable from the Twitter API. We are looking into possibilities for restoring accurate tweet ids, or releasing the data with raw text.
These tweets are annotated with named entity tags based on the Automatic Content Extraction guidelines for entities. The tags annotated are:
- PERSON
- ARTIFACT
- ORGANIZATION
- LOCATION
- FACILITY
Semantic Role Labelling involves annotation of the important semantic entities within a sentence and the syntactic relations between them. More generally, we aim to identify who did what to whom. The SRL data annotated for Project EPIC is over two events: Hurricane Gustiv and the Red River floods. This data is based on PropBank annotation, and is presented in an Excel style format. Each line contains a word, along with the word's index in the tweet, part of speech, dependency relation and semantic role. The semantic roles are the final column: they indicate the verb that the word is a role of (via it's index), as well as the type of argument. These types are:
- A0: ARG0
- A1: ARG1
- A2: ARG2
- AM: Modifier - can be temporal (TMP), directional (DIR), and many others
For example, consider the following tweet:
Index | Word | Lemma | POS | - | Head | Dep. Relation | PB Verb | Semantic Role |
1 | Thinking | think | VBG | _ | 5 | DEP | think.XX | _ |
2 | of | of | IN | _ | 1 | ADV | _ | 1:A1 |
3 | Gustav | gustav | NNP | _ | 2 | PMOD | _ | _ |
4 | . | . | . | _ | 1 | P | _ | _ |
5 | May | may | MD | _ | 0 | ROOT | _ | 7:AM-MOD |
6 | it | it | PRP | _ | 5 | SBJ | _ | 7:A0 |
7 | bring | bring | VB | _ | 5 | VC | bring.XX | _ |
8 | minimal | minimal | JJ | _ | 9 | NMOD | _ | _ |
9 | damage | damage | NN | _ | 7 | OBJ | _ | 7:A1 |
10 | . | . | . | _ | 5 | P | _ | _ |
Here, the verbs are "think", indexed 1, and "bring", indexed 7. The phrase "of Gustav" is the ARG1 of "think", marked by the index of the verb on "of": 1:A1. "May" is a modal (MOD) modifier of "bring", marked 7:AM-MOD. The pronoun "it" is the ARG0 of bring (7:A0), and the phrase "minimal damage" is the ARG1 of bring (7:A1 on "damage").
This data is based on the paper Natural Language Processing to the Rescue? Extracting “Situational Awareness” Tweets During Mass Emergency. They collected four datasets of 500 tweets each. These datasets overlap with the named entity annotation, and include the two Red River Floods (2009, 2010), the Oklahoma wildfire, and the Haiti Earthquake. These tweets were annotated with 'behavioral' categories:
- Situational Awareness: whether they contribute to user's awareness of the event
- Subjectivity: Whether the tweet is objective of subjective
- Linguistic Register: Whether the tweet is in a formal or informal register
- Personal/impersonal: Whether the tweet is expressed from a personal standpoint or not
These categories are annotated at the tweet level: each tweet has four annotations for each of the above categories. Like the named entity data, we include only tweet IDs and annotations. Unfortunately, the original IDs for the Oklahoma dats were not maintained, and this data is currently unavailable. We are looking into ways of releasing it publically in a consistent and ethical fashion.
This data was collected for Will Corvey's dissertation. It contains territory of information, evidentiality, and speech annotations for four different events: the Oklahoma fires, Haiti earthquake, and the Red River Flooding of '09 and '10.
For any questions, please contact
Kevin Stowe
[email protected]