We have defined a set of lightweight, task-specific schema to help simplify programmatic access to common nusantara-nlp
datasets. This schema should be implemented for each dataset in addition to a schema that preserves the original dataset format.
- Knowledge Base (KB)
- Named entity recognition (NER)
- Named entity disambiguation/normalization/linking (NED)
- Event extraction (EE)
- Relation extraction (RE)
- Coreference resolution (COREF)
- Question Answering (QA)
- Question answering (QA)
- Textual Entailment (TE)
- Textual entailment (TE)
- Text Pairs (PAIRS)
- Semantic Similarity (STS)
- Text to Text (T2T)
- Paraphasing (PARA)
- Translation (TRANSL)
- Summarization (SUM)
- Text (TEXT)
- Text classification (TXTCLASS)
This is a simple container format with minimal nesting that supports a range of common knowledge base construction / information extraction tasks.
- Named entity recognition (NER)
- Named entity disambiguation/normalization/linking (NED)
- Event extraction (EE)
- Relation extraction (RE)
- Coreference resolution (COREF)
{
"id": "ABCDEFG",
"document_id": "XXXXXX",
"passages": [...],
"entities": [...],
"events": [...],
"coreferences": [...],
"relations": [...]
}
Schema Notes
id
fields appear at the top (i.e. document) level and in every sub-component (passages
,entities
,events
,coreferences
,relations
). They can be set in any fashion that makes everyid
field in a dataset unique (includingid
fields in different splits like train/validation/test).document_id
should be a dataset provided document id. If not provided in the dataset, it can be set equal to the top levelid
.offsets
contain character offsets into the string that would be created from" ".join([passage["text"] for passage in passages])
offsets
andtext
are always lists to support discontinous spans. For continuous spans, they will have the formoffsets=[(lo,hi)], text=["text span"]
. For discontinuous spans, they will have the formoffsets=[(lo1,hi1), (lo2,hi2), ...], text=["text span 1", "text span 2", ...]
normalized
sub-component may contain 1 or more normalized links to database entity identifiers.passages
captures document structure such as named sections.entities
,events
,coreferences
,relations
may be empty fields depending on the dataset and specific task.
Passages capture document structure, such as the title and abstact sections of a PubMed abstract.
{
"id": "0",
"document_id": "227508",
"passages": [
{
"id": "1",
"type": "title",
"text": ["Naloxone reverses the antihypertensive effect of clonidine."],
"offsets": [[0, 59]],
},
{
"id": "2",
"type": "abstract",
"text": ["In unanesthetized, spontaneously hypertensive rats the decrease in blood pressure and heart rate produced by intravenous clonidine, 5 to 20 micrograms/kg, was inhibited or reversed by nalozone, 0.2 to 2 mg/kg. The hypotensive effect of 100 mg/kg alpha-methyldopa was also partially reversed by naloxone. Naloxone alone did not affect either blood pressure or heart rate. In brain membranes from spontaneously hypertensive rats clonidine, 10(-8) to 10(-5) M, did not influence stereoselective binding of [3H]-naloxone (8 nM), and naloxone, 10(-8) to 10(-4) M, did not influence clonidine-suppressible binding of [3H]-dihydroergocryptine (1 nM). These findings indicate that in spontaneously hypertensive rats the effects of central alpha-adrenoceptor stimulation involve activation of opiate receptors. As naloxone and clonidine do not appear to interact with the same receptor site, the observed functional antagonism suggests the release of an endogenous opiate by clonidine or alpha-methyldopa and the possible role of the opiate in the central control of sympathetic tone."],
"offsets": [[60, 1075]],
},
],
}
- Examples: BC5CDR
"entities": [
{
"id": "3",
"offsets": [[0, 8]],
"text": ["Naloxone"],
"type": "Chemical",
"normalized": [{"db_name": "MESH", "db_id": "D009270"}]
},
...
],
- Examples: MLEE
"events": [
{
"id": "3",
"type": "Reaction",
"trigger": {
"offsets": [[0,6]],
"text": ["reacts"]
},
"arguments": [
{
"role": "theme",
"ref_id": "5",
}
...
],
}
...
],
- Examples: n2c2 2011: Coreference Challenge
"coreferences": [
{
"id": "32",
"entity_ids": ["1", "10", "23"],
},
...
]
- Examples: BC5CDR
"relations": [
{
"id": "100",
"type": "chemical-induced disease",
"arg1_id": "10",
"arg2_id": "32",
"normalized": []
}
]
- Schema Template
- Examples: BioASQ Task B
{
"id": "0",
"document_id": "24267510",
"question_id": "55031181e9bde69634000014",
"question": "Is RANKL secreted from the cells?",
"type": "yesno",
"choices": [],
"context": "Osteoprotegerin (OPG) is a soluble secreted factor that acts as a decoy receptor for receptor activator of NF-\u03baB ligand (RANKL)",
"answer": ["yes"],
}
- Schema Template
- Examples: BaPOS
{
{
"id": "0",
"tokens": [
"Seorang",
"penduduk",
"yang",
"tinggal",
"dekat",
"tempat",
"kejadian",
"mengatakan",
",",
"dia",
"mendengar",
"suara",
"tabrakan",
"yang",
"keras",
"dan",
"melihat",
"mobil",
"ambulan",
"membawa",
"orang-orang",
"yang",
"berlumuran",
"darah",
"."
],
"labels": [
"B-NND",
"B-NN",
"B-SC",
"B-VB",
"B-JJ",
"B-NN",
"B-NN",
"B-VB",
"B-Z",
"B-PRP",
"B-VB",
"B-NN",
"B-NN",
"B-SC",
"B-JJ",
"B-CC",
"B-VB",
"B-NN",
"B-NN",
"B-VB",
"B-NN",
"B-SC",
"B-VB",
"B-NN",
"B-Z"
]
}
- Examples: SciTail
{
"id": "0",
"document_id": "NULL",
"premise": "Pluto rotates once on its axis every 6.39 Earth days;",
"hypothesis": "Earth rotates on its axis once times in one day.",
"label": "neutral",
}
- Schema Template
- Examples: MQP
{
"id": "0",
"document_id": "NULL",
"text_1": "Am I over weight (192.9) for my age (39)?",
"text_2": "I am a 39 y/o male currently weighing about 193 lbs. Do you think I am overweight?",
"label": 1,
}
- Schema Template
- Examples: ParaMed
{
"id": "0",
"text_1": "Pleasing God doesn"t mean that we must busy ourselves with a new set of "spiritual" activities\n",
"text_2": "Menyenangkan Allah tidaklah berarti bahwa kita harus menyibukkan diri sendiri dengan berbagai aktivitas rohani\n",
"text_1_name": "eng",
"text_2_name": "ind"
}
- Schema Template
- Examples: SmSA
{
"id": "0",
"text": "meski masa kampanye sudah selesai , bukan berati habis pula upaya mengerek tingkat kedipilihan elektabilitas .",
"labels": [
"neutral"
]
}
- Schema Template
- Examples: CC100
{
"id": "0",
"text": "Placeholder text. Will change to a real example soon."
}
- Examples: Coming soon
{
{"id": "01-001",
"path": ".cache/huggingface/datasets/downloads/extracted/ecbf4ad46b3db9b85aa9108272c39dc75a268b4c0b92f2827866ef17dea97585/01/01-001.wav",
"audio": {
"path": ".cache/huggingface/datasets/downloads/extracted/ecbf4ad46b3db9b85aa9108272c39dc75a268b4c0b92f2827866ef17dea97585/01/01-001.wav",
"array": array([-0.0005188 , -0.00018311, -0.00021362, ..., -0.00018311, -0.00033569, -0.00015259], dtype=float32),
"sampling_rate": 16000
},
"text": "hai selamat pagi apa kabar",
"speaker": "01",
"metadata": {"speaker_age": 25, "speaker_gender": "female"}}
}