Overview of Corpora
B1 (Fokus in Gur- und Kwasprachen)
![]() |
|||||
| contact: | sfb632.b1@rz.hu-berlin.de | ||||
| Gur and Kwa Corpus | |||||
| modality: | spoken, monologue and dialogue, quasi-parallel (QUIS tasks alignable via task numbers), extra experiments specific to each language | ||||
| formats: | wav (audio), toolbox, EXMARaLDA-XML (annotation) | ||||
| source data: | focus data elicited with (parts of) QUIS
, language specific elicitation tasks as well as quasi-spontaneous speech
|
||||
| languages: | Aja (AJA), Akan (AKA), Awutu-Efutu (AFU), Buli (BWU), Byali (BEH), Dagbani (DAG), Ditammari (TBZ), Ewe (EWE), Fon (FON), Foodo (FOD), Gurene (GUR), Konkomba (XON), Konni (KMA), Lelemi (LEF), Nateni (NTM), Waamma (WWA), Yom (PIL) | ||||
| metadata: | IMDI-oriented metadata (elicited in .doc tables) included in XML files during processing. Pseudonymisation provided. | ||||
| subcorpora / versions: | all 17 languages:
part of QUIS experiments for all langugages listed, large collections of extra experimental data (language specific), latest version of June 2007. selected ANNIS set: - Aja: small sample for ANNIS (QUIS experiments), status: final, latest version of July 2004, manually transcribed, glossed, translated to English and annotated with parts of speech, morphological analysis, phonological tones, definiteness of NPs, syntax and semantic roles, countability of nouns, and animacy. - Dagbani: small sample for ANNIS (QUIS experiments), status: final, latest version of July 2004, manually transcribed, glossed, translated to English and annotated with information status, topic, new information focus, contrastive focus, proposition status . - further languages: small samples of 9 further languages (Buli, Byali, Ditammari, Fon, Foodo, Konni, Nateni, Waamma and Yom) for ANNIS (QUIS experiments), status: in preparation, latest version of July 2007, manually transcribed, glossed, translated to English and annotated with parts of speech, morphological analysis, phonological tones, syntax, definiteness, information status, topic, new information focus, contrastive focus and animacy. |
||||
B2 (Fokussierung in den tschadischen Sprachen)
![]() |
|||||
| contact: | Peggy Jacop | ||||
| Chadic | |||||
| modality: | spoken, dialogue, quasi-parallel (alignment possible via QUIS task numbers) | ||||
| formats: | wav (audio), EXMARaLDA-XML (annotation) | ||||
| source data: | focus data elicited with (parts of) QUIS
![]() |
||||
| languages: | Guruntum (GRD), Tangale (TAN) | ||||
| subcorpora / versions: | Guruntum sample:
sample, status: final, manually transcribed, glossed and translated to English, annotated wrt. morphology, parts of speech, syntax, gramm. function,
sem. roles, focus and focus position (e.g. ex situ) in EXMARaLDA. Tangale sample: sample, status: final, manually transcribed, glossed and translated to English, annotated wrt. morphology, parts of speech, syntax, gramm. function, sem. roles, focus and focus position (e.g. ex situ) in EXMARaLDA. full set: all focus related experiments, status: work in progress, large parts elicited, most of the data transcribed, partly annotated. |
||||
| Hausa | |||||
| modality: | spoken, dialogue, not parallel | ||||
| formats: | wav (audio), EXMARaLDA-XML (annotation) | ||||
| source data: | spontaneous dialogues | ||||
| languages: | Hausa (HAU) | ||||
| subcorpora / versions: | Hausa:
complete set, status: final, manually transcribed, glossed and translated to English, annotated wrt. morphology, parts of speech, syntax, gramm. function,
sem. roles, focus and focus position (e.g. ex situ) in EXMARaLDA. |
||||
B4 (Die Rolle der Informationsstruktur bei der Herausbildung von Wortstellungsregularitäten im Germanischen)
![]() |
|||||
| contact: | Svetlana Petrova | ||||
| Heliand | |||||
| modality: | written, monologue, not parallel | ||||
| formats: | EXMARaLDA-XML (annotation) | ||||
| source data: | historic manuscript (9th century epic poem) | ||||
| languages: | Old Saxon (OS) | ||||
| subcorpora / versions: | Heliand 1, 4 and 5:
complete text, status: final, digitalization, translation to Modern German, manually annotated with parts of speech, syntactic categories, grammatical functions,
clause status, numbers of syllables (per constituent), alliteration, information status, topic/comment, position of phrase
in sentence, definiteness, focus/background, focus-marker, comments on context, source (bibliography). |
||||
| Muspilli | |||||
| modality: | written, monologue, not parallel | ||||
| formats: | EXMARaLDA-XML (annotation) | ||||
| source data: | historic manuscript (epic poem) | ||||
| languages: | Old High German (OHG) | ||||
| subcorpora / versions: | Muspilli:
complete text, status: work in progress, digitalization, translation to English, manually annotated with parts of speech, syntactic category, grammatical function,
clause status, numbers of syllables (per constituent), information status, topic/comment, position of constituent in sentence,
definiteness, focus/background, focus marker, comments, source (bibliography). |
||||
| Otfrid | |||||
| modality: | written, monologue, not parallel | ||||
| formats: | EXMARaLDA-XML (annotation) | ||||
| source data: | historic manuscript | ||||
| languages: | Old High German (OHG) | ||||
| subcorpora / versions: | Otfrid:
complete text, status: work in progress, digitalization, translation to English, manually annotated with parts of speech, syntactic category, grammatical function,
clause status, numbers of syllables (per constituent), information status, topic/comment, position of constituent in sentence,
definiteness, focus/background, focus marker, comments, source (bibliography). |
||||
| Tatian | |||||
| modality: | written, monologue, not parallel | ||||
| formats: | EXMARaLDA-XML (annotation) | ||||
| source data: | historic manuscript (2nd century epic poem, translation of Latin original) | ||||
| languages: | Old High German (OHG) | ||||
| subcorpora / versions: | Tatian:
Old High German lines differing from original (Latin) lines in word order, status: final, digitalization (Latin original and Old High German translation), manual annotation of parts of speech, syntactic constituents,
grammatical functions, sentence type, number of syllables per constituent, definiteness of noun phrases, information status,
topic/comment, focus/background, translations aligned at token level . |
||||
D1 (Linguistische Datenbank für Informationsstruktur: Annotation und Retrieval)
![]() |
|||||
| contact: |
Manfred Stede, licencing issues
Christian Chiarcos, maintenance and compilation |
||||
| PCC | |||||
| modality: | written, monologue, not parallel argumentative texts | ||||
| formats: | TIGER-XML, EXMARaLDA-XML, MMAX, rs3, conano... (annotation) | ||||
| source data: | newspaper commentaries from the Märkische Allgemeine Zeitung, a German regional daily | ||||
| languages: | German (DEU) | ||||
| subcorpora / versions: | PCC-11:
11 of 176 commentaries, status: final, manual annotation of syntax (NEGRA/TIGER-XML), coreference (MMAX), rhetorical relations (RST tool/URML), information structure,
i.e. information status, topic and focus (EXMARaLDA). PCC-176: 176 commentaries, status: work in progress, manual annotation of syntax (NEGRA/TIGER-XML), coreference (MMAX), rhetorical relations (RST tool/URML) and information structure. |
||||
D2 (Typologie der Informationsstruktur)
![]() |
|||||
| contact: | Stavros Skopeteas | ||||
| QUIS-Typologie | |||||
| modality: | spoken, monologue and dialogue, quasi-parallel (alignment possible via QUIS task numbers, some task types contain dialogues) | ||||
| formats: | wav (audio), EXMARaLDA-XML (annotation) | ||||
| source data: | data elicited with QUIS
![]() |
||||
| languages: | Dutch (NLD), French (Quebec) (QFR), Georgian (KAT), German (DEU), Hungarian (HUN), Japanese (JPN), Konkani (GOM), Mawng (MAU), Niue (NIU), Prinmi (PMI), Teribe (TFR) | ||||
| metadata: | IMDI-oriented metadata (elicited in .doc tables) included in XML files during processing. Pseudonymisation provided. | ||||
| subcorpora / versions: | 20 samples:
20 parallel samples (same experiments in all languages), status: final, elicited, transcribed, and manually annotated at the following levels: translation to English, phonetic transcription (SAMPA),
stress, accent, syntactic phrases, phonetic tones, intonational tones, morphological analysis, glosses, parts of speech, grammatical
function, semantic roles, information status, topic, focus. Available in PAULA. 100 samples: 100 parallel samples (Georgian, German, Prinmi, Teribe), status: final, elicited, transcribed, and manually annotated at the following levels: translation to English, phonetic transcription (SAMPA), stress, accent, syntactic phrases, phonetic tones, intonational tones, morphological analysis, glosses, parts of speech, grammatical function, semantic roles, information status, topic, focus. Available in PAULA. full set: all experiments in every language, status: work in progress, large parts elicited, most of the data transcribed, partly annotated. |
||||
![http://www.sfb632.uni-potsdam.de/projects_b1ger.html [interner Verweis]](img/link_extern.gif)