Overview of Corpora

B1 (Fokus in Gur- und Kwasprachen)   http://www.sfb632.uni-potsdam.de/projects_b1ger.html [interner Verweis]
contact: sfb632.b1@rz.hu-berlin.de
Gur and Kwa Corpus
modality: spoken, monologue and dialogue, quasi-parallel (QUIS tasks alignable via task numbers), extra experiments specific to each language
formats: wav (audio), toolbox, EXMARaLDA-XML (annotation)
source data: focus data elicited with (parts of) QUIS  http://www.sfb632.uni-potsdam.de/homes/d2/ [interner Verweis], language specific elicitation tasks as well as quasi-spontaneous speech
languages: Aja (AJA), Akan (AKA), Awutu-Efutu (AFU), Buli (BWU), Byali (BEH), Dagbani (DAG), Ditammari (TBZ), Ewe (EWE), Fon (FON), Foodo (FOD), Gurene (GUR), Konkomba (XON), Konni (KMA), Lelemi (LEF), Nateni (NTM), Waamma (WWA), Yom (PIL)
metadata: IMDI-oriented metadata (elicited in .doc tables) included in XML files during processing. Pseudonymisation provided.
subcorpora / versions: all 17 languages:  part of QUIS experiments for all langugages listed, large collections of extra experimental data (language specific), latest version of June 2007.
selected ANNIS set:
Aja:  small sample for ANNIS (QUIS experiments), status: final, latest version of July 2004, manually transcribed, glossed, translated to English and annotated with parts of speech, morphological analysis, phonological tones, definiteness of NPs, syntax and semantic roles, countability of nouns, and animacy.
Dagbani:  small sample for ANNIS (QUIS experiments), status: final, latest version of July 2004, manually transcribed, glossed, translated to English and annotated with information status, topic, new information focus, contrastive focus, proposition status .
further languages:  small samples of 9 further languages (Buli, Byali, Ditammari, Fon, Foodo, Konni, Nateni, Waamma and Yom) for ANNIS (QUIS experiments), status: in preparation, latest version of July 2007, manually transcribed, glossed, translated to English and annotated with parts of speech, morphological analysis, phonological tones, syntax, definiteness, information status, topic, new information focus, contrastive focus and animacy.

B2 (Fokussierung in den tschadischen Sprachen)   http://www.sfb632.uni-potsdam.de/projects_b2ger.html [interner Verweis]
contact: Peggy Jacop
Chadic
modality: spoken, dialogue, quasi-parallel (alignment possible via QUIS task numbers)
formats: wav (audio), EXMARaLDA-XML (annotation)
source data: focus data elicited with (parts of) QUIS  http://www.sfb632.uni-potsdam.de/homes/d2/ [interner Verweis]
languages: Guruntum (GRD), Tangale (TAN)
subcorpora / versions: Guruntum sample:  sample, status: final, manually transcribed, glossed and translated to English, annotated wrt. morphology, parts of speech, syntax, gramm. function, sem. roles, focus and focus position (e.g. ex situ) in EXMARaLDA.
Tangale sample:  sample, status: final, manually transcribed, glossed and translated to English, annotated wrt. morphology, parts of speech, syntax, gramm. function, sem. roles, focus and focus position (e.g. ex situ) in EXMARaLDA.
full set:  all focus related experiments, status: work in progress, large parts elicited, most of the data transcribed, partly annotated.

Hausa
modality: spoken, dialogue, not parallel
formats: wav (audio), EXMARaLDA-XML (annotation)
source data: spontaneous dialogues
languages: Hausa (HAU)
subcorpora / versions: Hausa:  complete set, status: final, manually transcribed, glossed and translated to English, annotated wrt. morphology, parts of speech, syntax, gramm. function, sem. roles, focus and focus position (e.g. ex situ) in EXMARaLDA.

B4 (Die Rolle der Informationsstruktur bei der Herausbildung von Wortstellungsregularitäten im Germanischen)   http://www.sfb632.uni-potsdam.de/projects_b4ger.html [interner Verweis]
contact: Svetlana Petrova
Heliand
modality: written, monologue, not parallel
formats: EXMARaLDA-XML (annotation)
source data: historic manuscript (9th century epic poem)
languages: Old Saxon (OS)
subcorpora / versions: Heliand 1, 4 and 5:  complete text, status: final, digitalization, translation to Modern German, manually annotated with parts of speech, syntactic categories, grammatical functions, clause status, numbers of syllables (per constituent), alliteration, information status, topic/comment, position of phrase in sentence, definiteness, focus/background, focus-marker, comments on context, source (bibliography).

Muspilli
modality: written, monologue, not parallel
formats: EXMARaLDA-XML (annotation)
source data: historic manuscript (epic poem)
languages: Old High German (OHG)
subcorpora / versions: Muspilli:  complete text, status: work in progress, digitalization, translation to English, manually annotated with parts of speech, syntactic category, grammatical function, clause status, numbers of syllables (per constituent), information status, topic/comment, position of constituent in sentence, definiteness, focus/background, focus marker, comments, source (bibliography).

Otfrid
modality: written, monologue, not parallel
formats: EXMARaLDA-XML (annotation)
source data: historic manuscript
languages: Old High German (OHG)
subcorpora / versions: Otfrid:  complete text, status: work in progress, digitalization, translation to English, manually annotated with parts of speech, syntactic category, grammatical function, clause status, numbers of syllables (per constituent), information status, topic/comment, position of constituent in sentence, definiteness, focus/background, focus marker, comments, source (bibliography).

Tatian
modality: written, monologue, not parallel
formats: EXMARaLDA-XML (annotation)
source data: historic manuscript (2nd century epic poem, translation of Latin original)
languages: Old High German (OHG)
subcorpora / versions: Tatian:  Old High German lines differing from original (Latin) lines in word order, status: final, digitalization (Latin original and Old High German translation), manual annotation of parts of speech, syntactic constituents, grammatical functions, sentence type, number of syllables per constituent, definiteness of noun phrases, information status, topic/comment, focus/background, translations aligned at token level .

D1 (Linguistische Datenbank für Informationsstruktur: Annotation und Retrieval)   http://www.sfb632.uni-potsdam.de/projects_d1ger.html [interner Verweis]
contact: Manfred Stede, licencing issues
Christian Chiarcos, maintenance and compilation
PCC
modality: written, monologue, not parallel argumentative texts
formats: TIGER-XML, EXMARaLDA-XML, MMAX, rs3, conano... (annotation)
source data: newspaper commentaries from the Märkische Allgemeine Zeitung, a German regional daily
languages: German (DEU)
subcorpora / versions: PCC-11:  11 of 176 commentaries, status: final, manual annotation of syntax (NEGRA/TIGER-XML), coreference (MMAX), rhetorical relations (RST tool/URML), information structure, i.e. information status, topic and focus (EXMARaLDA).
PCC-176:  176 commentaries, status: work in progress, manual annotation of syntax (NEGRA/TIGER-XML), coreference (MMAX), rhetorical relations (RST tool/URML) and information structure.

D2 (Typologie der Informationsstruktur)   http://www.sfb632.uni-potsdam.de/projects_d2ger.html [interner Verweis]
contact: Stavros Skopeteas
QUIS-Typologie
modality: spoken, monologue and dialogue, quasi-parallel (alignment possible via QUIS task numbers, some task types contain dialogues)
formats: wav (audio), EXMARaLDA-XML (annotation)
source data: data elicited with QUIS  http://www.sfb632.uni-potsdam.de/homes/d2/ [interner Verweis]
languages: Dutch (NLD), French (Quebec) (QFR), Georgian (KAT), German (DEU), Hungarian (HUN), Japanese (JPN), Konkani (GOM), Mawng (MAU), Niue (NIU), Prinmi (PMI), Teribe (TFR)
metadata: IMDI-oriented metadata (elicited in .doc tables) included in XML files during processing. Pseudonymisation provided.
subcorpora / versions: 20 samples:  20 parallel samples (same experiments in all languages), status: final, elicited, transcribed, and manually annotated at the following levels: translation to English, phonetic transcription (SAMPA), stress, accent, syntactic phrases, phonetic tones, intonational tones, morphological analysis, glosses, parts of speech, grammatical function, semantic roles, information status, topic, focus. Available in PAULA.
100 samples:  100 parallel samples (Georgian, German, Prinmi, Teribe), status: final, elicited, transcribed, and manually annotated at the following levels: translation to English, phonetic transcription (SAMPA), stress, accent, syntactic phrases, phonetic tones, intonational tones, morphological analysis, glosses, parts of speech, grammatical function, semantic roles, information status, topic, focus. Available in PAULA.
full set:  all experiments in every language, status: work in progress, large parts elicited, most of the data transcribed, partly annotated.