2nd Phase - Methods for interactive linguistic corpus analysis of information structure
In the newly applied for project D4, computer-linguistic tools and resources, machine learning processes and statistical methods are to be combined into a flexible, interactive method of investigation for information structure (IS) in large corpora - for different individual languages as well as for contrasting parallel corpora. The approach is characterized by machine learning based on interactive linguistic annotation (LAILA) and represents a supplement to the corpus infrastructure and the annotation and evaluation methodology already developed in the SFB, which focuses on relatively small, carefully elicited and hand-annotated data collections: With the LAILA method, unannotated corpus data can be subjected to rapid exploration or a phenomenon-oriented, controlled frequency analysis - with the aim of using the inevitable manual effort for annotation / checking of training or control data as effectively as possible for the linguistic research objectives.
Linguistic IS research can benefit from LAILA both when searching for individual documents for rare forms of realization in very large corpora and when determining the frequency distribution of alternative IS realizations or of IS-relevant parameters of the lexical, structural or discourse context. For frequency analyzes, controlled random samples are generated, which are checked manually and from which statistical generalization can be made. Frequency data for large corpora complement the elicit special corpora developed in the SFB so far for IS: the latter carefully control the contexts of use for IS so that qualitative comparability is guaranteed for typological research; frequency data can be used to quantitatively check (depending on the language excerpt documented by the available corpora) how the elicited implementation alternatives and possible additional variants are distributed in free language.
Aligned multilingual parallel corpus data are suitable for the LAILA method in two respects - on the one hand as a direct data source for contrastive studies on IS, and on the other hand to improve the training basis for single-language tools: analysis information on a language can be obtained using the annotation projection technique (Yarowsky et al. 2001) can be used as an auxiliary resource for other languages.
In the coming SFB phase, three exemplary application scenarios are in the foreground for D4: (1) For German, the individual document search and the frequency determination of the most important grammatical means for IS implementation are to be supported. In cooperation with A1, corpus-based IS factors are investigated that influence the placement of relative clauses in German (in the middle vs. extraponated). Project C1 aims to validate the technology with regard to the apron occupation by objects on the C1 newspaper corpus. (2) On the basis of the Europarl corpus (Koehn 2002) with translations of the EU parliamentary debates in 11 (or 20) languages, tool support for a contrastive IS analysis is to be provided, which then, can be used in cooperation with D2 for investigations into micro-variation, particularly into topicalization and cleft constructions. (3) Hindi serves as an example of languages for which few analysis tools are available. Together with C5 and using a parallel corpus English-Hindi and tools for English, a frequency analysis of IS-relevant categories and context parameters for Hindi is to be carried out.
Download full description
- Prof. Dr. Jonas Kuhn
Former Staff Members
- Dr. Gerlof Bouma
- Dr. Lilja Øvrelid
- Dr. Bettina Schrader
- Dr. Kathrin Spreyer