In the last 15 years, the heterogeneity of linguistic annotations has been identified as a key problem in NLP and corpus linguistics. Data produced by different tools for automatic or manual annotation comes in different, often conceptually incompatible formats, and even if these formats can be aligned, the different annotation schemes applied by different tools limit the interoperability and reusability of NLP tools and linguistic data collections.
I've worked on both problems in two projects funded by the German Research Foundation (DFG):
In order to deal with heterogeneous annotations, I developed an integrative framework for the handling of terminologically heterogeneously annotated linguistic resources. This framework is based on an OWL/DL formalization of annotation schemes and terminology repositories in the OLiA architecture (Ontologies of Linguistic Annotation). From 2006-2008, this research was conducted in the context of the project C2 “Sustainability of Linguistic Data”, Collaborative Research Center (SFB) 441 “Linguistic Data Structures” (U Tübingen), and since 2007 continued in the project D1 “Linguistic Database”, Collaborative Research Center (SFB) 632 “Information Structure” (U Potsdam).
In the context of the project D1 “Linguistic Database”, Collaborative Research Center (SFB) 632 “Information Structure” (U Potsdam), I develop and maintain the generic data model PAULA and its XML linearization PAULA XML. PAULA XML is a LAF-inspired standoff model that allows for the lossless representation of any text-oriented linguistic annotation. Similar to GrAF, it is based on labelled directed acyclic graphs (LDAGs), but it is more specific than GrAF in that it specifies additional constraints on linguistic data structures (dominance relations and pointing relations). These constraints represent the basis for the data model of the linguistic data base ANNIS, the generic visualizations implemented in ANNIS and the ANNIS query language.
My interest in heterogeneous annotations originally arose from my specific interest in discourse phenomena, e.g., anaphora, discourse structure, information structure. The empirical study and the computational modelling of discourse phenomena require the consideration of multiple, heterogeneous annotations, and therefore motivated my interest in these.
I am particularly interested in potential overlaps between functional/cognitive linguistics and artificial intelligence/computational linguistics, whose synergies have -- in my view -- not yet been explored too deeply explored and therefore represent an potential source of novel approaches to discourse and NLP.
In my PhD project, I've developed a theoretical and computational model of salience and its application in natural language generation (NLG). Especially, I am focusing on the salience-based modelling of coding preferences for the assignment of grammatical roles, word order preferences and the form of referring expressions. A crucial finding, is the identification of two independent dimensions of salience which are associated with different perspectives on the text and have different temporal scope. These dimensions are formalized as the weighted sum of different contextual factors. The corresponding scores are then associated with certain coding perferences. This parameterized framework allows for the reconstruction of several existing theories such as different instantiations of Centering Theory, Givón’s notion of topicality and Eva Hajicová’s model of salience and is thus adequate with respect to them. Additionally, means for the data-driven adjustment of parameter weights are developed using established learning algorithms (backpropagation).
Im am affiliated with the DFG-funded Collaborative Research Center (SFB) 632, ``Information Structure'', and therefore involved in a number of cooperations between SFB projects, including the anaphoric, information-structural and discourse-structural annotation of linguistic corpora (e.g., with C3), in the creation of corpora of languages with specific information-structural features (e.g., a corpus of Hausa, in cooperation with the SFB projects A5, B2 and D4).