Project D1

3rd Phase - Linguistic Database for Information Structure: Annotation and Retrieval

The goals of Project D1 are:

Creating linguistic corpora with information structural annotations, making these available to researchers and evaluating them.
Further development of software infrastructure for corpus search and data sustainability.
Consulting and integration of data from empirically oriented projects within Collaborative Research Centre 632 on Information Structure.

In this phase, the project will concentrate on the implementation of several remaining search and visualization possibilities, as well as the processing of larger (semi?)automatically annotated data and the creation of automatic annotation tools relevant to studies of information structure. These tools should enable both the acquisition of new data and the further annotation of existing data from other projects.

The annotated data made available through our database will be analyzed quantitatively with regard to the interaction between underlying features which are responsible for the distinction of language specific information structural categories. The goal here is to empirically describe the correlations between annotation levels which can be based on gradual and discrete categories.

In a complementary effort to the direct annotation of information structure categories practiced in phase 2, we intend to make use of more superficial features which influence the assignment of IS categories. Our target is to concentrate on reliably operationalizable categories such as definiteness, discourse-newness, coreference, animacy, and topological fields (for German). As part of this effort, we will build a robust parser for topological fields in different varieties of German and part-of-speech taggers for various languages, which should dramatically improve the state of the data available to several research projects within the Centre.

Description Linguistic Database: ANNIS

2nd Phase - Linguistic Database for Information Structure: Annotation and Retrieval

In the first phase of the SFB, Project D1 designed and implemented the linguistic database ANNIS and actively supported the process of data annotation in various sub-projects, on the one hand by helping to draft annotation guidelines, on the other hand by evaluating annotation software and training employees. In addition, in the last year (after the provision of another half an employee), work began on developing methods for the statistical analysis of multi-level annotations, applying them to the data annotated in the SFB and integrating corresponding tools into ANNIS. There are four priorities for the continuation of the project in the second phase:

The architecture of multi-level annotations (MEA), as it is also implemented in ANNIS, has developed into a highly topical issue in corpus-oriented computational linguistics and corpus linguistics in recent years. Here it is important to participate in international developments in order to ensure that the valuable data resources created in the SFB are available in formats that enable them to be used worldwide. The research goals concern the further development of our representation format, the theoretical and application-related investigation of search query languages for MEA and the special requirements of quality assurance that arise with MEA.

The further development of the ANNIS software should focus on the support of the data preparation by the users, the improvement of the search query language (see above), the further integration of statistical evaluation modules, the improvement of the visualization of the MEA data and finally the technical aspects of the corpus management (XML versus relational Database, user groups and rights, etc.).

Using methods of qualitative and quantitative data evaluation, we will, on the one hand, analyze the special conditions for acquiring knowledge from MEA data (including the integration of competing analyzes and their consequences for research and statistical evaluation). On the other hand, further analysis tools are to be integrated into ANNIS. The focus here is on "annotation mining", i.e. the automatic search for patterns in the MEA data, the support of the annotation process through semi-automatic procedures, and finally the projection of information-structural annotation between corpora in different languages.

The support of the sub-projects that work with empirical data will continue to be the central service of D1 (support of the annotation processes, data preparation, data evaluation, etc.), whereby special attention is to be paid to the quality assurance and evaluation of the annotations.

Description Linguistic Database: ANNIS

Download full description

1st Phase - Linguistic Database for Information Structure: Annotation and Retrieval

Project D1 provides the technical infrastructure for building, maintaining and retrieving the linguistic data collected by the SFB. D1 supports the individual projects in choosing suitable software and hardware for their task of data annotation. Moreover, D1 will develop a user-friendly query language and provide world-wide access to the database via the internet.

In addition, in close collaboration with project D2, D1 will develop a consistent annotation format, based on existing annotation standards such as CES or TUSNELDA. The focus will be on the systematic integration of information-structural features in existing annotation schemata.

The database will represent the first collection of data of typologically diverse languages that are annotated by information-structural features according to standardized criteria. A sophisticated query tool will make this resource available to the international research community.

Description Linguistic Database: ANNIS

Download full description

Principal Investigators

Prof. Dr. Manfred Stede
Prof. Dr. Anke Lüdeling

Former Staff Members

Dr. Christian Chiarcos
Dr. Stefanie Dipper
Dr. Amir Zeldes
Arne Neumann
Florian Zipser
Michael Götze
Julia Ritz

Student Assistants

Jakob Schmolling
Martin Klotz
Melanie Tosik

Activities

June 2006	Poster	Dipper, S., Donhauser, K., Götze, M., Hinterhölzl, R., Petrova, S., Ritz, J., Solf, M. & Stede, M.: Information Structure and Word Order in the Early Germanic Languages and its Analysis in a Linguistic Database.	Poster presentation at the SFB-Conference „Information Structure between Linguistic Theory and Empirical Methods“, Potsdam.
June 2006	Lecture	Dipper, S.: From Data to Insights: Exploiting Linguistic Data.	SFB-Conference „Information Structure between Linguistic Theory and Empirical Methods, Potsdam.
February 2006	Lecture	Dipper, S., Götze, M. & Skopeteas, S.: A Typological Language Archive for Researching Information Structure.	University Bielefeld: the 28. Jahrestagung der DGfS.
Winter 2005	Seminar	Dipper, S. & Götze, M.: Korpora und Sprachressourcen in der (Computer-)linguistik.	University Potsdam: Proseminar/Main seminar gehalten am Institut for Linguistik, University Potsdam.
October 2005	Invited Talk	Poesio, M.: Discourse Structure and Anaphora: an Empirical Analysis of Two Theories of the Global Focus.	University Potsdam.
September 2005	Lecture	Dipper, S.: XML-based Stand-off Representation and Exploitation of Multi-Level Linguistic Annotatio.	Conference - Berlin: Berliner XML Tage 2005 (BXML 2005).
September 2005	Lecture	Götze, M., Roloff, T., Skopeteas, S. & Stoel, R.: Exploring a cross-linguistic production data corpus.	Conference - Batumi, G.ia: 6th International Tbilisi Symposium on Language, Logic and Computation.
September 2005	Lecture	Götze, M., Roloff, T., Skopeteas, S. & Stoel, R.: Towards an infrastructure for exploring a cross-linguistic production data corpus.	Conference - Batumi: the Sixth International Tbilisi Symposium on Language, Logic and Computation.
September 2005	Invited Talk	Trippel, T.: Standardisierung von Sprachressourcen: der aktuelle Stand.	Lecture - University Potsdam.
June 2005	Invited Talk	Frank, A.: Frame Semantics in SALSA.	Lecture - University Potsdam.
April 2005	Lecture	Dipper, S. & Götze, M.: Accessing Heterogeneous Linguistic Data -- Generic XML-based Representation and Flexible Visualization.	Conference - Poznan: 2nd Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics.
April 2005	Poster	Dipper, S., Götze, M. & Stede, M.: ANNIS - a Linguistic Database for Complex Multilevel Annotation.	Conference - Poznan: 2nd Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics.
February 2005	Poster	Dipper, S., Götze, M. & Stede, M.: ANNIS - a Linguistic Database for Complex Multilevel Annotation.	Conference - Köln: 27. Jahrestagung der DGfS, University zu Köln.
January 2005	Invited Talk	Reitter, D.: FASiL: Ein multimodales Dialogsystem für E-Mails.	Lecture - University Potsdam.
November 2004	Invited Talk	Strube, M.: Von symbolischen zu empirischen Ansätzen zur Anaphernresolution.	Lecture - University Potsdam.
November 2004	Invited Talk	Milde, J-T.: Aufbau und Nutzung multimodaler Sprachkorpora.	Lecture - University Potsdam.
Winter 2004	Seminar	Dipper, S. & Hanneforth, T.: XML in der Computerlinguistik.	Lecture - University Potsdam: Proseminar gehalten am Institut für Linguistik, University Potsdam.
October 2004	Seminar	Dipper, S. & Götze, M.: SFB Tutorial for ANNIS and ANNOTATE.	Lecture - University of Potsdam.
August 2004	Lecture	Dipper, S., Götze, M. & Skopeteas, S.: Towards User-Adaptive Annotation Guidelines.	Workshop - Geneva: COLING Workshop on Linguistically Interpreted Corpora LINC-2004.
July 2004	Lecture	Dipper, S. & Götze, M.: ANNIS – eine Linguistische Datenbank for Informationsstruktur.	Workshop - Potsdam. D1-Workshop Heterogeneity in Linguistic Databases.
July 2004	Workshop	Dipper, S.: Heterogeneity in Linguistic Databases,	Workshop - Potsdam.
May 2004	Lecture	Dipper, S., Götze, M. & Stede, M.: Simple Annotation Tools for Complex Annotation Tasks: an Evaluation.	Workshop - Lisbon: LREC Workshop on XML-based Richly Annotated Corpora.
Mar 2004	Seminar	Dipper, S.: SFB Tutorial for Annotation Tools.	Lecture - University of Potsdam.
December 2003	Invited Talk	Kruijff-Korbayová, I.: The MULI Project – Multilingual Information Structure.	Colloquium - University of Potsdam.
November 2003	Invited Talk	Dipper, S.: Das TIGER Korpus: Annotation und Exploration.	Research Colloquium Korpuslinguistik, Humboldt-University.