Project D1

3rd Phase - Linguistic Database for Information Structure: Annotation and Retrieval

The goals of Project D1 are:

  • Creating linguistic corpora with information structural annotations, making these available to researchers and evaluating them.
  • Further development of software infrastructure for corpus search and data sustainability.
  • Consulting and integration of data from empirically oriented projects within Collaborative Research Centre 632 on Information Structure.

In this phase, the project will concentrate on the implementation of several remaining search and visualization possibilities, as well as the processing of larger (semi?)automatically annotated data and the creation of automatic annotation tools relevant to studies of information structure. These tools should enable both the acquisition of new data and the further annotation of existing data from other projects.

The annotated data made available through our database will be analyzed quantitatively with regard to the interaction between underlying features which are responsible for the distinction of language specific information structural categories. The goal here is to empirically describe the correlations between annotation levels which can be based on gradual and discrete categories.

In a complementary effort to the direct annotation of information structure categories practiced in phase 2, we intend to make use of more superficial features which influence the assignment of IS categories. Our target is to concentrate on reliably operationalizable categories such as definiteness, discourse-newness, coreference, animacy, and topological fields (for German). As part of this effort, we will build a robust parser for topological fields in different varieties of German and part-of-speech taggers for various languages, which should dramatically improve the state of the data available to several research projects within the Centre.

Description Linguistic Database: ANNIS

2nd Phase - Linguistic Database for Information Structure: Annotation and Retrieval

In the first phase of the SFB, Project D1 designed and implemented the linguistic database ANNIS and actively supported the process of data annotation in various sub-projects, on the one hand by helping to draft annotation guidelines, on the other hand by evaluating annotation software and training employees. In addition, in the last year (after the provision of another half an employee), work began on developing methods for the statistical analysis of multi-level annotations, applying them to the data annotated in the SFB and integrating corresponding tools into ANNIS. There are four priorities for the continuation of the project in the second phase:

The architecture of multi-level annotations (MEA), as it is also implemented in ANNIS, has developed into a highly topical issue in corpus-oriented computational linguistics and corpus linguistics in recent years. Here it is important to participate in international developments in order to ensure that the valuable data resources created in the SFB are available in formats that enable them to be used worldwide. The research goals concern the further development of our representation format, the theoretical and application-related investigation of search query languages ​​for MEA and the special requirements of quality assurance that arise with MEA.

The further development of the ANNIS software should focus on the support of the data preparation by the users, the improvement of the search query language (see above), the further integration of statistical evaluation modules, the improvement of the visualization of the MEA data and finally the technical aspects of the corpus management (XML versus relational Database, user groups and rights, etc.).

Using methods of qualitative and quantitative data evaluation, we will, on the one hand, analyze the special conditions for acquiring knowledge from MEA data (including the integration of competing analyzes and their consequences for research and statistical evaluation). On the other hand, further analysis tools are to be integrated into ANNIS. The focus here is on "annotation mining", i.e. the automatic search for patterns in the MEA data, the support of the annotation process through semi-automatic procedures, and finally the projection of information-structural annotation between corpora in different languages.

The support of the sub-projects that work with empirical data will continue to be the central service of D1 (support of the annotation processes, data preparation, data evaluation, etc.), whereby special attention is to be paid to the quality assurance and evaluation of the annotations.

Description Linguistic Database: ANNIS

Download full description

1st Phase - Linguistic Database for Information Structure: Annotation and Retrieval

Project D1 provides the technical infrastructure for building, maintaining and retrieving the linguistic data collected by the SFB. D1 supports the individual projects in choosing suitable software and hardware for their task of data annotation. Moreover, D1 will develop a user-friendly query language and provide world-wide access to the database via the internet.

In addition, in close collaboration with project D2, D1 will develop a consistent annotation format, based on existing annotation standards such as CES or TUSNELDA. The focus will be on the systematic integration of information-structural features in existing annotation schemata.

The database will represent the first collection of data of typologically diverse languages that are annotated by information-structural features according to standardized criteria. A sophisticated query tool will make this resource available to the international research community.

Description Linguistic Database: ANNIS

Download full description

Principal Investigators

  • Prof. Dr. Manfred Stede
  • Prof. Dr. Anke Lüdeling

Former Staff Members

  • Dr. Christian Chiarcos
  • Dr. Stefanie Dipper
  • Dr. Amir Zeldes
  • Arne Neumann
  • Florian Zipser
  • Michael Götze
  • Julia Ritz

Student Assistants

  • Jakob Schmolling
  • Martin Klotz
  • Melanie Tosik

Activities

June 2006 Poster Dipper, S., Donhauser, K., Götze, M., Hinterhölzl, R., Petrova, S., Ritz, J., Solf, M. & Stede, M.: Information Structure and Word Order in the Early Germanic Languages and its Analysis in a Linguistic Database. Poster presentation at the SFB-Conference „Information Structure between Linguistic Theory and Empirical Methods“, Potsdam.
June 2006 Lecture Dipper, S.: From Data to Insights: Exploiting Linguistic Data. SFB-Conference „Information Structure between Linguistic Theory and Empirical Methods, Potsdam.
February 2006 Lecture Dipper, S., Götze, M. & Skopeteas, S.: A Typological Language Archive for Researching Information Structure. University Bielefeld: the 28. Jahrestagung der DGfS.
Winter 2005 Seminar Dipper, S. & Götze, M.: Korpora und Sprachressourcen in der (Computer-)linguistik. University Potsdam: Proseminar/Main seminar gehalten am Institut for Linguistik, University Potsdam.
October 2005 Invited Talk Poesio, M.: Discourse Structure and Anaphora: an Empirical Analysis of Two Theories of the Global Focus. University Potsdam.
September 2005 Lecture Dipper, S.: XML-based Stand-off Representation and Exploitation of Multi-Level Linguistic Annotatio. Conference - Berlin: Berliner XML Tage 2005 (BXML 2005).
September 2005 Lecture Götze, M., Roloff, T., Skopeteas, S. & Stoel, R.: Exploring a cross-linguistic production data corpus. Conference - Batumi, G.ia: 6th International Tbilisi Symposium on Language, Logic and Computation.
September 2005 Lecture Götze, M., Roloff, T., Skopeteas, S. & Stoel, R.: Towards an infrastructure for exploring a cross-linguistic production data corpus. Conference - Batumi: the Sixth International Tbilisi Symposium on Language, Logic and Computation.
September 2005 Invited Talk Trippel, T.: Standardisierung von Sprachressourcen: der aktuelle Stand. Lecture - University Potsdam.
June 2005 Invited Talk Frank, A.: Frame Semantics in SALSA. Lecture - University Potsdam.
April 2005 Lecture Dipper, S. & Götze, M.: Accessing Heterogeneous Linguistic Data -- Generic XML-based Representation and Flexible Visualization. Conference - Poznan: 2nd Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics.
April 2005 Poster Dipper, S., Götze, M. & Stede, M.: ANNIS - a Linguistic Database for Complex Multilevel Annotation. Conference - Poznan: 2nd Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics.
February 2005 Poster Dipper, S., Götze, M. & Stede, M.: ANNIS - a Linguistic Database for Complex Multilevel Annotation. Conference - Köln: 27. Jahrestagung der DGfS, University zu Köln.
January 2005 Invited Talk Reitter, D.: FASiL: Ein multimodales Dialogsystem für E-Mails. Lecture - University Potsdam.
November 2004 Invited Talk Strube, M.: Von symbolischen zu empirischen Ansätzen zur Anaphernresolution. Lecture - University Potsdam.
November 2004 Invited Talk Milde, J-T.: Aufbau und Nutzung multimodaler Sprachkorpora. Lecture - University Potsdam.
Winter 2004 Seminar Dipper, S. & Hanneforth, T.: XML in der Computerlinguistik. Lecture - University Potsdam: Proseminar gehalten am Institut für Linguistik, University Potsdam.
October 2004 Seminar Dipper, S. & Götze, M.: SFB Tutorial for ANNIS and ANNOTATE. Lecture - University of Potsdam.
August 2004 Lecture Dipper, S., Götze, M. & Skopeteas, S.: Towards User-Adaptive Annotation Guidelines. Workshop - Geneva: COLING Workshop on Linguistically Interpreted Corpora LINC-2004.
July 2004 Lecture Dipper, S. & Götze, M.: ANNIS – eine Linguistische Datenbank for Informationsstruktur. Workshop - Potsdam. D1-Workshop Heterogeneity in Linguistic Databases.
July 2004 Workshop Dipper, S.: Heterogeneity in Linguistic Databases, Workshop - Potsdam.
May 2004 Lecture Dipper, S., Götze, M. & Stede, M.: Simple Annotation Tools for Complex Annotation Tasks: an Evaluation. Workshop - Lisbon: LREC Workshop on XML-based Richly Annotated Corpora.
Mar 2004 Seminar Dipper, S.: SFB Tutorial for Annotation Tools. Lecture - University of Potsdam.
December 2003 Invited Talk Kruijff-Korbayová, I.: The MULI Project – Multilingual Information Structure. Colloquium - University of Potsdam.
November 2003 Invited Talk Dipper, S.: Das TIGER Korpus: Annotation und Exploration. Research Colloquium Korpuslinguistik, Humboldt-University.