D4 (Methods for interactive linguistic corpus analysis of information structure)

Large Parallel Corpus of Cleft Constructions

  • modality: Written, partly translated. Parallel - sentence aligned.
  • formats: German-Dutch accessible through CQP interface Data and queries accessible in Prolog format
  • source data: Retokenization of Europarl v3. Cleft(-like) constructions automatically identified as described in Bouma, Gerlof, Lilja Øvrelid & Jonas Kuhn. 2010. Towards a Large Parallel Corpus of Cleft Constructions. Proceedings of LREC 2010.
  • languages: Dutch (nl), German (de), English (en), Swedish (sv).
  • subcorpora / versions:
    • Four language strata: up to 1.5M sentences in each, divided over 11 years of European parliament minutes (1996-2006).