corpus2008

Klyueva N., Bojar O.

UMC 0.1: Czech-Russian-English Multilingual Corpus

In the paper we present an initial stage of creating a Czech-Russian-English multilingual corpus(UMC). All the texts are downloaded automatically from a single source (Project Syndicate, a collection of newspaper articles and commentaries). In the next stage of our research, texts from other sources will be included.

We describe our new tokenizer, a tool that can be easily configured and trained to perform tokenization and sentence segmentation for various languages and tokenization schemes. Finally we use automatic techniques of sentence alignment for Czech-Russian and Russian-English language pairs. As the corpus now contains over 1.7 million running words in each of the three languages, it is a sufficient amount for preliminary experiments with statistical phrase-based machine translation, and it can be useful in other NLP applications and linguistic researches as well.

Назад

КОРПУСНАЯ ЛИНГВИСТИКА-2008

UMC 0.1: Czech-Russian-English Multilingual Corpus