|
Klyueva N., Bojar O.
UMC 0.1: Czech-Russian-English Multilingual Corpus
In the paper we present an initial stage of creating a Czech-Russian-English multilingual corpus(UMC). All the texts are downloaded automatically from a single source (Project Syndicate, a collection of newspaper articles and commentaries). In the next stage of our research, texts from other sources will be included.
We describe our new tokenizer, a tool that can be easily configured and trained to perform tokenization and sentence segmentation for various languages and tokenization schemes. Finally we use automatic techniques of sentence alignment for Czech-Russian and Russian-English language pairs. As the corpus now contains over 1.7 million running words in each of the three languages, it is a sufficient amount for preliminary experiments with statistical phrase-based machine translation, and it can be useful in other NLP applications and linguistic researches as well.
Back |