CORPUS LINGUISTICS - 2008

Conference Topics Committees Contact Info Russian/English
 
 
Conference
Topics
Committees
Contact Info
Russian/English 
 
Krizhanovsky A.

Experiments on Corpus Index Generated from Wikipedia

With the fantastic growth of Internet usage, information search in documents of a special type called a "wiki page" that is written using a simple markup language, has become an important problem. The software for indexing wiki texts in three languages (Russian, English, and German) was developed. Two index databases of Russian Wikipedia (RW) and Simple English Wikipedia (SEW) are built and compared. The size of RW is by order of magnitude higher than SEW (number of words, lexemes), though the growth rate of number of pages in SEW was found to be 12% higher than in Russian, and the rate of acquisition of new words in SEW lexicon was 6% higher during a period of five months (from September 2007 to February 2008). The entire source code of the indexing software and the generated index databases are freely available under GPL (GNU General Public License).

Back