Methods

General Strategy
Collecting Base Concepts
Definition Analysis
Derivational Analysis
Context Analysis

General Strategy

In our work we adopted the merge approach to building the Russian wordnet, i.e. starting with the language-internal structure of RussNet, and then coordinating it with EuroWordNet Top Ontology and linking it to Inter-Lingual-Index (ILI).

RussNet is by no means a clone or a translation of Princeton WordNet or any other similar resource, although most of the RussNet methodology follows the previous tradition of wordnet construction and makes use of WordNet and EuroWordNet experiences.

Collecting Base Concepts

Usually the starting point for building a wordnet is a list of Base Concepts (BCs), i. e. general words meanings on which more specific meanings depend and which are used most frequently.

Within the EuroWordNet the following formal criteria for BCs identification were postulated:

frequency in texts;
number of relations;
position in existing hierarchies (ontologies, thesauri etc.).

The criteria and procedures we used in RussNet differ slightly from those specified in EuroWordNet:

Selecting Russian BCs, we started with the most frequent words. Words with a relative frequency no less than 120 ipm were picked out from Frequency Lists for Russian and Text Corpora.
Also words belonging to the so called “core of the national mental lexicon” (ядро языкового сознания) were extracted from the Russian Word Association Thesaurus and added to the resulting list of words that included:
- 460 nouns;
- 226 verbs;
- 170 adjectives;
- 100 adverbs.
We had to take into consideration that the more frequent a word is, the more senses it has. Therefore, on the next stage we should examine the set of senses for each word and select the most frequent ones. For that purpose we employed Text Corpora and data presented in Word Association Norms, making use of the fact that about 90% of occurrences of a word in a corpus (or of responses stimulated by a word in WAT) are associated with 1 or 2 its senses [Hanks 2000; Ovchinnikova, Stern 1989]. These most frequent senses of the most frequently used words constituted the Preliminary List of Russian BCs.
To define relations between words inside and across the semantic fields we applied different methods of linguistic analysis, such as:

Definition analysis

The following guidelines allow us (semi)-automatically processing of explanatory dictionaries in order to determine the semantic relations between literals and synsets.

If a word A appears in a ISA (Genus proximum + differentia specificae) definition of a word B, we are to treat the synset {A,…} as a potential hyponym of the synset {B,…}.

Characteristic patterns of ISA definitions are “A - B + distinguishers”, “A - разновидность (вид, тип,...) B”.

e. g., Идти1 - двигаться в определенном направлении, переступая ногами, therefore {идти1} may be a hyponym of {двигаться}.
If a word A appears in a HASA definition of a word B, we are to establish holonymy/meronymy link between synsets {A,…} and {B,…}.

Characteristic patterns of HASA definition are: “A - часть (единица, компонент, раздел, частица, элемент, …) B ” vs. “A - совокупность (множество, объединение,… ) В-ов”

e. g., Рукав1 - часть одежды, покрывающая руку, therefore {рукав1} is a potential meronym of {одежда, платье2}.

e. g., Сеть4 - совокупность расположенных где-н. однородных учреждений, организаций [Ожегов, Шведова 1992], therefore {сеть4} is a potential holonym of {организация3, учреждение, объединение2}.
If a definition of the word A meaning is identical to that of word B meaning, we are to treat A as a potential synonym of B.

e.g., Тайна3 - скрытая причина чего-либо.

Секрет3 - скрытая причина чего-либо. [Ожегов, Шведова 1992]

In many cases definition by a set of synonyms is similar to the “genus proximum” one: both may have the structure “A - это B” without any distinguisher being specified. Thus, to distinguish between them we add some requirements for the definition analysis:
In case a meaning of the word A is defined by a set of similar words [B, C], and B is defined by a set [A, C], C is defined by a set [A, B], we are to treat A, B, C as potential synonyms.
In case a meaning of the word A is defined by a set of similar words [B, C], and none of words B, C are defined by a set including A, we are to regard the synset {A,…} as a potential hyperonym of the synsets {B,…} and {C,…}.

e.g., Грусть - чувство уныния, печали, горя (обычно из-за отсутствия кого-л., чего-л. родного близкого, необходимого или из-за упущенной возможности сделать что-л.)

Уныние - чувство безнадежности, печали, гнетущей тоски, возникающее вследствие несчастья, беды, обиды и т.п.

Печаль - чувство скорби, душевной горечи из-за чего-л., по поводу чего-л.

Горе - чувство глубокой печали, скорби о ком-л., чем-л., по кому-л., чему-л. [Evgenjeva, 1985-88]

Therefore {грусть} is a potential hyperonym of {уныние}, {печаль}, {горе}.
Negation is a specific feature of antonymous definitions.

e.g., Уродливый - некрасивый, thus {уродливый, безобразный} and {красивый} are considered to be potential antonyms.

But in fact negation is a very rare phenomenon, more often we observe words with “negative” meaning that is not so obviously declared:

e.g., Тьма - отсутствие света.

Although the definition analysis is a very useful method, it has its own limitations. It supplies the researcher with a number of hypotheses that should be verified by other means of linguistic analysis, such as

Derivational analysis

This method of analysis is necessary when there is a wide range of derivational relations between words which belong to the same semantic field. In such cases semantic nature of a word can be predicted by its morphological structure: some semantic components may get their own formal representation and appear as separate morphemes. Sense of morphemes may help us to define the meaning of words, to clarify the differences between cognate words and finally to define the relations between them.

E.g., both prefixes при- and под- have sense of “adding to, putting to” while being a part of verbs like: присоединить - подсоединить; примешать - подмешать; приколоть - подколоть, thus they regularly point to the relation of synonymy between corresponding words.

Another regular means of derivation is prefixes без-/бес-, не- that link the antonymous pairs, such as платный - бесплатный, внимание - невнимание.

Context analysis

Pure substitution tests help us to identify the relations of synonymy, hyponymy/hyperonymy while examining the real contexts:
- If there is a context, in which two words A and B can be interchanged without affecting its truth value, they are considered to be potentially synonymous.
- If there is a context, in which word A can be changed by word B without affecting truth value, but not vice versa, {A,…} is treated as a potential hyponym of {B,…}.
Pure substitution is only one of the possible examples of applying tests to the relation verification. More general approach implies building test sentences for each relation (for more details see Relations).
Analysing contextual markers, collocations
1. Lexical markers: What lexical items is the word associated with?
2. Semantic markers: What semantic class of lexical items is the word associated with?
3. Domain markers: What domain do these lexical items belong to?
4. Grammatical markers: What form(s) does the word appear in?
5. Syntactic markers: What structure(s) does the word perform in a sentence?
6. Textual markers: Is the word associated with any (position in any) textual organisation, i.e. does it have any textual colligations?
Mostly we deal with lexical, semantic and syntactic markers. For example, what concerns lexical markers, according to semantic amalgamation rules stated by V.G. Gak, there is a specific type of syntagmatic relations between lexical items in a collocation (‘semantic concord’) that implies repetition of the same semantic components (at least one) in the meanings of each collocant.

e.g.: Он уже успокоился, только немного сердился на учителя за этот спектакль.

… советник Арфарра сильно рассердился на меня за соглядатайство (Латынина Ю.)

Татьяна немного разозлилась, и, разозлившись, тут же поняла, что это уже московская злость (Аксенов В.)

Раскольников ужасно разозлился; ему вдруг захотелось как-нибудь оскорбить этого жирного франта. (Достоевский Ф.М.)

vs.

Это было равносильно измене, и царь Иван сильно разгневался. (Федоров Е.)

Рассердиться, разозлиться co-occur with various adverbs of degree, while разгневаться is associated with adverbs of the high degree only. That provides evidence for the extreme intensity of emotion in case of разгневаться and for the unspecified intensity in case of рассердиться and разозлиться, and consequently {разгневаться} being the hyponym of the {рассердиться, разозлиться}.

To page top


Main News Resources Methods Relations Database Structure Nouns Verbs Adjectives Adverbs Applications Results Bibliography Project members Guest book	Methods General Strategy Collecting Base Concepts Definition Analysis Derivational Analysis Context Analysis General Strategy In our work we adopted the merge approach to building the Russian wordnet, i.e. starting with the language-internal structure of RussNet, and then coordinating it with EuroWordNet Top Ontology and linking it to Inter-Lingual-Index (ILI). RussNet is by no means a clone or a translation of Princeton WordNet or any other similar resource, although most of the RussNet methodology follows the previous tradition of wordnet construction and makes use of WordNet and EuroWordNet experiences. Collecting Base Concepts Usually the starting point for building a wordnet is a list of Base Concepts (BCs), i. e. general words meanings on which more specific meanings depend and which are used most frequently. Within the EuroWordNet the following formal criteria for BCs identification were postulated: frequency in texts; number of relations; position in existing hierarchies (ontologies, thesauri etc.). The criteria and procedures we used in RussNet differ slightly from those specified in EuroWordNet: Selecting Russian BCs, we started with the most frequent words. Words with a relative frequency no less than 120 ipm were picked out from Frequency Lists for Russian and Text Corpora. Also words belonging to the so called “core of the national mental lexicon” (ядро языкового сознания) were extracted from the Russian Word Association Thesaurus and added to the resulting list of words that included: 460 nouns; 226 verbs; 170 adjectives; 100 adverbs. We had to take into consideration that the more frequent a word is, the more senses it has. Therefore, on the next stage we should examine the set of senses for each word and select the most frequent ones. For that purpose we employed Text Corpora and data presented in Word Association Norms, making use of the fact that about 90% of occurrences of a word in a corpus (or of responses stimulated by a word in WAT) are associated with 1 or 2 its senses [Hanks 2000; Ovchinnikova, Stern 1989]. These most frequent senses of the most frequently used words constituted the Preliminary List of Russian BCs. To define relations between words inside and across the semantic fields we applied different methods of linguistic analysis, such as: Definition analysis The following guidelines allow us (semi)-automatically processing of explanatory dictionaries in order to determine the semantic relations between literals and synsets. If a word A appears in a ISA (Genus proximum + differentia specificae) definition of a word B, we are to treat the synset {A,…} as a potential hyponym of the synset {B,…}. Characteristic patterns of ISA definitions are “A - B + distinguishers”, “A - разновидность (вид, тип,...) B”. e. g., Идти1 - двигаться в определенном направлении, переступая ногами, therefore {идти1} may be a hyponym of {двигаться}. If a word A appears in a HASA definition of a word B, we are to establish holonymy/meronymy link between synsets {A,…} and {B,…}. Characteristic patterns of HASA definition are: “A - часть (единица, компонент, раздел, частица, элемент, …) B ” vs. “A - совокупность (множество, объединение,… ) В-ов” e. g., Рукав1 - часть* одежды, покрывающая руку, therefore {рукав1} is a potential meronym* of {одежда, платье2}. e. g., Сеть4 - совокупность* расположенных где-н. однородных учреждений, организаций* [Ожегов, Шведова 1992], therefore {сеть4} is a potential holonym of {организация3, учреждение, объединение2}. If a definition of the word A meaning is identical to that of word B meaning, we are to treat A as a potential synonym of B. e.g., Тайна3 - скрытая причина чего-либо. Секрет3 - скрытая причина чего-либо. [Ожегов, Шведова 1992] In many cases definition by a set of synonyms is similar to the “genus proximum” one: both may have the structure “A - это B” without any distinguisher being specified. Thus, to distinguish between them we add some requirements for the definition analysis: In case a meaning of the word A is defined by a set of similar words [B, C], and B is defined by a set [A, C], C is defined by a set [A, B], we are to treat A, B, C as potential synonyms. In case a meaning of the word A is defined by a set of similar words [B, C], and none of words B, C are defined by a set including A, we are to regard the synset {A,…} as a potential hyperonym of the synsets {B,…} and {C,…}. e.g., Грусть - чувство уныния, печали, горя (обычно из-за отсутствия кого-л., чего-л. родного близкого, необходимого или из-за упущенной возможности сделать что-л.) Уныние - чувство безнадежности, печали, гнетущей тоски, возникающее вследствие несчастья, беды, обиды и т.п. Печаль - чувство скорби, душевной горечи из-за чего-л., по поводу чего-л. Горе - чувство глубокой печали, скорби о ком-л., чем-л., по кому-л., чему-л. [Evgenjeva, 1985-88] Therefore {грусть} is a potential hyperonym of {уныние}, {печаль}, {горе}. Negation is a specific feature of antonymous definitions. e.g., Уродливый - некрасивый, thus {уродливый, безобразный} and {красивый} are considered to be potential antonyms. But in fact negation is a very rare phenomenon, more often we observe words with “negative” meaning that is not so obviously declared: e.g., Тьма - отсутствие* света. Although the definition analysis is a very useful method, it has its own limitations. It supplies the researcher with a number of hypotheses that should be verified by other means of linguistic analysis, such as Derivational analysis This method of analysis is necessary when there is a wide range of derivational relations between words which belong to the same semantic field. In such cases semantic nature of a word can be predicted by its morphological structure: some semantic components may get their own formal representation and appear as separate morphemes. Sense of morphemes may help us to define the meaning of words, to clarify the differences between cognate words and finally to define the relations between them. E.g., both prefixes при-* and *под-* have sense of “adding to, putting to” while being a part of verbs like: присоединить - подсоединить; примешать - подмешать; приколоть - подколоть, thus they regularly point to the relation of synonymy between corresponding words. Another regular means of derivation is prefixes *без-/бес-, не-* that link the antonymous pairs, such as платный - бесплатный, внимание - невнимание. Context analysis Pure substitution tests help us to identify the relations of synonymy, hyponymy/hyperonymy while examining the real contexts: If there is a context, in which two words A and B can be interchanged without affecting its truth value, they are considered to be potentially synonymous. If there is a context, in which word A can be changed by word B without affecting truth value, but not vice versa, {A,…} is treated as a potential hyponym of {B,…}. Pure substitution is only one of the possible examples of applying tests to the relation verification. More general approach implies building test sentences for each relation (for more details see Relations). Analysing contextual markers, collocations ‘Contextual markers’ may detail several related aspects of the word’s environment in a text, including: Lexical markers: What lexical items is the word associated with? Semantic markers: What semantic class of lexical items is the word associated with? Domain markers: What domain do these lexical items belong to? Grammatical markers: What form(s) does the word appear in? Syntactic markers: What structure(s) does the word perform in a sentence? Textual markers: Is the word associated with any (position in any) textual organisation, i.e. does it have any textual colligations? Mostly we deal with lexical, semantic and syntactic markers. For example, what concerns lexical markers, according to semantic amalgamation rules stated by V.G. Gak, there is a specific type of syntagmatic relations between lexical items in a collocation (‘semantic concord’) that implies repetition of the same semantic components (at least one) in the meanings of each collocant. e.g.: Он уже успокоился, только немного сердился* на учителя за этот спектакль.* … советник Арфарра сильно рассердился* на меня за соглядатайство* (Латынина Ю.) Татьяна немного разозлилась, и, разозлившись, тут же поняла, что это уже московская злость (Аксенов В.) Раскольников ужасно разозлился; ему вдруг захотелось как-нибудь оскорбить этого жирного франта. (Достоевский Ф.М.) vs. Это было равносильно измене, и царь Иван сильно разгневался. (Федоров Е.) Рассердиться, разозлиться co-occur with various adverbs of degree, while разгневаться is associated with adverbs of the high degree only. That provides evidence for the extreme intensity of emotion in case of разгневаться and for the unspecified intensity in case of рассердиться and разозлиться, and consequently {разгневаться} being the hyponym of the {рассердиться, разозлиться}. To page top