WO2015080559A2

WO2015080559A2 - A method and system for automated word sense disambiguation

Info

Publication number: WO2015080559A2
Application number: PCT/MY2014/000154
Authority: WO
Inventors: Chu Min Xian Benjamin; Qiang Liu; Lukose Dickson
Original assignee: Mimos Berhad
Priority date: 2013-11-27
Filing date: 2014-05-29
Publication date: 2015-06-04
Also published as: MY182881A

Description

A Method and System for Automated Word Sense Disambiguation

Field of the Invention

[0001] The present invention relates to information processing. More specifically, the present invention relates to a system and method for automated word sense disambiguation.

Background

[0002] Word Sense Disambiguation (WSD) is known to be challenging subject in the field of Natural Language Processing (NLP). The challenge arises from the lack of means to address properties of context that characterize the use of words in a given sense. Further, there is also lack a standard and exhaustive inventory for word sense. It is also noted the accuracy of the current means to process disambiguation results at the final end is often questionable.

[0003] Therefore, given any text to identify the correct sense for an ambiguous word in a sentence; the main problem considered here is how to identify which sense of a meaning is used in any given sentence, when the word has a number of distinct senses. When it happens, it would require substantial amounts of training examples/tagged datasets, i.e. supervised machine learning, to handle it. For example, the word "plant" may have various meaning in different contexts when it is searched through the WordNet. It is however not that these training examples and tagged datasets are still not sufficient for extracting the word sense. Further, the accuracy for fine-grained sense distinctions is still rather lacking with the existing systems. Thus far, for example, the highest accuracies based on the state-of-the art approaches range from about 59.1% to 69.0%. The challenge lies on the determinations of the context length and context content. The context length refers to the size of window of text that should be taken to determine context. It can be difficult, if not impossible, to determine if the context length should contain only a few words, or a larger portions of the string. Similarly, it is also a challenge to decide whether all context words or only a selected word, such as words in certain part of speech or a certain grammatical relations to the target word, are to be considered for context content. There is also question whether the selected the selected words should be weighted based on their distance apart from the target word, or be treated as a "bag of words". Summary

[0004] In accordance with one aspect of the present invention, there is a system for disambiguating word sense from a text containing document having sentences. The system comprises an entity recognition module adapted for extracting possible entities from the sentence using a Linked Data; a text preprocessor adapted for tokenizing sentence into lemmatized words, the text processor includes a word recognizer adapted to identify if a verb and nouns from the sentence, a lemmatizer for lemmatizing the words of the sentence, and a polysemy checker for counting a number of possible sense of the words to determine if the words are ambiguous, an index builder (134) adapted for creating an index of schema graphs for each identified verb and to extract all possible sense description for nouns; a disambiguator adapted for disambiguating word senses, wherein the disambiguator extracts all the schemas for the identified verb and placing all the identified nouns into the schemas to determine the most suitable word sense, and a disambiguation rules is utilized for disambiguating word sense. [0005] In one embodiment, the disambiguator is operable to determine if the entity is a verb by referring to a linguistic resource and subsequently retrieve all possible schemas related to the verb.

[0006] In another embodiment, the text processor is operable to determine if a word is a verb or nouns through a linguistic resource. The index builder may be adapted for extracting schemas through the use of a linguistic resources and building an index reference for each word entry with all the related sense descriptions.

[0007] In yet another embodiment, the disambiguator may create a context word vector from the nouns extracted from the sense description, wherein the words in the context word vector are checked for semantic constraints with reference to a concept hierarchy from the linguistic resources, and the disambiguation rules is utilised when the ambiguous word cannot be resolved.

[0008] In another aspect, the present invention has a method of disambiguating word sense from a text containing document having sentences. The method comprises extracting possible entities from the sentence using a Linked Data; tokenizing sentence into lemmatized words; lemmatizing the words of the sentence; counting a number of possible sense of the words to determine if the words are ambiguous through a polysemy checker; identifying verb and nouns from the sentence; lemmatizing the words of the sentence; counting a number of possible sense of the words to determine if the words are ambiguous through a polysemy checker; creating an index of schema graphs for each identified verb and to extract all possible sense description for nouns; disambiguating word senses through extracting all the schemas for the identified verb and placing all the identified nouns into the schemas to determine the most suitable word sense; utilizing disambiguation rules disambiguating word sense.

[0009] In one embodiment, the disambiguating word sense includes determining if the entity is a verb through referring to a linguistic resource and retrieving all possible schemas related to the verb.

[0010] In another embodiment, the identifying verb and nouns includes matching verb and nouns with a linguistic resource.

[0011] Further, the index builder may be adapted for extracting schemas through the use of a linguistic resources and building an index reference for each word entry with all the related sense descriptions.

[0012] Yet, disambiguating the word sense includes creating a context word vector from the nouns extracted from the sense description, wherein the words in the context word vector are checked for semantic constraints with reference to a concept hierarchy from the linguistic resources, and the disambiguation rules is utilised when the ambiguous word cannot be resolved.

Brief Description of the Drawings

[0013] Preferred embodiments according to the present invention will now be described with reference to the figures accompanied herein, in which like reference numerals denote like elements;

[0014] FIG. 1 illustrates a block diagram of a word sense disambiguation system in accordance with one embodiment of the present invention; [0015] FIG. 2 illustrates a process carries out by the disambiguation module of

FIG. 1 in accordance with one embodiment of the present invention; and

[0016] FIGs. 3A-D exemplify an example of a sentence that is being processed to resolve ambiguity. Detailed Description

[0017] Embodiments of the present invention shall now be described in detail, with reference to the attached drawings. It is to be understood that no limitation of the scope of the invention is thereby intended, such alterations and further modifications in the illustrated device, and such further applications of the principles of the invention as illustrated therein being contemplated as would normally occur to one skilled in the art to which the invention relates.

[0018] FIG. 1 illustrates a block diagram of a word sense disambiguation system 100 in accordance with one embodiment of the present invention. The system 100 is adapted for automatically identifying which sense of a word (i.e. meaning) is used in the sentence context. It is particularly useful for words that are polysemous or have multiple meanings. Briefly, the system comprises an Entity Recognition Module 102, a text pre-processor 102, and a disambiguation module 103.

[0019] The entity recognition module 101 provides a preprocessing to the sentence or a target string to be processed to identify the possible entities based on a link data 112. Any entity recognition engine that is known in the market is suitable for identifying the relevant entities. [0020] The entity recognition module 101 is configured to recognize entities from content of a document. Many systems and methods for recognizing entities are well known in the art and they can be adapted for the present invention. In another embodiment, the entity recognition engine or module disclosed in the Malaysia patent application entitled "SYSTEM AND METHOD FOR AUTOMATED ENTITY RECOGNITION" filed on the same day as the present application can also be adapted wherein.

[0021] The text preprocessor 102 comprises a word recognizer 122, a

Lemmatizer 124, and a Polysemy Checker 126. The word recognizer 122 is adapted to works with the entity recognition module 101 to distinguish from the sentence if a word is a noun or a verb. The word recognizer 122 also takes references from the linguistic resources 138 to perform its recognitions. The lemmatizer 124 is adapted to tokenize sentence into lemmatized word form to identify ambiguous words. The polysemy checker 126 is utilized to identify which ambiguous word is to be disambiguated. [0022] An index builder 134 is used to create an index of schema graphs/maps for each verb. It extracts all possible sense descriptions from the linguistic resources 138 to determine if the word is a noun, while extracting possible related schemas to the targeted verb. The index is referenced by the disambiguator for disambiguating the word sense. [0023] It can be seen that the use of the polysemy checker 126 to identify ambiguous word, and the word recognizer 122 to distinguish verb and noun from the sentence and the index builder to extract sense description (for noun) and schemas (for verb) would be able to address the need for substantial amounts of training examples/tagged datasets.

[0024] The disambiguation module 103 receives the word and disambiguates its word sense based on disambiguate rules 132. With the use of semantic information (i.e. schemas), and the context vectors created from the sense descriptions to determine for selectional constraint for the semantic role, a high accuracy for fine-grained sense distinctions can be achieved. The disambiguation rules 132 are well known in the art.

[0025] FIG. 2 illustrates a process carries out by the disambiguation module 103 of FIG. 1 in accordance with one embodiment of the present invention. The process starts with selecting a sentence or target sentence to be processed to extract word sense of the containing words at step 202. From the target sentence, each of the words is being determined if it is a verb at step 204 through the use of the linguistic resources 138. This is done through the text preprocessor 102. When a word is determined to be a verb at step 206, all possible schemas related to the target verb are retrieved at step 208. A word that is not identified as verb, in general, it would be a noun for being salient and meaningful, which otherwise, it will proceed under step 212. At the step 212, it is further being determined if that word is ambiguous by using the polysemy checker. The polysemy checker 126 calculates a total count of different possible senses for the word. When the word has more than one possible sense, the higher count on the polysemy, the higher in likelihood that the word is ambiguous. If the word is being determined to be ambiguous at step 214, all related sense descriptions for the word (potential ambiguous word) are retrieved through the index rendered by the index builder 134 at step 216 and subsequently all nouns are extracted from the sense description to create context vectors of the word at step 218. [0026] Returning to the step 214, if the word is determined to be not ambiguous, at step 222, the word is being matched with the schemas of the word retrieved in step 208. A best schema, being the maximum number of concept matched, is selected at step 224. [0027] At step 226, each of the context vectors of the word is checked if it satisfies selectional constraints of the semantic role of the best schema that identified earlier. The selectional constraints check is done with the reference to a concept hierarchy from the linguistic resources 138 at step 228.

[0028] When the selectional constraints above are satisfied, at step 232, a best sense to the ambiguous word is selected and assigned to that word. If this can be resolved in step 234, the sense of that word is identified. If the sense of that word cannot be resolved, i.e. selectional constraints check not satisfied, at step 234, the disambiguation module 103 applies disambiguation rules to give the word a word sense.

[0029] FIGs. 3A-3C exemplify an example of a sentence that is being processed to disambiguate the word sense thereof. The exemplified sentence is "The boy fishes the bass from the river. ". The exemplified sentence can also herewith refer as a target sentence. As shown in FIG. 3A, the target sentence is scanned through by the present system 100 to identify entity/noun phrase through the entity recognition module 101 with reference to the Linked Data 112. In this case, the word "boy", "bass" and "river" shall be identified. Subsequently, verb(s) are identified from the target sentence using the linguistic resources 138. In this case, "fish" may be identified. The word "fish" will be tokenized into lemmatized form. [0030] Accordingly, as shown in FIG. 3B, all the schemas relating to "fish" are extracted as shown as S I, S2, Sn. The ambiguous words are being identified through polysemy checker 126. In this case, "fish" and "bass" are polysemous, and since "fish" is identified as verb, it can be remove from the consideration to disambiguate. Subsequently putting the words from the sentence to all the schemas extracted before. In this case, S2 can be identified as the closest match for the sentence.

[0031] Referring now to FIG. 3C, all the possible sense descriptions are retrieved through the index builder 134. Following that, all the nouns from the sense description is extracted to create a context vector. [0032] As shown in FIG. 3D, each of the context word is checked if the semantic constraints are satisfied. In the illustrated sentence, you can identify that the context word "fish" is an animate from the concept hierarchy. And therefore, the word "bass" shall be assigned with an appropriate sense from the sense description that has the word "fish" as shown in FIG. 3C. [0033] While specific embodiments have been described and illustrated, it is understood that many changes, modifications, variations, and combinations thereof could be made to the present invention without departing from the scope of the invention.

Claims

1. A system for disambiguating word sense from a text containing document having sentences, the system comprising:

an entity recognition module (101) adapted for extracting possible entities from the sentence using a Linked Data (112);

a text preprocessor (102) adapted for tokenizing sentence into lemmatized words, the text processor (102) includes a word recognizer (122) adapted to identify if a verb and nouns from the sentence, a lemmatizer (124) for lemmatizing the words of the sentence, and a polysemy checker (126) for counting a number of possible sense of the words to determine if the words are ambiguous,

an index builder (134) adapted for creating an index of schema graphs for each identified verb and to extract all possible sense description for nouns;

a disambiguator (103) adapted for disambiguating word senses, wherein the disambiguator (103) extracts all the schemas for the identified verb and placing all the identified nouns into the schemas to determine the most suitable word sense, and a disambiguation rules (132) is utilized for disambiguating word sense.

2. The system according to claim 1, wherein the disambiguator (103) is operable to determine if the entity is a verb through referring to a linguistic resource (138) and retrieve all possible schemas related to the verb.

3. The system according to claim 1, wherein the text processor (102) is operable to determine if a word is a verb or nouns through a linguistic resource (138).

4. The system according to claim 2, wherein the index builder (134) adapted for extracting schemas through the use of a linguistic resources (138) and building an index reference for each word entry with all the related sense descriptions.

5. The system according to claim 1, wherein the disambiguator (103) creates a context word vector from the nouns extracted from the sense description, wherein the words in the context word vector are based a concept hierarchy from the linguistic resources (138), and the disambiguation rules (132) is utilised when the ambiguous word cannot be resolved.

6. A method of disambiguating word sense from a text containing document having sentences, the method comprising:

extracting possible entities from the sentence using a Linked Data (112);

tokenizing sentence into lemmatized words

lemmatizing the words of the sentence

counting a number of possible sense of the words to determine if the words are ambiguous through a polysemy checker;

identifying verb and nouns from the sentence,

lemmatizing the words of the sentence, counting a number of possible sense of the words to determine if the words are ambiguous through a polysemy checker (126);

creating an index of schema graphs for each identified verb and to extract all possible sense description for nouns;

disambiguating word senses through extracting all the schemas for the identified verb and placing all the identified nouns into the schemas to determine the most suitable word sense;

utilizing disambiguation rules (132) disambiguating word sense.

7. The method according to claim 6, wherein disambiguating word sense includes determining if the entity is a verb through referring to a linguistic resource (138) and retrieving all possible schemas related to the verb.

8. The method according to claim 6, wherein identifying verb and nouns includes matching verb and nouns with a linguistic resource (138).

9. The method according to claim 6, wherein the index builder (138) adapted for extracting schemas through the use of a linguistic resources and building an index reference for each word entry with all the related sense descriptions.

10. The method according to claim 6, wherein disambiguating the word sense incudes creating a context word vector from the nouns extracted from the sense description, wherein the words in the context word vector are based on a concept hierarchy from the linguistic resources (138), and the disambiguation rules (132) is utilised when the ambiguous word cannot be resolved.