WO2015080559A2 - Procédé et système de désambiguïsation automatisée de signification de mots - Google Patents

Procédé et système de désambiguïsation automatisée de signification de mots Download PDF

Info

Publication number: WO2015080559A2
Authority: WO; WIPO (PCT)
Prior art keywords: word; sense; words; verb; sentence
Prior art date: 2013-11-27

Application number

PCT/MY2014/000154

Other languages

English (en)

Inventor

Chu Min Xian Benjamin

Qiang Liu

Lukose Dickson

Original Assignee

Mimos Berhad

Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)

2013-11-27

Filing date

2014-05-29

Publication date

2015-06-04

2014-05-29 Application filed by Mimos Berhad filed Critical Mimos Berhad

2015-06-04 Publication of WO2015080559A2 publication Critical patent/WO2015080559A2/fr

Links

Classifications

- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition

Definitions

the present invention relates to information processing. More specifically, the present invention relates to a system and method for automated word sense disambiguation.
Word Sense Disambiguation is known to be challenging subject in the field of Natural Language Processing (NLP).
NLP Natural Language Processing
the challenge arises from the lack of means to address properties of context that characterize the use of words in a given sense. Further, there is also lack a standard and exhaustive inventory for word sense. It is also noted the accuracy of the current means to process disambiguation results at the final end is often questionable.
the challenge lies on the determinations of the context length and context content.
the context length refers to the size of window of text that should be taken to determine context. It can be difficult, if not impossible, to determine if the context length should contain only a few words, or a larger portions of the string.
a system for disambiguating word sense from a text containing document having sentences comprises an entity recognition module adapted for extracting possible entities from the sentence using a Linked Data; a text preprocessor adapted for tokenizing sentence into lemmatized words, the text processor includes a word recognizer adapted to identify if a verb and nouns from the sentence, a lemmatizer for lemmatizing the words of the sentence, and a polysemy checker for counting a number of possible sense of the words to determine if the words are ambiguous, an index builder (134) adapted for creating an index of schema graphs for each identified verb and to extract all possible sense description for nouns; a disambiguator adapted for disambiguating word senses, wherein the disambiguator extracts all the schemas for the identified verb and placing all the identified nouns into the schemas to determine the most suitable word sense, and a disambiguation rules is utilized for disambiguating word sense.
an entity recognition module adapted for extracting possible entities from the sentence using a Linked
the text processor is operable to determine if a word is a verb or nouns through a linguistic resource.
the index builder may be adapted for extracting schemas through the use of a linguistic resources and building an index reference for each word entry with all the related sense descriptions.
the disambiguator may create a context word vector from the nouns extracted from the sense description, wherein the words in the context word vector are checked for semantic constraints with reference to a concept hierarchy from the linguistic resources, and the disambiguation rules is utilised when the ambiguous word cannot be resolved.
the present invention has a method of disambiguating word sense from a text containing document having sentences.
the method comprises extracting possible entities from the sentence using a Linked Data; tokenizing sentence into lemmatized words; lemmatizing the words of the sentence; counting a number of possible sense of the words to determine if the words are ambiguous through a polysemy checker; identifying verb and nouns from the sentence; lemmatizing the words of the sentence; counting a number of possible sense of the words to determine if the words are ambiguous through a polysemy checker; creating an index of schema graphs for each identified verb and to extract all possible sense description for nouns; disambiguating word senses through extracting all the schemas for the identified verb and placing all the identified nouns into the schemas to determine the most suitable word sense; utilizing disambiguation rules disambiguating word sense.
the disambiguating word sense includes determining if the entity is a verb through referring to a linguistic resource and retrieving all possible schemas related to the verb.
the identifying verb and nouns includes matching verb and nouns with a linguistic resource.
the index builder may be adapted for extracting schemas through the use of a linguistic resources and building an index reference for each word entry with all the related sense descriptions.
disambiguating the word sense includes creating a context word vector from the nouns extracted from the sense description, wherein the words in the context word vector are checked for semantic constraints with reference to a concept hierarchy from the linguistic resources, and the disambiguation rules is utilised when the ambiguous word cannot be resolved.
FIG. 1 illustrates a block diagram of a word sense disambiguation system in accordance with one embodiment of the present invention
FIG. 2 illustrates a process carries out by the disambiguation module of
FIG. 1 in accordance with one embodiment of the present invention.
FIGs. 3A-D exemplify an example of a sentence that is being processed to resolve ambiguity.
FIG. 1 illustrates a block diagram of a word sense disambiguation system 100 in accordance with one embodiment of the present invention.
the system 100 is adapted for automatically identifying which sense of a word (i.e. meaning) is used in the sentence context. It is particularly useful for words that are polysemous or have multiple meanings.
the system comprises an Entity Recognition Module 102, a text pre-processor 102, and a disambiguation module 103.
the entity recognition module 101 provides a preprocessing to the sentence or a target string to be processed to identify the possible entities based on a link data 112. Any entity recognition engine that is known in the market is suitable for identifying the relevant entities.
the entity recognition module 101 is configured to recognize entities from content of a document. Many systems and methods for recognizing entities are well known in the art and they can be adapted for the present invention. In another embodiment, the entity recognition engine or module disclosed in the Malaysia patent application entitled "SYSTEM AND METHOD FOR AUTOMATED ENTITY RECOGNITION" filed on the same day as the present application can also be adapted wherein.
the text preprocessor 102 comprises a word recognizer 122, a
the word recognizer 122 is adapted to works with the entity recognition module 101 to distinguish from the sentence if a word is a noun or a verb.
the word recognizer 122 also takes references from the linguistic resources 138 to perform its recognitions.
the lemmatizer 124 is adapted to tokenize sentence into lemmatized word form to identify ambiguous words.
the polysemy checker 126 is utilized to identify which ambiguous word is to be disambiguated.
An index builder 134 is used to create an index of schema graphs/maps for each verb.
the disambiguation module 103 receives the word and disambiguates its word sense based on disambiguate rules 132.
disambiguate rules 132 are well known in the art.
FIG. 2 illustrates a process carries out by the disambiguation module 103 of FIG. 1 in accordance with one embodiment of the present invention.
the process starts with selecting a sentence or target sentence to be processed to extract word sense of the containing words at step 202. From the target sentence, each of the words is being determined if it is a verb at step 204 through the use of the linguistic resources 138. This is done through the text preprocessor 102. When a word is determined to be a verb at step 206, all possible schemas related to the target verb are retrieved at step 208. A word that is not identified as verb, in general, it would be a noun for being salient and meaningful, which otherwise, it will proceed under step 212.
the polysemy checker 126 calculates a total count of different possible senses for the word. When the word has more than one possible sense, the higher count on the polysemy, the higher in likelihood that the word is ambiguous. If the word is being determined to be ambiguous at step 214, all related sense descriptions for the word (potential ambiguous word) are retrieved through the index rendered by the index builder 134 at step 216 and subsequently all nouns are extracted from the sense description to create context vectors of the word at step 218.
step 222 if the word is determined to be not ambiguous, at step 222, the word is being matched with the schemas of the word retrieved in step 208. A best schema, being the maximum number of concept matched, is selected at step 224. [0027] At step 226, each of the context vectors of the word is checked if it satisfies selectional constraints of the semantic role of the best schema that identified earlier. The selectional constraints check is done with the reference to a concept hierarchy from the linguistic resources 138 at step 228.
step 232 When the selectional constraints above are satisfied, at step 232, a best sense to the ambiguous word is selected and assigned to that word. If this can be resolved in step 234, the sense of that word is identified. If the sense of that word cannot be resolved, i.e. selectional constraints check not satisfied, at step 234, the disambiguation module 103 applies disambiguation rules to give the word a word sense.
FIGs. 3A-3C exemplify an example of a sentence that is being processed to disambiguate the word sense thereof.
the exemplified sentence is "The boy fishes the bass from the river. ".
the exemplified sentence can also herewith refer as a target sentence.
the target sentence is scanned through by the present system 100 to identify entity/noun phrase through the entity recognition module 101 with reference to the Linked Data 112.
the word “boy”, “bass” and “river” shall be identified.
verb(s) are identified from the target sentence using the linguistic resources 138.
fish may be identified.
the word “fish” will be tokenized into lemmatized form.

Landscapes

Engineering & Computer Science (AREA)
Theoretical Computer Science (AREA)
Health & Medical Sciences (AREA)
Artificial Intelligence (AREA)
Audiology, Speech & Language Pathology (AREA)
Computational Linguistics (AREA)
General Health & Medical Sciences (AREA)
Physics & Mathematics (AREA)
General Engineering & Computer Science (AREA)
General Physics & Mathematics (AREA)
Machine Translation (AREA)
Character Discrimination (AREA)

PCT/MY2014/000154 2013-11-27 2014-05-29 Procédé et système de désambiguïsation automatisée de signification de mots WO2015080559A2 (fr)

Applications Claiming Priority (2)

Application Number	Priority Date	Filing Date	Title
MYPI2013004280A MY182881A (en)	2013-11-27	2013-11-27	A method and system for automated entity recognition
MYPI2013004280		2013-11-27

Publications (1)

Publication Number	Publication Date
WO2015080559A2 true WO2015080559A2 (fr)	2015-06-04

Family

ID=51690418

Family Applications (1)

Application Number	Title	Priority Date	Filing Date
PCT/MY2014/000154 WO2015080559A2 (fr)	2013-11-27	2014-05-29	Procédé et système de désambiguïsation automatisée de signification de mots

Country Status (2)

Country	Link
MY (1)	MY182881A (fr)
WO (1)	WO2015080559A2 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number	Priority date	Publication date	Assignee	Title
CN108509449A (zh) *	2017-02-24	2018-09-07	腾讯科技（深圳）有限公司	一种信息处理的方法及服务器
CN109492214A (zh) *	2017-09-11	2019-03-19	苏州大学	属性词识别及其层次构建方法、装置、设备及存储介质
CN111199149A (zh) *	2019-12-17	2020-05-26	航天信息股份有限公司	一种对话*的语句智能澄清方法及*

2013
- 2013-11-27 MY MYPI2013004280A patent/MY182881A/en unknown
2014
- 2014-05-29 WO PCT/MY2014/000154 patent/WO2015080559A2/fr active Application Filing

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number	Priority date	Publication date	Assignee	Title
CN108509449A (zh) *	2017-02-24	2018-09-07	腾讯科技（深圳）有限公司	一种信息处理的方法及服务器
CN108509449B (zh) *	2017-02-24	2022-07-08	腾讯科技（深圳）有限公司	一种信息处理的方法及服务器
CN109492214A (zh) *	2017-09-11	2019-03-19	苏州大学	属性词识别及其层次构建方法、装置、设备及存储介质
CN109492214B (zh) *	2017-09-11	2023-09-19	苏州大学	属性词识别及其层次构建方法、装置、设备及存储介质
CN111199149A (zh) *	2019-12-17	2020-05-26	航天信息股份有限公司	一种对话*的语句智能澄清方法及*
CN111199149B (zh) *	2019-12-17	2023-10-20	航天信息股份有限公司	一种对话*的语句智能澄清方法及*

Also Published As

Publication number	Publication date
MY182881A (en)	2021-02-05

Legal Events

Date

Code

Title

Description

2015-07-15

121

Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14783674

Country of ref document: EP

Kind code of ref document: A2

2016-05-27

NENP

Non-entry into the national phase

Ref country code: DE

2016-12-21

122

Ep: pct application non-entry in european phase