WO2015080559A2 - Procédé et système de désambiguïsation automatisée de signification de mots - Google Patents

Procédé et système de désambiguïsation automatisée de signification de mots Download PDF

Info

Publication number
WO2015080559A2
WO2015080559A2 PCT/MY2014/000154 MY2014000154W WO2015080559A2 WO 2015080559 A2 WO2015080559 A2 WO 2015080559A2 MY 2014000154 W MY2014000154 W MY 2014000154W WO 2015080559 A2 WO2015080559 A2 WO 2015080559A2
Authority
WO
WIPO (PCT)
Prior art keywords
word
sense
words
verb
sentence
Prior art date
Application number
PCT/MY2014/000154
Other languages
English (en)
Inventor
Chu Min Xian Benjamin
Qiang Liu
Lukose Dickson
Original Assignee
Mimos Berhad
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mimos Berhad filed Critical Mimos Berhad
Publication of WO2015080559A2 publication Critical patent/WO2015080559A2/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Definitions

  • the present invention relates to information processing. More specifically, the present invention relates to a system and method for automated word sense disambiguation.
  • Word Sense Disambiguation is known to be challenging subject in the field of Natural Language Processing (NLP).
  • NLP Natural Language Processing
  • the challenge arises from the lack of means to address properties of context that characterize the use of words in a given sense. Further, there is also lack a standard and exhaustive inventory for word sense. It is also noted the accuracy of the current means to process disambiguation results at the final end is often questionable.
  • the challenge lies on the determinations of the context length and context content.
  • the context length refers to the size of window of text that should be taken to determine context. It can be difficult, if not impossible, to determine if the context length should contain only a few words, or a larger portions of the string.
  • a system for disambiguating word sense from a text containing document having sentences comprises an entity recognition module adapted for extracting possible entities from the sentence using a Linked Data; a text preprocessor adapted for tokenizing sentence into lemmatized words, the text processor includes a word recognizer adapted to identify if a verb and nouns from the sentence, a lemmatizer for lemmatizing the words of the sentence, and a polysemy checker for counting a number of possible sense of the words to determine if the words are ambiguous, an index builder (134) adapted for creating an index of schema graphs for each identified verb and to extract all possible sense description for nouns; a disambiguator adapted for disambiguating word senses, wherein the disambiguator extracts all the schemas for the identified verb and placing all the identified nouns into the schemas to determine the most suitable word sense, and a disambiguation rules is utilized for disambiguating word sense.
  • an entity recognition module adapted for extracting possible entities from the sentence using a Linked
  • the text processor is operable to determine if a word is a verb or nouns through a linguistic resource.
  • the index builder may be adapted for extracting schemas through the use of a linguistic resources and building an index reference for each word entry with all the related sense descriptions.
  • the disambiguator may create a context word vector from the nouns extracted from the sense description, wherein the words in the context word vector are checked for semantic constraints with reference to a concept hierarchy from the linguistic resources, and the disambiguation rules is utilised when the ambiguous word cannot be resolved.
  • the present invention has a method of disambiguating word sense from a text containing document having sentences.
  • the method comprises extracting possible entities from the sentence using a Linked Data; tokenizing sentence into lemmatized words; lemmatizing the words of the sentence; counting a number of possible sense of the words to determine if the words are ambiguous through a polysemy checker; identifying verb and nouns from the sentence; lemmatizing the words of the sentence; counting a number of possible sense of the words to determine if the words are ambiguous through a polysemy checker; creating an index of schema graphs for each identified verb and to extract all possible sense description for nouns; disambiguating word senses through extracting all the schemas for the identified verb and placing all the identified nouns into the schemas to determine the most suitable word sense; utilizing disambiguation rules disambiguating word sense.
  • the disambiguating word sense includes determining if the entity is a verb through referring to a linguistic resource and retrieving all possible schemas related to the verb.
  • the identifying verb and nouns includes matching verb and nouns with a linguistic resource.
  • the index builder may be adapted for extracting schemas through the use of a linguistic resources and building an index reference for each word entry with all the related sense descriptions.
  • disambiguating the word sense includes creating a context word vector from the nouns extracted from the sense description, wherein the words in the context word vector are checked for semantic constraints with reference to a concept hierarchy from the linguistic resources, and the disambiguation rules is utilised when the ambiguous word cannot be resolved.
  • FIG. 1 illustrates a block diagram of a word sense disambiguation system in accordance with one embodiment of the present invention
  • FIG. 2 illustrates a process carries out by the disambiguation module of
  • FIG. 1 in accordance with one embodiment of the present invention.
  • FIGs. 3A-D exemplify an example of a sentence that is being processed to resolve ambiguity.
  • FIG. 1 illustrates a block diagram of a word sense disambiguation system 100 in accordance with one embodiment of the present invention.
  • the system 100 is adapted for automatically identifying which sense of a word (i.e. meaning) is used in the sentence context. It is particularly useful for words that are polysemous or have multiple meanings.
  • the system comprises an Entity Recognition Module 102, a text pre-processor 102, and a disambiguation module 103.
  • the entity recognition module 101 provides a preprocessing to the sentence or a target string to be processed to identify the possible entities based on a link data 112. Any entity recognition engine that is known in the market is suitable for identifying the relevant entities.
  • the entity recognition module 101 is configured to recognize entities from content of a document. Many systems and methods for recognizing entities are well known in the art and they can be adapted for the present invention. In another embodiment, the entity recognition engine or module disclosed in the Malaysia patent application entitled "SYSTEM AND METHOD FOR AUTOMATED ENTITY RECOGNITION" filed on the same day as the present application can also be adapted wherein.
  • the text preprocessor 102 comprises a word recognizer 122, a
  • the word recognizer 122 is adapted to works with the entity recognition module 101 to distinguish from the sentence if a word is a noun or a verb.
  • the word recognizer 122 also takes references from the linguistic resources 138 to perform its recognitions.
  • the lemmatizer 124 is adapted to tokenize sentence into lemmatized word form to identify ambiguous words.
  • the polysemy checker 126 is utilized to identify which ambiguous word is to be disambiguated.
  • An index builder 134 is used to create an index of schema graphs/maps for each verb.
  • the disambiguation module 103 receives the word and disambiguates its word sense based on disambiguate rules 132.
  • disambiguate rules 132 are well known in the art.
  • FIG. 2 illustrates a process carries out by the disambiguation module 103 of FIG. 1 in accordance with one embodiment of the present invention.
  • the process starts with selecting a sentence or target sentence to be processed to extract word sense of the containing words at step 202. From the target sentence, each of the words is being determined if it is a verb at step 204 through the use of the linguistic resources 138. This is done through the text preprocessor 102. When a word is determined to be a verb at step 206, all possible schemas related to the target verb are retrieved at step 208. A word that is not identified as verb, in general, it would be a noun for being salient and meaningful, which otherwise, it will proceed under step 212.
  • the polysemy checker 126 calculates a total count of different possible senses for the word. When the word has more than one possible sense, the higher count on the polysemy, the higher in likelihood that the word is ambiguous. If the word is being determined to be ambiguous at step 214, all related sense descriptions for the word (potential ambiguous word) are retrieved through the index rendered by the index builder 134 at step 216 and subsequently all nouns are extracted from the sense description to create context vectors of the word at step 218.
  • step 222 if the word is determined to be not ambiguous, at step 222, the word is being matched with the schemas of the word retrieved in step 208. A best schema, being the maximum number of concept matched, is selected at step 224. [0027] At step 226, each of the context vectors of the word is checked if it satisfies selectional constraints of the semantic role of the best schema that identified earlier. The selectional constraints check is done with the reference to a concept hierarchy from the linguistic resources 138 at step 228.
  • step 232 When the selectional constraints above are satisfied, at step 232, a best sense to the ambiguous word is selected and assigned to that word. If this can be resolved in step 234, the sense of that word is identified. If the sense of that word cannot be resolved, i.e. selectional constraints check not satisfied, at step 234, the disambiguation module 103 applies disambiguation rules to give the word a word sense.
  • FIGs. 3A-3C exemplify an example of a sentence that is being processed to disambiguate the word sense thereof.
  • the exemplified sentence is "The boy fishes the bass from the river. ".
  • the exemplified sentence can also herewith refer as a target sentence.
  • the target sentence is scanned through by the present system 100 to identify entity/noun phrase through the entity recognition module 101 with reference to the Linked Data 112.
  • the word “boy”, “bass” and “river” shall be identified.
  • verb(s) are identified from the target sentence using the linguistic resources 138.
  • fish may be identified.
  • the word “fish” will be tokenized into lemmatized form.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Character Discrimination (AREA)
PCT/MY2014/000154 2013-11-27 2014-05-29 Procédé et système de désambiguïsation automatisée de signification de mots WO2015080559A2 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
MYPI2013004280A MY182881A (en) 2013-11-27 2013-11-27 A method and system for automated entity recognition
MYPI2013004280 2013-11-27

Publications (1)

Publication Number Publication Date
WO2015080559A2 true WO2015080559A2 (fr) 2015-06-04

Family

ID=51690418

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/MY2014/000154 WO2015080559A2 (fr) 2013-11-27 2014-05-29 Procédé et système de désambiguïsation automatisée de signification de mots

Country Status (2)

Country Link
MY (1) MY182881A (fr)
WO (1) WO2015080559A2 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108509449A (zh) * 2017-02-24 2018-09-07 腾讯科技(深圳)有限公司 一种信息处理的方法及服务器
CN109492214A (zh) * 2017-09-11 2019-03-19 苏州大学 属性词识别及其层次构建方法、装置、设备及存储介质
CN111199149A (zh) * 2019-12-17 2020-05-26 航天信息股份有限公司 一种对话***的语句智能澄清方法及***

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108509449A (zh) * 2017-02-24 2018-09-07 腾讯科技(深圳)有限公司 一种信息处理的方法及服务器
CN108509449B (zh) * 2017-02-24 2022-07-08 腾讯科技(深圳)有限公司 一种信息处理的方法及服务器
CN109492214A (zh) * 2017-09-11 2019-03-19 苏州大学 属性词识别及其层次构建方法、装置、设备及存储介质
CN109492214B (zh) * 2017-09-11 2023-09-19 苏州大学 属性词识别及其层次构建方法、装置、设备及存储介质
CN111199149A (zh) * 2019-12-17 2020-05-26 航天信息股份有限公司 一种对话***的语句智能澄清方法及***
CN111199149B (zh) * 2019-12-17 2023-10-20 航天信息股份有限公司 一种对话***的语句智能澄清方法及***

Also Published As

Publication number Publication date
MY182881A (en) 2021-02-05

Similar Documents

Publication Publication Date Title
CN106445998B (zh) 一种基于敏感词的文本内容审核方法及***
WO2015196909A1 (fr) Procédé et dispositif de segmentation de mot
Salehi et al. Using distributional similarity of multi-way translations to predict multiword expression compositionality
US9600469B2 (en) Method for detecting grammatical errors, error detection device for same and computer-readable recording medium having method recorded thereon
Jhamtani et al. Word-level language identification in bi-lingual code-switched texts
Jahan et al. A new approach to animacy detection
Jayan et al. A hybrid statistical approach for named entity recognition for malayalam language
Gupta et al. Preprocessing phase of Punjabi language text summarization
Shajalal et al. Semantic textual similarity in bengali text
Scholivet et al. Identification of ambiguous multiword expressions using sequence models and lexical resources
Sevgili et al. N-hance at semeval-2017 task 7: A computational approach using word association for puns
WO2015080559A2 (fr) Procédé et système de désambiguïsation automatisée de signification de mots
Sarmah et al. Word sense disambiguation for Assamese
Arikan et al. Detecting clitics related orthographic errors in Turkish
Gautam et al. Hindi word sense disambiguation using lesk approach on bigram and trigram words
Utt et al. Crosslingual and multilingual construction of syntax-based vector space models
Ahmed et al. Question analysis for Arabic question answering systems
CN110162615B (zh) 一种智能问答方法、装置、电子设备和存储介质
Mahafdah et al. Arabic Part of speech Tagging using k-Nearest Neighbour and Naive Bayes Classifiers Combination.
Cheng et al. Single document summarization based on triangle analysis of dependency graphs
Singh et al. Word sense disambiguation: enhanced lesk approach in Punjabi language
Lai et al. An unsupervised approach to discover media frames
CN111814025A (zh) 一种观点提取方法及装置
Karisani et al. Multi-view active learning for short text classification in user-generated data
Farahmand et al. Modeling the statistical idiosyncrasy of multiword expressions

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14783674

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14783674

Country of ref document: EP

Kind code of ref document: A2