CN115293158A - Disambiguation method and device based on label assistance - Google Patents

Disambiguation method and device based on label assistance Download PDF

Info

Publication number
CN115293158A
CN115293158A CN202210758371.0A CN202210758371A CN115293158A CN 115293158 A CN115293158 A CN 115293158A CN 202210758371 A CN202210758371 A CN 202210758371A CN 115293158 A CN115293158 A CN 115293158A
Authority
CN
China
Prior art keywords
words
word
vocabulary
entity
disambiguated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210758371.0A
Other languages
Chinese (zh)
Other versions
CN115293158B (en
Inventor
夏煜
龙非池
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Rocking Digital Chongqing Technology Co ltd
Original Assignee
Rocking Digital Chongqing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Rocking Digital Chongqing Technology Co ltd filed Critical Rocking Digital Chongqing Technology Co ltd
Priority to CN202210758371.0A priority Critical patent/CN115293158B/en
Publication of CN115293158A publication Critical patent/CN115293158A/en
Application granted granted Critical
Publication of CN115293158B publication Critical patent/CN115293158B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of natural language processing, and provides a label-assisted disambiguation method and a label-assisted disambiguation device, wherein the method comprises the following steps: acquiring a document to be processed, and extracting entity words to be disambiguated and word segmentation sets in the document to be processed by utilizing a word segmentation technology; determining a plurality of vocabulary labels corresponding to the entity words to be disambiguated from a preset entity word label library; calculating the similarity between the plurality of vocabulary labels and the participle set respectively, and determining the target similarity; and taking the vocabulary label corresponding to the target similarity as a disambiguation result of the entity word to be disambiguated. Compared with the prior art, the label-assisted disambiguation method and device provided by the invention realize the acquisition of an accurate disambiguation result, enable a user to have clear semantics of the acquired entity, and improve the accuracy of entity information.

Description

Disambiguation method and device based on label assistance
Technical Field
The invention relates to the technical field of natural language processing, in particular to a label-assisted disambiguation method and device.
Background
Referring to natural language processing, the phenomenon of one or more words in language is often involved, which affects the application of natural language processing fields such as machine translation, automatic abstractions, question and answer systems, public opinion analysis, machine writing, information retrieval and text classification with chapter comprehension. In order to achieve better accuracy or result more in line with the expected result of the application, words with various semantics are disambiguated.
An Entity (Entity) refers to things that exist objectively and can be distinguished from each other, including specific people, things, abstract concepts or relations, and a knowledge base includes various types of entities. Entity disambiguation (also known as semantic disambiguation) is a technique specifically used to address the problem of ambiguity arising from entities of the same name. In a practical language environment, the problem is often encountered that an entity name corresponds to multiple named entity objects.
The semantic ambiguity of the obtained entity of the user results in low accuracy of the entity information obtained by the user.
Disclosure of Invention
The invention aims to provide a label-assisted disambiguation method and a label-assisted disambiguation device, so as to solve the problem that in the prior art, the obtained entity semantics are ambiguous, so that the accuracy of the entity information obtained by a user is low.
In order to achieve the above purpose, the embodiment of the present invention adopts the following technical solutions:
in a first aspect, an embodiment of the present invention provides a label-assisted disambiguation method, where the method includes: acquiring a document to be processed, and extracting entity words to be disambiguated and word segmentation sets in the document to be processed by utilizing a word segmentation technology; determining a plurality of vocabulary labels corresponding to the entity words to be disambiguated from a preset entity word label library; calculating the similarity between the vocabulary labels and the word segmentation set respectively, and determining the target similarity; and taking the vocabulary labels corresponding to the target similarity as disambiguation results of the entity words to be disambiguated.
In a second aspect, an embodiment of the present invention provides a tag-based assist disambiguation apparatus, where the tag-based assist disambiguation apparatus includes: the word segmentation extraction module is used for acquiring a document to be processed and extracting entity words to be disambiguated and word segmentation sets in the document to be processed by utilizing a word segmentation technology; the vocabulary label determining module is used for determining a plurality of vocabulary labels corresponding to the entity words to be disambiguated from a preset entity word label library; the similarity calculation module is used for calculating the similarity between the vocabulary labels and the participle set respectively and determining the target similarity; and the entity word disambiguation module is used for taking the vocabulary labels corresponding to the target similarity as the disambiguation result of the entity words to be disambiguated.
Compared with the prior art, the embodiment of the invention has the following beneficial effects:
the disambiguation method and device based on tag assistance provided by the embodiment of the invention extract the entity words to be disambiguated and the participle set in the document to be processed by utilizing the participle technology, determine a plurality of vocabulary tags corresponding to the entity words to be disambiguated from the preset entity word tag library, calculate the similarity between the vocabulary tags and the participle set respectively, determine the target similarity, and finally take the vocabulary tags corresponding to the target similarity as the disambiguation result of the entity words to be disambiguated, thereby realizing the acquisition of the accurate disambiguation result, ensuring that the user has clear semantic meaning of the acquired entity and improving the accuracy of the entity information.
In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention, and therefore should not be considered as limiting the scope, and for a user of ordinary skill in the art, other related drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a block diagram of an electronic device provided by an embodiment of the invention;
FIG. 2 is a flow chart illustrating a label-based assisted disambiguation method provided by an embodiment of the present invention;
FIG. 3 is a flow chart of sub-steps of step S2 shown in FIG. 2;
FIG. 4 is a flow chart of sub-steps of step S3 shown in FIG. 2;
FIG. 5 is a flow chart of sub-steps of step S4 shown in FIG. 2;
FIG. 6 is a schematic diagram illustrating a tag-based assisted disambiguation apparatus according to an embodiment of the present invention;
reference numerals: 100-an electronic device; 101-a processor; 102-a memory; 103-a bus; 104-a communication interface; 105-a display screen; 200-label-based assisted disambiguation means; 201-interference item processing module; 202-word segmentation extraction module; 203-vocabulary label determination module; 204-a similarity calculation module; 205-entity word disambiguation module.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the invention without making creative efforts, fall within the protection scope of the invention.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined or explained in subsequent figures. Meanwhile, in the description of the present invention, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.
The label-assisted disambiguation method provided by the embodiment of the invention is applied to electronic equipment, and the electronic equipment can be, but is not limited to, a smart phone, a tablet computer, a personal computer, a vehicle-mounted computer, a Personal Digital Assistant (PDA) and the like. Referring to fig. 1, fig. 1 is a block diagram illustrating an electronic device according to an embodiment of the present invention, where the electronic device 100 includes a processor 101, a memory 102, a bus 103, a communication interface 104, and a display screen 105. The processor 101, the memory 102, the communication interface 104 and the display screen 105 are connected by a bus 103, and the processor 101 is configured to execute executable modules, such as computer programs, stored in the memory 102.
The processor 101 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the tag-assisted based disambiguation method may be performed by instructions in the form of software or integrated logic circuits of hardware in the processor 101. The Processor 101 may be a general-purpose Processor 101, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA), or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.
The Memory 102 may comprise a high-speed Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The Memory 102 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like.
The bus 103 may be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. Only one bi-directional arrow is shown in fig. 1, but this does not indicate only one bus 103 or one type of bus 103.
The electronic device 100 implements a communication connection between the electronic device 100 and an external device through at least one communication interface 104 (which may be wired or wireless). The memory 102 is used to store a program, such as a tag-assisted based disambiguation apparatus 200. The tag-based assisted disambiguation apparatus 200 includes at least one software functional module that may be stored in the memory 102 in the form of software or firmware (firmware) or solidified in an Operating System (OS) of the electronic device 100. The processor 101, upon receiving the execution instruction, executes the program to implement a tag-assisted based disambiguation method.
The display screen 105 is used for displaying, and the displayed content may be some processing result of the processor 101. The display screen 105 may be a touch display screen, a display screen without interactive functionality, or the like. The display screen 105 may display the engineering information pieces, the documents to be processed, and the disambiguation result.
It should be understood that the configuration shown in fig. 1 is merely a schematic application of the configuration of the electronic device 100, and that the electronic device 100 may also include more or fewer components than shown in fig. 1, or have a different configuration than shown in fig. 1. The components shown in fig. 1 may be implemented in hardware, software, or a combination thereof.
First embodiment
Referring to fig. 2, fig. 2 is a flowchart illustrating a tag-based assisted disambiguation method according to an embodiment of the present invention. The label-assisted disambiguation method comprises the following steps:
s1, acquiring an engineering information fragment, and performing interference item removing processing on the engineering information fragment to obtain a document to be processed.
In the embodiment of the present invention, the engineering information segment may be a network segment including websites, letters, words, symbols, numbers, pictures, spaces, and the like. The document to be processed may be textual content in the engineering information segment. The step of obtaining the engineering information segment and performing interference item removing processing on the engineering information segment to obtain the document to be processed can be understood as performing interference item removing processing on the engineering information segment containing information such as websites, letters, characters, symbols, numbers, pictures, spaces and the like, and filtering the information such as websites, letters, symbols, numbers, pictures, spaces and the like to obtain the document to be processed only containing the characters. The engineering information segments may be stored in the internal memory 102 of the electronic device 100, or may be received through the communication interface 104 and transmitted by other electronic devices 100.
The specific codes are as follows:
import re
# read one of list, convert to str
csv_text=str(csv_data_list[i])
# matches one numeric character. Equivalent to [0-9], and deleted. + represents matching multiple times; lower changes letters into lower case
csv_text=re.sub(r'([\d]+)',",csv_text).lower()
# matching miscellaneous items, and deleted
csv_text=re.sub("[A-Za-z0-9\!\%\,\。\...+\..\.+\_+\##\.\?\【\】\'\<\>\=\:\/\&\"\\-\'\\r\\n]","",csv_text)
# match [ ] by comma
csv_text=re.sub('[\[\]]',',',csv_text)
# matching \ and deletion
csv_text=re.sub(r'\\',",csv_text)
For example:
the input acquired engineering information segments are:
"< | A! -jrj _ final _ title _ start- > < p > asking apples for relevant items of new energy? [ p ] "
The output document to be processed is as follows:
asking the apples to ask for related items of new energy.
By carrying out interference item processing on the engineering information fragment, the obtained document to be processed only containing the characters reduces the workload of the disambiguation in the later period, and effectively improves the disambiguation efficiency.
S2, obtaining the document to be processed, and extracting the entity words to be disambiguated and the word segmentation set in the document to be processed by utilizing a word segmentation technology.
In the embodiment of the present invention, the entity word to be disambiguated may be a synonym entity noun in the document to be processed, such as "apple", "millet", "meta universe", "bean cotyledon", "himalaya", and the like. "apple" refers to both apple and apple fruits; "millet" can refer to both millet and millet as grains; "Yuanxus" can refer to both Yuanxus corporation and virtual digital living space; the bean sauce can refer to bean sauce company, bean sauce as a seasoning and a bean sauce net; "Himalayan" refers to both Himalayan corporation and Himalayan mountain, and also to Himalayan platform. The word segmentation set can be all other words except the entity word to be disambiguated in the document to be processed. For example, when the document to be processed is "asking for a related item of new energy from apple", the word segmentation is performed to obtain "asking for a question", "apple", "having", "new energy", "related", "item", "do", and the word segmentation is determined to be an entity word to be disambiguated, and the word segmentation set is "asking for a question", "having", "new energy", "related", "item", and "do".
In the embodiment of the invention, the step of acquiring the document to be processed and extracting the entity words to be disambiguated and the participle set in the document to be processed by utilizing the participle technology can be understood as that an entity word bank is stored in advance, a plurality of homonymic and heteronymic entity words are stored in the entity word bank, the acquired document to be processed is participled to obtain a plurality of participle words, the plurality of participle words and the plurality of homonymic and heteronymic entity words in the entity word bank stored in advance are compared, the participle words consistent with the prestored homonymic and heteronymic entity words are used as the entity words to be disambiguated, and the rest participle words are used as the participle set.
Referring to fig. 3, step S2 may further include the following sub-steps:
and a substep S21 of obtaining the document to be processed, and performing word segmentation and part-of-speech tagging on the document to be processed by utilizing a Hanlp word segmentation algorithm to obtain a plurality of word segmentation vocabularies and a part-of-speech corresponding to each word segmentation vocabulary.
In the embodiment of the invention, the Hanlp participle algorithm comprises standard participles, NLP participles, index participles, N-shortest path participles, CRF participles, top-speed dictionary participles and the like. The part-of-speech may be an adjective, a paraphrase, an adjective morpheme, an adjective idiom, an ideogram, a distinguishment, an interjection, a conjunctive, a parallel conjunctive, and the like. A part-of-speech correspondence table is stored in the Hanlp participle model in advance, please refer to table 1, and table 1 is a part of Hanlp participle part-of-speech correspondence table.
TABLE 1
(symbol) Description of the invention
a Adjectives
ad Auxiliary shape word
ag Morpheme and morphological morpheme
al Morphological and lexical idioms
an Famous-form word
b Differentiating word
begin For start # start only
bg Distinctive morphemes
bl Distinguishing part-of-speech idioms
c Conjunction word
cc Parallel conjunctions
d Adverb
dg Often, all, repeated adverbs of
dl Word of company
e Exclamation word
end For terminating # only
The method comprises the steps of obtaining a document to be processed, and utilizing a Hanlp word segmentation algorithm to perform word segmentation and part-of-speech tagging on the document to be processed to obtain a plurality of word segmentation vocabularies and a hierarchical part-of-speech corresponding to each word segmentation vocabulary.
S22, determining a target participle part of speech from the plurality of participle parts of speech, taking participle words corresponding to the target participle part of speech as entity words to be disambiguated, and taking the rest participle words as a participle set.
In the embodiment of the invention, the part of speech of the plurality of participles which is consistent with the part of speech of the preset participle is taken as the part of speech of the target participle, and the part of speech of the preset participle can be a special part of speech for representing a homonymy entity word. And taking the participle words corresponding to the part of speech of the target participle as the entity words to be disambiguated, and taking the rest participle words as a participle set.
And S3, determining a plurality of vocabulary labels corresponding to the entity words to be disambiguated from a preset entity word label library.
In the embodiment of the present invention, the preset entity word tag library may include a plurality of homonymic entity words and a plurality of vocabulary tags corresponding to each homonymic entity word. The vocabulary labels can represent related information, industry information and the like of the homonymous and heteronymous entity words. The step of determining a plurality of vocabulary labels corresponding to the entity word to be disambiguated from the preset entity word label library may be understood as comparing the entity word to be disambiguated with a plurality of homonymous dissimilatory entity words stored in the preset entity word label library to obtain a plurality of vocabulary labels corresponding to the homonymous dissimilatory entity words consistent with the entity word to be disambiguated.
The preset entity word tag library may further include a preset vocabulary association library and a preset vocabulary classification library, the preset vocabulary association library includes a plurality of first vocabularies and a plurality of associated vocabularies corresponding to each first vocabulary, and the preset vocabulary classification library includes a plurality of second vocabularies and at least one industry category corresponding to each second vocabulary. Referring to fig. 4, step S3 may further include the following sub-steps:
s31, comparing the plurality of first words in the preset word related library with the entity words to be disambiguated to obtain target first words consistent with the entity words to be disambiguated.
In the embodiment of the invention, the first vocabulary represents the entity words of the same name but different names in the preset vocabulary relevant library, and the target first vocabulary is the first vocabulary which is consistent with the entity words to be disambiguated in the preset vocabulary relevant library.
And S32, comparing the plurality of second words in the preset word classification library with the entity words to be disambiguated to obtain target second words consistent with the entity words to be disambiguated.
In the embodiment of the invention, the second vocabulary represents the entity words with the same name and different meaning in the preset vocabulary classification library, and the target second vocabulary is the second vocabulary which is consistent with the entity words to be disambiguated in the preset vocabulary classification library.
S33, taking a plurality of related vocabularies corresponding to the target first vocabulary and at least one industry category corresponding to the target second vocabulary as a plurality of vocabulary labels corresponding to the entity words to be disambiguated.
In the embodiment of the present invention, the related vocabulary represents the related information of the first vocabulary, for example, when the first vocabulary is "apple", the related vocabulary may be, but is not limited to, glory, hua, company, mobile phone, watch, banana, pear, grape, fruit tree, research and development, qiao Busi, and so on. The industry category characterizes the second vocabulary's industry classification information, e.g., when the second vocabulary is "apple," the industry category can be the science and technology industry, the food industry, etc. It should be noted that the target first vocabulary and the target second vocabulary refer to the same homonymous entity word. And taking a plurality of related vocabularies corresponding to the target first vocabulary and at least one industry category corresponding to the target second vocabulary as a plurality of vocabulary labels corresponding to the entity word to be disambiguated, wherein the step of adding the plurality of related vocabularies corresponding to the target first vocabulary and the at least one industry category corresponding to the target second vocabulary is understood to obtain a plurality of vocabulary labels corresponding to the entity word to be disambiguated. Preferably, the multiple vocabulary labels corresponding to the obtained entity words to be disambiguated can be subjected to deduplication processing, the repeated vocabulary labels are deleted, only one vocabulary label is reserved, so that repeated data processing in the later period is reduced, and disambiguation efficiency is improved.
And S4, calculating the similarity between the plurality of vocabulary labels and the participle set respectively, and determining the target similarity.
In the embodiment of the invention, the target similarity characterizes the maximum similarity among the similarity of the vocabulary labels and the word segmentation sets.
Referring to fig. 5, step S4 may include the following sub-steps:
s41, calculating the similarity of each label word and the participle set to obtain the similarity of each label word.
The character string comparison function match is formed by weighting 0.4 times of cosine similarity, 0.3 times of edit distance similarity and 0.3 times of serialization matching, and specific codes are as follows:
def compare(str1,str2):
if str1==str2:
return 1.0
# where str1 and str2 are strings of two groups of participles, str1 refers to a participle set, str2 refers to a plurality of vocabulary labels
diff_result=difflib.SequenceMatcher(None,str1,str2).ratio()
cos_result=cos_sim(str1,str2)
edit_reslut=edit_similar(str1,str2)
return cos_result*0.4+edit_reslut*0.3+0.3*diff_result
And returning a result to obtain the similarity of each label vocabulary.
And S42, comparing all the similarity degrees, and taking the maximum similarity degree as the target similarity degree.
For example, when the entity word to be disambiguated is "apple", the vocabulary label is "glory, hua is, company, mobile phone, watch, banana, pear, grape, fruit tree, research and development, qiao Busi, science and technology industry, food industry", the participle set is "ask for questions, available, new energy, related, project, and do", the similarity between the vocabulary label "glory" and the participle set is 0.444434, the similarity between the vocabulary label "hua is" and the participle set is 0.476431, the similarity between the vocabulary label "company" and the participle set is 0.730766, the similarity between the vocabulary label "mobile phone" and the participle set is 0.286301, the similarity between the vocabulary label "company" and the participle set is 0.283275, the similarity between the vocabulary label "banana" and the participle set is 0.186331, the similarity between the vocabulary label "pear" and the participle set is 0.156289, the similarity between the vocabulary label "grape" and the participle set is 0.169347, the similarity between the vocabulary label "fruit tree" and the participle set is 0.489634, the similarity between the vocabulary label "research and development" and the participle set is 0.605594, the similarity between the vocabulary label "Qiao Busi" and the participle set is 0.487695, the similarity between the vocabulary label "science and technology industry" and the participle set is 0.444434, and the similarity between the vocabulary label "food industry" and the participle set is 0.320620. The maximum similarity, i.e., the target similarity, is 0.730766.
And S5, taking the vocabulary labels corresponding to the target similarity as disambiguation results of the entity words to be disambiguated.
In the above example, the vocabulary label corresponding to the target similarity 0.730766 is "company", and the disambiguation result of the entity word "apple" to be disambiguated is "company", i.e., apple company.
Compared with the prior art, the embodiment of the invention has the following advantages:
firstly, the interference item processing is carried out on the engineering information fragment, and the obtained document to be processed only containing the characters reduces the workload of the disambiguation in the later period, and effectively improves the disambiguation efficiency.
Secondly, extracting entity words to be disambiguated and a participle set in the document to be processed by utilizing a participle technology, determining a plurality of vocabulary labels corresponding to the entity words to be disambiguated from a preset entity word label library, calculating the similarity between the plurality of vocabulary labels and the participle set respectively, determining a target similarity, and finally taking the vocabulary labels corresponding to the target similarity as a disambiguation result of the entity words to be disambiguated, so that the acquisition of an accurate disambiguation result is realized, a user is clear in the acquired entity semantics, and the accuracy of entity information is improved.
Second embodiment
Referring to fig. 6, fig. 6 is a block diagram illustrating a tag-based assisted disambiguation apparatus according to an embodiment of the present invention. The label-based assisted disambiguation apparatus 200 includes an interfering item processing module 201, a participle extraction module 202, a lexical label determination module 203, a similarity calculation module 204, and an entity word disambiguation module 205.
And the interference item processing module 201 is configured to obtain the engineering information segment, and perform interference item removing processing on the engineering information segment to obtain the to-be-processed document.
It is understood that the interference term processing module 201 may perform the above step S1.
The word segmentation extraction module 202 is configured to obtain a to-be-processed document, and extract a to-be-disambiguated entity word and a word segmentation set in the to-be-processed document by using a word segmentation technology.
It is understood that the segmentation extracting module 202 may perform the step S2.
In the embodiment of the present invention, the word segmentation extracting module 202 is specifically configured to: the method comprises the steps of obtaining a document to be processed, and performing word segmentation and part-of-speech tagging on the document to be processed by utilizing a Hanlp word segmentation algorithm to obtain a plurality of word segmentation vocabularies and a part-of-speech corresponding to each word segmentation vocabulary; determining a target word segmentation part of speech from a plurality of word segmentation parts of speech, taking the word segmentation words corresponding to the target word segmentation part of speech as entity words to be disambiguated, and taking the rest word segmentation words as a word segmentation set.
The vocabulary label determining module 203 is configured to determine a plurality of vocabulary labels corresponding to the entity words to be disambiguated from the preset entity word label library.
It is understood that the vocabulary tag determination module 203 may perform the above step S3.
In an embodiment of the present invention, the preset entity word tag library includes a preset vocabulary association library and a preset vocabulary classification library, the preset vocabulary association library includes a plurality of first vocabularies and a plurality of associated vocabularies corresponding to each of the first vocabularies, and the preset vocabulary classification library includes a plurality of second vocabularies and at least one industry category corresponding to each of the second vocabularies. The vocabulary tag determination module 203 is specifically configured to: comparing a plurality of first words in a preset word correlation library with the entity words to be disambiguated to obtain target first words consistent with the entity words to be disambiguated; comparing a plurality of second words in a preset word classification library with the entity words to be disambiguated to obtain target second words consistent with the entity words to be disambiguated; and taking a plurality of related vocabularies corresponding to the target first vocabulary and at least one industry category corresponding to the target second vocabulary as a plurality of vocabulary labels corresponding to the entity words to be disambiguated.
And the similarity calculation module 204 is configured to calculate similarities between the plurality of vocabulary labels and the segmentation sets respectively, and determine a target similarity.
It is understood that the similarity calculation module 204 may perform the step S4.
In this embodiment of the present invention, the similarity calculating module 204 is specifically configured to: calculating the similarity of each label vocabulary and the participle set to obtain the similarity of each label vocabulary; all the similarities are compared, and the maximum similarity is taken as the target similarity.
And the entity word disambiguation module 205 is configured to use the vocabulary label corresponding to the target similarity as a disambiguation result of the entity word to be disambiguated.
It is understood that the entity word disambiguation module 205 may perform step S5 described above.
In summary, an embodiment of the present invention provides a label-assisted disambiguation method and apparatus, where the method includes: acquiring a document to be processed, and extracting entity words to be disambiguated and word segmentation sets in the document to be processed by utilizing a word segmentation technology; determining a plurality of vocabulary labels corresponding to the entity words to be disambiguated from a preset entity word label library; calculating the similarity between the plurality of vocabulary labels and the participle set respectively, and determining the target similarity; and taking the vocabulary label corresponding to the target similarity as a disambiguation result of the entity word to be disambiguated. Compared with the prior art, the label-assisted disambiguation method provided by the embodiment of the invention has the following advantages: firstly, the interference item processing is carried out on the engineering information fragment, and the obtained document to be processed only containing the characters reduces the workload of the disambiguation in the later period, and effectively improves the disambiguation efficiency. Secondly, accurate disambiguation result acquisition is realized, so that the user can clearly understand the acquired entity semantics, and the accuracy of entity information is improved.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative and, for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, the functional modules in the embodiments of the present invention may be integrated together to form an independent part, or each module may exist alone, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes. It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

Claims (10)

1. A label-assisted disambiguation method, the method comprising:
acquiring a document to be processed, and extracting entity words to be disambiguated and word segmentation sets in the document to be processed by utilizing a word segmentation technology;
determining a plurality of vocabulary labels corresponding to the entity words to be disambiguated from a preset entity word label library;
calculating the similarity between the vocabulary labels and the word segmentation set respectively, and determining the target similarity;
and taking the vocabulary label corresponding to the target similarity as a disambiguation result of the entity word to be disambiguated.
2. The method of claim 1, wherein the step of obtaining the document to be processed and extracting the entity words to be disambiguated and the participle set in the document to be processed by using the participle technology comprises:
the method comprises the steps of obtaining a document to be processed, and performing word segmentation and part-of-speech tagging on the document to be processed by utilizing a Hanlp word segmentation algorithm to obtain a plurality of word segmentation vocabularies and a part-of-speech corresponding to each word segmentation vocabulary;
determining a target participle part of speech from a plurality of participle parts of speech, taking participle words corresponding to the target participle part of speech as entity words to be disambiguated, and taking the rest participle words as a participle set.
3. The method of claim 1, wherein the library of predetermined entity word tags includes a library of predetermined vocabulary associations and a library of predetermined vocabulary categories, the library of predetermined vocabulary associations including a plurality of first words and a plurality of associated words corresponding to each of the first words, the library of predetermined vocabulary categories including a plurality of second words and at least one industry category corresponding to each of the second words, the step of identifying from the library of predetermined entity word tags a plurality of vocabulary tags corresponding to the entity word to be disambiguated comprising:
comparing a plurality of first words in the preset word related library with the entity words to be disambiguated to obtain target first words consistent with the entity words to be disambiguated;
comparing a plurality of second words in the preset word classification library with the entity words to be disambiguated to obtain target second words consistent with the entity words to be disambiguated;
and taking a plurality of related vocabularies corresponding to the target first vocabulary and at least one industry category corresponding to the target second vocabulary as a plurality of vocabulary labels corresponding to the entity words to be disambiguated.
4. The method of claim 1, wherein the step of calculating the similarity of each of the plurality of vocabulary labels to the set of word segments and determining the target similarity comprises:
calculating the similarity of each label word and the participle set to obtain the similarity of each label word;
all the similarities are compared, and the maximum similarity is taken as the target similarity.
5. The method of any one of claims 1-4, wherein prior to the steps of obtaining a document to be processed and extracting entity words to be disambiguated and a set of word segments in the document to be processed using word segmentation techniques, the method further comprises:
and acquiring an engineering information fragment, and performing interference item removing processing on the engineering information fragment to obtain a document to be processed.
6. A tag-assisted disambiguation apparatus, comprising:
the word segmentation extraction module is used for acquiring a document to be processed and extracting entity words to be disambiguated and word segmentation sets in the document to be processed by utilizing a word segmentation technology;
the vocabulary label determining module is used for determining a plurality of vocabulary labels corresponding to the entity words to be disambiguated from a preset entity word label library;
the similarity calculation module is used for calculating the similarity between the vocabulary labels and the participle set respectively and determining the target similarity;
and the entity word disambiguation module is used for taking the vocabulary labels corresponding to the target similarity as the disambiguation result of the entity words to be disambiguated.
7. The apparatus of claim 6, wherein the segmentation extraction module is specifically configured to:
the method comprises the steps of obtaining a document to be processed, and performing word segmentation and part-of-speech tagging on the document to be processed by utilizing a Hanlp word segmentation algorithm to obtain a plurality of word segmentation vocabularies and a part-of-speech corresponding to each word segmentation vocabulary;
determining a target word segmentation part of speech from a plurality of word segmentation parts of speech, taking the word segmentation vocabulary corresponding to the target word segmentation part of speech as the entity word to be disambiguated, and taking the rest word segmentation vocabulary as a word segmentation set.
8. The apparatus of claim 6, wherein the predetermined set of entity word tags includes a predetermined set of word association library and a predetermined set of word classification library, the predetermined set of word association library includes a plurality of first words and a plurality of associated words corresponding to each of the first words, the predetermined set of word classification library includes a plurality of second words and at least one industry category corresponding to each of the second words, and the word tag determination module is specifically configured to:
comparing a plurality of first words in the preset word related library with the entity words to be disambiguated to obtain target first words consistent with the entity words to be disambiguated;
comparing a plurality of second words in the preset word classification library with the entity words to be disambiguated to obtain target second words consistent with the entity words to be disambiguated;
and taking a plurality of related vocabularies corresponding to the target first vocabulary and at least one industry category corresponding to the target second vocabulary as a plurality of vocabulary labels corresponding to the entity words to be disambiguated.
9. The apparatus of claim 6, wherein the similarity calculation module is specifically configured to:
calculating the similarity of each label word and the participle set to obtain the similarity of each label word;
all the similarities are compared, and the maximum similarity is taken as the target similarity.
10. The apparatus according to any of claims 6-9, wherein the apparatus further comprises an interference term processing module,
the interference item processing module is used for acquiring the engineering information fragment and performing interference item removing processing on the engineering information fragment to obtain the document to be processed.
CN202210758371.0A 2022-06-30 2022-06-30 Label-assisted disambiguation method and device Active CN115293158B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210758371.0A CN115293158B (en) 2022-06-30 2022-06-30 Label-assisted disambiguation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210758371.0A CN115293158B (en) 2022-06-30 2022-06-30 Label-assisted disambiguation method and device

Publications (2)

Publication Number Publication Date
CN115293158A true CN115293158A (en) 2022-11-04
CN115293158B CN115293158B (en) 2024-02-02

Family

ID=83823162

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210758371.0A Active CN115293158B (en) 2022-06-30 2022-06-30 Label-assisted disambiguation method and device

Country Status (1)

Country Link
CN (1) CN115293158B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030177000A1 (en) * 2002-03-12 2003-09-18 Verity, Inc. Method and system for naming a cluster of words and phrases
CN106844350A (en) * 2017-02-15 2017-06-13 广州索答信息科技有限公司 A kind of computational methods of short text semantic similarity
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model
CN108280061A (en) * 2018-01-17 2018-07-13 北京百度网讯科技有限公司 Text handling method based on ambiguity entity word and device
CN108491382A (en) * 2018-03-14 2018-09-04 四川大学 A kind of semi-supervised biomedical text semantic disambiguation method
CN109376309A (en) * 2018-12-28 2019-02-22 北京百度网讯科技有限公司 Document recommendation method and device based on semantic label
CN109635297A (en) * 2018-12-11 2019-04-16 湖南星汉数智科技有限公司 A kind of entity disambiguation method, device, computer installation and computer storage medium
CN111738009A (en) * 2019-03-19 2020-10-02 百度在线网络技术(北京)有限公司 Method and device for generating entity word label, computer equipment and readable storage medium
CN112395421A (en) * 2021-01-21 2021-02-23 平安科技(深圳)有限公司 Course label generation method and device, computer equipment and medium
CN112464669A (en) * 2020-12-07 2021-03-09 宁波深擎信息科技有限公司 Stock entity word disambiguation method, computer device and storage medium
CN112966054A (en) * 2021-02-07 2021-06-15 撼地数智(重庆)科技有限公司 Enterprise graph node relation-based ethnic group division method and computer equipment
CN114547338A (en) * 2022-02-22 2022-05-27 撼地数智(重庆)科技有限公司 Method for identifying uniqueness of industrial and commercial main body

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030177000A1 (en) * 2002-03-12 2003-09-18 Verity, Inc. Method and system for naming a cluster of words and phrases
CN106844350A (en) * 2017-02-15 2017-06-13 广州索答信息科技有限公司 A kind of computational methods of short text semantic similarity
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model
CN108280061A (en) * 2018-01-17 2018-07-13 北京百度网讯科技有限公司 Text handling method based on ambiguity entity word and device
CN108491382A (en) * 2018-03-14 2018-09-04 四川大学 A kind of semi-supervised biomedical text semantic disambiguation method
CN109635297A (en) * 2018-12-11 2019-04-16 湖南星汉数智科技有限公司 A kind of entity disambiguation method, device, computer installation and computer storage medium
CN109376309A (en) * 2018-12-28 2019-02-22 北京百度网讯科技有限公司 Document recommendation method and device based on semantic label
CN111738009A (en) * 2019-03-19 2020-10-02 百度在线网络技术(北京)有限公司 Method and device for generating entity word label, computer equipment and readable storage medium
CN112464669A (en) * 2020-12-07 2021-03-09 宁波深擎信息科技有限公司 Stock entity word disambiguation method, computer device and storage medium
CN112395421A (en) * 2021-01-21 2021-02-23 平安科技(深圳)有限公司 Course label generation method and device, computer equipment and medium
CN112966054A (en) * 2021-02-07 2021-06-15 撼地数智(重庆)科技有限公司 Enterprise graph node relation-based ethnic group division method and computer equipment
CN114547338A (en) * 2022-02-22 2022-05-27 撼地数智(重庆)科技有限公司 Method for identifying uniqueness of industrial and commercial main body

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
向宇;郭云龙;徐潇;曾维刚;李莉;: "多策略中文微博实体词消歧及实体链接", 计算机应用与软件, no. 08, pages 18 - 23 *

Also Published As

Publication number Publication date
CN115293158B (en) 2024-02-02

Similar Documents

Publication Publication Date Title
Shoufan et al. Natural language processing for dialectical Arabic: A survey
CN102253930B (en) A kind of method of text translation and device
Khattak et al. A survey on sentiment analysis in Urdu: A resource-poor language
CN110741376B (en) Automatic document analysis for different natural languages
CN107577663B (en) Key phrase extraction method and device
US11393237B1 (en) Automatic human-emulative document analysis
Jabbar et al. An improved Urdu stemming algorithm for text mining based on multi-step hybrid approach
US10528609B2 (en) Aggregating procedures for automatic document analysis
WO2012159558A1 (en) Natural language processing method, device and system based on semantic recognition
Hu et al. Self-supervised synonym extraction from the web.
Hristova Text analytics in Bulgarian: An overview and future directions
Wong et al. iSentenizer‐μ: Multilingual Sentence Boundary Detection Model
CN113743090B (en) Keyword extraction method and device
Abedissa et al. Amqa: amharic question answering dataset
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
US11676231B1 (en) Aggregating procedures for automatic document analysis
Elbarougy et al. A proposed natural language processing preprocessing procedures for enhancing arabic text summarization
Patel et al. Influence of Gujarati STEmmeR in supervised learning of web page categorization
CN115828893A (en) Method, device, storage medium and equipment for question answering of unstructured document
Litvak et al. Multilingual Text Analysis: Challenges, Models, and Approaches
CN115293158B (en) Label-assisted disambiguation method and device
Rofiq Indonesian news extractive text summarization using latent semantic analysis
US20090150141A1 (en) Method and system for learning second or foreign languages
CN111814025A (en) Viewpoint extraction method and device
Baishya et al. Present state and future scope of Assamese text processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant