CN115293158A

CN115293158A - Disambiguation method and device based on label assistance

Info

Publication number: CN115293158A
Application number: CN202210758371.0A
Authority: CN
Inventors: 夏煜; 龙非池
Original assignee: Rocking Digital Chongqing Technology Co ltd
Current assignee: Rocking Digital Chongqing Technology Co ltd
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2022-11-04
Anticipated expiration: 2042-06-30
Also published as: CN115293158B

Abstract

The invention relates to the technical field of natural language processing, and provides a label-assisted disambiguation method and a label-assisted disambiguation device, wherein the method comprises the following steps: acquiring a document to be processed, and extracting entity words to be disambiguated and word segmentation sets in the document to be processed by utilizing a word segmentation technology; determining a plurality of vocabulary labels corresponding to the entity words to be disambiguated from a preset entity word label library; calculating the similarity between the plurality of vocabulary labels and the participle set respectively, and determining the target similarity; and taking the vocabulary label corresponding to the target similarity as a disambiguation result of the entity word to be disambiguated. Compared with the prior art, the label-assisted disambiguation method and device provided by the invention realize the acquisition of an accurate disambiguation result, enable a user to have clear semantics of the acquired entity, and improve the accuracy of entity information.

Description

Disambiguation method and device based on label assistance

Technical Field

The invention relates to the technical field of natural language processing, in particular to a label-assisted disambiguation method and device.

Background

Referring to natural language processing, the phenomenon of one or more words in language is often involved, which affects the application of natural language processing fields such as machine translation, automatic abstractions, question and answer systems, public opinion analysis, machine writing, information retrieval and text classification with chapter comprehension. In order to achieve better accuracy or result more in line with the expected result of the application, words with various semantics are disambiguated.

An Entity (Entity) refers to things that exist objectively and can be distinguished from each other, including specific people, things, abstract concepts or relations, and a knowledge base includes various types of entities. Entity disambiguation (also known as semantic disambiguation) is a technique specifically used to address the problem of ambiguity arising from entities of the same name. In a practical language environment, the problem is often encountered that an entity name corresponds to multiple named entity objects.

The semantic ambiguity of the obtained entity of the user results in low accuracy of the entity information obtained by the user.

Disclosure of Invention

The invention aims to provide a label-assisted disambiguation method and a label-assisted disambiguation device, so as to solve the problem that in the prior art, the obtained entity semantics are ambiguous, so that the accuracy of the entity information obtained by a user is low.

In order to achieve the above purpose, the embodiment of the present invention adopts the following technical solutions:

in a first aspect, an embodiment of the present invention provides a label-assisted disambiguation method, where the method includes: acquiring a document to be processed, and extracting entity words to be disambiguated and word segmentation sets in the document to be processed by utilizing a word segmentation technology; determining a plurality of vocabulary labels corresponding to the entity words to be disambiguated from a preset entity word label library; calculating the similarity between the vocabulary labels and the word segmentation set respectively, and determining the target similarity; and taking the vocabulary labels corresponding to the target similarity as disambiguation results of the entity words to be disambiguated.

In a second aspect, an embodiment of the present invention provides a tag-based assist disambiguation apparatus, where the tag-based assist disambiguation apparatus includes: the word segmentation extraction module is used for acquiring a document to be processed and extracting entity words to be disambiguated and word segmentation sets in the document to be processed by utilizing a word segmentation technology; the vocabulary label determining module is used for determining a plurality of vocabulary labels corresponding to the entity words to be disambiguated from a preset entity word label library; the similarity calculation module is used for calculating the similarity between the vocabulary labels and the participle set respectively and determining the target similarity; and the entity word disambiguation module is used for taking the vocabulary labels corresponding to the target similarity as the disambiguation result of the entity words to be disambiguated.

Compared with the prior art, the embodiment of the invention has the following beneficial effects:

the disambiguation method and device based on tag assistance provided by the embodiment of the invention extract the entity words to be disambiguated and the participle set in the document to be processed by utilizing the participle technology, determine a plurality of vocabulary tags corresponding to the entity words to be disambiguated from the preset entity word tag library, calculate the similarity between the vocabulary tags and the participle set respectively, determine the target similarity, and finally take the vocabulary tags corresponding to the target similarity as the disambiguation result of the entity words to be disambiguated, thereby realizing the acquisition of the accurate disambiguation result, ensuring that the user has clear semantic meaning of the acquired entity and improving the accuracy of the entity information.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention, and therefore should not be considered as limiting the scope, and for a user of ordinary skill in the art, other related drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a block diagram of an electronic device provided by an embodiment of the invention;

FIG. 2 is a flow chart illustrating a label-based assisted disambiguation method provided by an embodiment of the present invention;

FIG. 3 is a flow chart of sub-steps of step S2 shown in FIG. 2;

FIG. 4 is a flow chart of sub-steps of step S3 shown in FIG. 2;

FIG. 5 is a flow chart of sub-steps of step S4 shown in FIG. 2;

FIG. 6 is a schematic diagram illustrating a tag-based assisted disambiguation apparatus according to an embodiment of the present invention;

reference numerals: 100-an electronic device; 101-a processor; 102-a memory; 103-a bus; 104-a communication interface; 105-a display screen; 200-label-based assisted disambiguation means; 201-interference item processing module; 202-word segmentation extraction module; 203-vocabulary label determination module; 204-a similarity calculation module; 205-entity word disambiguation module.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the invention without making creative efforts, fall within the protection scope of the invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined or explained in subsequent figures. Meanwhile, in the description of the present invention, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

The label-assisted disambiguation method provided by the embodiment of the invention is applied to electronic equipment, and the electronic equipment can be, but is not limited to, a smart phone, a tablet computer, a personal computer, a vehicle-mounted computer, a Personal Digital Assistant (PDA) and the like. Referring to fig. 1, fig. 1 is a block diagram illustrating an electronic device according to an embodiment of the present invention, where the electronic device 100 includes a processor 101, a memory 102, a bus 103, a communication interface 104, and a display screen 105. The processor 101, the memory 102, the communication interface 104 and the display screen 105 are connected by a bus 103, and the processor 101 is configured to execute executable modules, such as computer programs, stored in the memory 102.

The processor 101 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the tag-assisted based disambiguation method may be performed by instructions in the form of software or integrated logic circuits of hardware in the processor 101. The Processor 101 may be a general-purpose Processor 101, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA), or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

The Memory 102 may comprise a high-speed Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The Memory 102 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like.

The bus 103 may be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. Only one bi-directional arrow is shown in fig. 1, but this does not indicate only one bus 103 or one type of bus 103.

The electronic device 100 implements a communication connection between the electronic device 100 and an external device through at least one communication interface 104 (which may be wired or wireless). The memory 102 is used to store a program, such as a tag-assisted based disambiguation apparatus 200. The tag-based assisted disambiguation apparatus 200 includes at least one software functional module that may be stored in the memory 102 in the form of software or firmware (firmware) or solidified in an Operating System (OS) of the electronic device 100. The processor 101, upon receiving the execution instruction, executes the program to implement a tag-assisted based disambiguation method.

The display screen 105 is used for displaying, and the displayed content may be some processing result of the processor 101. The display screen 105 may be a touch display screen, a display screen without interactive functionality, or the like. The display screen 105 may display the engineering information pieces, the documents to be processed, and the disambiguation result.

It should be understood that the configuration shown in fig. 1 is merely a schematic application of the configuration of the electronic device 100, and that the electronic device 100 may also include more or fewer components than shown in fig. 1, or have a different configuration than shown in fig. 1. The components shown in fig. 1 may be implemented in hardware, software, or a combination thereof.

First embodiment

Referring to fig. 2, fig. 2 is a flowchart illustrating a tag-based assisted disambiguation method according to an embodiment of the present invention. The label-assisted disambiguation method comprises the following steps:

s1, acquiring an engineering information fragment, and performing interference item removing processing on the engineering information fragment to obtain a document to be processed.

In the embodiment of the present invention, the engineering information segment may be a network segment including websites, letters, words, symbols, numbers, pictures, spaces, and the like. The document to be processed may be textual content in the engineering information segment. The step of obtaining the engineering information segment and performing interference item removing processing on the engineering information segment to obtain the document to be processed can be understood as performing interference item removing processing on the engineering information segment containing information such as websites, letters, characters, symbols, numbers, pictures, spaces and the like, and filtering the information such as websites, letters, symbols, numbers, pictures, spaces and the like to obtain the document to be processed only containing the characters. The engineering information segments may be stored in the internal memory 102 of the electronic device 100, or may be received through the communication interface 104 and transmitted by other electronic devices 100.

The specific codes are as follows:

import re

# read one of list, convert to str

csv_text＝str(csv_data_list[i])

# matches one numeric character. Equivalent to [0-9], and deleted. + represents matching multiple times; lower changes letters into lower case

csv_text＝re.sub(r'([\d]+)',",csv_text).lower()

# matching miscellaneous items, and deleted

csv_text＝re.sub("[A-Za-z0-9\！\％\,\。\...+\..\.+\_+\##\.\？\【\】\'\<\>\＝\:\/\&\"\\-\'\\r\\n]","",csv_text)

# match [ ] by comma

csv_text＝re.sub('[\[\]]','，',csv_text)

# matching \ and deletion

csv_text＝re.sub(r'\\',",csv_text)

For example:

the input acquired engineering information segments are:

"< | A! -jrj _ final _ title _ start- > < p > asking apples for relevant items of new energy? [ p ] "

The output document to be processed is as follows:

asking the apples to ask for related items of new energy.

By carrying out interference item processing on the engineering information fragment, the obtained document to be processed only containing the characters reduces the workload of the disambiguation in the later period, and effectively improves the disambiguation efficiency.

S2, obtaining the document to be processed, and extracting the entity words to be disambiguated and the word segmentation set in the document to be processed by utilizing a word segmentation technology.

In the embodiment of the present invention, the entity word to be disambiguated may be a synonym entity noun in the document to be processed, such as "apple", "millet", "meta universe", "bean cotyledon", "himalaya", and the like. "apple" refers to both apple and apple fruits; "millet" can refer to both millet and millet as grains; "Yuanxus" can refer to both Yuanxus corporation and virtual digital living space; the bean sauce can refer to bean sauce company, bean sauce as a seasoning and a bean sauce net; "Himalayan" refers to both Himalayan corporation and Himalayan mountain, and also to Himalayan platform. The word segmentation set can be all other words except the entity word to be disambiguated in the document to be processed. For example, when the document to be processed is "asking for a related item of new energy from apple", the word segmentation is performed to obtain "asking for a question", "apple", "having", "new energy", "related", "item", "do", and the word segmentation is determined to be an entity word to be disambiguated, and the word segmentation set is "asking for a question", "having", "new energy", "related", "item", and "do".

In the embodiment of the invention, the step of acquiring the document to be processed and extracting the entity words to be disambiguated and the participle set in the document to be processed by utilizing the participle technology can be understood as that an entity word bank is stored in advance, a plurality of homonymic and heteronymic entity words are stored in the entity word bank, the acquired document to be processed is participled to obtain a plurality of participle words, the plurality of participle words and the plurality of homonymic and heteronymic entity words in the entity word bank stored in advance are compared, the participle words consistent with the prestored homonymic and heteronymic entity words are used as the entity words to be disambiguated, and the rest participle words are used as the participle set.

Referring to fig. 3, step S2 may further include the following sub-steps:

and a substep S21 of obtaining the document to be processed, and performing word segmentation and part-of-speech tagging on the document to be processed by utilizing a Hanlp word segmentation algorithm to obtain a plurality of word segmentation vocabularies and a part-of-speech corresponding to each word segmentation vocabulary.

In the embodiment of the invention, the Hanlp participle algorithm comprises standard participles, NLP participles, index participles, N-shortest path participles, CRF participles, top-speed dictionary participles and the like. The part-of-speech may be an adjective, a paraphrase, an adjective morpheme, an adjective idiom, an ideogram, a distinguishment, an interjection, a conjunctive, a parallel conjunctive, and the like. A part-of-speech correspondence table is stored in the Hanlp participle model in advance, please refer to table 1, and table 1 is a part of Hanlp participle part-of-speech correspondence table.

TABLE 1

(symbol)	Description of the invention
		a	Adjectives
ad	Auxiliary shape word
		ag	Morpheme and morphological morpheme
al	Morphological and lexical idioms
		an	Famous-form word
b	Differentiating word
		begin	For start # start only
bg	Distinctive morphemes
		bl	Distinguishing part-of-speech idioms
c	Conjunction word
		cc	Parallel conjunctions
d	Adverb
		dg	Often, all, repeated adverbs of
dl	Word of company
		e	Exclamation word
end	For terminating # only

The method comprises the steps of obtaining a document to be processed, and utilizing a Hanlp word segmentation algorithm to perform word segmentation and part-of-speech tagging on the document to be processed to obtain a plurality of word segmentation vocabularies and a hierarchical part-of-speech corresponding to each word segmentation vocabulary.

S22, determining a target participle part of speech from the plurality of participle parts of speech, taking participle words corresponding to the target participle part of speech as entity words to be disambiguated, and taking the rest participle words as a participle set.

In the embodiment of the invention, the part of speech of the plurality of participles which is consistent with the part of speech of the preset participle is taken as the part of speech of the target participle, and the part of speech of the preset participle can be a special part of speech for representing a homonymy entity word. And taking the participle words corresponding to the part of speech of the target participle as the entity words to be disambiguated, and taking the rest participle words as a participle set.

And S3, determining a plurality of vocabulary labels corresponding to the entity words to be disambiguated from a preset entity word label library.

In the embodiment of the present invention, the preset entity word tag library may include a plurality of homonymic entity words and a plurality of vocabulary tags corresponding to each homonymic entity word. The vocabulary labels can represent related information, industry information and the like of the homonymous and heteronymous entity words. The step of determining a plurality of vocabulary labels corresponding to the entity word to be disambiguated from the preset entity word label library may be understood as comparing the entity word to be disambiguated with a plurality of homonymous dissimilatory entity words stored in the preset entity word label library to obtain a plurality of vocabulary labels corresponding to the homonymous dissimilatory entity words consistent with the entity word to be disambiguated.

The preset entity word tag library may further include a preset vocabulary association library and a preset vocabulary classification library, the preset vocabulary association library includes a plurality of first vocabularies and a plurality of associated vocabularies corresponding to each first vocabulary, and the preset vocabulary classification library includes a plurality of second vocabularies and at least one industry category corresponding to each second vocabulary. Referring to fig. 4, step S3 may further include the following sub-steps:

s31, comparing the plurality of first words in the preset word related library with the entity words to be disambiguated to obtain target first words consistent with the entity words to be disambiguated.

In the embodiment of the invention, the first vocabulary represents the entity words of the same name but different names in the preset vocabulary relevant library, and the target first vocabulary is the first vocabulary which is consistent with the entity words to be disambiguated in the preset vocabulary relevant library.

And S32, comparing the plurality of second words in the preset word classification library with the entity words to be disambiguated to obtain target second words consistent with the entity words to be disambiguated.

In the embodiment of the invention, the second vocabulary represents the entity words with the same name and different meaning in the preset vocabulary classification library, and the target second vocabulary is the second vocabulary which is consistent with the entity words to be disambiguated in the preset vocabulary classification library.

S33, taking a plurality of related vocabularies corresponding to the target first vocabulary and at least one industry category corresponding to the target second vocabulary as a plurality of vocabulary labels corresponding to the entity words to be disambiguated.

In the embodiment of the present invention, the related vocabulary represents the related information of the first vocabulary, for example, when the first vocabulary is "apple", the related vocabulary may be, but is not limited to, glory, hua, company, mobile phone, watch, banana, pear, grape, fruit tree, research and development, qiao Busi, and so on. The industry category characterizes the second vocabulary's industry classification information, e.g., when the second vocabulary is "apple," the industry category can be the science and technology industry, the food industry, etc. It should be noted that the target first vocabulary and the target second vocabulary refer to the same homonymous entity word. And taking a plurality of related vocabularies corresponding to the target first vocabulary and at least one industry category corresponding to the target second vocabulary as a plurality of vocabulary labels corresponding to the entity word to be disambiguated, wherein the step of adding the plurality of related vocabularies corresponding to the target first vocabulary and the at least one industry category corresponding to the target second vocabulary is understood to obtain a plurality of vocabulary labels corresponding to the entity word to be disambiguated. Preferably, the multiple vocabulary labels corresponding to the obtained entity words to be disambiguated can be subjected to deduplication processing, the repeated vocabulary labels are deleted, only one vocabulary label is reserved, so that repeated data processing in the later period is reduced, and disambiguation efficiency is improved.

And S4, calculating the similarity between the plurality of vocabulary labels and the participle set respectively, and determining the target similarity.

In the embodiment of the invention, the target similarity characterizes the maximum similarity among the similarity of the vocabulary labels and the word segmentation sets.

Referring to fig. 5, step S4 may include the following sub-steps:

s41, calculating the similarity of each label word and the participle set to obtain the similarity of each label word.

The character string comparison function match is formed by weighting 0.4 times of cosine similarity, 0.3 times of edit distance similarity and 0.3 times of serialization matching, and specific codes are as follows:

def compare(str1,str2):

if str1＝＝str2:

return 1.0

# where str1 and str2 are strings of two groups of participles, str1 refers to a participle set, str2 refers to a plurality of vocabulary labels

diff_result＝difflib.SequenceMatcher(None,str1,str2).ratio()

cos_result＝cos_sim(str1,str2)

edit_reslut＝edit_similar(str1,str2)

return cos_result*0.4+edit_reslut*0.3+0.3*diff_result

And returning a result to obtain the similarity of each label vocabulary.

And S42, comparing all the similarity degrees, and taking the maximum similarity degree as the target similarity degree.

For example, when the entity word to be disambiguated is "apple", the vocabulary label is "glory, hua is, company, mobile phone, watch, banana, pear, grape, fruit tree, research and development, qiao Busi, science and technology industry, food industry", the participle set is "ask for questions, available, new energy, related, project, and do", the similarity between the vocabulary label "glory" and the participle set is 0.444434, the similarity between the vocabulary label "hua is" and the participle set is 0.476431, the similarity between the vocabulary label "company" and the participle set is 0.730766, the similarity between the vocabulary label "mobile phone" and the participle set is 0.286301, the similarity between the vocabulary label "company" and the participle set is 0.283275, the similarity between the vocabulary label "banana" and the participle set is 0.186331, the similarity between the vocabulary label "pear" and the participle set is 0.156289, the similarity between the vocabulary label "grape" and the participle set is 0.169347, the similarity between the vocabulary label "fruit tree" and the participle set is 0.489634, the similarity between the vocabulary label "research and development" and the participle set is 0.605594, the similarity between the vocabulary label "Qiao Busi" and the participle set is 0.487695, the similarity between the vocabulary label "science and technology industry" and the participle set is 0.444434, and the similarity between the vocabulary label "food industry" and the participle set is 0.320620. The maximum similarity, i.e., the target similarity, is 0.730766.

And S5, taking the vocabulary labels corresponding to the target similarity as disambiguation results of the entity words to be disambiguated.

In the above example, the vocabulary label corresponding to the target similarity 0.730766 is "company", and the disambiguation result of the entity word "apple" to be disambiguated is "company", i.e., apple company.

Compared with the prior art, the embodiment of the invention has the following advantages:

firstly, the interference item processing is carried out on the engineering information fragment, and the obtained document to be processed only containing the characters reduces the workload of the disambiguation in the later period, and effectively improves the disambiguation efficiency.

Secondly, extracting entity words to be disambiguated and a participle set in the document to be processed by utilizing a participle technology, determining a plurality of vocabulary labels corresponding to the entity words to be disambiguated from a preset entity word label library, calculating the similarity between the plurality of vocabulary labels and the participle set respectively, determining a target similarity, and finally taking the vocabulary labels corresponding to the target similarity as a disambiguation result of the entity words to be disambiguated, so that the acquisition of an accurate disambiguation result is realized, a user is clear in the acquired entity semantics, and the accuracy of entity information is improved.

Second embodiment

Referring to fig. 6, fig. 6 is a block diagram illustrating a tag-based assisted disambiguation apparatus according to an embodiment of the present invention. The label-based assisted disambiguation apparatus 200 includes an interfering item processing module 201, a participle extraction module 202, a lexical label determination module 203, a similarity calculation module 204, and an entity word disambiguation module 205.

And the interference item processing module 201 is configured to obtain the engineering information segment, and perform interference item removing processing on the engineering information segment to obtain the to-be-processed document.

It is understood that the interference term processing module 201 may perform the above step S1.

The word segmentation extraction module 202 is configured to obtain a to-be-processed document, and extract a to-be-disambiguated entity word and a word segmentation set in the to-be-processed document by using a word segmentation technology.

It is understood that the segmentation extracting module 202 may perform the step S2.

In the embodiment of the present invention, the word segmentation extracting module 202 is specifically configured to: the method comprises the steps of obtaining a document to be processed, and performing word segmentation and part-of-speech tagging on the document to be processed by utilizing a Hanlp word segmentation algorithm to obtain a plurality of word segmentation vocabularies and a part-of-speech corresponding to each word segmentation vocabulary; determining a target word segmentation part of speech from a plurality of word segmentation parts of speech, taking the word segmentation words corresponding to the target word segmentation part of speech as entity words to be disambiguated, and taking the rest word segmentation words as a word segmentation set.

The vocabulary label determining module 203 is configured to determine a plurality of vocabulary labels corresponding to the entity words to be disambiguated from the preset entity word label library.

It is understood that the vocabulary tag determination module 203 may perform the above step S3.

In an embodiment of the present invention, the preset entity word tag library includes a preset vocabulary association library and a preset vocabulary classification library, the preset vocabulary association library includes a plurality of first vocabularies and a plurality of associated vocabularies corresponding to each of the first vocabularies, and the preset vocabulary classification library includes a plurality of second vocabularies and at least one industry category corresponding to each of the second vocabularies. The vocabulary tag determination module 203 is specifically configured to: comparing a plurality of first words in a preset word correlation library with the entity words to be disambiguated to obtain target first words consistent with the entity words to be disambiguated; comparing a plurality of second words in a preset word classification library with the entity words to be disambiguated to obtain target second words consistent with the entity words to be disambiguated; and taking a plurality of related vocabularies corresponding to the target first vocabulary and at least one industry category corresponding to the target second vocabulary as a plurality of vocabulary labels corresponding to the entity words to be disambiguated.

And the similarity calculation module 204 is configured to calculate similarities between the plurality of vocabulary labels and the segmentation sets respectively, and determine a target similarity.

It is understood that the similarity calculation module 204 may perform the step S4.

In this embodiment of the present invention, the similarity calculating module 204 is specifically configured to: calculating the similarity of each label vocabulary and the participle set to obtain the similarity of each label vocabulary; all the similarities are compared, and the maximum similarity is taken as the target similarity.

And the entity word disambiguation module 205 is configured to use the vocabulary label corresponding to the target similarity as a disambiguation result of the entity word to be disambiguated.

It is understood that the entity word disambiguation module 205 may perform step S5 described above.

In summary, an embodiment of the present invention provides a label-assisted disambiguation method and apparatus, where the method includes: acquiring a document to be processed, and extracting entity words to be disambiguated and word segmentation sets in the document to be processed by utilizing a word segmentation technology; determining a plurality of vocabulary labels corresponding to the entity words to be disambiguated from a preset entity word label library; calculating the similarity between the plurality of vocabulary labels and the participle set respectively, and determining the target similarity; and taking the vocabulary label corresponding to the target similarity as a disambiguation result of the entity word to be disambiguated. Compared with the prior art, the label-assisted disambiguation method provided by the embodiment of the invention has the following advantages: firstly, the interference item processing is carried out on the engineering information fragment, and the obtained document to be processed only containing the characters reduces the workload of the disambiguation in the later period, and effectively improves the disambiguation efficiency. Secondly, accurate disambiguation result acquisition is realized, so that the user can clearly understand the acquired entity semantics, and the accuracy of entity information is improved.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative and, for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, the functional modules in the embodiments of the present invention may be integrated together to form an independent part, or each module may exist alone, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes. It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

Claims

1. A label-assisted disambiguation method, the method comprising:

acquiring a document to be processed, and extracting entity words to be disambiguated and word segmentation sets in the document to be processed by utilizing a word segmentation technology;

determining a plurality of vocabulary labels corresponding to the entity words to be disambiguated from a preset entity word label library;

calculating the similarity between the vocabulary labels and the word segmentation set respectively, and determining the target similarity;

and taking the vocabulary label corresponding to the target similarity as a disambiguation result of the entity word to be disambiguated.

2. The method of claim 1, wherein the step of obtaining the document to be processed and extracting the entity words to be disambiguated and the participle set in the document to be processed by using the participle technology comprises:

the method comprises the steps of obtaining a document to be processed, and performing word segmentation and part-of-speech tagging on the document to be processed by utilizing a Hanlp word segmentation algorithm to obtain a plurality of word segmentation vocabularies and a part-of-speech corresponding to each word segmentation vocabulary;

determining a target participle part of speech from a plurality of participle parts of speech, taking participle words corresponding to the target participle part of speech as entity words to be disambiguated, and taking the rest participle words as a participle set.

3. The method of claim 1, wherein the library of predetermined entity word tags includes a library of predetermined vocabulary associations and a library of predetermined vocabulary categories, the library of predetermined vocabulary associations including a plurality of first words and a plurality of associated words corresponding to each of the first words, the library of predetermined vocabulary categories including a plurality of second words and at least one industry category corresponding to each of the second words, the step of identifying from the library of predetermined entity word tags a plurality of vocabulary tags corresponding to the entity word to be disambiguated comprising:

comparing a plurality of first words in the preset word related library with the entity words to be disambiguated to obtain target first words consistent with the entity words to be disambiguated;

comparing a plurality of second words in the preset word classification library with the entity words to be disambiguated to obtain target second words consistent with the entity words to be disambiguated;

and taking a plurality of related vocabularies corresponding to the target first vocabulary and at least one industry category corresponding to the target second vocabulary as a plurality of vocabulary labels corresponding to the entity words to be disambiguated.

4. The method of claim 1, wherein the step of calculating the similarity of each of the plurality of vocabulary labels to the set of word segments and determining the target similarity comprises:

calculating the similarity of each label word and the participle set to obtain the similarity of each label word;

all the similarities are compared, and the maximum similarity is taken as the target similarity.

5. The method of any one of claims 1-4, wherein prior to the steps of obtaining a document to be processed and extracting entity words to be disambiguated and a set of word segments in the document to be processed using word segmentation techniques, the method further comprises:

and acquiring an engineering information fragment, and performing interference item removing processing on the engineering information fragment to obtain a document to be processed.

6. A tag-assisted disambiguation apparatus, comprising:

the word segmentation extraction module is used for acquiring a document to be processed and extracting entity words to be disambiguated and word segmentation sets in the document to be processed by utilizing a word segmentation technology;

the vocabulary label determining module is used for determining a plurality of vocabulary labels corresponding to the entity words to be disambiguated from a preset entity word label library;

the similarity calculation module is used for calculating the similarity between the vocabulary labels and the participle set respectively and determining the target similarity;

and the entity word disambiguation module is used for taking the vocabulary labels corresponding to the target similarity as the disambiguation result of the entity words to be disambiguated.

7. The apparatus of claim 6, wherein the segmentation extraction module is specifically configured to:

determining a target word segmentation part of speech from a plurality of word segmentation parts of speech, taking the word segmentation vocabulary corresponding to the target word segmentation part of speech as the entity word to be disambiguated, and taking the rest word segmentation vocabulary as a word segmentation set.

8. The apparatus of claim 6, wherein the predetermined set of entity word tags includes a predetermined set of word association library and a predetermined set of word classification library, the predetermined set of word association library includes a plurality of first words and a plurality of associated words corresponding to each of the first words, the predetermined set of word classification library includes a plurality of second words and at least one industry category corresponding to each of the second words, and the word tag determination module is specifically configured to:

9. The apparatus of claim 6, wherein the similarity calculation module is specifically configured to:

10. The apparatus according to any of claims 6-9, wherein the apparatus further comprises an interference term processing module,

the interference item processing module is used for acquiring the engineering information fragment and performing interference item removing processing on the engineering information fragment to obtain the document to be processed.