CN114997169A

CN114997169A - Entity word recognition method and device, electronic equipment and readable storage medium

Info

Publication number: CN114997169A
Application number: CN202210602459.3A
Authority: CN
Inventors: 单海军; 覃祥坤; 王枭
Original assignee: Zhongdian Jinxin Software Co Ltd
Current assignee: Zhongdian Jinxin Software Co Ltd
Priority date: 2022-05-30
Filing date: 2022-05-30
Publication date: 2022-09-02
Anticipated expiration: 2042-05-30
Also published as: CN114997169B

Abstract

The application provides an entity word recognition method, an entity word recognition device, electronic equipment and a readable storage medium, wherein target word segmentation in a risk event is replaced to obtain a reference recognition text for reference; determining a target coding vector of a text word in a risk event text and a reference coding vector of a reference word in a reference recognition text; splicing target coding vectors of text participles to obtain fragment coding vectors of each identification text fragment in the risk event text, and splicing reference coding vectors of reference participles to obtain reference fragment vectors of each reference text fragment in the reference identification text; compressing to obtain a compressed representation vector of the identification text segment and a compressed reference vector of the corresponding reference text segment; and adjusting the compressed expression vector by combining the noise loss and the generalization loss, and further accurately determining the entity word category to which the recognized text segment belongs through the adjusted compressed expression vector, so that the accuracy of the entity word recognition result is improved.

Description

Entity word recognition method and device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of entity recognition technologies, and in particular, to a method and an apparatus for recognizing entity words, an electronic device, and a readable storage medium.

Background

Named Entity Recognition (NER) aims at identifying entity vocabularies with specific semantics from unstructured text and classifying the identified entity vocabularies. With the benefit of the development of deep learning technology, the NER recognition model based on the neural network is widely applied to the aspect of entity recognition, and achieves good recognition effect.

At present, a traditional NER recognition model is obtained by training based on a characteristic engineering and machine learning mode, and the traditional NER recognition model realizes recognition of entity vocabularies by remembering entity names; because the scale of the data set used for training is limited to a certain extent, the conventional NER recognition model is also limited by the samples in the data set, and the recognition effect on the entity vocabulary (i.e., the entity vocabulary which is not registered) which does not appear in the training set is poor, and the recognition error is easy to occur.

Disclosure of Invention

In view of the above, an object of the present invention is to provide an entity word recognition method, apparatus, electronic device and readable storage medium, which can perform vector coding on recognition text segments of a risk event text with reference to existing reference entities in an entity word class table when recognizing the risk event text, and can obtain an accurate recognition result even if the entity word class of the recognition text segments including "unknown vocabulary" that is not involved in a training process needs to be recognized.

The embodiment of the application provides an entity word identification method, which comprises the following steps:

replacing each target word segmentation in the acquired risk event text one by one according to a preset entity word category table to obtain a reference identification text;

determining a target coding vector of each text participle in the risk event text and a reference coding vector of each reference participle in the reference recognition text through a pre-training language model;

splicing to obtain a segment coding vector of each identification text segment existing in the risk event text based on the target coding vector of each text segment, and splicing to obtain a reference segment vector of each reference text segment in the reference identification text based on the reference coding vector of each reference segment; wherein each recognition text segment comprises at least one of the text participles;

inputting the segment coding vector of each identification text segment and the reference segment vector of each reference text segment into an information bottleneck layer, and compressing to obtain a compressed representation vector corresponding to the risk event text and a compressed reference vector corresponding to the reference identification text;

adjusting the compressed expression vector through the calculated noise loss and generalization loss between the compressed expression vector and the compressed reference vector to obtain an adjusted compressed expression vector;

determining entity class probability that each recognition text segment included in the risk event text belongs to each preset entity class through a pre-trained entity word classifier based on the adjusted compressed expression vector;

and determining the entity word class to which each recognition text fragment belongs in the risk event text based on the probability of each entity class of each recognition text fragment.

In a possible implementation manner, the replacing each target word segmentation in the acquired risk event text one by one according to a preset entity word category table to obtain a reference recognition text includes:

determining a target word segmentation from the risk event text according to the attribute information of each text word segmentation in the risk event text;

aiming at each target participle, determining a replacement real word for replacing the target participle according to the edit distance between the target participle and each candidate real word in a preset entity word category table;

and replacing corresponding target participles in the risk event text one by utilizing the replaced real word of each target participle to obtain a reference recognition text.

In one possible embodiment, the determining, by a pre-trained language model, a target coding vector of each text segment in the risk event text and a reference coding vector of each reference segment in the reference recognition text includes:

according to a preset word segmentation rule, the risk event text and the reference identification text are segmented into a plurality of text word segments and a plurality of reference word segments;

mapping each text participle into an identification participle vector and mapping each reference participle into a reference participle vector;

for each text participle, merging the context information of the text participle into the recognition participle vector of the text participle through a pre-training language model to obtain a target coding vector of the text participle;

and for each reference participle, merging the context information of the reference participle into the reference participle vector of the reference participle through a pre-training language model to obtain the reference coding vector of the reference participle.

In a possible embodiment, the splicing to obtain a segment code vector of each identified text segment present in the risk event text based on the target code vector of each text segment includes:

for each recognition text segment present in the risk event text, determining at least one text segmentation constituting the recognition text segment;

and sequentially splicing the target coding vector of each text word according to the sequence of each text word in the recognized text segment to obtain the segment coding vector of the recognized text segment.

In a possible implementation, the adjusting the compressed representation vector by the calculated noise loss and generalization loss between the compressed representation vector and the compressed reference vector to obtain an adjusted compressed representation vector includes:

aiming at each identification text segment included in the compressed representation vector, noise information included in the segment compressed vector is adjusted by calculating noise loss between the segment compressed vector of the identification text segment and a segment reference vector of a corresponding reference text segment in the compressed reference vectors to obtain a noise compressed vector;

and utilizing an InfonCE loss function to adjust generalization information included in the noise compressed vector by calculating generalization loss between the segment noise vector of each identification text segment in the noise compressed vector and the segment reference vector of the corresponding reference text segment in the compressed reference vector, so as to obtain an adjusted compressed representation vector.

In one possible implementation, the determining, based on the probability of each entity category of each recognized text fragment, the result of determining the entity word of each recognized text fragment in the risk event text and the category to which the recognized text fragment belongs includes:

for each recognition text segment, determining the maximum entity class probability of the recognition text segment;

and determining the entity word judgment result and the category of the recognized text segment according to the maximum entity category probability.

The embodiment of the present application further provides an entity word recognition apparatus, where the entity word recognition apparatus includes:

the word segmentation replacement module is used for replacing each target word segmentation in the acquired risk event text one by one according to a preset entity word category table to obtain a reference identification text;

the vector coding module is used for determining a target coding vector of each text word in the risk event text and a reference coding vector of each reference word in the reference recognition text through a pre-training language model;

the vector splicing module is used for splicing to obtain a fragment coding vector of each identification text fragment existing in the risk event text based on a target coding vector of each text participle, and splicing to obtain a reference fragment vector of each reference text fragment in the reference identification text based on a reference coding vector of each reference participle; wherein each recognized text segment includes at least one of the text participles;

the vector compression module is used for inputting the segment coding vector of each identification text segment and the reference segment vector of each reference text segment into an information bottleneck layer, and compressing to obtain a compressed representation vector corresponding to the risk event text and a compressed reference vector corresponding to the reference identification text;

a vector adjustment module, configured to adjust the compressed representation vector according to the calculated noise loss and the calculated generalization loss between the compressed representation vector and the compressed reference vector, so as to obtain an adjusted compressed representation vector;

a probability determination module, configured to determine, based on the adjusted compressed representation vector, an entity category probability that each recognition text segment included in the risk event text belongs to each preset entity category through a pre-trained entity word classifier;

and the result judging module is used for determining the entity word class to which each recognition text segment belongs in the risk event text based on the probability of each entity class of each recognition text segment.

In a possible implementation manner, when the word segmentation replacement module is configured to replace each target word segmentation in the acquired risk event text one by one according to a preset entity word category table to obtain a reference recognition text, the word segmentation replacement module is configured to:

and replacing corresponding target participles in the risk event text one by using the replaced real word of each target participle to obtain a reference recognition text.

In one possible implementation, when the vector encoding module is configured to determine, through a pre-trained language model, a target encoding vector for each text segment in the risk event text and a reference encoding vector for each reference segment in the reference recognition text, the vector encoding module is configured to:

In one possible implementation, when the vector splicing module is configured to splice segment encoding vectors of each identified text segment present in the risk event text based on a target encoding vector of each text segment, the vector splicing module is configured to:

and sequentially splicing the target coding vector of each text word according to the sequence of each text word in the identification text segment to obtain the segment coding vector of the identification text segment.

In one embodiment, when the vector adjustment module is configured to adjust the compressed representation vector by calculating a noise loss and a generalization loss between the compressed representation vector and the compressed reference vector to obtain an adjusted compressed representation vector, the vector adjustment module is configured to:

In one possible implementation, when the result determination module is configured to determine the entity word determination result and the category of each recognized text segment in the risk event text based on the probability of each entity category of each recognized text segment, the result determination module is configured to:

for each recognition text segment, determining a maximum entity class probability of the recognition text segment;

An embodiment of the present application further provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the steps of the entity word recognition method as described above.

Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the entity word recognition method are performed as described above.

According to the entity word identification method, the entity word identification device, the electronic equipment and the readable storage medium, each target word segmentation in the acquired risk event text is replaced one by one according to a preset entity word category table to obtain a reference identification text; determining a target coding vector of each text word in the risk event text and a reference coding vector of each reference word in the reference recognition text through a pre-training language model; splicing to obtain a segment coding vector of each identification text segment existing in the risk event text based on the target coding vector of each text segment, and splicing to obtain a reference segment vector of each reference text segment in the reference identification text based on the reference coding vector of each reference segment; inputting the segment coding vector of each identification text segment and the reference segment vector of each reference text segment into an information bottleneck layer, and compressing to obtain a compressed representation vector corresponding to the risk event text and a compressed reference vector corresponding to the reference identification text; adjusting the compressed expression vector through the noise loss and the generalization loss between the compressed expression vector and the compressed reference vector obtained through calculation to obtain an adjusted compressed expression vector; determining entity class probability of each recognition text fragment included in the risk event text belonging to each preset entity class through a pre-trained entity word classifier based on the adjusted compressed expression vector; and determining the entity word class to which each recognition text segment belongs in the risk event text based on the probability of each entity class of each recognition text segment. Thus, when the risk event text is identified, vector coding can be performed on the identification text segment of the risk event text by referring to the existing reference real words in the entity word class table, so that the information which is beneficial to the identification of the entity word classifier in the identification text segment can be optimized to the greatest extent, and therefore, even if the unknown word which is not involved in the training process needs to be identified, an accurate identification result can be obtained, and the error of the identification result when the unknown word is identified is reduced.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a flowchart of an entity word recognition method according to an embodiment of the present application;

fig. 2 is a schematic diagram of an encoding process of text word segmentation provided in an embodiment of the present application;

fig. 3 is a schematic structural diagram of an entity word recognition apparatus according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as presented in the figures, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. Every other embodiment that can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present application falls within the protection scope of the present application.

Research shows that, at present, a traditional NER recognition model is obtained by training based on a characteristic engineering and machine learning mode, and the traditional NER recognition model realizes recognition of entity vocabularies by remembering entity names; because the scale of the data set used for training is limited to a certain extent, the conventional NER recognition model is also limited by the samples in the data set, and the recognition effect on the entity vocabulary (i.e., the entity vocabulary which is not registered) which does not appear in the training set is poor, and the recognition error is easy to occur.

Therefore, the method for recognizing the entity words can improve the accuracy of the recognition result of the entity words and is beneficial to avoiding the situation of recognition errors when recognition text segments including the entity words which do not appear in the training set data are recognized.

Referring to fig. 1, fig. 1 is a flowchart illustrating an entity word recognition method according to an embodiment of the present disclosure. As shown in fig. 1, the method for identifying entity words provided in the embodiment of the present application includes:

s101, replacing each target word segmentation in the acquired risk event text one by one according to a preset entity word category table to obtain a reference recognition text.

S102, determining a target coding vector of each text word in the risk event text and a reference coding vector of each reference word in the reference recognition text through a pre-training language model.

S103, splicing to obtain a segment coding vector of each identification text segment existing in the risk event text based on the target coding vector of each text segment, and splicing to obtain a reference segment vector of each reference text segment in the reference identification text based on the reference coding vector of each reference segment.

And S104, inputting the segment coding vector of each identification text segment and the reference segment vector of each reference text segment into an information bottleneck layer, and compressing to obtain a compressed representation vector corresponding to the risk event text and a compressed reference vector corresponding to the reference identification text.

And S105, adjusting the compressed expression vector through the noise loss and the generalization loss between the compressed expression vector and the compressed reference vector obtained through calculation to obtain the adjusted compressed expression vector.

And S106, determining the entity class probability of each recognition text segment included in the risk event text belonging to each preset entity class through a pre-trained entity word classifier based on the adjusted compressed expression vector.

S107, determining the entity word class to which each recognition text fragment belongs in the risk event text based on the probability of each entity class of each recognition text fragment.

The embodiment of the application provides an entity word identification method, when target participles in a risk event are identified, the existing entity words in an entity word category table are utilized to replace the target participles in the risk event, and a reference identification text which can be referred to is obtained; respectively determining a target coding vector of a text participle with context information fused in a risk event text and a reference coding vector of a reference participle with the context information fused in a reference recognition text by utilizing a pre-training language model; respectively utilizing the target coding vector of each text participle to splice to obtain a segment coding vector of each identification text segment in the risk event text, and utilizing the reference coding vector of each reference participle to splice to obtain a reference segment vector of each reference text segment in the reference identification text; obtaining a compressed representation vector corresponding to the risk event text and a compressed reference vector corresponding to the reference identification text through information bottleneck layer compression; adjusting the compressed expression vector through the noise loss and the generalization loss between the compressed expression vector and the compressed reference vector obtained through calculation to obtain an adjusted compressed expression vector; determining entity category probability of each recognition text segment included in the risk event text belonging to each preset entity category through a pre-trained entity word classifier based on the adjusted compressed expression vector; furthermore, the entity word class to which each recognition text fragment belongs in the risk event text can be determined based on the entity class probability of each recognition text fragment belonging to each preset entity class. Thus, when the risk event text is identified, vector coding can be performed on the identification text segment of the risk event text by referring to the existing reference real words in the entity word class table, so that the information which is beneficial to the identification of the entity word classifier in the identification text segment can be optimized to the greatest extent, and therefore, even if the unknown word which is not involved in the training process needs to be identified, an accurate identification result can be obtained, and the error of the identification result when the unknown word is identified is reduced.

Here, the preset entity word category table is composed of predetermined entity words; and all the entity words in the entity word category table are used for training a pre-training language model and an entity word classifier.

In fact, although there are a plurality of text participles in the risk event text, not every text participle is an entity vocabulary, and not every text participle is an entity vocabulary that needs attention, so that when performing vocabulary replacement, only the target participle in the risk event text needs to be replaced in order to reduce the number of replaced vocabularies and to retain the necessary context information in the risk event text, and on the premise that the basic event information of the risk event text is retained, the reference recognition text that can be compared with the risk event text is determined to the maximum extent.

In step S101, a replacement real word for replacing a target word segmentation in a risk event text is determined from a preset entity word category table; and replacing corresponding target participles in the risk event text one by using the replacement real words, and determining the replaced risk event text as a reference recognition text.

Here, in order to provide valuable reference information for the target participle in the risk event text and ensure the correctness of the semantic, the selected replacement real word for replacing the target participle should be the entity vocabulary closest to the target participle; moreover, not every text participle needs to be replaced during replacement, and in order to reduce the replacement amount of the participle and reduce the processing amount of data in the recognition process, the target participle related to the risk event needs to be further extracted from the risk event text.

In one embodiment, step S101 includes: determining a target word segmentation from the risk event text according to the attribute information of each text word segmentation in the risk event text; aiming at each target participle, determining a replacement real word for replacing the target participle according to the edit distance between the target participle and each candidate real word in a preset entity word category table; and replacing corresponding target participles in the risk event text one by utilizing the replaced real word of each target participle to obtain a reference recognition text.

In the step, determining attribute information of each text participle in the risk event text, for example, information such as part of speech, word sense and the like of the text participle; and determining the target participle to be identified from a plurality of text participles included in the risk event text according to the attribute information of each text participle.

The editing distance between words can show the similarity degree between two words, and candidate entity words which are more similar to the target word segmentation can provide more reference value for the target word segmentation.

Therefore, in order to utilize information with higher reference value in the process of identifying entity words for target participles, the editing distance between each target participle and each candidate entity word needs to be determined for each target participle; further, a candidate entity word most similar to the target word segmentation can be determined according to the editing distance, and the most similar candidate entity word is determined as a replacement entity word for replacing the target word segmentation; the candidate entity word with the smallest edit distance may be determined as the candidate entity word most similar to the target participle.

Here, the edit distance may be an euclidean distance between the target participle and the candidate entity word, and specifically, may be calculated by using a code vector of the target participle and a code vector of the candidate entity word.

Furthermore, the determined replacing real words can be used for replacing corresponding target participles in the risk event text, and after all the target participles are replaced, the reference recognition text is obtained.

As an example, the acquired risk event text is that "potential safety hazards exist in shanghai parts during examination", and the "place name" is determined as a vocabulary to be recognized according to the part of speech of each word, and at this time, the "shanghai" can be determined as a target participle to be replaced in the risk event text; by calculating the editing distance between the Shanghai and each candidate entity word, the candidate entity word most similar to the Shanghai is determined to be the Beijing from the entity word category table, and then the Shanghai can be replaced by the Beijing to obtain a reference identification text, namely that the potential safety hazard of the Beijing subsection is found in the inspection.

In the subsequent recognition process, the entity word judgment result and the category of the target word segmentation can be more accurately determined, context information needs to be merged into each text word of the risk event text, and further each text word can represent more time information.

In step S102, context information may be merged into each text participle of the risk event text through a pre-trained language model, so as to obtain a target encoding vector of each text participle and a reference encoding vector of each reference participle.

Because the pre-training language model is based on the word vector of the text participle and is integrated with the context information of the risk event text to which the text participle belongs, the text participle needs to be converted into the form of the word vector.

In one implementation, please refer to fig. 2, wherein fig. 2 is a schematic diagram of an encoding process of text participles according to an embodiment of the present application. As shown in fig. 2, step S102 includes: and S1021, segmenting the risk event text and the reference identification text into a plurality of text segments and a plurality of reference segments respectively according to a preset segmentation rule.

In the step, the risk event text is segmented into a plurality of text segments according to the segmentation rules of the pre-training language model, and the reference recognition text is segmented into a plurality of reference segments according to the segmentation rules of the pre-training language model.

Here, the word segmentation rule may include word-by-word splitting or word-by-word splitting.

Corresponding to the above embodiment, the risk event text is "finding potential safety hazard in shanghai part during examination", and when the word segmentation rule is "splitting word by word", the risk event text will be split into "examination", "middle", "finding", "shanghai", "part", "existence", "safe", "hidden", "suffering"; when the word segmentation rule is "split word by word", the risk event text will be split into "check", "middle", "find", "shanghai", "part", "presence", "security", "hidden trouble".

Step S1022, respectively mapping each text participle as an identification participle vector, and mapping each reference participle as a reference participle vector.

In this step, each of the split text participles may be mapped into a form of an identification participle vector and each of the split reference participles may be mapped into a form of a reference participle vector through a preset lookup word table.

And S1023, aiming at each text participle, merging the context information of the text participle into the recognition participle vector of the text participle through a pre-training language model to obtain the target coding vector of the text participle.

In the step, aiming at each text participle, the recognition participle vector of all the text participles of the risk event text is input into a pre-training language model, each text participle is coded through the pre-training language model, context information is fused into each text participle in the coding process, and a target coding vector capable of representing the context information of the text participle is obtained.

Here, the pre-trained language model includes a BERT model.

Step S1024, aiming at each reference participle, the context information of the reference participle is merged into the reference participle vector of the reference participle through a pre-training language model, and the reference encoding vector of the reference participle is obtained.

In the step, aiming at each reference participle, the recognition participle vector of all the reference participles of the risk event text is input into a pre-training language model, each reference participle is coded through the pre-training language model, context information is fused into each reference participle in the coding process, and a reference coding vector capable of representing the context information of the reference participle is obtained.

In order to further enhance the recognition effect of the entity words and facilitate information optimization in the subsequent process, the target coding vector and the reference coding vector obtained by coding are not directly used, and corresponding text segments are obtained by splicing based on the coding vectors, so that the entity words in the risk event text are more accurately recognized.

In step S103, for the risk event text and the reference recognition text, respectively, segment coding vectors of each recognition text segment existing in the risk event text and reference coding vectors of each reference text segment in the reference recognition text are obtained by splicing the target coding vector of each text segment and the reference coding vector of each reference word segment.

Here, each recognition text segment is composed of different text segments, so that each recognition text segment includes at least one text segment; namely, the recognized text segments are obtained by splicing different text word segmentation modes.

In one embodiment, the splicing to obtain a segment coding vector of each identified text segment existing in the risk event text based on the target coding vector of each text segment includes: for each recognition text segment present in the risk event text, determining at least one text segmentation constituting the recognition text segment; and sequentially splicing the target coding vector of each text word according to the sequence of each text word in the identification text segment to obtain the segment coding vector of the identification text segment.

In this step, segment code vectors of the recognized text segments existing in the risk event text are obtained by sequentially splicing target code vectors of each text segment in the recognized text segments.

Specifically, the segment code vector of each recognition text segment can be obtained by splicing according to the following formula:

t _i ^h ＝[h _bi :h _ei ]；

wherein, t _i ^h Segment encoding vector for ith recognition text segment in risk event text, h _bi For the target coding vector, h, of the first text segment in the ith recognition text segment _ei And identifying a target code vector of the last text word segmentation in the ith text segment.

Accordingly, the reference segment vector of the reference text segment existing in the reference recognition text is obtained by sequentially concatenating the reference encoding vectors of the reference participles in the reference text segment.

And splicing the reference coded vectors of each reference text segment by the following formula:

t _i ^d ＝[d _bi :d _ei ]；

wherein, t _i ^d Identifying for reference a segment-coded vector of the ith reference text segment in the text, d _bi A reference coding vector for a first text segment in the ith reference text segment, d _ei And encoding a vector for the last text word in the ith reference text segment.

Corresponding to the above embodiment, the identification text segments present in the risk event text "finding potential safety hazards in shanghai part during examination" include "examination", "under examination", and "under examination finding", and the like; if the segment coding vector of the recognized text segment is found in the inspection, the target coding vectors of the three text participles of the inspection, the middle and the finding are spliced in sequence to obtain the segment coding vector of the recognized text segment.

In step S104, information compression is performed on the segment coding vectors of all identification text segments in the risk event text and the reference segment vectors of all reference text segments in the reference identification text through an information bottleneck layer, so as to compress a large amount of information that does not need to be focused in the segment coding vectors and the reference segment vectors, and obtain a compressed representation vector of the compressed risk event text and a compressed reference vector corresponding to the reference identification text.

In step S105, the resulting compressed representation vector and compressed reference vector are further optimized by means of noise loss and generalization loss; specifically, noise information in the compressed representation vector can be optimized by means of noise loss, and beneficial information which is helpful for a subsequent entity word classifier to identify in the compressed representation vector can be optimized by means of generalization loss, namely, the compressed representation vector is adjusted by means of the noise loss and the generalization loss, and the adjusted compressed representation vector is obtained.

In one embodiment, step S105 includes: aiming at each identification text segment included in the compressed representation vector, noise information included in the segment compressed vector is adjusted by calculating noise loss between the segment compressed vector of the identification text segment and a segment reference vector of a corresponding reference text segment in the compressed reference vectors to obtain a noise compressed vector; and utilizing an InfonCE loss function to adjust generalization information included in the noise compressed vector by calculating generalization loss between the segment noise vector of each identification text segment in the noise compressed vector and the segment reference vector of the corresponding reference text segment in the compressed reference vector, so as to obtain an adjusted compressed representation vector.

In the step, the risk event text comprises a plurality of identification text segments, and for each identification text segment, noise information in the segment compressed vector can be continuously adjusted by calculating a segment compressed vector of the identification text segment in a compressed representation vector corresponding to the risk event text and noise loss between the segment compressed vector of a reference text segment corresponding to the identification text segment in a reference identification text until a calculated noise loss value is lower than a loss threshold value, so as to obtain an adjusted segment compressed vector of the identification text segment; and splicing the adjusted segment compression vectors of the identification text segments to obtain a noise compression vector of the risk event text.

Identifying a similarity between the text segment and the reference text segment to determine a noise loss therebetween; wherein, the similarity between the two can be determined by calculating the JS divergence and the KL divergence between the two.

Specifically, taking JS divergence as an example, the similarity between the identified text segment and the reference text segment can be calculated by the following formula:

wherein D is _i For the JS divergence (similarity) between the ith recognized text segment and its corresponding reference text segment, Pz _ai Distribution of compressed representation vectors, Pz, for the ith recognized text segment _bi And the distribution of the compressed reference vectors of the reference text segment corresponding to the ith identification text segment.

Calculating a noise loss between the recognized text segment and the reference text segment by the following formula:

wherein L is _si For the noise loss between the ith recognized text segment and the reference text segment,

and

represent pair D _i Performing a two-layer weighted average calculation, D _i The JS divergence (similarity) between the ith recognized text segment and its corresponding reference text segment is identified.

Furthermore, after the optimization of the noise information is completed, beneficial information which is beneficial to the identification of a subsequent entity word classifier in the noise compression vector can be further optimized, namely the beneficial information which can be beneficial to improving the accuracy of the entity identification result is amplified; specifically, the generalized information included in the noise compressed vector may be adjusted by calculating a generalized loss between a segment noise vector of each identified text segment in the noise compressed vector and a segment reference vector of a corresponding reference text segment in the compressed reference vector by using an InfoNCE loss function, so as to obtain an adjusted compressed representation vector; and furthermore, generalization information between the fragment noise vector of each recognition text fragment and the fragment reference vector of the corresponding reference text fragment is maximized, and the accuracy of subsequent recognition results is improved.

Specifically, the generalization loss between the recognized text segment and the reference text segment is calculated by the following formula:

wherein L is _gi For the generalization loss between the ith recognized text segment and the reference text segment, z _ai ' segment noise vector for ith recognized text segment, z _bi Segment reference vectors for reference text segments corresponding to the ith recognition text segment, gw representing a function of the computation of a matching score, which is implemented using a neural network, E _p And E _p ' means to find the expected value.

In step S106, the adjusted compressed representation vector may be input into a pre-trained entity word classifier, and all recognition text segments in the risk event text are classified by the pre-trained entity word classifier; and determining the entity category probability that each recognition text segment of the risk event text belongs to each preset entity category.

Specifically, the entity category probability of each recognition text segment belonging to each preset entity category can be calculated by the following formula:

wherein L is _basei Entity class probability, score (z) for the ith recognized text fragment belonging to the ith preset entity class _i ，y _i ) And the probability that the risk event text belongs to the entity category of the ith preset entity category is determined, Y is a category set formed by the preset entity categories, and Y' is any one preset entity category in the category set Y.

In step S107, an entity word determination result of each recognition text segment may be determined based on the determined probability of each entity category of each recognition text segment, that is, whether each recognition text segment is an entity word or not may be determined; and simultaneously, determining the category of the text fragment according to the entity category probability.

In one embodiment, step S107 includes: for each recognition text segment, determining the maximum entity class probability of the recognition text segment; and determining the entity word judgment result and the category of the recognized text segment according to the maximum entity category probability.

In the step, for each recognition text segment in the risk event text, determining the maximum entity category probability from the entity category probabilities of the recognition text segments belonging to each preset entity category; determining whether the recognition text fragment is an entity word or not according to a preset entity class corresponding to the maximum entity class probability, namely determining an entity word judgment result of the recognition text fragment according to the maximum entity class probability; meanwhile, the preset entity category corresponding to the maximum entity category probability can be determined as the category of the identification text fragment; for example, the preset entity categories include "non-entity", "place name entity" and "person name entity", and taking the recognized text fragment as "shanghai", as an example, it is determined that the recognized text fragment of "shanghai" has an entity category probability of "0.1" for belonging to the "non-entity", an entity category probability of "0.9" for belonging to the "place name entity" and an entity category probability of "0" for belonging to the "person name entity", at this time, it may be determined that the maximum entity category probability corresponding to the recognized text fragment of "shanghai" is "0.9", and the preset entity category corresponding to "0.9" is the "place name entity", and therefore, it may be determined that the recognized text fragment of "shanghai" belongs to the entity vocabulary and the category of "place name entity" according to the preset entity category.

Illustratively, taking "i am in the sea" as an example, the recognition text segments included are: "i", "in", "shang", "sea", "i am", "shang", "shanghai", "i shang hai".

Here, it is obvious that only "shanghai" is an entity vocabulary, and the other 9 recognition text segments are not entity vocabularies, and in the recognition process, the entity class probabilities of the 10 recognition text segments belonging to each preset entity class are predicted through an entity word classifier respectively; the preset entity categories (taking shanghai as an example) include "non-entity", "place name entity" and "name entity", and then the entity category probabilities of "shanghai" belonging to different preset entity categories calculated by the entity word classifier are respectively: "0.1", "0.9" and "0" indicate that the probability that the recognition text segment of "shanghai" belongs to the category of "place name entity" is the highest, so that the recognition text segment of "shanghai" can be regarded as an entity vocabulary and the category of "place name entity".

After the entity vocabulary of the recognition text segment in the risk event text is recognized, if the entity vocabulary of the enterprise and/or the individual which characterize the risk event is determined to exist, the risk event text can be sent to the related enterprise and/or the individual according to the corresponding entity vocabulary, or the corresponding risk solution can be sent to the related enterprise and/or the individual; therefore, reminding or warning of related enterprises and/or individuals is achieved, and early warning efficiency of potential risks is improved.

As an example, taking a risk event text as "finding potential safety hazards in the shanghai part of the enterprise a during inspection", the identification text segments obtained by splicing text segments include "inspection", "shanghai part", "existence", and "potential safety hazards", etc.; after each recognition text segment is recognized, the fact that the enterprise A is an enterprise entity vocabulary, the Shanghai part is a department entity vocabulary and the potential safety hazard is a risk entity vocabulary is found, so that the potential safety hazard existing in the Shanghai part of the risk event text with the meaning of the enterprise A can be determined by recognizing the entity word category of the risk event text; finally, the enterprise related to the risk event can be determined to be the enterprise A, and the specific department is the Shanghai part, so that the risk event text or the corresponding solution can be sent to the Shanghai part of the enterprise A, and risk early warning can be performed on the Shanghai part of the enterprise A in advance. The entity word classifier is obtained by training a constructed neural network through a plurality of sample text segments and entity class labels corresponding to the sample text segments; the entity category label comprises entity category probability of the sample text fragment belonging to a preset entity category, and can represent whether the sample text fragment is an entity vocabulary or not and the preset entity category of the sample text fragment; specifically, segment noise vectors corresponding to a plurality of sample text segments can be used as input features, entity class labels are used as output features, and the well-constructed neural network is trained to obtain the well-trained entity word classifier.

In the training process, the hyperparameters related to the entity word classifier can be adjusted through loss values; specifically, the loss value can be calculated by the following formula:

L＝L _basen +γ*L _gn +β*L _sn ；

wherein L is _basen Entity class probability, L, for the nth sample text fragment belonging to a predetermined entity class _gn Generalized loss for the nth sample text segment, L _sn For the noise loss of the nth sample text segment, γ and β are hyper-parameters involved in the entity word classifier.

L _basen 、L _gn And L _sn With reference to L _basei 、L _gi And L _si The calculation method of (2) is not described herein again.

In conclusion, the scheme is suitable for application of enterprise risk early warning based on the knowledge graph, and when a risk event exists, related entity vocabularies representing event information are extracted from a risk event text related to the currently occurring risk event, and the enterprise risk early warning is carried out on the enterprise with the risk event by combining the enterprise knowledge graph; generally, for recognition of entity vocabularies, a trained machine learning model, such as an NER recognition model, is relied on; however, for a trained model, the recognition result is limited by the training data set, and "unknown vocabulary" that does not appear in the training data set cannot be accurately recognized, so that the accuracy of the recognition result is low, which may result in that the related enterprise cannot be accurately located, and further, early warning cannot be timely performed on the related enterprise. Aiming at the problem, when the risk event text is identified, the method can refer to the existing reference real words in the entity word category table to carry out vector coding on the identification text segments of the risk event text, so that the information beneficial to the identification of an entity word classifier in the identification text segments can be optimized to the greatest extent, and even if unknown vocabulary which is not involved in the training process needs to be identified, accurate identification results can still be obtained, and the early warning efficiency is indirectly improved.

According to the entity word identification method provided by the embodiment of the application, each target word segmentation in the acquired risk event text is replaced one by one according to a preset entity word category table to obtain a reference identification text; determining a target coding vector of each text word in the risk event text and a reference coding vector of each reference word in the reference recognition text through a pre-training language model; splicing to obtain a segment coding vector of each identification text segment existing in the risk event text based on the target coding vector of each text segment, and splicing to obtain a reference segment vector of each reference text segment in the reference identification text based on the reference coding vector of each reference segment; inputting the segment coding vector of each identification text segment and the reference segment vector of each reference text segment into an information bottleneck layer, and compressing to obtain a compressed representation vector corresponding to the risk event text and a compressed reference vector corresponding to the reference identification text; adjusting the compressed expression vector through the noise loss and the generalization loss between the compressed expression vector and the compressed reference vector obtained through calculation to obtain an adjusted compressed expression vector; determining entity category probability of each recognition text segment included in the risk event text belonging to each preset entity category through a pre-trained entity word classifier based on the adjusted compressed expression vector; and determining the entity word class to which each recognition text fragment belongs in the risk event text based on the probability of each entity class of each recognition text fragment. Therefore, vector coding can be performed on the recognition text segment of the risk event text by referring to the existing reference real words in the entity word class table, so that the information which is beneficial to recognition of the entity word classifier in the recognition text segment can be optimized to the greatest extent, and even if the unknown word vocabulary which is not involved in the training process needs to be recognized, an accurate recognition result can be obtained, and the error of the recognition result when the unknown word vocabulary is recognized can be reduced.

Referring to fig. 3, fig. 3 is a schematic structural diagram of an entity word recognition device according to an embodiment of the present application. As shown in fig. 3, the entity word recognition apparatus 300 includes:

the word segmentation replacement module 310 is configured to replace each target word segmentation in the acquired risk event text one by one according to a preset entity word category table to obtain a reference identification text;

a vector encoding module 320, configured to determine, through a pre-training language model, a target encoding vector of each text participle in the risk event text and a reference encoding vector of each reference participle in the reference recognition text;

the vector splicing module 330 is configured to splice to obtain a segment coding vector of each identification text segment existing in the risk event text based on a target coding vector of each text segment, and splice to obtain a reference segment vector of each reference text segment in the reference identification text based on a reference coding vector of each reference segment; wherein each recognized text segment includes at least one of the text participles;

the vector compression module 340 is configured to input the segment coding vector of each identification text segment and the reference segment vector of each reference text segment into an information bottleneck layer, and compress the segment coding vector of each identification text segment and the reference segment vector of each reference text segment to obtain a compressed representation vector corresponding to the risk event text and a compressed reference vector corresponding to the reference identification text;

a vector adjusting module 350, configured to adjust the compressed representation vector according to the calculated noise loss and generalization loss between the compressed representation vector and the compressed reference vector, so as to obtain an adjusted compressed representation vector;

a probability determination module 360, configured to determine, based on the adjusted compressed representation vector, an entity category probability that each recognition text segment included in the risk event text belongs to each preset entity category through a pre-trained entity word classifier;

and the result judging module 370 is configured to determine, based on the probability of each entity category of each recognized text segment, an entity word category to which each recognized text segment belongs in the risk event text.

Further, when the word segmentation replacement module 310 is configured to replace each target word segmentation in the acquired risk event text one by one according to a preset entity word category table to obtain a reference recognition text, the word segmentation replacement module 310 is configured to:

Further, when the vector encoding module 320 is configured to determine, through a pre-trained language model, a target encoding vector for each text word in the risk event text and a reference encoding vector for each reference word in the reference recognition text, the vector encoding module 320 is configured to:

according to a preset word segmentation rule, segmenting the risk event text and the reference identification text into a plurality of text word segments and a plurality of reference word segments respectively;

and aiming at each reference participle, merging the context information of the reference participle into the reference participle vector of the reference participle through a pre-training language model to obtain the reference coding vector of the reference participle.

Further, when the vector stitching module 330 is configured to, based on the target coding vector of each text word segmentation, stitch to obtain a segment coding vector of each identified text segment existing in the risk event text, the vector stitching module 330 is configured to:

Further, when the vector adjusting module 350 is configured to adjust the compressed representation vector by calculating a noise loss and a generalization loss between the compressed representation vector and the compressed reference vector to obtain an adjusted compressed representation vector, the vector adjusting module 350 is configured to:

Further, when the result determining module 370 is configured to determine the result of determining the entity word of each recognized text segment in the risk event text and the category of the recognized text segment based on the probability of each entity category of each recognized text segment, the result determining module 370 is configured to:

According to the entity word recognition device provided by the embodiment of the application, each target word segmentation in the acquired risk event text is replaced one by one according to a preset entity word category table to obtain a reference recognition text; determining a target coding vector of each text participle in a risk event text and a reference coding vector of each reference participle in a reference recognition text through a pre-training language model; splicing to obtain a segment coding vector of each identification text segment existing in the risk event text based on the target coding vector of each text segment, and splicing to obtain a reference segment vector of each reference text segment in the reference identification text based on the reference coding vector of each reference segment; inputting the fragment coding vector of each identification text fragment and the reference fragment vector of each reference text fragment into an information bottleneck layer, and compressing to obtain a compressed representation vector corresponding to the risk event text and a compressed reference vector corresponding to the reference identification text; adjusting the compressed expression vector through the noise loss and the generalization loss between the compressed expression vector and the compressed reference vector obtained through calculation to obtain an adjusted compressed expression vector; determining entity category probability of each recognition text segment included in the risk event text belonging to each preset entity category through a pre-trained entity word classifier based on the adjusted compressed expression vector; and determining the entity word class to which each recognition text segment belongs in the risk event text based on the probability of each entity class of each recognition text segment. Therefore, vector coding can be performed on the recognition text segment of the risk event text by referring to the existing reference real words in the entity word class table, so that the information which is beneficial to recognition of the entity word classifier in the recognition text segment can be optimized to the greatest extent, and even if the unknown word vocabulary which is not involved in the training process needs to be recognized, an accurate recognition result can be obtained, and the error of the recognition result when the unknown word vocabulary is recognized can be reduced.

Referring to fig. 4, fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 4, the electronic device 400 includes a processor 410, a memory 420, and a bus 430.

The memory 420 stores machine-readable instructions executable by the processor 410, when the electronic device 400 runs, the processor 410 communicates with the memory 420 through the bus 430, and when the machine-readable instructions are executed by the processor 410, the steps of the entity word identification method in the method embodiment shown in fig. 1 may be performed.

The embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the entity word identification method in the method embodiment shown in fig. 1 may be executed.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed coupling or direct coupling or communication connection between each other may be through some communication interfaces, indirect coupling or communication connection between devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solutions of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: those skilled in the art can still make modifications or changes to the embodiments described in the foregoing embodiments, or make equivalent substitutions for some features, within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present application, and are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An entity word recognition method, characterized in that the entity word recognition method comprises:

determining a target coding vector of each text word in the risk event text and a reference coding vector of each reference word in the reference recognition text through a pre-training language model;

2. The method for recognizing entity words according to claim 1, wherein the step of replacing each target word segmentation in the acquired risk event text one by one according to a preset entity word category table to obtain a reference recognition text comprises:

3. The entity word recognition method according to claim 1, wherein the determining a target code vector of each text segment in the risk event text and a reference code vector of each reference segment in the reference recognition text through a pre-trained language model comprises:

4. The entity word recognition method according to claim 1, wherein the splicing the target coding vector based on each text segment to obtain a segment coding vector of each recognition text segment existing in the risk event text comprises:

5. The method for recognizing entity words according to claim 1, wherein said adjusting the compressed representation vector by calculating noise loss and generalization loss between the compressed representation vector and the compressed reference vector to obtain an adjusted compressed representation vector comprises:

for each identification text segment included in the compressed representation vector, calculating noise loss between a segment compressed vector of the identification text segment and a segment reference vector of a corresponding reference text segment in the compressed reference vectors, and adjusting noise information included in the segment compressed vector to obtain a noise compressed vector;

6. An entity word recognition apparatus, characterized in that the entity word recognition apparatus comprises:

the vector splicing module is used for splicing to obtain a segment coding vector of each identification text segment existing in the risk event text based on a target coding vector of each text segment, and splicing to obtain a reference segment vector of each reference text segment in the reference identification text based on a reference coding vector of each reference segment; wherein each recognized text segment includes at least one of the text participles;

a vector adjusting module, configured to adjust the compressed expression vector according to the calculated noise loss and generalization loss between the compressed expression vector and the compressed reference vector, so as to obtain an adjusted compressed expression vector;

and the result judging module is used for determining the entity word class to which each recognition text fragment belongs in the risk event text based on the probability of each entity class of each recognition text fragment.

7. The apparatus for recognizing entity words according to claim 6, wherein the word segmentation replacement module, when being configured to replace each target word segmentation in the acquired risk event text one by one according to a preset entity word category table to obtain a reference recognition text, is configured to:

8. The apparatus according to claim 6, wherein the vector adjustment module, when being configured to adjust the compressed representation vector by the calculated noise loss and generalization loss between the compressed representation vector and the compressed reference vector to obtain an adjusted compressed representation vector, is configured to:

and utilizing an InfoNCE loss function, calculating the generalization loss between the fragment noise vector of each identification text fragment in the noise compressed vector and the fragment reference vector of the corresponding reference text fragment in the compressed reference vector, and adjusting the generalization information included in the noise compressed vector to obtain an adjusted compressed representation vector.

9. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is operated, the machine-readable instructions being executed by the processor to perform the steps of the entity word recognition method according to any one of claims 1 to 5.

10. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, performs the steps of the entity word recognition method according to any one of claims 1 to 5.