CN110688853B

CN110688853B - Sequence labeling method and device, computer equipment and storage medium

Info

Publication number: CN110688853B
Application number: CN201910740751.XA
Authority: CN
Inventors: 孙超; 于凤英; 王健宗; 韩茂琨
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-08-12
Filing date: 2019-08-12
Publication date: 2022-09-30
Anticipated expiration: 2039-08-12
Also published as: CN110688853A; WO2021027125A1

Abstract

The application relates to a sequence labeling method, a sequence labeling device, computer equipment and a storage medium based on a neural network. The method comprises the following steps: carrying out vector conversion on each character in the sequence to be marked to obtain a corresponding feature word vector; inputting the feature word vector into a preset sequence tagging neural network to segment words of a sequence to be tagged to obtain candidate words and word labels corresponding to the candidate words; and combining the word label with the position of each character in the candidate word respectively to obtain the character label of the character in the candidate word. And calculating a first pairing index of the candidate word based on the weight vector of the character label of each character in the candidate word. And measuring and calculating a second pairing index of the candidate annotation sequence based on the first pairing index corresponding to each group of candidate words. And identifying the candidate annotation sequence corresponding to the second pairing index with the largest numerical value as the first annotation sequence. The method can improve the accuracy of the marking.

Description

Sequence labeling method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a sequence labeling method and apparatus, a computer device, and a storage medium.

Background

Sequence annotation is a basic task of natural language processing, namely, dividing input text to output a sequence string corresponding to the text. Sequence tagging is generally widely applied to scenes such as part-of-speech tagging and named body recognition. Conventionally, sequences are labeled, and usually a semi-Markov conditional random fields (SCRFs) is used to solve the labeling. However, SCRFs use phrases in natural languages instead of single words for feature extraction, resulting in low labeling accuracy in the case of a large number of single words.

Disclosure of Invention

In view of the above, it is necessary to provide a sequence labeling method, apparatus, computer device and storage medium capable of improving accuracy.

A method of sequence annotation, the method comprising:

when a sequence labeling request carrying a sequence to be labeled is received, carrying out vector conversion on each character in the sequence to be labeled to obtain a feature word vector corresponding to the character;

inputting the characteristic word vector into a preset sequence labeling neural network to perform word segmentation on the sequence to be labeled to obtain candidate words and word labels corresponding to the candidate words;

combining the word label with the position of each character in the candidate word respectively to obtain the character label of the character in the candidate word;

calculating a first pairing index of the candidate word based on the weight vector of the character label of each character in the candidate word; the weight vector is obtained when the sequence labeling neural network is trained;

calculating a second pairing index of the candidate annotation sequence based on the first pairing index corresponding to each group of candidate words; the candidate tagging sequence is obtained by arranging and combining at least two groups of candidate words;

and identifying the candidate annotation sequence corresponding to the second pairing index with the largest numerical value as a first annotation sequence.

In one embodiment, the performing vector conversion on each character in the sequence to be labeled to obtain a feature word vector corresponding to the character includes:

acquiring word vector representations corresponding to all characters in the sequence to be marked from a preset word vector table;

and converting the word vector representation corresponding to each character by using a neural network to obtain corresponding feature word vectors.

In one embodiment, the calculating a first pairing index of the candidate word based on the weight vector of the character tag of each character in the candidate word includes:

determining characters forming the candidate words, and acquiring characteristic vectors corresponding to the characters forming the candidate words;

obtaining a weight vector of a character label corresponding to the character forming the candidate word;

and calculating to obtain a first pairing index of the candidate word according to the feature vector and the weight vector of each character.

In one embodiment, said calculating a second pair index of a candidate annotation sequence based on the first pair index corresponding to each group of the candidate words comprises:

determining candidate words forming the candidate annotation sequence;

obtaining transfer parameters corresponding to the candidate words;

and calculating to obtain a second pairing index of the candidate annotation sequence based on the first pairing index and the transfer parameter corresponding to the candidate word.

In one embodiment, after the identifying the candidate annotation sequence corresponding to the second pairing index with the largest value as the first annotation sequence, the method further includes:

inputting the feature word vector into a preset conditional random field model by using a conditional random field model to label the sequence to be labeled to obtain a second labeling sequence;

and calculating loss values of the first labeling sequence and the second labeling sequence by using a preset decoding algorithm, and determining a labeling sequence with the minimum loss value from the first labeling sequence and the second labeling sequence as a final labeling result.

In one embodiment, the step of calculating the loss values of the first annotation sequence and the second annotation sequence by using a preset decoding algorithm, and determining the annotation sequence with the minimum loss value from the first annotation sequence and the second annotation sequence as the final annotation result includes:

calculating a first loss value and a second loss value corresponding to the first annotation sequence and the second annotation sequence based on a log-likelihood function;

and determining a final labeling result from the first labeling sequence and the second labeling sequence according to the first loss value and the second loss value.

A sequence annotation apparatus, said apparatus comprising:

the conversion module is used for performing vector conversion on each character in the sequence to be labeled when receiving a sequence labeling request carrying the sequence to be labeled to obtain a feature word vector corresponding to the character;

the word segmentation module is used for inputting the characteristic word vector into a preset sequence tagging neural network so as to segment the sequence to be tagged to obtain candidate words and word labels corresponding to the candidate words;

the combination module is used for combining the word label with the position of each character in the candidate word respectively to obtain the character label of the character in the candidate word;

the calculation module is used for calculating a first pairing index of the candidate word based on the weight vector of the character label of each character in the candidate word; the weight vector is obtained when the sequence is trained and the neural network is labeled;

the measuring and calculating module is further used for measuring and calculating a second pairing index of the candidate annotation sequence based on the first pairing index corresponding to each group of candidate words; the candidate tagging sequence is obtained by arranging and combining at least two groups of candidate words;

and the identification module is used for identifying the candidate annotation sequence corresponding to the second pairing index with the largest numerical value as the first annotation sequence.

In one embodiment, the conversion module is further configured to obtain, from a preset word vector table, a word vector representation corresponding to each character in the sequence to be labeled;

and converting the word vector representation corresponding to each character by using a preset neural network to obtain a corresponding characteristic word vector.

A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the sequence tagging method of any one of the above when executing the computer program.

A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements any of the sequence tagging methods described above.

According to the sequence labeling method, the sequence labeling device, the computer equipment and the storage medium, after a sequence labeling request carrying a sequence to be labeled is received, vector conversion is carried out on each character in the sequence to be labeled to obtain a feature word vector corresponding to the character, so that the character level, namely the feature vector of a single character, is ensured to be obtained. And inputting the characteristic word vector into a preset sequence labeling neural network to perform word segmentation on the sequence to be labeled to obtain candidate words and corresponding word labels. And combining the word label with the position of each character in the candidate word respectively to obtain the character label of the character in the candidate word, thereby ensuring to obtain the character label of the character. And calculating to obtain a first pairing index of the candidate word based on the weight vector of the character label of each character in the candidate word, so that the correct probability of the candidate word is obtained according to the characters. And calculating a second pairing index of the candidate annotation sequence based on the first pairing index corresponding to each group of candidate words, wherein the candidate annotation sequence is obtained by arranging and combining at least two groups of candidate words, so that the annotation sequence is determined according to the numerical value of the second pairing index, the character-level features are utilized for carrying out sequence annotation, and the annotation accuracy is improved.

Drawings

FIG. 1 is a diagram illustrating an exemplary sequence tagging method;

FIG. 2 is a flowchart illustrating a sequence tagging method according to an embodiment;

FIG. 3 is a flowchart illustrating the step of calculating a score for a candidate term in one embodiment;

FIG. 4 is a diagram illustrating a structure of a sequence labeling neural network in one embodiment;

FIG. 5 is a flowchart illustrating a sequence tagging method in another embodiment;

FIG. 6 is a block diagram showing the structure of a sequence labeling apparatus according to an embodiment;

FIG. 7 is a diagram of the internal structure of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The sequence annotation method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. When the server 104 receives a sequence tagging request carrying a sequence to be tagged sent by the terminal 102, vector conversion is performed on each character in the sequence to be tagged to obtain a feature word vector corresponding to the character. The server 104 performs vector conversion on each character in the sequence to be labeled to obtain a feature word vector corresponding to the character. The server 104 inputs the feature word vector into a preset sequence labeling neural network to perform word segmentation on the sequence to be labeled, so as to obtain candidate words and word labels corresponding to the candidate words. The server 104 combines the word label with the position of each character in the candidate word respectively to obtain the character label to which the character belongs in the candidate word. The server 104 calculates a first pairing index of the candidate word based on the weight vector of the character label of each character in the candidate word; the weight vector is obtained when the training sequence labels the neural network. The server 104 calculates a second pairing index of the candidate annotation sequence based on the first pairing index corresponding to each group of candidate words; the candidate tagging sequence is obtained by arranging and combining at least two groups of candidate words. The server 104 identifies the candidate annotation sequence corresponding to the second pairing index with the largest value as the first annotation sequence. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by multiple servers.

In one embodiment, as shown in fig. 2, a sequence annotation method is provided, which is described by taking the application of the method to the server in fig. 1 as an example, and includes the following steps:

step S202, when a sequence marking request carrying a sequence to be marked is received, vector conversion is carried out on each character in the sequence to be marked to obtain a feature word vector corresponding to the character. .

The sequence to be labeled is a natural language which needs to be labeled, and includes but is not limited to long sentences and short sentences, and is in the form of texts including the long sentences and the short sentences. The sequence annotation request is an instruction for requesting the computing device to perform a sequence annotation task, for example, the terminal instructs the server to perform the sequence annotation task on the sequence to be annotated by sending the sequence annotation request. The sequence tagging task is a task of segmenting natural language and tagging corresponding tags to each segmented word, and the sequence tagging generally comprises part of speech tagging and named body recognition. Part-of-speech tagging, also known as grammar tagging or part-of-speech disambiguation, is a process technique that performs tagging based on the part-of-speech of a word. Named entity recognition, also known as proper name recognition, refers to the recognition of entities in a sequence that have a particular meaning, such as a person's name, place name, etc.

Specifically, a user can input a sequence to be labeled to a terminal through a voice input device or a text input device of the terminal, and after the terminal receives the sequence to be labeled, a sequence labeling request is generated and sent to a corresponding server, wherein the sequence to be labeled is sent to the server along with the sequence labeling request. And if the sequence to be annotated is input by the user through the voice input equipment, the sequence to be annotated is a sequence in a voice form. If the sequence to be marked is recognized as voice, firstly, voice processing is carried out on the sequence in the voice form to obtain a corresponding sequence in the text form, and the sequence in the text form is used as the sequence to be marked.

In practical application, if the default language of the sequence to be labeled is Chinese and other languages with non-Chinese input sequences to be labeled are identified, the translation tool is called to translate the sequence to obtain the sequence to be labeled in Chinese form, so that subsequent labeling errors are prevented.

Further, the vector conversion refers to converting characters in the sequence to be labeled into corresponding word vectors. The word vector is the expression of a word, and means that characters are mapped to a real number vector, so that the processing of a subsequent sequence labeling neural network is facilitated.

Specifically, after receiving the sequence to be labeled, word vector representations corresponding to the characters in the sequence to be labeled are obtained, so that word vector representations corresponding to each character are obtained. And performing vector conversion according to the word vector expression to obtain a corresponding feature word vector.

In one embodiment, performing vector conversion on each character in the sequence to be labeled to obtain a feature word vector corresponding to the character specifically includes: acquiring word vector representations corresponding to all characters in a sequence to be marked from a preset word vector table; and converting the word vector representation corresponding to each character by using a preset neural network to obtain a corresponding characteristic word vector.

The word vector table is a collection of all word vectors. By pre-configuring the word vector table, the word vector representation corresponding to each character is searched from the word vector table by using the embedding layer and mapped to the corresponding word vector representation. And after determining the word vector representation corresponding to each character, converting the word vector representation by using a neural network to obtain the corresponding characteristic word vector. The neural network is a BLSTM network model (Bi-directional long-term memory network). For example, assuming that the sequence to be labeled is "wangsi and xiaoming", the word vector representations mapped to the characters of wang, xiao, diad, sum, xiao and ming through the embedding layer and the preset word vector representations are e1, e2, e3, e4, e5 and e6, and the word vector representations e1, e2, e3, e4, e5 and e6 are converted by using the BLSTM network model to obtain the corresponding feature word vectors w1, w2, w3, w4, w5 and w 6.

Step S204, inputting the characteristic word vectors into a preset sequence labeling neural network to perform word segmentation on the sequence to be labeled to obtain candidate words and word labels corresponding to the candidate words.

Specifically, a neural network is labeled by using a preset sequence, and candidate words and corresponding word labels are determined according to each feature word vector. The preset sequence labeling neural network is an HSCRF network, namely a mixed SCRFS (semi-Markov conditional random fields) network. Namely, the sequence labeling neural network performs word segmentation on the sequence to be labeled according to the feature word vector corresponding to each character in the sequence to be labeled, so as to obtain corresponding candidate words and corresponding word labels.

The candidate word refers to a set of words that may be formed by all adjacent characters in the sequence to be labeled, that is, words that may be formed by all characters. Since words are connected, all possible words can be derived from the position of a single character. For example, assuming that the sequences to be labeled are "wang xiao and xiao ming", possible words including wang, wang xiao, xiao and, xiao and so on may be formed, and will not be described herein again. The word label is a label of the word, that is, a part of speech corresponding to the word. For example, labels include PER, B-PER, I-PER, E-PER, O, etc. Wherein, PER refers to Person, B-PER refers to characters corresponding to the beginning of the name, I-PER refers to corresponding characters in the middle of the name, E-PER refers to characters corresponding to the end of the name, and O refers to other characters except the name. Further, the word label of each candidate word may be denoted as s _i ＝(b _i ,e _i ,l _i ) Wherein s is _i Representing candidate words i, b _i Indicating the position at which the candidate word i starts, e _i Position indicating the end of candidate word i,/ _i A word label representing candidate word i. For example, "wanodi" may be represented as (1, 3, PER) in "wanodi and xiaoming", i.e., the candidate word "wanodi" starts at a position of 1 in "wanodi and xiaoming", ends at a position of 3, and wanodi is a name of a person.

Step S206, combining the word label with the position of each character in the candidate word respectively to obtain the character label of the character in the candidate word.

The character label refers to a label of each character, for example, a label of a king in the candidate word "wang xiao bi" is B-PER, a small label is I-PER, and a label of the second word is E-PER. The character label is the label of the character in the corresponding candidate word.

Specifically, the word label of the candidate word is combined with the position of the character in the candidate word to obtain the character label of the character. For example, the word label of the candidate word "wangsi" is PER, that is, it means that the characters in "wangsi" all belong to PER. According to the position of each character, the first position is the 1 st position, the second position is smaller, and the second position is the third position. In the BIOSE notation rule, the first word is denoted by B, the middle word by I, and the last word by E. After combination, the character labels of the king are inquired about B-PER, the small character label of the king is I-PER, and the character label of the king is E-PER.

Step S208, calculating and obtaining a first pairing index of the candidate word based on the weight vector corresponding to the character label of each character in the candidate word; the weight vector is obtained when the training sequence labels the neural network.

The first pairing index is an index indicating whether the candidate word is reasonable or not. It is understood that the first pair index is used to indicate whether the candidate word composed of the characters is a common word or a score of a correct word, and whether the candidate word is correct or not is determined reasonably by the level of the score value.

Specifically, the first pairing index of the candidate word is measured and calculated to determine the characters forming the candidate word, and based on the feature word vector corresponding to the characters forming the candidate word and the weight vector corresponding to the character tag to which the character belongs, the first pairing index of the candidate word is measured and calculated through the feature word vector corresponding to the character and the weight vector corresponding to the character tag to which the character belongs. The weight vector of the character label is obtained by training a sequence labeling neural network and is used for representing the probability weight of the character belonging to the character label. After the preset sequence labeling neural network obtains the candidate words and the corresponding character labels according to the feature word vectors, the score of the candidate words, namely the first pairing index of the candidate words, is calculated according to the feature word vectors corresponding to the characters and the weight vectors corresponding to the character labels to which the characters determined during training belong.

Step S210, calculating a second pairing index of the candidate annotation sequence based on the first pairing index corresponding to each group of candidate words; the candidate tagging sequence is obtained by arranging and combining at least two groups of candidate words.

The labeling sequence is a sequence which is marked after words and word labels are determined. The candidate tagging sequences are a plurality of candidate tagging sequences formed by the candidate words and the corresponding word labels. Namely, multi-combination is carried out according to all possible words to obtain all possible candidate labeling sequences. For example, if the candidate word determined in the "wanese and xiaoming" is wanese, sum and xiaoming, or the candidate word is wanese, sum and xiaoming, then the candidate tagging sequence determined according to the candidate word is wanese-sum-xiaoming and wanese-sum-xiaoming, and then a candidate tagging sequence is determined as the final tagging sequence through the second pairing index obtained by calculation. The second pairing index is an index indicating whether the candidate sequence is reasonable. It can be understood that the second pair index is used to indicate whether the candidate word composed of the candidate words is a score of a common sequence or a correct sequence, and the probability that the candidate tagged sequence is correct and reasonable is determined by the level of the score value. And because the candidate tagging sequence consists of all candidate words, the second pairing index corresponding to the candidate tagging sequence is obtained by measuring and calculating the first pairing index of all candidate words forming the candidate tagging sequence.

Specifically, the second pairing index for measuring and calculating each candidate tagging sequence first determines each candidate word forming the candidate tagging sequence. And calculating to obtain a second pairing index of the candidate annotation sequence based on the first pairing index of each candidate word and the corresponding transfer parameter. The transition parameter refers to the probability of a candidate word being converted from one word label to another word label, and is also determined during the training of the sequence labeling neural network. Since the determination of a candidate word is related to the corresponding previous candidate word, the transition parameter is typically the probability that the current word label is converted to the word label of the corresponding previous candidate word.

In step S212, the candidate annotation sequence corresponding to the second pairing index with the largest value is identified as the first annotation sequence.

The first annotation sequence is an annotation sequence determined from a plurality of candidate annotation sequences. And because a plurality of candidate labeling sequences are formed according to different candidate words, the final labeling sequence is determined according to the height of the second pairing index corresponding to each candidate labeling sequence. Specifically, the first annotation sequence is determined according to the magnitude of the second pairing index, that is, the candidate annotation sequence corresponding to the second pairing index with the highest value in the multiple candidate annotation sequences is the first annotation sequence.

According to the sequence labeling method, after a sequence labeling request carrying a sequence to be labeled is received, vector conversion is carried out on each character in the sequence to be labeled to obtain a feature word vector corresponding to the character, and therefore the character level, namely the feature vector of a single character, is ensured to be obtained. And inputting the feature word vectors into a preset sequence labeling neural network to perform word segmentation on the sequence to be labeled to obtain candidate words and corresponding word labels. And combining the word labels with the positions of the characters in the candidate words respectively to obtain the character labels of the characters in the candidate words, thereby ensuring that the character labels of the characters are obtained. And calculating to obtain a first pairing index of the candidate word based on the weight vector of the character label of each character in the candidate word, so as to obtain the correct probability of the candidate word according to the characters. And calculating a second pairing index of the candidate annotation sequence based on the first pairing index corresponding to each group of candidate words, wherein the candidate annotation sequence is obtained by arranging and combining at least two groups of candidate words, so that the annotation sequence is determined according to the numerical value of the second pairing index, the character-level features are utilized for carrying out sequence annotation, and the annotation accuracy is improved.

In one embodiment, as shown in fig. 3, the calculating and obtaining a first pairing index of the candidate word based on the weight vector corresponding to the character tag to which each character in the candidate word belongs includes the following steps:

step S302, determining the characters forming the candidate words, and acquiring the feature vectors corresponding to the characters forming the candidate words.

Step S304, acquiring the weight vector of the character label corresponding to the character forming the candidate word.

Step S306, calculating and obtaining a first pairing index of the candidate word according to the feature vector and the weight vector of each character.

Specifically, when a first pair index of a candidate word is evaluated, the characters that make up the candidate word are determined. For example, the characters that constitute the candidate word wang xiao bi include wang, xiao and bi. According to the characteristic vector of each character and the weight vector of the character label to which the character belongs corresponding to the character in the candidate word wanese, a first pairing index of the candidate word is obtained by measurement, and a measurement formula of the first pairing index of the candidate word is as follows:

wherein m is _i Is a first pairing index, w ', of the ith candidate term' _k Is the feature vector of the kth character in the candidate word, y _k Is a character tag for the k-th character,

to calculate the score for the kth character as category y,

and the weight vector corresponding to the character label. And for each character, a feature vector w' _k Represented by three parts, the first part is a feature word vector w _k . The second part is

As a candidate word s _i The feature vector of the end position character,

as a candidate word s _i The feature vector of the start position character. The third part is an embedded vector Φ (k-b) of the position of the word in the candidate word _i +1). That is, the feature vector of the kth character is:

in an embodiment, calculating the second pairing index of the candidate annotation sequence based on the first pairing index corresponding to each group of candidate words specifically includes: determining candidate words forming the candidate tagging sequence, obtaining transfer parameters corresponding to the candidate words, and measuring and calculating to obtain a second pairing index of the candidate tagging sequence based on the first pairing index and the transfer parameters corresponding to the candidate words.

Specifically, candidate words forming the candidate annotation sequence are determined, and transfer parameters corresponding to the candidate words and first pairing indexes corresponding to the candidate words are obtained. And calculating the sum of the first pairing index corresponding to each candidate word and the transfer parameter, wherein the product of each sum after exponential operation is the second pairing index of the candidate tagging sequence. The measurement formula of the second pairing index of the candidate annotation sequence is as follows:

wherein S represents a candidate tag sequence, Score (S, w) represents a second pairing index of the candidate tag sequence, and m _i The first pair indicator, Ψ (l), representing the ith candidate word under the candidate annotation sequence _i-1 ,l _i ,w,b _i ,e _i ) An exponential function representing the sum of each candidate term score and a transition parameter, exp being the exponential function,

is a transition parameter for a word transitioning from category i to category i-1.

In one embodiment, as shown in fig. 4, a block diagram of a sequence labeling neural network is provided, and a detailed description is made based on the block diagram.

Specifically, the sequence labeling neural network comprises an embedding layer, a BLSTM network model layer and an HSCRS model layer. And after the sequence to be labeled is obtained, mapping is carried out through the embedding layer, and word vector representation of each character in the sequence to be labeled is obtained. And taking the word vector representation as the input of a BLSTM network model layer, and converting the word vector representation through the BLSTM network model layer to obtain the characteristic word vector corresponding to each character of the sequence to be labeled. And the HSCRS model layer carries out word segmentation of the sequence to be labeled based on the characteristic word vector, and determines candidate words and word labels corresponding to the candidate words. And determining the character label of each character based on each candidate word and the corresponding word label, and determining all possible candidate labeling sequences. And measuring and calculating the weight vector of the character label determined by the training of the sequence labeling neural network to obtain a first pairing index of each candidate word. Similarly, a second pairing index of the corresponding candidate tagging sequence is obtained through calculation based on the first pairing index of each candidate word forming the candidate tagging sequence and the transfer parameter of each word tag determined through the training of the sequence tagging neural network. And determining a final labeling sequence according to the height of the second pairing index corresponding to each candidate labeling sequence, namely selecting the candidate labeling sequence corresponding to the second pairing index with the highest numerical value for sequence labeling, wherein the sequence labeling is performed through a SCRF (semi-Markov conditional random field) model layer in the HSCRF model, and obtaining the labeling sequence.

That is to say, after the character tag to which each character belongs is obtained, the feature vector of each character and the weight vector of the character tag to which each character belongs are calculated, that is, each feature vector is multiplied by the corresponding weight vector, and the products are added to obtain a first pairing index whose value is a candidate word. Due to the difference of parts of speech, the parts of speech of each character in the candidate words must accord with the normal rule. And the character label of each character in different candidate words is different according to different parts of speech. Therefore, by calculating the weight vector which can indicate whether the character is reasonable when the character is the character label of the character in the corresponding candidate word, the first pairing index which can evaluate whether the candidate word is reasonable can be obtained. Similarly, the candidate tagging sequences are composed of candidate words, and the calculation is performed according to the first pairing index capable of evaluating whether each candidate word is reasonable, so that the second pairing index capable of evaluating whether each candidate tagging sequence is reasonable can be obtained. And finally, determining the first annotation sequence from the candidate annotation sequences according to the height of each second pairing index.

Further, referring to fig. 4, assuming that the sequences to be labeled are "wangsi and xiaoming", word vector representations e1, e2, e3, e4, e5, and e6 corresponding to the characters are obtained through embedding layer mapping, and the word vector representations e1, e2, e3, e4, e5, and e6 are converted by using a BLSTM network model, so as to obtain corresponding feature word vectors w1, w2, w3, w4, w5, and w 6. And taking the feature word vectors w1, w2, w3, w4, w5 and w6 as the input of an HSCRS model layer, and carrying out word segmentation on the HSCRS model layer based on the feature word vectors w1, w2, w3, w4, w5 and w6 to determine all possible candidate words and candidate annotation sequences possibly formed by all the candidate words. And determining the final candidate annotation sequence through the calculation of the candidate word score and the candidate annotation sequence score. Referring to fig. 4, m1, m2, and m3 are scores of candidate words included in the selected sequence, that is, the sequence of candidate words s1, s2, and s3 corresponding to the scores is determined to be the sequence with the highest finally determined score. The candidate annotation sequence is then annotated, with the corresponding annotation for candidate word s1 being (1, 3, PER), the corresponding annotation for candidate word s2 being (4, O), and the corresponding annotation for candidate word s3 being (5, 6, PER).

Further, the sequence labeled neural network is a pre-trained network, and the training process of the sequence labeled neural network includes: obtaining a large number of sequences for training, namely corpus samples for training, and preprocessing the corpus samples, namely performing artificial word segmentation, labeling and the like on the corpus samples. And inputting the labeled corpus sample into a sequence labeling neural network for training, namely inputting the labeled sequence into the sequence labeling neural network for training. After the model training is finished, the corpus samples used for testing, namely the unmarked corpus samples, are input into the model to be subjected to sequence marking to obtain a test marking sequence, and meanwhile, the test samples are manually marked to obtain a manual marking sequence. And performing parameter tuning on the model according to the comparison result of the test labeling sequence obtained by the test and the artificial labeling sequence artificially labeled, and performing iterative training to obtain the finally applicable sequence labeling neural network. The parameters comprise weight vectors, transfer parameters and the like, and can be directly used when the subsequent network is applied.

In one embodiment, as shown in fig. 5, another sequence annotation method is provided, which further includes the following steps after step S214:

step S214, inputting the feature word vector into a preset conditional random field model to label the sequence to be labeled to obtain a second labeling sequence.

Among them, the Conditional Random Field (CRF) is an undirected graph model, and has a good effect in sequence tagging tasks such as word segmentation, part of speech tagging, named entity recognition, and the like. In this embodiment, in order to fully utilize the character labels, when the training sequence labels the neural network, the output layer of the CRF model and the output layer of the designed HSCRS model are integrated together for co-training. And the output layer of the CRF model and the output layer of the HSCRS model share the feature word vector of the sequence to be labeled, and the model training optimization becomes the sum of a CRF output layer loss function and an HSCRF output layer loss function.

Specifically, after the feature word vectors of the sequence to be labeled are obtained, the sequence labeling task is performed based on the feature word vectors by using the HSFCRS model layer, and the feature word vectors are used as the input of the CRF model to perform the sequence labeling task. Namely, the labeling sequence finally obtained by the HSFCRS model is a first labeling sequence, and the labeling sequence finally obtained by the CRF model is a second labeling sequence.

Step S216, calculating loss values of the first annotation sequence and the second annotation sequence by using a preset decoding algorithm, and determining an annotation sequence with the minimum loss value from the first annotation sequence and the second annotation sequence as a final annotation result.

Specifically, the final labeling result is determined by calculating the loss values of the first labeling sequence and the second labeling sequence. Namely, the sequence with smaller loss value in the first annotation sequence and the second annotation sequence is used as the final annotation result.

In one embodiment, the method includes calculating loss values of a first labeled sequence and a second labeled sequence by using a preset decoding algorithm, and determining a labeled sequence with the minimum loss value from the first labeled sequence and the second labeled sequence as a final labeling result, which specifically includes:

the preset decoding algorithm is to calculate loss values of corresponding labeled sequences through an HSFCRS model and a CRF model respectively, then exchange the models to calculate loss values of labeled sequences of the other model respectively, and obtain a labeled sequence with the minimum loss value as a final labeled result by accumulating the two loss values.

Specifically, first and second loss values for the first and second annotation sequences are calculated based on a log-likelihood function. And determining a final labeling result from the first labeling sequence and the second labeling sequence according to the first loss value and the second loss value.

Wherein, the log-likelihood function is a loss function used by the HSFCRS model and the CRF model. And calculating the corresponding loss values of the first labeling sequence and the second labeling sequence by using a log-likelihood function, namely the loss value corresponding to the first labeling sequence is a first loss value, and the loss value corresponding to the second labeling sequence is a second loss value.

Specifically, assuming that the first annotation sequence and the second annotation sequence are Sh and Sc, respectively, first, calculating loss values NNLh and NNLc of Sh and Sc by using log-likelihood functions through models corresponding to the annotation sequences, that is, calculating a loss value NNLh of the first annotation sequence Sh through an HSFCRS model, and calculating a loss value NNLc of the second annotation sequence Sc through a CRF model. Then, the models are exchanged, and the loss values NNLh _ by _ c and NNLc _ by _ h of Sh and Sc are calculated again using the log-likelihood functions. That is, the loss value NNLc _ by _ h of the second labeling sequence Sc is calculated by the HSFCRS model, and the loss value NNLh _ by _ c of the first labeling sequence Sh is calculated by the CRF model. The calculation formula of the log-likelihood function is as follows:

log(p(s _i |w))

wherein, p(s) _i | w) is the word s _i Probability of (c), score(s) _i W) is the word s _i The score of (a) is obtained,

is the total score of all words that make up the annotation sequence. Further, after the loss values NNLh and NNLc, and the loss values NNLh _ by _ c and NNLc _ by _ h are obtained through calculation of the log-likelihood function, the loss values NNLh and NNLh _ by _ c corresponding to the first labeling sequence are added to obtain a first loss value, and the loss values NNLc and NNLc _ by _ h corresponding to the second labeling sequence are added to obtain a second loss value. And determining a final labeling result by comparing the first loss value with the second loss value, namely if the first loss value is smaller than the second loss value, taking the first labeling sequence corresponding to the first loss value as the final labeling result. And if the second loss value is smaller than the first loss value, taking the second label sequence corresponding to the second loss value as a final labeling result.

It should be understood that although the steps in the flowcharts of fig. 2, 3, and 5 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2, 3, and 5 may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternatingly with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 6, there is provided a sequence annotation apparatus, including: a conversion module 602, a word segmentation module 604, a combination module 606, a calculation module 608, and a recognition module 610, wherein:

the conversion module 602 is configured to, when receiving a sequence tagging request carrying a sequence to be tagged, perform vector conversion on each character in the sequence to be tagged to obtain a feature word vector corresponding to the character.

The word segmentation module 604 is configured to input the feature word vector into a preset sequence tagging neural network, so as to perform word segmentation on the sequence to be tagged, and obtain candidate words and word labels corresponding to the candidate words.

And the combining module 606 is configured to combine the word label with the position of each character in the candidate word, so as to obtain a character label to which the character belongs in the candidate word.

The calculating module 608 is configured to calculate a first pairing index of the candidate word based on a weight vector of a character tag to which each character in the candidate word belongs; the weight vector is obtained when the training sequence labels the neural network.

The calculating module 608 is further configured to calculate a second pairing index of the candidate annotation sequence based on the first pairing index corresponding to each group of candidate words; the candidate labeling sequence is obtained by arranging and combining at least two groups of candidate words.

The identifying module 610 is configured to identify the candidate annotation sequence corresponding to the second pairing index with the largest value as the first annotation sequence.

In an embodiment, the conversion module 602 is further configured to obtain, from a preset word vector table, a word vector representation corresponding to each character in the sequence to be labeled; and converting the word vector representation corresponding to each character by using a neural network to obtain corresponding characteristic word vectors.

In one embodiment, the calculation module 608 is further configured to determine characters forming the candidate word, and obtain a feature vector corresponding to the characters forming the candidate word; acquiring a weight vector of a character label corresponding to a character forming a candidate word; and calculating to obtain a first pairing index of the candidate word according to the feature vector and the weight vector of each character.

In an embodiment, the calculating module 608 is further configured to determine candidate words forming the candidate tagging sequence, obtain a transfer parameter corresponding to the candidate word, and calculate to obtain a second pairing index of the candidate tagging sequence based on the first pairing index and the transfer parameter corresponding to the candidate word.

In one embodiment, the sequence labeling device further comprises a comparison module, configured to input the feature word vector into a preset conditional random field model, so as to label the sequence to be labeled, so as to obtain a second labeled sequence; and calculating loss values of the first labeling sequence and the second labeling sequence by using a preset decoding algorithm, and determining the labeling sequence with the minimum loss value from the first labeling sequence and the second labeling sequence as a final labeling result.

In one embodiment, the comparison module is further configured to calculate a first loss value and a second loss value for the first annotation sequence and the second annotation sequence based on a log-likelihood function. And determining a final labeling result from the first labeling sequence and the second labeling sequence according to the first loss value and the second loss value.

For the specific definition of the sequence labeling apparatus, reference may be made to the definition of the sequence labeling method above, and details are not described herein again. The modules in the sequence labeling apparatus can be implemented in whole or in part by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a sequence annotation method.

Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, there is provided a computer device comprising a memory storing a computer program and a processor implementing the following steps when the processor executes the computer program:

when a sequence marking request carrying a sequence to be marked is received, carrying out vector conversion on each character in the sequence to be marked to obtain a feature word vector corresponding to the character;

inputting the characteristic word vector into a preset sequence tagging neural network to perform word segmentation on a sequence to be tagged to obtain candidate words and word labels corresponding to the candidate words;

combining the word labels with the positions of all characters in the candidate words respectively to obtain the character labels of the characters in the candidate words;

calculating a first pairing index of the candidate word based on the weight vector of the character label of each character in the candidate word; the weight vector is obtained when the neural network is marked by the training sequence;

measuring and calculating a second pairing index of the candidate annotation sequence based on the first pairing index corresponding to each group of candidate words; the candidate tagging sequence is obtained by arranging and combining at least two groups of candidate words;

and identifying the candidate annotation sequence corresponding to the second pairing index with the largest numerical value as the first annotation sequence.

In one embodiment, the processor, when executing the computer program, further performs the steps of:

acquiring word vector representations corresponding to all characters in a sequence to be marked from a preset word vector table; and converting the word vector representation corresponding to each character by using a neural network to obtain corresponding characteristic word vectors.

determining characters forming the candidate words, and acquiring characteristic vectors corresponding to the characters forming the candidate words; acquiring a weight vector of a character label corresponding to a character forming a candidate word; and calculating to obtain a first pairing index of the candidate word according to the feature vector and the weight vector of each character.

determining candidate words forming the candidate tagging sequence, obtaining transfer parameters corresponding to the candidate words, and measuring and calculating to obtain a second pairing index of the candidate tagging sequence based on the first pairing index and the transfer parameters corresponding to the candidate words.

inputting the feature word vector into a preset conditional random field model to label the sequence to be labeled to obtain a second labeling sequence; and determining the annotation sequence with the minimum loss value from the first annotation sequence and the second annotation sequence as a final annotation result.

first and second loss values for the first and second annotation sequences are calculated based on the log-likelihood function. And determining a final labeling result from the first labeling sequence and the second labeling sequence according to the first loss value and the second loss value.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

when a sequence marking request carrying a sequence to be marked is received, carrying out vector conversion on each character in the sequence to be marked to obtain a characteristic word vector corresponding to the character;

inputting the feature word vector into a preset sequence labeling neural network to perform word segmentation on a sequence to be labeled to obtain candidate words and word labels corresponding to the candidate words;

calculating a first pairing index of the candidate word based on the weight vector of the character label of each character in the candidate word; the weight vector is obtained when the training sequence is marked on the neural network;

In one embodiment, the computer program when executed by the processor further performs the steps of:

determining characters forming the candidate words, and acquiring feature vectors corresponding to the characters forming the candidate words; acquiring a weight vector of a character label corresponding to a character forming a candidate word; and calculating to obtain a first pairing index of the candidate word according to the feature vector and the weight vector of each character.

determining candidate words forming the candidate annotation sequence, obtaining transfer parameters corresponding to the candidate words, and calculating to obtain a second pairing index of the candidate annotation sequence based on the first pairing index and the transfer parameters corresponding to the candidate words.

first and second loss values for the first and second annotated sequences are calculated based on a log-likelihood function. And determining a final labeling result from the first labeling sequence and the second labeling sequence according to the first loss value and the second loss value.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of sequence annotation, the method comprising:

inputting the characteristic word vector into a preset sequence labeling neural network to perform word segmentation on the sequence to be labeled to obtain a plurality of groups of candidate words and word labels corresponding to each group of candidate words, wherein the candidate words are a set of words formed between adjacent characters in the sequence to be labeled;

acquiring a weight vector of a character label corresponding to the character forming the candidate word;

calculating to obtain a first pairing index of the candidate word according to the feature vector and the weight vector of each character; the first pairing index is an index that determines whether the candidate word is reasonable; the weight vector is obtained when the sequence is trained and the neural network is labeled;

calculating a second pairing index of a candidate annotation sequence based on the first pairing index corresponding to each group of candidate words; the candidate tagging sequence is obtained by arranging and combining at least two groups of candidate words; the second pairing index is a score used for identifying whether a corresponding candidate tagging sequence formed by each candidate word is a common sequence or a correct sequence, and the probability that the candidate tagging sequence is correct and reasonable is determined according to the score value;

2. The method according to claim 1, wherein the performing vector conversion on each character in the sequence to be labeled to obtain a feature word vector corresponding to the character comprises:

3. The method of claim 1, wherein measuring a second pair index of a candidate annotation sequence based on the first pair index corresponding to each group of the candidate words comprises:

determining candidate words forming the candidate tagging sequence;

obtaining transfer parameters corresponding to the candidate words;

4. The method of claim 1, wherein after identifying the candidate annotation sequence corresponding to the second pairing index with the largest value as the first annotation sequence, the method further comprises:

inputting the feature word vector into a preset conditional random field model to label the sequence to be labeled to obtain a second labeling sequence;

5. The method of claim 4, wherein the step of calculating the loss values of the first labeling sequence and the second labeling sequence by using a predetermined decoding algorithm, and determining the labeling sequence with the minimum loss value from the first labeling sequence and the second labeling sequence as the final labeling result comprises:

6. A sequence annotation apparatus, characterized in that the apparatus comprises:

the word segmentation module is used for inputting the characteristic word vector into a preset sequence tagging neural network so as to segment the sequence to be tagged to obtain candidate words and word labels corresponding to the candidate words, wherein the candidate words are a set of words formed between adjacent characters in the sequence to be tagged;

the measuring and calculating module is used for determining the characters forming the candidate words and acquiring the characteristic vectors corresponding to the characters forming the candidate words; acquiring a weight vector of a character label corresponding to the character forming the candidate word; calculating to obtain a first pairing index of the candidate word according to the feature vector and the weight vector of each character; the first pairing index is an index that determines whether the candidate word is reasonable; the weight vector is obtained when the sequence is trained and the neural network is labeled;

the measuring and calculating module is further used for measuring and calculating a second pairing index of the candidate annotation sequence based on the first pairing index corresponding to each group of candidate words; the candidate tagging sequence is obtained by arranging and combining at least two groups of candidate words; the second pairing index is a score used for identifying whether a corresponding candidate tagging sequence formed by each candidate word is a common sequence or a correct sequence, and the probability that the candidate tagging sequence is correct and reasonable is determined according to the score value;

7. The apparatus according to claim 6, wherein the conversion module is further configured to obtain, from a preset word vector table, a word vector representation corresponding to each character in the sequence to be labeled;

8. The apparatus of claim 6, wherein the calculation module is further configured to determine candidate words that constitute the candidate annotation sequence; obtaining transfer parameters corresponding to the candidate words; and calculating to obtain a second pairing index of the candidate annotation sequence based on the first pairing index and the transfer parameter corresponding to the candidate word.

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 5 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 5.