CN111222317A - Sequence labeling method, system and computer equipment - Google Patents

Sequence labeling method, system and computer equipment Download PDF

Info

Publication number
CN111222317A
CN111222317A CN201910983279.2A CN201910983279A CN111222317A CN 111222317 A CN111222317 A CN 111222317A CN 201910983279 A CN201910983279 A CN 201910983279A CN 111222317 A CN111222317 A CN 111222317A
Authority
CN
China
Prior art keywords
sequence
word
target text
vector
labeling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910983279.2A
Other languages
Chinese (zh)
Other versions
CN111222317B (en
Inventor
金戈
徐亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910983279.2A priority Critical patent/CN111222317B/en
Priority to PCT/CN2019/117403 priority patent/WO2021072852A1/en
Publication of CN111222317A publication Critical patent/CN111222317A/en
Application granted granted Critical
Publication of CN111222317B publication Critical patent/CN111222317B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention provides a sequence labeling method, which comprises the following steps: receiving a target text sequence, and converting the target text sequence into corresponding sentence vectors, word vectors of all words and position vectors of all words; inputting sentence vectors of a target text sequence, word vectors of all words and position vectors of all words into a trained BERT model, and outputting a first labeling sequence corresponding to the target text sequence through the BERT model; inputting the first labeling sequence into a full connection layer, and outputting a second labeling sequence through the full connection layer; using the second labeling sequence as an input sequence of a conditional random field CRF model to output a tag sequence Y (Y) through the CRF model1,y2,...,ym) (ii) a And generating a named entity sequence according to the label sequence and outputting the named entity sequence. The embodiment of the invention solves the problemsThe problem that the existing model cannot consider long-term context information relation is solved, the model can be used for directly extracting named entities in the text, and the technical effect of improving the accuracy of entity identification is achieved.

Description

Sequence labeling method, system and computer equipment
Technical Field
The embodiments of the present invention relate to the field of sequence annotation, and in particular, to a sequence annotation method, system, computer device, and computer-readable storage medium.
Background
Named entity recognition is the most basic and most widely used one of all natural language processing applications. It is an entity with special meaning in the identification text, mainly including name of person, place name, organization name, proper noun, etc. Named entity recognition application is an important basic tool for subsequent application fields of other applications, such as information extraction, question and answer systems, syntactic analysis, machine translation, semantic web-oriented metadata annotation and the like. Through the application of this tool to named entity recognition, a natural language model can be constructed that can understand, analyze, and answer the results of natural language like a human. But the existing model often can not consider long-term context information, so that the recognition accuracy is limited.
Therefore, how to solve the problem that the existing model cannot consider the long-term context information relationship, so as to further improve the identification accuracy of sequence labeling, becomes one of the technical problems to be solved at present.
Disclosure of Invention
In view of the above, it is necessary to provide a sequence labeling method, a sequence labeling system, a computer device, and a computer readable storage medium, so as to solve the technical problems that the existing model cannot consider the long-term context information relationship, so that the recognition accuracy of sequence labeling is limited, and the like.
In order to achieve the above object, an embodiment of the present invention provides a sequence tagging method, where the method includes:
receiving a target text sequence, and converting the target text sequence into corresponding sentence vectors, word vectors of all words and position vectors of all words;
inputting sentence vectors of the target text sequence, word vectors of all words and position vectors of all words into a trained BERT model, and outputting a first labeling sequence corresponding to the target text sequence through the BERT model, wherein the first labeling sequence comprises a plurality of first n-dimensional vectors, each first n-dimensional vector corresponds to one word in the target text sequence, and the first n-dimensional vectors represent first probabilities of corresponding words belonging to each first label in n first labels;
inputting the first labeling sequence into a full-connection layer, and outputting a second labeling sequence through the full-connection layer, wherein the second labeling sequence comprises a plurality of second n-dimensional vectors, each second n-dimensional vector corresponds to one word in the target text sequence, and the second n-dimensional vectors represent second probabilities of the corresponding words belonging to each of n second tags;
using the second labeling sequence as an input sequence of a conditional random field CRF model to output a tag sequence Y (Y) through the CRF model1,y2,...,ym) (ii) a And
and generating a named entity sequence according to the label sequence, and outputting the named entity sequence.
Illustratively, the step of converting the target text sequence into corresponding sentence vectors, word vectors for respective words, and position vectors for respective words includes:
inputting the target text sequence into an embedding layer, and outputting a plurality of word vectors corresponding to the target text sequence through the embedding layer, wherein the word vectors comprise at least one punctuation vector;
inputting the word vectors into a segmentation layer, segmenting the word vectors according to the at least one punctuation vector to obtain n word vector sets, wherein the n word vector sets correspond to n segmentation codes;
performing coding operation on each division code through position coding, and determining the position information coding of each division code to obtain the position vector of each word in the target text sequence; and
and generating a sentence vector of the target text sequence according to the word vector of each word in the target text sequence and the position vector of each word.
Illustratively, the step of outputting, by the BERT model, a first annotation sequence corresponding to the target text sequence includes:
performing feature extraction on the sentence vector of the target text sequence, the word vector of each word and the position vector of each word through the BERT model to obtain a first probability of each first label of each word in the target text sequence;
and generating a first labeling sequence according to the first probability of each first label of each word in the target text sequence.
Illustratively, the step of inputting the first annotation sequence into a fully-connected layer and outputting a second annotation sequence through the fully-connected layer comprises:
inputting the first labeling sequence into a neural network structure of a full connection layer, performing additional feature extraction to obtain a second probability of each second label of each word in the target text sequence, wherein an operation formula for the additional feature extraction of the ith word in the target text sequence is Bi=wXi+ b, wherein, XiIs the first probability of each first label of the ith word in the first annotation sequence, and w and b are BERT model learning parameters;
and generating a second labeling sequence according to the second probability of each label of each word in the target text sequence.
Illustratively, the second tagging sequence is used as an input sequence of a conditional random field CRF model to output a tag sequence Y ═ Y (Y) by the CRF model1,y2,...,ym) The method comprises the following steps:
inputting the second annotation sequence into a CRF model;
performing Viterbi solution on the second labeling sequence through a Viterbi algorithm to obtain an optimal solution path in the second labeling sequence, wherein the optimal solution path is a highest probability sequence of the label sequence being a whole target text sequence;
and generating a label sequence according to the optimal solving path.
In order to achieve the above object, an embodiment of the present invention further provides a sequence annotation system, including:
the text receiving module is used for receiving a target text sequence and converting the target text sequence into corresponding sentence vectors, word vectors of all the words and position vectors of all the words;
a first labeling module, configured to input a sentence vector of the target text sequence, a word vector of each word, and a position vector of each word into a trained BERT model, and output a first labeling sequence corresponding to the target text sequence through the BERT model, where the first labeling sequence includes a plurality of first n-dimensional vectors, each first n-dimensional vector corresponds to one word in the target text sequence, and the first n-dimensional vector represents a first probability that a corresponding word belongs to each first tag of the n first tags;
a second labeling module, configured to input the first labeling sequence into a full-connected layer, and output a second labeling sequence through the full-connected layer, where the second labeling sequence includes a plurality of second n-dimensional vectors, each second n-dimensional vector corresponds to one word in the target text sequence, and the second n-dimensional vector represents a second probability that a corresponding word belongs to each of n second tags;
an output tag module, configured to use the second tagging sequence as an input sequence of a conditional random field CRF model, so as to output a tag sequence Y ═ (Y) through the CRF model1,y2,...,ym) (ii) a And
and the output entity module is used for generating a named entity sequence according to the label sequence and outputting the named entity sequence.
Illustratively, the receive text module is further configured to:
inputting the target text sequence into an embedding layer, and outputting a plurality of word vectors corresponding to the target text sequence through the embedding layer, wherein the word vectors comprise at least one punctuation vector;
inputting the word vectors into a segmentation layer, segmenting the word vectors according to the at least one punctuation vector to obtain n word vector sets, wherein the n word vector sets correspond to n segmentation codes; and
performing coding operation on each division code through position coding, and determining the position information coding of each division code to obtain the position vector of each word in the target text sequence;
and generating a sentence vector of the target text sequence according to the word vector of each word in the target text sequence and the position vector of each word.
Illustratively, the first annotation module is further configured to:
performing feature extraction on the sentence vector of the target text sequence, the word vector of each word and the position vector of each word through the BERT model to obtain a first probability of each first label of each word in the target text sequence;
and generating a first labeling sequence according to the first probability of each first label of each word in the target text sequence.
To achieve the above object, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when executed by the processor, the computer program implements the steps of the sequence tagging method as described above.
To achieve the above object, an embodiment of the present invention further provides a computer-readable storage medium, in which a computer program is stored, where the computer program is executable by at least one processor, so as to cause the at least one processor to execute the steps of the sequence tagging method as described above.
The sequence labeling method, the system, the computer equipment and the computer readable storage medium provided by the embodiment of the invention provide an effective sequence labeling method for a text sequence; the embodiment of the invention solves the technical problem that the model in the prior art can not consider long-term context information, so that the recognition accuracy is limited, achieves the technical effects that named entities in sentences can be extracted by directly inputting the original sentences into the model, has strong adaptability and wide application range, and improves the accuracy of sequence labeling on entity recognition.
Drawings
Fig. 1 is a flowchart illustrating a sequence tagging method according to an embodiment of the present invention.
FIG. 2 is a schematic diagram of program modules of a second embodiment of the sequence annotation system.
Fig. 3 is a schematic diagram of a hardware structure of a third embodiment of the computer device according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the description relating to "first", "second", etc. in the present invention is for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.
In the following embodiments, the computer device 2 will be exemplarily described as an execution subject.
Example one
Referring to fig. 1, a flowchart of steps of a sequence labeling method according to an embodiment of the present invention is shown. It is to be understood that the flow charts in the embodiments of the present method are not intended to limit the order in which the steps are performed. The following description is made by way of example with the computer device 2 as the execution subject. The details are as follows.
Step S100, receiving a target text sequence, and converting the target text sequence into a corresponding sentence vector, a word vector of each word, and a position vector of each word.
Specifically, the step S100 may further include:
step S100a, inputting the target text sequence into an embedding layer, and outputting a plurality of word vectors corresponding to the target text sequence through the embedding layer, where the plurality of word vectors includes at least one punctuation vector.
Illustratively, when the target text sequence received is [ Curie in Poland, resident in the United states ], then each word and special symbol needs to be converted into a word-embedded vector because the neural network can only perform numerical calculations.
Step S100b, inputting the word vectors into a segmentation layer, segmenting the word vectors according to the at least one punctuation vector to obtain n word vector sets, where the n word vector sets correspond to n segmentation codes.
Illustratively, the target text sequence is divided into a sentence a [ curie is living in poland and resides in the united states ] and a sentence B [ living in the united states ], and the first half sentence is added with a division code a and the second half sentence is added with a division code B.
Step S100c, performing encoding operation on each division code through position encoding, and determining position information encoding of each division code to obtain a position vector of each word in the target text sequence.
Illustratively, the position information encoding may be used to determine the position of each word in the target text sequence.
Step S100d, generating a sentence vector of the target text sequence according to the word vector of each word in the target text sequence and the position vector of each word.
Step S102, inputting the sentence vector of the target text sequence, the word vector of each word and the position vector of each word into a trained BERT model, and outputting a first labeling sequence corresponding to the target text sequence through the BERT model, wherein the first labeling sequence comprises a plurality of first n-dimensional vectors, each first n-dimensional vector corresponds to one word in the target text sequence, and the first n-dimensional vectors represent first probabilities of the corresponding words belonging to each first label in n first labels.
For example, the n first tags may be a plurality of location tags and a plurality of semantic tags, and the n first tags may also be a plurality of location tags and a plurality of part-of-speech tags.
Illustratively, the BERT is an existing pre-training model, which is collectively referred to as bidirectional Encoder recurrences from Transformers, i.e., the Encoder of a bidirectional Transformer (Encoder); wherein the Transformer is a method that relies entirely on self-attention to compute input and output characterizations; BERT aims to pre-train the deep bi-directional representation by jointly adjusting the context in all layers. Therefore, the pre-trained BERT representation can be finely adjusted through an additional output layer, and is suitable for building the most advanced model of a wide range of tasks, such as question-answering tasks and language reasoning, without making great architectural modification on specific tasks.
Illustratively, the BERT model may be obtained by capturing words by a Masked Language Model (MLM) method and expressing Sentence levels by a "Next Sentence Prediction" method; wherein, masking some words (tokens) in the language model input randomly, aiming at predicting the original vocabulary id based on the context of the masked words only, unlike the pre-training of the language model from left to right, the training target of the masked language model allows representing the contexts fusing the left and right sides, thereby pre-training a deep bidirectional Transformer; "Next Sentence Prediction", that is, Next sequence Prediction means that when a language model is pre-trained, two sentences are selected in two cases, one is two sentences which are really connected in sequence in a selected corpus; the other is that the second sentence throws the color son from the corpus and randomly selects one to be pieced after the first sentence. Besides the Masked language model task, the model is required to additionally perform sentence relation prediction to judge whether the second sentence is a subsequent sentence of the first sentence or not.
Illustratively, the training of the pre-trained BERT model may include: the method comprises the steps of obtaining a plurality of training text sequences, using the training text sequences as a training set of a BERT model, inputting the training set into a pre-trained BERT model, and training the pre-trained BERT model through the training set to obtain the trained BERT model.
Specifically, the step S102 may further include:
step S102a, performing feature extraction on the sentence vector of the target text sequence, the word vector of each word and the position vector of each word through the BERT model to obtain a first probability of each first label of each word in the target text sequence.
Step S102b, generating a first labeling sequence according to the first probability of each first label of each word in the target text sequence.
Step S104, inputting the first labeling sequence into a full-connection layer, and outputting a second labeling sequence through the full-connection layer, wherein the second labeling sequence comprises a plurality of second n-dimensional vectors, each second n-dimensional vector corresponds to one word in the target text sequence, and the second n-dimensional vectors represent second probabilities of the corresponding words belonging to each second label of the n second labels.
For example, the n second tags may be a plurality of location tags and a plurality of semantic tags, and the n second tags may also be a plurality of location tags and a plurality of part-of-speech tags.
Specifically, the step S104 may further include:
step S104a, inputting the first labeling sequence into a neural network structure of a full connection layer, performing additional feature extraction to obtain a second probability of each label of each word in the target text sequence, wherein an operation formula for the additional feature extraction of the ith word in the target text sequence is Bi=wXi+ b, wherein, XiIs the first probability of each first label of the ith word in the first annotation sequence, and w and b are BERT model learning parameters;
for example, the fully-connected layer neural network structure of this embodiment may be a multi-layer transform structure, where the multi-layer transform structure further includes an attention mechanism, and the first annotation sequence is processed by the attention mechanism and then input to the feed-forward fully-connected neural network structure for additional feature extraction, so as to obtain a second probability of each second label of each word in the target text sequence; and obtaining a second probability of each second label of each word in the target text sequence through the operation of wx + b, wherein x is the sequence, and w and b are model learning parameters.
Step S104b, generating a second labeling sequence according to the second probability of each second label of each word in the target text sequence.
Step S106, the second labeling sequence is used as an input sequence of a conditional random field CRF model, and a label sequence Y is output through the CRF model (Y ═ Y)1,y2,...,ym)。
Specifically, the step S106 may further include:
step S106a, inputting the second annotation sequence into a CRF model;
step S106b, performing Viterbi solution on the second labeling sequence through a Viterbi algorithm to obtain an optimal solution path in the second labeling sequence, wherein the optimal solution path is a highest probability sequence of the label sequence being a whole target text sequence;
exemplarily, in this step, the output object corresponding to the target text sequence is determined according to the probability value of the second probability of each second tag of each word in the target text sequence; here, the method is implemented by a viterbi algorithm, which does not output the highest label probability among the second probabilities of the respective second labels of each word in the target text sequence, but outputs the highest probability labeled sequence of the entire target text sequence.
Illustratively, the viterbi algorithm may include: when the path with the highest second probability of each second label of each word in the target text sequence passes through a certain point of the fence network, the sub-path from the starting point to the point must also be the path with the highest probability from the beginning to the point; when there are k states at the ith time, there are k shortest paths from the beginning to the k states at the ith time, and the final shortest path must pass through one of them.
And step S106, 106c, generating a label sequence according to the optimal solution path.
For example, the viterbi algorithm is used to calculate the highest probability labeling sequence of the whole target text sequence, and when the shortest path of the (i + 1) th state is calculated, only the shortest path from the beginning to the current k state values and the shortest path from the current state value to the (i + 1) th state value need to be considered.
And S108, generating a named entity sequence according to the label sequence, and outputting the named entity sequence.
Illustratively, a named entity sequence can be generated according to the tag sequence, and the named entity sequence is a target text sequence predicted by an annotation system. Wherein the named entities comprise place names, person names and the like; the sequence marking adopts a BIOES form, wherein B is an entity beginning, I is an entity middle, O is a non-entity, E is an entity ending and S is a single character entity; each named entity label corresponds to an entity category, and can be detailed as B-place name: the beginning of the place name entity, and the like. Here, place names and person names are taken as examples; for example, now the sentence "curie is from huasha" is taken as an example, and the sentence will be split into word sequences. Then the curio is labeled as the name of B, the lire is labeled as the name of E, the raw word is labeled O, the wave is labeled as the name of B place, and the blue is labeled as the name of E-place.
Example two
FIG. 2 is a schematic diagram of program modules of a second embodiment of the sequence annotation system. The sequence annotation system 20 can include or be divided into one or more program modules, which are stored in a storage medium and executed by one or more processors to implement the present invention and implement the above-described sequence annotation methods. The program module referred to in the embodiments of the present invention refers to a series of computer program instruction segments capable of performing specific functions, and is more suitable than the program itself for describing the execution process of the sequence labeling system 20 in the storage medium. The following description will specifically describe the functions of the program modules of the present embodiment:
a text receiving module 200, configured to receive a target text sequence and convert the target text sequence into corresponding sentence vectors, word vectors of each word, and position vectors of each word.
Illustratively, the receive text module 200 is further configured to: inputting the target text sequence into an embedding layer, and outputting a plurality of word vectors corresponding to the target text sequence through the embedding layer, wherein the word vectors comprise at least one punctuation vector; inputting the word vectors into a segmentation layer, segmenting the word vectors according to the at least one punctuation vector to obtain n word vector sets, wherein the n word vector sets correspond to n segmentation codes; performing coding operation on each division code through position coding, and determining the position information coding of each division code to obtain the position vector of each word in the target text sequence; and generating sentence vectors of the target text sequence according to the word vector of each word in the target text sequence and the position vector of each word.
A first labeling module 202, configured to input a sentence vector of the target text sequence, a word vector of each word, and a position vector of each word into a trained BERT model, and output a first labeling sequence corresponding to the target text sequence through the BERT model, where the first labeling sequence includes a plurality of first n-dimensional vectors, each first n-dimensional vector corresponds to one word in the target text sequence, and the first n-dimensional vector represents a first probability that the corresponding word belongs to each first tag in n first tags.
Illustratively, the first annotation module 202 is further configured to: performing feature extraction on the sentence vector of the target text sequence, the word vector of each word and the position vector of each word through the BERT model to obtain a first probability of each first label of each word in the target text sequence; and generating a first labeling sequence according to the first probability of each first label of each word in the target text sequence.
A second labeling module 204, configured to input the first labeling sequence into a full-connected layer, and output a second labeling sequence through the full-connected layer, where the second labeling sequence includes a plurality of second n-dimensional vectors, each second n-dimensional vector corresponds to one word in the target text sequence, and the second n-dimensional vector represents a second probability that the corresponding word belongs to each second tag of the n second tags.
Illustratively, the second annotation module 204 is further configured to: inputting the first labeling sequence into a neural network structure of a full connection layer, performing additional feature extraction to obtain a second probability of each second label of each word in the target text sequence, wherein an operation formula for the additional feature extraction of the ith word in the target text sequence is Bi=wXi+ b, wherein, XiIs the first probability of each first label of the ith word in the first annotation sequence, and w and b are BERT model learning parameters; and generating a second labeling sequence according to the second probability of each second label of each word in the target text sequence.
An output tag module 206, configured to use the second tagging sequence as an input sequence of a conditional random field CRF model, so as to output a tag sequence Y ═ (Y) through the CRF model1,y2,...,ym)。
Illustratively, the output tag module 206 is further configured to: inputting the second annotation sequence into a CRF model; performing Viterbi solution on the second labeling sequence through a Viterbi algorithm to obtain an optimal solution path in the second labeling sequence, wherein the optimal solution path is a highest probability sequence of the label sequence being a whole target text sequence; and generating a label sequence according to the optimal solving path.
And the output entity module 208 is configured to generate a named entity sequence according to the tag sequence, and output the named entity sequence.
EXAMPLE III
Fig. 3 is a schematic diagram of a hardware architecture of a computer device according to a third embodiment of the present invention. In the present embodiment, the computer device 2 is a device capable of automatically performing numerical calculation and/or information processing in accordance with a preset or stored instruction. The computer device 2 may be a rack server, a blade server, a tower server or a rack server (including an independent server or a server cluster composed of a plurality of servers), and the like. As shown, the computer device 2 includes, but is not limited to, at least a memory 21, a processor 22, a network interface 23, and a sequence annotation system 20 communicatively coupled to each other via a system bus.
In this embodiment, the memory 21 includes at least one type of computer-readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage 21 may be an internal storage unit of the computer device 2, such as a hard disk or a memory of the computer device 2. In other embodiments, the memory 21 may also be an external storage device of the computer device 2, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like provided on the computer device 2. Of course, the memory 21 may also comprise both internal and external memory units of the computer device 2. In this embodiment, the memory 21 is generally used for storing an operating system installed on the computer device 2 and various application software, such as the program code of the sequence annotation system 20 of the second embodiment. Further, the memory 21 may also be used to temporarily store various types of data that have been output or are to be output.
Processor 22 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 22 is typically used to control the overall operation of the computer device 2. In this embodiment, the processor 22 is configured to execute the program code stored in the memory 21 or process data, for example, execute the sequence annotation system 20, so as to implement the sequence annotation method of the first embodiment.
The network interface 23 may comprise a wireless network interface or a wired network interface, and the network interface 23 is generally used for establishing communication connection between the computer device 2 and other electronic apparatuses. For example, the network interface 23 is used to connect the computer device 2 to an external terminal through a network, establish a data transmission channel and a communication connection between the computer device 2 and the external terminal, and the like. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), Wi-Fi, and the like.
It is noted that fig. 3 only shows the computer device 2 with components 20-23, but it is to be understood that not all shown components are required to be implemented, and that more or less components may be implemented instead.
In this embodiment, the sequence annotation system 20 stored in the memory 21 can be further divided into one or more program modules, and the one or more program modules are stored in the memory 21 and executed by one or more processors (in this embodiment, the processor 22) to complete the present invention.
For example, fig. 2 is a schematic diagram of program modules for implementing the sequence annotation system 20 according to a second embodiment of the present invention, in which the sequence annotation system 20 can be divided into a text receiving module 200, a first annotation module 202, a second annotation module 204, an output tag module 206, and an output entity module 208. The program modules referred to herein are a series of computer program instruction segments that can perform specific functions, and are more suitable than programs for describing the execution process of the sequence labeling system 20 in the computer device 2. The specific functions of the program modules 200-208 have been described in detail in the second embodiment, and are not described herein again.
Example four
The present embodiment also provides a computer-readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application mall, etc., on which a computer program is stored, which when executed by a processor implements corresponding functions. The computer-readable storage medium of this embodiment is used for the sequence annotation system 20, and when executed by a processor, the method for sequence annotation of the first embodiment is implemented.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A method for labeling sequences, the method comprising:
receiving a target text sequence, and converting the target text sequence into corresponding sentence vectors, word vectors of all words and position vectors of all words;
inputting sentence vectors of the target text sequence, word vectors of all words and position vectors of all words into a trained BERT model, and outputting a first labeling sequence corresponding to the target text sequence through the BERT model, wherein the first labeling sequence comprises a plurality of first n-dimensional vectors, each first n-dimensional vector corresponds to one word in the target text sequence, and the first n-dimensional vectors represent first probabilities that corresponding words belong to each first label in n first labels;
inputting the first labeling sequence into a full-connection layer, and outputting a second labeling sequence through the full-connection layer, wherein the second labeling sequence comprises a plurality of second n-dimensional vectors, each second n-dimensional vector corresponds to one word in the target text sequence, and the second n-dimensional vectors represent second probabilities of the corresponding words belonging to each of n second tags;
using the second labeling sequence as an input sequence of a conditional random field CRF model to output a tag sequence Y (Y) through the CRF model1,y2,...,ym) (ii) a And
and generating a named entity sequence according to the label sequence, and outputting the named entity sequence.
2. The sequence annotation method of claim 1, wherein the step of converting the target text sequence into a corresponding sentence vector, a word vector for each word, and a position vector for each word comprises:
inputting the target text sequence into an embedding layer, and outputting a plurality of word vectors corresponding to the target text sequence through the embedding layer, wherein the word vectors comprise at least one punctuation vector;
inputting the word vectors into a segmentation layer, segmenting the word vectors according to the at least one punctuation vector to obtain n word vector sets, wherein the n word vector sets correspond to n segmentation codes;
performing coding operation on each division code through position coding, and determining the position information coding of each division code to obtain the position vector of each word in the target text sequence; and
and generating a sentence vector of the target text sequence according to the word vector of each word in the target text sequence and the position vector of each word.
3. The sequence annotation method of claim 2, wherein the step of outputting, by the BERT model, a first annotated sequence corresponding to the target text sequence comprises:
performing feature extraction on the sentence vector of the target text sequence, the word vector of each word and the position vector of each word through the BERT model to obtain a first probability of each first label of each word in the target text sequence;
and generating a first labeling sequence according to the first probability of each first label of each word in the target text sequence.
4. The sequence annotation method of claim 3, wherein said step of inputting said first annotation sequence into a fully-connected layer and outputting a second annotation sequence through said fully-connected layer comprises:
inputting the first labeling sequence into a neural network structure of a full connection layer, performing additional feature extraction to obtain a second probability of each second label of each word in the target text sequence, wherein an operation formula for the additional feature extraction of the ith word in the target text sequence is Bi=wXi+ b, wherein, XiIs the first probability of each first label of the ith word in the first annotation sequence, and w and b are BERT model learning parameters;
and generating a second labeling sequence according to the second probability of each second label of each word in the target text sequence.
5. The sequence tagging method of claim 1, wherein said second tagging sequence is used as an input sequence for a Conditional Random Field (CRF) model to output a tag sequence Y (Y) via the CRF model1,y2,...,ym) The method comprises the following steps:
inputting the second annotation sequence into a CRF model;
performing Viterbi solution on the second labeling sequence through a Viterbi algorithm to obtain an optimal solution path in the second labeling sequence, wherein the optimal solution path is a highest probability sequence of the label sequence being a whole target text sequence;
and generating a label sequence according to the optimal solving path.
6. A sequence annotation system, comprising:
the text receiving module is used for receiving a target text sequence and converting the target text sequence into corresponding sentence vectors, word vectors of all the words and position vectors of all the words;
a first labeling module, configured to input a sentence vector of the target text sequence, a word vector of each word, and a position vector of each word into a trained BERT model, and output a first labeling sequence corresponding to the target text sequence through the BERT model, where the first labeling sequence includes a plurality of first n-dimensional vectors, each first n-dimensional vector corresponds to one word in the target text sequence, and the first n-dimensional vector represents a first probability that a corresponding word belongs to each first tag of the n first tags;
a second labeling module, configured to input the first labeling sequence into a full-connected layer, and output a second labeling sequence through the full-connected layer, where the second labeling sequence includes a plurality of second n-dimensional vectors, each second n-dimensional vector corresponds to one word in the target text sequence, and the second n-dimensional vector represents a second probability that a corresponding word belongs to each of n second tags;
an output tag module, configured to use the second tagging sequence as an input sequence of a conditional random field CRF model, so as to output a tag sequence Y ═ (Y) through the CRF model1,y2,...,ym) (ii) a And
and the output entity module is used for generating a named entity sequence according to the label sequence and outputting the named entity sequence.
7. The sequence annotation system of claim 6, wherein the receive text module is further to:
inputting the target text sequence into an embedding layer, and outputting a plurality of word vectors corresponding to the target text sequence through the embedding layer, wherein the word vectors comprise at least one punctuation vector;
inputting the word vectors into a segmentation layer, segmenting the word vectors according to the at least one punctuation vector to obtain n word vector sets, wherein the n word vector sets correspond to n segmentation codes; and
performing coding operation on each division code through position coding, and determining the position information coding of each division code to obtain the position vector of each word in the target text sequence;
and generating a sentence vector of the target text sequence according to the word vector of each word in the target text sequence and the position vector of each word.
8. The sequence annotation system of claim 7, wherein the first annotation module is further to:
performing feature extraction on the sentence vector of the target text sequence, the word vector of each word and the position vector of each word through the BERT model to obtain a first probability of each first label of each word in the target text sequence;
and generating a first labeling sequence according to the first probability of each first label of each word in the target text sequence.
9. A computer arrangement comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program, when executed by the processor, implements the steps of the sequence annotation method according to any one of claims 1 to 5.
10. A computer-readable storage medium, in which a computer program is stored which is executable by at least one processor for causing the at least one processor to carry out the steps of the sequence annotation method according to any one of claims 1 to 5.
CN201910983279.2A 2019-10-16 2019-10-16 Sequence labeling method, system and computer equipment Active CN111222317B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910983279.2A CN111222317B (en) 2019-10-16 2019-10-16 Sequence labeling method, system and computer equipment
PCT/CN2019/117403 WO2021072852A1 (en) 2019-10-16 2019-11-12 Sequence labeling method and system, and computer device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910983279.2A CN111222317B (en) 2019-10-16 2019-10-16 Sequence labeling method, system and computer equipment

Publications (2)

Publication Number Publication Date
CN111222317A true CN111222317A (en) 2020-06-02
CN111222317B CN111222317B (en) 2022-04-29

Family

ID=70827510

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910983279.2A Active CN111222317B (en) 2019-10-16 2019-10-16 Sequence labeling method, system and computer equipment

Country Status (2)

Country Link
CN (1) CN111222317B (en)
WO (1) WO2021072852A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111859858A (en) * 2020-07-22 2020-10-30 智者四海(北京)技术有限公司 Method and device for extracting relationship from text
CN112148856A (en) * 2020-09-22 2020-12-29 北京百度网讯科技有限公司 Method and device for establishing punctuation prediction model
CN112541341A (en) * 2020-12-18 2021-03-23 广东电网有限责任公司 Text event element extraction method
CN113064992A (en) * 2021-03-22 2021-07-02 平安银行股份有限公司 Complaint work order structured processing method, device, equipment and storage medium
CN113157883A (en) * 2021-04-07 2021-07-23 浙江工贸职业技术学院 Chinese opinion target boundary prediction method based on dual-model structure
CN113553824A (en) * 2021-07-07 2021-10-26 临沂中科好孕智能技术有限公司 Sentence vector model training method
CN114386419A (en) * 2022-01-11 2022-04-22 平安科技(深圳)有限公司 Entity recognition model training method, device, equipment and storage medium

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113673247A (en) * 2021-05-13 2021-11-19 江苏曼荼罗软件股份有限公司 Entity identification method, device, medium and electronic equipment based on deep learning
CN113392659A (en) * 2021-06-25 2021-09-14 携程旅游信息技术(上海)有限公司 Machine translation method, device, electronic equipment and storage medium
CN113537346B (en) * 2021-07-15 2023-08-15 思必驰科技股份有限公司 Medical field data labeling model training method and medical field data labeling method
CN113569574B (en) * 2021-07-16 2024-02-09 阳光电源股份有限公司 Method, terminal and storage medium for identifying work order type
CN113516196B (en) * 2021-07-20 2024-04-12 云知声智能科技股份有限公司 Named entity recognition data enhancement method, named entity recognition data enhancement device, electronic equipment and named entity recognition data enhancement medium
CN113626608B (en) * 2021-10-12 2022-02-15 深圳前海环融联易信息科技服务有限公司 Semantic-enhancement relationship extraction method and device, computer equipment and storage medium
CN114048288A (en) * 2021-11-10 2022-02-15 北京明略软件***有限公司 Fine-grained emotion analysis method and system, computer equipment and storage medium
CN114580424B (en) * 2022-04-24 2022-08-05 之江实验室 Labeling method and device for named entity identification of legal document
CN115879473B (en) * 2022-12-26 2023-12-01 淮阴工学院 Chinese medical named entity recognition method based on improved graph attention network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180089152A1 (en) * 2016-09-02 2018-03-29 Digital Genius Limited Message text labelling
CN109635279A (en) * 2018-11-22 2019-04-16 桂林电子科技大学 A kind of Chinese name entity recognition method neural network based
CN109815952A (en) * 2019-01-24 2019-05-28 珠海市筑巢科技有限公司 Brand name recognition methods, computer installation and computer readable storage medium
CN110223742A (en) * 2019-06-14 2019-09-10 中南大学 The clinical manifestation information extraction method and equipment of Chinese electronic health record data

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106980624B (en) * 2016-01-18 2021-03-26 阿里巴巴集团控股有限公司 Text data processing method and device
CN109994201B (en) * 2019-03-18 2021-06-11 浙江大学 Diabetes and hypertension probability calculation system based on deep learning
CN110083831B (en) * 2019-04-16 2023-04-18 武汉大学 Chinese named entity identification method based on BERT-BiGRU-CRF
CN110147452B (en) * 2019-05-17 2022-03-01 北京理工大学 Coarse grain emotion analysis method based on hierarchy BERT neural network
CN110297913A (en) * 2019-06-12 2019-10-01 中电科大数据研究院有限公司 A kind of electronic government documents entity abstracting method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180089152A1 (en) * 2016-09-02 2018-03-29 Digital Genius Limited Message text labelling
CN109635279A (en) * 2018-11-22 2019-04-16 桂林电子科技大学 A kind of Chinese name entity recognition method neural network based
CN109815952A (en) * 2019-01-24 2019-05-28 珠海市筑巢科技有限公司 Brand name recognition methods, computer installation and computer readable storage medium
CN110223742A (en) * 2019-06-14 2019-09-10 中南大学 The clinical manifestation information extraction method and equipment of Chinese electronic health record data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SIYUAN XUE ET AL.: "Dialogue Act Recognition for Open-Domain Based on Word-level Sequence Annotation with CRF", 《2018 IEEE 9TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND SERVICE SCIENCE (ICSESS)》 *
李明浩 等: "基于LSTM-CRF的中医医案症状术语识别", 《计算机应用》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111859858A (en) * 2020-07-22 2020-10-30 智者四海(北京)技术有限公司 Method and device for extracting relationship from text
CN111859858B (en) * 2020-07-22 2024-03-01 智者四海(北京)技术有限公司 Method and device for extracting relation from text
CN112148856A (en) * 2020-09-22 2020-12-29 北京百度网讯科技有限公司 Method and device for establishing punctuation prediction model
CN112148856B (en) * 2020-09-22 2024-01-23 北京百度网讯科技有限公司 Method and device for establishing punctuation prediction model
CN112541341A (en) * 2020-12-18 2021-03-23 广东电网有限责任公司 Text event element extraction method
CN113064992A (en) * 2021-03-22 2021-07-02 平安银行股份有限公司 Complaint work order structured processing method, device, equipment and storage medium
CN113157883A (en) * 2021-04-07 2021-07-23 浙江工贸职业技术学院 Chinese opinion target boundary prediction method based on dual-model structure
CN113553824A (en) * 2021-07-07 2021-10-26 临沂中科好孕智能技术有限公司 Sentence vector model training method
CN114386419A (en) * 2022-01-11 2022-04-22 平安科技(深圳)有限公司 Entity recognition model training method, device, equipment and storage medium
CN114386419B (en) * 2022-01-11 2023-07-25 平安科技(深圳)有限公司 Entity recognition model training method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN111222317B (en) 2022-04-29
WO2021072852A1 (en) 2021-04-22

Similar Documents

Publication Publication Date Title
CN111222317B (en) Sequence labeling method, system and computer equipment
CN111460807B (en) Sequence labeling method, device, computer equipment and storage medium
CN110287480B (en) Named entity identification method, device, storage medium and terminal equipment
CN110750965B (en) English text sequence labeling method, english text sequence labeling system and computer equipment
CN109543181B (en) Named entity model and system based on combination of active learning and deep learning
WO2019075967A1 (en) Enterprise name recognition method, electronic device, and computer-readable storage medium
CN112766319B (en) Dialogue intention recognition model training method, device, computer equipment and medium
CN112188311B (en) Method and apparatus for determining video material of news
CN110348012B (en) Method, device, storage medium and electronic device for determining target character
CN112446211A (en) Text processing device, method, apparatus, and computer-readable storage medium
CN112860919A (en) Data labeling method, device and equipment based on generative model and storage medium
CN114757171A (en) Training method of pre-training language model, and training method and device of language model
CN113743101A (en) Text error correction method and device, electronic equipment and computer storage medium
CN113918031A (en) System and method for Chinese punctuation recovery using sub-character information
CN113947095A (en) Multilingual text translation method and device, computer equipment and storage medium
CN112232052A (en) Text splicing method and device, computer equipment and storage medium
CN111241843B (en) Semantic relation inference system and method based on composite neural network
CN112364659B (en) Automatic identification method and device for unsupervised semantic representation
CN111368531B (en) Translation text processing method and device, computer equipment and storage medium
CN116306663B (en) Semantic role labeling method, device, equipment and medium
CN111967253A (en) Entity disambiguation method and device, computer equipment and storage medium
CN116702765A (en) Event extraction method and device and electronic equipment
CN115860002A (en) Combat task generation method and system based on event extraction
CN112131879A (en) Relationship extraction system, method and device
CN115600596A (en) Named entity recognition method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40030926

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant