CN111414757B - Text recognition method and device - Google Patents

Text recognition method and device Download PDF

Info

Publication number
CN111414757B
CN111414757B CN201910008861.7A CN201910008861A CN111414757B CN 111414757 B CN111414757 B CN 111414757B CN 201910008861 A CN201910008861 A CN 201910008861A CN 111414757 B CN111414757 B CN 111414757B
Authority
CN
China
Prior art keywords
word vector
text
recognized
word
characters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910008861.7A
Other languages
Chinese (zh)
Other versions
CN111414757A (en
Inventor
龙定坤
徐光伟
李辰
包祖贻
刘恒友
李林琳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201910008861.7A priority Critical patent/CN111414757B/en
Publication of CN111414757A publication Critical patent/CN111414757A/en
Application granted granted Critical
Publication of CN111414757B publication Critical patent/CN111414757B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Character Discrimination (AREA)

Abstract

The invention discloses a text recognition method and a text recognition device, which relate to the technical field of data processing and mainly aim at improving the accuracy of recognizing named entities; the main technical scheme comprises the following steps: training a language model by using the corpus to obtain the language model; determining a word vector list corresponding to the corpus; determining a first word vector of characters of a text to be recognized based on the language model, and determining a second word vector of characters of the text to be recognized based on the word vector list; and carrying out named entity recognition on the text to be recognized through a named entity recognition model at least based on the first word vector and the second word vector of the characters of the text to be recognized.

Description

Text recognition method and device
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a text recognition method and apparatus.
Background
Named entity recognition refers to the recognition of entities in text with specific meaning, which mainly comprises personal names, place names, organization names, proper nouns and the like. Named entity recognition is particularly important in applications such as e-commerce, information retrieval and intelligent translation, and the like, because user intention can be acquired through named entity recognition, and thus business processing such as search and the like can be completed rapidly and accurately.
Currently, named entity recognition is typically based on rule or vocabulary based recognition methods. When named entity recognition is performed based on rules, fixed parts and variable parts of a text to be recognized are usually recognized according to the rules, but the accuracy of named entity recognition is not high due to uncertainty in the content of the text to be recognized. When named entity recognition is performed based on a vocabulary, a large number of recognized named entities are stored in the vocabulary. When the text to be recognized is recognized, the text to be recognized is matched with the named entities in the word list, and when the text to be recognized is matched, the named entities in the text to be recognized are recognized. However, if the named entity in the text to be recognized is not recorded in the vocabulary, the named entity in the text to be recognized cannot be recognized, and therefore the accuracy of the named entity recognition is not high.
Disclosure of Invention
In view of this, the present invention provides a text recognition method and apparatus, and mainly aims to improve accuracy of recognizing named entities.
In a first aspect, the present invention provides a text recognition method, the method comprising:
training a language model by using the corpus to obtain the language model;
determining a word vector list corresponding to the corpus;
determining a first word vector of characters of a text to be recognized based on the language model, and determining a second word vector of characters of the text to be recognized based on the word vector list;
and carrying out named entity recognition on the text to be recognized through a named entity recognition model at least based on the first word vector and the second word vector of the characters of the text to be recognized.
In a second aspect, the present invention provides a text recognition apparatus comprising:
the training unit is used for training the language model by utilizing the corpus to obtain the language model;
the first determining unit is used for determining a word vector list corresponding to the corpus;
a second determining unit configured to determine a first word vector of characters of a text to be recognized based on the language model, and determine a second word vector of characters of the text to be recognized based on the word vector list;
and the first recognition unit is used for carrying out named entity recognition on the text to be recognized through a preset named entity recognition model at least based on the first word vector and the second word vector of the characters of the text to be recognized.
In a third aspect, the present invention provides an electronic device, including: a storage medium and a processor;
the processor is suitable for realizing each instruction;
the storage medium is suitable for storing a plurality of instructions;
the instructions are adapted to be loaded by the processor and to perform a text recognition method as claimed in any one of the preceding claims.
By means of the technical scheme, the text recognition method and the text recognition device provided by the invention have the advantages that the language model is obtained by training the language model through the corpus, and the first word vector of the characters of the text to be recognized is determined based on the language model. And then determining a word vector list corresponding to the preset corpus, and determining a second word vector of the characters of the text to be recognized based on the word vector list. And finally, carrying out named entity recognition on the text to be recognized through a named entity recognition model based on the first word vector and the second word vector of the characters of the text to be recognized. Therefore, the named entity recognition is performed based on the first word vector and the second word vector of the characters of the text to be recognized, and the first vector of the characters can reflect the context of the characters in the text to be recognized, and the second vector can perform static representation on the characters to reflect the literal characteristics of the characters, so that the accuracy of recognizing the named entity can be improved.
The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 illustrates a flow chart of a text recognition method provided by one embodiment of the present invention;
FIG. 2 is a flow chart of a text recognition method according to another embodiment of the present invention;
FIG. 3 is a schematic diagram of a text recognition method according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a text recognition method according to another embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The embodiment of the invention provides a text recognition method, which mainly comprises the following steps as shown in fig. 1:
101. and training the language model by using the corpus to obtain the language model.
The text recognition method can be used in any application scene, so that the corpus involved in the step is the corpus involved in the application scene of the method. Illustrating: when the application scenario of the text recognition method is an e-commerce scenario, the corpus is a large number of texts related to the e-commerce operation process, and the large number of texts can include but are not limited to brand names, place of production names, manufacturer names, product specifications and other named entities. In addition, the large amount of text may be, but is not limited to, chinese text. Therefore, the corpus involved in the step is related to the application scene of the method, namely, the method can be used for identifying the text to be identified in any application scene, and the service applicability is high.
In the step, when language model training is carried out by utilizing the corpus, the language model training can be carried out by utilizing the n-gram characteristics of the corpus, and the n-gram characteristics of the corpus are extracted from the corpus. The process of extracting n-gram features from the corpus may include, but is not limited to: splitting the corpus into a plurality of short sentences based on punctuation marks in the corpus; based on a preset value of n (n is an integer greater than 0), extracting a plurality of byte segment sequences with the length of n from the corpus, wherein each byte segment sequence with the length of n is an n-gram characteristic.
Illustrating: taking a short sentence 'beautiful air conditioner' in the corpus as an example for explanation, and the value of n is 1, the extracted 1-gram features are divided into: beautiful, air-conditioner.
Illustrating: taking a short sentence 'beautiful air conditioner' in the corpus as an example for explanation, and n is 2, the extracted 2-gram features are divided into: beautiful, hollow, air-conditioning and regulating.
Illustrating: taking a short sentence 'beautiful air conditioner' in the corpus as an example for explanation, and the value of n is 3, the extracted 3-gram features are divided into: air conditioning, air conditioning.
It can be seen from the above illustration that any one n-gram feature is only related to the n-1 characters preceding it, but not to any other characters. In addition, the n-gram feature of the corpus may be the n-gram feature of the same numerical value n, or may be at least two n-gram features of different numerical values n.
In the step, after extracting the n-gram characteristics of the corpus, a language model is trained according to the n-gram characteristics of the corpus by adopting a preset model algorithm, and the language model can embody the context information among the n-gram characteristics. The pre-set model algorithm may be determined based on business requirements and may include, but is not limited to, bi-directional multi-layer LSTM (Long Short-Term Memory).
102. And determining a word vector list corresponding to the corpus.
Specifically, the corpus is input into a preset word vector generation model, and a word vector list corresponding to the corpus is generated by the preset word vector generation model. The word vector list comprises at least one word and the corresponding relation of the at least one word vector, and different word segments correspond to different word vectors, wherein the at least one word segment is included in the corpus. The word segments in the word vector list may be arranged by at least one character.
It should be noted that the specific type of the word vector generation model may be determined according to the service requirement. Alternatively, the word vector generation model may include, but is not limited to, word2vec.
103. A first word vector of characters of the text to be recognized is determined based on the language model, and a second word vector of characters of the text to be recognized is determined based on the word vector list.
The text to be identified in this step may be specified by a business person, or may be obtained from a set website (e.g., an e-commerce website) or a set data storage area (e.g., an e-commerce database) in real time or at a set period. After the text to be recognized is acquired, the characters in the text to be recognized need to be cut out. When the character is segmented, the segmentation can be performed according to the set number of characters, and optionally, the number of characters can be 1. Illustrating: the text to be identified is "Taobao net is good", and the characters after the text to be identified is segmented are: washing, bao, ding, zhen, hao and Xiang. When the text to be recognized is segmented for each character, the determined first word vector of the character of the text to be recognized is the first word vector of each character.
Specifically, since the language model is obtained according to the n-gram feature of the preset corpus, the first word vector feature of each character obtained based on the language model can accurately represent the context relation of each character in the text to be recognized.
Specifically, the word vector of each word segment in the word vector list is a fixed vector relative to all the word segments in the corpus, so that the second word vector feature of each character obtained based on the word vector list is only a static representation of each character, and can reflect the character face feature of the character.
104. And carrying out named entity recognition on the text to be recognized through a preset named entity recognition model at least based on the first word vector and the second word vector of the characters of the text to be recognized.
The first word vector and the second word vector of the character of the text to be recognized may be the first word vector and the second word vector of every n adjacent characters of the text to be recognized. Wherein n is a positive integer greater than or equal to 1. Illustrating: and carrying out named entity recognition on the text to be recognized through a preset named entity recognition model based on the first word vector and the second word vector of each character of the text to be recognized.
The first word vector of the character can accurately represent the context relation of the character in the text to be recognized, and the second word vector of the character can accurately represent a fixed representation of the character relative to all the word segments in the corpus, so that when the named entity to be recognized exists in the text to be recognized, the named entity to be recognized can be accurately recognized based on the first word vector and the second word vector of the character of the text to be recognized and through a preset named entity recognition model. The identified named entity may be an existing named entity in the named entity library or a newly mined named entity that does not exist in the named entity library.
It should be noted that, when the named entity recognition model is used to recognize the named entity of the text to be recognized based on the first word vector and the second word vector of the characters of the text to be recognized, the named entity is not recognized, so that the named entity is not present in the text to be recognized, and at this time, a prompt that the named entity is not recognized needs to be sent out, so that the recognition result can be known in time according to the prompt.
According to the text recognition method provided by the invention, the language model is obtained by training the language model by utilizing the corpus, and the first word vector of the characters of the text to be recognized is determined based on the language model. And then determining a word vector list corresponding to the preset corpus, and determining a second word vector of the characters of the text to be recognized based on the word vector list. And finally, carrying out named entity recognition on the text to be recognized through a named entity recognition model based on the first word vector and the second word vector of the characters of the text to be recognized. Therefore, the named entity recognition is performed based on the first word vector and the second word vector of the characters of the text to be recognized, and the first vector of the characters can reflect the context of the characters in the text to be recognized, and the second vector can perform static representation on the characters to reflect the literal characteristics of the characters, so that the accuracy of recognizing the named entity can be improved.
In one embodiment of the present invention, step 101 in the flowchart shown in fig. 1 performs language model training by using corpus to obtain a language model, which may include:
determining a word vector for each of the n-gram features;
and inputting word vectors of each n-gram characteristic into a bidirectional LSTM for training to obtain the language model.
The method for extracting the n-gram features from the preset corpus in this embodiment is the same as the method described in the above step 1, and will not be described here again.
In this embodiment, the word vector process for determining each n-gram feature may include, but is not limited to: determining a word vector of each character in the n-gram characteristics by adopting a preset vector generation model; when only one character is included in the n-gram feature, then the word vector for that character is determined to be the word vector for that n-gram feature. When at least two characters are included in the n-gram feature, an average or weighted average of word vectors of the respective characters is calculated, and the average or weighted average is determined as a word vector of the n-gram feature. It should be noted that the vector generation model may be determined according to the service requirement, and alternatively, the vector generation model may include, but is not limited to, word2vec.
In this embodiment, the word vector of each n-gram feature is input into the bi-directional LSTM to be trained to obtain a language model, and the language model can accurately reflect the context of each character.
In one embodiment of the present invention, step 102 in the flowchart shown in fig. 1, determining the word vector list corresponding to the corpus may include:
word segmentation processing is carried out on the corpus;
training the corpus subjected to word segmentation processing by using a preset word vector generation algorithm to obtain a word vector list; the word vector list comprises at least one word segmentation and a corresponding relation of at least one word vector; the at least one word segmentation is obtained after word segmentation processing is carried out on the corpus.
In this embodiment, the process of word segmentation on the language material may include, but is not limited to: the corpus is divided into a plurality of short sentences according to punctuation marks, and then a plurality of word segments are divided from the plurality of short sentences according to semantics. Each word segment includes at least one character. After word segmentation processing is carried out on the language material to obtain a plurality of segmented words, training is carried out on the plurality of segmented words by utilizing a preset word vector generation algorithm to obtain a word vector list, wherein the word vector list comprises at least one segmented word and at least one word vector, and different segmented words correspond to different word vectors. The word vector generation algorithm may be determined based on business requirements, alternatively, the word vector generation algorithm may include, but is not limited to, word2vec.
In one embodiment of the present invention, the determining the first word vector of the character of the text to be recognized based on the language model, which is referred to in step 103 in the flowchart shown in fig. 1, may include:
inputting the text to be recognized into the language model;
determining a word vector to be selected corresponding to each character in the text to be recognized and the probability of the word vector to be selected by utilizing the voice model;
for each of the characters, performing: and determining the word vector with the highest probability in the word vector to be selected corresponding to the character as the first word vector of the character.
In the present embodiment, when a text to be recognized is input into a speech model, 1-gram features of the text to be recognized are input into a language model.
In this embodiment, the process of determining the first word vector by the language model is related to a training method for obtaining the language model. Illustrating: when the language model is obtained through the bidirectional LSTM training, the language model can obtain the first word vector of each character based on the bidirectional LSTM. The resulting first word vector may reflect the contextual relationship of the character in the text to be recognized. It should be noted that, since the word vector with the highest probability in the word vector to be selected corresponding to the character is determined as the first word vector of the character, the determined first word vector can accurately represent the context of the character in the text to be identified.
In one embodiment of the present invention, the word vector list includes at least one word segment and a corresponding relationship of at least one word vector; determining a second word vector of the character of the text to be recognized based on the word vector list, which is referred to in step 103 in the flowchart shown in fig. 1, includes:
for each of the characters, performing: querying the word vector list to determine target word segmentation corresponding to the character; and determining a word vector corresponding to the target word segmentation as a second word vector of the character.
In this embodiment, the word vector list includes a correspondence between at least one word segment and at least one word vector, and different word segments correspond to different word vectors. The word segment is arranged by at least one character.
In this embodiment, since the procedure of determining the second word vector of each character in the text to be recognized is the same, the following description will be made with the second word vector of one character being determined: inquiring whether the word vector list has the same word segmentation as the character; if the word vector exists, determining the word vector corresponding to the inquired word segmentation as a second word vector of the character; if not, the explanatory word vector list is not comprehensive enough, and the second word vector of the character can be determined at least by the following method. First, a spare word vector list is used to determine the word vector of the character, and the segmentation words in the spare word vector list are not existed in the word vector list used before; secondly, a reminder that the second word vector of the character is not queried is sent out, so that service personnel can upgrade the word vector list according to the reminder.
In one embodiment of the present invention, step 104 in the flowchart shown in fig. 1 at least performs named entity recognition on the text to be recognized through a preset named entity recognition model based on the first word vector and the second word vector of the character of the text to be recognized, and may include:
for each of the characters, performing: splicing the first word vector and the second word vector of the character to obtain a third word vector of the character;
inputting a third word vector of each character into the named entity recognition model;
and carrying out named entity recognition on the text to be recognized through the third word vector of each character by using the named entity recognition model.
In this embodiment, since the third vector of the character is formed by splicing the first word vector and the second word vector of the character, the third vector not only can reflect the context of the character in the text to be recognized, but also can reflect the literal characteristics of the character by carrying out static representation on each character. Illustrating: the first word vector of character 1 is 50-dimensional, the second word vector is 50-dimensional, and the third word vector of character 1 is 100-dimensional.
In this embodiment, the specific style of the named entity recognition model may be determined according to the service requirements. Alternatively, named entity recognition models may include, but are not limited to, convolutional neural network models, LSTM+CRF sequence labeling models, recurrent neural network models.
In this embodiment, using the named entity recognition model to recognize the named entity of the text to be recognized through the third word vector of each character may include the following steps: inputting a third word vector of each character into the two-way LSTM in the named entity recognition model for training to obtain semantic information characteristics of the text to be recognized; inputting the semantic information features into a Conditional Random Field (CRF) in the named entity recognition model, and marking a text to be recognized by using the Conditional Random Field (CRF) in the named entity recognition model to obtain a marking result; and identifying the named entity in the text to be identified based on the labeling result.
Specifically, the bi-directional LSTM includes one input layer, two hidden layers, and one softmax layer. Wherein one hidden layer represents a forward LSTM neural network and the other hidden layer represents a backward LSTM neural network. When the third word vector of the character is input to the input layer, the hidden layer and the softmax layer are trained through a back propagation algorithm, so that semantic information characteristics of LSTM output are obtained. And inputting the semantic information features into a Conditional Random Field (CRF), and marking each character by the Conditional Random Field (CRF) according to the semantics to obtain a marking result. And then identifying the named entity in the text to be identified according to the labeling result, wherein the identification can have two results, namely, the named entity is identified from the text to be identified, and the identified named entity can be the named entity existing in the named entity library or can be the newly mined named entity which does not exist in the named entity library. The other is that no named entity exists in the text to be identified. In addition, it should be noted that the type of the label may be determined according to the service requirement. Alternatively, the annotation category may include, but is not limited to, four categories, "BIE" which indicates a start location, an intermediate location, and an end location of an entity. "S" means an entity composed of one single word, and "O" means an entity not belonging to any kind.
It should be noted that, because the named entity recognition is performed based on the third vector of the character, the third vector of the character can reflect not only the context of the character in the text to be recognized, but also the literal features of the character as well as the static representation of each character. Therefore, the accuracy of named entity recognition of the text to be recognized is high.
Based on the description in the above embodiments, the embodiments of the method described above can be freely combined according to the service requirements to form new embodiments. A text recognition method will be described below by taking the flowchart shown in fig. 2 as an example, the method including:
201. an n-gram feature of the corpus is determined.
202. A word vector is determined for each n-gram feature.
203. And inputting word vectors of each n-gram characteristic into a bidirectional LSTM for training to obtain a language model.
204. Word segmentation processing is carried out on the language materials.
205. Training the corpus after word segmentation processing by using a preset word vector generation algorithm to obtain a word vector list; the word vector list comprises at least one word segmentation and a corresponding relation of at least one word vector; at least one word is obtained by word segmentation processing on the language material.
206. Text to be recognized is input to the language model.
207. And determining the word vector to be selected corresponding to each character in the text to be recognized and the probability of the word vector to be selected by using the voice model.
208. For each character, respectively: and determining the word vector with the highest probability in the word vectors to be selected corresponding to the characters as a first word vector of the characters.
209. For each character, respectively: inquiring a word vector list to determine target word segmentation corresponding to the character; and determining a word vector corresponding to the target word segmentation as a second word vector of the character.
210. Each character in the text to be recognized is executed separately: and splicing the first word vector and the second word vector of the character to obtain a third word vector of the character.
211. The third word vector for each character is input to the named entity recognition model.
212. And inputting the third word vector of each character into a bidirectional LSTM in a preset named entity recognition model for training to obtain semantic information features of the text to be recognized.
213. Inputting semantic information features into a Conditional Random Field (CRF) in a named entity recognition model, and marking a text to be recognized by using the Conditional Random Field (CRF) in the named entity recognition model to obtain a marking result.
214. And identifying the named entity in the text to be identified based on the labeling result.
Further, according to the above method embodiment, another embodiment of the present invention further provides a text recognition device, as shown in fig. 3, where the device includes:
a training unit 31, configured to perform language model training by using corpus, so as to obtain a language model;
a first determining unit 32, configured to determine a word vector list corresponding to the corpus;
a second determining unit 33 for determining a first word vector of characters of the text to be recognized based on the language model, and determining a second word vector of characters of the text to be recognized based on the word vector list;
the first recognition unit 34 is configured to perform named entity recognition on the text to be recognized through a preset named entity recognition model based on at least the first word vector and the second word vector of the character of the text to be recognized.
The text recognition device provided by the invention is used for training the language model by utilizing the corpus to obtain the language model, and determining the first word vector of the character of the text to be recognized based on the language model. And then determining a word vector list corresponding to the preset corpus, and determining a second word vector of the characters of the text to be recognized based on the word vector list. And finally, carrying out named entity recognition on the text to be recognized through a named entity recognition model based on the first word vector and the second word vector of the characters of the text to be recognized. Therefore, the named entity recognition is performed based on the first word vector and the second word vector of the characters of the text to be recognized, and the first vector of the characters can reflect the context of the characters in the text to be recognized, and the second vector can perform static representation on the characters to reflect the literal characteristics of the characters, so that the accuracy of recognizing the named entity can be improved.
Optionally, as shown in fig. 4, the apparatus further includes:
a splicing unit 341, configured to splice the first word vector and the second word vector of the character of the text to be recognized to obtain a third word vector of the character of the text to be recognized;
a first input unit 342, configured to input third word vectors of characters of the text to be recognized into the named entity recognition model;
the second recognition unit 343 is configured to perform named entity recognition on the text to be recognized by using the named entity recognition model through a third word vector of the characters of the text to be recognized.
The setting positions of the splicing unit 341, the first input unit 342, and the second recognition unit 343 in the text recognition device may be determined according to the service requirement. The provision of the splice unit 341, the first input unit 342, and the second recognition unit 343 in the recognition unit 34 in fig. 4 is only one example.
Optionally, as shown in fig. 4, the apparatus further includes:
the first training unit 3431 is configured to input a third word vector of the character of the text to be recognized into the bidirectional LSTM in the named entity recognition model for training, so as to obtain semantic information features of the text to be recognized;
the labeling unit 3432 is configured to input the semantic information features into a Conditional Random Field (CRF) in the named entity recognition model, label the text to be recognized by using the Conditional Random Field (CRF) in the named entity recognition model, and obtain a labeling result;
and a third identifying unit 3433, configured to identify a named entity in the text to be identified based on the labeling result.
The setting positions of the first training unit 3431, the labeling unit 3432, and the third recognition unit 3433 in the text recognition device may be determined according to the service requirements. The first training unit 3431, the labeling unit 3432, and the third recognition unit 3433 are provided in the second recognition unit 343 of the first recognition unit 34 in fig. 4 as just one example.
Optionally, as shown in fig. 4, the apparatus may further include:
a second input unit 331 for inputting the text to be recognized to the language model;
a third determining unit 332, configured to determine a word vector to be selected corresponding to each character in the text to be recognized and a probability of the word vector to be selected by using the speech model;
a fourth determination unit 333 configured to perform, for each of the characters: and determining the word vector with the highest probability in the word vector to be selected corresponding to the character as the first word vector of the character.
The setting positions of the above-described second input unit 331, third determination unit 332, fourth determination unit 333 in the text recognition apparatus may be determined according to the service requirement. The provision of the second input unit 331, the third determination unit 332, and the fourth determination unit 333 in the second determination unit 33 in fig. 4 is only one example.
Optionally, as shown in fig. 4, the word vector list includes at least one word segment and a corresponding relationship of at least one word vector; the apparatus further comprises:
a fifth determining unit 334 for performing, for each of the characters: querying the word vector list to determine target word segmentation corresponding to the character; and determining a word vector corresponding to the target word segmentation as a second word vector of the character.
The setting position of the fifth determining unit 334 in the text recognition device described above may be determined according to the service requirement. The fifth determination unit 334 is provided in the second determination unit 33 in fig. 4 as only one example.
Alternatively, as shown in fig. 4, the apparatus may include:
a sixth determining unit 311 configured to determine a word vector of each of the n-gram features;
a second training module 312, configured to input the word vector of each of the n-gram features into a bi-directional LSTM to perform training, so as to obtain the language model.
The setting positions of the sixth determining unit 311 and the second training module 312 in the text recognition device may be determined according to the service requirement. The sixth determination unit 311, the second training module 312 are provided in the training unit 31 in fig. 4 as just one example.
Optionally, as shown in fig. 4, the apparatus may further include:
the word segmentation unit 321 is configured to perform word segmentation processing on the corpus;
a seventh determining unit 322, configured to train the corpus after word segmentation to obtain the word vector list by using a preset word vector generation algorithm; the word vector list comprises at least one word segmentation and a corresponding relation of at least one word vector; the at least one word segmentation is obtained after word segmentation processing is carried out on the corpus.
The setting positions of the word segmentation unit 321 and the seventh determination unit 322 in the text recognition device may be determined according to the service requirement. The arrangement of the word segmentation unit 321, the seventh determination unit 322 in the first determination unit 32 in fig. 4 is only one example.
In the text recognition device provided by the embodiment of the present invention, a detailed description of a method adopted in the operation process of each functional module may be referred to a corresponding method of the above method embodiment, which is not repeated herein.
Based on the same inventive concept, according to the above embodiment, another embodiment of the present invention further provides an electronic device, including: a storage medium and a processor;
the processor is suitable for realizing each instruction;
the storage medium is suitable for storing a plurality of instructions;
the instructions are adapted to be loaded by the processor and to perform a text recognition method as claimed in any one of the preceding claims.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.
It will be appreciated that the relevant features of the methods and apparatus described above may be referenced to one another. In addition, the "first", "second", and the like in the above embodiments are for distinguishing the embodiments, and do not represent the merits and merits of the embodiments.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
The algorithms and displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with the teachings herein. The required structure for a construction of such a system is apparent from the description above. In addition, the present invention is not directed to any particular programming language. It will be appreciated that the teachings of the present invention described herein may be implemented in a variety of programming languages, and the above description of specific languages is provided for disclosure of enablement and best mode of the present invention.
In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the intention that: i.e., the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments can be used in any combination.
Various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the functions of some or all of the components in the methods, apparatus and framework of operation of the deep neural network model according to embodiments of the present invention may be implemented in practice using a microprocessor or Digital Signal Processor (DSP). The present invention can also be implemented as an apparatus or device program (e.g., a computer program and a computer program product) for performing a portion or all of the methods described herein. Such a program embodying the present invention may be stored on a computer readable medium, or may have the form of one or more signals. Such signals may be downloaded from an internet website, provided on a carrier signal, or provided in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.

Claims (13)

1. A method of text recognition, comprising:
training a language model by using the corpus to obtain the language model;
determining a word vector list corresponding to the corpus;
determining a first word vector of characters of a text to be recognized based on the language model, and determining a second word vector of characters of the text to be recognized based on the word vector list;
performing named entity recognition on the text to be recognized through a named entity recognition model at least based on a first word vector and a second word vector of characters of the text to be recognized;
the determining a first word vector of characters of the text to be recognized based on the language model comprises:
inputting the text to be recognized into the language model;
determining a word vector to be selected corresponding to each character in the text to be recognized and the probability of the word vector to be selected by using the language model;
for each of the characters, performing: and determining the word vector with the highest probability in the word vector to be selected corresponding to the character as the first word vector of the character.
2. The method of claim 1, further comprising:
splicing the first word vector and the second word vector of the characters of the text to be recognized to obtain a third word vector of the characters of the text to be recognized;
inputting the third word vectors of the characters of the text to be recognized into the named entity recognition model;
and carrying out named entity recognition on the text to be recognized through a third word vector of the characters of the text to be recognized by using the named entity recognition model.
3. The method of claim 2, further comprising:
inputting a third word vector of the character of the text to be recognized into the bidirectional LSTM in the named entity recognition model for training to obtain semantic information characteristics of the text to be recognized;
inputting the semantic information features into a Conditional Random Field (CRF) in the named entity recognition model, and marking a text to be recognized by using the Conditional Random Field (CRF) in the named entity recognition model to obtain a marking result;
and identifying the named entity in the text to be identified based on the labeling result.
4. A method according to any one of claims 1-3, wherein the list of word vectors comprises at least one word segment and a correspondence of at least one word vector; further comprises:
for each of the characters, performing: querying the word vector list to determine target word segmentation corresponding to the character; and determining a word vector corresponding to the target word segmentation as a second word vector of the character.
5. A method according to any one of claims 1-3, further comprising:
determining n-gram characteristics of the corpus;
determining a word vector for each of the n-gram features;
and inputting word vectors of each n-gram characteristic into a bidirectional LSTM for training to obtain the language model.
6. A method according to any one of claims 1-3, further comprising:
word segmentation processing is carried out on the corpus;
training the corpus subjected to word segmentation processing by using a preset word vector generation algorithm to obtain a word vector list; the word vector list comprises at least one word segmentation and a corresponding relation of at least one word vector; the at least one word segmentation is obtained after word segmentation processing is carried out on the corpus.
7. A text recognition device, comprising:
the training unit is used for training the language model by utilizing the corpus to obtain the language model;
the first determining unit is used for determining a word vector list corresponding to the corpus;
a second determining unit configured to determine a first word vector of characters of a text to be recognized based on the language model, and determine a second word vector of characters of the text to be recognized based on the word vector list;
the first recognition unit is used for carrying out named entity recognition on the text to be recognized through a preset named entity recognition model at least based on a first word vector and a second word vector of characters of the text to be recognized;
the second determining unit is further configured to input the text to be recognized into the language model; determining a word vector to be selected corresponding to each character in the text to be recognized and the probability of the word vector to be selected by using the language model; for each of the characters, performing: and determining the word vector with the highest probability in the word vector to be selected corresponding to the character as the first word vector of the character.
8. The apparatus as recited in claim 7, further comprising:
the splicing unit is used for splicing the first word vector and the second word vector of the characters of the text to be recognized to obtain a third word vector of the characters of the text to be recognized;
the first input unit is used for inputting third word vectors of characters of the text to be recognized into the named entity recognition model;
and the second recognition unit is used for carrying out named entity recognition on the text to be recognized through a third word vector of the characters of the text to be recognized by utilizing the named entity recognition model.
9. The apparatus as recited in claim 8, further comprising:
the first training unit is used for inputting a third word vector of the character of the text to be recognized into the bidirectional LSTM in the named entity recognition model for training to obtain semantic information characteristics of the text to be recognized;
the labeling unit is used for inputting the semantic information characteristics into a Conditional Random Field (CRF) in the named entity recognition model, labeling the text to be recognized by using the Conditional Random Field (CRF) in the named entity recognition model, and obtaining a labeling result;
and the third recognition unit is used for recognizing the named entity in the text to be recognized based on the labeling result.
10. The apparatus according to any one of claims 7-9, wherein the list of word vectors includes at least one word segment and a correspondence of at least one word vector; further comprises:
a fifth determining unit configured to perform, for each of the characters, respectively: querying the word vector list to determine target word segmentation corresponding to the character; and determining a word vector corresponding to the target word segmentation as a second word vector of the character.
11. The apparatus according to any one of claims 7-9, further comprising:
a sixth determining unit, configured to determine an n-gram feature of the corpus; determining a word vector for each of the n-gram features;
and the second training module is used for inputting the word vector of each n-gram characteristic into a bidirectional LSTM for training to obtain the language model.
12. The apparatus according to any one of claims 7-9, further comprising:
the word segmentation unit is used for carrying out word segmentation on the corpus;
a seventh determining unit, configured to train the corpus after word segmentation processing by using a preset word vector generation algorithm to obtain the word vector list; the word vector list comprises at least one word segmentation and a corresponding relation of at least one word vector; the at least one word segmentation is obtained after word segmentation processing is carried out on the corpus.
13. An electronic device, the electronic device comprising: a storage medium and a processor;
the processor is suitable for realizing each instruction;
the storage medium is suitable for storing a plurality of instructions;
the instructions are adapted to be loaded by the processor and to perform the text recognition method of any one of claims 1 to 6.
CN201910008861.7A 2019-01-04 2019-01-04 Text recognition method and device Active CN111414757B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910008861.7A CN111414757B (en) 2019-01-04 2019-01-04 Text recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910008861.7A CN111414757B (en) 2019-01-04 2019-01-04 Text recognition method and device

Publications (2)

Publication Number Publication Date
CN111414757A CN111414757A (en) 2020-07-14
CN111414757B true CN111414757B (en) 2023-06-20

Family

ID=71490649

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910008861.7A Active CN111414757B (en) 2019-01-04 2019-01-04 Text recognition method and device

Country Status (1)

Country Link
CN (1) CN111414757B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111738007B (en) * 2020-07-03 2021-04-13 北京邮电大学 Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network
CN112687328B (en) * 2021-03-12 2021-08-31 北京贝瑞和康生物技术有限公司 Method, apparatus and medium for determining phenotypic information of clinical descriptive information
CN112687332B (en) * 2021-03-12 2021-07-30 北京贝瑞和康生物技术有限公司 Method, apparatus and storage medium for determining sites of variation at risk of disease
CN113095085B (en) * 2021-03-30 2024-04-19 北京达佳互联信息技术有限公司 Emotion recognition method and device for text, electronic equipment and storage medium
CN113343692B (en) * 2021-07-15 2023-09-12 杭州网易云音乐科技有限公司 Search intention recognition method, model training method, device, medium and equipment
CN116052648B (en) * 2022-08-03 2023-10-20 荣耀终端有限公司 Training method, using method and training system of voice recognition model

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202054A (en) * 2016-07-25 2016-12-07 哈尔滨工业大学 A kind of name entity recognition method learnt based on the degree of depth towards medical field
CN107133220A (en) * 2017-06-07 2017-09-05 东南大学 Name entity recognition method in a kind of Geography field
CN107644014A (en) * 2017-09-25 2018-01-30 南京安链数据科技有限公司 A kind of name entity recognition method based on two-way LSTM and CRF
CN107885721A (en) * 2017-10-12 2018-04-06 北京知道未来信息技术有限公司 A kind of name entity recognition method based on LSTM
CN107908614A (en) * 2017-10-12 2018-04-13 北京知道未来信息技术有限公司 A kind of name entity recognition method based on Bi LSTM
CN108205524A (en) * 2016-12-20 2018-06-26 北京京东尚科信息技术有限公司 Text data processing method and device
CN108536679A (en) * 2018-04-13 2018-09-14 腾讯科技(成都)有限公司 Name entity recognition method, device, equipment and computer readable storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10019438B2 (en) * 2016-03-18 2018-07-10 International Business Machines Corporation External word embedding neural network language models

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202054A (en) * 2016-07-25 2016-12-07 哈尔滨工业大学 A kind of name entity recognition method learnt based on the degree of depth towards medical field
CN108205524A (en) * 2016-12-20 2018-06-26 北京京东尚科信息技术有限公司 Text data processing method and device
CN107133220A (en) * 2017-06-07 2017-09-05 东南大学 Name entity recognition method in a kind of Geography field
CN107644014A (en) * 2017-09-25 2018-01-30 南京安链数据科技有限公司 A kind of name entity recognition method based on two-way LSTM and CRF
CN107885721A (en) * 2017-10-12 2018-04-06 北京知道未来信息技术有限公司 A kind of name entity recognition method based on LSTM
CN107908614A (en) * 2017-10-12 2018-04-13 北京知道未来信息技术有限公司 A kind of name entity recognition method based on Bi LSTM
CN108536679A (en) * 2018-04-13 2018-09-14 腾讯科技(成都)有限公司 Name entity recognition method, device, equipment and computer readable storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Guillaume Lample 等."Neural Architectures for Named Entity Recognition".《arXiv》.2016,全文. *
买买提阿依甫 ; 吾守尔・斯拉木 ; 帕丽旦・木合塔尔 ; 杨文忠 ; .基于BiLSTM-CNN-CRF模型的维吾尔文命名实体识别.计算机工程.2018,(08),全文. *
李丽双 ; 郭元凯 ; .基于CNN-BLSTM-CRF模型的生物医学命名实体识别.中文信息学报.2018,(01),全文. *

Also Published As

Publication number Publication date
CN111414757A (en) 2020-07-14

Similar Documents

Publication Publication Date Title
CN111414757B (en) Text recognition method and device
CN109145153B (en) Intention category identification method and device
CN108121700B (en) Keyword extraction method and device and electronic equipment
CN108091328B (en) Speech recognition error correction method and device based on artificial intelligence and readable medium
CN108287858B (en) Semantic extraction method and device for natural language
CN111310440B (en) Text error correction method, device and system
CN108549656B (en) Statement analysis method and device, computer equipment and readable medium
US9645988B1 (en) System and method for identifying passages in electronic documents
CN108170859A (en) Method, apparatus, storage medium and the terminal device of speech polling
EP1619620A1 (en) Adaptation of Exponential Models
CN108776901B (en) Advertisement recommendation method and system based on search terms
Xu et al. Exploiting shared information for multi-intent natural language sentence classification.
US11164210B2 (en) Method, device and computer storage medium for promotion displaying
CN109766550B (en) Text brand recognition method, recognition device and storage medium
US11023685B2 (en) Affect-enriched vector representation of words for use in machine-learning models
CN105206274A (en) Voice recognition post-processing method and device as well as voice recognition system
CN112256845A (en) Intention recognition method, device, electronic equipment and computer readable storage medium
CN114154487A (en) Text automatic error correction method and device, electronic equipment and storage medium
CN114995903B (en) Class label identification method and device based on pre-training language model
CN113988057A (en) Title generation method, device, equipment and medium based on concept extraction
CN107844531B (en) Answer output method and device and computer equipment
CN116522905B (en) Text error correction method, apparatus, device, readable storage medium, and program product
CN112632956A (en) Text matching method, device, terminal and storage medium
CN112559725A (en) Text matching method, device, terminal and storage medium
CN111858860B (en) Search information processing method and system, server and computer readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant