CN109522553B - Named entity identification method and device - Google Patents

Named entity identification method and device Download PDF

Info

Publication number
CN109522553B
CN109522553B CN201811332914.2A CN201811332914A CN109522553B CN 109522553 B CN109522553 B CN 109522553B CN 201811332914 A CN201811332914 A CN 201811332914A CN 109522553 B CN109522553 B CN 109522553B
Authority
CN
China
Prior art keywords
text
vector
named entity
character
question
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811332914.2A
Other languages
Chinese (zh)
Other versions
CN109522553A (en
Inventor
聂镭
徐泓洋
郑权
张峰
聂颖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dragon Horse Zhixin (zhuhai Hengqin) Technology Co Ltd
Original Assignee
Dragon Horse Zhixin (zhuhai Hengqin) Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dragon Horse Zhixin (zhuhai Hengqin) Technology Co Ltd filed Critical Dragon Horse Zhixin (zhuhai Hengqin) Technology Co Ltd
Priority to CN201811332914.2A priority Critical patent/CN109522553B/en
Publication of CN109522553A publication Critical patent/CN109522553A/en
Application granted granted Critical
Publication of CN109522553B publication Critical patent/CN109522553B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a named entity identification method and a named entity identification device. Wherein, the method comprises the following steps: extracting information of the character image by using a convolutional neural network model CNN to obtain a font vector corresponding to characters in the character image; splicing the font vector and the character vector corresponding to the character, and acquiring a characteristic vector according to the spliced vector obtained by splicing; obtaining a named entity set according to the feature vector, wherein the named entity set comprises a plurality of named entities; establishing a question corresponding to the text and image, and positioning to obtain a named entity needing to be obtained based on the question, wherein the named entity needing to be obtained belongs to a named entity set. The invention solves the technical problem that the information obtained by identifying the information of some files by using the traditional information extraction mode is unavailable in the related technology.

Description

Named entity identification method and device
Technical Field
The invention relates to the technical field of natural language processing, in particular to a named entity identification method and device.
Background
The traditional certificates of national authentication, including CET-4, CET-6, graduation certificate, academic certificate, etc., all have fixed modes, fixed locations and specific contents. Therefore, in the certificate identification, only the characters on the relevant positions need to be extracted, and the corresponding information can be directly matched, namely, the identification is carried out, namely, the acquisition is carried out.
With the release of the state on the form and content of the certificate, colleges and universities and scientific research institutions begin to design the certificates with respective characteristics, particularly graduation certificates and degree certificates, independently. Different schools have different forms and contents, and even different certificates of a school have different contents and forms. This presents a challenge for traditional certificate identification: even if the text in the certificate is extracted, there is still no match to the information, i.e. only information that is identified but not available.
In view of the above problem of information that is unavailable in the related art, which is obtained by performing information identification on some files by using a conventional information extraction method, no effective solution has been proposed at present.
Disclosure of Invention
The embodiment of the invention provides a method and a device for identifying a named entity, which at least solve the technical problem that information obtained by identifying information of some files by using a traditional information extraction mode in the related art is unavailable.
According to an aspect of the embodiments of the present invention, there is provided a method for identifying a named entity, including: extracting information of the character image by using a convolutional neural network model CNN to obtain a font vector corresponding to characters in the character image; splicing the font vector and the character vector corresponding to the character, and acquiring a characteristic vector according to the spliced vector; obtaining a named entity set according to the feature vector, wherein the named entity set comprises a plurality of named entities; and establishing a questioning subject corresponding to the text and image, and positioning to obtain the named entity needing to be obtained based on the questioning subject, wherein the named entity needing to be obtained belongs to the named entity set.
Optionally, the font vector is a vector of dimension N × 1, and the text vector is a vector of dimension M × 1, where N represents the number of font attributes of the text corresponding to the font vector, and M represents the number of text attributes of the text in the text vector.
Optionally, the splicing the font vector and the text vector corresponding to the text, and obtaining the feature vector according to the spliced vector obtained by the splicing includes: splicing the font vector with the dimension of N x 1 with the character vector with the dimension of M x 1 to obtain a spliced vector with the dimension of (N + M) x 1; taking the (N + M) × 1-dimensional splicing vector as the input of a bidirectional long-short time memory network model Bi-LSTM; acquiring the output of the bidirectional long-time memory network model Bi-LSTM; and obtaining the feature vector according to the output, wherein the feature vector is a 2(N + M) × 1 dimensional vector.
Optionally, obtaining a named entity set according to the feature vector includes: taking the feature vector as an input of a conditional random field model CRF; acquiring the output of the conditional random field model CRF; and obtaining the named entity set according to the output of the conditional random field model CRF.
Optionally, constructing a question corresponding to the text image comprises: extracting key information of a text corresponding to the text image, wherein the key information is a feature word having an association relation with the named entity; and taking the key information as the question setting question.
Optionally, obtaining the named entity to be acquired based on the questioning subject positioning includes: determining identifiers of text segments corresponding to the questioning subjects through a matching neural network model, wherein the matching neural network model is obtained by using multiple groups of data through machine learning training, and each group of data in the multiple groups of data comprises: the question and the identifier of the text segment corresponding to the question; and extracting the named entity to be acquired according to the identifier of the text fragment.
Optionally, before obtaining the named entity to be acquired based on the questioning subject positioning, the method for identifying the named entity further includes: identifying a text corresponding to the text-image to obtain a plurality of character segments; adding identifiers to the plurality of text segments based on a predetermined rule; identifying a text corresponding to the text-image to obtain a plurality of text segments comprises: identifying a predetermined punctuation mark in the text; and identifying the text corresponding to the text image according to the preset identification symbol to obtain the plurality of character segments.
According to another aspect of the embodiments of the present invention, there is also provided a device for identifying a named entity, including: the extraction unit is used for extracting information of the character image by using the convolutional neural network model CNN to obtain a font vector corresponding to characters in the character image; the first obtaining unit is used for splicing the font vector and the character vector corresponding to the character, and obtaining a characteristic vector according to the spliced vector obtained by splicing; the second obtaining unit is used for obtaining a named entity set according to the feature vector, wherein the named entity set comprises a plurality of named entities; and the third acquisition unit is used for constructing a questioning question corresponding to the text image and positioning to obtain the named entity to be acquired based on the questioning question, wherein the named entity to be acquired belongs to the named entity set.
Optionally, the font vector is a vector of dimension N × 1, and the text vector is a vector of dimension M × 1, where N represents the number of font attributes of the text corresponding to the font vector, and M represents the number of text attributes of the text in the text vector.
Optionally, the first obtaining unit includes: the splicing module is used for splicing the font vector with the dimension of N x 1 and the character vector with the dimension of M x 1 to obtain a spliced vector with the dimension of (N + M) x 1; the first determining module is used for taking the (N + M) × 1-dimensional splicing vector as the input of a bidirectional long-time memory network model Bi-LSTM; the first acquisition module is used for acquiring the output of the bidirectional long-time memory network model Bi-LSTM; and the second obtaining module is used for obtaining the feature vector according to the output, wherein the feature vector is a 2(N + M) × 1 dimensional vector.
Optionally, the second obtaining unit includes: the second determining module is used for taking the feature vector as the input of a conditional random field model CRF; the third acquisition module is used for acquiring the output of the conditional random field model CRF; and the fourth acquisition module is used for acquiring the named entity set according to the output of the conditional random field model CRF.
Optionally, the third obtaining unit includes: the extraction module is used for extracting key information of a text corresponding to the text image, wherein the key information is a feature word which has an association relation with the named entity; and the third determining module is used for taking the key information as the question setting question.
Optionally, the third obtaining unit includes: a fourth determining module, configured to determine, through a matching neural network model, an identifier of a text segment corresponding to the question, where the matching neural network model is obtained through machine learning training using multiple sets of data, and each set of data in the multiple sets of data includes: the question and the identifier of the text segment corresponding to the question; and the extraction module is used for extracting the named entity to be acquired according to the identifier of the text segment.
Optionally, the apparatus for identifying a named entity further includes: a fourth obtaining unit, configured to identify a text corresponding to the text-image to obtain a plurality of text segments before obtaining a named entity to be obtained based on the questioning subject positioning; an adding unit, configured to add identifiers to the plurality of text segments based on a predetermined rule; wherein the fourth acquiring unit includes: the recognition module is used for recognizing the preset punctuation marks in the text; and the fifth acquisition module is used for identifying the text corresponding to the text image according to the preset identification symbol to obtain the plurality of character segments.
According to another aspect of the embodiments of the present invention, there is also provided a storage medium including a stored program, wherein the program executes the method for identifying a named entity according to any one of the above.
According to another aspect of the embodiment of the present invention, there is further provided a processor, configured to execute a program, where the program executes to perform the method for identifying a named entity according to any one of the above items.
In the embodiment of the invention, a convolutional neural network model CNN is adopted to extract information of a character image to obtain a font vector corresponding to characters in the character image; splicing the font vector and the character vector corresponding to the character, and acquiring a characteristic vector according to the spliced vector obtained by splicing; obtaining a named entity set according to the feature vector, wherein the named entity set comprises a plurality of named entities; the method for identifying the named entity can realize the purposes of splicing the font vector of the extracted font information and the character information corresponding to the character information to obtain a spliced vector and obtaining the named entity set according to the spliced vector, thereby not only considering the space information of the characters, but also considering the context information of the text, improving the identification efficiency of effective information, and further solving the technical problem that the information obtained by identifying the information of some files by using the traditional information extraction mode in the related technology is unavailable information.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a flow diagram of a method of identifying a named entity according to an embodiment of the invention;
fig. 2 is a schematic diagram of a named entity recognition apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
In accordance with an embodiment of the present invention, there is provided a method embodiment of a method for named entity recognition, it being noted that the steps illustrated in the flowchart of the figure may be performed in a computer system such as a set of computer-executable instructions and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.
Fig. 1 is a flowchart of a method for identifying a named entity according to an embodiment of the present invention, and as shown in fig. 1, the method for identifying a named entity includes the following steps:
and S102, extracting information of the character image by using the convolutional neural network model CNN to obtain a font vector corresponding to the characters in the character image.
The Convolutional Neural Network (CNN) is a deep feedforward artificial neural network, and artificial neurons can respond to surrounding units to perform large-scale image processing. Comprises a convolution layer, a pooling layer, an activation layer, a dropout layer and the like. The method comprises the following steps: one-dimensional convolutional neural networks, two-dimensional convolutional neural networks, and three-dimensional convolutional neural networks. Wherein, the one-dimensional convolution neural network is commonly used for data processing of sequence class; two-dimensional convolutional neural networks are often applied to image-like text recognition; the three-dimensional convolutional neural network is mainly applied to medical image and video data identification.
In the implementation of the invention, the convolutional neural network model CNN can be used for extracting font information in the character image and outputting a font vector corresponding to each character in the image characters. For example, CET-4, CET-6, graduation certificate, academic certificate, etc. because of different kinds of certificate contents, characters in the certificate may have different fonts, for example, information such as name, time, unit, etc. is different from the font type, font size, font thickness, etc. of general text. These words are usually part of or even all of the key information in the certificate, so first, the font information of the words needs to be extracted. The convolutional neural network model CNN is a convolutional neural network model, and is commonly used to extract spatial information of an image, and in practical application, convolutional neural network models CNN of different complexity levels can be used according to an application scene and requirements.
And step S104, splicing the font vector and the character vector corresponding to the character, and acquiring a characteristic vector according to the spliced vector obtained by splicing.
In step S104, the font information extracted in step S102 may be input into the Bi-LSTM + CRF model as a part of the text vector for named entity recognition.
The Bi-LSTM, Bi-directional LSTM model, is a variant of the Recurrent Neural Network (RNN), in which the LSTM modifies the memory cells on the basis of the basic RNN model, sets up input gates, forgetting gates, and output gates, thereby implementing more efficient timing information learning. The Bi-LSTM adds a reverse sequence learning to the original forward (relative to the reverse) LSTM, and usually splices the forward and reverse vectors in the output link to obtain a final output vector.
The input of Bi-LSTM is a vector of each Word or character, which may be in a simple form of one-hot, or may be a Word vector (Word2vec, Glove) of pre _ train, and in the embodiment of the present invention, font information of each character is added, so that the Word/Word vector of M × 1 dimension of pre _ train is used to splice the Word/Word vector and the font information vector to obtain an input vector of (N + M) < 1 dimension. And obtaining an output vector dimension of 2(N + M) × 1 after Bi-LSTM, namely splicing the obtained feature vectors.
Preferably, the font vector is a vector of dimension N × 1, and the text vector is a vector of dimension M × 1, where N represents the number of font attributes of the text corresponding to the font vector, and M represents the number of text attributes of the text in the text vector. The font property here may be a property used for representing the character characteristics, such as the font type and the font size of the character. The text attribute is an attribute for indicating that the text is a verb, noun, predicate, subject, name of person, place name, or the like.
As an optional embodiment, the splicing the font vector and the text vector corresponding to the text, and obtaining the feature vector according to the spliced vector obtained by the splicing includes: splicing the font vector with the dimension of N x 1 with the character vector with the dimension of M x 1 to obtain a spliced vector with the dimension of (N + M) x 1; taking the spliced vector of (N + M) × 1 dimension as the input of a bidirectional long-short time memory network model Bi-LSTM; acquiring the output of a bidirectional long-time memory network model Bi-LSTM; and obtaining a feature vector according to the output, wherein the feature vector is a 2(N + M) × 1 dimensional vector.
And S106, obtaining a named entity set according to the feature vector, wherein the named entity set comprises a plurality of named entities.
Here, a Conditional Random Field (CRF) is a probabilistic undirected graph model. The conditional random field is a conditional probability distribution model of another set of output random variables under the condition of a given set of input random variables, and is characterized in that the output random variables are assumed to form a Markov random field. The HMM is a discriminant model for predicting hidden variables by an observation sequence, and is commonly used in scenes such as syntactic analysis, named entity recognition, part of speech tagging and the like. Here, we use CRF as the next layer of Bi-LSTM, input 2(N + M) × 1 dimensional feature vectors for each layer of Bi-LSTM, and output as corresponding tag sequences, i.e. various named entities.
In step S108, obtaining the set of named entities according to the feature vector may include: taking the feature vector as the input of a conditional random field model CRF; acquiring the output of a conditional random field model CRF; and obtaining a named entity set according to the output of the conditional random field model CRF.
And S108, establishing a question corresponding to the text image, and positioning to obtain the named entity needing to be obtained based on the question, wherein the named entity needing to be obtained belongs to a named entity set.
In this embodiment, information extraction may be performed on the text image by using the convolutional neural network model CNN to obtain a font vector corresponding to a text in the text image; splicing the font vector and the character vector corresponding to the character, and acquiring a characteristic vector according to the spliced vector obtained by splicing; obtaining a named entity set according to the feature vector, wherein the named entity set comprises a plurality of named entities; and establishing a question corresponding to the text and image, and positioning to obtain a named entity to be acquired based on the question, wherein the named entity to be acquired belongs to a named entity set. Compared with the prior art, due to the wide variety of certificates, certificates issued by different units have different forms and contents, and even the contents and forms of the certificates issued by different departments at different times in the same unit are not the same. The method for identifying the named entity provided by the embodiment of the invention can realize the purposes of splicing the font vector of the extracted font information and the character information corresponding to the character information to obtain a spliced vector and obtaining a named entity set according to the spliced vector, thereby not only considering the space information of the characters, but also considering the context information of the text, improving the identification efficiency of effective information and further solving the technical problem that the information obtained by identifying the information of some files by using the traditional information extraction mode in the related technology is unavailable information.
In step S108, constructing a question corresponding to the text image may include: extracting key information of a text corresponding to the text image, wherein the key information is a feature word having an association relation with a named entity; and taking the key information as a question. The purpose of this step is to look for the text part related to the question from the original text by the question of questioning in order to compare the extracted information with the reading comprehension question, so as to locate the position of the answer.
Taking the graduation certificate as an example, extracting the key information of the graduation certificate should include: name, graduation time, graduation unit, graduation school calendar, date of birth, school system, etc. Then the corresponding question can be asked:
a what is the name of the student?
B what are the graduate units of the students?
……
In addition, in step S108, obtaining the named entity to be obtained based on the questioning subject positioning may include: through the matching neural network model, confirm the identifier of the text fragment that corresponds with the question of asking, wherein, the matching neural network model is obtained for using multiunit data through machine learning training, and every group data in the multiunit data all includes: the question and the identifier of the text segment corresponding to the question; and extracting the named entities needing to be acquired according to the identifiers of the text segments.
For example, a model similar to Match-LSTM can be used to understand text to locate the problem-related segments. The certificate content is characterized in that characters are very concise, one content is separated by one segment and separated by commas, and for the situation, the character segments are numbered according to the sequence of texts, and finally, the numbers of the segments related to the problems are output.
The training process of the matched neural network model is similar to that of Match-LSTM, and is also divided into four steps. Firstly, Embedding the problem and the original text to generate a word vector; then using bidirectional LSTM to carry out Encode on the question and the original text; thirdly, calculating the attention distribution of each word of the original text about the question, summarizing the question representation by using the attention distribution, inputting the word representation and the corresponding question representation of the original text into another LSTM layer to be used as an Encode to obtain a query-aware representation of the word; fourthly, adding an Attention layer to obtain vector representation of the text; finally, solving the probability Pi of each word by using a Softmax layer, wherein the probability of the word with the optimization target of the target fragment is maximum, namely,
Figure BDA0001860484340000081
where l represents the loss function, k represents the number of the text segment, and i represents the ith word in the segment. The loss function here is mainly used to optimize the parameters that match the functions in the network layer in the neural network model. It should be noted that, becauseThe text of the certificate is relatively short and the named entity is obvious, so that the starting position does not need to be located. That is, the training process of the above-mentioned matching neural network model is similar to Match-LSTM, but there is a difference in the final result output, and the target can be directly found only by finding the corresponding position, and the starting position does not need to be located.
Wherein, Embedding is an Embedding layer in a network structure and mainly converts positive integers into vectors with fixed size. Reasons for using an embedding layer: 1. the vectors encoded using the one-hot method will be very high dimensional and also very sparse. Assuming we encountered a dictionary containing 2000 words in doing natural language processing, when using one-hot encoding, each word would be represented by a vector containing 2000 integers, of which 1999 numbers 0; 2. each embedded vector is updated during the course that the neural network is to be trained.
The Softmax function, also known as the normalization exponential function, is actually a log-gradient normalization of a finite discrete probability distribution in mathematics, particularly probability theory and related fields. It can "compress" a K-dimensional vector Z containing arbitrary real numbers into another K-dimensional real vector such that each element ranges between (0,1) and the sum of all elements is 1.
As an optional embodiment, before obtaining the named entity to be acquired based on the questioning subject positioning, the method for identifying the named entity may further include: identifying a text corresponding to the character image to obtain a plurality of character segments; adding identifiers for the plurality of text segments based on a predetermined rule; the method for identifying the text corresponding to the text-image to obtain the plurality of character segments comprises the following steps: recognizing a predetermined punctuation mark in the text; and identifying the text corresponding to the character image according to the preset identification symbol to obtain a plurality of character segments.
In addition, due to the characteristic of concise certificate texts, the extracted named entity is the target content, and the core answer of the corresponding question can be found by positioning the text segment in the text content based on the question of the question. Namely, the position of the answer of the set question destination is located, and then the named entity of the position is extracted.
The method for identifying the named entity provided by the embodiment of the invention can extract the font information of the character image, and adopts a Bi-LSTM + CRF model to identify the named entity by combining the font information, so as to extract the named entities such as time, name of a person, name of an organization, name of a place and the like in the text; setting up a 'question' with key information as an answer; understanding the text by adopting a Bi-LSTM + Attention model, and predicting sentences related to the problem; and matching named entities in the related sentences, namely, obtaining answers. Aiming at the problem of extracting the information of the later words after the identification of the certificate with changeable content, the method combines the font information of the words and the current popular deep learning method to realize the identification of the named entity, thus not only considering the space information of the words, but also considering the context information of the text. Then, text extraction is converted into a simple question of what answer is in reading understanding, a model construction method similar to Match-LSTM is provided, the starting point or the answer word of the answer is not predicted any more, and the position of the answer segment segmented according to punctuation marks is located. And extracting information by combining text position and named entity identification.
Example 2
The embodiment of the present invention further provides a device for identifying a named entity, and it should be noted that the device for identifying a named entity according to the embodiment of the present invention may be used to execute the method for identifying a named entity according to the embodiment of the present invention. The following describes an apparatus for identifying a named entity according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a named entity recognition apparatus according to an embodiment of the present invention, and as shown in fig. 2, the named entity recognition apparatus may include: an extraction unit 21, a first acquisition unit 23, a second acquisition unit 25, a third acquisition unit 27. The means for identifying the named entity is described in detail below.
And the extracting unit 21 is configured to perform information extraction on the text image by using the convolutional neural network model CNN to obtain a font vector corresponding to a text in the text image.
The first obtaining unit 23 is connected to the extracting unit 21, and is configured to splice the font vector and the text vector corresponding to the text, and obtain the feature vector according to the spliced vector obtained by splicing.
The second obtaining unit 25 is connected to the first obtaining unit 23, and is configured to obtain a named entity set according to the feature vector, where the named entity set includes a plurality of named entities.
And a third obtaining unit 27, connected to the second obtaining unit 25, configured to construct a question corresponding to the text image, and obtain a named entity to be obtained based on the question positioning, where the named entity to be obtained belongs to a named entity set.
It should be noted that the extracting unit 21 in this embodiment may be configured to execute step S102 in this embodiment of the present invention, the first acquiring unit 23 in this embodiment may be configured to execute step S104 in this embodiment of the present invention, the second acquiring unit 25 in this embodiment may be configured to execute step S106 in this embodiment of the present invention, and the third acquiring unit 27 in this embodiment may be configured to execute step S108 in this embodiment of the present invention. The modules are the same as the corresponding steps in the realized examples and application scenarios, but are not limited to the disclosure of the above embodiments.
In this embodiment, the extracting unit 21 may extract information of the text image by using the convolutional neural network model CNN to obtain a font vector corresponding to a text in the text image; then, the first obtaining unit 23 is used for splicing the font vector and the character vector corresponding to the character, and the feature vector is obtained according to the spliced vector obtained by splicing; then, a named entity set is obtained by using a second obtaining unit 25 according to the feature vector, wherein the named entity set comprises a plurality of named entities; and constructing a questioning subject corresponding to the text image by using a third acquisition unit, and positioning to obtain the named entity to be acquired based on the questioning subject, wherein the named entity to be acquired belongs to a named entity set. Compared with the prior art, due to the wide variety of certificates, certificates issued by different units have different forms and contents, and even the contents and forms of the certificates issued by different departments at different times in the same unit are not the same. The recognition device of the named entity provided by the embodiment of the invention can realize the purposes of splicing the font vector of the extracted font information and the character information corresponding to the character information to obtain a spliced vector and obtaining a named entity set according to the spliced vector, thereby not only considering the space information of the characters, but also considering the context information of the text, improving the recognition efficiency of effective information and further solving the technical problem that the information obtained by using the traditional information extraction mode to recognize the information of some files is unavailable information in the related technology.
As an alternative embodiment, the font vector is a vector with dimensions N × 1, and the text vector is a vector with dimensions M × 1, where N represents the number of font attributes of the text corresponding to the font vector, and M represents the number of text attributes of the text in the text vector.
As an alternative embodiment, the first obtaining unit includes: the splicing module is used for splicing the font vector with the dimension of N x 1 and the character vector with the dimension of M x 1 to obtain a spliced vector with the dimension of (N + M) x 1; the first determining module is used for taking the (N + M) × 1-dimensional splicing vector as the input of the Bi-directional long-time memory network model Bi-LSTM; the first acquisition module is used for acquiring the output of the Bi-directional long-time memory network model Bi-LSTM; and the second acquisition module is used for obtaining a feature vector according to the output, wherein the feature vector is a 2(N + M) × 1-dimensional vector.
As an alternative embodiment, the second obtaining unit includes: the second determining module is used for taking the feature vector as the input of a conditional random field model CRF; the third acquisition module is used for acquiring the output of the conditional random field model CRF; and the fourth acquisition module is used for acquiring the named entity set according to the output of the conditional random field model CRF.
As an alternative embodiment, the third obtaining unit includes: the extraction module is used for extracting key information of a text corresponding to the text image, wherein the key information is a characteristic word which has an association relation with a named entity; and the third determining module is used for taking the key information as a question.
As an alternative embodiment, the third obtaining unit includes: the fourth determining module is used for determining the identifier of the text segment corresponding to the question through the matching neural network model, wherein the matching neural network model is obtained by using multiple groups of data through machine learning training, and each group of data in the multiple groups of data comprises: the question and the identifier of the text segment corresponding to the question; and the extraction module is used for extracting the named entities to be acquired according to the identifiers of the text segments.
As an alternative embodiment, the apparatus for identifying a named entity further comprises: the fourth acquisition unit is used for identifying the text corresponding to the character image to obtain a plurality of character segments before the named entity to be acquired is positioned and obtained based on the questioning subjects; an adding unit configured to add identifiers to the plurality of text segments based on a predetermined rule; wherein the fourth acquiring unit includes: the recognition module is used for recognizing the preset punctuation marks in the text; and the fifth acquisition module is used for identifying the text corresponding to the character image according to the preset identification symbol to obtain a plurality of character segments.
The named entity recognition device comprises a processor and a memory, wherein the extraction unit 21, the first acquisition unit 23, the second acquisition unit 25, the third acquisition unit 27 and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls a corresponding program unit from the memory. The kernel can be set with one or more than one, a question corresponding to the text image is constructed by adjusting the kernel parameters, and the named entity needing to be obtained is obtained based on the question positioning, wherein the named entity needing to be obtained belongs to the named entity set.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
According to another aspect of the embodiments of the present invention, there is also provided a storage medium including a stored program, wherein the program performs the method for identifying a named entity of any one of the above.
According to another aspect of the embodiments of the present invention, there is provided a processor, configured to execute a program, where the program executes the method for identifying a named entity according to any one of the above.
The embodiment of the present invention further provides an apparatus, which includes a processor, a memory, and a program stored in the memory and executable on the processor, and when the processor executes the program, the following steps are implemented: extracting information of the character image by using a convolutional neural network model CNN to obtain a font vector corresponding to characters in the character image; splicing the font vector and the character vector corresponding to the character, and acquiring a characteristic vector according to the spliced vector obtained by splicing; obtaining a named entity set according to the feature vector, wherein the named entity set comprises a plurality of named entities; establishing a question corresponding to the text and image, and positioning to obtain a named entity needing to be obtained based on the question, wherein the named entity needing to be obtained belongs to a named entity set.
There is also provided in an embodiment of the invention a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device: extracting information of the character image by using a convolutional neural network model CNN to obtain a font vector corresponding to characters in the character image; splicing the font vector and the character vector corresponding to the character, and acquiring a characteristic vector according to the spliced vector obtained by splicing; obtaining a named entity set according to the feature vector, wherein the named entity set comprises a plurality of named entities; establishing a question corresponding to the text and image, and positioning to obtain a named entity needing to be obtained based on the question, wherein the named entity needing to be obtained belongs to a named entity set.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (7)

1. A method for identifying a named entity, comprising:
extracting information of the character image by using a convolutional neural network model CNN to obtain a font vector corresponding to characters in the character image;
splicing the font vector and the character vector corresponding to the character, and acquiring a characteristic vector according to the spliced vector;
obtaining a named entity set according to the feature vector, wherein the named entity set comprises a plurality of named entities;
establishing a questioning question corresponding to the text and image, and positioning to obtain a named entity needing to be obtained based on the questioning question, wherein the named entity needing to be obtained belongs to the named entity set;
wherein, the step of constructing the question questions corresponding to the text image comprises the following steps: extracting key information of a text corresponding to the text image, wherein the key information is a feature word having an association relation with the named entity; taking the key information as the question;
obtaining a named entity to be acquired based on the questioning subject positioning comprises: determining identifiers of text segments corresponding to the questioning subjects through a matching neural network model, wherein the matching neural network model is obtained by using multiple groups of data through machine learning training, and each group of data in the multiple groups of data comprises: the question and the identifier of the text segment corresponding to the question; extracting the named entity to be acquired according to the identifier of the text fragment;
before obtaining the named entity to be acquired based on the questioning subject positioning, the method further includes:
identifying a text corresponding to the text-image to obtain a plurality of character segments;
adding identifiers to the plurality of text segments based on a predetermined rule;
identifying a text corresponding to the text-image to obtain a plurality of text segments comprises:
recognizing a predetermined identification symbol in the text;
and identifying the text corresponding to the text image according to the preset identification symbol to obtain the plurality of character segments.
2. The method of claim 1, wherein the font vector is a vector with dimensions N x 1, and the text vector is a vector with dimensions M x 1, where N represents the number of font attributes of the text corresponding to the font vector, and M represents the number of text attributes of the text in the text vector.
3. The method of claim 2, wherein the splicing the font vector with the text vector corresponding to the text, and obtaining the feature vector according to the spliced vector comprises:
splicing the font vector with the dimension of N x 1 with the character vector with the dimension of M x 1 to obtain a spliced vector with the dimension of (N + M) x 1;
taking the (N + M) × 1-dimensional splicing vector as the input of a bidirectional long-short time memory network model Bi-LSTM;
acquiring the output of the bidirectional long-time memory network model Bi-LSTM;
and obtaining the feature vector according to the output, wherein the feature vector is a 2(N + M) × 1 dimensional vector.
4. The method of claim 1, wherein deriving a set of named entities from the feature vector comprises:
taking the feature vector as an input of a conditional random field model CRF;
acquiring the output of the conditional random field model CRF;
and obtaining the named entity set according to the output of the conditional random field model CRF.
5. An apparatus for identifying named entities, comprising:
the extraction unit is used for extracting information of the character image by using the convolutional neural network model CNN to obtain a font vector corresponding to characters in the character image;
the first obtaining unit is used for splicing the font vector and the character vector corresponding to the character, and obtaining a characteristic vector according to the spliced vector obtained by splicing;
the second obtaining unit is used for obtaining a named entity set according to the feature vector, wherein the named entity set comprises a plurality of named entities;
the third acquisition unit is used for constructing a questioning question corresponding to the text image and positioning to obtain a named entity to be acquired based on the questioning question, wherein the named entity to be acquired belongs to the named entity set;
the third acquiring unit is further configured to extract key information of a text corresponding to the text image, where the key information is a feature word having an association relationship with the named entity; taking the key information as the question; determining identifiers of text segments corresponding to the questioning subjects through a matching neural network model, wherein the matching neural network model is obtained by using multiple groups of data through machine learning training, and each group of data in the multiple groups of data comprises: the question and the identifier of the text segment corresponding to the question; extracting the named entity to be acquired according to the identifier of the text fragment;
wherein the device for identifying the named entity further comprises: a fourth obtaining unit, configured to identify a text corresponding to the text-image to obtain a plurality of text segments before obtaining a named entity to be obtained based on the questioning subject positioning; an adding unit, configured to add identifiers to the plurality of text segments based on a predetermined rule; wherein the fourth acquiring unit includes: the recognition module is used for recognizing a preset identification symbol in the text; and the fifth acquisition module is used for identifying the text corresponding to the text image according to the preset identification symbol to obtain the plurality of character segments.
6. A storage medium, characterized in that the storage medium comprises a stored program, wherein the program performs the method of identifying a named entity of any one of claims 1 to 4.
7. A processor, characterized in that the processor is configured to run a program, wherein the program when running performs the method of identifying a named entity of any of claims 1 to 4.
CN201811332914.2A 2018-11-09 2018-11-09 Named entity identification method and device Active CN109522553B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811332914.2A CN109522553B (en) 2018-11-09 2018-11-09 Named entity identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811332914.2A CN109522553B (en) 2018-11-09 2018-11-09 Named entity identification method and device

Publications (2)

Publication Number Publication Date
CN109522553A CN109522553A (en) 2019-03-26
CN109522553B true CN109522553B (en) 2020-02-11

Family

ID=65776277

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811332914.2A Active CN109522553B (en) 2018-11-09 2018-11-09 Named entity identification method and device

Country Status (1)

Country Link
CN (1) CN109522553B (en)

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110119694B (en) * 2019-04-24 2021-03-12 北京百炼智能科技有限公司 Picture processing method and device and computer readable storage medium
CN110222168B (en) * 2019-05-20 2023-08-18 平安科技(深圳)有限公司 Data processing method and related device
CN110209721A (en) * 2019-06-04 2019-09-06 南方科技大学 Judgement document transfers method, apparatus, server and storage medium
CN110348023A (en) * 2019-07-18 2019-10-18 北京香侬慧语科技有限责任公司 A kind of method, apparatus, storage medium and the electronic equipment of Chinese text participle
CN110348025A (en) * 2019-07-18 2019-10-18 北京香侬慧语科技有限责任公司 A kind of interpretation method based on font, device, storage medium and electronic equipment
CN110334357A (en) * 2019-07-18 2019-10-15 北京香侬慧语科技有限责任公司 A kind of method, apparatus, storage medium and electronic equipment for naming Entity recognition
CN110348022A (en) * 2019-07-18 2019-10-18 北京香侬慧语科技有限责任公司 A kind of method, apparatus of similarity analysis, storage medium and electronic equipment
CN110705272A (en) * 2019-08-28 2020-01-17 昆明理工大学 Named entity identification method for automobile engine fault diagnosis
CN110569846A (en) 2019-09-16 2019-12-13 北京百度网讯科技有限公司 Image character recognition method, device, equipment and storage medium
CN110619124B (en) * 2019-09-19 2023-06-16 成都数之联科技股份有限公司 Named entity identification method and system combining attention mechanism and bidirectional LSTM
CN110781646B (en) * 2019-10-15 2023-08-22 泰康保险集团股份有限公司 Name standardization method, device, medium and electronic equipment
CN111126069B (en) * 2019-12-30 2022-03-29 华南理工大学 Social media short text named entity identification method based on visual object guidance
CN111241839B (en) * 2020-01-16 2022-04-05 腾讯科技(深圳)有限公司 Entity identification method, entity identification device, computer readable storage medium and computer equipment
CN113283241B (en) * 2020-02-20 2022-04-29 阿里巴巴集团控股有限公司 Text recognition method and device, electronic equipment and computer readable storage medium
CN111488739B (en) * 2020-03-17 2023-07-18 天津大学 Implicit chapter relation identification method for generating image enhancement representation based on multiple granularities
CN111767732B (en) * 2020-06-09 2024-01-26 上海交通大学 Document content understanding method and system based on graph attention model
CN114021572B (en) * 2022-01-05 2022-03-22 苏州浪潮智能科技有限公司 Natural language processing method, device, equipment and readable storage medium
CN117252202B (en) * 2023-11-20 2024-03-19 江西风向标智能科技有限公司 Construction method, identification method and system for named entities in high school mathematics topics

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101246550A (en) * 2008-03-11 2008-08-20 深圳华为通信技术有限公司 Image character recognition method and device
CN106228157A (en) * 2016-07-26 2016-12-14 江苏鸿信***集成有限公司 Coloured image word paragraph segmentation based on image recognition technology and recognition methods

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314417A (en) * 2011-09-22 2012-01-11 西安电子科技大学 Method for identifying Web named entity based on statistical model
CN107644014A (en) * 2017-09-25 2018-01-30 南京安链数据科技有限公司 A kind of name entity recognition method based on two-way LSTM and CRF

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101246550A (en) * 2008-03-11 2008-08-20 深圳华为通信技术有限公司 Image character recognition method and device
CN106228157A (en) * 2016-07-26 2016-12-14 江苏鸿信***集成有限公司 Coloured image word paragraph segmentation based on image recognition technology and recognition methods

Also Published As

Publication number Publication date
CN109522553A (en) 2019-03-26

Similar Documents

Publication Publication Date Title
CN109522553B (en) Named entity identification method and device
CN110795543A (en) Unstructured data extraction method and device based on deep learning and storage medium
Tahsin Mayeesha et al. Deep learning based question answering system in Bengali
CN110347802B (en) Text analysis method and device
CN110909549B (en) Method, device and storage medium for punctuating ancient Chinese
CN108763535A (en) Information acquisition method and device
CN115062134B (en) Knowledge question-answering model training and knowledge question-answering method, device and computer equipment
CN113342958B (en) Question-answer matching method, text matching model training method and related equipment
CN114528418B (en) Text processing method, system and storage medium
CN110852071B (en) Knowledge point detection method, device, equipment and readable storage medium
CN112507095A (en) Information identification method based on weak supervised learning and related equipment
CN114281931A (en) Text matching method, device, equipment, medium and computer program product
CN111460808B (en) Synonymous text recognition and content recommendation method and device and electronic equipment
CN116561272A (en) Open domain visual language question-answering method and device, electronic equipment and storage medium
CN115795007A (en) Intelligent question-answering method, intelligent question-answering device, electronic equipment and storage medium
CN115203388A (en) Machine reading understanding method and device, computer equipment and storage medium
CN112052680B (en) Question generation method, device, equipment and storage medium
CN115270746A (en) Question sample generation method and device, electronic equipment and storage medium
CN114936274A (en) Model training method, dialogue generating device, dialogue training equipment and storage medium
CN114398903A (en) Intention recognition method and device, electronic equipment and storage medium
CN114297353A (en) Data processing method, device, storage medium and equipment
CN114611529A (en) Intention recognition method and device, electronic equipment and storage medium
CN114510561A (en) Answer selection method, device, equipment and storage medium
CN114281934A (en) Text recognition method, device, equipment and storage medium
CN114328894A (en) Document processing method, document processing device, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder
CP02 Change in the address of a patent holder

Address after: 519031 office 1316, No. 1, lianao Road, Hengqin new area, Zhuhai, Guangdong

Patentee after: LONGMA ZHIXIN (ZHUHAI HENGQIN) TECHNOLOGY Co.,Ltd.

Address before: 519000 room 417, building 20, creative Valley, Hengqin new area, Xiangzhou, Zhuhai, Guangdong

Patentee before: LONGMA ZHIXIN (ZHUHAI HENGQIN) TECHNOLOGY Co.,Ltd.