CN112380867A - Text processing method, text processing device, knowledge base construction method, knowledge base construction device and storage medium - Google Patents

Text processing method, text processing device, knowledge base construction method, knowledge base construction device and storage medium Download PDF

Info

Publication number
CN112380867A
CN112380867A CN202011403298.2A CN202011403298A CN112380867A CN 112380867 A CN112380867 A CN 112380867A CN 202011403298 A CN202011403298 A CN 202011403298A CN 112380867 A CN112380867 A CN 112380867A
Authority
CN
China
Prior art keywords
text
entity
network
training
coding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011403298.2A
Other languages
Chinese (zh)
Inventor
刘港安
文瑞
陈曦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202011403298.2A priority Critical patent/CN112380867A/en
Publication of CN112380867A publication Critical patent/CN112380867A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to a text processing method, a text processing device, a knowledge base construction device and a storage medium. The text processing method comprises the following steps: coding the training text through a coding network to obtain a training coding sequence; obtaining an entity recognition result according to the training coding sequence and the entity recognition network; obtaining a relation identification result according to the entity identification result and the relation classification network; jointly training an encoding network, an entity recognition network and a relation classification network based on the entity recognition result, the relation recognition result and the label data; and updating the training coding sequence according to the updated coding network and the relationship recognition result, returning to the step of recognizing the entity of the training text according to the training coding sequence and the entity recognition network for iterative training, and obtaining a target coding network, a target entity recognition network and a target relationship classification network when the iteration stop condition is met. By adopting the method, the accuracy of the entity identification network and the relation classification network can be improved.

Description

Text processing method, text processing device, knowledge base construction method, knowledge base construction device and storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a text processing method, an apparatus, a computer device, and a storage medium, and a method, an apparatus, a computer device, and a storage medium for constructing a knowledge base.
Background
With the rapid development of artificial intelligence technology, natural language processing has been widely used in many ways. In natural language processing, entity relationship recognition is often required. Entity relationship recognition, namely entity relationship extraction, aims to extract structured information from large-scale non-structural or semi-structural natural language sentences to determine semantic relationships among entities in the natural language sentences, can solve the problem of classification among the entities in the natural language sentences, and is also an important basis for constructing complex knowledge base systems, such as text summarization, automatic question answering, machine translation, search engines, knowledge maps and the like.
In the conventional technology, when an entity relationship is extracted, a named entity recognition model is usually adopted to firstly perform entity recognition, and then a relationship classification model is adopted to perform relationship classification on the recognized entity pair.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a text processing method, apparatus, computer device and storage medium capable of improving model accuracy.
A text processing method, comprising:
acquiring a training text and label data corresponding to the training text;
coding the training text through a coding network to obtain a training code sequence corresponding to the training text;
identifying the entity of the training text according to the training coding sequence and the entity identification network to obtain an entity identification result;
identifying the entity relationship of the training text according to the entity identification result and a relationship classification network to obtain a relationship identification result;
jointly training the coding network, the entity recognition network and the relationship classification network based on the entity recognition result, the relationship recognition result and the label data to update the coding network, the entity recognition network and the relationship classification network;
updating the training coding sequence according to the updated coding network and the relationship recognition result, returning to the step of recognizing the entity of the training text according to the training coding sequence and the entity recognition network for iterative training, and obtaining a trained target coding network, a trained target entity recognition network and a trained target relationship classification network when an iteration stop condition is met; the target coding network, the target entity identification network and the target relation classification network are used for cooperatively identifying the entity relation of the text to be processed.
In one embodiment, the relationship classification network comprises a hidden layer, a convolutional layer and a pooling layer; inputting the training code segment into a relationship classification network for relationship recognition to obtain a relationship recognition result corresponding to the training entity pair, wherein the relationship recognition result comprises: inputting the training code segment into the hidden layer, and processing the training code segment through the hiding to obtain a corresponding first intermediate feature; inputting the first intermediate feature into the convolutional layer, and performing convolution processing on the first intermediate feature through the convolutional layer to obtain a corresponding second intermediate feature; inputting the second intermediate features into the pooling layer, and performing pooling treatment on the second intermediate features through the pooling layer; and classifying the entity relation according to the pooling processing result to obtain a relation identification result.
A text processing apparatus, the apparatus comprising:
the data acquisition module is used for acquiring a training text and label data corresponding to the training text;
the coding module is used for coding the training text through a coding network to obtain a training coding sequence corresponding to the training text;
the entity recognition module is used for recognizing the entity of the training text according to the training code sequence and the entity recognition network to obtain an entity recognition result;
the relation recognition module is used for recognizing the entity relation of the training text according to the entity recognition result and the relation classification network to obtain a relation recognition result;
a joint training module, configured to jointly train the coding network, the entity recognition network, and the relationship classification network based on the entity recognition result, the relationship recognition result, and the label data, so as to update the coding network, the entity recognition network, and the relationship classification network;
the iterative training module is used for updating the training code sequence according to the updated coding network and the relationship recognition result, returning to the step of recognizing the entity of the training text according to the training code sequence and the entity recognition network for iterative training, and obtaining a trained target coding network, a trained target entity recognition network and a trained target relationship classification network when an iteration stop condition is met; the target coding network, the target entity recognition network and the target relation classification network are used for cooperatively recognizing the entity relation of the text to be processed.
A computer device comprising a memory storing a computer program and a processor implementing the steps of the text processing method when executing the computer program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the text processing method described above.
The text processing method, the text processing device, the computer equipment and the storage medium encode the training text through the encoding network to obtain the training encoding sequence corresponding to the training text, identify the entity of the training text according to the training encoding sequence and the entity identification network to obtain the entity identification result, identify the entity relationship of the training text according to the entity identification result and the relationship classification network to obtain the relationship identification result, jointly train the encoding network, the entity identification network and the relationship classification network based on the entity identification result, the relationship identification result and the label data to obtain the updated encoding network, the updated entity identification network and the updated relationship classification network, and the entity identification network and the updated relationship classification network can share the network parameters of the encoding network due to the joint training, then, the dependence between the entity recognition network and the relation classification network can be realized in the training process, so that the obtained entity recognition network and the relation classification network are more accurate, further, the coding network, the entity recognition network and the relation classification network are subjected to multi-round iterative training, in the process of each iteration, the training coding sequence is updated according to the updated coding network and the relation recognition result, and when the iteration stop condition is met, the trained target coding network, the trained target entity recognition network and the trained target relation classification network are obtained, so that the correlation between the entity recognition network and the relation classification network can be fully considered, and the accuracy of the entity recognition network and the relation classification network is further improved.
A method for constructing a knowledge base, the method comprising:
acquiring a text to be processed, and coding the text to be processed through a trained target coding network to obtain a coding sequence corresponding to the text to be processed;
identifying the entity of the text to be processed according to the coding sequence and the trained target entity identification network to obtain an entity identification result;
identifying the entity relationship of the text to be processed according to the entity identification result and the trained target relationship classification network to obtain a relationship identification result;
the target coding network, the target entity recognition network and the target relation classification network are obtained by performing iterative joint training based on a training text; each iteration training, updating a training coding sequence input into the entity recognition network obtained by the previous training according to a relationship recognition result output by the relationship classification network obtained by the previous training;
and constructing a knowledge base according to the entity recognition result and the relation recognition result.
In one embodiment, the method for constructing the knowledge base further comprises: acquiring a prescription to be audited, wherein the prescription to be audited comprises a medicine name; inquiring a corresponding target entity and a target relation from the medicine knowledge base according to the medicine name; and auditing the prescription to be audited according to the inquired target entity and the target relation to obtain an auditing result.
An apparatus for building a knowledge base, the apparatus comprising:
the text acquisition module is used for acquiring a text to be processed, and encoding the text to be processed through a trained target encoding network to obtain an encoding sequence corresponding to the text to be processed;
the entity recognition module is used for recognizing the entity of the text to be processed according to the coding sequence and the trained target entity recognition network to obtain an entity recognition result;
the relation recognition module is used for recognizing the entity relation of the text to be processed according to the entity recognition result and the trained target relation classification network to obtain a relation recognition result; the target coding network, the target entity recognition network and the target relation classification network are obtained by performing iterative joint training based on a training text; each iteration training, updating a training coding sequence input into the entity recognition network obtained by the previous training according to a relationship recognition result output by the relationship classification network obtained by the previous training;
and the knowledge base construction module is used for constructing a knowledge base according to the entity recognition result and the relation recognition result.
A computer device comprising a memory storing a computer program and a processor implementing the steps of the method of constructing a knowledge base when executing the computer program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of building a knowledge base as described above.
After the text to be processed is obtained, the text to be processed is coded through the trained target coding network to obtain the coding sequence corresponding to the text to be processed, the entity of the text to be processed is identified according to the coding sequence and the trained target entity identification network to obtain the entity identification result, the entity relation of the text to be processed is identified according to the entity identification result and the trained target relation classification network to obtain the relation identification result, the knowledge base is constructed according to the entity identification result and the relation identification result, as the target coding network, the target entity identification network and the target relation classification network are obtained by iterative joint training based on the training text, each iterative training is carried out, and the relation identification result is output according to the relation classification network obtained by the previous training, the training coding sequence input into the entity recognition network obtained by the previous training is updated, the dependence and the correlation of the entity recognition network and the relation classification network are fully considered in the training process, the obtained entity recognition network and the target relation classification network have better generalization performance, and the entity and the relation in the text to be processed can be accurately recognized, so that the accuracy of the knowledge base constructed according to the entity recognition result and the relation recognition result is obviously improved compared with the knowledge base constructed in a manual labeling mode in the related technology.
Drawings
FIG. 1 is a diagram of an application environment for a method of text processing and a method of knowledge base construction in one embodiment;
FIG. 2 is a flow diagram that illustrates a method for text processing in one embodiment;
FIG. 3 is a flowchart illustrating steps prior to obtaining a training text in one embodiment;
FIG. 4 is a flowchart illustrating a method of processing text in one embodiment;
FIG. 5 is a diagram illustrating the structure of a text processing network in one embodiment;
FIG. 6 is a schematic diagram of a BLSTM network in one embodiment;
FIG. 7 is a block diagram of a CNN network in one embodiment;
FIG. 8 is a diagram illustrating the effect of a text processing method in one embodiment;
FIG. 9 is a flow diagram that illustrates the steps of a method for building a knowledge base in one embodiment;
FIG. 10A is a process diagram of a method for processing text and a method for constructing a knowledge base in an application scenario;
FIG. 10B is a diagram illustrating the relationship of the required weekday to knowledge base magnitude in one embodiment;
FIG. 11 is a block diagram showing a configuration of a text processing apparatus according to an embodiment;
FIG. 12 is a block diagram showing the construction of a knowledge base constructing apparatus according to an embodiment;
FIG. 13 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the implementation method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operating/interactive systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.
Machine Learning (ML) is a multi-domain cross subject, and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how the computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.
With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and researched in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical services, smart customer service, and the like.
The scheme provided by the embodiment of the application relates to technologies such as artificial intelligence natural language processing and machine learning, and is specifically explained by the following embodiment.
The text processing method provided by the application can be applied to the application environment shown in fig. 1. The server and the terminal can both independently execute the text processing method, and the server and the terminal can also cooperatively execute the text processing method.
In one embodiment, taking the example of the method for processing the text cooperatively executed by the server and the terminal as an example, the server may first obtain a training text and label data corresponding to the training text, encode the training text through a coding network to obtain a training code sequence corresponding to the training text, identify an entity of the training text according to the training code sequence and the entity identification network to obtain an entity identification result, identify an entity relationship of the training text according to the entity identification result and the relationship classification network to obtain a relationship identification result, jointly train the coding network, the entity identification network and the relationship classification network based on the entity identification result, the relationship identification result and the label data to update the coding network, the entity identification network and the relationship classification network, update the training code sequence according to the updated coding network and the updated relationship identification result, the step of identifying the entity of the training text according to the training coding sequence and the entity identification network is returned to carry out iterative training, when the iteration stop condition is met, the trained target coding network, the trained target entity identification network and the trained target relation classification network are obtained, when the entity relation identification is carried out, the server can send the obtained target coding network, the obtained target entity identification network and the obtained target relation classification network to the terminal, and the terminal is matched with the target coding network, the target entity identification network and the target relation classification network to identify the entity relation of the text to be processed; or the terminal can send the text to be recognized to the server, and the server recognizes the entity relationship of the text to be processed according to the cooperation of the target coding network, the target entity recognition network and the target relationship classification network and returns the recognition result to the terminal.
The text processing method provided by the application can be applied to the application environment shown in fig. 1. The server and the terminal can both independently execute the text processing method, and the server and the terminal can also cooperatively execute the text processing method.
The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster composed of a plurality of servers.
In one embodiment, as shown in fig. 2, a text processing method is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:
step 202, obtaining a training text and label data corresponding to the training text.
Wherein the training text refers to sample text for training the text processing network. The language of the training text can be determined according to actual needs, for example, the training text can be a chinese sentence, and can also be an english sentence. The label data corresponding to the training text refers to a training label for performing supervised training on the text processing network. The text processing network refers to a machine learning model for performing entity recognition and relationship recognition on the text to be processed. The text processing network includes an encoding network, a recognition entity network, and a relationship classification network. In the process of training, the text processing network takes training samples as input data and takes label data as expected output data, and continuously learns, and when training is finished, the obtained text processing network can perform entity recognition and relationship extraction on the input text to be processed, so that in the process of supervised training, the adopted label data can comprise entity labels and relationship labels corresponding to the training samples.
Specifically, the server may obtain the training samples and the label data corresponding to the training samples from the training data set. The training data set includes a plurality of training samples and label data corresponding to each training sample. The training data set can be a manually labeled data set, namely label data corresponding to a training sample is obtained by manually labeling entities and relations in a text; the training data set can also be a data set labeled automatically by a machine, that is, the label data corresponding to the training sample is obtained by automatically labeling the entities and relations in the text by the computer.
In one embodiment, the training data set may be stored in a database, and the server may obtain the training text and the label data corresponding to the training text from the database. In another embodiment, the training data set may be stored on other computer devices, and the server may obtain the training text and the label data corresponding to the training text from other computer devices through a network or the like.
And 204, coding the training text through a coding network to obtain a training code sequence corresponding to the training text.
Where an encoding network refers to a machine learning module used for encoding, encoding refers to the process of converting information from one form or format to another. The training code sequence refers to a code result obtained by the server coding the text through a code network in the training process. Text elements refer to the elements of a text that are composed of words, phrases, etc. in the text.
Specifically, after the server obtains the training text, the training text is input into a coding network, each text element in the training text is coded through the coding network, in the coding process, each text element needs to be coded one by one, codes corresponding to each text element are obtained, and the codes corresponding to each text element are sequenced according to the position sequence of the text element in the text to form a training coding sequence corresponding to the training text.
In one embodiment, when the coding network performs coding, the server first performs feature mapping on each text element in the training text, maps each text element to a feature space to obtain a corresponding feature representation, and then codes the obtained feature representation to obtain element coding features, for example, the feature representations of each text element are respectively coded by a BLSTM (Bi-directional Long Short-Term Memory) network, an LSTM (Long Short-Term Memory) network, and a CNN (Convolutional Neural network) network to obtain element coding features, and finally obtains a training coding sequence corresponding to the training text according to the element coding features.
In a specific embodiment, the server may map each word in the training text to the feature space through the embedding algorithm to obtain a corresponding feature representation. In another specific embodiment, the server may perform word segmentation on the text to obtain a plurality of words corresponding to the text, and then map each word in the training text to the feature space by an embedding algorithm to obtain a corresponding feature representation. The word segmentation method can be, for example, a forward maximum matching method, a reverse maximum matching method, a shortest path word segmentation method, a bidirectional maximum matching method, a word meaning word segmentation method, a statistical word segmentation method, or the like. For example, assuming that the training text is "today is sunday", the participle obtained after the participle processing may be "today/is/sunday".
And step 206, identifying the entity of the training text according to the training code sequence and the entity identification network to obtain an entity identification result.
The Entity identification network refers to a machine learning module for Entity identification (NER). The entity recognition network may employ a CRF (Conditional Random Field) network, an LSTM (Long Short-Term Memory) network, an HMM (Hidden Markov Model), or the like.
An entity is a text element, usually a word or phrase, in a training text having a characteristic meaning. For example, the text "the patient has paroxysmal cough and expectoration after eruption more than 50 years ago, the sputum is white and thick, easy to expectorate, small in amount and has chest distress and short breath after activities" in the medical field, wherein the "paroxysmal cough", "expectoration", "not short", "white sputum", "thick", "easy expectoration", "small amount", "chest distress" and short breath "are all entities.
Specifically, the server inputs the training code sequence into an entity recognition network, entity recognition is performed through the entity recognition network to obtain an entity recognition result, and according to the entity recognition result, the server can determine a plurality of prediction entities corresponding to the training text.
In one embodiment, the entity recognition result may be a category identifier for characterizing a category to which each text element in the training text belongs, for example, assuming that an entity type includes two categories, respectively a name of a Person and an Organization structure, the actual number of categories includes five categories, respectively B-Person (beginning part of the name of the Person), I-Person (middle part of the name of the Person), B-Organization (beginning part of the Organization structure), I-Organization (middle part of the Organization structure), and O (non-entity information), the training text X is composed of text elements w1, w2, and w3, and for the text element w1, the entity recognition result corresponding to the text element may be represented by a vector (1,0,0, 0), and it may be determined that w1 belongs to the category of B-Person according to the entity recognition result.
In another embodiment, the entity recognition result may be a probability or a score for characterizing an entity class to which each text element in the training text belongs, and the greater the probability or the score corresponding to a certain class, the greater the probability that the text element belongs to the class, as in the above example, for the text element w1, the entity recognition result may be 1.5(B-Person), 0.9(I-Person), 0.1(B-Organization),0.08 (I-Organization) and 0.05(O), and it may be determined that w1 belongs to the class of B-Person according to the entity recognition result.
And step 208, identifying the entity relationship of the training text according to the entity identification result and the relationship classification network to obtain a relationship identification result.
A relationship classification network refers to a machine learning module for classifying relationships between entities to identify relationships between entities. The relational classification network may be, for example, CNN (Convolutional Neural Networks).
In one embodiment, identifying the entity relationship of the training text according to the entity identification result and the relationship classification network, and obtaining the relationship identification result includes: determining at least two training entities corresponding to the training texts according to the entity recognition result; determining a corresponding training entity pair according to at least two training entities; determining a training code segment corresponding to a training entity pair from a training code sequence; and inputting the training code segment into a relation classification network for relation identification to obtain a corresponding relation identification result of the training entity pair.
It can be understood that the entity relationship refers to a relationship between two entities, and the server obtains an entity identification result, determines at least two predicted entities corresponding to the training text according to the entity identification result, pairs the predicted entities, and then performs classification and identification on the relationship between the entity pairs to obtain a relationship identification result.
Because the relationship between the entities is usually contained in the text description between the two entities, the server can determine the entity pair and the code corresponding to the intermediate text of the entity pair from the training code sequence to obtain the code segment corresponding to the entity pair, and classify and identify the relationship between the entity pair through the code segment to obtain the relationship identification result.
In one embodiment, the server may pair each two of the at least two prediction entities to obtain a plurality of entity pairs. In another embodiment, since the relationship identification is performed when not all entities have relationships, the server may first determine that one or more master entities are needed, and then pair the master entities with respect to other entities, respectively, to obtain a plurality of entity pairs. The main entity here refers to the entity that needs to recognize the entity relationship with which it exists in the training text. According to different application scenarios, the server determines different main entities. For example, when the training text is a drug instruction book, the entity relationship is the relationship between the drug and other entities, and the main entity is the drug name. For another example, when the training text is a disease diagnosis text, the entity relationship is a relationship between a disease and other entities, and the main entity may be a disease name.
Step 210, jointly training the coding network, the entity recognition network and the relationship classification network based on the entity recognition result, the relationship recognition result and the label data to update the coding network, the entity recognition network and the relationship classification network.
Specifically, the tag data includes an entity tag and a relationship tag, the server may determine a first loss function according to a difference between the entity identification result and the entity tag, determine a second loss function according to a difference between the relationship identification result and the relationship tag, determine a first gradient of the first loss function and the second loss function generated in each iteration process according to a preset deep learning optimization algorithm, and superimpose a gradient of the first loss function and a second gradient of the second loss function to obtain a comprehensive gradient. The deep learning optimization algorithm may be Batch Gradient Descent (BGD), random Gradient Descent (SGD), small Batch Gradient Descent (mbi-Batch Gradient Descent, MBGD), adaptive algorithm (rmsprop (root Mean Square procedure), or adaptive motion estimation). And the server reversely propagates the comprehensive descending gradient to the coding network, the entity recognition network and the relation classification network so as to update network parameters corresponding to the coding network, the entity recognition network and the relation classification network to realize joint training of the coding network, the entity recognition network and the relation classification network, and the current training is finished until the preset training stopping condition is met, so that the updated coding network, the updated entity recognition network and the updated relation classification network are obtained.
Using the stochastic gradient descent method as an example, assume L1And L2Respectively a first loss function and a second loss function, f1(x,Θadapt1) When the representation input is x, the entity identifies the actual output value of the network, thetaadapt1Identifying network parameters of a network for an entity, y1Entity tag value, f, for an entity identifying a network2(x,Θadapt2) Representing the actual output value, Θ, of the relational classification network when the input is xadapt2Classifying network parameters of the network for relationships, y2Classifying the relationship label value, Θ, of the network for relationshipsadapt3For encoding network parameters of the network, the training data set contains n training samples { x }(1),…,x(n)In which x(i)The output target value of the corresponding entity identification network is y1 (i)Wherein x is(i)The output target value of the corresponding relation classification network is y2 (i)Then the overall gradient of descent for each iteration is
Figure BDA0002817692150000121
Assuming that the learning rate of the stochastic gradient descent algorithm is η, the network parameters of the entity recognition network can be updated to be Θ after back propagationadapt1- η g, updating network parameters of the relational classification network to Θadapt2- η g, updating network parameters of the coding network to Θadapt3And eta g, and continuously iterating the changed network parameters as the current network parameters until the preset training stopping condition is reached, and ending the current training. The training stopping condition may be that the comprehensive loss value reaches a preset minimum value, or that the performance of the iterative text processing network is not significantly improved for a continuous preset number of times, or the like.
And 212, updating the training code sequence according to the updated code network and the relationship recognition result, returning to the step of recognizing the entity of the training text according to the training code sequence and the entity recognition network for iterative training, and obtaining a trained target code network, a target entity recognition network and a target relationship classification network when an iteration stop condition is met.
Specifically, the server inputs the training samples into the updated coding network, codes the training samples through the updated coding network to obtain a coding result, fuses the relationship identification result with the obtained coding result to update the training coding sequence, returns to step 206 to start a second iteration training, and obtains a trained target coding network, a trained target entity identification network and a trained target relationship classification network when an iteration stop condition is met. The iteration stop condition may be that the number of iterations reaches a preset number, and the preset number may be set as needed, for example, may be set to 3 times. The obtained target coding network, the target entity recognition network and the target relation classification network can be used for cooperatively recognizing the entity relation of the text to be processed.
In an embodiment, the step of fusing the relationship identification result with the obtained encoding result may specifically be to horizontally splice the relationship identification result with the encoding result. For example, assume that the training text Y includes 5 text elements w1, w2, w3, w4 and w5, wherein the corresponding codes of the text elements w1 and w3 are k-dimensional vectors, which are respectively represented as (a)1,a2……ak)、(b1,b2……bk) And the relationship identification result representation vectors corresponding to w1 and w3 are j-dimensional vectors and are (c1, c2 … … cj), then the vectors with the dimensions of k + j are obtained by horizontally splicing (c1, c2 … … cj) with (a1, a2 … … ak) and (b1, b2 … … bk), and are respectively represented as (a)1,a2……akC1, c2 … … cj) and (b)1,b2……bk,c1,c2……cj)。
In the text processing method, the training text is coded through the coding network to obtain a training coding sequence corresponding to the training text, the entity of the training text is identified according to the training coding sequence and the entity identification network to obtain an entity identification result, the entity relationship of the training text is identified according to the entity identification result and the relationship classification network to obtain a relationship identification result, the coding network, the entity identification network and the relationship classification network are jointly trained based on the entity identification result, the relationship identification result and the label data to obtain an updated coding network, entity identification network and relationship classification network, and because the joint training is carried out, the entity identification network and the relationship classification network can share the network parameters of the coding network, so that the dependence between the entity identification network and the relationship classification network can be realized in the training process, and the obtained entity identification network and the relationship classification network are more accurate, further, the encoding network, the entity recognition network and the relation classification network are subjected to multiple rounds of iterative training, the training encoding sequence is updated according to the updated encoding network and the updated relation recognition result in the process of each iteration, and the trained target encoding network, the trained target entity recognition network and the trained target relation classification network are obtained when the iteration stop condition is met, so that the correlation between the entity recognition network and the relation classification network can be fully considered, and the accuracy of the entity recognition network and the relation classification network is further improved.
In one embodiment, before obtaining the training text and the label data corresponding to the training text, the method further includes: performing word segmentation processing on the training text to obtain a corresponding text element set; matching the text element set with an entity dictionary, and determining at least two first entity labels corresponding to the training text according to a matching result; the entity dictionary is obtained according to a pre-constructed knowledge graph; and determining entity relationships corresponding to the at least two first entity labels according to the knowledge graph to obtain first relationship labels corresponding to the training text.
Specifically, the server firstly performs word segmentation processing on the training text, and forms words obtained through word segmentation processing into a text element set. The word segmentation method can be, for example, a forward maximum matching method, a reverse maximum matching method, a shortest path word segmentation method, a two-way maximum matching method, a word meaning word segmentation method, a statistical word segmentation method, or the like. For example, assuming that the training text is "today is sunday", the participle obtained after the participle processing may be "today/is/sunday".
The server acquires a preset entity dictionary, matches each text element in the text element set with each entity in the entity dictionary respectively, and determines the text element as a first entity label when any word in the text element is successfully matched with the entity in the entity dictionary. The entity dictionary is obtained according to a pre-constructed knowledge graph, and the entities of all nodes in the knowledge graph are extracted to obtain the entity dictionary.
In one embodiment, in the matching process, the similarity between the words in the text element and the entities in the entity dictionary may be calculated, and when the similarity is greater than a preset threshold, it is determined that the text element and the entities are successfully matched. The similarity calculation may adopt string similarity or cosine similarity. In another embodiment, the server may first normalize the text elements in the text element set, i.e., unify the words in the original element set into a standard word expression, e.g., normalize "diarrhea" and "belly-pulling" into "diarrhea".
Further, after the first entity label is obtained, the entity relationship corresponding to the first entity label needs to be determined, and since the first entity label is determined by matching with the entity dictionary, which is obtained according to the knowledge graph, and the knowledge graph includes the corresponding entity relationship in the entity dictionary, the server can match the first entity label with the node in the knowledge graph, so that the entity relationship corresponding to the first entity label can be determined according to the entity relationship in the knowledge graph, and the first relationship label corresponding to the training text is obtained.
In the embodiment, the first entity label is determined by matching the text element set with the entity dictionary, the entity and the entity relationship in the training text can be extracted by using the pre-constructed knowledge graph, and the extracted entity and entity relationship are used as the label data of the training text, so that a small amount of high-quality labeling data can be obtained, and the time cost and the labor cost of manual labeling are reduced.
In one embodiment, as shown in fig. 3, before obtaining the training text, the method further includes:
step 302, performing word segmentation processing on the training text to obtain a corresponding text element set.
Step 304, matching the text element set with an entity dictionary, and determining at least two first entity labels corresponding to the training text according to a matching result; the entity dictionary is obtained according to a pre-constructed knowledge graph.
And step 306, determining the entity relationship corresponding to the at least two first entity labels according to the knowledge graph to obtain the first relationship labels corresponding to the training text.
And 308, determining candidate text elements from the text element set, wherein the candidate text elements are text elements except for the at least two first entity labels.
Step 310, calculating the word solidity of the candidate text element.
And the candidate text elements are text elements except the at least two first entity labels. Because the entity dictionary is constructed according to the pre-constructed knowledge graph, the number of included entities is limited, and the entity dictionary may not be included for some new words, the new word discovery algorithm can be adopted to extract the entities according to the word solidification degree. Taking an example that a certain candidate text element includes A, B, C three characters, the calculation formula of the word solidity is as follows formula (1):
Figure BDA0002817692150000151
wherein P (ABC) is the probability of co-occurrence of the character A, B, C, P (A) is the probability of individual occurrence of the character A, P (BC) is the probability of co-occurrence of the characters B and C, P (AB) is the probability of co-occurrence of the characters A and B, and P (C) is the probability of individual occurrence of the character C,
Figure BDA0002817692150000152
the larger the representation, the less independent the A, B, C three characters are, i.e. A, B, C three characters have high relevance and are likely to occur simultaneously, i.e. the internal solidity of the word is high, the more will
Figure BDA0002817692150000153
The minimum value of (3) is determined as the degree of solidification of the candidate text element.
And step 312, when the calculated word freezing degree exceeds a preset threshold, determining the candidate text element as a second entity label corresponding to the training text.
Specifically, the server judges whether the word freezing degree of the candidate text element exceeds a preset threshold, and if so, determines that the candidate text element is a second entity label corresponding to the training text, so that all entities in the training text can be extracted as comprehensively as possible.
And step 314, acquiring the entity relationship corresponding to the second entity label to obtain a second relationship label corresponding to the training text.
The entity relationship corresponding to the second entity label comprises two types of entity relationship between the second entity labels and the first entity labels.
Specifically, the server may obtain a pre-trained relationship extraction model, determine an entity label pair according to the first entity label and the second entity label, input the intermediate text corresponding to the entity label pair and the entity label pair into the pre-trained relationship extraction model, and obtain a relationship corresponding to the entity label as a second relationship label corresponding to the training text.
In the embodiment, by combining the ways of entity dictionary matching and word solidification calculation, the entities in the training text can be extracted as comprehensively as possible, so that the obtained training text can be better used in the training process, and the training accuracy is improved.
In one embodiment, as shown in fig. 4, the method further includes:
and 402, acquiring a candidate text, and coding the candidate text through a coding network to obtain a coding sequence corresponding to the candidate text.
And step 404, identifying the entity of the candidate text according to the coding sequence corresponding to the candidate text and the entity identification network to obtain an entity identification result corresponding to the candidate text.
And step 406, identifying the entity relationship of the candidate text according to the entity identification result corresponding to the candidate text and the relationship classification network to obtain a relationship identification result corresponding to the candidate text.
And the candidate texts are unlabeled texts. After the server obtains the candidate text, the candidate text is coded through the coding network to obtain a corresponding coding sequence, the coding sequence is input into the entity recognition network to carry out entity recognition to obtain an entity recognition result corresponding to the candidate text, then according to the entity pair determined by the entity result, a corresponding coding section is determined from the coding sequence and input into the relation classification network to carry out relation recognition on the entity pair, and a relation recognition result is obtained.
And step 408, when the candidate text is judged to be the uncertainty sample according to at least one of the entity identification result corresponding to the candidate text and the relation identification result corresponding to the candidate text, sending the candidate text to a preset terminal.
The uncertain samples refer to the samples which are most uncertain for classification, and the most uncertain samples are selected as training samples, namely, the data with less certainty is possibly the data which is difficult to classify, the data is some data near the boundary, and the active learning algorithm in the embodiment of the application can know more boundary information by observing the data.
In the active learning process, when the server judges that the candidate text is an uncertainty sample according to at least one of an entity identification result corresponding to the candidate text and a relation identification result corresponding to the candidate text, the server sends the candidate text to a preset terminal, wherein the preset terminal is a terminal capable of providing manual annotation. Specifically, The server may use uncertainty measurement methods such as a minimum margin uncertainty (SMU), a minimum confidence uncertainty (LCU), a maximum margin uncertainty (LMU), and The like as bases for actively learning to select The labeled sample.
Taking the minimum margin uncertainty as an example, the minimum margin uncertainty reflects the best and second best probabilities, i.e., the probability of adopting the most probable category minus the probability of the second probable category. The significance behind this value is: classification is very deterministic of which class this data belongs to if the probability of the most likely class is significantly greater than the probability of the second likely class. Similarly, if the probability of the most likely category is not much greater than the probability of the second likely category, then the classification is less certain which category this data belongs to.
Specifically, the calculation formula of the minimum margin uncertainty is as follows (2):
φSM(x)=Pθ(y1|x)-Pθ(y2|x) (2)
wherein, Pθ(y1| x) probability of being the most probable category, Pθ(y2| x) is the probability of the second possible category. The server can calculate the minimum margin uncertainty for the probability that each text element in the entity recognition result belongs to each entity category by referring to the formula (2), calculate the minimum margin uncertainty for the probability that the corresponding relation of the entity in the relation recognition result belongs to each relation category, and determine that the candidate text is an uncertainty sample when any one of the calculated minimum margin uncertainties is smaller than a preset threshold value.
And step 410, when receiving entity labels and relationship labels corresponding to the candidate texts returned by the preset terminal, continuing to train the target coding network, the target entity recognition network and the target relationship classification network according to the candidate texts.
Specifically, a user (for example, an expert or an authority) corresponding to a preset terminal performs entity labeling and relationship labeling on a candidate text, the preset terminal determines a labeling result corresponding to the entity labeling and a labeling result corresponding to the relationship labeling as an entity label and a relationship label corresponding to the candidate text, and sends the entity label and the relationship label corresponding to the candidate text to a server, at this time, because the candidate text has corresponding label data and can be used for supervised training of an entity identification network and a relationship classification network, the server can continue training a target coding network, a target entity identification network and a target relationship classification network by using the candidate text as a training sample so as to update network parameters of the target coding network, the target entity identification network and the target relationship classification network, and when a training stop condition is met, and finishing the training pair. It can be appreciated that as training samples continue to increase, the accuracy of the text processing network gradually converges.
In the embodiment, through an active learning mode, only uncertain samples need to be selected for labeling, all samples do not need to be labeled, the labor cost of manual labeling is reduced, and because the uncertain samples are data close to a classification boundary generally and the information amount is large, the uncertain samples are manually labeled, high-quality training samples can be obtained, and the generalization performance of the model can be improved by performing model training through the high-quality training samples.
In one embodiment, an encoding network includes a feature mapping layer and an encoding layer; the training text is composed of a plurality of ordered text elements; the training text is coded through a coding network, and the training code sequence corresponding to the training text is obtained by the following steps: performing feature mapping on each text element through a feature mapping layer to obtain feature representations corresponding to the text elements; inputting each feature representation into a coding layer to obtain element coding features corresponding to each text element; and obtaining a training coding sequence corresponding to the training text according to the element coding characteristics corresponding to each text element.
The feature mapping refers to mapping the text elements to a feature space with fixed dimensions to obtain feature vectors with fixed lengths, and the feature vectors are feature representations corresponding to the text elements. The text elements of the training text may be single words constituting the text elements or segmented words obtained by performing segmentation processing on the text elements.
Specifically, the server inputs each text element into an encoding layer of the encoding network, the encoding layer maps the text element to a fixed-dimension feature space by using a preset feature mapping algorithm to obtain a feature representation corresponding to each text element, and when performing feature mapping, the server may perform feature mapping on the text element by using an embedding method. The embedding method mainly aims to reduce the dimension of (sparse) features, the dimension reduction mode can be analogized to a full connection layer (without an activation function), the dimension is reduced through the weight matrix calculation of the embedding layer, namely, a large sparse vector can be converted into a low-dimensional space with a semantic relation reserved, so that a server can map text elements to a feature space with fixed dimension to obtain feature vectors, and the feature extraction of the text elements is realized.
After the feature representation corresponding to each text element is obtained, the service inputs the feature representation corresponding to each text element into the coding layer, and codes each feature representation through the coding layer to obtain the element coding feature corresponding to each text element. The coding layer may specifically adopt network structures such as a BLSTM (Bi-directional Long Short Term Memory) network, an LSTM (Long Short Term Memory) network, a CNN (Convolutional Neural network), and the like.
And after the element coding features corresponding to the text elements are obtained, forming a coding sequence by the element coding features according to the sequence of the corresponding text elements in the training text, and obtaining the training coding sequence corresponding to the training text.
In one embodiment, inputting each feature representation into the coding layer, and obtaining the element coding features corresponding to each text element includes: coding each feature representation according to the sequence of the text elements corresponding to each feature representation and the forward direction to obtain the forward coding feature corresponding to each text element; coding each feature representation according to the sequence of the text elements corresponding to each feature representation and the backward direction to obtain backward coding features corresponding to each text element; and respectively fusing the forward coding features and the backward coding features corresponding to the text elements to obtain the element coding features corresponding to the text elements.
Specifically, when the server performs encoding on the encoding layer, the server may encode each feature representation according to the sequence of the text elements corresponding to each feature representation and the forward direction, to obtain the previous encoding feature corresponding to each text element, where the obtained previous encoding feature includes the forward information of the training text. The server further encodes each feature representation according to the sequence of the text elements corresponding to each feature representation and the backward direction to obtain backward encoding features corresponding to each text element, wherein the obtained backward encoding features comprise backward information of a training text, and the server further fuses forward encoding features and backward encoding features corresponding to each text element to obtain element encoding features corresponding to each text element, and the obtained element encoding features comprise the forward information and the backward information, so that the features of the text elements can be better expressed. The fusion refers to a process of expressing a plurality of features by using one feature, and the fusion can be splicing, combining and the like.
In one embodiment, the entity identification network includes a decoding layer and a classification layer; the method for recognizing the entity of the training text according to the training code sequence and the entity recognition network to obtain the entity recognition result comprises the following steps: decoding the element coding features according to the sequence of the text elements corresponding to the element coding features at a decoding layer to obtain the element decoding features corresponding to the text elements; and in the classification layer, performing entity classification processing on each element decoding characteristic to obtain an entity identification result.
Specifically, the server inputs each element coding feature into a decoding layer, and in the decoding layer, the element coding features are decoded according to the sequence of the text elements corresponding to each element coding feature to obtain the element decoding features corresponding to each text element, wherein the decoding is the inverse process of the coding. In one particular embodiment, the decoding layer may employ an LSTM network.
After the element decoding features are obtained, the server further inputs the element decoding features into a classification layer, entity classification processing is carried out on each element decoding feature in the classification layer to obtain an entity recognition result, and at least two prediction entities corresponding to the training text can be determined according to the entity recognition result. In one particular embodiment, the taxonomy layer may employ the softmax algorithm for categorization.
In one embodiment, the relational classification network includes a hidden layer, a convolutional layer, and a pooling layer; inputting the training code segment into a relation classification network for relation recognition, and obtaining a corresponding relation recognition result of a training entity pair, wherein the relation recognition result comprises the following steps: inputting the training code segment into a hidden layer, and processing the training code segment through hiding to obtain a corresponding first intermediate feature; inputting the first intermediate feature into a convolutional layer, and performing convolution processing on the first intermediate feature through the convolutional layer to obtain a corresponding second intermediate feature; inputting the second intermediate features into the pooling layer, and performing pooling treatment on the second intermediate features through the pooling layer; and classifying the entity relationship according to the pooling processing result to obtain a relationship identification result.
Specifically, the server inputs the training code segment into the hidden layer, abstracts the training code segment to another dimensional space through the hidden layer, and obtains a first intermediate feature, wherein the first intermediate feature can show more abstracted features compared with an original training code segment, so that linear division is better performed, and subsequent classification processing is facilitated.
The server then inputs the first intermediate feature into the convolution layer, and performs convolution processing on the first intermediate feature through the second convolution layer, wherein the convolution processing process can be understood as a feature extraction process, and a second intermediate feature is obtained after the convolution processing.
The server then enters the second intermediate features into a pooling layer where the second intermediate features are pooled to reduce the number of computations in the classification by performing a dimensionality reduction on the features. Max Pooling, Average Pooling, and further processing of the feature mapping results from the convolution operation using a Pooling function. Pooling statistically summarizes the eigenvalues of some unknown and adjacent positions within the plane. And the summed result is taken as the value of this position in the plane. The maximum pooling calculates the maximum value in the location and its neighboring matrix area and takes this maximum value as the value of the location, and the average pooling calculates the average value in the location and its neighboring matrix area and takes this value as the value of the location. The use of pooling does not cause the depth of the data matrix to change, and only reduces the height and the broadband, thereby achieving the purpose of dimension reduction
And finally, the server can classify the entity relation according to the pooling result to obtain an entity identification result. In one embodiment, the server may input the pooling result into the full connection layer, and implement entity relationship classification through the full connection layer to obtain the relationship identification result.
In a specific embodiment, referring to FIG. 5, a schematic diagram of a text processing network in a specific embodiment is shown. The following describes the text processing method provided by the present application in detail with reference to fig. 5, where the text processing method provided by the embodiment of the present application includes three rounds of iterative training.
1) A first round of training:
referring to FIG. 5, the training text includes w1,w2,w3,w4,w5Five text elements. The server firstly inputs the training text into a coding network, and the coding network comprises an input layer, an embedding layer and a coding layer. At the input level, the server will w1,w2,w3,w4,w5Inputting an imbedding layer, wherein the imbedding layer pairs w through an imbedding method1,w2,w3,w4,w5Performing feature mapping to obtain w1,w2,w3,w4,w5The corresponding characteristic is denoted as e1,e2,e3,e4,e5The server then represents the feature e1,e2,e3,e4,e5Inputting a coding layer for coding to obtain element coding characteristics h corresponding to each text element1,h2,h3,h4,h5,h1,h2,h3,h4,h5And forming an element coding sequence, wherein the coding layer adopts a BLSTM network to carry out bidirectional coding. A schematic diagram of a BLSTM network is shown in fig. 6.
As shown in FIG. 6, the BLSTM network includes a forward coding layer and a backward coding layer, and the input is ei-1,ei,ei+1The forward coding layer performs forward coding on the input to obtain
Figure BDA0002817692150000211
Backward coding is carried out on a backward coding layer to obtain
Figure BDA0002817692150000212
Will be provided with
Figure BDA0002817692150000213
Splicing to obtain ei-1Corresponding code characteristic is ht-1Will be
Figure BDA0002817692150000214
Splicing to obtain eiCorresponding code characteristic is htWill be
Figure BDA0002817692150000215
Splicing to obtain ei+1Corresponding code characteristic is ht+1
The forward coding layer and backward coding layer of BLSTM network are both LSTM network. For LSTM networks, the following equations (3) to (6) are specifically implemented:
ft=σ(Wfxt+Ufht-1+bf) (3)
it=σ(Wixt+Uiht-1+bi) (4)
ot=σ(Woxt+Uoht-1+bo) (5)
ct=ft·ct-1+it·σ(Wcxt+Ucht-1+bc) (6)
wherein [ W ]f,Uf,bf,Wi,Ui,bi,Wo,Uo,bo]All the weights are corresponding weights and are network model parameters of the LSTM; f. oftFor the output of forgetting gate at time t in LSTM network, sigma is sigmod function, xtFor input at the current moment, ht-1Is the output of the LSTM network at time t-1; i.e. itIs the output of the input gate in the LSTM network; otIs the output of an output gate in the LSTM network; c. CtCellular memory at time t, ct-1Is the cell memory at time t-1.
With continued reference to FIG. 5, the server then takes the sequence of element codes as input to the entity recognition network, i.e., x1,x2,x3,x4,x5. The entity identification network comprises a decoding layer and a classification layer, and the clothesThe server will x1,x2,x3,x4,x5Inputting a decoding layer, wherein the decoding layer adopts an LSTM network and is paired with x1,x2,x3,x4,x5Decoding to obtain element decoding characteristic y1,y2,y3,y4,y5And further inputting the element decoding characteristics into a softmax layer to perform entity classification processing to obtain an entity identification result, wherein FB, FI, FE, O and LU are entity categories represented by the entity identification result.
Continuing to refer to fig. 5, the server determines a training code segment corresponding to the entity pair from the element code sequence output by the coding network according to the entity recognition result, and performs relationship classification through the CNN network to obtain a relationship recognition result, wherein r1, r2, and r3 are scores of the entity relationship corresponding to the entity pair belonging to each relationship class, and the server jointly trains the coding network, the entity recognition network, and the relationship classification network based on the difference between the entity recognition result and the entity label and the difference between the relationship recognition result and the relationship label until the network converges, and ends the training of the round. The specific structure of the CNN network is shown in fig. 7.
Referring to fig. 7, the CNN network includes a hidden layer, a convolutional layer, and a pooling layer, where the training code segment is input into the hidden layer, the training code segment is processed by hiding to obtain a corresponding first intermediate feature, the first intermediate feature is input into the convolutional layer, the first intermediate feature is convolved by the convolutional layer to obtain a corresponding second intermediate feature, the second intermediate feature is input into the pooling layer, the second intermediate feature is pooled by the pooling layer, and entity relationship classification is performed according to the pooling result to obtain a relationship recognition result.
2) And (3) training for the second round:
the server takes the coding network, the entity recognition network and the relation classification network obtained by the first round of training as the current coding network, the current entity recognition network and the current relation classification network, inputs the training text into the current coding network, repeats the training coding sequence obtained by the first round of training, and further inputs r1, r2 and r3 into the serverActivating the function layer to carry out normalization processing, splicing the result after the normalization processing with the entities corresponding to r1, r2 and r3 in the training code sequence to obtain the current input x of the entity recognition network1,x2,x3,x4,x5Decoding the current input decoding layer to obtain element decoding characteristic y1,y2,y3,y4,y5And further inputting the element decoding characteristics into a softmax layer to perform entity classification processing to obtain an entity recognition result, determining a training coding section corresponding to an entity pair from an element coding sequence output by a coding network according to the entity recognition result by a server, and performing relationship classification through a CNN network to obtain a relationship recognition result, wherein r1, r2 and r3 are scores of entity relationships corresponding to the entity pair belonging to each relationship class, and the server jointly trains the coding network, the entity recognition network and the relationship classification network based on the difference between the entity recognition result and an entity label and the difference between the relationship recognition result and the relationship label until the network converges, and then ends the training of the round.
(3) And (3) a third training:
and the server takes the coding network, the entity recognition network and the relation classification network obtained by the second round of training as the current coding network, the current entity recognition network and the current relation classification network, repeats the steps of the second round of training until the networks are converged, and ends the whole training process to obtain the target coding network, the target entity recognition network and the target relation classification network.
As shown in fig. 8, the effect diagram of the text processing method of the present application specifically shows the trained target entity recognition network and target relationship classification network, and compares the accuracy of the entity recognition model trained separately in the related art and the accuracy of the relationship classification model trained separately in the related art when performing entity recognition and relationship extraction. Referring to fig. 8, it is shown that the accuracy of the target entity recognition network and the target relationship classification network obtained by the text processing method of the present application is significantly improved when recognizing disease entities, surgery entities, labeling entities, pharmaceutical products entities, symptom entities, and entity relationships related to these types of entities.
In one embodiment, the text processing method further includes: acquiring a text to be processed, and coding the text to be processed through a target coding network to obtain a coding sequence corresponding to the text to be processed; identifying the entity of the text to be processed according to the coding sequence corresponding to the text to be processed and the target entity identification network to obtain an entity identification result corresponding to the text to be processed; identifying the entity relationship of the text to be processed according to the entity identification result and the target relationship classification network to obtain a relationship identification result corresponding to the text to be processed; and establishing a knowledge base according to the entity recognition result and the relation recognition result corresponding to the text to be processed.
The text to be processed refers to a text to be subjected to entity identification and relationship extraction. The method comprises the steps that after a to-be-processed text is obtained and obtained by a server, the to-be-processed text is coded through a target coding network, a coding sequence corresponding to the to-be-processed text is obtained, an entity of the to-be-processed text is identified according to the coding sequence corresponding to the to-be-processed text and a target entity identification network, an entity identification result corresponding to the to-be-processed text is obtained, the entity relation of the to-be-processed text is identified through the network according to the entity identification result and the target relation classification, a relation identification result corresponding to the to-be-processed text is obtained, the server can determine an entity in the to-be-processed text according to the entity identification result, and the entity relation of the to. The server further stores the identified entities and entity relationships in a knowledge base to construct the knowledge base.
In one embodiment, as shown in fig. 9, a method for constructing a knowledge base is provided, which can be applied to the application environment shown in fig. 1. The server and the terminal can independently execute the construction method of the knowledge base, and the server and the terminal can also cooperatively execute the construction method of the knowledge base. The construction method of the knowledge base specifically comprises the following steps:
and step 902, acquiring a text to be processed, and coding the text to be processed through the trained target coding network to obtain a coding sequence corresponding to the text to be processed.
And 904, identifying the entity of the text to be processed according to the coding sequence and the trained target entity identification network to obtain an entity identification result.
Step 906, the entity relationship of the text to be processed is identified according to the entity identification result and the trained target relationship classification network, and a relationship identification result is obtained.
The target coding network, the target entity recognition network and the target relation classification network are obtained by performing iterative joint training based on a training text; and updating the training code sequence input into the entity recognition network obtained by the previous training according to the relationship recognition result output by the relationship classification network obtained by the previous training each time of iterative training.
Step 908, building a knowledge base according to the entity recognition result and the relationship recognition result.
It can be understood that details of implementation of the present embodiment may refer to the description in the above embodiments, and are not repeated herein.
In the above embodiment, after the text to be processed is obtained, the text to be processed is encoded through the trained target encoding network to obtain the encoding sequence corresponding to the text to be processed, the entity of the text to be processed is identified according to the encoding sequence and the trained target entity identification network to obtain the entity identification result, the entity relationship of the text to be processed is identified according to the entity identification result and the trained target relationship classification network to obtain the relationship identification result, the knowledge base is constructed according to the entity identification result and the relationship identification result, because the target encoding network, the target entity identification network and the target relationship classification network are obtained by iterative joint training based on the training text, each iterative training updates the training encoding sequence input into the entity identification network obtained by the previous training according to the relationship identification result output by the relationship classification network obtained by the previous training, in the training process, the dependence and the relevance of the entity recognition network and the relation classification network are fully considered, the obtained entity recognition network and the target relation classification network have better generalization performance, and the entities and the relations in the text to be processed can be accurately recognized, so that the accuracy of the knowledge base established according to the entity recognition result and the relation recognition result is obviously improved compared with the knowledge base established in a manual labeling mode in the correlation technology.
It can be understood that, in the method for constructing a knowledge base provided in the embodiment of the present application, the target encoding network, the target entity identifying network, and the target relationship classification network may be obtained by training using a text processing method provided in any one of the above embodiments, which is not described herein again.
In one embodiment, the text to be processed is a drug instruction book text; the knowledge base is a medicine knowledge base; identifying the entity relationship of the text to be processed according to the entity identification result and the trained target relationship classification network, wherein the obtaining of the relationship identification result comprises the following steps: determining at least two prediction entities corresponding to the text to be processed according to the entity recognition result; determining a target drug name from at least two prediction entities, and forming a prediction entity pair by the target drug name and other prediction entities respectively; determining the coding segment corresponding to each prediction entity pair from the coding sequence; and respectively inputting each coding segment into a target relation classification network for relation identification to obtain a corresponding relation identification result of each prediction entity.
The target drug name is the name of the drug to which the drug manual applies. For example, if a certain pharmaceutical specification is an amoxicillin capsule specification, the target drug name is amoxicillin capsule.
In this embodiment, the text to be processed is a drug specification, the entity type obtained by performing entity identification on the text to be processed includes drugs, diseases, symptoms, and the like, and the relationship identification mainly identifies the relationship between the drugs applied to the drug specification and other entities, so that the names of target drugs in at least two predicted entities determined according to the entity identification result can be determined as a main entity, the main entity and other predicted entities respectively form entity pairs, then the coding segments corresponding to the respective predicted entity pairs are determined from the coding sequence, the coding segments are respectively input into the target relationship classification network for relationship identification, the relationship identification result corresponding to the respective predicted entity pair is obtained, the relationship type corresponding to the entity pair can be determined according to the relationship identification result, and the relationship type includes "adapted to"), Contraindications, suitability for people, and the like.
As shown in table 1 below, is an example of knowledge of a drug specification for a drug. Wherein, the entities of bacillary dysentery, chronic enteritis, intestinal infection and diarrhea can be extracted according to the knowledge corresponding to applicable symptoms, the entities of nausea, vomit, dizziness, headache, blurred vision and chapped skin can be extracted according to the knowledge corresponding to adverse reactions, the entities of powerful muscle and bone fragments can be extracted according to the knowledge corresponding to contraindicated medicines, the entities of receptor stimulant and hypokalemia can be extracted according to the knowledge corresponding to complications, and the entities of protein and hydrolyzed protein can be extracted according to the knowledge corresponding to medicinal components.
TABLE 1
Figure BDA0002817692150000261
In one embodiment, the method for constructing the knowledge base further includes: acquiring a prescription to be audited, wherein the prescription to be audited comprises a medicine name; inquiring a corresponding target entity and a target relation from a medicine knowledge base according to the medicine name; and auditing the prescription to be audited according to the inquired target entity and the target relation to obtain an auditing result.
The prescription to be audited refers to the medical prescription to be audited with accuracy, the prescription to be audited includes information such as medicine names, medicine usage and disease diagnosis results, and the disease diagnosis results specifically include disease names and symptoms. After the prescription to be audited is obtained, the corresponding target entity and target relationship can be inquired from the medicine knowledge base according to the medicine name in the prescription to be audited, and because the target entity and target relationship corresponding to the medicine are obtained according to the medicine specification, the target entity and target relationship corresponding to the medicine name cover the content of the medicine specification, so that the prescription to be audited can be audited according to the inquired target entity and target relationship, and the audit result is obtained. For example, the disease in the prescription to be reviewed can be compared with the queried indication entity to review whether the drug in the prescription is suitable for treatment with the current disease; the amount of the drug in the prescription to be checked can be compared with the inquired drug amount entity to check whether the drug amount in the prescription is applicable and the current disease treatment is correct.
The application also provides an application scene, and the application scene applies the text processing method and the knowledge base construction method. In the application scenario, a medicine knowledge base needs to be constructed according to a medicine list of a hospital, and prescription verification is performed according to the medicine knowledge base. In the application scenario, the steps of the text processing method and the method for constructing the knowledge base can be illustrated by fig. 10A, and the application of the text processing method and the method for constructing the knowledge base in the application scenario will be described with reference to fig. 10A.
1. And acquiring the unlabeled medical short sentence text, and automatically labeling the unlabeled medical short sentence text through new word discovery and a medical dictionary to generate a training data set.
Specifically, word segmentation processing is performed on the medical phrase text to obtain a corresponding text element set, the text element set is matched with a medical dictionary, and at least two first entity labels corresponding to the medical phrase text are determined according to a matching result, wherein the medical dictionary is obtained according to a pre-constructed knowledge graph, so that an entity relationship corresponding to the at least two first entity labels can be determined according to the knowledge graph to obtain first relationship labels corresponding to the medical phrase text.
Further, candidate text elements are determined from the text element set, the candidate text elements are text elements except for the at least two first entity labels, word solidification degrees of the candidate text elements are calculated, when the calculated word solidification degrees exceed a preset threshold value, the candidate text elements are determined to be second entity labels corresponding to the medical phrase texts, entity relations corresponding to the second entity labels are obtained, and second relation labels corresponding to the medical phrase texts are obtained.
2. The coding network, the entity recognition network and the relation classification network are jointly trained through the training data set, and the coding network, the entity recognition network and the relation classification network form a text processing network.
The coding network comprises an embedding layer and a BLSTM network layer; the entity identification network comprises an LSTM network layer and a softmax classification layer; the relational classification network includes a hidden layer, a convolutional layer, and a pooling layer.
Acquiring a medical short sentence text and an entity label and a relation label corresponding to the medical short sentence text from a training data set, and performing feature mapping on each text element by using the medical short sentence text at an embedding layer to obtain a feature representation corresponding to each text element; in a BLSTM network layer, coding each feature representation according to the sequence of the text elements corresponding to each feature representation and the forward direction to obtain the forward coding features corresponding to each text element, coding each feature representation according to the sequence of the text elements corresponding to each feature representation and the backward direction to obtain the backward coding features corresponding to each text element, fusing the forward coding features and the backward coding features corresponding to each text element respectively to obtain the element coding features corresponding to each text element, and obtaining the training coding sequence corresponding to the medical short sentence text according to the element coding features corresponding to each text element respectively; decoding the element coding features according to the sequence of the text elements corresponding to the element coding features in an LSTM network layer to obtain the element decoding features corresponding to the text elements; in a softmax classification layer, performing entity classification processing on decoding characteristics of each element to obtain an entity identification result, determining at least two training entities corresponding to the medical phrase text according to the entity identification result, determining corresponding training entity pairs according to the at least two training entities, determining training code sections corresponding to the training entity pairs from training code sequences, inputting the training code sections into a relation classification network for relation identification to obtain a relation identification result corresponding to the training entity pairs, jointly training the coding network, the entity identification network and the relation classification network based on the difference between the entity identification result and the entity label and the difference between the relation identification result and the relation label to update the coding network, the entity identification network and the relation classification network, and updating the training code sequences according to the updated coding network and the relation identification result, and returning to the processing step of the embedding layer to carry out iterative training, and obtaining a trained target coding network, a trained target entity identification network and a trained target relation classification network when an iteration stop condition is met.
3. The method comprises the steps of obtaining medicine specifications of all medicines in a medicine list, preprocessing texts of all the medicine specifications, inputting the preprocessed texts of the medicine specifications into a text processing network for entity recognition and relationship recognition, obtaining entities and relationships corresponding to all the medicine specifications, and constructing a medicine recognition library according to the entities and the relationships.
The preprocessing of the text of the medicine specification can be specifically processing of subject substitution, sentence segmentation, word segmentation and the like on the characters of the medicine specification. The phrase "the drug" is used herein to mean that the drug name is replaced with another expression indicating the drug in the drug manual, for example, "the drug" in the drug manual is replaced with "the drug name". The sentence segmentation may specifically be to segment the medicine specification text according to punctuation marks in the specification text. After the preprocessing is finished, inputting the obtained medicine specification text into a target text processing network, coding the medicine specification text through a target coding network of the text processing network to obtain a coding sequence corresponding to the medicine specification text, identifying an entity of the medicine specification text according to the coding sequence and the target entity identification network to obtain an entity identification result, identifying an entity relation of the medicine specification text through the network according to the entity identification result and the target relation to obtain a relation identification result, determining the entity and the relation corresponding to the medicine specification according to the entity identification result and the relation identification result, and constructing a medicine knowledge base according to the entity and the relation. The constructed drug knowledge base can be used for auditing by the prescription.
4. And selecting uncertain samples through an active learning algorithm to carry out artificial labeling, and continuously training a target coding network, a target entity recognition network and a target relation classification network through the labeled samples.
Judging whether the medicine specification text is an uncertain sample according to an entity recognition result and a relation recognition result corresponding to the medicine specification text, if so, sending the medicine specification text to a terminal corresponding to a pharmacist for entity and relation labeling, after the pharmacist returns a labeling result, updating entity and relation data corresponding to the medicine specification in a medicine knowledge base according to the labeling result, meanwhile, taking the medicine specification text as a training sample to continue training a target coding network, a target entity recognition network and a target relation classification network so as to update network parameters of the target coding network, the target entity recognition network and the target relation classification network, and when a convergence condition is met, finishing the training. The generalization performance of the updated target coding network, the updated target entity identification network and the updated target relation classification network is further improved.
Compared with the prior art that the medicine knowledge base is constructed by carrying out entity labeling and relation labeling on the medicine specification text in a manual mode, the combined learning and active learning mode in the application scene can greatly improve the construction precision of the knowledge base on the basis of using a small amount of manual labeling cost.
Fig. 10B is a diagram illustrating a relationship between a working day and a knowledge base quantity level required for manually labeling a medicine specification in one embodiment, and it can be seen from fig. 10 that as the medicine knowledge base quantity level is continuously iterated, the accuracy of a text processing network gradually converges and the time required for labeling gradually decreases.
It should be understood that although the various steps in the flow charts of fig. 1-10 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in a strict order unless explicitly stated herein, and may be performed in other orders. Moreover, at least some of the steps in fig. 1-10 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or in alternation with other steps or at least some of the other steps or stages.
In one embodiment, as shown in fig. 11, there is provided a text processing apparatus 1100, which may be a part of a computer device using software modules or hardware modules, or a combination of the two, and specifically includes:
a data obtaining module 1102, configured to obtain a training text and tag data corresponding to the training text;
the coding module 1104 is configured to code the training text through a coding network to obtain a training coding sequence corresponding to the training text;
an entity identification module 1106, configured to identify an entity of the training text according to the training code sequence and the entity identification network, to obtain an entity identification result;
a relationship recognition module 1108, configured to recognize an entity relationship of the training text according to the entity recognition result and the relationship classification network, to obtain a relationship recognition result;
a joint training module 1110, configured to jointly train the coding network, the entity recognition network, and the relationship classification network based on the entity recognition result, the relationship recognition result, and the label data, so as to update the coding network, the entity recognition network, and the relationship classification network;
an iterative training module 1112, configured to update the training code sequence according to the updated coding network and the relationship recognition result, and return to the step of recognizing the entity of the training text according to the training code sequence and the entity recognition network for iterative training, and when an iteration stop condition is met, obtain a trained target coding network, a trained target entity recognition network, and a trained target relationship classification network; the target coding network, the target entity identification network and the target relation classification network are used for cooperatively identifying the entity relation of the text to be processed.
In one embodiment, the above apparatus further comprises: the marking module is used for performing word segmentation processing on the training text to obtain a corresponding text element set; matching the text element set with an entity dictionary, and determining at least two first entity labels corresponding to the training text according to a matching result; the entity dictionary is obtained according to a pre-constructed knowledge graph; and determining entity relations corresponding to the at least two first entity labels according to the knowledge graph to obtain first relation labels corresponding to the training text.
In one embodiment, the labeling module is further configured to determine candidate text elements from the text element set, where the candidate text elements are text elements other than the at least two first entity tags; calculating word solidity of the candidate text elements; when the calculated word freezing degree exceeds a preset threshold value, determining the candidate text element as a second entity label corresponding to the training text; and acquiring the entity relation corresponding to the second entity label to obtain a second relation label corresponding to the training text.
In one embodiment, the above apparatus further comprises: the active learning module is used for acquiring a candidate text and coding the candidate text through a coding network to obtain a coding sequence corresponding to the candidate text; identifying the entity of the candidate text according to the coding sequence corresponding to the candidate text and the entity identification network to obtain an entity identification result corresponding to the candidate text; identifying the entity relationship of the candidate text according to the entity identification result corresponding to the candidate text and the relationship classification network to obtain a relationship identification result corresponding to the candidate text; when the candidate text is judged to be an uncertain sample according to at least one of the entity identification result corresponding to the candidate text and the relation identification result corresponding to the candidate text, the candidate text is sent to a preset terminal; and when receiving the entity labels and the relation labels corresponding to the candidate texts returned by the preset terminal, updating the training data set according to the candidate texts.
In one embodiment, an encoding network includes a feature mapping layer and an encoding layer; the coding module is also used for carrying out feature mapping on each text element through the feature mapping layer to obtain feature representation corresponding to each text element; inputting each feature representation into a coding layer to obtain element coding features corresponding to each text element; and obtaining a training coding sequence corresponding to the training text according to the element coding characteristics corresponding to each text element.
In one embodiment, the encoding module is further configured to encode each feature representation according to a forward direction according to a sequence of the text element corresponding to each feature representation, so as to obtain a forward encoding feature corresponding to each text element; coding each feature representation according to the sequence of the text elements corresponding to each feature representation and the backward direction to obtain backward coding features corresponding to each text element; and respectively fusing the forward coding features and the backward coding features corresponding to the text elements to obtain the element coding features corresponding to the text elements.
In one embodiment, the entity identification network includes a decoding layer and a classification layer; the entity identification module is also used for decoding the element coding features according to the sequence of the text elements corresponding to the element coding features at the decoding layer to obtain the element decoding features corresponding to the text elements; and in the classification layer, performing entity classification processing on each element decoding characteristic to obtain an entity identification result.
In one embodiment, the entity recognition module is further configured to determine at least two training entities corresponding to the training texts according to the entity recognition result; determining a corresponding training entity pair according to at least two training entities; determining a training code segment corresponding to a training entity pair from a training code sequence; and inputting the training code segment into a relation classification network for relation recognition to obtain a corresponding relation recognition result of the training entity pair.
In one embodiment, the relational classification network includes a hidden layer, a convolutional layer, and a pooling layer; the relation recognition module is also used for inputting the training code segment into the hidden layer, and processing the training code segment through hiding to obtain a corresponding first intermediate characteristic; inputting the first intermediate feature into a convolutional layer, and performing convolution processing on the first intermediate feature through the convolutional layer to obtain a corresponding second intermediate feature; inputting the second intermediate features into the pooling layer, and performing pooling treatment on the second intermediate features through the pooling layer; and classifying the entity relationship according to the pooling processing result to obtain a relationship identification result.
In one embodiment, the device also comprises a knowledge base construction module, a knowledge base analysis module and a knowledge base analysis module, wherein the knowledge base construction module is used for acquiring the text to be processed, and coding the text to be processed through a target coding network to obtain a coding sequence corresponding to the text to be processed; identifying the entity of the text to be processed according to the coding sequence corresponding to the text to be processed and the target entity identification network to obtain an entity identification result corresponding to the text to be processed; identifying the entity relationship of the text to be processed according to the entity identification result and the target relationship classification network to obtain a relationship identification result corresponding to the text to be processed; and constructing a knowledge base according to the entity identification result and the relation identification result corresponding to the text to be processed.
The text processing device obtains the training coding sequence corresponding to the training text by coding the training text through the coding network, identifies the entity of the training text according to the training coding sequence and the entity identification network to obtain the entity identification result, identifies the entity relationship of the training text according to the entity identification result and the relationship classification network to obtain the relationship identification result, jointly trains the coding network, the entity identification network and the relationship classification network based on the entity identification result, the relationship identification result and the label data to obtain the updated coding network, the entity identification network and the relationship classification network, because the joint training is carried out, the entity identification network and the relationship classification network can share the network parameters of the coding network, the dependence between the entity identification network and the relationship classification network can be realized in the training process, and the obtained entity identification network and the relationship classification network are more accurate, further, the encoding network, the entity recognition network and the relation classification network are subjected to multiple rounds of iterative training, the training encoding sequence is updated according to the updated encoding network and the updated relation recognition result in each iteration process, and the trained target encoding network, the trained target entity recognition network and the trained target relation classification network are obtained when the iteration stop condition is met, so that the correlation between the entity recognition network and the relation classification network can be fully considered, and the accuracy of the entity recognition network and the relation classification network is further improved.
In one embodiment, as shown in fig. 12, there is provided an apparatus 1200 for building a knowledge base, which may be a part of a computer device using a software module or a hardware module, or a combination of the two, and specifically includes:
the text acquisition module 1202 is configured to acquire a text to be processed, and encode the text to be processed through the trained target coding network to obtain a coding sequence corresponding to the text to be processed;
the entity recognition module 1204 is used for recognizing the entity of the text to be processed according to the coding sequence and the trained target entity recognition network to obtain an entity recognition result;
the relationship identification module 1206 is used for identifying the entity relationship of the text to be processed according to the entity identification result and the trained target relationship classification network to obtain a relationship identification result; the target coding network, the target entity recognition network and the target relation classification network are obtained by performing iterative joint training based on a training text; updating a training coding sequence input into the entity recognition network obtained by the previous training according to a relationship recognition result output by the relationship classification network obtained by the previous training each time of iterative training;
and a knowledge base construction module 1208, configured to construct a knowledge base according to the entity identification result and the relationship identification result.
In one embodiment, the text to be processed is a drug instruction book text; the knowledge base is a medicine knowledge base; the relationship identification module is also used for determining at least two prediction entities corresponding to the text to be processed according to the entity identification result; determining a target drug name from at least two prediction entities, and forming a prediction entity pair by the target drug name and each other prediction entity respectively; determining coding segments corresponding to the prediction entity pairs from the coding sequence; and respectively inputting each coding segment into a target relation classification network for relation identification to obtain a corresponding relation identification result of each prediction entity.
In one embodiment, the above apparatus further comprises: the prescription auditing module is used for acquiring a prescription to be audited, and the prescription to be audited comprises a medicine name; inquiring a corresponding target entity and a target relation from a medicine knowledge base according to the medicine name; and auditing the prescription to be audited according to the inquired target entity and the target relation to obtain an auditing result.
The construction device of the knowledge base comprises a coding sequence corresponding to a text to be processed, which is obtained by coding the text to be processed through a trained target coding network, a coding sequence corresponding to the text to be processed is obtained, an entity of the text to be processed is identified according to the coding sequence and the trained target entity identification network to obtain an entity identification result, an entity relation of the text to be processed is identified according to the entity identification result and the trained target relation classification network to obtain a relation identification result, the knowledge base is constructed according to the entity identification result and the relation identification result, because the target coding network, the target entity identification network and the target relation classification network are obtained by iterative joint training based on the training text, each iterative training, the training coding sequence input into the entity identification network obtained by the previous training is updated according to the relation identification result output by the relation classification network obtained by the previous training, in the training process, the dependence and the relevance of the entity recognition network and the relation classification network are fully considered, the obtained entity recognition network and the target relation classification network have better generalization performance, and the entity and the relation in the text to be processed can be accurately recognized, so that the accuracy of the knowledge base established according to the entity recognition result and the relation recognition result is obviously improved compared with the knowledge base established in a manual labeling mode in the related technology.
For the specific limitations of the text processing device and the knowledge base constructing device, reference may be made to the limitations of the text processing method and the knowledge base constructing method, which are not described herein again. The modules in the text processing device and the knowledge base constructing device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 13. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a text processing method or a knowledge base construction method.
Those skilled in the art will appreciate that the architecture shown in fig. 13 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.
In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above examples represent only a few embodiments of the present application, which are described in more detail and detail, but are not to be construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the protection scope of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (15)

1. A method of text processing, the method comprising:
acquiring a training text and label data corresponding to the training text;
coding the training text through a coding network to obtain a training coding sequence corresponding to the training text;
identifying the entity of the training text according to the training coding sequence and the entity identification network to obtain an entity identification result;
identifying the entity relationship of the training text according to the entity identification result and a relationship classification network to obtain a relationship identification result;
jointly training the coding network, the entity recognition network and the relationship classification network based on the entity recognition result, the relationship recognition result and the label data to update the coding network, the entity recognition network and the relationship classification network;
updating the training coding sequence according to the updated coding network and the relationship recognition result, returning to the step of recognizing the entity of the training text according to the training coding sequence and the entity recognition network for iterative training, and obtaining a trained target coding network, a trained target entity recognition network and a trained target relationship classification network when an iteration stop condition is met; the target coding network, the target entity recognition network and the target relation classification network are used for cooperatively recognizing the entity relation of the text to be processed.
2. The method of claim 1, wherein prior to the obtaining of the training text and the label data corresponding to the training text, the method further comprises:
performing word segmentation processing on the training text to obtain a corresponding text element set;
matching the text element set with an entity dictionary, and determining at least two first entity labels corresponding to the training text according to a matching result; the entity dictionary is obtained according to a pre-constructed knowledge graph;
and determining the entity relationship corresponding to the at least two first entity labels according to the knowledge graph to obtain the first relationship labels corresponding to the training text.
3. The method of claim 2, wherein prior to the obtaining training text, the method further comprises:
determining candidate text elements from the text element set, wherein the candidate text elements are text elements except the at least two first entity labels;
calculating word solidity of the candidate text elements;
when the calculated word coagulation degree exceeds a preset threshold value, determining the candidate text element as a second entity label corresponding to the training text;
and acquiring the entity relation corresponding to the second entity label to obtain a second relation label corresponding to the training text.
4. The method of claim 1, further comprising:
acquiring a candidate text, and coding the candidate text through a coding network to obtain a coding sequence corresponding to the candidate text;
identifying the entity of the candidate text according to the coding sequence corresponding to the candidate text and an entity identification network to obtain an entity identification result corresponding to the candidate text;
identifying the entity relationship of the candidate text according to the entity identification result corresponding to the candidate text and a relationship classification network to obtain a relationship identification result corresponding to the candidate text;
when the candidate text is judged to be an uncertainty sample according to at least one of the entity identification result corresponding to the candidate text and the relation identification result corresponding to the candidate text, the candidate text is sent to a preset terminal;
and when receiving the entity labels and the relation labels corresponding to the candidate texts returned by the preset terminal, continuing training the target coding network, the target entity recognition network and the target relation classification network according to the candidate texts.
5. The method of claim 1, wherein the coding network comprises a feature mapping layer and a coding layer; the training text is composed of a plurality of ordered text elements; the encoding the training text through the encoding network to obtain the training encoding sequence corresponding to the training text comprises:
performing feature mapping on each text element through a feature mapping layer to obtain feature representations corresponding to the text elements;
inputting each feature representation into the coding layer to obtain element coding features corresponding to each text element;
and obtaining a training coding sequence corresponding to the training text according to the element coding characteristics corresponding to each text element.
6. The method of claim 5, wherein inputting each of the feature representations into the coding layer to obtain the element coding feature corresponding to each of the text elements comprises:
coding each feature representation according to the sequence of the text elements corresponding to each feature representation and the forward direction to obtain the forward coding feature corresponding to each text element;
coding each feature representation according to the sequence of the text elements corresponding to each feature representation and the backward direction to obtain backward coding features corresponding to each text element;
and respectively fusing the forward coding features and the backward coding features corresponding to the text elements to obtain the element coding features corresponding to the text elements.
7. The method of claim 5, wherein the entity identification network comprises a decoding layer and a classification layer; the step of identifying the entity of the training text according to the training code sequence and the entity identification network to obtain an entity identification result comprises the following steps:
decoding the element coding features according to the sequence of the text elements corresponding to the element coding features at the decoding layer to obtain the element decoding features corresponding to the text elements;
and carrying out entity classification processing on each element decoding characteristic in the classification layer to obtain an entity identification result.
8. The method of claim 1, wherein the recognizing the entity relationship of the training text according to the entity recognition result and the relationship classification network to obtain the relationship recognition result comprises:
determining at least two training entities corresponding to the training texts according to the entity recognition result;
determining a corresponding training entity pair according to the at least two training entities;
determining a training code segment corresponding to the training entity pair from the training code sequence;
and inputting the training code segment into a relation classification network for relation recognition to obtain a relation recognition result corresponding to the training entity pair.
9. The method according to any one of claims 1 to 8, further comprising:
acquiring a text to be processed, and coding the text to be processed through the target coding network to obtain a coding sequence corresponding to the text to be processed;
identifying the entity of the text to be processed according to the coding sequence corresponding to the text to be processed and the target entity identification network to obtain an entity identification result corresponding to the text to be processed;
identifying the entity relationship of the text to be processed according to the entity identification result and the target relationship classification network to obtain a relationship identification result corresponding to the text to be processed;
and constructing a knowledge base according to the entity recognition result and the relation recognition result corresponding to the text to be processed.
10. A method for constructing a knowledge base, the method comprising:
acquiring a text to be processed, and coding the text to be processed through a trained target coding network to obtain a coding sequence corresponding to the text to be processed;
identifying the entity of the text to be processed according to the coding sequence and the trained target entity identification network to obtain an entity identification result;
identifying the entity relationship of the text to be processed according to the entity identification result and the trained target relationship classification network to obtain a relationship identification result;
the target coding network, the target entity recognition network and the target relation classification network are obtained by performing iterative joint training based on a training text; each iteration training, updating a training coding sequence input into the entity recognition network obtained by the previous training according to a relationship recognition result output by the relationship classification network obtained by the previous training;
and constructing a knowledge base according to the entity recognition result and the relation recognition result.
11. The method of claim 10, wherein the text to be processed is a drug instruction text; the knowledge base is a medicine knowledge base; the step of identifying the entity relationship of the text to be processed according to the entity identification result and the trained target relationship classification network to obtain a relationship identification result comprises the following steps:
determining at least two prediction entities corresponding to the text to be processed according to the entity recognition result;
determining a target drug name from the at least two prediction entities, and forming a prediction entity pair by the target drug name and each other prediction entity respectively;
determining a coding segment corresponding to each prediction entity pair from the coding sequence;
and respectively inputting each coding segment into a target relation classification network for relation identification to obtain a corresponding relation identification result of each prediction entity.
12. A text processing apparatus, characterized in that the apparatus comprises:
the data acquisition module is used for acquiring a training text and label data corresponding to the training text;
the coding module is used for coding the training text through a coding network to obtain a training coding sequence corresponding to the training text;
the entity recognition module is used for recognizing the entity of the training text according to the training code sequence and the entity recognition network to obtain an entity recognition result;
the relation recognition module is used for recognizing the entity relation of the training text according to the entity recognition result and the relation classification network to obtain a relation recognition result;
a joint training module, configured to jointly train the coding network, the entity recognition network, and the relationship classification network based on the entity recognition result, the relationship recognition result, and the label data, so as to update the coding network, the entity recognition network, and the relationship classification network;
the iterative training module is used for updating the training code sequence according to the updated coding network and the relationship recognition result, returning to the step of recognizing the entity of the training text according to the training code sequence and the entity recognition network for iterative training, and obtaining a trained target coding network, a trained target entity recognition network and a trained target relationship classification network when an iteration stop condition is met; the target coding network, the target entity recognition network and the target relation classification network are used for cooperatively recognizing the entity relation of the text to be processed.
13. An apparatus for building a knowledge base, the apparatus comprising:
the text acquisition module is used for acquiring a text to be processed, and encoding the text to be processed through a trained target encoding network to obtain an encoding sequence corresponding to the text to be processed;
the entity recognition module is used for recognizing the entity of the text to be processed according to the coding sequence and the trained target entity recognition network to obtain an entity recognition result;
the relation recognition module is used for recognizing the entity relation of the text to be processed according to the entity recognition result and the trained target relation classification network to obtain a relation recognition result; the target coding network, the target entity recognition network and the target relation classification network are obtained by performing iterative joint training based on a training text; each iteration training, updating a training coding sequence input into the entity recognition network obtained by the previous training according to a relationship recognition result output by the relationship classification network obtained by the previous training;
and the knowledge base construction module is used for constructing a knowledge base according to the entity recognition result and the relation recognition result.
14. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 1 to 11 when executing the computer program.
15. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 11.
CN202011403298.2A 2020-12-04 2020-12-04 Text processing method, text processing device, knowledge base construction method, knowledge base construction device and storage medium Pending CN112380867A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011403298.2A CN112380867A (en) 2020-12-04 2020-12-04 Text processing method, text processing device, knowledge base construction method, knowledge base construction device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011403298.2A CN112380867A (en) 2020-12-04 2020-12-04 Text processing method, text processing device, knowledge base construction method, knowledge base construction device and storage medium

Publications (1)

Publication Number Publication Date
CN112380867A true CN112380867A (en) 2021-02-19

Family

ID=74589422

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011403298.2A Pending CN112380867A (en) 2020-12-04 2020-12-04 Text processing method, text processing device, knowledge base construction method, knowledge base construction device and storage medium

Country Status (1)

Country Link
CN (1) CN112380867A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112926665A (en) * 2021-03-02 2021-06-08 安徽七天教育科技有限公司 Text line recognition system based on domain self-adaptation and use method
CN112949307A (en) * 2021-02-25 2021-06-11 平安科技(深圳)有限公司 Method and device for predicting statement entity and computer equipment
CN114298043A (en) * 2021-12-24 2022-04-08 厦门快商通科技股份有限公司 Entity standardization method, device and equipment based on joint learning and readable medium
US20220351067A1 (en) * 2021-04-29 2022-11-03 International Business Machines Corporation Predictive performance on slices via active learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109858010A (en) * 2018-11-26 2019-06-07 平安科技(深圳)有限公司 Field new word identification method, device, computer equipment and storage medium
CN110019839A (en) * 2018-01-03 2019-07-16 中国科学院计算技术研究所 Medical knowledge map construction method and system based on neural network and remote supervisory
US20200065374A1 (en) * 2018-08-23 2020-02-27 Shenzhen Keya Medical Technology Corporation Method and system for joint named entity recognition and relation extraction using convolutional neural network
CN111159407A (en) * 2019-12-30 2020-05-15 北京明朝万达科技股份有限公司 Method, apparatus, device and medium for training entity recognition and relation classification model
CN111324696A (en) * 2020-02-19 2020-06-23 腾讯科技(深圳)有限公司 Entity extraction method, entity extraction model training method, device and equipment
WO2020140386A1 (en) * 2019-01-02 2020-07-09 平安科技(深圳)有限公司 Textcnn-based knowledge extraction method and apparatus, and computer device and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019839A (en) * 2018-01-03 2019-07-16 中国科学院计算技术研究所 Medical knowledge map construction method and system based on neural network and remote supervisory
US20200065374A1 (en) * 2018-08-23 2020-02-27 Shenzhen Keya Medical Technology Corporation Method and system for joint named entity recognition and relation extraction using convolutional neural network
CN109858010A (en) * 2018-11-26 2019-06-07 平安科技(深圳)有限公司 Field new word identification method, device, computer equipment and storage medium
WO2020140386A1 (en) * 2019-01-02 2020-07-09 平安科技(深圳)有限公司 Textcnn-based knowledge extraction method and apparatus, and computer device and storage medium
CN111159407A (en) * 2019-12-30 2020-05-15 北京明朝万达科技股份有限公司 Method, apparatus, device and medium for training entity recognition and relation classification model
CN111324696A (en) * 2020-02-19 2020-06-23 腾讯科技(深圳)有限公司 Entity extraction method, entity extraction model training method, device and equipment

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112949307A (en) * 2021-02-25 2021-06-11 平安科技(深圳)有限公司 Method and device for predicting statement entity and computer equipment
CN112926665A (en) * 2021-03-02 2021-06-08 安徽七天教育科技有限公司 Text line recognition system based on domain self-adaptation and use method
US20220351067A1 (en) * 2021-04-29 2022-11-03 International Business Machines Corporation Predictive performance on slices via active learning
CN114298043A (en) * 2021-12-24 2022-04-08 厦门快商通科技股份有限公司 Entity standardization method, device and equipment based on joint learning and readable medium

Similar Documents

Publication Publication Date Title
CN108733742B (en) Global normalized reader system and method
CN108733792B (en) Entity relation extraction method
CN109471895B (en) Electronic medical record phenotype extraction and phenotype name normalization method and system
CN110969020B (en) CNN and attention mechanism-based Chinese named entity identification method, system and medium
CN112380867A (en) Text processing method, text processing device, knowledge base construction method, knowledge base construction device and storage medium
CN112084331A (en) Text processing method, text processing device, model training method, model training device, computer equipment and storage medium
CN111444715B (en) Entity relationship identification method and device, computer equipment and storage medium
CN112131883B (en) Language model training method, device, computer equipment and storage medium
CN114565104A (en) Language model pre-training method, result recommendation method and related device
WO2023029506A1 (en) Illness state analysis method and apparatus, electronic device, and storage medium
CN111930942A (en) Text classification method, language model training method, device and equipment
JP7315065B2 (en) QUESTION GENERATION DEVICE, QUESTION GENERATION METHOD AND PROGRAM
CN112765370B (en) Entity alignment method and device of knowledge graph, computer equipment and storage medium
CN111881292B (en) Text classification method and device
CN114281931A (en) Text matching method, device, equipment, medium and computer program product
CN113657105A (en) Medical entity extraction method, device, equipment and medium based on vocabulary enhancement
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
CN114781382A (en) Medical named entity recognition system and method based on RWLSTM model fusion
CN113191150B (en) Multi-feature fusion Chinese medical text named entity identification method
CN114358020A (en) Disease part identification method and device, electronic device and storage medium
CN114266905A (en) Image description generation model method and device based on Transformer structure and computer equipment
CN112035627B (en) Automatic question and answer method, device, equipment and storage medium
CN113761151A (en) Synonym mining method, synonym mining device, synonym question answering method, synonym question answering device, computer equipment and storage medium
CN113761124A (en) Training method of text coding model, information retrieval method and equipment
CN113536784A (en) Text processing method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40038818

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination