CN113553840A

CN113553840A - Text information processing method, device, equipment and storage medium

Info

Publication number: CN113553840A
Application number: CN202110923275.2A
Authority: CN
Inventors: 姜逸文; 陈旭; 宋晓霞; 刘鸣谦; 洪平; 高玉杰; 黄智勇; 王琪; 孙嘉明
Original assignee: Winning Health Technology Group Co Ltd
Current assignee: Winning Health Technology Group Co Ltd
Priority date: 2021-08-12
Filing date: 2021-08-12
Publication date: 2021-10-26

Abstract

The application provides a text information processing method, a text information processing device, text information processing equipment and a storage medium, and relates to the technical field of data processing. The initial knowledge extraction model comprises an initial entity type identification model and an initial entity relationship extraction model, and the method comprises the following steps: according to the corresponding relation between the entity type and the entity name, marking original training text information to obtain a first training sample, wherein the entity type comprises: a standard entity type, an attribute entity type, and a value entity type; inputting the first training sample into the initial entity type recognition model, and training to obtain an entity type recognition model; labeling the first training sample according to the corresponding relation between the entity types to obtain a second training sample; and inputting the second training sample into the initial entity relationship extraction model, and training to obtain an entity relationship extraction model. By applying the embodiment of the application, the accuracy of the knowledge extraction model comprising the entity type recognition model can be improved.

Description

Text information processing method, device, equipment and storage medium

Technical Field

The present application relates to the field of medical text technologies, and in particular, to a text information processing method, apparatus, device, and storage medium.

Background

With the rapid development of medical informatization, medical staff generally adopts electronic medical records to record important information in the process of diagnosis and treatment of patients. Information in the electronic medical record (which may be referred to as medical text information) is mostly stored in an unstructured form, and is difficult to be directly used in scenes such as scientific research and the like.

Currently, unstructured medical text information may be processed through a pre-trained knowledge extraction model to obtain structured information, where the structured information includes entities and relationships between the entities in the medical text information.

The comprehensiveness of the training sample used for training the knowledge extraction model directly affects the accuracy of the knowledge extraction model, and therefore how to construct the comprehensiveness of the training sample is a technical problem to be solved urgently at present.

Disclosure of Invention

An object of the present application is to provide a method, an apparatus, a device and a storage medium for processing text information, which can improve the accuracy of a knowledge extraction model based on a comprehensive training sample constructed.

In order to achieve the above purpose, the technical solutions adopted in the embodiments of the present application are as follows:

in a first aspect, an embodiment of the present application provides a text information processing method, where an initial knowledge extraction model includes an initial entity type identification model and an initial entity relationship extraction model, and the method includes:

according to the corresponding relation between the entity type and the entity name, marking original training text information to obtain a first training sample, wherein the entity type comprises: the first training sample comprises a standard entity type, an attribute entity type and a value entity type, wherein the attribute entity type and the value entity type are respectively used for characterizing the characteristics of the standard entity type, and the first training sample comprises: the original training text information and entity types corresponding to entity names in the original training text information;

inputting the first training sample into the initial entity type recognition model, and training to obtain an entity type recognition model;

labeling the first training sample according to the corresponding relation between the entity types to obtain a second training sample, wherein the corresponding relation between the entity types comprises: a directional relationship between a subject entity type and a guest entity type, the second training sample comprising: the original training text information and the corresponding relation between the entity names corresponding to the entity types in the original training text information;

and inputting the second training sample into the initial entity relationship extraction model, and training to obtain an entity relationship extraction model.

Optionally, the labeling the first training sample according to the correspondence between the entity types to obtain a second training sample includes:

and labeling the first training sample according to the corresponding relation between the entity types and the strength degree information between the entity names corresponding to the entity types in the first training sample to obtain a second training sample.

Optionally, the labeling the original training text information according to the correspondence between the entity type and the entity name to obtain a first training sample includes:

marking original training text information according to the corresponding relation between the entity type and the entity name to obtain an initial first training sample;

and if one entity name in the original training text information comprises a plurality of entity names, deleting the entity type corresponding to each entity name in the initial first training sample to obtain the first training sample.

Optionally, the method further comprises:

inputting target text information into the entity type recognition model, and outputting an entity set, wherein the entity set comprises: the entity name contained in the target text information and the entity type corresponding to the entity name, wherein the entity type comprises a standard entity type, an attribute entity type and a value entity type;

and inputting the target text information and the entity set into the entity relationship extraction model, and outputting an entity name pair, wherein the entity name pair comprises a subject entity name and an object entity name, and the subject entity name points to the object entity name.

Optionally, the inputting the target text information and the entity set into the entity relationship extraction model, and outputting an entity name pair includes:

inputting the target text information and the entity set into the entity relationship extraction model, and outputting the entity name pair and the strength degree information between the entity names contained in the entity name pair, wherein the entity name pair comprises a subject entity name and an object entity name, and the subject entity name points to the object entity name.

Optionally, after the target text information and the entity set are input into the entity relationship extraction model and an entity name pair is output, the method further includes:

and constructing a knowledge graph according to the entity name pair, taking the subject entity name and the client entity name in the entity name pair as nodes in the knowledge graph respectively, and taking the relationship between the subject entity name and the client entity name as an edge in the knowledge graph.

Optionally, after constructing the knowledge graph according to the entity name pair, the method further includes:

acquiring a corresponding entity name from a database storing graph data corresponding to the knowledge graph according to a knowledge acquisition instruction input by a user;

and displaying the entity name in the knowledge graph according to the display state corresponding to the entity name.

Optionally, after the target text information is input into the entity type recognition model and an entity set is output, the method further includes:

performing statistical operation on the entity set to obtain a statistical result, wherein the statistical result comprises: the frequency of occurrence of each entity name and/or the frequency of occurrence of each entity type in the entity set;

and respectively sequencing the contents belonging to the same dimensionality in the statistical result to obtain a sequencing result.

In a second aspect, an embodiment of the present application further provides a text information processing apparatus, where an initial knowledge extraction model includes an initial entity type identification model and an initial entity relationship extraction model, the apparatus includes:

the first labeling module is used for labeling original training text information according to the corresponding relation between the entity type and the entity name to obtain a first training sample, wherein the entity type comprises: the first training sample comprises a standard entity type, an attribute entity type and a value entity type, wherein the attribute entity type and the value entity type are respectively used for characterizing the characteristics of the standard entity type, and the first training sample comprises: the original training text information and entity types corresponding to entity names in the original training text information;

the first training module is used for inputting the first training sample into the initial entity type recognition model and training to obtain an entity type recognition model;

a second labeling module, configured to label the first training sample according to a correspondence between entity types to obtain a second training sample, where the correspondence between entity types includes: a directional relationship between a subject entity type and a guest entity type, the second training sample comprising: the original training text information and the corresponding relation between the entity names corresponding to the entity types in the original training text information;

and the second training module is used for inputting the second training sample into the initial entity relationship extraction model and training to obtain an entity relationship extraction model.

Optionally, the second labeling module is specifically configured to label the first training sample according to the correspondence between the entity types and the strength information between the entity names corresponding to the entity types in the first training sample, so as to obtain a second training sample.

Optionally, the first labeling module is specifically configured to label the original training text information according to a correspondence between an entity type and an entity name, so as to obtain an initial first training sample; and if one entity name in the original training text information comprises a plurality of entity names, deleting the entity type corresponding to each entity name in the initial first training sample to obtain the first training sample.

Optionally, the apparatus further comprises:

a first output module, configured to input target text information into the entity type recognition model, and output an entity set, where the entity set includes: the entity name contained in the target text information and the entity type corresponding to the entity name, wherein the entity type comprises a standard entity type, an attribute entity type and a value entity type;

and the second output module is used for inputting the target text information and the entity set into the entity relationship extraction model and outputting an entity name pair, wherein the entity name pair comprises a subject entity name and an object entity name, and the subject entity name points to the object entity name.

Optionally, the second output module is specifically configured to input the target text information and the entity set into the entity relationship extraction model, and output the entity name pair and the strength information between the entity names included in the entity name pair, where the entity name pair includes a subject entity name and an object entity name, and the subject entity name points to the object entity name.

Optionally, the apparatus further comprises:

and the construction module is used for constructing a knowledge graph according to the entity name pair, taking the subject entity name and the client entity name in the entity name pair as nodes in the knowledge graph respectively, and taking the relationship between the subject entity name and the object entity name as an edge in the knowledge graph.

Optionally, the apparatus further comprises:

the acquisition module is used for acquiring a corresponding entity name from a database storing graph data corresponding to the knowledge graph according to a knowledge acquisition instruction input by a user;

and the display module is used for displaying the entity name in the knowledge graph according to the display state corresponding to the entity name.

Optionally, the apparatus further comprises:

a statistic module, configured to perform statistic operation on the entity set to obtain a statistic result, where the statistic result includes: the frequency of occurrence of each entity name and/or the frequency of occurrence of each entity type in the entity set; and respectively sequencing the contents belonging to the same dimensionality in the statistical result to obtain a sequencing result.

In a third aspect, an embodiment of the present application provides an electronic device, including: the electronic device comprises a processor, a storage medium and a bus, wherein the storage medium stores machine-readable instructions executable by the processor, when the electronic device runs, the processor and the storage medium communicate through the bus, and the processor executes the machine-readable instructions to execute the steps of the text information processing method of the first aspect.

In a fourth aspect, the present application provides a storage medium, where a computer program is stored, and the computer program is executed by a processor to perform the steps of the text information processing method according to the first aspect.

The beneficial effect of this application is:

the embodiment of the application provides a text information processing method, a text information processing device, text information processing equipment and a storage medium, wherein an initial knowledge extraction model comprises an initial entity type identification model and an initial entity relationship extraction model, and the method comprises the following steps: according to the corresponding relation between the entity type and the entity name, marking original training text information to obtain a first training sample, wherein the entity type comprises: the first training sample comprises a standard entity type, an attribute entity type and a value entity type, wherein the attribute entity type and the value entity type are respectively used for representing the characteristics of the quasi entity type, and the first training sample comprises: the original training text information and the entity type corresponding to each entity name in the original training text information; inputting the first training sample into the initial entity type recognition model, and training to obtain an entity type recognition model; labeling the first training sample according to the corresponding relation between the entity types to obtain a second training sample, wherein the corresponding relation between the entity types comprises: a directional relationship between the subject entity type and the object entity type, the second training sample comprising: the original training text information and the corresponding relation between entity names corresponding to the entity types in the original training text information; and inputting the second training sample into the initial entity relationship extraction model, and training to obtain an entity relationship extraction model.

By adopting the text information processing method provided by the embodiment of the application, the original training text information is labeled by adding the attribute entity type and the value entity type, so that not only can the entity name corresponding to the standard entity type be identified from the original training text information, but also the entity name corresponding to the attribute entity type and the entity name corresponding to the value entity type can be identified from the original training text information.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

FIG. 1 is a schematic structural diagram of an initial knowledge extraction model provided in an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a knowledge extraction model provided in an embodiment of the present application;

fig. 3 is a schematic flowchart of a text information processing method according to an embodiment of the present application;

fig. 4 is a schematic flowchart of another text information processing method according to an embodiment of the present application;

fig. 5 is a schematic flowchart of another text information processing method according to an embodiment of the present application;

fig. 6 is a schematic diagram illustrating a conversion of unstructured target text information into structured graph data according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a knowledge-graph structure provided by an embodiment of the present application;

fig. 8 is a schematic flowchart of another text information processing method according to an embodiment of the present application;

fig. 9 is a schematic flowchart of another text information processing method according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a text information processing apparatus according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments.

Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

Before explaining the embodiments of the present application in detail, some terms appearing in the embodiments of the present application will be explained first.

Entity type: covering the main concepts involved in medical texts, the following mainly introduces 11 entity types, and these 11 entity types can be divided into a standard entity type, an attribute entity type, and a value entity type, wherein the standard entity type mainly includes: human body parts (including the various organs of the nine major systems of the human body, broadly referring to biological subjects of interest, including human tissues and cells); patient subjects (subjects with disease, broadly referred to as subjects described in medical text); clinical manifestations (symptoms and subjective abnormal sensations that appear objectively on the patient's body, or manifestations that appear objectively on the patient's body part); the term examination (diagnostic term name, intended to aid the doctor in making a judgment on the patient's disease, including imaging examinations, physicochemical examinations conducted in the laboratory, etc.), disease diagnosis (medically defined disease and the doctor's judgment on the cause, physiology, staging, etc. in clinical work), treatment (intervening in the patient's health by therapeutic means, intended to alleviate and eliminate the patient's disease and abnormal symptoms, including surgery and medical instruments, etc. for therapeutic purposes), drug name (broadly referring to all drug generic names used to prevent, treat and diagnose the patient's disease), orientation (describing specific positional information that other entities may potentially be involved in), time (describing the time and period of occurrence of an event), attribute entity type (broadly referring to all potential attributes describing the standard entity type, value entity type: specific experimental or observation of an attribute of the standard entity type, can be any one of numerical values, word descriptions and the like. It should be noted that the present application does not limit the specific content included in the standard entity type.

Entity name: it is understood that the entity type may be a specific lower description of the entity type, for example, the entity name corresponding to the entity type "body part" may include the names of the organs of the nine major systems of the human body, such as the right upper limb, the shoulder and neck, etc., the entity name corresponding to the entity type "clinical manifestation" may include numbness, pain, etc., the entity name corresponding to the entity type "attribute entity type" may include the body temperature, the boundary, etc., and the entity name corresponding to the entity type "value entity type" may include a specific numerical value (e.g., 0.8 × 0.6cm), clearness, etc.

Next, an application scenario of the present application will be described. The application scenario may be a scenario in which index information required for clinical scientific research is extracted from an electronic medical record, where the information in the electronic medical record is medical text information stored in an unstructured form, and such medical text information is difficult to be directly used for clinical scientific research, so that the unstructured medical text information needs to be converted into structured graph data.

Specifically, the initial knowledge extraction model may be trained according to a constructed training sample to obtain a knowledge extraction model, and fig. 1 is a schematic structural diagram of the initial knowledge extraction model provided in the embodiment of the present application. As shown in fig. 1, the initial knowledge extraction model 100 may include an initial entity type identification model 101 and an initial entity relationship extraction model 102, optionally, the initial entity type identification model 101 and the initial entity relationship extraction model 102 may be trained as a whole, and when a training stop condition is met, the entity type identification model and the entity relationship extraction model are obtained, or the initial entity type identification model 101 and the initial entity relationship extraction model 102 may be trained respectively, so as to obtain the entity type identification model and the entity relationship extraction model, which is not limited in this application.

Fig. 1 is a structural diagram for separately training an initial entity type recognition model 101 and an initial entity relationship extraction model 102, where the initial entity type recognition model 101 is a deep neural network model based on a BERT (Bidirectional Encoder) and CRF (Conditional Random Fields) decoders, and a first training sample 1011 for training the initial entity type recognition model 101 is specifically constructed in the following embodiment, and the initial entity type recognition model 101 is trained according to the first training sample 1011 to obtain an entity recognition model; the initial entity relationship extraction model 102 is a BERT-based deep neural network, and specifically, the following embodiment may be adopted to label the first training sample 1011 to obtain the second training sample 1021, and train the initial entity relationship extraction model 102 according to the second training sample 1021 to obtain the entity relationship extraction model.

Then the process of extracting the model by applying the knowledge obtained by the training is carried out. Fig. 2 is a schematic structural diagram of a knowledge extraction model according to an embodiment of the present disclosure, as shown in fig. 2, the knowledge extraction model 200 may include an entity type identification model 201 and an entity relationship extraction model 202, the target text information 2011 is input into the entity type identification model 201, the entity type identification model 201 may output an entity name included in the target text information 2011 and an entity set of entity types corresponding to the entity names, the entity name may include an entity name corresponding to an "attribute entity type", i.e., an attribute entity name, the entity name corresponding to a "value entity type", i.e., a value entity name, the entity set output by the entity type identification model 201 and the target text information 2011 are input into the entity relationship extraction model 202, the entity relationship extraction model 202 may output an entity name pair 2021, and the entity name pair 2021 includes a standard entity name, a standard entity name, Attribute entity names and/or value entity names, and the directional relationships between them, the knowledge extraction model may convert unstructured medical textual information into structured graph data (entity name pairs).

The text information processing method mentioned in the present application is explained below with reference to the drawings. Fig. 3 is a schematic flowchart of a text information processing method according to an embodiment of the present application, and as shown in fig. 3, the method may include:

s301, according to the corresponding relation between the entity type and the entity name, marking the original training text information to obtain a first training sample, wherein the entity type comprises: a standard entity type, an attribute entity type, and a value entity type.

The attribute entity type and the value entity type are respectively used for characterizing the characteristics of the standard entity type.

Specifically, the medical text information is a content described in natural language, and the content corresponds to unstructured data, but unstructured data is difficult to be directly used in different clinical scientific research tasks or statistical analysis tasks, so a model capable of converting unstructured data into structured data needs to be trained firstly, which may be called a knowledge extraction model, and how to obtain the knowledge extraction model is mainly described below.

The original training text information is medical text information, which can be extracted from a corpus related to medical treatment, and it should be noted that the number of extracted medical texts is not limited in the present application. An entity type frame can be preset, the entity type frame comprises a plurality of entity types, corresponding relations exist between the entity types and entity names, and the entity types in the entity frame can be updated according to actual clinical scientific research tasks. For example, if the entity category of "microorganism" is included in a clinical research mission and the entity category is not present in the entity category frame, the entity category of "microorganism" may be added to the entity type frame, and if the entity category of "drug name" is not normally involved in a clinical research mission and the entity category is present in the entity category frame, the entity category of "drug name" may be deleted from the entity frame. That is, the entity framework has expansibility and can be dynamically adjusted according to actual clinical scientific research task requirements.

The entity types included in the entity frame include a standard entity type, an attribute entity type and a value entity type, and the three entity types have a certain correlation, the attribute entity type is used for characterizing the standard entity type, namely, an entity name (attribute entity name) corresponding to the attribute entity type is equivalent to an attribute of the entity name (standard entity name) corresponding to the standard entity type; the value entity type is used to characterize the standard entity type, and refers to a concrete representation of the attribute entity type, that is, an entity name (value entity name) corresponding to the value entity type is equivalent to a concrete representation of an entity name (attribute entity name) corresponding to the attribute entity type, and may also be understood as an attribute that an entity name (value entity name) corresponding to the value entity type is equivalent to an entity name (standard entity name) corresponding to the standard entity type.

For example, if the original training text information includes a content of "patient generates heat and has a heat peak of 38 ℃", then, according to a corresponding relationship between the entity category and the entity name, corresponding entity types "patient subject", "attribute entity type", and "value entity type" may be respectively labeled on the entity names "patient", "heat peak", and "38 ℃", and of course, may also be labeled in a simplified manner, such as "patient", "genus", and "value", and the manner of labeling other contents in the original training text information may refer to the above description, so that a first training sample may be finally obtained, where the first training sample may include the original training text information and the entity type corresponding to each entity name in the original training text information.

In an implementation embodiment, the original training text information may be subjected to word segmentation processing according to a dictionary matching module to obtain a plurality of entity names, and then each entity name is labeled to obtain a first training sample, and the labeling process may refer to the above description, where the dictionary matching module includes an entity dictionary including entity names of a plurality of normative terms, and the original training text information is subjected to word segmentation processing according to the dictionary matching module to improve the accuracy of word segmentation, and further improve the accuracy of the first training sample.

S302, inputting the first training sample into the initial entity type recognition model, and training to obtain the entity type recognition model.

The initial entity type recognition model and the initial entity relationship extraction model are trained separately, the obtained original training text information in the first training sample is used as the input of the initial entity type recognition model, the entity type corresponding to each entity name in the original training text information in the first training sample is used as the output of the initial entity type recognition model to train the initial entity type recognition model, and when the training stop condition is met, the entity type recognition model can be trained.

S303, labeling the first training sample according to the corresponding relation between the entity types to obtain a second training sample, wherein the corresponding relation between the entity types comprises: a directional relationship between the subject entity type and the object entity type.

The corresponding relationship between the entity types can be pre-stored in an entity relationship frame, the entity frame mainly comprises a constraint table of the relationship and a direction table of the relationship, and the constraint table of the relationship stores the entity types of the corresponding relationship. For example, if there is a correspondence between the "patient subject" and the "attribute entity type", the "patient subject" and the "attribute entity type" may be stored in association with the constraint table of the relationship, and for different entity features not stored in association with the constraint table of the relationship, it is considered that there is no correspondence therebetween. In the direction table of the relationship, the directional relationship between the entity types is stored, and for example, the relationship between the "patient subject" and the "attribute entity type" is started from the "patient subject" and points to the "attribute entity type", that is, the "patient subject" is the subject entity type, the "attribute entity type" is the object entity type, and if the direction is opposite, the "patient subject" is the object entity type, and the "attribute entity type" is the subject entity type. The pointing relationship between entity types has uniqueness, and the pointing relationship between all entity types cannot be looped, for example: the patient subject takes a certain medicine name which is used for relieving a certain clinical manifestation which occurs in a certain patient subject, and the directional relation between the entity types of the patient subject appears in a loop, namely the directional relation between the entity types preset in the entity relation framework can not appear in the loop.

The corresponding relation between the entity names can be obtained according to the corresponding relation between the entity types and the entity names, the first training sample can be labeled based on the corresponding relation between the entity names, namely, the entity names with the corresponding relation in the original training text information of the first training sample are associated, because the corresponding relation between the entity types comprises the attribute entity type, the corresponding relation between the value entity type and the standard entity type and the corresponding relation between the attribute entity type and the value entity type, the attribute entity name, the value entity name and the standard entity name in the original training text information of the first training sample can be associated, and finally, the second training sample can be obtained, wherein the second training sample can comprise the original training text information and the corresponding relation between the entity names in the original training text information, the entity name may include a standard entity name, an attribute entity name, and a value entity name.

And S304, inputting the second training sample into the initial entity relationship extraction model, and training to obtain an entity relationship extraction model.

And training the initial entity relationship extraction model by taking the obtained original training text information in the second training sample and the entity types corresponding to the entity names in the original training text information as the input of the initial entity relationship extraction model and taking the corresponding relationship between the entity names corresponding to the entity types in the original training text information in the second training sample as the output of the initial entity relationship extraction model, and obtaining the entity relationship extraction model by training when the training stopping condition is met.

In summary, in the text information processing method provided by the application, the attribute entity type and the value entity type are added, and the original training text information is labeled, so that not only the entity name corresponding to the standard entity type can be identified from the original training text information, but also the entity name corresponding to the attribute entity type and the entity name corresponding to the value entity type can be identified from the original training text information.

Optionally, the labeling the first training sample according to the correspondence between the entity types to obtain a second training sample includes: and labeling the first training sample according to the corresponding relation between the entity types and the strength degree information between the entity names corresponding to the entity types in the first training sample to obtain a second training sample.

The entity relationship framework mentioned above includes, in addition to the constraint table of the relationship and the direction table of the relationship, the type of the entity relationship, where the type of the relationship occurrence is equivalent to the strength information between the entity names corresponding to the entity types. The correspondence between different entity names has specific semantics, for example, assuming that there is a correspondence between "patient subject" and "clinical manifestation", the semantic relationship between their corresponding entity names can be expressed as: the patient may have a certain clinical manifestation, and the type of occurrence of the relationship may refer to the degree of the occurrence of the certain clinical manifestation, and the degree of the occurrence may include: the expression is not shown, slightly shows, moderately shows, severely shows and the like in different degrees, and for example, a corresponding relation exists between a "treatment method" and a "disease diagnosis", and semantic relations between corresponding entity names can be uniformly expressed as follows: how much a certain treatment can alleviate or treat a certain disease diagnosis, "how much" can correspond to information of strength of multiple levels.

And after determining the entity names with the corresponding relations in the original training text information according to the corresponding relations among the entity types, identifying semantic strength degree information among the entity names with the corresponding relations. Specifically, the degree of strength information may be divided into positive semantics, negative semantics, and uncertain semantics, and by way of example, it is assumed that the relationship between the entity names corresponding to the "patient subject" and the "clinical manifestation" in the original training text is: if a patient shows a certain clinical expression, that is, the patient shows a certain semantic meaning, the strength information between the "patient subject" and the "clinical expression" is positive, and the "patient subject" and the "clinical expression" can be classified as positive semantic meanings for mild expression, moderate expression, and the like, and although the "patient subject" and the "clinical expression" are mapped to positive semantic relations, the true semantic meaning is: the patient shows clinical manifestations, and the semantics between entity names corresponding to each entity type can be clearly represented by the representation mode of the strength and weakness information. It should be noted that, the present application does not limit the degree of classification, that is, the entity relationship framework has expansibility, and the type of entity relationship occurrence can be dynamically adjusted according to the actual clinical scientific research task requirement.

In an implementation embodiment, the strength information between the entity names corresponding to the entity types in the first training sample can be revised based on the regular expression, so that the accuracy of the second training sample can be improved.

Therefore, the type of entity relationship generation is represented by the strength degree information, the work of constructing the second training sample is greatly simplified, the associated semantic relationship of loads among entity names is kept, and the efficiency of training the initial entity relationship extraction model can be improved.

Fig. 4 is a flowchart illustrating another text information processing method according to an embodiment of the present application. As shown in fig. 4, optionally, the labeling the original training text information according to the correspondence between the entity type and the entity name to obtain a first training sample includes:

s401, marking original training text information according to the corresponding relation between the entity type and the entity name to obtain an initial first training sample.

S402, if one entity name in the original training text information comprises a plurality of entity names, deleting the entity type corresponding to each entity name in the initial first training sample to obtain the first training sample.

The original training text information may be segmented according to the entity dictionary in the entity matching module mentioned above to obtain a plurality of entity names included in the original training text information, and then the original training text information is labeled according to a corresponding relationship between the entity type and the entity name to obtain an initial first training sample, where a specific labeling manner may refer to the description of the above corresponding parts, and is not described here.

In an implementation embodiment, entity names of all granularities in the original training text information are labeled, if there is an overlapping relationship between the entity names, the entity name of each granularity may be labeled with a corresponding entity type, and this labeling manner may be referred to as a nested labeling. For example, if an entity name "prostate hyperplasia" exists in the original training text message, the entity name "prostate hyperplasia" includes a plurality of entity names, such as "prostate" and "hyperplasia", wherein an entity type corresponding to the entity name "prostate hyperplasia" is "disease diagnosis", an entity type corresponding to the entity name "prostate" is "human body part", and an entity type corresponding to the entity name "hyperplasia" is "clinical manifestation", then the entity type "disease diagnosis" corresponding to "prostate hyperplasia" includes two entity types, namely "human body part" and "clinical manifestation".

In another implementation, the entity type labeled by the entity name of the longest character may be retained based on the longest matching rule, and the entity type labeled by each sub-entity name included in the entity name may be deleted. Continuing the above example, the entity type corresponding to the entity name "prostate hyperplasia" may be retained as "disease diagnosis", the entity type "human body part" corresponding to the entity name "prostate" may be deleted, and the entity type "clinical manifestation" corresponding to the entity name "hyperplasia" may be deleted, so as to obtain the first training sample; or after segmenting the original training text information according to the entity dictionary in the entity matching module to obtain a plurality of entity names included in the original training text information, firstly detecting whether the entity names have an overlapping relationship, if so, only storing the entity name corresponding to the longest character, and then obtaining a first training sample according to the corresponding relationship between each entity name and the entity type.

The following mainly explains the process of applying the entity type identification model and the entity relationship extraction model.

Fig. 5 is a schematic flowchart of another text information processing method according to an embodiment of the present application.

As shown in fig. 5, the method may further include:

s501, inputting the target text information into an entity type recognition model, and outputting an entity set, wherein the entity set comprises: the entity name and the entity type corresponding to the entity name are contained in the target text information, and the entity type comprises a standard entity type, an attribute entity type and a value entity type.

Referring to fig. 2, the target text information may be input into an entity type identification model 201 in a knowledge extraction model 200, where the entity type identification model 201 can identify attribute entity names and value entity names in the target text information, that is, an entity set output by the entity type identification model may include entity names included in the target text information, and the entity names may include entity names corresponding to "standard entity types", such as "patient"; entity names corresponding to the attribute entity types, such as hot peak and size; the "value entity type" corresponds to an entity name such as "38 ℃", "0.8 × 0.6 cm".

S502, inputting the target text information and the entity set into the entity relationship extraction model, and outputting an entity name pair, wherein the entity name pair comprises a subject entity name and an object entity name, and the subject entity name points to the object entity name.

Wherein, the target text information and the entity set output by the entity type recognition model can be simultaneously input into the entity relationship extraction model in the knowledge extraction model, the entity relationship extraction model can firstly identify the entity name as a subject according to the target text information, the entity names contained in the target text information of the entity set and the entity types corresponding to the entity names, then identify the entity names as objects, and make the entity names with corresponding relationship into an entity name pair, the entity name as subject in the entity name pair may be called a subject entity name, the entity name as object may be called an object entity name, the subject object entity name points to the object entity name, the subject object entity and the object entity name have standard entity name, attribute entity name and value entity name.

It can be seen that the entity type identification model obtained through training of the first training sample can identify various entity names, such as standard entity names, attribute entity names, and value entity names, contained in the target text information, and then the entity relationship extraction model obtained through training of the second training sample can extract the entity name pairs having relationships, i.e., the directional relationships among the standard entity names, the attribute entity names, and the value entity names can be extracted, so that the entity names contained in the target text information can be extracted more comprehensively, and the finally obtained structured graph data can be matched with the unstructured target text information more, or the finally obtained structured graph data can reflect the content in the unstructured target text information more comprehensively. Fig. 6 is a schematic diagram illustrating an unstructured target text message is converted into structured graph data according to an embodiment of the present application, where the unstructured target text message is specifically shown in fig. 6, the content in the unstructured graph data circle box is a standard entity name in the unstructured target text message, such as fever, cough, and the like, the content in the diamond box is an attribute entity name in the unstructured target text message, such as a hot peak, a vomit, and the like, and the content in the diamond box is a value entity name in the unstructured target text message, such as 38.2 ℃, clearness, and the like.

Optionally, the inputting the target text information and the entity set into the entity relationship extraction model and outputting an entity name pair includes:

and inputting the target text information and the entity set into the entity relationship extraction model, and outputting an entity name pair and strength information between entity names contained in the entity name pair, wherein the entity name pair comprises a subject entity name and an object entity name, and the subject entity name points to the object entity name.

After the entity type identification model outputs the entity set, the entity set and the target text information can be input into an entity relationship extraction model obtained by training a second training sample marked with strength degree information between entity names corresponding to the entity types, the entity relationship extraction model can output entity name pairs with corresponding relations contained in the target text information and strength degree information between the entity name pairs, and the strength degree information can indicate semantics between the entity name pairs.

Optionally, after the target text information and the entity set are input into the entity relationship extraction model and an entity name pair is output, the method further includes: and constructing a knowledge graph according to the entity name pair, taking the subject entity name and the client entity name in the entity name pair as nodes in the knowledge graph respectively, and taking the relationship between the subject entity name and the client entity name as edges in the knowledge graph.

The entity name pairs may be referred to as graph data, that is, the graph data includes entity names and corresponding relationships between the entity names, and the graph data may be stored in an associated database. The knowledge graph is composed of nodes and edges, the nodes in the knowledge graph are entity names identified by the entity type identification model, namely the entity names in the graph data, each node can also be associated with an entity type corresponding to the entity name and position information of the entity name in the target text information, the edges in the knowledge graph are entity name pairs which are extracted by the entity relationship extraction model and have corresponding relationships, and the directions of the edges point to object entity names from main entity names in the entity name pairs, namely the corresponding relationships among the entity names in the graph data. Or after the entity relationship extraction model outputs a plurality of entity name pairs, the entity name pairs can be stored in a database in a graph data form, the subject entity names and the object entity names contained in the entity name pairs in the graph data are extracted from the database, the subject entity names and the object entity names are used as nodes of the knowledge graph, and the relationship between the subject entity names and the object entity names can be used as edges of the knowledge graph.

In another implementation, the strength information between entity names contained in entity name pairs can be added on the edges of the knowledge-graph. Fig. 7 is a schematic structural diagram of an intellectual graph provided in the embodiment of the present application, as shown in fig. 7, it can be seen from fig. 7 that if there is strong and weak degree information between a patient and fever, it represents that the patient has fever, and a thermal peak is shown as 38.2 ℃.

Optionally, when the knowledge graph is displayed, the knowledge graph may be displayed according to a preset display state, for example, the display size of each node shape in the knowledge graph may be determined according to the number of times that each entity name appears in an entity pair, and the larger the number of times that each entity name appears in an entity pair, the larger the corresponding node shape display is, as shown in fig. 7, the node shape corresponding to the entity name, "patient" is displayed maximally; the names of entities belonging to the same entity type can also be displayed in the same color display state, and it should be noted that the specific display state of the knowledge graph is not limited in the present application.

Fig. 8 is a flowchart illustrating another text information processing method according to an embodiment of the present application. Optionally, as shown in fig. 8, after the above-mentioned building a knowledge graph according to the entity name pairs, the method further includes:

s801, acquiring corresponding entity names from a database storing graph data corresponding to the knowledge graph according to knowledge acquisition instructions input by a user.

The map data corresponding to the knowledge map is pre-stored in a related database, and after a knowledge acquisition instruction input by a user is received, the map data matched with the knowledge acquisition instruction can be acquired from the database, wherein the map data comprises an entity name. For example, the user may search the database for the index information meeting the requirement by means of a search, and assuming that the content included in the knowledge acquisition instruction input by the user is "what clinical manifestations appear in the patient", the entity name corresponding to the clinical manifestations may be acquired from the knowledge map.

S802, displaying the entity name in the knowledge graph according to the display state corresponding to the entity name.

The knowledge graph may be visually displayed on the interface, as shown in fig. 7, the display state of the entity name may be set in advance by using the entity type as a dimension, for example, the entity name corresponding to the "attribute entity type" may be represented by a yellow display parameter, that is, the related node is displayed in yellow; the entity name corresponding to the "value entity type" may be represented by a blue display parameter, i.e., the associated node is displayed in blue. Therefore, the user can conveniently and intuitively know the information to be viewed.

Fig. 9 is a flowchart illustrating a further text information processing method according to an embodiment of the present application. Optionally, as shown in fig. 9, after the target text information is input into the entity type recognition model and an entity set is output, the method further includes:

s901, carrying out statistic operation on the entity set to obtain a statistic result, wherein the statistic result comprises: the entity set includes a frequency of occurrence of each entity name and/or a frequency of occurrence of each entity type.

And S902, sequencing the contents belonging to the same dimensionality in the statistical result respectively to obtain a sequencing result.

After the entity type identification model outputs the entity set, the entity set can be input into the entity consistency detection module, the entity consistency detection module can perform statistical analysis on information in the entity set, specifically, the times of occurrence of each entity name and the times of occurrence of each entity type in the entity set can be counted, for example, the entity set specifically includes 4 entity categories (human body parts, clinical expressions, attribute entity characteristics and value entity characteristics), wherein the human body parts and the clinical expressions belong to the same dimension, namely, the human body parts and the clinical expressions belong to standard entity characteristics, and the times of occurrence of the human body parts and the clinical expressions can be sorted to obtain a sorting result, so that a user can know the information in the entity set in real time.

Fig. 10 is a schematic structural diagram of a text information processing apparatus according to an embodiment of the present application. As shown in fig. 10, the apparatus includes:

a first labeling module 1001, configured to label, according to a correspondence between an entity type and an entity name, original training text information to obtain a first training sample, where the entity type includes: a standard entity type, an attribute entity type, and a value entity type;

the first training module 1002 is configured to input a first training sample into an initial entity type identification model, and train to obtain an entity type identification model;

a second labeling module 1003, configured to label the first training sample according to a corresponding relationship between entity types to obtain a second training sample, where the corresponding relationship between the entity types includes: a direction relationship between the subject entity type and the object entity type;

and a second training module 1004, configured to input the second training sample into the initial entity relationship extraction model, and train to obtain the entity relationship extraction model.

Optionally, the second labeling module 1003 is specifically configured to label the first training sample according to the correspondence between the entity types and the strength information between the entity names corresponding to the entity types in the first training sample, so as to obtain a second training sample.

Optionally, the first labeling module 1001 is specifically configured to label the original training text information according to a correspondence between an entity type and an entity name, so as to obtain an initial first training sample; and if one entity name in the original training text information comprises a plurality of entity names, deleting the entity types corresponding to the entity names in the initial first training sample to obtain the first training sample.

Optionally, the apparatus further comprises:

the first output module is used for inputting the target text information into the entity type recognition model and outputting an entity set, and the entity set comprises: the entity type comprises a standard entity type, an attribute entity type and a value entity type;

and the second output module is used for inputting the target text information and the entity set into the entity relationship extraction model, outputting an entity name pair, wherein the entity name pair comprises a subject entity name and an object entity name, and the subject entity name points to the object entity name.

Optionally, the apparatus further comprises:

and the construction module is used for constructing the knowledge graph according to the entity name pair, respectively taking the subject entity name and the client entity name in the entity name pair as nodes in the knowledge graph, and taking the relationship between the subject entity name and the object entity name as edges in the knowledge graph.

Optionally, the apparatus further comprises:

the acquisition module is used for acquiring a corresponding entity name from a database storing graph data corresponding to the knowledge map according to a knowledge acquisition instruction input by a user;

Optionally, the apparatus further comprises:

the statistic module is used for carrying out statistic operation on the entity set to obtain a statistic result, and the statistic result comprises: the frequency of occurrence of each entity name and/or the frequency of occurrence of each entity type in the entity set; and respectively sequencing the contents belonging to the same dimensionality in the statistical result to obtain a sequencing result.

The above-mentioned apparatus is used for executing the method provided by the foregoing embodiment, and the implementation principle and technical effect are similar, which are not described herein again.

These above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors, or one or more Field Programmable Gate Arrays (FPGAs), etc. For another example, when one of the above modules is implemented in the form of a Processing element scheduler code, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. For another example, these modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

Fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application, and as shown in fig. 11, the electronic device may include: the electronic device comprises a processor 1101, a storage medium 1102 and a bus 1103, wherein the storage medium 1102 stores machine-readable instructions executable by the processor 1101, when the electronic device runs, the processor 1101 communicates with the storage medium 1102 through the bus 1103, and the processor 1101 executes the machine-readable instructions to execute the steps of the above-mentioned method embodiments. The specific implementation and technical effects are similar, and are not described herein again.

Optionally, the present application further provides a storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program performs the steps of the above method embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. Alternatively, the indirect coupling or communication connection of devices or units may be electrical, mechanical or other.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to perform some steps of the methods according to the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A text information processing method is characterized in that an initial knowledge extraction model comprises an initial entity type identification model and an initial entity relationship extraction model, and the method comprises the following steps:

2. The method according to claim 1, wherein the labeling the first training sample according to the corresponding relationship between entity types to obtain a second training sample comprises:

3. The method according to claim 1, wherein the labeling the original training text information according to the corresponding relationship between the entity type and the entity name to obtain a first training sample comprises:

4. The method of claim 2, further comprising:

5. The method of claim 4, wherein the inputting the target text information and the entity set into the entity relationship extraction model and outputting entity name pairs comprises:

6. The method of claim 5, wherein after inputting the target text information and the entity set into the entity relationship extraction model and outputting entity name pairs, the method further comprises:

7. The method of claim 6, wherein after constructing a knowledge graph from the entity name pairs, the method further comprises:

8. The method of claim 5, wherein after inputting the target text information into the entity type recognition model and outputting a set of entities, the method further comprises:

9. A text information processing apparatus characterized in that an initial knowledge extraction model includes an initial entity type identification model and an initial entity relationship extraction model, the apparatus comprising:

10. An electronic device, comprising: a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium communicating via the bus when the electronic device is operating, the processor executing the machine-readable instructions to perform the steps of the text information processing method according to any one of claims 1 to 8.

11. A storage medium, characterized in that the storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the text information processing method according to any one of claims 1 to 8.