CN112100228A - Method and device for constructing hierarchical pattern for information extraction - Google Patents

Method and device for constructing hierarchical pattern for information extraction Download PDF

Info

Publication number
CN112100228A
CN112100228A CN202011019692.6A CN202011019692A CN112100228A CN 112100228 A CN112100228 A CN 112100228A CN 202011019692 A CN202011019692 A CN 202011019692A CN 112100228 A CN112100228 A CN 112100228A
Authority
CN
China
Prior art keywords
granularity
node
information
nodes
labels
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011019692.6A
Other languages
Chinese (zh)
Inventor
刘辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zero Krypton Information Technology Beijing Co ltd
Linkdoc Technology Beijing Co ltd
Original Assignee
Zero Krypton Information Technology Beijing Co ltd
Linkdoc Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zero Krypton Information Technology Beijing Co ltd, Linkdoc Technology Beijing Co ltd filed Critical Zero Krypton Information Technology Beijing Co ltd
Priority to CN202011019692.6A priority Critical patent/CN112100228A/en
Publication of CN112100228A publication Critical patent/CN112100228A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24573Query processing with adaptation to user needs using data annotations, e.g. user-defined metadata
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Public Health (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Epidemiology (AREA)
  • Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Library & Information Science (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a method and a device for constructing a hierarchical pattern for information extraction. The method comprises the steps of obtaining a sample set with marking information, wherein the sample set is a free text, and the marking information is labels with different hierarchical granularities; and automatically constructing different levels of patterns according to the sample set with the labeled information, wherein different levels of granularity correspond to different levels of patterns. The application aims to provide a more effective information extraction mode suitable for free texts.

Description

Method and device for constructing hierarchical pattern for information extraction
Technical Field
The application relates to the technical field of data processing, in particular to a method and a device for constructing a hierarchical pattern for information extraction.
Background
In recent years, with the rapid development of information technology, applications based on large data have become more and more widespread. Due to the richness and complexity of data representation, in practical use, particularly in statistical analysis, free texts are difficult to be directly utilized, and at the moment, data structuring needs to be carried out on the texts, key information points are extracted, and the texts are arranged into formatted items. This process can be viewed as information extraction and structuring.
There are many fields related to information extraction, such as extraction of medical data in the medical field, extraction of case information in the legal field, extraction of public opinion information in the internet dissemination field, and the like. Generally, there are two main ways of information extraction, namely, based on rule extraction, dictionary matching is used for matching key information, and regular formula matching is summarized by experience; and secondly, based on model extraction, named entity identification and other methods are used. The rule-based extraction has high maintenance cost, but is simple and easy to understand and strong in interpretability. Model-based extraction has high calculation cost, long iteration period, lack of interpretability and the like.
In summary, there are various drawbacks to the conventional information extraction methods, and it is highly desirable to provide a more efficient information extraction method.
Disclosure of Invention
The present application mainly aims to provide a method and an apparatus for constructing a hierarchical pattern for information extraction, so as to solve the above-mentioned various defects existing in the existing information extraction method.
To achieve the above object, according to a first aspect of the present application, a method for hierarchical pattern construction for information extraction is provided.
The method for constructing the hierarchical pattern for information extraction comprises the following steps:
acquiring a sample set with labeled information, wherein the sample set is a free text, and the labeled information is labels with different hierarchical granularities;
and automatically constructing different levels of patterns according to the sample set with the labeled information, wherein different levels of granularity correspond to different levels of patterns.
Optionally, before acquiring the labeled information sample set, the method further includes:
setting different levels of granularity nodes and labels contained in the different levels of granularity nodes according to the structural information to be extracted in the sample set;
and marking the free text in the sample set according to the labels contained in the granularity nodes of different levels.
Optionally, the automatically constructing different hierarchical patterns according to the labeled information sample set includes:
determining a value corresponding to the label corresponding to the minimum level granularity node as a minimum level granularity pattern;
from the minimum level granularity node, the construction of each level granularity pattern comprises:
matching values corresponding to the labels of the current level granularity node with a node library corresponding to the level granularity node of each layer smaller than the current level granularity node, wherein the node library is all the labels in each layer of granularity node and the values corresponding to the labels;
replacing the successfully matched value with a corresponding label smaller than the node of the current level granularity;
and combining the replaced labels and the values which are not matched successfully to determine the granularity pattern of the current level.
Optionally, the sample set is an electronic medical record EMR set, and the setting of different levels of granularity nodes and labels included in the different levels of granularity nodes according to the structured information to be extracted in the sample set includes:
setting granularity nodes of three levels of leaf nodes, intermediate nodes and event nodes according to structured information in EMR (electronic medical record) to be extracted; and the number of the first and second electrodes,
setting labels contained in each level of granularity nodes; wherein the leaf node comprises at least one label in a pathotype, a TNM stage, a tissue site; the intermediate node comprises at least one label of pathological observation and pathological diagnosis, and the event node comprises at least one label of pathological event, radiotherapy event, chemotherapy event and CT image examination event.
Optionally, the automatically constructing different hierarchical patterns according to the labeled information sample set includes:
determining the value corresponding to the label corresponding to the leaf node as a leaf node pattern;
matching the value corresponding to the label of the middle node with the leaf node library to obtain a node value, replacing the matched value with the label of the leaf node, and combining the replaced label with the unmatched value to determine the pattern of the middle node;
and respectively matching the values corresponding to the labels of the event nodes with the leaf node library and the intermediate node library to obtain node values, replacing the matched values with the labels of the leaf nodes and/or the labels of the intermediate nodes, and combining the replaced labels with the unmatched values to determine the pattern of the event nodes.
Optionally, after automatically constructing different hierarchical patterns according to the labeled information sample set, the method further includes:
and correspondingly storing the pattern of each level and the labels contained in the granularity of each level.
Optionally, after automatically constructing different hierarchical patterns according to the labeled information sample set, the method further includes:
acquiring a free text of information to be extracted;
and matching step by step according to the patterns of different levels, and extracting the information of different levels of granularity in the free text.
To achieve the above object, according to a second aspect of the present application, there is provided an apparatus for hierarchical pattern construction for information extraction.
The device for constructing the hierarchical pattern for information extraction according to the application comprises the following steps:
the system comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring a sample set with label information, the sample set is a free text, and the label information is labels with different hierarchical granularities;
and the construction unit is used for automatically constructing different levels of patterns according to the sample set with the labeled information, and the different levels of granularity correspond to the different levels of patterns.
Optionally, the apparatus further comprises:
the setting unit is used for setting different-level granularity nodes and labels contained in the different-level granularity nodes according to the structural information to be extracted in the sample set before the sample set with the labeled information is obtained;
and the marking unit is used for marking the information of the free text in the sample set according to the labels contained in the granularity nodes of different levels.
Optionally, the building unit includes:
the first building module is used for determining a value corresponding to a label corresponding to the minimum level granularity node as a minimum level granularity pattern;
a second construction module, configured to construct, from the minimum hierarchy granularity node, each hierarchy granularity pattern: matching values corresponding to the labels of the current level granularity node with a node library corresponding to the level granularity node of each layer smaller than the current level granularity node, wherein the node library is all the labels in each layer of granularity node and the values corresponding to the labels; replacing the successfully matched value with a corresponding label smaller than the node of the current level granularity; and combining the replaced labels and the values which are not matched successfully to determine the granularity pattern of the current level.
Optionally, the sample set is an electronic medical record EMR set, and the setting unit includes:
the extraction module is used for setting granularity nodes of three levels of leaf nodes, intermediate nodes and event nodes according to structured information in EMR (electronic medical record) to be extracted;
the setting module is used for setting labels contained in each level of granularity nodes; wherein the leaf node comprises at least one label in a pathotype, a TNM stage, a tissue site; the intermediate node comprises at least one label of pathological observation and pathological diagnosis, and the event node comprises at least one label of pathological event, radiotherapy event, chemotherapy event and CT image examination event.
Optionally, the building unit includes:
the first building module is further configured to determine a value corresponding to a label corresponding to a leaf node as a leaf node pattern;
the second building module is further configured to perform node value matching on a value corresponding to the label of the intermediate node and the leaf node library, replace the matched value with the label of the leaf node, and combine the replaced label and the unmatched value to determine a pattern of the intermediate node;
the second building module is further configured to match the values corresponding to the tags of the event nodes with the leaf node library and the intermediate node library respectively, replace the matched values with the tags of the leaf nodes and/or the tags of the intermediate nodes, and combine the replaced tags with the unmatched values to determine patterns of the event nodes.
Optionally, the apparatus further comprises:
and the storage unit is used for automatically constructing different levels of patterns according to the sample set with the labeled information, and then correspondingly storing the patterns of each level and the labels contained in the granularity of each level.
Optionally, the apparatus further comprises:
the second acquisition unit is used for acquiring a free text of the information to be extracted after different levels of patterns are automatically constructed according to the sample set with the marked information;
and the extraction unit is used for matching step by step according to the patterns of different levels and extracting the information of different levels of granularity in the free text.
To achieve the above object, according to a third aspect of the present application, there is provided a non-transitory computer-readable storage medium, characterized in that the non-transitory computer-readable storage medium stores computer instructions for causing the computer to execute the method for hierarchical pattern construction for information extraction of any one of the above first aspects.
In the embodiment of the application, in the method and the device for constructing the hierarchical pattern for information extraction, a sample set with labeled information is obtained, wherein the sample set is a free text, and the labeled information is labels with different hierarchical granularities; and automatically constructing different levels of patterns according to the sample set with the labeled information, wherein different levels of granularity correspond to different levels of patterns. It can be seen that different levels of patterns are automatically constructed in the method, the conventional regular rule writing during information extraction is omitted, the automatically constructed different levels of patterns are used for extracting the information, and the fault tolerance is better than the traditional direct matching effect. In addition, the pattern is automatically constructed on different granularities, so that the precision and the coverage of information extraction can be improved in a complex text scene.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, serve to provide a further understanding of the application and to enable other features, objects, and advantages of the application to be more apparent. The drawings and their description illustrate the embodiments of the invention and do not limit it. In the drawings:
FIG. 1 is a flowchart of a method for hierarchical pattern construction for information extraction according to an embodiment provided herein;
FIG. 2 is a flowchart of a method for hierarchical pattern construction for information extraction according to another embodiment provided herein;
FIG. 3 is a block diagram illustrating an apparatus for hierarchical pattern construction for information extraction according to an embodiment provided herein;
FIG. 4 is a block diagram of an apparatus for hierarchical pattern construction for information extraction according to another embodiment provided herein.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
According to an embodiment of the present application, a method for hierarchical pattern construction for information extraction is provided, as shown in fig. 1, the method includes the following steps S101 to S102:
first, it should be noted that the present application is an improvement on the conventional regular matching, that is, an information extraction method for constructing a hierarchical pattern is provided, where the pattern is an expression pattern obtained by compiling a regular expression. In fact, the information extraction mode for constructing the hierarchical Pattern can be reduced to the information extraction based on rules, but the construction of the Pattern is greatly different from the traditional mode, and the method carries out the automatic construction of the Pattern on different granularities, so that the precision and the coverage of the information extraction can be improved in a complex text scene.
S101, obtaining a sample set with marking information, wherein the sample set is a free text, and the marking information is labels with different hierarchical granularities.
The embodiment of the application can be applied to different fields, and mainly relates to the application of free text information extraction, such as extraction of medical data in the medical field, extraction of case information in the legal field, extraction of public opinion information in the internet transmission field and the like. The sample sets in different fields are different, the sample sets contain a large number of free text samples, the more the number of the samples is, the more comprehensive and accurate the pattern is finally obtained, and the corresponding information extraction based on the pattern is more accurate. The sample set is free text, i.e. unstructured text.
The labeling information is labels with different hierarchical granularities, and the labels with different hierarchical granularities are set according to the structural information to be extracted from the samples in the sample set. When the hierarchical granularity is set, the coarser the granularity is, the higher the hierarchical level is, the finer the granularity is, the lower the hierarchical level is, and the hierarchical levels are in an association relationship. For example, it can be divided into events and entities; or time nodes, intermediate nodes, leaf nodes, etc. The setting of the hierarchy granularity can be adjusted according to the actual application field or the actual requirement adaptability of the user. After the level granularity is set, the labels corresponding to different levels of granularity are correspondingly set.
And the sample set with the labeled information is obtained by labeling the sample set with labels corresponding to different hierarchical granularities. The specific labeling mode may be manual labeling or automatic labeling, and the present application is not limited as long as the labeling mode can be a mode capable of realizing labeling of a sample set.
S102, automatically constructing different levels of patterns according to the sample set with the labeled information, wherein different levels of granularity correspond to different levels of patterns.
Specifically, the step of automatically constructing patterns of different hierarchies according to the sample set with the labeled information comprises the following steps:
for the minimum hierarchy granularity node:
determining a value corresponding to the label corresponding to the minimum level granularity node as a minimum level granularity pattern;
for nodes other than the minimum hierarchy granularity node:
from the minimum level granularity node, the construction of each level granularity pattern comprises: matching values corresponding to the labels of the current level granularity node with a node library corresponding to the level granularity node of each layer smaller than the current level granularity node, wherein the node library is all the labels in each layer of granularity node and the values corresponding to the labels; replacing the successfully matched value with a corresponding label smaller than the node of the current level granularity; and combining the replaced labels and the values which are not matched successfully to determine the granularity pattern of the current level.
And after the pattern of each level is obtained, correspondingly storing the pattern of each level and the labels contained in the granularity of each level. And matching is carried out according to different levels of patterns in the subsequent process.
From the above description, it can be seen that, in the method for constructing the hierarchical pattern for information extraction in the embodiment of the present application, a sample set with labeled information is obtained, where the sample set is a free text and the labeled information is labels with different hierarchical granularities; and automatically constructing different levels of patterns according to the sample set with the labeled information, wherein different levels of granularity correspond to different levels of patterns. It can be seen that different levels of patterns are automatically constructed in the method, the conventional regular rule writing during information extraction is omitted, the automatically constructed different levels of patterns are used for extracting the information, and the fault tolerance is better than the traditional direct matching effect. In addition, the pattern automatic construction is carried out on different granularities, so that the precision and the coverage of information extraction can be improved in a complex text scene.
Further, this example presents a flowchart of another method for hierarchical pattern construction for information extraction, as shown in fig. 2, including the following processes:
s201, setting granularity nodes of three levels of leaf nodes, middle nodes and event nodes according to structured information in EMR which needs to be extracted, and setting labels contained in the granularity nodes of each level.
First, the embodiment of the present application will be described taking the example of extracting information in EMR. EMRs are very domain and doctor's writings are usually in a relatively fixed format, so it is appropriate to use pattern's approach to information extraction.
Specifically, before the pattern is constructed, multi-level granularity nodes are required to be set first. The specific implementation is as follows: and setting granularity nodes of three levels of leaf nodes, intermediate nodes and event nodes according to structured information in the EMR which needs to be extracted. More levels of granularity nodes can be set in practical application. After the hierarchical granularity nodes are set, labels contained in each hierarchical granularity node are required to be set. A specific example is given for explanation, and three layers of nodes are set:
leaf nodes, which are minimum granularity nodes, may have labels of: pathological typing, TNM staging, tissue sites, and the like; the event is a description text segment at a segment level, a diagnosis and treatment event is identified in EMR, an event node is a node with the maximum granularity, and labels of the event node can be pathological events, radiotherapy events, chemotherapy events, CT image examination events and the like; intermediate nodes are arranged between the leaf nodes and the event nodes, labels of the intermediate nodes can be pathological findings, pathological diagnoses and the like, and the relation between the leaf nodes can be acquired subsequently by utilizing the layer.
S202, labeling information of EMRs in the sample set according to labels contained in different levels of granularity nodes.
The information labeling for labeling an EMR text is described with reference to the label in the above example, as follows:
original EMR text
Radical treatment after admission to hospital to obtain right middle lung + lymph node resection specimen: central mass 3 x 2cm, grey white and hard without hard border, not reaching the pleura.
# notation
ann:
layer 3: leaf node (including original word)
L4 — pathotyping: center type
L5 — lesion size: 32.2cm
L6 — invaded site: palpitations and pleura
L7 — lesion texture: gray white and hard
L8_ specimen site: l81 Right middle Lung, L82 lymph node
It should be noted that L4_ pathotype, L5_ lesion size, L6_ invasion site, L7_ lesion texture, and L8_ specimen site are leaf node labels
layer2 intermediate nodes consisting of leaf nodes
T2_ pathology see: central mass 3 x 2cm, grey white and hard without hard border, not reaching the pleura.
It should be noted that T2_ pathology is seen as a label of the intermediate node
layer 1: can be composed of intermediate nodes and leaf nodes
T1_ pathological event: right middle lung + lymph node resection specimen was obtained: central mass 3 x 2cm, grey white and hard without hard border, not reaching the pleura.
Note that, the T1_ pathological event is a label of the event node.
According to the labeling of the above example, each EMR of the sample set is subjected to information labeling, so as to obtain a sample set with labeling information.
And S203, automatically constructing patterns corresponding to the leaf nodes, the intermediate nodes and the event nodes.
Automatically constructing patterns corresponding to leaf nodes, intermediate nodes and event nodes according to the sample set with the labeling information, and performing automatic pattern construction on each layer of nodes as follows:
for leaf nodes: and determining the value corresponding to the label corresponding to the leaf node as a leaf node pattern.
For the intermediate node: matching the value corresponding to the label of the middle node with the leaf node library to obtain a node value, replacing the matched value with the label of the leaf node, and combining the replaced label with the unmatched value to determine the pattern of the middle node;
for an event node: and respectively matching the values corresponding to the labels of the event nodes with the leaf node library and the intermediate node library to obtain node values, replacing the matched values with the labels of the leaf nodes and/or the labels of the intermediate nodes, and combining the replaced labels with the unmatched values to determine the pattern of the event nodes.
The automatic construction pattern in this step is described with reference to the example in the above step, as follows:
for leaf nodes: is composed of word, pattern is the value corresponding to label
Pathological typing: center type
Lesion size: 32.2cm
Invasion site: palpitations and pleura
The texture of the focus: gray white and hard
The specimen part: right middle lung
The specimen part: lymph nodes
For the intermediate node: is composed of word and leaf nodes
The pathology is shown as follows: central mass 3 x 2cm, grey white and hard without hard border, not reaching the pleura.
Corresponding Pattern: < pathological typing > lump < lesion size >, < lesion texture > unclear and hard border, < invasion site >.
For an event node: is composed of word, leaf nodes and intermediate nodes
Pathological events: right middle lung + lymph node resection specimen was obtained: central mass 3 x 2cm, grey white and hard, not reaching the pleura
The corresponding Pattern: to obtain < specimen site > + < specimen site > excision specimen: < observation of pathology >
According to the above Pattern example, the Pattern construction is performed for all EMRs in all sample sets with information labels.
Finally, it should be further noted that after the pattern of each hierarchy is obtained, the pattern of each hierarchy and the label included in each hierarchy granularity are correspondingly stored. Examples are given for illustration: for example, for the intermediate nodes, the "pathological findings" and the "< pathological typing > masses < lesion sizes >, < lesion textures > unclear boundaries, < invasion and location >" are stored correspondingly.
And S204, obtaining the EMR of the information to be extracted, and extracting the information based on the pattern corresponding to the leaf node, the middle node and the event node.
The EMR of the information to be extracted is an EMR text which needs to be subjected to information extraction, and the information extraction can be performed according to patterns corresponding to leaf nodes, intermediate nodes and event nodes, specifically: the matching can be carried out step by step in a pattern library, leaf nodes are preferentially matched, if the matching is carried out, leaf labels are used for replacing, then the matching is carried out on the intermediate nodes and the event nodes, and the values of the entities (the leaf nodes), the intermediate nodes and the event nodes can be sequentially obtained according to the process, so that multi-level and multi-granularity information can be extracted.
According to the method and the device, the multi-level pattern is automatically constructed, the traditional regular rule writing is omitted, and the fault tolerance is better than the traditional direct matching in the coarse granularity such as the extraction of an intermediate layer (an intermediate node layer) and an event (corresponding to an event node). In addition, matching by means of query is superior to a modeling method in performance. In the medical field EMR, because the text has better normative and modal nature and the limited nature of the medical professional terms, the information extraction using the multi-level pattern is more suitable.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.
According to an embodiment of the present application, there is also provided an apparatus for implementing the hierarchical pattern construction for information extraction in the methods of fig. 1 to fig. 2, as shown in fig. 3, the apparatus includes:
the first obtaining unit 31 is configured to obtain a sample set with label information, where the sample set is a free text and the label information is labels with different hierarchical granularities;
the constructing unit 32 is configured to automatically construct different-level patterns according to the sample set with the labeled information, where the different-level granularities correspond to the different-level patterns.
Specifically, the specific process of implementing the functions of each module in the apparatus in the embodiment of the present application may refer to the related description in the method embodiment, and is not described herein again.
From the above description, it can be seen that, in the device for constructing the hierarchical pattern for information extraction in the embodiment of the present application, a sample set with labeled information is obtained, where the sample set is a free text and the labeled information is labels with different hierarchical granularities; and automatically constructing different levels of patterns according to the sample set with the labeled information, wherein different levels of granularity correspond to different levels of patterns. It can be seen that different levels of patterns are automatically constructed in the method, the conventional regular rule writing during information extraction is omitted, the automatically constructed different levels of patterns are used for extracting the information, and the fault tolerance is better than the traditional direct matching effect. In addition, the pattern automatic construction is carried out on different granularities, so that the precision and the coverage of information extraction can be improved in a complex text scene.
Further, as shown in fig. 4, the apparatus further includes:
a setting unit 33, configured to set different-level granularity nodes and labels included in the different-level granularity nodes according to structural information to be extracted in a sample set before obtaining the sample set with labeled information;
and the labeling unit 34 is used for labeling the free text in the sample set according to the labels contained in the different hierarchical granularity nodes.
Further, as shown in fig. 4, the building unit 32 includes:
a first constructing module 321, configured to determine a value corresponding to a tag corresponding to the minimum hierarchical granularity node as a minimum hierarchical granularity pattern;
a second constructing module 322, configured to construct, from the minimum hierarchy granularity node, each hierarchy granularity pattern: matching values corresponding to the labels of the current level granularity node with a node library corresponding to the level granularity node of each layer smaller than the current level granularity node, wherein the node library is all the labels in each layer of granularity node and the values corresponding to the labels; replacing the successfully matched value with a corresponding label smaller than the node of the current level granularity; and combining the replaced labels and the values which are not matched successfully to determine the granularity pattern of the current level.
Further, the sample set is an electronic medical record EMR set, as shown in fig. 4, the setting unit 33 includes:
the extraction module 331 is configured to set granularity nodes of three levels, namely a leaf node, a middle node, and an event node, according to structured information in EMR to be extracted;
a setting module 332, configured to set a label included in each level of granularity node; wherein the leaf node comprises at least one label in a pathotype, a TNM stage, a tissue site; the intermediate node comprises at least one label of pathological observation and pathological diagnosis, and the event node comprises at least one label of pathological event, radiotherapy event, chemotherapy event and CT image examination event.
Further, as shown in fig. 4, the building unit 32 includes:
the first constructing module 321 is further configured to determine a value corresponding to a label corresponding to a leaf node as a leaf node pattern;
the second building module 322 is further configured to perform node value matching on a value corresponding to the tag of the intermediate node and the leaf node library, replace the matched value with the tag of the leaf node, and combine the replaced tag and the unmatched value to determine a pattern of the intermediate node;
the second building module 322 is further configured to match the value corresponding to the label of the event node with the leaf node library and the intermediate node library respectively, replace the matched value with the label of the leaf node and/or the label of the intermediate node, and combine the replaced label and the unmatched value to determine the pattern of the event node.
Further, as shown in fig. 4, the apparatus further includes:
the storage unit 35 is configured to, after different levels of patterns are automatically constructed according to the sample set with the labeled information, correspondingly store the patterns of each level and the labels included in each level of granularity.
Further, as shown in fig. 4, the apparatus further includes:
the second obtaining unit 36 is configured to obtain a free text of the information to be extracted after different levels of patterns are automatically constructed according to the sample set with the labeled information;
and the extracting unit 37 is configured to perform matching step by step according to different hierarchical patterns, and extract information of different hierarchical granularities in the free text.
Specifically, the specific process of implementing the functions of each module in the apparatus in the embodiment of the present application may refer to the related description in the method embodiment, and is not described herein again.
There is also provided, in accordance with an embodiment of the present application, a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of hierarchical pattern construction for information extraction of any one of fig. 1 and 2.
It will be apparent to those skilled in the art that the modules or steps of the present application described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and they may alternatively be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, or fabricated separately as individual integrated circuit modules, or fabricated as a single integrated circuit module from multiple modules or steps. Thus, the present application is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A method for hierarchical pattern construction for information extraction, the method comprising:
acquiring a sample set with labeled information, wherein the sample set is a free text, and the labeled information is labels with different hierarchical granularities;
and automatically constructing different levels of patterns according to the sample set with the labeled information, wherein different levels of granularity correspond to different levels of patterns.
2. The method for hierarchical pattern construction for information extraction as set forth in claim 1, wherein prior to obtaining a sample set with labeled information, the method further comprises:
setting different levels of granularity nodes and labels contained in the different levels of granularity nodes according to the structural information to be extracted in the sample set;
and marking the free text in the sample set according to the labels contained in the granularity nodes of different levels.
3. The method for hierarchical pattern construction for information extraction as set forth in claim 2, wherein the automatically constructing different hierarchical patterns from labeled information sample sets comprises:
determining a value corresponding to the label corresponding to the minimum level granularity node as a minimum level granularity pattern;
from the minimum level granularity node, the construction of each level granularity pattern comprises:
matching values corresponding to the labels of the current level granularity node with a node library corresponding to the level granularity node of each layer smaller than the current level granularity node, wherein the node library is all the labels in each layer of granularity node and the values corresponding to the labels;
replacing the successfully matched value with a corresponding label smaller than the node of the current level granularity;
and combining the replaced labels and the values which are not matched successfully to determine the granularity pattern of the current level.
4. The method for hierarchical pattern construction for information extraction according to claim 2, wherein the sample set is an electronic medical record EMR set, and the setting of the different hierarchical granularity nodes and the labels included in the different hierarchical granularity nodes according to the structured information to be extracted in the sample set includes:
setting granularity nodes of three levels of leaf nodes, intermediate nodes and event nodes according to structured information in EMR (electronic medical record) to be extracted; and the number of the first and second electrodes,
setting labels contained in each level of granularity nodes; wherein the leaf node comprises at least one label in a pathotype, a TNM stage, a tissue site; the intermediate node comprises at least one label of pathological observation and pathological diagnosis, and the event node comprises at least one label of pathological event, radiotherapy event, chemotherapy event and CT image examination event.
5. The method for hierarchical pattern construction for information extraction as set forth in claim 4, wherein the automatically constructing different hierarchical patterns from labeled information sample sets comprises:
determining the value corresponding to the label corresponding to the leaf node as a leaf node pattern;
matching the value corresponding to the label of the middle node with the leaf node library to obtain a node value, replacing the matched value with the label of the leaf node, and combining the replaced label with the unmatched value to determine the pattern of the middle node;
and respectively matching the values corresponding to the labels of the event nodes with the leaf node library and the intermediate node library to obtain node values, replacing the matched values with the labels of the leaf nodes and/or the labels of the intermediate nodes, and combining the replaced labels with the unmatched values to determine the pattern of the event nodes.
6. The method for hierarchical pattern construction for information extraction as set forth in claim 1, wherein after automatically constructing different hierarchical patterns from annotated sample sets, the method further comprises:
and correspondingly storing the pattern of each level and the labels contained in the granularity of each level.
7. The method for hierarchical pattern construction for information extraction as set forth in claim 1, wherein after automatically constructing different hierarchical patterns from annotated sample sets, the method further comprises:
acquiring a free text of information to be extracted;
and matching step by step according to the patterns of different levels, and extracting the information of different levels of granularity in the free text.
8. An apparatus for hierarchical pattern construction for information extraction, the apparatus comprising:
the system comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring a sample set with label information, the sample set is a free text, and the label information is labels with different hierarchical granularities;
and the construction unit is used for automatically constructing different levels of patterns according to the sample set with the labeled information, and the different levels of granularity correspond to the different levels of patterns.
9. The apparatus for hierarchical pattern building for information extraction as recited in claim 8, further comprising:
the setting unit is used for setting different-level granularity nodes and labels contained in the different-level granularity nodes according to the structural information to be extracted in the sample set before the sample set with the labeled information is obtained;
and the marking unit is used for marking the information of the free text in the sample set according to the labels contained in the granularity nodes of different levels.
10. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method for hierarchical pattern construction for information extraction of any one of claims 1 to 7.
CN202011019692.6A 2020-09-24 2020-09-24 Method and device for constructing hierarchical pattern for information extraction Pending CN112100228A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011019692.6A CN112100228A (en) 2020-09-24 2020-09-24 Method and device for constructing hierarchical pattern for information extraction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011019692.6A CN112100228A (en) 2020-09-24 2020-09-24 Method and device for constructing hierarchical pattern for information extraction

Publications (1)

Publication Number Publication Date
CN112100228A true CN112100228A (en) 2020-12-18

Family

ID=73756130

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011019692.6A Pending CN112100228A (en) 2020-09-24 2020-09-24 Method and device for constructing hierarchical pattern for information extraction

Country Status (1)

Country Link
CN (1) CN112100228A (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111180076A (en) * 2018-11-13 2020-05-19 零氪科技(北京)有限公司 Medical information extraction method based on multilayer semantic analysis
CN111460141A (en) * 2020-03-05 2020-07-28 支付宝(杭州)信息技术有限公司 Text processing method and device and electronic equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111180076A (en) * 2018-11-13 2020-05-19 零氪科技(北京)有限公司 Medical information extraction method based on multilayer semantic analysis
CN111460141A (en) * 2020-03-05 2020-07-28 支付宝(杭州)信息技术有限公司 Text processing method and device and electronic equipment

Similar Documents

Publication Publication Date Title
Jeub et al. Multiresolution consensus clustering in networks
CN111813963B (en) Knowledge graph construction method and device, electronic equipment and storage medium
CN107644011A (en) System and method for the extraction of fine granularity medical bodies
CN111159184B (en) Metadata tracing method and device and server
CN111143547B (en) Big data display method based on knowledge graph
CN116881430B (en) Industrial chain identification method and device, electronic equipment and readable storage medium
CN113706322A (en) Service distribution method, device, equipment and storage medium based on data analysis
CN113360654A (en) Text classification method and device, electronic equipment and readable storage medium
CN116776879A (en) Method, system and equipment for excavating skill entity in recruitment field
CN110706200B (en) Data prediction method and device
CN115905528A (en) Event multi-label classification method and device with time sequence characteristics and electronic equipment
CN113343012B (en) News matching method, device, equipment and storage medium
CN108830302B (en) Image classification method, training method, classification prediction method and related device
CN110956271A (en) Multi-stage classification method and device for mass data
CN113032440A (en) Data processing method and device for training risk model
CN117594183A (en) Radiological report generation method based on inverse fact data enhancement
CN112651782A (en) Behavior prediction method, device, equipment and medium based on zoom dot product attention
CN110287270B (en) Entity relationship mining method and equipment
CN111950707A (en) Behavior prediction method, apparatus, device and medium based on behavior co-occurrence network
CN112100228A (en) Method and device for constructing hierarchical pattern for information extraction
CN111414404A (en) Data visualization device and method
CN110941638B (en) Application classification rule base construction method, application classification method and device
CN116469103A (en) Automatic labeling method for medical image segmentation data
CN113516205B (en) Employee stability classification method based on artificial intelligence and related equipment
CN115905885A (en) Data identification method, device, storage medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination