CN111737951B

CN111737951B - Text language incidence relation labeling method and device

Info

Publication number: CN111737951B
Application number: CN201910212664.7A
Authority: CN
Inventors: 韩英; 刘迪; 王腾蛟; 邱镇; 陈薇; 孟洪民
Original assignee: Peking University; State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd; State Grid Zhejiang Electric Power Co Ltd
Current assignee: Peking University; State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd; State Grid Zhejiang Electric Power Co Ltd
Priority date: 2019-03-20
Filing date: 2019-03-20
Publication date: 2022-10-14
Anticipated expiration: 2039-03-20
Also published as: CN111737951A

Abstract

The invention discloses a method and a device for labeling incidence relation of text language. By utilizing the close relevance of each information extraction subtask of the text language, a composite labeling method independent of a specific model is designed, multiple text language information extraction tasks can be naturally fused, and the joint learning and integrated training of the multiple text language association tasks is realized, such as joint learning supporting named entity identification and named entity standardization, joint learning supporting named entity identification and entity relation extraction, joint learning supporting named entity identification and entity disambiguation and the like. The text language association relation composite marking method provided by the invention fully utilizes the close association among the subtasks of the text language information extraction, realizes complete joint learning, enables the information sharing among the associated tasks to be mutually promoted, and improves the accuracy and the recall rate of the text language information extraction as a whole.

Description

Text language incidence relation labeling method and device

Technical Field

The invention belongs to the technical field of information, and relates to information extraction of a text language assisted by computer intelligence technology. The method specifically relates to a composite labeling method designed by utilizing the close relevance of each information extraction subtask of the text language to naturally fuse multiple text language information extraction tasks, and realizes the joint learning and integrated training of the multiple text language associated tasks, so that the information sharing among the associated tasks can be mutually promoted, and the accuracy and the recall rate of the text language information extraction are improved.

Background

Text languages, which are the main expression of natural languages, are important carriers of information. In the current information explosion era, the key to data intelligence is how to extract useful structured information from massive unstructured texts. The information extraction of the text language comprises a plurality of subtasks, such as named entity identification, named entity standardization, entity relation extraction and the like. There is a close correlation between these sub-tasks, but the traditional method treats these tasks as independent tasks and performs them separately (Peng Z, sun L, han X. SIR-ReDeeM: a chip name recognition and visualization system using a two-stage method [ C ]// Proceedings of the Second CIPS-SIGHAN Joint Conference on chip Language processing.2012: 115-120.) so that these tasks cannot share and supplement information with each other.

Currently, a small percentage of researchers began to pay attention to the relevance between text language information extraction subtasks, liu X et al (Liu X, zhou M, wei F, et al. Joint reference of name information and standardization for tweets [ C ]// Proceedings of the 50th Annual Meeting of the Association for computerized learning, long pages-Volume 1.Association for computerized learning, 2012-535.) proposed a method of probability map-based joint learning named entity recognition and standardization that links the two tasks by introducing binary variable factors. The joint learning method based on the probability map is not a neural network architecture, depends on feature engineering, is tedious, time-consuming and difficult to adapt to different linguistic data. Zheng S et al (Zheng S, hao Y, lu D, et al. Joint entry and relation based on a hybrid network [ J ]. Neuro-learning, 2017, 257). In the training stage, the optimization of the related parameters of named entity recognition is firstly carried out, and then the training of entity relation extraction is carried out. This two-stage training approach cannot be globally optimized. How to realize the method does not depend on a specific machine learning and deep learning method, and the method can be used for integrated training, which is a very challenging problem.

Disclosure of Invention

In view of the above problems, the present invention aims to provide a model-independent general joint learning strategy supporting integrated training, which does not depend on a specific model and simultaneously supports multi-task integrated training.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for labeling incidence relation of text language includes the following steps:

1) Determining at least two related information extraction subtasks of the text language according to the requirements of the text language related tasks;

2) Analyzing the text corpus, and defining a tag set of each information extraction subtask;

3) Extracting a label set of the subtasks by combining all information to form a composite labeling system;

4) And labeling the text corpus according to the composite labeling system.

Further, the information extraction subtask in step 1) may include, but is not limited to, a named entity identification subtask, a named entity normalization subtask, and a named entity relationship extraction subtask.

Further, step 2) defines a separate labeling system corresponding to each text language information extraction subtask on the corpus, and each information extraction subtask corresponds to a label set and comprises a position of a character in an entity and an entity type.

Further, step 3) extracting subtasks from the information with the correlation relationship, combining the label sets of the information extraction subtasks, optimizing the public part in the labels of the information extraction subtasks to form a composite labeling system, and realizing the natural fusion of multiple tasks.

A text language association labeling apparatus, comprising:

the subtask determining module is responsible for determining at least two related information extraction subtasks of the text language according to the requirements of the text language related tasks;

the label set definition module is responsible for analyzing the text corpus and defining the label sets of all the information extraction subtasks;

the label combination module is responsible for combining label sets of all the information extraction subtasks to form a composite labeling system;

and the marking module is responsible for marking the text corpus according to the composite marking system.

A machine learning model integrated training method supporting multiple tasks comprises the following steps:

(1) Labeling the text corpus according to the composite labeling system by adopting the method to obtain a training data set and a test data set;

(2) Selecting a specific machine learning (including deep learning) model;

(3) In the prediction stage, a label sequence obtained by predicting the machine learning model according to an input sequence is decoded according to a composite labeling system to obtain a final label prediction result;

(4) And in the training iterative process of the machine learning model, optimizing on a training data set, simultaneously testing on a testing data set, and stopping training when the result on the testing data set is reduced.

Furthermore, a plurality of tasks are completely fused together through the composite labeling system, so that integrated training is realized, and the separate training of each task in multiple stages is not needed.

Further, the machine learning model is a traditional machine learning model or a deep learning model based on a deep neural network, and the traditional machine learning model comprises a conditional random field, a hidden markov model or other models based on probability maps.

Further, the decoding extracts entity relationships according to a proximity principle.

A multitasking enabled machine learning model integrated training device, comprising:

the data preparation module is responsible for labeling the text corpus according to the composite labeling system by adopting the method to obtain a training data set and a test data set;

the model selection module is responsible for selecting a specific machine learning model;

the decoding module is responsible for decoding a mark sequence obtained by predicting the machine learning model according to an input sequence according to a composite labeling system in a prediction stage to obtain a final label prediction result;

and the training module is responsible for optimizing the training data set and testing the testing data set in the training iterative process of the machine learning model, and stops training when the result on the testing data set is reduced.

The invention provides a universal joint learning strategy with independent models and supporting integrated training, which does not depend on specific models, supports both traditional machine learning based on statistics and deep learning based on a deep neural network, and supports multi-task integrated training, such as joint learning supporting named entity identification and named entity standardization, joint learning supporting named entity identification and entity relation extraction, joint learning supporting named entity identification and entity disambiguation and the like. The invention can naturally fuse a plurality of text language information extraction tasks, realize the joint learning and integrated training of the plurality of text language associated tasks, ensure that the information sharing among the associated tasks can be mutually promoted, and improve the accuracy and the recall rate of the text language information extraction.

The invention is an innovation on a labeling method, does not relate to a specific model, and is suitable for both traditional machine learning and deep learning based on a neural network; a composite labeling system is designed by combining label sets of a plurality of subtasks, so that the natural fusion of multiple tasks is realized; the multiple tasks are completely fused together by the composite labeling system, integrated training can be realized, and separate training of each task in multiple stages is not required.

Drawings

FIG. 1 is a schematic diagram of a text-based language association labeling method according to an embodiment of the present invention. Wherein, the diagram (a) is a composite label of named entity identification and standardization; and (b) composite labeling of named entity identification and relationship extraction.

FIG. 2 is a flow chart of steps for an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, the present invention shall be described in further detail with reference to the following detailed description and accompanying drawings.

Fig. 1 is a schematic diagram of a text language association relation labeling method according to an embodiment of the present invention, where (a) is a composite labeling system of a joint learning framework in which the labeling method is applied to named entity recognition and entity standardization, and (b) is a composite labeling system of a joint learning framework in which the labeling method is applied to named entity recognition and relation extraction. The method for labeling the association relation based on the text language is suitable for the joint learning of subtasks associated with various text languages, and the method is only described by two examples of the joint learning of named entity identification and entity standardization and the joint learning of named entity identification and relation extraction.

The composite label in the diagram (a) of fig. 1 is composed of a label identified by a named entity and a label standardized by the named entity, and the style is [ position-entity type-entity standardized symbol ]. Wherein B, I, E, S, O represent the position of the character in the entity, wherein B represents begin, corresponding to the beginning position of the entity name (where the word represents a word in Chinese or a word in English); i represents an inter and corresponds to the middle position of the entity name; e represents end, corresponding to the ending position of the entity name; s represents a single, and represents that a corresponding entity only consists of one word; o stands for out and the corresponding character does not belong to a component of the entity name. The "ORG" represents that the type of entity is an organization class, which can be freely defined here according to the task requirement, and the common entity types are "PER" (person name), "LOC" (place name), and so on. The name of an entity with only one expression in a document is represented by S, the standard name of the entity with a plurality of expression forms is represented by F, the non-standard name of the entity with a plurality of expression forms, such as short name, alternative name and the like, is represented by A, and F is agreed to be longer than A. In the diagram (a) of fig. 1, "transportation bank" is a standard name, and "delivery bank" is an abbreviation of transportation bank and is a non-standard name of the entity concept of "transportation bank". The 'transportation bank' belongs to an organization. The "traffic" word of "transportation bank" is therefore marked as "B-ORG-F", representing the first character of the standard name of the entity of the organization class.

The composite label in the diagram (b) of fig. 1 is composed of a label identified by the named entity and a label extracted from the relationship between the named entities, and the style is [ position-entity type-entity relationship ]. B. I, E, S, O represent the bits of the character in the entity. The set of entity types and entity relationships need to be freely defined in advance according to the requirements of the tasks. It is defined herein that "ORG" represents that the type of an entity is an organization class, "PER" represents that the type of an entity is a person name class, "LOC" represents that the type of an entity is a place name class, and "CF" represents a relationship of "Company-foundation" (Company-Founder). In the diagram (b) of fig. 1, "plum" is the originator of "sun company" (both the name of the person and the name of the company are imaginary, for example only), and thus both are in a "CF" relationship. The "sun" of "sunshine company" is labeled as "B-ORG-CF" and represents its first character corresponding to an organization entity in the "company-originator" relationship, and similarly, the "ming" of "li ming" is labeled as "E-PER-CF" and represents its last character corresponding to a name entity in the "company-originator" relationship. "Beijing" is a place name entity with no defined entity relationship to other entities in the example text, and thus "north" is labeled "B-LOC-S" where "S" represents a single entity with no entity relationship.

FIG. 2 is a flow chart of steps of an embodiment of the present invention, including the steps of:

step 1, defining the requirements of the tasks related to the text language, and determining at least two related information extraction subtasks of the text language aiming at a specific data set and an application scene. For example, the named entity identifying subtask and the named entity relationship extracting subtask included in the diagram (b) of fig. 1.

And 2, analyzing a specific text corpus, extracting subtasks for the information of each text language, defining a corresponding independent labeling system on the corpus, wherein each task corresponds to a label set and comprises the position of characters in an entity, the entity type and the like.

For example, for the named entity recognition subtask, taking the entity type including organization class and name class as an example, the defined tag set is { B-ORG, I-ORG, E-ORG, S-ORG, B-PER, I-PER, E-PER, S-PER, O }; for the named entity relationship extraction subtask, taking the example that the entity relationship includes Country-prefix (county-prefix), company-originator (company-foundation), and Part-Whole (Part-Whole), the defined tag set is { e1-CP, e2-CP, e1-CF, e2-CF, e1-PW, e2-PW }, where e1, e2 represent the role position in a pair of entity relationships, e1-CP represents the role of the Country in the Country-prefix relationship.

And 3, for the subtasks with the association relation, combining the labels of the subtasks, and optimizing the public part in the label of each subtask to form a composite annotation system.

For example, the tags of the two subtasks, named entity identification and named entity standardization, contain the position information of the characters in the entities, and the common part can be optimized when the tags of the two subtasks are combined, and the two subtasks share the entity position information. In addition, when the labels of each subtask are combined, the application scenarios of specific problems are combined for further optimization, and the number of labels of the label set in the composite labeling system is reduced as much as possible. For example, for the named entity recognition and the named entity relationship extraction, the formed composite labeling system is shown in fig. 1 (b), which includes both the label of the named entity recognition and the label of the entity relationship extraction.

And 4.1, marking the material by using the composite marking system defined in the step 3. The labeled results are shown in FIG. 1 (b).

And 4.2, segmenting the labeled corpus into a training data set and a testing data set.

Step 5.1, selecting a specific machine learning (including deep learning) model, which can be a traditional machine learning model, such as a conditional random field, a hidden markov model or other models based on probability maps, or a deep learning model based on a deep neural network.

And 5.2, defining a cost function according to the machine learning model. A commonly used cost function in the sequence labeling problem is a cross-entropy loss function:

wherein J (theta) represents a cross entropy loss function, theta represents a parameter of the model, m represents the number of training samples, y ⁽ⁱ⁾ Representing the true probability value, x, of the ith sample ⁽ⁱ⁾ Represents the ith sample input, h _θ A mapping function, h, representing the model _θ (x ⁽ⁱ⁾ ) Representing the predicted output probability value under the mapping of the model for the input of the ith sample.

And 6, decoding the label sequence obtained by predicting the machine learning model prediction according to the input sequence according to the composite labeling system, and translating into a readable entity extraction result by combining the labels predicted by each adjacent character. Namely, the decoding stage of the composite labeling system extracts the entity relationship according to the principle of proximity.

If "lie" is modeled as "B-PER-CF", "light" is modeled as "E-PER-CF", the location "B" to "E" is a range of entity names, the "PER" represents a person name, so the person name entity of "lie" is extracted, and "CF" represents that this entity is a person name entity in a "Company-creator" relationship, and similarly, the "sun Company" is an organization class trial in a "CF" relationship, so a pair of relationships (sun Company, company-creator, lie) is obtained. And similarly, decoding other mark sequences to obtain an entity marking result which is finally output after the model predicts the input text sequence.

Step 7. During the training iteration, optimization is performed on the training data set, typically using a gradient descent algorithm of adaptive learning rate, such as the Adam algorithm (Kingma D P, ba J. Adam: A method for the stored optimization [ J ]. ArXiv preprint arXiv:1412.6980, 2014.). While testing is performed on the test data set, and when the results on the test data set decrease, training is stopped. And the fitting capability and the generalization capability of the model are ensured.

Based on the same inventive concept, another embodiment of the present invention provides a device for labeling a text language association relationship, including:

Based on the same inventive concept, another embodiment of the present invention provides a machine learning model integrated training device supporting multiple tasks, comprising:

and the training module is responsible for optimizing on a training data set and testing on a testing data set in the training iterative process of the machine learning model, and stopping training when the result on the testing data set is reduced.

The specific implementation of the modules is described in the foregoing description of the method of the present invention.

The above embodiments are only used for illustrating the technical solutions of the present invention and not for limiting the same, and those skilled in the art can make modifications or equivalent substitutions to the technical solutions of the present invention without departing from the principle and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1.A method for labeling incidence relation of text language is characterized by comprising the following steps:

4) Labeling the text corpus according to the composite labeling system;

wherein, the information extraction subtask of step 1) includes: a named entity identification subtask, a named entity standardization subtask and a named entity relation extraction subtask;

step 2) for each text language information extraction subtask, defining a corresponding independent labeling system on the corpus, wherein each information extraction subtask corresponds to a label set and comprises the position of characters in an entity and the entity type;

step 3) extracting subtasks from the information with the correlation, combining the label sets of the information extraction subtasks, optimizing the public part in the labels of the information extraction subtasks to form a composite labeling system, and realizing the natural fusion of multiple tasks;

the optimizing common parts in the labels of each information extraction subtask comprises the following steps:

for the condition that the labels of the two subtasks of named entity identification and named entity standardization both contain the position information of characters in the entities, the common part is optimized when the labels of the two subtasks are combined, so that the two subtasks share the entity position information.

2. A device for labeling association relation of text language using the method of claim 1, comprising:

the tag set definition module is responsible for analyzing the text corpus and defining the tag set of each information extraction subtask;

the label combining module is responsible for combining label sets of all the information extraction subtasks to form a composite labeling system;

3. A machine learning model integrated training method supporting multiple tasks is characterized by comprising the following steps:

(1) The method of claim 1, labeling the text corpus according to a composite labeling system to obtain a training data set and a test data set;

(2) Selecting a specific machine learning model;

4. The method of claim 3, wherein multiple tasks are completely merged together by the composite annotation hierarchy to achieve unified training without separate training of multiple stages of each task.

5. The method of claim 3, in which the machine learning model is a traditional machine learning model comprising conditional random fields, hidden Markov, or other probability map based models, or is a deep learning model based on a deep neural network.

6. The method of claim 3, wherein the decoding extracts entity relationships on a proximity basis.

7. A machine learning model integrated training device supporting multiple tasks, comprising:

the data preparation module is responsible for labeling the text corpora according to the composite labeling system by adopting the method of claim 1 to obtain a training data set and a test data set;