CN113326700A

CN113326700A - ALBert-based complex heavy equipment entity extraction method

Info

Publication number: CN113326700A
Application number: CN202110217185.1A
Authority: CN
Inventors: 李军怀; 陈苗苗; 王怀军; 曹霆; 于蕾
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2021-08-31
Anticipated expiration: 2041-02-26
Also published as: CN113326700B

Abstract

The invention discloses an ALBert-based complex heavy equipment entity extraction method, which is implemented according to the following steps: step 1, collecting texts in the field of complex heavy equipment, and constructing a corpus; step 2, pre-training an ALBert model by using the corpus obtained in the step 1 to obtain a pre-trained word representation model ALBert; step 3, labeling the entity names in the corpus obtained in the step 1, and adjusting a text format to an algorithm reading format to obtain a training set and a verification set; step 4, training the model, namely sending the marked data into an ALBert-BGRU-Attention-CRF algorithm to obtain a trained model; step 5, creating a dictionary Dict; and 6, inputting the text to be extracted into the model obtained in the step 4, and combining the dictionary Dict constructed in the step 5 to obtain an entity extraction result. The invention can complete the entity extraction task in the field of complex heavy equipment.

Description

ALBert-based complex heavy equipment entity extraction method

Technical Field

The invention belongs to the technical field of knowledge maps, and particularly relates to an ALBert-based complex heavy equipment entity extraction method.

Background

The complex heavy equipment is one of important basic equipment in the manufacturing industry, is an important guarantee for social and economic development and national defense industry, and is particularly important to be used as a national heavy equipment. The heavy equipment is taken as high-end equipment and is widely applied to key industries and fields of energy, traffic, ships, engineering machinery, metallurgy, aerospace, war industry and the like. Heavy equipment has long development period and complex stages, including preliminary investigation, design, manufacture, purchase, matching, installation, debugging, delivery, quality control, after-sales service and the like, and a great deal of knowledge is generated in the processes, wherein the great deal of knowledge is stored in a text form.

With the development of new internet technology, the effective management of knowledge and the reuse of knowledge in the equipment manufacturing industry can better assist the whole process of design, production, operation and maintenance. The knowledge graph is an efficient mode capable of organizing and managing knowledge effectively, one of important links of the construction of the knowledge graph is entity extraction, and the accuracy of the entity extraction determines the accuracy of the knowledge graph to a certain extent. Entity extraction for complex heavy equipment texts lays a foundation for subsequent knowledge map construction, effective knowledge management and knowledge reuse.

Disclosure of Invention

The invention aims to provide an ALBert-based entity extraction method for complex heavy equipment, which can complete an entity extraction task in the field of complex heavy equipment.

The technical scheme adopted by the invention is that the complex heavy equipment entity extracting method based on the ALBert is implemented according to the following steps:

step 1, collecting texts in the field of complex heavy equipment, and constructing a corpus;

step 2, pre-training an ALBert model by using the corpus obtained in the step 1 to obtain a pre-trained word representation model ALBert;

step 3, labeling the entity names in the corpus obtained in the step 1, and adjusting the text format into an algorithm reading format to obtain a training set and a verification set;

step 4, training the model, namely sending the marked data into an ALBert-BGRU-Attention-CRF algorithm to obtain a trained model;

step 5, creating a dictionary Dict;

and 6, inputting the text to be extracted into the model obtained in the step 4, and combining the dictionary Dict constructed in the step 5 to obtain an entity extraction result.

The present invention is also characterized in that,

in the step 1, a web crawler frame Scapy is used for capturing related complex heavy equipment information from a webpage and storing the complex heavy equipment information as a text file, and the stored text is integrated with an existing complex heavy equipment field document collected manually to serve as a data source; then processing the data source, and removing special symbols, formulas and measurement units; the processed data is stored as a corpus as a text file.

In the step 2, the ALBert model takes a single Chinese character as input, a starting mark [ CLS ] is added in front of the first character of each sentence, an ending mark [ SEP ] is added at the tail of each sentence, the ALBert output is a representation vector of semantic information of each input character fused text, the following connection parameters are finely adjusted according to the linguistic data in the data source on the basis of the ALBert pre-training model, and the internal training parameters of the ALBert do not participate in training to obtain the finely adjusted ALBert model.

And 3, completing entity labeling by adopting an artificial labeling mode, wherein a labeled entity adopts a BIO labeling mode, a B-Type label is marked on the first character of the entity, an I-Type label is marked on the non-first character of the entity, O labels are marked on the non-entity and punctuation marks, and the Type represents the entity Type.

The training model in step 4 is specifically as follows:

step 4.1, inputting the training set and the verification set obtained in the step 3 into the ALBert model finely adjusted in the step 2 to generate a word vector;

step 4.2, inputting the word vectors generated in the step 4.1 into a bidirectional gating circulation unit BGRU, and obtaining scores of all the words on all the labels;

4.3, weighting the result of the step 4.2 by using an Attention mechanism to obtain a weighted score of each word on all the labels;

step 4.4, using conditional random field CRF to constrain the tag sequence, and reducing the occurrence probability of abnormal sequences;

and 4.5, obtaining the trained entity extraction model.

The step 5 is as follows:

and extracting relevant names from the complex heavy equipment detailed information table as a dictionary Dict, wherein the names include but are not limited to parts, combinations and product names.

The step 6 is as follows:

6.1, aiming at a large amount of texts to be extracted, introducing all the texts into the entity extraction model trained in the step 4 to obtain a primary recognition result, and then adding the dictionary Dict constructed in the step 5 for secondary extraction on the basis to obtain a final entity extraction result;

and 6.2, aiming at entity extraction of the single sentence, pasting the sentence to be extracted to an online recognition window in an online recognition mode, calling the model obtained in the step 4 and combining the dictionary Dict to give an extraction result.

The method has the advantages that the method for extracting the complex heavy equipment entity based on the ALBert marks the characters of the related information in the existing texts and web pages in the field as the corpus, uses the fine-tuned ALBert to realize word embedding, uses the deep learning algorithm BGRU-Attention-CRF to train to obtain the entity extraction model, and adds the field dictionary in order to improve the entity extraction accuracy and consider the special nouns of the complex heavy equipment industry. When a new corpus is input, the trained model identifies the entity in the corpus and provides a final entity extraction result by combining a dictionary.

Drawings

Fig. 1 is a general flowchart of an ALBert-based complex heavy equipment entity extraction method of the present invention;

FIG. 2 is a flow chart of a depth learning algorithm ALBert-BGRU-Attention-CRF for establishing a complex heavy equipment entity extraction model in the complex heavy equipment entity extraction method based on ALBert.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The invention discloses an ALBert-based complex heavy equipment entity extraction method, which is characterized in that a flow chart is shown in figure 1, an ALBert-BGRU-Attention-CRF algorithm based on deep learning is utilized to train an entity extraction model on the basis of data collection and processing, and a dictionary (Dict) is combined to obtain a final extraction result after initial entity extraction is carried out on a corpus to be extracted. The method is implemented according to the following steps:

And step 3, developing a manual labeling and automatic format adjusting webpage system. And adopting a manual labeling mode, and labeling the webpage by using the developed data to finish entity labeling.

The entity labeling and text format adjustment algorithm pseudo-code is as follows:

input: text data to be labeled;

output: tagged annotation data;

1. text preprocessing:

1.1. removing line feed and blank spaces in the text, and displaying the text after format arrangement;

1.2. creating a label array, and initializing all character labels in the text to be O;

2. entity marking;

2.1. clicking the tag type, selecting the entity corresponding to the tag type, and setting the tag of the text of the selected entity as the corresponding tag type;

2.2. if full text labeling is started, searching a full text, and setting all entity labels with the same name as a selected label type;

3. generating marking data in a standard format, outputting the text character by character, and adding a label corresponding to the character and a line feed character after each character;

the return format is standard and is provided with tag data;

the labeling entity adopts a BIO labeling mode, the first character of the entity is marked with a B-Type label, the non-first character of the entity is marked with an I-Type label, and the non-entity and punctuation marks are all marked with O labels, wherein the Type represents the entity Type.

For example, there is a corpus: "metal extrusion press is the most important equipment for realizing metal extrusion processing. ", the entities are labeled: the gold B-Product belongs to I-Product extrusion I-Product pressing I-Product machine I-Product is O-most O main O equipment O for realizing O-gold B-Way belongs to I-Way extrusion I-Way pressing I-Way and I-Way machining I-Way by O actual O. O is

The non-entity information is marked as O, the entity type of the B-Product is marked as the entity first character of a Product, the entity type of the I-Product is marked as the entity non-first character of the Product, the entity type of the B-Way is marked as the entity first character of a processing mode, and the entity type of the I-Way is marked as the entity non-first character of the processing mode.

Step 4, training the model, namely sending the marked data into an ALBert-BGRU-Attention-CRF algorithm to obtain a trained model; the flow chart is as shown in figure 2,

the training model in step 4 is specifically as follows:

step 4.2, inputting the word vector generated in the step 4.1 into a bidirectional gating circulating unit BGRU (bidirectional Gated Recurrent Unit), and acquiring the score of each word on all labels;

step 4.4, using conditional Random field CRF (conditional Random field) to constrain the tag sequence, and reducing the occurrence probability of abnormal sequences;

and 4.5, obtaining the trained entity extraction model.

The training entity extraction model is as follows:

input: training set and verification set;

output: an entity extraction model;

1, an Import training set and a verification set;

2. importing the fine-tuned ALBert model;

3. importing the word vector into GRU-Attention-CRF;

4. specifying model parameters;

5. inputting a training set and a verification set to start training;

the return entity extraction model.

Step 5, creating a dictionary Dict;

the step 5 is as follows:

The step 6 is as follows:

Claims

1. The ALBert-based complex heavy equipment entity extraction method is characterized by comprising the following steps:

step 3, labeling the entity names in the corpus obtained in the step 1, and adjusting a text format to an algorithm reading format to obtain a training set and a verification set;

step 5, creating a dictionary Dict;

2. The ALBert-based complex heavy equipment entity extraction method as claimed in claim 1, wherein in the step 1, a web crawler frame Scapy is used to capture information about complex heavy equipment from a webpage and store the information as a text file, and the stored text is integrated with an existing complex heavy equipment domain document collected manually as a data source; then processing the data source, and removing special symbols, formulas and measurement units; the processed data is stored as a corpus as a text file.

3. The method as claimed in claim 2, wherein in the step 2, the ALBert model takes a single chinese character as input, a start mark [ CLS ] is added in front of the first word of each sentence, an end mark [ SEP ] is added at the end of each sentence, the ALBert output is a representation vector of semantic information of fused text of each input word, the following connection parameters are finely tuned according to the corpus in the data source on the basis of the ALBert pre-training model, and the internal training parameters of the ALBert do not participate in training, so as to obtain the finely tuned ALBert model.

4. The ALBert-based complex heavy equipment entity extraction method according to claim 3, wherein in the step 3, an entity labeling is completed by adopting an artificial labeling mode, a BIO labeling mode is adopted for labeling an entity, a B-Type label is marked on an entity first character, an I-Type label is marked on an entity non-first character, O labels are marked on a non-entity and punctuation marks, and the Type represents an entity Type.

5. The ALBert-based complex heavy equipment entity extraction method according to claim 4, wherein the training model in the step 4 is specifically as follows:

4.3, weighting the result of the step 4.2 by using an Attention mechanism to obtain the weighted score of each word on all the labels;

and 4.5, obtaining the trained entity extraction model.

6. The ALBert-based complex heavy equipment entity extraction method according to claim 5, wherein the step 5 is as follows:

7. The ALBert-based complex heavy equipment entity extraction method according to claim 6, wherein the step 6 is as follows:

6.1, aiming at a large amount of texts to be extracted, introducing all the texts into the entity extraction model trained in the step 4 to obtain a primary recognition result, and then adding the dictionary Dict constructed in the step 5 to perform secondary extraction on the basis to obtain a final entity extraction result;