CN115292490A

CN115292490A - Analysis algorithm for policy interpretation semantics

Info

Publication number: CN115292490A
Application number: CN202210921753.0A
Authority: CN
Inventors: 黄明明; 施东晓; 廖晓洁
Original assignee: Fujian Kelifang Technology Co ltd
Current assignee: Fujian Kelifang Technology Co ltd
Priority date: 2022-08-02
Filing date: 2022-08-02
Publication date: 2022-11-04

Abstract

The invention discloses an analysis algorithm for policy interpretation semantics, which comprises an analysis model, wherein the analysis model comprises a BERT language model, a TCN time sequence model and a CRF probability model, and the analysis algorithm comprises the following steps: 1. inputting a policy file to be identified into an analysis model; 2. converting the file input in the step one into a word vector containing context information by using a BERT language model; 3. the TCN time sequence model classifies the word vectors obtained in the second step; 4. the CRF probability model adjusts sentence sequence of the word vectors classified in the third step; 5. cleaning the result output by the model by using regular matching; 6. and extracting and displaying the identified entities to complete the analysis of the policy interpretation semantics. The method and the system can analyze and research the policy document by using the named entity recognition technology, automatically recognize and classify valuable information in the policy, solve the problem of cleaning and warehousing of recognition results, and simultaneously judge and record the fields with wrong recognition.

Description

Analysis algorithm for policy interpretation semantics

Technical Field

The invention belongs to the technical field of computers, and particularly relates to an analysis algorithm for policy interpretation semantics.

Background

In different regions, a great deal of policy documents are introduced and managed for enterprises, the documents contain a lot of information which is very important for the enterprises, such as subsidy conditions, loan policies, project declaration conditions and the like, and with the development of society and science and technology, more and more enterprises have the requirement of reading the policy documents. Since policy documents are mostly in semi-structured and unstructured states, analysis processing and data mining thereof are severely restricted.

In recent years, deep learning has been significantly advanced in the fields of NLP, image recognition, and the like, and a large number of researchers have also applied deep learning to named entity recognition. The named entity recognition method based on deep learning needs to convert text information into serialized vectors through a word embedding method. The Word embedding methods proposed at present, such as Word2Vec, have the problem that the ambiguity of Chinese characters cannot be handled, for example, the Word disease may be represented by 'disease' in different contexts, and the speed of adjective Word may also be represented. To address this problem, many scholars propose different context-dependent word embedding methods, such as the ELMO (embedding from language models) method and the OpenAI-GPT (generating pre-tracing) method. However, the language representation of the word vector embedding method combined with the context is unidirectional, and the information of the front and back semantics cannot be obtained at the same time.

The method and the system can analyze and research the policy document by using the named entity recognition technology, automatically recognize and classify valuable information in the policy, solve the problem of cleaning and warehousing of recognition results, and simultaneously judge and record the fields with wrong recognition.

Disclosure of Invention

The invention discloses an analysis algorithm for policy interpretation semantics, which mainly aims to overcome the defects and shortcomings in the prior art.

The technical scheme adopted by the invention is as follows:

an analytical algorithm for policy interpretation semantics comprising an analytical model comprising a BERT language model, a TCN timing model and a CRF probability model, the analytical algorithm comprising the following specific analytical steps:

the method comprises the following steps: inputting a policy file to be identified into an analysis model;

step two: the BERT language model converts the file input in the step one into a word vector containing context information;

step three: the TCN time sequence model classifies the word vectors obtained in the second step;

step four: the CRF probability model adjusts sentence sequence of the word vectors classified in the third step;

step five: cleaning the result output by the model by using regular matching;

step six: and extracting and displaying the identified entities to complete the analysis of the policy interpretation semantics.

Further, the transformation process in the second step comprises:

(1) Labeling data by using a BIOES labeling method, wherein B represents the beginning of a sentence, I represents an entity in the sentence, O represents irrelevant content, E represents the end of the sentence, and S represents an entity formed by a single word;

(2) Training the BERT language model by using the data marked in the step (1), wherein the training process is as follows: firstly, the labeled data passes through a BERT network, and then the input data is converted into an embedded word vector containing upper and lower semantics.

Furthermore, the whole framework of the BERT network in the step (2) is formed by stacking a plurality of layers of transform encoders, each layer of encoder is composed of one layer of muti-head-entry and one layer of feed-word, and each entry recodes the target word through the relevance of the target word and all words in the sentence to obtain a new code of each word.

Further, the calculation of attention includes the following three steps:

the method comprises the following steps: calculating the correlation between words, performing linear transformation on the input sequence vectors (512 × 768) through three weight matrixes, respectively generating three new sequence vectors of query, key and value, and multiplying the query vector of each word with the key vectors of all words in the sequence to obtain the correlation between the words;

step two: normalizing the correlation degree, and normalizing the correlation degree obtained in the step one through softmax;

step three: and D, performing weighted summation on the correlation and the codes of all the words, and performing weighted summation on the normalized weight obtained in the step two and the value to obtain a new code of each word.

Still further, the BERT network comprises 24 layers of transformers, each having 16 attentions.

Further, the entity identified in the sixth step is an entity belonging to class I.

Furthermore, the specific process of word vector classification by the TCN timing model in step three includes:

(1) Firstly, inputting the word vector input in the step two into a TCN network;

(2) And (3) classifying the word vectors input in the step (1) by utilizing a TCN time sequence convolution network.

Furthermore, the sentence order in step four is adjusted by: and inputting the classified word vectors into a CRF conditional random field, and then smoothly adjusting to meet the sentence order requirement to finish the sentence order adjustment.

Furthermore, the CRF conditional random field is a discriminant probability distribution model and is a Markov random field of another set of output random variables Y under the condition of a set of input random variables X.

As can be seen from the above description of the present invention, compared with the prior art, the present invention has the following advantages:

according to the invention, the BERT-TCN-CRF model is used for realizing named entity recognition of the policy document, the BERT network pre-training model is used, and the static word vector generated by the traditional method is replaced by the dynamic word vector obtained by training in the large-scale corpus, so that the problem of word ambiguity existing in the traditional word embedding method is effectively solved, and the semantic representation is more accurate. The F1 value of the BERT network pre-training model in the policy document corpus marked by the BERT network pre-training model reaches 94.72%, and compared with other models, the BERT network pre-training model has a better recognition effect, can well complete the task of recognizing the policy document named entities, and can meet the requirements of enterprises on the aspect of recognizing the policy text named entities. Meanwhile, the invention provides complete data cleaning and warehousing work, and can clean the identification result with finer granularity.

Using TCN network, the traditional named entity recognition model usually adopts LSTM model, but experiments prove that TCN network can keep more extended memory, and the performance in the recognition model is higher than that of LSTM model.

Using CRF conditional random fields, consider that in the sequence labeling task, adjacent words or phrases need to follow certain rules, such as that an I label is preceded by a B label, cannot be an O label, etc. The CRF model can reasonably consider the dependency relationship existing between information and model the tag sequence so as to obtain the optimal sequence.

Meanwhile, the invention also solves the problem of the work of cleaning and warehousing of the recognition result, such as the growth rate recognized by a model, judges whether the growth rate is the growth rate of the last year or the growth rate of the last two years according to the situation of regular matching, stores the judgment result into the corresponding field of the database, and judges and records the field with the recognition error.

Drawings

Fig. 1 is a schematic diagram of the architecture of the BERT network of the present invention.

Fig. 2 is a schematic diagram of the TCN network of the present invention.

FIG. 3 is a schematic diagram of the structure of a conditional random field of the CRF of the present invention.

Detailed Description

Embodiments of the present invention will be further described with reference to the accompanying drawings.

As shown in fig. 1, fig. 2 and fig. 3, an analysis algorithm for policy interpretation semantics includes an analysis model, the analysis model includes a BERT language model, a TCN timing model and a CRF probability model, and the analysis algorithm includes the following specific analysis steps:

the method comprises the following steps: inputting policy files needing to be identified into an analysis model;

step five: cleaning the result output by the model by using regular matching;

Further, the transformation process in the second step comprises:

(1) Marking the data by using a BIOES marking method, wherein B represents the beginning of a sentence, I represents an entity in the sentence, O represents irrelevant content, E represents the end of the sentence, and S represents an entity formed by a single character;

(2) Training the BERT language model by using the data marked in the step (1), wherein the training process comprises the following steps: firstly, the labeled data passes through a BERT network, and then the input data is converted into an embedded word vector containing upper and lower semantics.

Further, the calculation of attention includes the following three steps:

Still further, the BERT network includes 24 transformers, each having 16 attentions.

The following is a detailed description of each model of the present embodiment:

1. high quality data sets labeled using BIOS labeling

In this embodiment, a BIOS text labeling method is used to perform entity labeling on a large number of policy documents for training and testing of models.

2. BERT network

The first part of the model in this embodiment uses BERT network for word embedding, BERT is transform-based bi-directional coding representation, and is a pre-training model, and two tasks of model training are to predict the words covered in sentences and to determine whether the input two sentences are top and bottom sentences. And adding a corresponding network behind the pre-trained BERT model according to a specific task, and finishing downstream tasks of the NLP, such as text classification, machine translation and the like.

Although BERT is based on a transformer, it uses only the encoder portion of the transformer, and its overall frame is formed by stacking multiple layers of the transformer's encoders. The encoder of each layer consists of one layer of muti-head-attentions and one layer of feed-form, and our model uses a larger BERT network with 24 layers of 16 attentions each. The main role of each attention is to re-encode the target word by its relatedness to all words in the sentence. The calculation of each attention therefore includes three steps: and calculating the correlation between the words, normalizing the correlation, and performing weighted summation on the correlation and the codes of all the words to obtain the code of the target word. When the correlation degree between words is calculated through attribute, firstly, linear transformation is carried out on an input sequence vector (512 × 768) through three weight matrixes, three new sequence vectors of query, key and value are respectively generated, the query vector of each word is respectively multiplied by the key vectors of all words in the sequence to obtain the correlation degree between the words, then the correlation degree is normalized through softmax, and the normalized weight and the value are subjected to weighted summation to obtain new codes of each word.

3. TCN network

The TCN is a model proposed in 2018 and is called temporalconvolutionnetwork, which is a time-series convolutional network, and can be used for time-series data processing. Convolutional networks have proven to be good at extracting high-level features in structured data. The time convolution network is a neural network model utilizing causal convolution and hole convolution, can adapt to the time sequence of time sequence data, and can provide a visual field for time sequence modeling.

(1) Causal convolution

Causal convolution means that for the value at time t of the previous layer, only the value at and before time t of the next layer is relied upon. Unlike conventional convolutional neural networks, causal convolution does not see future data, is a unidirectional structure, and is not bidirectional. That is, the prior antecedent, which is a strictly time-constrained model, can have consequences, and is therefore referred to as causal convolution.

(2) Convolution of holes

Unlike conventional convolution, the dilated convolution allows the input to be sampled at intervals during convolution, and the sample rate parameter d controls. The bottom layer d =1 indicates that each point is sampled during the input, and the middle layer d =2 indicates that every 2 points are sampled once during the input as input. Generally, the higher the hierarchy, the larger the value of d. Therefore, the dilation convolution causes the effective window size to grow exponentially with the number of layers. In this way, the convolutional network can use fewer layers and can obtain a large field of view.

(3) Residual concatenation

According to the scheme, residual connection is used in the TCN network, and a residual block is constructed to replace a convolutional layer. One residual block contains two layers of convolution and nonlinear mapping and weighted norm and Dropout are added to each layer to regularize the network.

4. Conditional random field for CRF

Conditional random fields are a discriminative probabilistic model, which is a type of random field commonly used to label or analyze sequence data, such as natural language text or biological sequences. The conditional random field is a conditional probability distribution model P (Y | X) representing a markov random field of a set of output random variables Y given a set of input random variables X, i.e., the CRF is characterized by assuming that the output random variables constitute a markov random field. Conditional random fields can be viewed as a generalization of the maximum entropy Markov model over the labeling problem.

Like a Markov random field, a conditional random field is a graph model with no direction, in which the distribution of a random variable Y is the conditional probability and a given observation is the random variable X. In principle, the graph model layout of the conditional random field can be arbitrarily given, and a general layout is a chained architecture, which has a more efficient algorithm for calculation in training (training), inference (inference), or decoding (decoding). The conditional random field is a typical discriminant model, and the joint probability thereof can be written in the form of multiplication of several potential functions, wherein the most common is the linear chain element random field. The model of the invention uses conditional random fields to ensure that the classification data output by the model is in the same order as the BIOS labeling method.

5. Canonical matching

The method and the system use regular matching to clean the output result of the model, for example, when the category of the entity is judged to be the growth rate, the system can position the paragraph of the sentence, and then extract the specific year in a regular matching mode, such as the growth rate of the previous year or the growth rate of the previous two years.

As can be seen from the above description of the present invention, compared with the prior art, the advantages of the present invention are:

the method realizes named entity recognition of the policy document by using the BERT-TCN-CRF model, uses the BERT network pre-training model, and uses the dynamic word vectors obtained by training in the large-scale corpus to replace the static word vectors generated by the traditional method, thereby effectively solving the problem of word ambiguity existing in the traditional word embedding method and ensuring that the semantic representation is more accurate. The F1 value of the BERT network pre-training model in the policy document corpus marked by the BERT network pre-training model reaches 94.72%, and compared with other models, the BERT network pre-training model has a better recognition effect, can well complete the task of recognizing the policy document named entities, and can meet the requirements of enterprises on the aspect of recognizing the policy text named entities. Meanwhile, the invention provides complete data cleaning and warehousing work, and can clean the identification result with finer granularity.

Using a CRF conditional random field, consider that in the sequence labeling task, adjacent words or phrases need to follow certain rules, such as an I label preceded by a B label, not an O label, etc. The CRF model can reasonably consider the dependency relationship existing between information and model the tag sequence so as to obtain the optimal sequence.

The above description is only an embodiment of the present invention, but the design concept of the present invention is not limited thereto, and any insubstantial modifications of the present invention using this concept shall fall within the scope of infringing the present invention.

Claims

1. An analysis algorithm for policy interpretation semantics, characterized by: the method comprises an analysis model, wherein the analysis model comprises a BERT language model, a TCN time sequence model and a CRF probability model, and the analysis algorithm comprises the following specific analysis steps:

step five: cleaning the result output by the model by using regular matching;

2. An analysis algorithm for policy interpretation semantics according to claim 1, wherein: the transformation process in the second step comprises the following steps:

3. An analysis algorithm for policy interpretation semantics according to claim 2, wherein: and (3) stacking the integral frame of the BERT network in the step (2) by a plurality of layers of transform encoders, wherein each layer of encoder consists of one layer of muti-head-entry and one layer of feed-word, and each entry recodes the target word through the relevance of the target word and all words in the sentence to obtain a new code of each word.

4. An analysis algorithm for policy interpretation semantics according to claim 3, wherein: the calculation of the attention comprises the following three steps:

step two: normalizing the correlation degree, and normalizing the correlation degree obtained in the first step through softmax;

5. An analysis algorithm for policy interpretation semantics according to claim 3, wherein: the BERT network includes 24 transformers, each having 16 attentions.

6. An analysis algorithm for policy interpretation semantics according to claim 2, wherein: and the entity identified in the sixth step is an entity belonging to the class I.

7. An analysis algorithm for policy interpretation semantics according to claim 1, wherein: the specific process of carrying out word vector classification by the TCN timing model in the third step comprises the following steps:

(2) And (2) classifying the word vectors input in the step (1) by using a TCN time sequence convolution network.

8. An analysis algorithm for policy interpretation semantics according to claim 1, wherein: the sentence sequence in the fourth step is adjusted in the following way: and inputting the classified word vectors into a CRF conditional random field, and then smoothly adjusting to ensure that the word vectors meet the sequence requirement of sentences, thereby completing the adjustment of the sentence sequence.

9. An analysis algorithm for policy interpretation semantics according to claim 5, wherein: the CRF conditional random field is a discriminant probability distribution model and is a Markov random field of another group of output random variables Y under the condition of giving a group of input random variables X.