CN113297845B

CN113297845B - Resume block classification method based on multi-level bidirectional circulation neural network

Info

Publication number: CN113297845B
Application number: CN202110685320.5A
Authority: CN
Inventors: 许启强; 张吉; 李嘉木
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2021-06-21
Filing date: 2021-06-21
Publication date: 2022-07-26
Anticipated expiration: 2041-06-21
Also published as: CN113297845A

Abstract

The invention discloses a resume block classification method based on a multi-level bidirectional circulation neural network, which comprises the following steps: 1. resume segmentation: acquiring training data of an RS model, converting the training data into vector representation, performing forward calculation on a Bi-LSTM layer and a linear layer, updating model parameters reversely, and predicting by using a resume segmentation model to obtain a segmented resume block sequence; 2. and (3) classifying the resume blocks: and taking the divided resume block sequence as RC model training data, obtaining resume block feature vectors by utilizing the Bi-LSTM and the maximum pooling layer, performing forward calculation by utilizing the Bi-GRU and the softmax layer to obtain block class probability distribution, updating RC model parameters by utilizing a gradient descent algorithm, and predicting by utilizing the resume block classification model. The method can improve the accuracy of resume segmentation and resume block classification, and solves the problems of low accuracy of schemes based on keyword and format feature matching and large workload of word stock establishment.

Description

Resume block classification method based on multi-level bidirectional circulation neural network

Technical Field

The invention belongs to the branch fields of text segmentation, text classification, information extraction and the like in the field of natural language processing in the direction of computer science and technology, and particularly relates to a resume block classification method based on a multi-level bidirectional recurrent neural network.

Background

Resume information extraction is a technique for extracting semi-structured resume document contents into structured contents by using a computer program. Through the resume information extraction technology, structured resume information can be obtained and stored in a certain structured storage mode, and further meaningful analysis can be conveniently carried out on the structured resume data by various subsequent program automatic analysis tools, such as automatic post recommendation, resume screening, resume query, person recommendation and the like.

The resume information extraction process comprises the following steps: resume segmentation, resume block classification and block information extraction. When inputting a resume document, firstly dividing the text content of the resume according to a certain boundary characteristic to obtain a resume block list, which is called resume division, then classifying each divided resume block, then calling the block information extraction algorithm of the corresponding category according to the classification result to extract the block entity information, for example, aiming at the 'basic information' block, the entity information comprises 'name', 'contact way', 'address' and the like, and finally storing the extracted entity information by a database.

The existing resume segmentation method mainly adopts a mode of matching block titles based on keywords and format characteristics. The specific rule is to establish a huge keyword library, search whether resume text content appears in the keyword library when a segmentation algorithm is executed specifically, judge whether format characteristics of the text accord with block title (block start mark) characteristics, and use the result as a segmentation boundary if the format characteristics meet the conditions. The method has the defects that the accuracy is low, actual service requirements cannot be met, when a keyword library is not complete enough, all title keywords cannot be detected, the expression forms of the keywords are different, the workload for constructing a complete keyword library is extremely high, and secondly, some text contents which are not marked by boundaries can contain the keywords, so that the problem of information scattering is caused. In addition, because the resume writing forms are various, for part of the resume, no obvious block title exists, or the format characteristics of the block title and the format characteristics of the block content are not obvious, and at the moment, the effect of segmenting the resume by applying the method is extremely undesirable.

The existing resume block classification method mainly adopts a traditional keyword matching method and a general text classification method. The first method based on keyword matching is to identify the category of the text content of the resume block by means of the boundary marks (block titles) identified in the resume segmentation method. For example, if a piece of text is matched as "basic information" and the next appearing text is "education background", the content between the two keywords is directly determined as the category of "basic information". The method also has the defects of low accuracy, firstly, the method is low in efficiency directly caused by error conduction brought by a mode of realizing resume segmentation by keyword matching, and secondly, the block titles and the corresponding block contents do not completely belong to the same category. The second method is to use a conventional and general text classification algorithm, such as Support Vector Machine (SVM) and Random Forest (RF), to classify the text content. The method only utilizes a general text classification algorithm to classify a single block, and does not consider the special format characteristics of the resume: typically, the arrangement of the individual resume blocks of the resume document is ordered, e.g., the "basic information" block is typically placed in the first or second block, and the "self-rating" block is typically placed in the last block, etc. Therefore, the method fails to integrate the sequence characteristics of the resume block, the accuracy rate is still low, more samples are needed for model training, and the workload is large.

Disclosure of Invention

The invention aims to solve the defects in the background technology, and provides a resume block classification method based on a multi-level bidirectional cyclic neural network, so that the format characteristics of the resume can be fully fused, the resume segmentation precision and the accuracy of resume block classification are improved, and the problems of low accuracy and large workload of matching based on keywords and format characteristics are solved.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention relates to a resume block classification method based on a multilayer bidirectional cyclic neural network, which is characterized by comprising the following steps of:

step 1, resume segmentation:

step 1.1, obtaining data of an RS model:

step 1.1.1, acquiring a training set of a resume sample in a word format; analyzing the xml tree of each resume sample by using a regular expression to obtain the text and the format characteristics of each line in each resume sample, and forming a line text sequence { s } by the text of each line in one resume sample according to the original sequence ₁ ,s ₂ ,…,s _i ,…,s _n }，s _i Representing the ith line of text, and the ith line of text s _i Is characterized by the format of f(s) _i ) (ii) a n is a Chinese characterThe number of the components;

step 1.1.2, i line text s _i Setting corresponding real label y _i And y is _i The element is epsilon { B, I }, B represents the start of the block, and I represents the block;

step 1.1.3, for the ith line of text s _i Obtaining the ith word sequence after word segmentation processing

A jth word representing the ith line of text; m is _i For the ith line of text s _i The number of words of (a);

step 1.2, text representation:

step 1.2.1, embedding the ith line of text s by using a pre-trained Word2vec Word embedding model _i The jth word of (a)

Into corresponding digital vectors

Thereby obtaining a sequence of digital vectors

Step 1.2.2, calculating by using the formula (1) to obtain the jth word

Importance score of

Thereby obtaining the ith importance score sequence

In the formula (1), the acid-base catalyst,

representing the i-th line of text s _i The jth word of (a)

The number of occurrences, c (rs) represents the total number of resume samples in the training set,

the representation contains the jth word

The number of resume samples of (a);

step 1.2.3, calculate the ith line of text s _i The j (th) word of

Weight distribution of

Wherein softmax (·) is a normalization function, thereby obtaining an ith weight distribution sequence

Step 1.2.4, calculate the ith line of text s _i Is represented by a row vector

Step 1.2.5, utilizing a layer of feedforward neural network of the RS model to process the ith line text s according to the formula (2) _i Format feature f(s) _i ) Training to obtain the ith line text s _i E(s) of the embedded format feature code _i ))：

E(f(s _i ))＝W ₁ ·relu(W ₀ ·f(s _i )+b ₀ )+b ₁ (2)

In formula (2), relu (. cndot.) is an activation function, W ₀ ,W ₁ Is two weight matrices, b ₀ ,b ₁ Two bias terms;

step 1.2.6, obtaining the ith line of text s by using the formula (3) _i Text vector x of _i So as to obtain the text vector sequence input ═ x ₁ ,x ₂ ,…,x _i ,…,x _n }：

x _i ＝[E(s _i )；E(f(s _i ))] (3)

Step 1.3, Bi-LSTM network forward calculation of the RS model:

representing a text vector by x _i Inputting into Bi-LSTM network to obtain forward output from Bi-LSTM network

And reverse output

Spliced output

Thereby obtaining an output sequence o ₁ ,o ₂ ,…,o _i ,…,o _n }；

Step 1.4, obtaining output of a linear layer of the RS model by using the formula (4):

in the formula (4), the reaction mixture is,

representing the ith line of text s _i The predictive tag of (a);

step 1.5, learning of RS model:

step 1.5.1, defining loss function L (theta) of RS model by using formula (5) _RS )：

In the formula (5), θ _RS As a model parameter, n ₁ For trainingConcentrate the real tags as the number of B, n ₂ For training set, the number of true labels is I, N is the number of samples in a batch, N is ₁ Is the number of real tags in a batch, N ₂ The number of true tags in a batch is B, D (y) _i ^(j′) C) is an illustrative function when y is satisfied _i ^(j′) When C, D (y) _i ^(j′) C) is 1, otherwise, D (y) _i ^(j′) ═ C) is 0;

a loss weight representing a true tag as B,

a loss weight indicating that the true tag is I,

respectively representing the real label and the predicted label of the ith line of text in the jth resume sample,

is a binary cross entropy loss function;

step 1.5.2, applying Adam optimization algorithm to the loss function L (theta) _RS ) Carrying out gradient back propagation and updating the weight of the RS model until the RS model converges, thereby obtaining a resume segmentation model;

step 1.6, prediction of a resume segmentation model;

processing the resume sample in the word format to be predicted according to the processes of the step 1.1 and the step 1.2 to obtain a text vector sequence to be predicted, and inputting the text vector sequence to be predicted into the resume segmentation model to obtain an output prediction sequence;

step 2, resume block classification:

step 2.1, data preparation of the RC model:

dividing the prediction sequence into resume block sequences according to the block start mark B ₁ ,…,block _t ,…,block _z And block for each resume block _t Labeling one-hot type category label

block _t Representing the tth resume block; and is

The r-th line text vector representing the t-th resume block, z representing the total number of resume blocks in one resume sample; p is a radical of formula _t Representing the total number of the line text vectors contained in the t-th resume block;

step 2.2, representing by the resume block feature vector:

constructing an encoder with the same structure as the Bi-LSTM network for the tth resume block _t Encoding to obtain the output of the encoder

Wherein the content of the first and second substances,

an r-th output vector representing a fixed dimension;

output Out of the encoder _t The tth resume block is obtained through the processing of the maximum pooling layer _t Characteristic vector f of _t So as to obtain the characteristic vector sequence F ═ F ₁ ,…,f _t ,…,f _z }；

Step 2.3, forward calculation of the RC model:

constructing a Bi-GRU encoder for encoding the feature vector f _t Encoding to obtain forward output of Bi-GRU encoder

And reverse output

Spliced output

Thereby obtaining the output of the Bi-GRU encoder

2.4, obtaining the probability distribution of z resume blocks after the output out of the Bi-GRU encoder is activated by a full connection layer dense and a softmax;

step 2.5, learning of an RC model:

constructing loss function L of RC model by using formula (6) _RC (θ _RC )：

In formula (6), k is a resume block sequence { block } ₁ ,…,block _t ,…,block _z The total number of label categories for the resume block in the },

as in the first resume sample

To (1)

The value of the element of the item,

for the first resume sample

To (1)

A value of an item element;

step 2.6, adopting a Mini-batch gradient descent optimization algorithm to carry out optimization on the loss function L _RC (θ _RC ) And reversely updating the middle parameters until the model converges, thereby obtaining the resume block classification model.

Compared with the prior art, the invention has the beneficial effects that:

1. the resume segmentation and block classification method provided by the invention realizes resume segmentation and resume block classification by using the recurrent neural network structures with different levels and different granularities, solves the problems that in the prior art, segmentation is not accurate based on keyword matching, and the accuracy rate of resume block classification is low based on a universal text classification model, improves the accuracy rate of resume block classification, and reduces the scale of model training data.

2. The invention adopts a sequence marking thought for resume segmentation, provides an RS (resume segmentation model) model, sufficiently fuses text characteristics and format characteristics of the resume, utilizes Bi-LSTM to search for segmentation boundaries, sufficiently considers context sequence characteristics, and avoids the problems of information disorder, small difference between block titles and block content format characteristics and the like caused by incomplete keywords, difficulty in matching, large workload for creating a huge keyword library, keywords of non-segmentation boundary sign notation words and the like in a traditional keyword and format characteristic matching mode, thereby improving the accuracy of resume segmentation.

3. The invention provides RC (resume Classification model) for the resume block Classification, which extracts the characteristics of each resume block by adopting Bi-LSTM based on sentence granularity and classifies each resume block by adopting Bi-GRU based on block granularity, thereby solving the problem that the context sequence characteristics of the resume blocks cannot be fused by the conventional universal text Classification model, improving the accuracy rate of the resume block Classification and effectively reducing the required training data scale.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a diagram of a resume segmentation model according to the present invention;

FIG. 3 is a schematic diagram of resume segmentation according to the present invention;

FIG. 4 is a diagram of a profile block feature extraction architecture in accordance with the present invention;

FIG. 5 is a diagram of a simplified chunk classification model according to the present invention.

Detailed Description

In this embodiment, a method for classifying Resume blocks based on a multi-level bidirectional recurrent neural network includes, for a Resume partition, using a sequence labeling thought, providing a Resume partition model (RS) based on a bidirectional long-short-term memory recurrent neural network (Bi-LSTM), using each line of text in the Resume as a basic granularity, providing a format feature code, fusing the format feature code into feature representations of the lines of text, forming a text sequence by all the lines of text, inputting the text sequence into the RS partition model, and generating a label for each sentence, where the labels are divided into two types: block start (B) and intra block (I). And for Resume Block classification tasks, a Resume Block classification model (RC) based on a bidirectional gate unit recurrent neural network (Bi-GRU) is provided, each Resume Block is used as a basic granularity, all Resume blocks are arranged into a Resume Block sequence according to an original sequence, each Resume Block is subjected to a Bi-LSTM encoder and a maximum pooling layer to obtain characteristic vector representation, and the characteristic vectors of all Resume blocks form a sequence to be input into the Bi-GRU to obtain corresponding Block type output. Specifically, as shown in fig. 1, the classification method flow includes the following steps:

step 1, resume segmentation:

step 1.1, obtaining data of an RS model:

step 1.1.1, acquiring a training set of a resume sample in a word format; analyzing the xml tree of each resume sample by using a regular expression to obtain the text and the format characteristics of each line in each resume sample, and forming a line text sequence { s } by the text of each line in one resume sample according to the original sequence ₁ ,s ₂ ,…,s _i ,…,s _n }，s _i Representing the ith line of text, and the ith line of text s _i Is characterized by the format of f(s) _i ) (ii) a n is the number of line texts;

step 1.1.2, i-th line of text s _i Setting corresponding real label y _i And y is _i E { B, I }, wherein B denotes the start of a block and I denotes the block;

A jth word representing the ith line of text; m is _i For the i-th line of text s _i The number of words of (a);

step 1.2, text representation:

step 1.2.1, embedding the ith line of text s by using a pre-trained Word2vec Word embedding model _i The jth word of

Into corresponding digital vectors

Thereby obtaining a sequence of digital vectors

；

Step 1.2.2, calculating by using the formula (1) to obtain the jth word

Importance score of

Thereby obtaining the ith importance score sequence

In the formula (1), the acid-base catalyst,

representing the ith line of text s _i The jth word of

the representation comprisesThe jth word

The number of resume samples of (a);

step 1.2.3, calculate the ith line of text s _i The j (th) word of

Weight distribution of

Wherein softmax (-) is a normalization function, thereby obtaining the ith weight distribution sequence

Step 1.2.4, calculate the ith line of text s _i Is represented by a row vector

Step 1.2.5, for a line of text s _i Format feature f(s) _i ) { Bd, Sz, Cr, Ic, Ft, Pm, Sp, Len }, as shown in table 1:

TABLE 1 Format characteristic information

One layer of feedforward neural network dense using RS model, as shown in FIG. 2, according to equation (2) for f(s) _i ) Training to obtain the ith line text s _i E(s) of the embedded format feature code _i ))：

E(f(s _i ))＝W ₁ ·relu(W ₀ ·f(s _i )+b ₀ )+b ₁ (2)

step 1.2.6, obtaining the ith line of text s by using the formula (3) _i Text vector x of _i So as to obtain a text vector sequence input ═ { x ═ ₁ ,x ₂ ,…,x _i ,…,x _n }：

x _i ＝[E(s _i )；E(f(s _i ))] (3)

Step 1.3, Bi-LSTM network forward calculation of the RS model:

as shown in FIG. 3, the text vector is represented by x _i Inputting into Bi-LSTM network, passing through LSTM circulation unit

To obtain a forward output

And pass through

To obtain a reverse output

Wherein

The calculation process corresponds to the formula (4) and the formula (5), and the output after splicing

Thereby obtaining an output sequence o ₁ ,o ₂ ,…,o _i ,…,o _n }；

Step 1.4, obtaining output of a linear layer of the RS model by using the formula (6):

in the formula (6), the reaction mixture is,

representing the ith line of text s _i The predictive tag of (a);

step 1.5, learning of an RS model:

step 1.5.1, defining loss function L (theta) of RS model by using formula (7) _RS )：

In the formula (7), θ _RS As a model parameter, n ₁ For training set, the number of true labels is B, n ₂ For training set, the number of true labels is I, N is the number of samples in a batch, N is ₁ Is the number of real tags in a batch I, N ₂ Is the number of true tags in a batch, B, D (y) _i ^(j′) C) is an indicative function when y is satisfied _i ^(j′) When C, D (y) _i ^(j′) C) is 1, otherwise, D (y) _i ^(j′) ═ C) is 0;

a loss weight representing a true tag as B,

a loss weight indicating that the true tag is I,

is a binary cross entropy loss function;

step 1.6, prediction of a resume segmentation model;

processing resume samples of word formats to be predicted according to the processes of the step 1.1 and the step 1.2 to obtain text vector sequences to be predicted, and inputting the text vector sequences to be predicted into the resume segmentation model to obtain output prediction sequences; fig. 3 is a schematic diagram of resume segmentation according to the present embodiment, after a segmentation model is performed, a prediction marker sequence is obtained, the marker sequence is traversed from the beginning, and when a row text marked as B is encountered, all row texts in the middle form a resume block until a next row text marked as B is encountered.

Step 2, resume block classification:

step 2.1, data preparation of the RC model:

dividing into resume block sequences { block } according to the step 1.6 ₁ ,…,block _t ,…,block _z After the computation, block is computed for each resume block _t Labeling one-hot type category label

，block _t Representing the tth resume block; and is provided with

The r-th line text vector representing the t-th resume block, z representing the total number of resume blocks in one resume sample; p is a radical of formula _t Representing the total number of line text vectors contained in the tth resume block;

step 2.2, representing by the resume block feature vector:

as shown in fig. 4, an encoder with the same structure as the Bi-LSTM network is constructed for the tth resume block _t Encoding to obtain the output of the encoder

Wherein the content of the first and second substances,

representing an r-th output vector of length d;

pressing formula (8) to output Out of the encoder _t Performing maximum pooling layer processing to obtain the tth resume block _t Characteristic vector f of _t ：

In the formula (8), the reaction mixture is,

is composed of

The value of the element of the i' th term in (a) to obtain a feature vector sequence F ═ { F ═ F ₁ ,…,f _t ,…,f _z }；

Step 2.3, forward calculation of the RC model:

as shown in FIG. 5, a Bi-GRU encoder is constructed for the feature vector f _t And (6) coding is carried out. Through GRU circulation unit

Obtaining a forward output from a Bi-GRU encoder

Through a process

To obtain a reverse output

Wherein

Calculating corresponding to formula (9) and formula (10), and splicing to obtain output

Thereby obtaining the output of the Bi-GRU encoder

2.4, obtaining the probability distribution of z resume blocks after the output out of the Bi-GRU encoder is activated by a fully connected layer dense and a softmax;

step 2.5, learning of the RC model:

constructing loss function L of RC model by using formula (6) _RC (θ _RC )：

In the formula (10), k is a resume block sequence { block } ₁ ,…,block _t ,…,block _z The total number of label categories for the resume block in the },

for the first resume sample

To (1)

The value of the element of the term,

as in the first resume sample

To (1) a

A value of a term element;

Claims

1. A resume block classification method based on a multi-level bidirectional circulation neural network is characterized by comprising the following steps:

step 1, resume segmentation:

step 1.1, obtaining data of an RS model:

step 1.1.1, acquiring a training set of resume samples in word format; analyzing the xml tree of each resume sample by using a regular expression to obtain the text of each line in each resume sample and the format characteristics thereof, and forming a line text sequence { s } according to the original sequence by the text of each line in one resume sample ₁ ,s ₂ ,…,s _i ,…,s _n }，s _i Representing the ith line of text, and the ith line of text s _i Is characterized by the format of f(s) _i ) (ii) a n is the number of line texts;

step 1.1.2, i-th line of text s _i Setting corresponding real label y _i And y is _i The element is epsilon { B, I }, B represents the start of the block, and I represents the block;

A jth word representing the ith line of text; m is a unit of _i For the ith line of text s _i The number of words of (a);

step 1.2, text representation:

step 1.2.1, embedding a model with pretrained Word2vec words into the ith line of text s _i The jth word of

Conversion into corresponding digital vectors

Thereby obtaining a sequence of digital vectors

Step 1.2.2, calculating by using the formula (1) to obtain the jth word

Importance score of

Thereby obtaining the ith importance score sequence

In the formula (1), the reaction mixture is,

representing the i-th line of text s _i The jth word of

the representation contains the jth word

The number of resume samples of (a);

step 1.2.3, calculate the ith line of text s _i The j (th) word

Weight distribution of

Step 1.2.4, calculate the ith line of text s _i Is represented by a row vector

Step 1.2.5, utilizing a layer of feedforward neural network of the RS model to process the ith line text s according to the formula (2) _i Format feature f(s) _i ) Training to obtain the ith line of text s _i E(s) of embedded format feature encoding _i ))：

E(f(s _i ))＝W ₁ ·relu(W ₀ ·f(s _i )+b ₀ )+b ₁ (2)

x _i ＝[E(s _i )；E(f(s _i ))] (3)

Step 1.3, forward calculation of the Bi-LSTM network of the RS model:

And reverse output

Spliced output

Thereby obtaining an output sequence o ₁ ,o ₂ ,…,o _i ,…,o _n }；

Step 1.4, obtaining an output of a linear layer of the RS model by using the formula (4):

in the formula (4), the reaction mixture is,

representing the i-th line of text s _i The predictive tag of (a);

step 1.5, learning of RS model:

In the formula (5), θ _RS As a model parameter, n ₁ For training the number of true labels in the set, B, n ₂ For training set, the number of true labels is I, N is the number of samples in a batch, N is ₁ Is the number of real tags in a batch, N ₂ Is the number of true tags in a batch, B, D (y) _i ^(j′) C) is an indicative function when y is satisfied _i ^(j′) When being equal to C, D (y) _i ^(j′) C) is 1, otherwise, D (y) _i ^(j′) ═ C) is 0;

a loss weight representing a true tag as B,

loss weight, y, representing a true tag as I ^(j′) _i ,

is a binary cross entropy loss function;

step 1.6, prediction of a resume segmentation model;

processing resume samples of word formats to be predicted according to the processes of the step 1.1 and the step 1.2 to obtain text vector sequences to be predicted, and inputting the text vector sequences to be predicted into the resume segmentation model to obtain output prediction sequences;

step 2, resume block classification:

step 2.1, data preparation of the RC model:

dividing the prediction sequence into resume block sequences according to block start marks B ₁ ,…,block _t ,…,block _z And for each resume block _t Labeling one-hot type category label

block _t Representing the tth resume block; and is provided with

The r-th line text vector representing the t-th resume block, z representing the total number of resume blocks in one resume sample; p is a radical of _t Representing the total number of the line text vectors contained in the t-th resume block;

step 2.2, representing by the resume block feature vector:

constructing an encoder with the same structure as the Bi-LSTM network, and using the encoder to perform block mapping on the tth resume block _t Encoding to obtain encoder output

Wherein the content of the first and second substances,

an r-th output vector representing a fixed dimension;

output of the encoder Out _t The tth resume block is obtained through the processing of the maximum pooling layer _t Characteristic vector f of _t So as to obtain the characteristic vector sequence F ═ F ₁ ,…,f _t ,…,f _z }；

Step 2.3, forward calculation of the RC model: