CN113297845B - Resume block classification method based on multi-level bidirectional circulation neural network - Google Patents

Resume block classification method based on multi-level bidirectional circulation neural network Download PDF

Info

Publication number
CN113297845B
CN113297845B CN202110685320.5A CN202110685320A CN113297845B CN 113297845 B CN113297845 B CN 113297845B CN 202110685320 A CN202110685320 A CN 202110685320A CN 113297845 B CN113297845 B CN 113297845B
Authority
CN
China
Prior art keywords
resume
text
block
model
line
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN202110685320.5A
Other languages
Chinese (zh)
Other versions
CN113297845A (en
Inventor
许启强
张吉
李嘉木
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN202110685320.5A priority Critical patent/CN113297845B/en
Publication of CN113297845A publication Critical patent/CN113297845A/en
Application granted granted Critical
Publication of CN113297845B publication Critical patent/CN113297845B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a resume block classification method based on a multi-level bidirectional circulation neural network, which comprises the following steps: 1. resume segmentation: acquiring training data of an RS model, converting the training data into vector representation, performing forward calculation on a Bi-LSTM layer and a linear layer, updating model parameters reversely, and predicting by using a resume segmentation model to obtain a segmented resume block sequence; 2. and (3) classifying the resume blocks: and taking the divided resume block sequence as RC model training data, obtaining resume block feature vectors by utilizing the Bi-LSTM and the maximum pooling layer, performing forward calculation by utilizing the Bi-GRU and the softmax layer to obtain block class probability distribution, updating RC model parameters by utilizing a gradient descent algorithm, and predicting by utilizing the resume block classification model. The method can improve the accuracy of resume segmentation and resume block classification, and solves the problems of low accuracy of schemes based on keyword and format feature matching and large workload of word stock establishment.

Description

Resume block classification method based on multi-level bidirectional circulation neural network
Technical Field
The invention belongs to the branch fields of text segmentation, text classification, information extraction and the like in the field of natural language processing in the direction of computer science and technology, and particularly relates to a resume block classification method based on a multi-level bidirectional recurrent neural network.
Background
Resume information extraction is a technique for extracting semi-structured resume document contents into structured contents by using a computer program. Through the resume information extraction technology, structured resume information can be obtained and stored in a certain structured storage mode, and further meaningful analysis can be conveniently carried out on the structured resume data by various subsequent program automatic analysis tools, such as automatic post recommendation, resume screening, resume query, person recommendation and the like.
The resume information extraction process comprises the following steps: resume segmentation, resume block classification and block information extraction. When inputting a resume document, firstly dividing the text content of the resume according to a certain boundary characteristic to obtain a resume block list, which is called resume division, then classifying each divided resume block, then calling the block information extraction algorithm of the corresponding category according to the classification result to extract the block entity information, for example, aiming at the 'basic information' block, the entity information comprises 'name', 'contact way', 'address' and the like, and finally storing the extracted entity information by a database.
The existing resume segmentation method mainly adopts a mode of matching block titles based on keywords and format characteristics. The specific rule is to establish a huge keyword library, search whether resume text content appears in the keyword library when a segmentation algorithm is executed specifically, judge whether format characteristics of the text accord with block title (block start mark) characteristics, and use the result as a segmentation boundary if the format characteristics meet the conditions. The method has the defects that the accuracy is low, actual service requirements cannot be met, when a keyword library is not complete enough, all title keywords cannot be detected, the expression forms of the keywords are different, the workload for constructing a complete keyword library is extremely high, and secondly, some text contents which are not marked by boundaries can contain the keywords, so that the problem of information scattering is caused. In addition, because the resume writing forms are various, for part of the resume, no obvious block title exists, or the format characteristics of the block title and the format characteristics of the block content are not obvious, and at the moment, the effect of segmenting the resume by applying the method is extremely undesirable.
The existing resume block classification method mainly adopts a traditional keyword matching method and a general text classification method. The first method based on keyword matching is to identify the category of the text content of the resume block by means of the boundary marks (block titles) identified in the resume segmentation method. For example, if a piece of text is matched as "basic information" and the next appearing text is "education background", the content between the two keywords is directly determined as the category of "basic information". The method also has the defects of low accuracy, firstly, the method is low in efficiency directly caused by error conduction brought by a mode of realizing resume segmentation by keyword matching, and secondly, the block titles and the corresponding block contents do not completely belong to the same category. The second method is to use a conventional and general text classification algorithm, such as Support Vector Machine (SVM) and Random Forest (RF), to classify the text content. The method only utilizes a general text classification algorithm to classify a single block, and does not consider the special format characteristics of the resume: typically, the arrangement of the individual resume blocks of the resume document is ordered, e.g., the "basic information" block is typically placed in the first or second block, and the "self-rating" block is typically placed in the last block, etc. Therefore, the method fails to integrate the sequence characteristics of the resume block, the accuracy rate is still low, more samples are needed for model training, and the workload is large.
Disclosure of Invention
The invention aims to solve the defects in the background technology, and provides a resume block classification method based on a multi-level bidirectional cyclic neural network, so that the format characteristics of the resume can be fully fused, the resume segmentation precision and the accuracy of resume block classification are improved, and the problems of low accuracy and large workload of matching based on keywords and format characteristics are solved.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention relates to a resume block classification method based on a multilayer bidirectional cyclic neural network, which is characterized by comprising the following steps of:
step 1, resume segmentation:
step 1.1, obtaining data of an RS model:
step 1.1.1, acquiring a training set of a resume sample in a word format; analyzing the xml tree of each resume sample by using a regular expression to obtain the text and the format characteristics of each line in each resume sample, and forming a line text sequence { s } by the text of each line in one resume sample according to the original sequence 1 ,s 2 ,…,s i ,…,s n },s i Representing the ith line of text, and the ith line of text s i Is characterized by the format of f(s) i ) (ii) a n is a Chinese characterThe number of the components;
step 1.1.2, i line text s i Setting corresponding real label y i And y is i The element is epsilon { B, I }, B represents the start of the block, and I represents the block;
step 1.1.3, for the ith line of text s i Obtaining the ith word sequence after word segmentation processing
Figure GDA0003643580370000021
A jth word representing the ith line of text; m is i For the ith line of text s i The number of words of (a);
step 1.2, text representation:
step 1.2.1, embedding the ith line of text s by using a pre-trained Word2vec Word embedding model i The jth word of (a)
Figure GDA0003643580370000022
Into corresponding digital vectors
Figure GDA0003643580370000023
Thereby obtaining a sequence of digital vectors
Figure GDA0003643580370000024
Step 1.2.2, calculating by using the formula (1) to obtain the jth word
Figure GDA0003643580370000025
Importance score of
Figure GDA0003643580370000026
Thereby obtaining the ith importance score sequence
Figure GDA0003643580370000027
Figure GDA0003643580370000031
In the formula (1), the acid-base catalyst,
Figure GDA0003643580370000032
representing the i-th line of text s i The jth word of (a)
Figure GDA0003643580370000033
The number of occurrences, c (rs) represents the total number of resume samples in the training set,
Figure GDA0003643580370000034
the representation contains the jth word
Figure GDA0003643580370000035
The number of resume samples of (a);
step 1.2.3, calculate the ith line of text s i The j (th) word of
Figure GDA0003643580370000036
Weight distribution of
Figure GDA0003643580370000037
Wherein softmax (·) is a normalization function, thereby obtaining an ith weight distribution sequence
Figure GDA0003643580370000038
Step 1.2.4, calculate the ith line of text s i Is represented by a row vector
Figure GDA0003643580370000039
Step 1.2.5, utilizing a layer of feedforward neural network of the RS model to process the ith line text s according to the formula (2) i Format feature f(s) i ) Training to obtain the ith line text s i E(s) of the embedded format feature code i )):
E(f(s i ))=W 1 ·relu(W 0 ·f(s i )+b 0 )+b 1 (2)
In formula (2), relu (. cndot.) is an activation function, W 0 ,W 1 Is two weight matrices, b 0 ,b 1 Two bias terms;
step 1.2.6, obtaining the ith line of text s by using the formula (3) i Text vector x of i So as to obtain the text vector sequence input ═ x 1 ,x 2 ,…,x i ,…,x n }:
x i =[E(s i );E(f(s i ))] (3)
Step 1.3, Bi-LSTM network forward calculation of the RS model:
representing a text vector by x i Inputting into Bi-LSTM network to obtain forward output from Bi-LSTM network
Figure GDA00036435803700000310
And reverse output
Figure GDA00036435803700000311
Spliced output
Figure GDA00036435803700000312
Thereby obtaining an output sequence o 1 ,o 2 ,…,o i ,…,o n };
Step 1.4, obtaining output of a linear layer of the RS model by using the formula (4):
Figure GDA00036435803700000313
in the formula (4), the reaction mixture is,
Figure GDA00036435803700000314
representing the ith line of text s i The predictive tag of (a);
step 1.5, learning of RS model:
step 1.5.1, defining loss function L (theta) of RS model by using formula (5) RS ):
Figure GDA0003643580370000041
In the formula (5), θ RS As a model parameter, n 1 For trainingConcentrate the real tags as the number of B, n 2 For training set, the number of true labels is I, N is the number of samples in a batch, N is 1 Is the number of real tags in a batch, N 2 The number of true tags in a batch is B, D (y) i (j′) C) is an illustrative function when y is satisfied i (j′) When C, D (y) i (j′) C) is 1, otherwise, D (y) i (j′) ═ C) is 0;
Figure GDA0003643580370000042
a loss weight representing a true tag as B,
Figure GDA0003643580370000043
a loss weight indicating that the true tag is I,
Figure GDA0003643580370000044
respectively representing the real label and the predicted label of the ith line of text in the jth resume sample,
Figure GDA0003643580370000045
is a binary cross entropy loss function;
step 1.5.2, applying Adam optimization algorithm to the loss function L (theta) RS ) Carrying out gradient back propagation and updating the weight of the RS model until the RS model converges, thereby obtaining a resume segmentation model;
step 1.6, prediction of a resume segmentation model;
processing the resume sample in the word format to be predicted according to the processes of the step 1.1 and the step 1.2 to obtain a text vector sequence to be predicted, and inputting the text vector sequence to be predicted into the resume segmentation model to obtain an output prediction sequence;
step 2, resume block classification:
step 2.1, data preparation of the RC model:
dividing the prediction sequence into resume block sequences according to the block start mark B 1 ,…,block t ,…,block z And block for each resume block t Labeling one-hot type category label
Figure GDA0003643580370000046
block t Representing the tth resume block; and is
Figure GDA0003643580370000047
The r-th line text vector representing the t-th resume block, z representing the total number of resume blocks in one resume sample; p is a radical of formula t Representing the total number of the line text vectors contained in the t-th resume block;
step 2.2, representing by the resume block feature vector:
constructing an encoder with the same structure as the Bi-LSTM network for the tth resume block t Encoding to obtain the output of the encoder
Figure GDA0003643580370000048
Wherein the content of the first and second substances,
Figure GDA0003643580370000049
an r-th output vector representing a fixed dimension;
output Out of the encoder t The tth resume block is obtained through the processing of the maximum pooling layer t Characteristic vector f of t So as to obtain the characteristic vector sequence F ═ F 1 ,…,f t ,…,f z };
Step 2.3, forward calculation of the RC model:
constructing a Bi-GRU encoder for encoding the feature vector f t Encoding to obtain forward output of Bi-GRU encoder
Figure GDA0003643580370000051
And reverse output
Figure GDA0003643580370000052
Spliced output
Figure GDA0003643580370000053
Thereby obtaining the output of the Bi-GRU encoder
Figure GDA0003643580370000054
2.4, obtaining the probability distribution of z resume blocks after the output out of the Bi-GRU encoder is activated by a full connection layer dense and a softmax;
step 2.5, learning of an RC model:
constructing loss function L of RC model by using formula (6) RCRC ):
Figure GDA0003643580370000055
In formula (6), k is a resume block sequence { block } 1 ,…,block t ,…,block z The total number of label categories for the resume block in the },
Figure GDA0003643580370000056
as in the first resume sample
Figure GDA0003643580370000057
To (1)
Figure GDA0003643580370000058
The value of the element of the item,
Figure GDA0003643580370000059
for the first resume sample
Figure GDA00036435803700000510
To (1)
Figure GDA00036435803700000511
A value of an item element;
step 2.6, adopting a Mini-batch gradient descent optimization algorithm to carry out optimization on the loss function L RCRC ) And reversely updating the middle parameters until the model converges, thereby obtaining the resume block classification model.
Compared with the prior art, the invention has the beneficial effects that:
1. the resume segmentation and block classification method provided by the invention realizes resume segmentation and resume block classification by using the recurrent neural network structures with different levels and different granularities, solves the problems that in the prior art, segmentation is not accurate based on keyword matching, and the accuracy rate of resume block classification is low based on a universal text classification model, improves the accuracy rate of resume block classification, and reduces the scale of model training data.
2. The invention adopts a sequence marking thought for resume segmentation, provides an RS (resume segmentation model) model, sufficiently fuses text characteristics and format characteristics of the resume, utilizes Bi-LSTM to search for segmentation boundaries, sufficiently considers context sequence characteristics, and avoids the problems of information disorder, small difference between block titles and block content format characteristics and the like caused by incomplete keywords, difficulty in matching, large workload for creating a huge keyword library, keywords of non-segmentation boundary sign notation words and the like in a traditional keyword and format characteristic matching mode, thereby improving the accuracy of resume segmentation.
3. The invention provides RC (resume Classification model) for the resume block Classification, which extracts the characteristics of each resume block by adopting Bi-LSTM based on sentence granularity and classifies each resume block by adopting Bi-GRU based on block granularity, thereby solving the problem that the context sequence characteristics of the resume blocks cannot be fused by the conventional universal text Classification model, improving the accuracy rate of the resume block Classification and effectively reducing the required training data scale.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a diagram of a resume segmentation model according to the present invention;
FIG. 3 is a schematic diagram of resume segmentation according to the present invention;
FIG. 4 is a diagram of a profile block feature extraction architecture in accordance with the present invention;
FIG. 5 is a diagram of a simplified chunk classification model according to the present invention.
Detailed Description
In this embodiment, a method for classifying Resume blocks based on a multi-level bidirectional recurrent neural network includes, for a Resume partition, using a sequence labeling thought, providing a Resume partition model (RS) based on a bidirectional long-short-term memory recurrent neural network (Bi-LSTM), using each line of text in the Resume as a basic granularity, providing a format feature code, fusing the format feature code into feature representations of the lines of text, forming a text sequence by all the lines of text, inputting the text sequence into the RS partition model, and generating a label for each sentence, where the labels are divided into two types: block start (B) and intra block (I). And for Resume Block classification tasks, a Resume Block classification model (RC) based on a bidirectional gate unit recurrent neural network (Bi-GRU) is provided, each Resume Block is used as a basic granularity, all Resume blocks are arranged into a Resume Block sequence according to an original sequence, each Resume Block is subjected to a Bi-LSTM encoder and a maximum pooling layer to obtain characteristic vector representation, and the characteristic vectors of all Resume blocks form a sequence to be input into the Bi-GRU to obtain corresponding Block type output. Specifically, as shown in fig. 1, the classification method flow includes the following steps:
step 1, resume segmentation:
step 1.1, obtaining data of an RS model:
step 1.1.1, acquiring a training set of a resume sample in a word format; analyzing the xml tree of each resume sample by using a regular expression to obtain the text and the format characteristics of each line in each resume sample, and forming a line text sequence { s } by the text of each line in one resume sample according to the original sequence 1 ,s 2 ,…,s i ,…,s n },s i Representing the ith line of text, and the ith line of text s i Is characterized by the format of f(s) i ) (ii) a n is the number of line texts;
step 1.1.2, i-th line of text s i Setting corresponding real label y i And y is i E { B, I }, wherein B denotes the start of a block and I denotes the block;
step 1.1.3, for the ith line of text s i Obtaining the ith word sequence after word segmentation processing
Figure GDA0003643580370000061
A jth word representing the ith line of text; m is i For the i-th line of text s i The number of words of (a);
step 1.2, text representation:
step 1.2.1, embedding the ith line of text s by using a pre-trained Word2vec Word embedding model i The jth word of
Figure GDA0003643580370000071
Into corresponding digital vectors
Figure GDA0003643580370000072
Thereby obtaining a sequence of digital vectors
Figure GDA00036435803700000716
Step 1.2.2, calculating by using the formula (1) to obtain the jth word
Figure GDA0003643580370000073
Importance score of
Figure GDA0003643580370000074
Thereby obtaining the ith importance score sequence
Figure GDA0003643580370000075
Figure GDA0003643580370000076
In the formula (1), the acid-base catalyst,
Figure GDA0003643580370000077
representing the ith line of text s i The jth word of
Figure GDA0003643580370000078
The number of occurrences, c (rs) represents the total number of resume samples in the training set,
Figure GDA0003643580370000079
the representation comprisesThe jth word
Figure GDA00036435803700000710
The number of resume samples of (a);
step 1.2.3, calculate the ith line of text s i The j (th) word of
Figure GDA00036435803700000711
Weight distribution of
Figure GDA00036435803700000712
Wherein softmax (-) is a normalization function, thereby obtaining the ith weight distribution sequence
Figure GDA00036435803700000713
Step 1.2.4, calculate the ith line of text s i Is represented by a row vector
Figure GDA00036435803700000714
Step 1.2.5, for a line of text s i Format feature f(s) i ) { Bd, Sz, Cr, Ic, Ft, Pm, Sp, Len }, as shown in table 1:
TABLE 1 Format characteristic information
Figure GDA00036435803700000715
One layer of feedforward neural network dense using RS model, as shown in FIG. 2, according to equation (2) for f(s) i ) Training to obtain the ith line text s i E(s) of the embedded format feature code i )):
E(f(s i ))=W 1 ·relu(W 0 ·f(s i )+b 0 )+b 1 (2)
In formula (2), relu (. cndot.) is an activation function, W 0 ,W 1 Is two weight matrices, b 0 ,b 1 Two bias terms;
step 1.2.6, obtaining the ith line of text s by using the formula (3) i Text vector x of i So as to obtain a text vector sequence input ═ { x ═ 1 ,x 2 ,…,x i ,…,x n }:
x i =[E(s i );E(f(s i ))] (3)
Step 1.3, Bi-LSTM network forward calculation of the RS model:
as shown in FIG. 3, the text vector is represented by x i Inputting into Bi-LSTM network, passing through LSTM circulation unit
Figure GDA0003643580370000081
To obtain a forward output
Figure GDA0003643580370000082
And pass through
Figure GDA0003643580370000083
To obtain a reverse output
Figure GDA0003643580370000084
Wherein
Figure GDA0003643580370000085
The calculation process corresponds to the formula (4) and the formula (5), and the output after splicing
Figure GDA0003643580370000086
Thereby obtaining an output sequence o 1 ,o 2 ,…,o i ,…,o n };
Figure GDA0003643580370000087
Figure GDA0003643580370000088
Step 1.4, obtaining output of a linear layer of the RS model by using the formula (6):
Figure GDA0003643580370000089
in the formula (6), the reaction mixture is,
Figure GDA00036435803700000810
representing the ith line of text s i The predictive tag of (a);
step 1.5, learning of an RS model:
step 1.5.1, defining loss function L (theta) of RS model by using formula (7) RS ):
Figure GDA0003643580370000091
In the formula (7), θ RS As a model parameter, n 1 For training set, the number of true labels is B, n 2 For training set, the number of true labels is I, N is the number of samples in a batch, N is 1 Is the number of real tags in a batch I, N 2 Is the number of true tags in a batch, B, D (y) i (j′) C) is an indicative function when y is satisfied i (j′) When C, D (y) i (j′) C) is 1, otherwise, D (y) i (j′) ═ C) is 0;
Figure GDA0003643580370000092
a loss weight representing a true tag as B,
Figure GDA0003643580370000093
a loss weight indicating that the true tag is I,
Figure GDA0003643580370000094
respectively representing the real label and the predicted label of the ith line of text in the jth resume sample,
Figure GDA0003643580370000095
is a binary cross entropy loss function;
step 1.5.2, applying Adam optimization algorithm to the loss function L (theta) RS ) Carrying out gradient back propagation and updating the weight of the RS model until the RS model converges, thereby obtaining a resume segmentation model;
step 1.6, prediction of a resume segmentation model;
processing resume samples of word formats to be predicted according to the processes of the step 1.1 and the step 1.2 to obtain text vector sequences to be predicted, and inputting the text vector sequences to be predicted into the resume segmentation model to obtain output prediction sequences; fig. 3 is a schematic diagram of resume segmentation according to the present embodiment, after a segmentation model is performed, a prediction marker sequence is obtained, the marker sequence is traversed from the beginning, and when a row text marked as B is encountered, all row texts in the middle form a resume block until a next row text marked as B is encountered.
Step 2, resume block classification:
step 2.1, data preparation of the RC model:
dividing into resume block sequences { block } according to the step 1.6 1 ,…,block t ,…,block z After the computation, block is computed for each resume block t Labeling one-hot type category label
Figure GDA0003643580370000096
,block t Representing the tth resume block; and is provided with
Figure GDA0003643580370000097
The r-th line text vector representing the t-th resume block, z representing the total number of resume blocks in one resume sample; p is a radical of formula t Representing the total number of line text vectors contained in the tth resume block;
step 2.2, representing by the resume block feature vector:
as shown in fig. 4, an encoder with the same structure as the Bi-LSTM network is constructed for the tth resume block t Encoding to obtain the output of the encoder
Figure GDA0003643580370000098
Wherein the content of the first and second substances,
Figure GDA0003643580370000099
representing an r-th output vector of length d;
pressing formula (8) to output Out of the encoder t Performing maximum pooling layer processing to obtain the tth resume block t Characteristic vector f of t
Figure GDA0003643580370000101
In the formula (8), the reaction mixture is,
Figure GDA0003643580370000102
is composed of
Figure GDA0003643580370000103
The value of the element of the i' th term in (a) to obtain a feature vector sequence F ═ { F ═ F 1 ,…,f t ,…,f z };
Step 2.3, forward calculation of the RC model:
as shown in FIG. 5, a Bi-GRU encoder is constructed for the feature vector f t And (6) coding is carried out. Through GRU circulation unit
Figure GDA00036435803700001017
Obtaining a forward output from a Bi-GRU encoder
Figure GDA0003643580370000104
Through a process
Figure GDA0003643580370000105
To obtain a reverse output
Figure GDA0003643580370000106
Wherein
Figure GDA0003643580370000107
Calculating corresponding to formula (9) and formula (10), and splicing to obtain output
Figure GDA0003643580370000108
Thereby obtaining the output of the Bi-GRU encoder
Figure GDA0003643580370000109
Figure GDA00036435803700001010
Figure GDA00036435803700001011
2.4, obtaining the probability distribution of z resume blocks after the output out of the Bi-GRU encoder is activated by a fully connected layer dense and a softmax;
step 2.5, learning of the RC model:
constructing loss function L of RC model by using formula (6) RCRC ):
Figure GDA00036435803700001012
In the formula (10), k is a resume block sequence { block } 1 ,…,block t ,…,block z The total number of label categories for the resume block in the },
Figure GDA00036435803700001013
for the first resume sample
Figure GDA00036435803700001018
To (1)
Figure GDA00036435803700001019
The value of the element of the term,
Figure GDA00036435803700001014
as in the first resume sample
Figure GDA00036435803700001015
To (1) a
Figure GDA00036435803700001016
A value of a term element;
step 2.6, adopting a Mini-batch gradient descent optimization algorithm to carry out optimization on the loss function L RCRC ) And reversely updating the middle parameters until the model converges, thereby obtaining the resume block classification model.

Claims (1)

1. A resume block classification method based on a multi-level bidirectional circulation neural network is characterized by comprising the following steps:
step 1, resume segmentation:
step 1.1, obtaining data of an RS model:
step 1.1.1, acquiring a training set of resume samples in word format; analyzing the xml tree of each resume sample by using a regular expression to obtain the text of each line in each resume sample and the format characteristics thereof, and forming a line text sequence { s } according to the original sequence by the text of each line in one resume sample 1 ,s 2 ,…,s i ,…,s n },s i Representing the ith line of text, and the ith line of text s i Is characterized by the format of f(s) i ) (ii) a n is the number of line texts;
step 1.1.2, i-th line of text s i Setting corresponding real label y i And y is i The element is epsilon { B, I }, B represents the start of the block, and I represents the block;
step 1.1.3, for the ith line of text s i Obtaining the ith word sequence after word segmentation processing
Figure FDA0003643580360000011
Figure FDA0003643580360000012
A jth word representing the ith line of text; m is a unit of i For the ith line of text s i The number of words of (a);
step 1.2, text representation:
step 1.2.1, embedding a model with pretrained Word2vec words into the ith line of text s i The jth word of
Figure FDA0003643580360000013
Conversion into corresponding digital vectors
Figure FDA0003643580360000014
Thereby obtaining a sequence of digital vectors
Figure FDA0003643580360000015
Step 1.2.2, calculating by using the formula (1) to obtain the jth word
Figure FDA0003643580360000016
Importance score of
Figure FDA0003643580360000017
Thereby obtaining the ith importance score sequence
Figure FDA0003643580360000018
Figure FDA0003643580360000019
In the formula (1), the reaction mixture is,
Figure FDA00036435803600000110
representing the i-th line of text s i The jth word of
Figure FDA00036435803600000111
The number of occurrences, c (rs) represents the total number of resume samples in the training set,
Figure FDA00036435803600000112
the representation contains the jth word
Figure FDA00036435803600000113
The number of resume samples of (a);
step 1.2.3, calculate the ith line of text s i The j (th) word
Figure FDA00036435803600000114
Weight distribution of
Figure FDA00036435803600000115
Wherein softmax (·) is a normalization function, thereby obtaining an ith weight distribution sequence
Figure FDA00036435803600000116
Step 1.2.4, calculate the ith line of text s i Is represented by a row vector
Figure FDA00036435803600000117
Step 1.2.5, utilizing a layer of feedforward neural network of the RS model to process the ith line text s according to the formula (2) i Format feature f(s) i ) Training to obtain the ith line of text s i E(s) of embedded format feature encoding i )):
E(f(s i ))=W 1 ·relu(W 0 ·f(s i )+b 0 )+b 1 (2)
In formula (2), relu (. cndot.) is an activation function, W 0 ,W 1 Is two weight matrices, b 0 ,b 1 Two bias terms;
step 1.2.6, obtaining the ith line of text s by using the formula (3) i Text vector x of i So as to obtain a text vector sequence input ═ { x ═ 1 ,x 2 ,…,x i ,…,x n }:
x i =[E(s i );E(f(s i ))] (3)
Step 1.3, forward calculation of the Bi-LSTM network of the RS model:
representing a text vector by x i Inputting into Bi-LSTM network to obtain forward output from Bi-LSTM network
Figure FDA0003643580360000021
And reverse output
Figure FDA0003643580360000022
Spliced output
Figure FDA0003643580360000023
Thereby obtaining an output sequence o 1 ,o 2 ,…,o i ,…,o n };
Step 1.4, obtaining an output of a linear layer of the RS model by using the formula (4):
Figure FDA0003643580360000024
in the formula (4), the reaction mixture is,
Figure FDA0003643580360000025
representing the i-th line of text s i The predictive tag of (a);
step 1.5, learning of RS model:
step 1.5.1, defining loss function L (theta) of RS model by using formula (5) RS ):
Figure FDA0003643580360000026
In the formula (5), θ RS As a model parameter, n 1 For training the number of true labels in the set, B, n 2 For training set, the number of true labels is I, N is the number of samples in a batch, N is 1 Is the number of real tags in a batch, N 2 Is the number of true tags in a batch, B, D (y) i (j′) C) is an indicative function when y is satisfied i (j′) When being equal to C, D (y) i (j′) C) is 1, otherwise, D (y) i (j′) ═ C) is 0;
Figure FDA0003643580360000027
a loss weight representing a true tag as B,
Figure FDA0003643580360000028
loss weight, y, representing a true tag as I (j′) i ,
Figure FDA0003643580360000029
Respectively representing the real label and the predicted label of the ith line of text in the jth resume sample,
Figure FDA0003643580360000031
is a binary cross entropy loss function;
step 1.5.2, applying Adam optimization algorithm to the loss function L (theta) RS ) Carrying out gradient back propagation and updating the weight of the RS model until the RS model converges, thereby obtaining a resume segmentation model;
step 1.6, prediction of a resume segmentation model;
processing resume samples of word formats to be predicted according to the processes of the step 1.1 and the step 1.2 to obtain text vector sequences to be predicted, and inputting the text vector sequences to be predicted into the resume segmentation model to obtain output prediction sequences;
step 2, resume block classification:
step 2.1, data preparation of the RC model:
dividing the prediction sequence into resume block sequences according to block start marks B 1 ,…,block t ,…,block z And for each resume block t Labeling one-hot type category label
Figure FDA0003643580360000032
block t Representing the tth resume block; and is provided with
Figure FDA0003643580360000033
Figure FDA0003643580360000034
The r-th line text vector representing the t-th resume block, z representing the total number of resume blocks in one resume sample; p is a radical of t Representing the total number of the line text vectors contained in the t-th resume block;
step 2.2, representing by the resume block feature vector:
constructing an encoder with the same structure as the Bi-LSTM network, and using the encoder to perform block mapping on the tth resume block t Encoding to obtain encoder output
Figure FDA0003643580360000035
Wherein the content of the first and second substances,
Figure FDA0003643580360000036
an r-th output vector representing a fixed dimension;
output of the encoder Out t The tth resume block is obtained through the processing of the maximum pooling layer t Characteristic vector f of t So as to obtain the characteristic vector sequence F ═ F 1 ,…,f t ,…,f z };
Step 2.3, forward calculation of the RC model:
constructing a Bi-GRU encoder for encoding the feature vector f t Encoding to obtain forward output of Bi-GRU encoder
Figure FDA0003643580360000037
And reverse output
Figure FDA0003643580360000038
Spliced output
Figure FDA0003643580360000039
Thereby obtaining the output of the Bi-GRU encoder
Figure FDA00036435803600000310
2.4, obtaining the probability distribution of z resume blocks after the output out of the Bi-GRU encoder is activated by a full connection layer dense and a softmax;
step 2.5, learning of the RC model:
constructing loss function L of RC model by using formula (6) RCRC ):
Figure FDA0003643580360000041
In formula (6), k is a resume block sequence { block } 1 ,…,block t ,…,block z The total number of label categories for the resume block in the },
Figure FDA0003643580360000042
as in the first resume sample
Figure FDA0003643580360000043
To (1) a
Figure FDA0003643580360000044
The value of the element of the item,
Figure FDA0003643580360000045
for the first resume sample
Figure FDA0003643580360000046
To (1) a
Figure FDA0003643580360000047
A value of an item element;
step 2.6, adopting a Mini-batch gradient descent optimization algorithm to carry out optimization on the loss function L RCRC ) And reversely updating the middle parameters until the model converges, thereby obtaining the resume block classification model.
CN202110685320.5A 2021-06-21 2021-06-21 Resume block classification method based on multi-level bidirectional circulation neural network Expired - Fee Related CN113297845B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110685320.5A CN113297845B (en) 2021-06-21 2021-06-21 Resume block classification method based on multi-level bidirectional circulation neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110685320.5A CN113297845B (en) 2021-06-21 2021-06-21 Resume block classification method based on multi-level bidirectional circulation neural network

Publications (2)

Publication Number Publication Date
CN113297845A CN113297845A (en) 2021-08-24
CN113297845B true CN113297845B (en) 2022-07-26

Family

ID=77328902

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110685320.5A Expired - Fee Related CN113297845B (en) 2021-06-21 2021-06-21 Resume block classification method based on multi-level bidirectional circulation neural network

Country Status (1)

Country Link
CN (1) CN113297845B (en)

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108334605B (en) * 2018-02-01 2020-06-16 腾讯科技(深圳)有限公司 Text classification method and device, computer equipment and storage medium
CN109376242B (en) * 2018-10-18 2020-11-17 西安工程大学 Text classification method based on cyclic neural network variant and convolutional neural network
CN109543084B (en) * 2018-11-09 2021-01-19 西安交通大学 Method for establishing detection model of hidden sensitive text facing network social media
CN109635288B (en) * 2018-11-29 2023-05-23 东莞理工学院 Resume extraction method based on deep neural network
CN109753909B (en) * 2018-12-27 2021-08-10 广东人啊人网络技术开发有限公司 Resume analysis method based on content blocking and BilSTM model
US11194962B2 (en) * 2019-06-05 2021-12-07 Fmr Llc Automated identification and classification of complaint-specific user interactions using a multilayer neural network
CN110442841B (en) * 2019-06-20 2024-02-02 平安科技(深圳)有限公司 Resume identification method and device, computer equipment and storage medium
CN110888927B (en) * 2019-11-14 2023-04-18 东莞理工学院 Resume information extraction method and system
CN111026845B (en) * 2019-12-06 2021-09-21 北京理工大学 Text classification method for acquiring multilevel context semantics
CN111428488A (en) * 2020-03-06 2020-07-17 平安科技(深圳)有限公司 Resume data information analyzing and matching method and device, electronic equipment and medium
CN112149389A (en) * 2020-09-27 2020-12-29 南方电网数字电网研究院有限公司 Resume information structured processing method and device, computer equipment and storage medium
CN112416956B (en) * 2020-11-19 2023-04-07 重庆邮电大学 Question classification method based on BERT and independent cyclic neural network

Also Published As

Publication number Publication date
CN113297845A (en) 2021-08-24

Similar Documents

Publication Publication Date Title
CN110110054B (en) Method for acquiring question-answer pairs from unstructured text based on deep learning
CN111985239B (en) Entity identification method, entity identification device, electronic equipment and storage medium
CN111694924A (en) Event extraction method and system
CN111639171A (en) Knowledge graph question-answering method and device
CN109359291A (en) A kind of name entity recognition method
CN111783394A (en) Training method of event extraction model, event extraction method, system and equipment
CN112541355A (en) Few-sample named entity identification method and system with entity boundary class decoupling
CN112052684A (en) Named entity identification method, device, equipment and storage medium for power metering
CN113836891A (en) Method and device for extracting structured information based on multi-element labeling strategy
CN113515632A (en) Text classification method based on graph path knowledge extraction
CN114139522A (en) Key information identification method based on level attention and label guided learning
CN115203338A (en) Label and label example recommendation method
CN115952791A (en) Chapter-level event extraction method, device and equipment based on machine reading understanding and storage medium
CN112417132A (en) New intention recognition method for screening negative samples by utilizing predicate guest information
CN114722204A (en) Multi-label text classification method and device
CN113961666A (en) Keyword recognition method, apparatus, device, medium, and computer program product
CN117725211A (en) Text classification method and system based on self-constructed prompt template
CN117390189A (en) Neutral text generation method based on pre-classifier
CN112883726A (en) Multi-task Thai word segmentation method based on syllable segmentation and word segmentation joint learning
CN113297845B (en) Resume block classification method based on multi-level bidirectional circulation neural network
CN116362247A (en) Entity extraction method based on MRC framework
CN113705222B (en) Training method and device for slot identification model and slot filling method and device
CN116258204A (en) Industrial safety production violation punishment management method and system based on knowledge graph
CN114510943A (en) Incremental named entity identification method based on pseudo sample playback
CN114611489A (en) Text logic condition extraction AI model construction method, extraction method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20220726