CN113297845B - Resume block classification method based on multi-level bidirectional circulation neural network - Google Patents
Resume block classification method based on multi-level bidirectional circulation neural network Download PDFInfo
- Publication number
- CN113297845B CN113297845B CN202110685320.5A CN202110685320A CN113297845B CN 113297845 B CN113297845 B CN 113297845B CN 202110685320 A CN202110685320 A CN 202110685320A CN 113297845 B CN113297845 B CN 113297845B
- Authority
- CN
- China
- Prior art keywords
- resume
- text
- block
- model
- line
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 33
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 14
- 230000002457 bidirectional effect Effects 0.000 title claims abstract description 10
- 239000013598 vector Substances 0.000 claims abstract description 49
- 230000011218 segmentation Effects 0.000 claims abstract description 38
- 238000009826 distribution Methods 0.000 claims abstract description 10
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 9
- 238000004364 calculation method Methods 0.000 claims abstract description 9
- 238000013145 classification model Methods 0.000 claims abstract description 9
- 238000011176 pooling Methods 0.000 claims abstract description 5
- 230000006870 function Effects 0.000 claims description 24
- 238000005457 optimization Methods 0.000 claims description 9
- 239000011541 reaction mixture Substances 0.000 claims description 5
- 238000002372 labelling Methods 0.000 claims description 4
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 3
- 230000004913 activation Effects 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 238000002360 preparation method Methods 0.000 claims description 3
- 239000000126 substance Substances 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims 1
- 238000013507 mapping Methods 0.000 claims 1
- 238000000605 extraction Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 5
- 235000019580 granularity Nutrition 0.000 description 5
- 230000000306 recurrent effect Effects 0.000 description 5
- 230000007547 defect Effects 0.000 description 3
- 238000005192 partition Methods 0.000 description 3
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 239000003054 catalyst Substances 0.000 description 2
- 238000007635 classification algorithm Methods 0.000 description 2
- 125000004122 cyclic group Chemical group 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000003550 marker Substances 0.000 description 2
- 238000007637 random forest analysis Methods 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000003825 pressing Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000003860 storage Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a resume block classification method based on a multi-level bidirectional circulation neural network, which comprises the following steps: 1. resume segmentation: acquiring training data of an RS model, converting the training data into vector representation, performing forward calculation on a Bi-LSTM layer and a linear layer, updating model parameters reversely, and predicting by using a resume segmentation model to obtain a segmented resume block sequence; 2. and (3) classifying the resume blocks: and taking the divided resume block sequence as RC model training data, obtaining resume block feature vectors by utilizing the Bi-LSTM and the maximum pooling layer, performing forward calculation by utilizing the Bi-GRU and the softmax layer to obtain block class probability distribution, updating RC model parameters by utilizing a gradient descent algorithm, and predicting by utilizing the resume block classification model. The method can improve the accuracy of resume segmentation and resume block classification, and solves the problems of low accuracy of schemes based on keyword and format feature matching and large workload of word stock establishment.
Description
Technical Field
The invention belongs to the branch fields of text segmentation, text classification, information extraction and the like in the field of natural language processing in the direction of computer science and technology, and particularly relates to a resume block classification method based on a multi-level bidirectional recurrent neural network.
Background
Resume information extraction is a technique for extracting semi-structured resume document contents into structured contents by using a computer program. Through the resume information extraction technology, structured resume information can be obtained and stored in a certain structured storage mode, and further meaningful analysis can be conveniently carried out on the structured resume data by various subsequent program automatic analysis tools, such as automatic post recommendation, resume screening, resume query, person recommendation and the like.
The resume information extraction process comprises the following steps: resume segmentation, resume block classification and block information extraction. When inputting a resume document, firstly dividing the text content of the resume according to a certain boundary characteristic to obtain a resume block list, which is called resume division, then classifying each divided resume block, then calling the block information extraction algorithm of the corresponding category according to the classification result to extract the block entity information, for example, aiming at the 'basic information' block, the entity information comprises 'name', 'contact way', 'address' and the like, and finally storing the extracted entity information by a database.
The existing resume segmentation method mainly adopts a mode of matching block titles based on keywords and format characteristics. The specific rule is to establish a huge keyword library, search whether resume text content appears in the keyword library when a segmentation algorithm is executed specifically, judge whether format characteristics of the text accord with block title (block start mark) characteristics, and use the result as a segmentation boundary if the format characteristics meet the conditions. The method has the defects that the accuracy is low, actual service requirements cannot be met, when a keyword library is not complete enough, all title keywords cannot be detected, the expression forms of the keywords are different, the workload for constructing a complete keyword library is extremely high, and secondly, some text contents which are not marked by boundaries can contain the keywords, so that the problem of information scattering is caused. In addition, because the resume writing forms are various, for part of the resume, no obvious block title exists, or the format characteristics of the block title and the format characteristics of the block content are not obvious, and at the moment, the effect of segmenting the resume by applying the method is extremely undesirable.
The existing resume block classification method mainly adopts a traditional keyword matching method and a general text classification method. The first method based on keyword matching is to identify the category of the text content of the resume block by means of the boundary marks (block titles) identified in the resume segmentation method. For example, if a piece of text is matched as "basic information" and the next appearing text is "education background", the content between the two keywords is directly determined as the category of "basic information". The method also has the defects of low accuracy, firstly, the method is low in efficiency directly caused by error conduction brought by a mode of realizing resume segmentation by keyword matching, and secondly, the block titles and the corresponding block contents do not completely belong to the same category. The second method is to use a conventional and general text classification algorithm, such as Support Vector Machine (SVM) and Random Forest (RF), to classify the text content. The method only utilizes a general text classification algorithm to classify a single block, and does not consider the special format characteristics of the resume: typically, the arrangement of the individual resume blocks of the resume document is ordered, e.g., the "basic information" block is typically placed in the first or second block, and the "self-rating" block is typically placed in the last block, etc. Therefore, the method fails to integrate the sequence characteristics of the resume block, the accuracy rate is still low, more samples are needed for model training, and the workload is large.
Disclosure of Invention
The invention aims to solve the defects in the background technology, and provides a resume block classification method based on a multi-level bidirectional cyclic neural network, so that the format characteristics of the resume can be fully fused, the resume segmentation precision and the accuracy of resume block classification are improved, and the problems of low accuracy and large workload of matching based on keywords and format characteristics are solved.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention relates to a resume block classification method based on a multilayer bidirectional cyclic neural network, which is characterized by comprising the following steps of:
step 1.1, obtaining data of an RS model:
step 1.1.1, acquiring a training set of a resume sample in a word format; analyzing the xml tree of each resume sample by using a regular expression to obtain the text and the format characteristics of each line in each resume sample, and forming a line text sequence { s } by the text of each line in one resume sample according to the original sequence 1 ,s 2 ,…,s i ,…,s n },s i Representing the ith line of text, and the ith line of text s i Is characterized by the format of f(s) i ) (ii) a n is a Chinese characterThe number of the components;
step 1.1.2, i line text s i Setting corresponding real label y i And y is i The element is epsilon { B, I }, B represents the start of the block, and I represents the block;
step 1.1.3, for the ith line of text s i Obtaining the ith word sequence after word segmentation processingA jth word representing the ith line of text; m is i For the ith line of text s i The number of words of (a);
step 1.2, text representation:
step 1.2.1, embedding the ith line of text s by using a pre-trained Word2vec Word embedding model i The jth word of (a)Into corresponding digital vectorsThereby obtaining a sequence of digital vectors
Step 1.2.2, calculating by using the formula (1) to obtain the jth wordImportance score ofThereby obtaining the ith importance score sequence
In the formula (1), the acid-base catalyst,representing the i-th line of text s i The jth word of (a)The number of occurrences, c (rs) represents the total number of resume samples in the training set,the representation contains the jth wordThe number of resume samples of (a);
step 1.2.3, calculate the ith line of text s i The j (th) word ofWeight distribution ofWherein softmax (·) is a normalization function, thereby obtaining an ith weight distribution sequence
Step 1.2.5, utilizing a layer of feedforward neural network of the RS model to process the ith line text s according to the formula (2) i Format feature f(s) i ) Training to obtain the ith line text s i E(s) of the embedded format feature code i )):
E(f(s i ))=W 1 ·relu(W 0 ·f(s i )+b 0 )+b 1 (2)
In formula (2), relu (. cndot.) is an activation function, W 0 ,W 1 Is two weight matrices, b 0 ,b 1 Two bias terms;
step 1.2.6, obtaining the ith line of text s by using the formula (3) i Text vector x of i So as to obtain the text vector sequence input ═ x 1 ,x 2 ,…,x i ,…,x n }:
x i =[E(s i );E(f(s i ))] (3)
Step 1.3, Bi-LSTM network forward calculation of the RS model:
representing a text vector by x i Inputting into Bi-LSTM network to obtain forward output from Bi-LSTM networkAnd reverse outputSpliced outputThereby obtaining an output sequence o 1 ,o 2 ,…,o i ,…,o n };
Step 1.4, obtaining output of a linear layer of the RS model by using the formula (4):
in the formula (4), the reaction mixture is,representing the ith line of text s i The predictive tag of (a);
step 1.5, learning of RS model:
step 1.5.1, defining loss function L (theta) of RS model by using formula (5) RS ):
In the formula (5), θ RS As a model parameter, n 1 For trainingConcentrate the real tags as the number of B, n 2 For training set, the number of true labels is I, N is the number of samples in a batch, N is 1 Is the number of real tags in a batch, N 2 The number of true tags in a batch is B, D (y) i (j′) C) is an illustrative function when y is satisfied i (j′) When C, D (y) i (j′) C) is 1, otherwise, D (y) i (j′) ═ C) is 0;a loss weight representing a true tag as B,a loss weight indicating that the true tag is I,respectively representing the real label and the predicted label of the ith line of text in the jth resume sample,is a binary cross entropy loss function;
step 1.5.2, applying Adam optimization algorithm to the loss function L (theta) RS ) Carrying out gradient back propagation and updating the weight of the RS model until the RS model converges, thereby obtaining a resume segmentation model;
step 1.6, prediction of a resume segmentation model;
processing the resume sample in the word format to be predicted according to the processes of the step 1.1 and the step 1.2 to obtain a text vector sequence to be predicted, and inputting the text vector sequence to be predicted into the resume segmentation model to obtain an output prediction sequence;
step 2, resume block classification:
step 2.1, data preparation of the RC model:
dividing the prediction sequence into resume block sequences according to the block start mark B 1 ,…,block t ,…,block z And block for each resume block t Labeling one-hot type category labelblock t Representing the tth resume block; and isThe r-th line text vector representing the t-th resume block, z representing the total number of resume blocks in one resume sample; p is a radical of formula t Representing the total number of the line text vectors contained in the t-th resume block;
step 2.2, representing by the resume block feature vector:
constructing an encoder with the same structure as the Bi-LSTM network for the tth resume block t Encoding to obtain the output of the encoderWherein the content of the first and second substances,an r-th output vector representing a fixed dimension;
output Out of the encoder t The tth resume block is obtained through the processing of the maximum pooling layer t Characteristic vector f of t So as to obtain the characteristic vector sequence F ═ F 1 ,…,f t ,…,f z };
Step 2.3, forward calculation of the RC model:
constructing a Bi-GRU encoder for encoding the feature vector f t Encoding to obtain forward output of Bi-GRU encoderAnd reverse outputSpliced outputThereby obtaining the output of the Bi-GRU encoder
2.4, obtaining the probability distribution of z resume blocks after the output out of the Bi-GRU encoder is activated by a full connection layer dense and a softmax;
step 2.5, learning of an RC model:
constructing loss function L of RC model by using formula (6) RC (θ RC ):
In formula (6), k is a resume block sequence { block } 1 ,…,block t ,…,block z The total number of label categories for the resume block in the },as in the first resume sampleTo (1)The value of the element of the item,for the first resume sampleTo (1)A value of an item element;
step 2.6, adopting a Mini-batch gradient descent optimization algorithm to carry out optimization on the loss function L RC (θ RC ) And reversely updating the middle parameters until the model converges, thereby obtaining the resume block classification model.
Compared with the prior art, the invention has the beneficial effects that:
1. the resume segmentation and block classification method provided by the invention realizes resume segmentation and resume block classification by using the recurrent neural network structures with different levels and different granularities, solves the problems that in the prior art, segmentation is not accurate based on keyword matching, and the accuracy rate of resume block classification is low based on a universal text classification model, improves the accuracy rate of resume block classification, and reduces the scale of model training data.
2. The invention adopts a sequence marking thought for resume segmentation, provides an RS (resume segmentation model) model, sufficiently fuses text characteristics and format characteristics of the resume, utilizes Bi-LSTM to search for segmentation boundaries, sufficiently considers context sequence characteristics, and avoids the problems of information disorder, small difference between block titles and block content format characteristics and the like caused by incomplete keywords, difficulty in matching, large workload for creating a huge keyword library, keywords of non-segmentation boundary sign notation words and the like in a traditional keyword and format characteristic matching mode, thereby improving the accuracy of resume segmentation.
3. The invention provides RC (resume Classification model) for the resume block Classification, which extracts the characteristics of each resume block by adopting Bi-LSTM based on sentence granularity and classifies each resume block by adopting Bi-GRU based on block granularity, thereby solving the problem that the context sequence characteristics of the resume blocks cannot be fused by the conventional universal text Classification model, improving the accuracy rate of the resume block Classification and effectively reducing the required training data scale.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a diagram of a resume segmentation model according to the present invention;
FIG. 3 is a schematic diagram of resume segmentation according to the present invention;
FIG. 4 is a diagram of a profile block feature extraction architecture in accordance with the present invention;
FIG. 5 is a diagram of a simplified chunk classification model according to the present invention.
Detailed Description
In this embodiment, a method for classifying Resume blocks based on a multi-level bidirectional recurrent neural network includes, for a Resume partition, using a sequence labeling thought, providing a Resume partition model (RS) based on a bidirectional long-short-term memory recurrent neural network (Bi-LSTM), using each line of text in the Resume as a basic granularity, providing a format feature code, fusing the format feature code into feature representations of the lines of text, forming a text sequence by all the lines of text, inputting the text sequence into the RS partition model, and generating a label for each sentence, where the labels are divided into two types: block start (B) and intra block (I). And for Resume Block classification tasks, a Resume Block classification model (RC) based on a bidirectional gate unit recurrent neural network (Bi-GRU) is provided, each Resume Block is used as a basic granularity, all Resume blocks are arranged into a Resume Block sequence according to an original sequence, each Resume Block is subjected to a Bi-LSTM encoder and a maximum pooling layer to obtain characteristic vector representation, and the characteristic vectors of all Resume blocks form a sequence to be input into the Bi-GRU to obtain corresponding Block type output. Specifically, as shown in fig. 1, the classification method flow includes the following steps:
step 1.1, obtaining data of an RS model:
step 1.1.1, acquiring a training set of a resume sample in a word format; analyzing the xml tree of each resume sample by using a regular expression to obtain the text and the format characteristics of each line in each resume sample, and forming a line text sequence { s } by the text of each line in one resume sample according to the original sequence 1 ,s 2 ,…,s i ,…,s n },s i Representing the ith line of text, and the ith line of text s i Is characterized by the format of f(s) i ) (ii) a n is the number of line texts;
step 1.1.2, i-th line of text s i Setting corresponding real label y i And y is i E { B, I }, wherein B denotes the start of a block and I denotes the block;
step 1.1.3, for the ith line of text s i Obtaining the ith word sequence after word segmentation processingA jth word representing the ith line of text; m is i For the i-th line of text s i The number of words of (a);
step 1.2, text representation:
step 1.2.1, embedding the ith line of text s by using a pre-trained Word2vec Word embedding model i The jth word ofInto corresponding digital vectorsThereby obtaining a sequence of digital vectors;
Step 1.2.2, calculating by using the formula (1) to obtain the jth wordImportance score ofThereby obtaining the ith importance score sequence
In the formula (1), the acid-base catalyst,representing the ith line of text s i The jth word ofThe number of occurrences, c (rs) represents the total number of resume samples in the training set,the representation comprisesThe jth wordThe number of resume samples of (a);
step 1.2.3, calculate the ith line of text s i The j (th) word ofWeight distribution ofWherein softmax (-) is a normalization function, thereby obtaining the ith weight distribution sequence
Step 1.2.5, for a line of text s i Format feature f(s) i ) { Bd, Sz, Cr, Ic, Ft, Pm, Sp, Len }, as shown in table 1:
TABLE 1 Format characteristic information
One layer of feedforward neural network dense using RS model, as shown in FIG. 2, according to equation (2) for f(s) i ) Training to obtain the ith line text s i E(s) of the embedded format feature code i )):
E(f(s i ))=W 1 ·relu(W 0 ·f(s i )+b 0 )+b 1 (2)
In formula (2), relu (. cndot.) is an activation function, W 0 ,W 1 Is two weight matrices, b 0 ,b 1 Two bias terms;
step 1.2.6, obtaining the ith line of text s by using the formula (3) i Text vector x of i So as to obtain a text vector sequence input ═ { x ═ 1 ,x 2 ,…,x i ,…,x n }:
x i =[E(s i );E(f(s i ))] (3)
Step 1.3, Bi-LSTM network forward calculation of the RS model:
as shown in FIG. 3, the text vector is represented by x i Inputting into Bi-LSTM network, passing through LSTM circulation unitTo obtain a forward outputAnd pass throughTo obtain a reverse outputWhereinThe calculation process corresponds to the formula (4) and the formula (5), and the output after splicingThereby obtaining an output sequence o 1 ,o 2 ,…,o i ,…,o n };
Step 1.4, obtaining output of a linear layer of the RS model by using the formula (6):
in the formula (6), the reaction mixture is,representing the ith line of text s i The predictive tag of (a);
step 1.5, learning of an RS model:
step 1.5.1, defining loss function L (theta) of RS model by using formula (7) RS ):
In the formula (7), θ RS As a model parameter, n 1 For training set, the number of true labels is B, n 2 For training set, the number of true labels is I, N is the number of samples in a batch, N is 1 Is the number of real tags in a batch I, N 2 Is the number of true tags in a batch, B, D (y) i (j′) C) is an indicative function when y is satisfied i (j′) When C, D (y) i (j′) C) is 1, otherwise, D (y) i (j′) ═ C) is 0;a loss weight representing a true tag as B,a loss weight indicating that the true tag is I,respectively representing the real label and the predicted label of the ith line of text in the jth resume sample,is a binary cross entropy loss function;
step 1.5.2, applying Adam optimization algorithm to the loss function L (theta) RS ) Carrying out gradient back propagation and updating the weight of the RS model until the RS model converges, thereby obtaining a resume segmentation model;
step 1.6, prediction of a resume segmentation model;
processing resume samples of word formats to be predicted according to the processes of the step 1.1 and the step 1.2 to obtain text vector sequences to be predicted, and inputting the text vector sequences to be predicted into the resume segmentation model to obtain output prediction sequences; fig. 3 is a schematic diagram of resume segmentation according to the present embodiment, after a segmentation model is performed, a prediction marker sequence is obtained, the marker sequence is traversed from the beginning, and when a row text marked as B is encountered, all row texts in the middle form a resume block until a next row text marked as B is encountered.
Step 2, resume block classification:
step 2.1, data preparation of the RC model:
dividing into resume block sequences { block } according to the step 1.6 1 ,…,block t ,…,block z After the computation, block is computed for each resume block t Labeling one-hot type category label,block t Representing the tth resume block; and is provided withThe r-th line text vector representing the t-th resume block, z representing the total number of resume blocks in one resume sample; p is a radical of formula t Representing the total number of line text vectors contained in the tth resume block;
step 2.2, representing by the resume block feature vector:
as shown in fig. 4, an encoder with the same structure as the Bi-LSTM network is constructed for the tth resume block t Encoding to obtain the output of the encoderWherein the content of the first and second substances,representing an r-th output vector of length d;
pressing formula (8) to output Out of the encoder t Performing maximum pooling layer processing to obtain the tth resume block t Characteristic vector f of t :
In the formula (8), the reaction mixture is,is composed ofThe value of the element of the i' th term in (a) to obtain a feature vector sequence F ═ { F ═ F 1 ,…,f t ,…,f z };
Step 2.3, forward calculation of the RC model:
as shown in FIG. 5, a Bi-GRU encoder is constructed for the feature vector f t And (6) coding is carried out. Through GRU circulation unitObtaining a forward output from a Bi-GRU encoderThrough a processTo obtain a reverse outputWhereinCalculating corresponding to formula (9) and formula (10), and splicing to obtain outputThereby obtaining the output of the Bi-GRU encoder
2.4, obtaining the probability distribution of z resume blocks after the output out of the Bi-GRU encoder is activated by a fully connected layer dense and a softmax;
step 2.5, learning of the RC model:
constructing loss function L of RC model by using formula (6) RC (θ RC ):
In the formula (10), k is a resume block sequence { block } 1 ,…,block t ,…,block z The total number of label categories for the resume block in the },for the first resume sampleTo (1)The value of the element of the term,as in the first resume sampleTo (1) aA value of a term element;
step 2.6, adopting a Mini-batch gradient descent optimization algorithm to carry out optimization on the loss function L RC (θ RC ) And reversely updating the middle parameters until the model converges, thereby obtaining the resume block classification model.
Claims (1)
1. A resume block classification method based on a multi-level bidirectional circulation neural network is characterized by comprising the following steps:
step 1, resume segmentation:
step 1.1, obtaining data of an RS model:
step 1.1.1, acquiring a training set of resume samples in word format; analyzing the xml tree of each resume sample by using a regular expression to obtain the text of each line in each resume sample and the format characteristics thereof, and forming a line text sequence { s } according to the original sequence by the text of each line in one resume sample 1 ,s 2 ,…,s i ,…,s n },s i Representing the ith line of text, and the ith line of text s i Is characterized by the format of f(s) i ) (ii) a n is the number of line texts;
step 1.1.2, i-th line of text s i Setting corresponding real label y i And y is i The element is epsilon { B, I }, B represents the start of the block, and I represents the block;
step 1.1.3, for the ith line of text s i Obtaining the ith word sequence after word segmentation processing A jth word representing the ith line of text; m is a unit of i For the ith line of text s i The number of words of (a);
step 1.2, text representation:
step 1.2.1, embedding a model with pretrained Word2vec words into the ith line of text s i The jth word ofConversion into corresponding digital vectorsThereby obtaining a sequence of digital vectors
Step 1.2.2, calculating by using the formula (1) to obtain the jth wordImportance score ofThereby obtaining the ith importance score sequence
In the formula (1), the reaction mixture is,representing the i-th line of text s i The jth word ofThe number of occurrences, c (rs) represents the total number of resume samples in the training set,the representation contains the jth wordThe number of resume samples of (a);
step 1.2.3, calculate the ith line of text s i The j (th) wordWeight distribution ofWherein softmax (·) is a normalization function, thereby obtaining an ith weight distribution sequence
Step 1.2.5, utilizing a layer of feedforward neural network of the RS model to process the ith line text s according to the formula (2) i Format feature f(s) i ) Training to obtain the ith line of text s i E(s) of embedded format feature encoding i )):
E(f(s i ))=W 1 ·relu(W 0 ·f(s i )+b 0 )+b 1 (2)
In formula (2), relu (. cndot.) is an activation function, W 0 ,W 1 Is two weight matrices, b 0 ,b 1 Two bias terms;
step 1.2.6, obtaining the ith line of text s by using the formula (3) i Text vector x of i So as to obtain a text vector sequence input ═ { x ═ 1 ,x 2 ,…,x i ,…,x n }:
x i =[E(s i );E(f(s i ))] (3)
Step 1.3, forward calculation of the Bi-LSTM network of the RS model:
representing a text vector by x i Inputting into Bi-LSTM network to obtain forward output from Bi-LSTM networkAnd reverse outputSpliced outputThereby obtaining an output sequence o 1 ,o 2 ,…,o i ,…,o n };
Step 1.4, obtaining an output of a linear layer of the RS model by using the formula (4):
in the formula (4), the reaction mixture is,representing the i-th line of text s i The predictive tag of (a);
step 1.5, learning of RS model:
step 1.5.1, defining loss function L (theta) of RS model by using formula (5) RS ):
In the formula (5), θ RS As a model parameter, n 1 For training the number of true labels in the set, B, n 2 For training set, the number of true labels is I, N is the number of samples in a batch, N is 1 Is the number of real tags in a batch, N 2 Is the number of true tags in a batch, B, D (y) i (j′) C) is an indicative function when y is satisfied i (j′) When being equal to C, D (y) i (j′) C) is 1, otherwise, D (y) i (j′) ═ C) is 0;a loss weight representing a true tag as B,loss weight, y, representing a true tag as I (j′) i ,Respectively representing the real label and the predicted label of the ith line of text in the jth resume sample,is a binary cross entropy loss function;
step 1.5.2, applying Adam optimization algorithm to the loss function L (theta) RS ) Carrying out gradient back propagation and updating the weight of the RS model until the RS model converges, thereby obtaining a resume segmentation model;
step 1.6, prediction of a resume segmentation model;
processing resume samples of word formats to be predicted according to the processes of the step 1.1 and the step 1.2 to obtain text vector sequences to be predicted, and inputting the text vector sequences to be predicted into the resume segmentation model to obtain output prediction sequences;
step 2, resume block classification:
step 2.1, data preparation of the RC model:
dividing the prediction sequence into resume block sequences according to block start marks B 1 ,…,block t ,…,block z And for each resume block t Labeling one-hot type category labelblock t Representing the tth resume block; and is provided with The r-th line text vector representing the t-th resume block, z representing the total number of resume blocks in one resume sample; p is a radical of t Representing the total number of the line text vectors contained in the t-th resume block;
step 2.2, representing by the resume block feature vector:
constructing an encoder with the same structure as the Bi-LSTM network, and using the encoder to perform block mapping on the tth resume block t Encoding to obtain encoder outputWherein the content of the first and second substances,an r-th output vector representing a fixed dimension;
output of the encoder Out t The tth resume block is obtained through the processing of the maximum pooling layer t Characteristic vector f of t So as to obtain the characteristic vector sequence F ═ F 1 ,…,f t ,…,f z };
Step 2.3, forward calculation of the RC model:
constructing a Bi-GRU encoder for encoding the feature vector f t Encoding to obtain forward output of Bi-GRU encoderAnd reverse outputSpliced outputThereby obtaining the output of the Bi-GRU encoder
2.4, obtaining the probability distribution of z resume blocks after the output out of the Bi-GRU encoder is activated by a full connection layer dense and a softmax;
step 2.5, learning of the RC model:
constructing loss function L of RC model by using formula (6) RC (θ RC ):
In formula (6), k is a resume block sequence { block } 1 ,…,block t ,…,block z The total number of label categories for the resume block in the },as in the first resume sampleTo (1) aThe value of the element of the item,for the first resume sampleTo (1) aA value of an item element;
step 2.6, adopting a Mini-batch gradient descent optimization algorithm to carry out optimization on the loss function L RC (θ RC ) And reversely updating the middle parameters until the model converges, thereby obtaining the resume block classification model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110685320.5A CN113297845B (en) | 2021-06-21 | 2021-06-21 | Resume block classification method based on multi-level bidirectional circulation neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110685320.5A CN113297845B (en) | 2021-06-21 | 2021-06-21 | Resume block classification method based on multi-level bidirectional circulation neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113297845A CN113297845A (en) | 2021-08-24 |
CN113297845B true CN113297845B (en) | 2022-07-26 |
Family
ID=77328902
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110685320.5A Expired - Fee Related CN113297845B (en) | 2021-06-21 | 2021-06-21 | Resume block classification method based on multi-level bidirectional circulation neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113297845B (en) |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108334605B (en) * | 2018-02-01 | 2020-06-16 | 腾讯科技(深圳)有限公司 | Text classification method and device, computer equipment and storage medium |
CN109376242B (en) * | 2018-10-18 | 2020-11-17 | 西安工程大学 | Text classification method based on cyclic neural network variant and convolutional neural network |
CN109543084B (en) * | 2018-11-09 | 2021-01-19 | 西安交通大学 | Method for establishing detection model of hidden sensitive text facing network social media |
CN109635288B (en) * | 2018-11-29 | 2023-05-23 | 东莞理工学院 | Resume extraction method based on deep neural network |
CN109753909B (en) * | 2018-12-27 | 2021-08-10 | 广东人啊人网络技术开发有限公司 | Resume analysis method based on content blocking and BilSTM model |
US11194962B2 (en) * | 2019-06-05 | 2021-12-07 | Fmr Llc | Automated identification and classification of complaint-specific user interactions using a multilayer neural network |
CN110442841B (en) * | 2019-06-20 | 2024-02-02 | 平安科技(深圳)有限公司 | Resume identification method and device, computer equipment and storage medium |
CN110888927B (en) * | 2019-11-14 | 2023-04-18 | 东莞理工学院 | Resume information extraction method and system |
CN111026845B (en) * | 2019-12-06 | 2021-09-21 | 北京理工大学 | Text classification method for acquiring multilevel context semantics |
CN111428488A (en) * | 2020-03-06 | 2020-07-17 | 平安科技(深圳)有限公司 | Resume data information analyzing and matching method and device, electronic equipment and medium |
CN112149389A (en) * | 2020-09-27 | 2020-12-29 | 南方电网数字电网研究院有限公司 | Resume information structured processing method and device, computer equipment and storage medium |
CN112416956B (en) * | 2020-11-19 | 2023-04-07 | 重庆邮电大学 | Question classification method based on BERT and independent cyclic neural network |
-
2021
- 2021-06-21 CN CN202110685320.5A patent/CN113297845B/en not_active Expired - Fee Related
Also Published As
Publication number | Publication date |
---|---|
CN113297845A (en) | 2021-08-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110110054B (en) | Method for acquiring question-answer pairs from unstructured text based on deep learning | |
CN111985239B (en) | Entity identification method, entity identification device, electronic equipment and storage medium | |
CN111694924A (en) | Event extraction method and system | |
CN111639171A (en) | Knowledge graph question-answering method and device | |
CN109359291A (en) | A kind of name entity recognition method | |
CN111783394A (en) | Training method of event extraction model, event extraction method, system and equipment | |
CN112541355A (en) | Few-sample named entity identification method and system with entity boundary class decoupling | |
CN112052684A (en) | Named entity identification method, device, equipment and storage medium for power metering | |
CN113836891A (en) | Method and device for extracting structured information based on multi-element labeling strategy | |
CN113515632A (en) | Text classification method based on graph path knowledge extraction | |
CN114139522A (en) | Key information identification method based on level attention and label guided learning | |
CN115203338A (en) | Label and label example recommendation method | |
CN115952791A (en) | Chapter-level event extraction method, device and equipment based on machine reading understanding and storage medium | |
CN112417132A (en) | New intention recognition method for screening negative samples by utilizing predicate guest information | |
CN114722204A (en) | Multi-label text classification method and device | |
CN113961666A (en) | Keyword recognition method, apparatus, device, medium, and computer program product | |
CN117725211A (en) | Text classification method and system based on self-constructed prompt template | |
CN117390189A (en) | Neutral text generation method based on pre-classifier | |
CN112883726A (en) | Multi-task Thai word segmentation method based on syllable segmentation and word segmentation joint learning | |
CN113297845B (en) | Resume block classification method based on multi-level bidirectional circulation neural network | |
CN116362247A (en) | Entity extraction method based on MRC framework | |
CN113705222B (en) | Training method and device for slot identification model and slot filling method and device | |
CN116258204A (en) | Industrial safety production violation punishment management method and system based on knowledge graph | |
CN114510943A (en) | Incremental named entity identification method based on pseudo sample playback | |
CN114611489A (en) | Text logic condition extraction AI model construction method, extraction method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20220726 |