CN115203415A - Resume document information extraction method and related device - Google Patents

Resume document information extraction method and related device Download PDF

Info

Publication number
CN115203415A
CN115203415A CN202210826700.0A CN202210826700A CN115203415A CN 115203415 A CN115203415 A CN 115203415A CN 202210826700 A CN202210826700 A CN 202210826700A CN 115203415 A CN115203415 A CN 115203415A
Authority
CN
China
Prior art keywords
information extraction
model
embedding layer
feature embedding
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210826700.0A
Other languages
Chinese (zh)
Inventor
吕杨苗
张翼飞
张雪飞
廖艺
郭腾飞
胡光辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan Zhongyuan Consumption Finance Co ltd
Original Assignee
Henan Zhongyuan Consumption Finance Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan Zhongyuan Consumption Finance Co ltd filed Critical Henan Zhongyuan Consumption Finance Co ltd
Priority to CN202210826700.0A priority Critical patent/CN115203415A/en
Publication of CN115203415A publication Critical patent/CN115203415A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The application discloses a resume document information extraction method, which comprises the following steps: performing corpus construction processing according to original resume document data to obtain training data; constructing a Transformer model according to the multi-feature Embedding layer, and taking the Transformer model as an initial information extraction model; the multi-feature Embedding layer is constructed by position features, layout features, image features and page features; training the initial information extraction model according to the training data to obtain an information extraction model; processing the resume document to be processed by adopting an information extraction model to obtain an information extraction result; the information extraction result comprises a plurality of entities, information of each entity and classification results of the entities. The accuracy and precision of information extraction of the established document information are improved. The application also discloses a resume document information extraction device, terminal equipment and a computer readable storage medium, which have the beneficial effects.

Description

Resume document information extraction method and related device
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method for extracting resume document information, a device for extracting resume document information, a terminal device, and a computer-readable storage medium.
Background
With the continuous development of information technology, the resume document information becomes more and more complex, and the existing information extraction technology has no way to accurately and quickly extract the information in the resume.
In the related art, generally, an information extraction manner is adopted to extract information, for example, a work experience (or an educational experience, a project experience) often includes multiple sections of experiences, each section of experience includes information such as start time, end time, work unit, work post, and the like, so that in an information structuring process, a plurality of pieces of work information of one section of experience are often required to be correspondingly output in groups, and a rule of a near principle is used for judgment, which often causes problems such as errors, rule flooding, difficulty in expansion, and the like. That is to say, the accuracy and precision of analyzing the document are reduced and the problem of inaccurate extraction occurs due to the continuous increase of the complexity of the document in the existing resume information.
Therefore, how to improve the accuracy and precision of information extraction on the resume document is a key issue to which those skilled in the art are concerned.
Disclosure of Invention
The application aims to provide a resume document information extraction method, a resume document information extraction device, a terminal device and a computer readable storage medium, so as to improve the accuracy and precision of information extraction of established document information and improve the extraction effect.
In order to solve the technical problem, the present application provides a resume document information extraction method, including:
performing training corpus construction processing according to original resume document data to obtain training data;
constructing a Transformer model according to the multi-feature Embedding layer, and taking the Transformer model as an initial information extraction model; the multi-feature Embedding layer is constructed by position features, layout features, image features and page features;
training the initial information extraction model according to the training data to obtain an information extraction model;
processing the resume document to be processed by adopting the information extraction model to obtain an information extraction result; wherein the information extraction result includes a plurality of entities, information of each entity, and a classification result of the plurality of entities.
Optionally, the training corpus construction processing is performed according to the original resume document data to obtain training data, including:
extracting text boxes from the original resume document data to obtain a plurality of text boxes;
and constructing a training corpus based on the information of each text box and the corresponding entity classification to obtain the training data.
Optionally, constructing a transform model according to the multi-feature Embedding layer, and using the transform model as an initial information extraction model, including:
respectively constructing a position feature embedding layer, a layout feature embedding layer, an image feature embedding layer and a page feature embedding layer;
and fusing the position feature embedding layer, the layout feature embedding layer, the image feature embedding layer and the page feature embedding layer into a Transformer model to obtain the initial information extraction model.
Optionally, training the initial information extraction model according to the training data to obtain an information extraction model, including:
respectively constructing a loss function of region classification, a loss function of entity extraction and a loss function of entity relation classification;
and training the initial information extraction model based on the loss function of the region classification, the loss function of the entity extraction, the loss function of the entity relation classification and the training data to obtain the information extraction model.
The present application further provides a resume document information extraction apparatus, including:
the training data acquisition module is used for constructing and processing training corpora according to original resume document data to obtain training data;
the model building module is used for building a Transformer model according to the multi-feature Embedding layer and taking the Transformer model as an initial information extraction model; the multi-feature Embedding layer is constructed by position features, layout features, image features and page features;
the model training module is used for training the initial information extraction model according to the training data to obtain an information extraction model;
the document information extraction module is used for processing the resume document to be processed by adopting the information extraction model to obtain an information extraction result; wherein the information extraction result includes a plurality of entities, information of each entity, and a classification result of the plurality of entities.
Optionally, the training data obtaining module is specifically configured to perform text box extraction on the initial resume document data to obtain a plurality of text boxes; and constructing a training corpus based on the information of each text box and the corresponding entity classification to obtain the training data.
Optionally, the model building module is specifically configured to respectively build a position feature embedding layer, a layout feature embedding layer, an image feature embedding layer, and a page feature embedding layer; and fusing the position feature embedding layer, the layout feature embedding layer, the image feature embedding layer and the page feature embedding layer into a Transformer model to obtain the initial information extraction model.
Optionally, the model training module is specifically configured to respectively construct a loss function of the region classification, a loss function of the entity extraction, and a loss function of the entity relationship classification; and training the initial information extraction model based on the loss function of the region classification, the loss function of the entity extraction, the loss function of the entity relation classification and the training data to obtain the information extraction model.
The present application further provides a terminal device, including:
a memory for storing a computer program;
a processor for implementing the steps of the resume document information extraction method when executing the computer program.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the resume document information extraction method as described above.
The application provides a resume document information extraction method, which comprises the following steps: performing training corpus construction processing according to original resume document data to obtain training data; constructing a Transformer model according to the multi-feature Embedding layer, and taking the Transformer model as an initial information extraction model; the multi-feature Embedding layer is constructed by position features, layout features, image features and page features; training the initial information extraction model according to the training data to obtain an information extraction model; processing the resume document to be processed by adopting the information extraction model to obtain an information extraction result; wherein the information extraction result includes a plurality of entities, information of each entity, and a classification result of the plurality of entities.
The training data are extracted from the original data, then an initial information extraction model of a multi-feature Embedding layer is constructed, finally the training data are adopted to train the initial information extraction model to obtain the information extraction model, and finally extraction processing is carried out, so that a plurality of entities in the resume document, information of each entity and classification results of the entities can be obtained, the effect of information extraction processing on the high-complexity document is improved, and the accuracy and precision of extraction are improved.
The application further provides a resume document information extraction device, a terminal device and a computer readable storage medium, which have the above beneficial effects, and are not described herein again.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only the embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a method for extracting resume document information according to an embodiment of the present application;
fig. 2 is a schematic diagram of a model structure of a resume document information extraction method according to an embodiment of the present application;
FIG. 3 is a schematic view of a page code of a method for extracting resume document information according to an embodiment of the present application;
fig. 4 is a schematic diagram illustrating feature fusion of a resume document information extraction method according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a resume document information extraction device according to an embodiment of the present application.
Detailed Description
The core of the application is to provide a resume document information extraction method, a resume document information extraction device, a terminal device and a computer readable storage medium, so as to improve the accuracy and precision of information extraction of established document information and improve the extraction effect.
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In the related art, generally, an information extraction method is used for extracting information, for example, a work experience (or an educational experience, a project experience) often includes multiple sections of experiences, each section of experience includes information such as start time, end time, work unit, work post, and the like, in an information structuring process, a plurality of pieces of work information of one section of experience are often required to be output in a corresponding group, and a rule of a near principle is used for judgment, which often causes problems such as errors, rule flooding, difficulty in expansion, and the like. That is to say, the accuracy and precision of analyzing the document are reduced and the problem of inaccurate extraction occurs due to the continuous increase of the complexity of the document in the existing resume information.
Therefore, the method for extracting the resume document information comprises the steps of extracting training data from original data, then constructing an initial information extraction model of a multi-feature Embedding layer, finally training the initial information extraction model by adopting the training data to obtain the information extraction model, and finally performing extraction processing to obtain a plurality of entities in the resume document, information of each entity and classification results of the entities, so that the effect of performing information extraction processing on the high-complexity document is improved, and the extraction accuracy and precision are improved.
The following describes a method for extracting resume document information according to an embodiment.
Referring to fig. 1, fig. 1 is a flowchart of a method for extracting resume document information according to an embodiment of the present disclosure.
In this embodiment, the method may include:
s101, performing training corpus construction processing according to original resume document data to obtain training data;
the step aims to carry out construction processing on the training corpus according to the original resume document data to obtain training data. That is, the training data suitable for the application in the embodiment is constructed so as to achieve better information extraction on the resume document.
Further, the step may include:
step 1, extracting text boxes from the original resume document data to obtain a plurality of text boxes;
and 2, constructing a training corpus based on the information of each text box and the corresponding entity classification to obtain training data.
It can be seen that the present alternative is primarily illustrative of how training data is constructed. In the alternative scheme, the text box extraction is carried out on the original resume document data to obtain a plurality of text boxes, and training corpus construction is carried out based on the information of each text box and the corresponding entity classification to obtain training data. That is, data is extracted therefrom and training data is constructed based on the extracted entity classification.
S102, constructing a Transformer model according to the multi-feature Embedding layer, and taking the Transformer model as an initial information extraction model; the multi-feature Embedding layer is constructed by position features, layout features, image features and page features;
on the basis of S101, the step aims to construct a Transformer model according to a multi-feature Embedding layer and take the Transformer model as an initial information extraction model; the multi-feature Embedding layer is constructed by position features, layout features, image features and page features. In the embodiment, in order to improve the accuracy and effect of identifying the document in the information extraction process, more feature information is mixed into the model, so that the accuracy of identifying the features by the model is improved.
The Transformer model is a model which utilizes an attention mechanism to improve the training speed of the model.
The purpose of the Embedding layer is to project data with high latitude, which is relatively sparse in each dimension, to data with relatively low dimensions, and each dimension can be operated by taking a real number set. Essentially, continuous space is used instead of (quasi-) discrete space to increase space utilization.
Further, the step may include:
step 1, respectively constructing a position feature embedding layer, a layout feature embedding layer, an image feature embedding layer and a page feature embedding layer;
and 2, fusing the position feature embedding layer, the layout feature embedding layer, the image feature embedding layer and the page feature embedding layer into a transform model to obtain an initial information extraction model.
It can be seen that the present alternative scheme mainly illustrates how the information extraction model is constructed. In the alternative scheme, a position feature embedded layer, a layout feature embedded layer, an image feature embedded layer and a page feature embedded layer are respectively constructed; and fusing the position feature embedding layer, the layout feature embedding layer, the image feature embedding layer and the page feature embedding layer into a transform model to obtain an initial information extraction model.
S103, training the initial information extraction model according to the training data to obtain an information extraction model;
on the basis of S102, this step aims to train the initial information extraction model according to the training data, resulting in an information extraction model. In order to implement simultaneous performance of multiple tasks, multiple loss functions may be used to train the model, so that the trained information extraction model may simultaneously infer multiple tasks. Wherein the loss function may include: loss functions of region classification, loss functions of entity extraction, and loss functions of entity relationship classification.
Further, the step may include:
step 1, respectively constructing a loss function of region classification, a loss function of entity extraction and a loss function of entity relation classification;
and 2, training the initial information extraction model based on the loss function of the region classification, the loss function of the entity extraction, the loss function of the entity relation classification and the training data to obtain the information extraction model.
It can be seen that the present alternative is primarily illustrative of how the model may be trained. The alternative scheme mainly comprises the steps of respectively constructing a loss function of region classification, a loss function of entity extraction and a loss function of entity relationship classification, and training an initial information extraction model based on the loss function of region classification, the loss function of entity extraction, the loss function of entity relationship classification and training data to obtain an information extraction model. Obviously, in order to implement simultaneous determination of the tasks of region classification, entity extraction, and entity relationship classification in the alternative, the loss function of region classification, the loss function of entity extraction, and the loss function of entity relationship classification are used in this embodiment to train the model together. The specific process of training may refer to any training mode provided in the prior art, and is not specifically limited herein.
S104, processing the resume document to be processed by adopting an information extraction model to obtain an information extraction result; the information extraction result comprises a plurality of entities, information of each entity and classification results of the entities.
On the basis of S103, processing the resume document to be processed by adopting an information extraction model to obtain an information extraction result; the information extraction result comprises a plurality of entities, information of each entity and classification results of the entities.
In summary, in the embodiment, training data is extracted from original data, an initial information extraction model of a multi-feature Embedding layer is constructed, the initial information extraction model is trained by using the training data to obtain the information extraction model, and finally extraction processing is performed, so that classification results of a plurality of entities, information of each entity and the plurality of entities in the resume document can be obtained, the effect of information extraction processing on a high-complexity document is improved, and the accuracy and precision of extraction are improved.
The following further describes a method for extracting resume document information provided by the present application by using another specific embodiment.
Referring to fig. 2, fig. 2 is a schematic diagram of a model structure of a resume document information extraction method according to an embodiment of the present application,
the document information extraction model in the embodiment is modified on the basis of an Encoder end of a Transformer, and the model training comprises 5 steps: data preparation, multi-modal Embedding layer construction, model construction, multi-task loss function calculation and model optimization.
In fig. 2, T-0 to T-511 are each character of the text sequence, V-0 to V-48 are the image sequence obtained by transforming the image, and P is the page number where the group of text boxes is located.
In this embodiment, the method may include:
step 10, data preparation.
Most of the resume documents applied in this embodiment are PDF (Portable Document Format), word (a Document Format), and pictures.
Step 11, obtaining a text box of a document: text of the text box, position coordinates of the text box and a picture of the current page.
When the document is PDF, acquiring a text by using a PDF analysis tool (when the document is word, the document needs to be converted into PDF), and converting each page into a picture; when the document is a picture, the text is acquired using an OCR (Optical Character Recognition) technique.
Step 12, preparing a corpus: (ID, P, T, S, V, L) area ,L entity ,L relation )。
Wherein ID refers to an index, P represents a page number, T represents text information of a group of text boxes, S represents position coordinate information of a page where each text box is located, and V represents picture information of the page where the group of text boxes is located;
L area and the area label corresponding to the group of text boxes is represented by: basic information, job-seeking intent, work experience, educational experience, project experience, professional skill certificates, family members. L is a radical of an alcohol entity The extracted entities in the group of text boxes are represented, the number of the entities is 56 (the entities correspond to different areas respectively), and the specific correspondence relationship is shown in the following table. L is relation Indicating whether there is a relationship between the set of text boxes and with other text boxes in the batch, 0 representing no relationship, 1 representing a relationship.
The set of zone tags is denoted S area Entity set tag is denoted as S entity
For example: { ID:0,
P:0
T:[
zhang III, no. 7 month and 21 days 1987,
"work experience", "2017.10-calf science to date",
"position: a wind control model manager,
"2016.05-2017.10 prefong finance", "job: data scientists "
],
S:[
[12,22,28,26],[12,27,35,31],[13,53,48,62],
[24,66,49,72],[45,74,59,79],[24,81,51,88],[46,92,68,100]
],
V:numpy.array,
L area :[
"basic situation" and "work experience"
],
L entity :[
Zhang III, 21.7 months in 1987,
"2017.10", "to date", "calf technology", "wind control model manager",
"2016.05", "2017.10", "Qianglong finance", "data scientist"
],
L relation :[
[{“id”:0,“index”:3},{“id”:0,“index”:4}],
[{“id”:0,“index”:5},{“id”:0,“index”:6}],
[{“id”:0,“index”:5},{“id”:1,“index”:0}],
[{“id”:0,“index”:6},{“id”:1,“index”:0}],
# where id refers to the index in a batch, and index denotes the textbox index
]
}
Wherein, in the real data, L area Is a tag index, e.g., [0,3 ]]0 and 3 respectively represent index values corresponding to the area labels; l is a radical of an alcohol entity Is a BIO tag, e.g.<OOBIIOBOOOOOBIIIIIO>Each of B, I, O corresponds to a character.
And step 20, constructing a multi-mode Embedding layer.
The Embedding of the document information extraction model of the embodiment is composed of Word-Embedding, position-Embedding, cut-Embedding, layout-Embedding, spatial-Embedding, visual-Embedding and Page-Embedding, and finally fused into Fusion-Embedding.
Wherein, word-Embedding and Position-Embedding are owned by the transducer, and Segment-Embedding, spatial-Embedding, visual-Embedding, page-Embedding and Fusion-Embedding are mainly introduced here.
Step 21, position Embedding Segment-Embedding.
Initializing a 3-dimensional position dictionary segment _ type _ embedding, respectively representing a text, an image and a page number, wherein when token is the text, the first one is taken, when token is the image, the second one is taken, and when token is the page number, the third one is taken.
And step 22, embedding Spatial-Embedding in the layout.
Wherein the layout embedding is formed by coordinates of the text box. The coordinates of the text box are composed of (x 1, y1, x2, y2, w, h), wherein x1 and y1 are the coordinates of the upper left corner of the text box, x2 and y2 are the coordinates of the lower right corner of the text box, and w and h are the width and height of the text box.
Initializing an embedding dictionary for x, y, w and h respectively, wherein the embedding dictionary corresponds to an abscissa, an ordinate, a width and a height respectively. And (3) extracting embeddings corresponding to 6 pieces of information of the coordinates of the text box respectively, and then connecting to finally form Spatial-embeddings.
Step 23, embedding Visual-Embedding the image.
The image embedding is formed by converting the page where the text box is located into the picture. The picture is coded through a ResNeXt-FPN network, then the average pooling is carried out according to line expansion, and then a characteristic sequence corresponding to the picture is obtained through linear projection.
And step 24, embedding the Page number into Page-Embedding.
Page-Embedding of Page number P, in order to avoid Page number sparsity and Page number coverage, invents a Page number coding method:
firstly, initializing an embedding dictionary from 0 to 9; second, the page number is 4-bit, for example, 0001 at page 1 and 0012 at page 12 (if it exceeds 9999, truncation is performed, and generally, 9999 is not exceeded).
Referring to fig. 3, fig. 3 is a schematic page encoding diagram of a resume document information extraction method according to an embodiment of the present application.
As can be seen, a one-dimensional Convolutional CNN (Convolutional neural network) with a width of 2 is used for the Page code sequence and is averaged and pooled, and finally, the Page-Embedding of the Page is obtained.
In fig. 3, the CNN includes a one-dimensional convolution conv1d and an average pooling layer average-pool1d, 0-v-9-v-vector is a representation of Page numbers 0 to 9, and Page-Embedding of the Page numbers is obtained through the CNN.
Step 25, fusion Embedding Fusion-Embedding
After Segment-Embedding, spatial-Embedding, visual-Embedding and Page-Embedding are obtained, the Segment-Embedding, spatial-Embedding, visual-Embedding and Page-Embedding are fused with Word-Embedding and Position-Embedding.
Referring to fig. 4, fig. 4 is a schematic diagram illustrating feature fusion of a resume document information extraction method according to an embodiment of the present application.
In fig. 4, the text feature sequence length is 512, the image feature sequence length is 49, and the page feature sequence length is 1.
As in fig. 4, segment segmentation feature: the segments of the text are all 0, the segments of the image are all 1, and the segments of the page number are 2;
in the Token text/image/page number characteristics, the text corresponds to word-embedding, the image corresponds to visual-embedding, and the page number corresponds to page-embedding;
in the Position feature, text is from 0 to 511, images are from 512 to 560, and the page number is 561;
in the Layout feature of Layout, the text corresponds to the spatial-embedding of the text box, spatial-embedding where the image corresponds to (0, 0), the page number corresponds to the spatial-embedding of (0, 0).
As can be seen, the process of processing the data may include:
firstly, embedding text, images and page numbers on Segment segmentation characteristics, token text/images, page number characteristics, position characteristics and Layout characteristics of Layout respectively concat;
and in the second step, directly adding the four Embedding to obtain Fusion-Embedding.
Wherein the text is composed of a plurality of text boxes.
And step 30, constructing a model.
The document information extraction model is constructed on the basis of a transform model Encoder end and comprises 12 transform units, an Embedding layer vector is 768 dimensions and about 500 ten thousand parameters, and the Embedding layer is constructed by four characteristics of texts, layouts, images and page numbers.
And step 40, calculating a multitask loss function.
The document information extraction model of the embodiment has 3 training tasks: regional multi-label classification, entity extraction and relationship classification.
The area multi-label classification refers to determining which area the group of text boxes belongs to, for example, the area belongs to the area < basic information >, < intention to ask > and at this time, the first character T-0 is subjected to multi-label classification and belongs to a label classification task.
The entity extraction refers to judging the entity type, such as a < name > entity, of the character level of a text sequence, and belongs to a sequence labeling task;
the relation classification refers to whether a relation exists between every two text boxes in a text sequence and belongs to a binary classification task.
Step 41, the loss function of the regional multi-label classification is:
Figure BDA0003746858720000121
P tag +P′ tag =1。
wherein, I is an indicator function (indicator function), and tag is a region tag set S area A label of P tag Probability of T-0 character being predicted as tag, P' tag The probability of not being a tag is predicted for the T-0 character.
Step 42, the loss function extracted by the entity is:
Figure BDA0003746858720000122
Figure BDA0003746858720000123
wherein, I is an indicator function (indicator function), token is a character of a T sequence (T sequence, i.e. text sequence), and tag is an entity tag set S entity A label of P tag A probability value predicted to be tag for token,
Figure BDA0003746858720000124
is the real label of token in T.
Step 43, the loss function of the relationship classification is:
Figure BDA0003746858720000125
wherein I is an indicator function (indicator function), P 0-ij Predict probability of 0 class (no relation) for a sentence, P 1-ij Predict the probability of 1 class (related) for a sentence, P 0-ij +P 1-ij =1。
Wherein, i and j are 2 different entities belonging to a batch, and may be from one sample or from different samples.
Step 44, adding the above 3 loss functions to obtain the total loss function of the document information extraction model training of the present embodiment:
Total loss=Sentence area loss+Sentence entity loss+Sentence relation loss。
and step 50, optimizing the model.
And training the transform by adopting a deep learning model optimization framework according to the prepared training corpus, the multi-modal Embedding layer and the multi-task loss function, and finally obtaining a resume document information extraction model. Compared with a model which only depends on the characteristics of plain texts, the accuracy rate of entity extraction is improved by 20%, the accuracy rate of relation extraction is improved by 15%, the overall accuracy rate reaches 95%, and the requirements and expectations of workers at present are completely met.
Step 60, document information extraction model reasoning
The reasoning of the resume document information extraction model in the embodiment mainly comprises two steps: entity extraction and relation classification.
Step 61, entity extraction.
Substituting a text box and picture information (namely, the description prepared according to the data in the step 10, but label information does not need to be filled) of the resume document into the extraction model to obtain an entity extraction result; if the entity extraction result is null, the process jumps out, otherwise, the process goes to step 62.
Step 62, relationship classification.
And taking out the relation result between every two extracted text boxes where the entities are located, summarizing the relation results, and dividing the related entities into a group.
During training, data of one batch (one batch processing) belongs to one document (text of one document is randomly taken out), and each epoch randomly scrambles a text box of the document and then puts the document into training (so that a situation of page crossing may exist in one batch, and learning of a hierarchical relationship of page crossing is supported).
Therefore, the embodiment extracts the training data from the original data, then constructs the initial information extraction model of the multi-feature Embedding layer, finally trains the initial information extraction model by adopting the training data to obtain the information extraction model, and finally performs extraction processing to obtain the classification results of a plurality of entities, the information of each entity and the entities in the resume document, so that the effect of performing information extraction processing on the high-complexity document is improved, and the accuracy and precision of extraction are improved.
In the following, the resume document information extraction device provided in the embodiment of the present application is introduced, and the resume document information extraction device described below and the resume document information extraction method described above may be referred to in correspondence with each other.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a resume document information extraction device according to an embodiment of the present application.
In this embodiment, the apparatus may include:
a training data acquisition module 100, configured to perform training corpus construction processing according to original resume document data to obtain training data;
the model building module 200 is used for building a Transformer model according to the multi-feature Embedding layer and taking the model as an initial information extraction model; the multi-feature Embedding layer is constructed by position features, layout features, image features and page features;
the model training module 300 is configured to train the initial information extraction model according to the training data to obtain an information extraction model;
the document information extraction module 400 is configured to process the resume document to be processed by using an information extraction model to obtain an information extraction result; the information extraction result comprises a plurality of entities, information of each entity and classification results of the entities.
Optionally, the training data obtaining module 100 is specifically configured to perform text box extraction on the original resume document data to obtain a plurality of text boxes; and constructing a training corpus based on the information of each text box and the corresponding entity classification to obtain training data.
Optionally, the model building module 200 is specifically configured to respectively build a position feature embedding layer, a layout feature embedding layer, an image feature embedding layer, and a page feature embedding layer; and fusing the position feature embedding layer, the layout feature embedding layer, the image feature embedding layer and the page feature embedding layer into a Transformer model to obtain an initial information extraction model.
Optionally, the model training module 300 is specifically configured to respectively construct a loss function of region classification, a loss function of entity extraction, and a loss function of entity relationship classification; and training the initial information extraction model based on the loss function of the region classification, the loss function of the entity extraction, the loss function of the entity relation classification and the training data to obtain the information extraction model.
An embodiment of the present application further provides a terminal device, including:
a memory for storing a computer program;
a processor for implementing the steps of the resume document information extraction method as described in the above embodiments when executing the computer program.
An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the resume document information extraction method according to the above embodiment are implemented.
The embodiments are described in a progressive mode in the specification, the emphasis of each embodiment is on the difference from the other embodiments, and the same and similar parts among the embodiments can be referred to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the components and steps of the various examples have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The above provides a method for extracting resume document information, a device for extracting resume document information, a terminal device and a computer readable storage medium. The principles and embodiments of the present application are explained herein using specific examples, which are provided only to help understand the method and the core idea of the present application. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

Claims (10)

1. A resume document information extraction method is characterized by comprising the following steps:
performing corpus construction processing according to original resume document data to obtain training data;
constructing a Transformer model according to the multi-feature Embedding layer, and taking the Transformer model as an initial information extraction model; the multi-feature Embedding layer is constructed by position features, layout features, image features and page features;
training the initial information extraction model according to the training data to obtain an information extraction model;
processing the resume document to be processed by adopting the information extraction model to obtain an information extraction result; wherein the information extraction result includes a plurality of entities, information of each entity, and a classification result of the plurality of entities.
2. The method for extracting resume document information according to claim 1, wherein performing corpus construction processing on original resume document data to obtain training data comprises:
extracting text boxes from the original resume document data to obtain a plurality of text boxes;
and constructing a training corpus based on the information of each text box and the corresponding entity classification to obtain the training data.
3. The resume document information extraction method according to claim 1, wherein constructing a Transformer model according to a multi-feature Embedding layer and serving as an initial information extraction model comprises:
respectively constructing a position feature embedding layer, a layout feature embedding layer, an image feature embedding layer and a page feature embedding layer;
and fusing the position feature embedding layer, the layout feature embedding layer, the image feature embedding layer and the page feature embedding layer into a transform model to obtain the initial information extraction model.
4. The resume document information extraction method of claim 1, wherein training the initial information extraction model according to the training data to obtain an information extraction model comprises:
respectively constructing a loss function of region classification, a loss function of entity extraction and a loss function of entity relation classification;
and training the initial information extraction model based on the loss function of the region classification, the loss function of the entity extraction, the loss function of the entity relation classification and the training data to obtain the information extraction model.
5. A resume document information extraction device, characterized by comprising:
the training data acquisition module is used for constructing and processing training corpora according to original resume document data to obtain training data;
the model building module is used for building a Transformer model according to the multi-feature Embedding layer and taking the Transformer model as an initial information extraction model; the multi-feature Embedding layer is constructed by position features, layout features, image features and page features;
the model training module is used for training the initial information extraction model according to the training data to obtain an information extraction model;
the document information extraction module is used for processing the resume document to be processed by adopting the information extraction model to obtain an information extraction result; wherein the information extraction result includes a plurality of entities, information of each entity, and a classification result of the plurality of entities.
6. The resume document information extraction device according to claim 5, wherein the training data acquisition module is specifically configured to perform text box extraction on the start resume document data to obtain a plurality of text boxes; and constructing a training corpus based on the information of each text box and the corresponding entity classification to obtain the training data.
7. The resume document information extraction device of claim 5, wherein the model construction module is specifically configured to respectively construct a position feature embedding layer, a layout feature embedding layer, an image feature embedding layer, and a page feature embedding layer; and fusing the position feature embedding layer, the layout feature embedding layer, the image feature embedding layer and the page feature embedding layer into a Transformer model to obtain the initial information extraction model.
8. The apparatus according to claim 5, wherein the model training module is configured to respectively construct a loss function for region classification, a loss function for entity extraction, and a loss function for entity relationship classification; and training the initial information extraction model based on the loss function of the region classification, the loss function of the entity extraction, the loss function of the entity relation classification and the training data to obtain the information extraction model.
9. A terminal device, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the resume document information extraction method according to any one of claims 1 to 4 when executing the computer program.
10. A computer-readable storage medium, characterized in that a computer program is stored thereon, which, when being executed by a processor, implements the steps of the resume document information extraction method of any one of claims 1 to 4.
CN202210826700.0A 2022-07-14 2022-07-14 Resume document information extraction method and related device Pending CN115203415A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210826700.0A CN115203415A (en) 2022-07-14 2022-07-14 Resume document information extraction method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210826700.0A CN115203415A (en) 2022-07-14 2022-07-14 Resume document information extraction method and related device

Publications (1)

Publication Number Publication Date
CN115203415A true CN115203415A (en) 2022-10-18

Family

ID=83580504

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210826700.0A Pending CN115203415A (en) 2022-07-14 2022-07-14 Resume document information extraction method and related device

Country Status (1)

Country Link
CN (1) CN115203415A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116311320A (en) * 2023-05-22 2023-06-23 建信金融科技有限责任公司 Training method of text image fusion layer, text image recognition method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116311320A (en) * 2023-05-22 2023-06-23 建信金融科技有限责任公司 Training method of text image fusion layer, text image recognition method and device
CN116311320B (en) * 2023-05-22 2023-08-22 建信金融科技有限责任公司 Training method of text image fusion layer, text image recognition method and device

Similar Documents

Publication Publication Date Title
CN111160343B (en) Off-line mathematical formula symbol identification method based on Self-Attention
CN114821622B (en) Text extraction method, text extraction model training method, device and equipment
CN110490081B (en) Remote sensing object interpretation method based on focusing weight matrix and variable-scale semantic segmentation neural network
CN115146488B (en) Variable business process intelligent modeling system and method based on big data
CN114580424B (en) Labeling method and device for named entity identification of legal document
CN110781672A (en) Question bank production method and system based on machine intelligence
CN115526259A (en) Training method and device for multi-mode pre-training model
CN114419642A (en) Method, device and system for extracting key value pair information in document image
CN112269872B (en) Resume analysis method and device, electronic equipment and computer storage medium
CN113204615A (en) Entity extraction method, device, equipment and storage medium
CN110705459A (en) Automatic identification method and device for mathematical and chemical formulas and model training method and device
CN114596566A (en) Text recognition method and related device
CN113672731A (en) Emotion analysis method, device and equipment based on domain information and storage medium
CN113935314A (en) Abstract extraction method, device, terminal equipment and medium based on heteromorphic graph network
CN115203415A (en) Resume document information extraction method and related device
CN114639109A (en) Image processing method and device, electronic equipment and storage medium
Al Ghamdi A novel approach to printed Arabic optical character recognition
CN115810215A (en) Face image generation method, device, equipment and storage medium
CN115455955A (en) Chinese named entity recognition method based on local and global character representation enhancement
CN115270792A (en) Medical entity identification method and device
CN114611489A (en) Text logic condition extraction AI model construction method, extraction method and system
CN114092931A (en) Scene character recognition method and device, electronic equipment and storage medium
CN110852102B (en) Chinese part-of-speech tagging method and device, storage medium and electronic equipment
CN114298032A (en) Text punctuation detection method, computer device and storage medium
WO2021137942A1 (en) Pattern generation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination