CN115203415A

CN115203415A - Resume document information extraction method and related device

Info

Publication number: CN115203415A
Application number: CN202210826700.0A
Authority: CN
Inventors: 吕杨苗; 张翼飞; 张雪飞; 廖艺; 郭腾飞; 胡光辉
Original assignee: Henan Zhongyuan Consumption Finance Co ltd
Current assignee: Henan Zhongyuan Consumption Finance Co ltd
Priority date: 2022-07-14
Filing date: 2022-07-14
Publication date: 2022-10-18

Abstract

The application discloses a resume document information extraction method, which comprises the following steps: performing corpus construction processing according to original resume document data to obtain training data; constructing a Transformer model according to the multi-feature Embedding layer, and taking the Transformer model as an initial information extraction model; the multi-feature Embedding layer is constructed by position features, layout features, image features and page features; training the initial information extraction model according to the training data to obtain an information extraction model; processing the resume document to be processed by adopting an information extraction model to obtain an information extraction result; the information extraction result comprises a plurality of entities, information of each entity and classification results of the entities. The accuracy and precision of information extraction of the established document information are improved. The application also discloses a resume document information extraction device, terminal equipment and a computer readable storage medium, which have the beneficial effects.

Description

Resume document information extraction method and related device

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method for extracting resume document information, a device for extracting resume document information, a terminal device, and a computer-readable storage medium.

Background

With the continuous development of information technology, the resume document information becomes more and more complex, and the existing information extraction technology has no way to accurately and quickly extract the information in the resume.

In the related art, generally, an information extraction manner is adopted to extract information, for example, a work experience (or an educational experience, a project experience) often includes multiple sections of experiences, each section of experience includes information such as start time, end time, work unit, work post, and the like, so that in an information structuring process, a plurality of pieces of work information of one section of experience are often required to be correspondingly output in groups, and a rule of a near principle is used for judgment, which often causes problems such as errors, rule flooding, difficulty in expansion, and the like. That is to say, the accuracy and precision of analyzing the document are reduced and the problem of inaccurate extraction occurs due to the continuous increase of the complexity of the document in the existing resume information.

Therefore, how to improve the accuracy and precision of information extraction on the resume document is a key issue to which those skilled in the art are concerned.

Disclosure of Invention

The application aims to provide a resume document information extraction method, a resume document information extraction device, a terminal device and a computer readable storage medium, so as to improve the accuracy and precision of information extraction of established document information and improve the extraction effect.

In order to solve the technical problem, the present application provides a resume document information extraction method, including:

performing training corpus construction processing according to original resume document data to obtain training data;

constructing a Transformer model according to the multi-feature Embedding layer, and taking the Transformer model as an initial information extraction model; the multi-feature Embedding layer is constructed by position features, layout features, image features and page features;

training the initial information extraction model according to the training data to obtain an information extraction model;

processing the resume document to be processed by adopting the information extraction model to obtain an information extraction result; wherein the information extraction result includes a plurality of entities, information of each entity, and a classification result of the plurality of entities.

Optionally, the training corpus construction processing is performed according to the original resume document data to obtain training data, including:

extracting text boxes from the original resume document data to obtain a plurality of text boxes;

and constructing a training corpus based on the information of each text box and the corresponding entity classification to obtain the training data.

Optionally, constructing a transform model according to the multi-feature Embedding layer, and using the transform model as an initial information extraction model, including:

respectively constructing a position feature embedding layer, a layout feature embedding layer, an image feature embedding layer and a page feature embedding layer;

and fusing the position feature embedding layer, the layout feature embedding layer, the image feature embedding layer and the page feature embedding layer into a Transformer model to obtain the initial information extraction model.

Optionally, training the initial information extraction model according to the training data to obtain an information extraction model, including:

respectively constructing a loss function of region classification, a loss function of entity extraction and a loss function of entity relation classification;

and training the initial information extraction model based on the loss function of the region classification, the loss function of the entity extraction, the loss function of the entity relation classification and the training data to obtain the information extraction model.

The present application further provides a resume document information extraction apparatus, including:

the training data acquisition module is used for constructing and processing training corpora according to original resume document data to obtain training data;

the model building module is used for building a Transformer model according to the multi-feature Embedding layer and taking the Transformer model as an initial information extraction model; the multi-feature Embedding layer is constructed by position features, layout features, image features and page features;

the model training module is used for training the initial information extraction model according to the training data to obtain an information extraction model;

the document information extraction module is used for processing the resume document to be processed by adopting the information extraction model to obtain an information extraction result; wherein the information extraction result includes a plurality of entities, information of each entity, and a classification result of the plurality of entities.

Optionally, the training data obtaining module is specifically configured to perform text box extraction on the initial resume document data to obtain a plurality of text boxes; and constructing a training corpus based on the information of each text box and the corresponding entity classification to obtain the training data.

Optionally, the model building module is specifically configured to respectively build a position feature embedding layer, a layout feature embedding layer, an image feature embedding layer, and a page feature embedding layer; and fusing the position feature embedding layer, the layout feature embedding layer, the image feature embedding layer and the page feature embedding layer into a Transformer model to obtain the initial information extraction model.

Optionally, the model training module is specifically configured to respectively construct a loss function of the region classification, a loss function of the entity extraction, and a loss function of the entity relationship classification; and training the initial information extraction model based on the loss function of the region classification, the loss function of the entity extraction, the loss function of the entity relation classification and the training data to obtain the information extraction model.

The present application further provides a terminal device, including:

a memory for storing a computer program;

a processor for implementing the steps of the resume document information extraction method when executing the computer program.

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the resume document information extraction method as described above.

The application provides a resume document information extraction method, which comprises the following steps: performing training corpus construction processing according to original resume document data to obtain training data; constructing a Transformer model according to the multi-feature Embedding layer, and taking the Transformer model as an initial information extraction model; the multi-feature Embedding layer is constructed by position features, layout features, image features and page features; training the initial information extraction model according to the training data to obtain an information extraction model; processing the resume document to be processed by adopting the information extraction model to obtain an information extraction result; wherein the information extraction result includes a plurality of entities, information of each entity, and a classification result of the plurality of entities.

The training data are extracted from the original data, then an initial information extraction model of a multi-feature Embedding layer is constructed, finally the training data are adopted to train the initial information extraction model to obtain the information extraction model, and finally extraction processing is carried out, so that a plurality of entities in the resume document, information of each entity and classification results of the entities can be obtained, the effect of information extraction processing on the high-complexity document is improved, and the accuracy and precision of extraction are improved.

The application further provides a resume document information extraction device, a terminal device and a computer readable storage medium, which have the above beneficial effects, and are not described herein again.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only the embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of a method for extracting resume document information according to an embodiment of the present application;

fig. 2 is a schematic diagram of a model structure of a resume document information extraction method according to an embodiment of the present application;

FIG. 3 is a schematic view of a page code of a method for extracting resume document information according to an embodiment of the present application;

fig. 4 is a schematic diagram illustrating feature fusion of a resume document information extraction method according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a resume document information extraction device according to an embodiment of the present application.

Detailed Description

The core of the application is to provide a resume document information extraction method, a resume document information extraction device, a terminal device and a computer readable storage medium, so as to improve the accuracy and precision of information extraction of established document information and improve the extraction effect.

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the related art, generally, an information extraction method is used for extracting information, for example, a work experience (or an educational experience, a project experience) often includes multiple sections of experiences, each section of experience includes information such as start time, end time, work unit, work post, and the like, in an information structuring process, a plurality of pieces of work information of one section of experience are often required to be output in a corresponding group, and a rule of a near principle is used for judgment, which often causes problems such as errors, rule flooding, difficulty in expansion, and the like. That is to say, the accuracy and precision of analyzing the document are reduced and the problem of inaccurate extraction occurs due to the continuous increase of the complexity of the document in the existing resume information.

Therefore, the method for extracting the resume document information comprises the steps of extracting training data from original data, then constructing an initial information extraction model of a multi-feature Embedding layer, finally training the initial information extraction model by adopting the training data to obtain the information extraction model, and finally performing extraction processing to obtain a plurality of entities in the resume document, information of each entity and classification results of the entities, so that the effect of performing information extraction processing on the high-complexity document is improved, and the extraction accuracy and precision are improved.

The following describes a method for extracting resume document information according to an embodiment.

Referring to fig. 1, fig. 1 is a flowchart of a method for extracting resume document information according to an embodiment of the present disclosure.

In this embodiment, the method may include:

s101, performing training corpus construction processing according to original resume document data to obtain training data;

the step aims to carry out construction processing on the training corpus according to the original resume document data to obtain training data. That is, the training data suitable for the application in the embodiment is constructed so as to achieve better information extraction on the resume document.

Further, the step may include:

step 1, extracting text boxes from the original resume document data to obtain a plurality of text boxes;

and 2, constructing a training corpus based on the information of each text box and the corresponding entity classification to obtain training data.

It can be seen that the present alternative is primarily illustrative of how training data is constructed. In the alternative scheme, the text box extraction is carried out on the original resume document data to obtain a plurality of text boxes, and training corpus construction is carried out based on the information of each text box and the corresponding entity classification to obtain training data. That is, data is extracted therefrom and training data is constructed based on the extracted entity classification.

S102, constructing a Transformer model according to the multi-feature Embedding layer, and taking the Transformer model as an initial information extraction model; the multi-feature Embedding layer is constructed by position features, layout features, image features and page features;

on the basis of S101, the step aims to construct a Transformer model according to a multi-feature Embedding layer and take the Transformer model as an initial information extraction model; the multi-feature Embedding layer is constructed by position features, layout features, image features and page features. In the embodiment, in order to improve the accuracy and effect of identifying the document in the information extraction process, more feature information is mixed into the model, so that the accuracy of identifying the features by the model is improved.

The Transformer model is a model which utilizes an attention mechanism to improve the training speed of the model.

The purpose of the Embedding layer is to project data with high latitude, which is relatively sparse in each dimension, to data with relatively low dimensions, and each dimension can be operated by taking a real number set. Essentially, continuous space is used instead of (quasi-) discrete space to increase space utilization.

Further, the step may include:

step 1, respectively constructing a position feature embedding layer, a layout feature embedding layer, an image feature embedding layer and a page feature embedding layer;

and 2, fusing the position feature embedding layer, the layout feature embedding layer, the image feature embedding layer and the page feature embedding layer into a transform model to obtain an initial information extraction model.

It can be seen that the present alternative scheme mainly illustrates how the information extraction model is constructed. In the alternative scheme, a position feature embedded layer, a layout feature embedded layer, an image feature embedded layer and a page feature embedded layer are respectively constructed; and fusing the position feature embedding layer, the layout feature embedding layer, the image feature embedding layer and the page feature embedding layer into a transform model to obtain an initial information extraction model.

S103, training the initial information extraction model according to the training data to obtain an information extraction model;

on the basis of S102, this step aims to train the initial information extraction model according to the training data, resulting in an information extraction model. In order to implement simultaneous performance of multiple tasks, multiple loss functions may be used to train the model, so that the trained information extraction model may simultaneously infer multiple tasks. Wherein the loss function may include: loss functions of region classification, loss functions of entity extraction, and loss functions of entity relationship classification.

Further, the step may include:

step 1, respectively constructing a loss function of region classification, a loss function of entity extraction and a loss function of entity relation classification;

and 2, training the initial information extraction model based on the loss function of the region classification, the loss function of the entity extraction, the loss function of the entity relation classification and the training data to obtain the information extraction model.

It can be seen that the present alternative is primarily illustrative of how the model may be trained. The alternative scheme mainly comprises the steps of respectively constructing a loss function of region classification, a loss function of entity extraction and a loss function of entity relationship classification, and training an initial information extraction model based on the loss function of region classification, the loss function of entity extraction, the loss function of entity relationship classification and training data to obtain an information extraction model. Obviously, in order to implement simultaneous determination of the tasks of region classification, entity extraction, and entity relationship classification in the alternative, the loss function of region classification, the loss function of entity extraction, and the loss function of entity relationship classification are used in this embodiment to train the model together. The specific process of training may refer to any training mode provided in the prior art, and is not specifically limited herein.

S104, processing the resume document to be processed by adopting an information extraction model to obtain an information extraction result; the information extraction result comprises a plurality of entities, information of each entity and classification results of the entities.

On the basis of S103, processing the resume document to be processed by adopting an information extraction model to obtain an information extraction result; the information extraction result comprises a plurality of entities, information of each entity and classification results of the entities.

In summary, in the embodiment, training data is extracted from original data, an initial information extraction model of a multi-feature Embedding layer is constructed, the initial information extraction model is trained by using the training data to obtain the information extraction model, and finally extraction processing is performed, so that classification results of a plurality of entities, information of each entity and the plurality of entities in the resume document can be obtained, the effect of information extraction processing on a high-complexity document is improved, and the accuracy and precision of extraction are improved.

The following further describes a method for extracting resume document information provided by the present application by using another specific embodiment.

Referring to fig. 2, fig. 2 is a schematic diagram of a model structure of a resume document information extraction method according to an embodiment of the present application,

the document information extraction model in the embodiment is modified on the basis of an Encoder end of a Transformer, and the model training comprises 5 steps: data preparation, multi-modal Embedding layer construction, model construction, multi-task loss function calculation and model optimization.

In fig. 2, T-0 to T-511 are each character of the text sequence, V-0 to V-48 are the image sequence obtained by transforming the image, and P is the page number where the group of text boxes is located.

In this embodiment, the method may include:

step 10, data preparation.

Most of the resume documents applied in this embodiment are PDF (Portable Document Format), word (a Document Format), and pictures.

Step 11, obtaining a text box of a document: text of the text box, position coordinates of the text box and a picture of the current page.

When the document is PDF, acquiring a text by using a PDF analysis tool (when the document is word, the document needs to be converted into PDF), and converting each page into a picture; when the document is a picture, the text is acquired using an OCR (Optical Character Recognition) technique.

Step 12, preparing a corpus: (ID, P, T, S, V, L) _area ,L _entity ,L _relation )。

Wherein ID refers to an index, P represents a page number, T represents text information of a group of text boxes, S represents position coordinate information of a page where each text box is located, and V represents picture information of the page where the group of text boxes is located;

L _area and the area label corresponding to the group of text boxes is represented by: basic information, job-seeking intent, work experience, educational experience, project experience, professional skill certificates, family members. L is a radical of an alcohol _entity The extracted entities in the group of text boxes are represented, the number of the entities is 56 (the entities correspond to different areas respectively), and the specific correspondence relationship is shown in the following table. L is _relation Indicating whether there is a relationship between the set of text boxes and with other text boxes in the batch, 0 representing no relationship, 1 representing a relationship.

The set of zone tags is denoted S _area Entity set tag is denoted as S _entity 。

For example: { ID:0,

P：0

T：[

zhang III, no. 7 month and 21 days 1987,

"work experience", "2017.10-calf science to date",

"position: a wind control model manager,

"2016.05-2017.10 prefong finance", "job: data scientists "

]，

S：[

[12，22，28，26]，[12，27，35，31]，[13，53，48，62]，

[24，66，49，72]，[45，74，59，79]，[24，81，51，88]，[46，92，68，100]

]，

V：numpy.array，

L _area ：[

"basic situation" and "work experience"

]，

L _entity ：[

Zhang III, 21.7 months in 1987,

"2017.10", "to date", "calf technology", "wind control model manager",

"2016.05", "2017.10", "Qianglong finance", "data scientist"

]，

L _relation ：[

[{“id”：0，“index”：3}，{“id”：0，“index”：4}]，

[{“id”：0，“index”：5}，{“id”：0，“index”：6}]，

[{“id”：0，“index”：5}，{“id”：1，“index”：0}]，

[{“id”：0，“index”：6}，{“id”：1，“index”：0}]，

# where id refers to the index in a batch, and index denotes the textbox index

]

}

Wherein, in the real data, L _area Is a tag index, e.g., [0,3 ]]0 and 3 respectively represent index values corresponding to the area labels; l is a radical of an alcohol _entity Is a BIO tag, e.g.<OOBIIOBOOOOOBIIIIIO>Each of B, I, O corresponds to a character.

And step 20, constructing a multi-mode Embedding layer.

The Embedding of the document information extraction model of the embodiment is composed of Word-Embedding, position-Embedding, cut-Embedding, layout-Embedding, spatial-Embedding, visual-Embedding and Page-Embedding, and finally fused into Fusion-Embedding.

Wherein, word-Embedding and Position-Embedding are owned by the transducer, and Segment-Embedding, spatial-Embedding, visual-Embedding, page-Embedding and Fusion-Embedding are mainly introduced here.

Step 21, position Embedding Segment-Embedding.

Initializing a 3-dimensional position dictionary segment _ type _ embedding, respectively representing a text, an image and a page number, wherein when token is the text, the first one is taken, when token is the image, the second one is taken, and when token is the page number, the third one is taken.

And step 22, embedding Spatial-Embedding in the layout.

Wherein the layout embedding is formed by coordinates of the text box. The coordinates of the text box are composed of (x 1, y1, x2, y2, w, h), wherein x1 and y1 are the coordinates of the upper left corner of the text box, x2 and y2 are the coordinates of the lower right corner of the text box, and w and h are the width and height of the text box.

Initializing an embedding dictionary for x, y, w and h respectively, wherein the embedding dictionary corresponds to an abscissa, an ordinate, a width and a height respectively. And (3) extracting embeddings corresponding to 6 pieces of information of the coordinates of the text box respectively, and then connecting to finally form Spatial-embeddings.

Step 23, embedding Visual-Embedding the image.

The image embedding is formed by converting the page where the text box is located into the picture. The picture is coded through a ResNeXt-FPN network, then the average pooling is carried out according to line expansion, and then a characteristic sequence corresponding to the picture is obtained through linear projection.

And step 24, embedding the Page number into Page-Embedding.

Page-Embedding of Page number P, in order to avoid Page number sparsity and Page number coverage, invents a Page number coding method:

firstly, initializing an embedding dictionary from 0 to 9; second, the page number is 4-bit, for example, 0001 at page 1 and 0012 at page 12 (if it exceeds 9999, truncation is performed, and generally, 9999 is not exceeded).

Referring to fig. 3, fig. 3 is a schematic page encoding diagram of a resume document information extraction method according to an embodiment of the present application.

As can be seen, a one-dimensional Convolutional CNN (Convolutional neural network) with a width of 2 is used for the Page code sequence and is averaged and pooled, and finally, the Page-Embedding of the Page is obtained.

In fig. 3, the CNN includes a one-dimensional convolution conv1d and an average pooling layer average-pool1d, 0-v-9-v-vector is a representation of Page numbers 0 to 9, and Page-Embedding of the Page numbers is obtained through the CNN.

Step 25, fusion Embedding Fusion-Embedding

After Segment-Embedding, spatial-Embedding, visual-Embedding and Page-Embedding are obtained, the Segment-Embedding, spatial-Embedding, visual-Embedding and Page-Embedding are fused with Word-Embedding and Position-Embedding.

Referring to fig. 4, fig. 4 is a schematic diagram illustrating feature fusion of a resume document information extraction method according to an embodiment of the present application.

In fig. 4, the text feature sequence length is 512, the image feature sequence length is 49, and the page feature sequence length is 1.

As in fig. 4, segment segmentation feature: the segments of the text are all 0, the segments of the image are all 1, and the segments of the page number are 2;

in the Token text/image/page number characteristics, the text corresponds to word-embedding, the image corresponds to visual-embedding, and the page number corresponds to page-embedding;

in the Position feature, text is from 0 to 511, images are from 512 to 560, and the page number is 561;

in the Layout feature of Layout, the text corresponds to the spatial-embedding of the text box, spatial-embedding where the image corresponds to (0, 0), the page number corresponds to the spatial-embedding of (0, 0).

As can be seen, the process of processing the data may include:

firstly, embedding text, images and page numbers on Segment segmentation characteristics, token text/images, page number characteristics, position characteristics and Layout characteristics of Layout respectively concat;

and in the second step, directly adding the four Embedding to obtain Fusion-Embedding.

Wherein the text is composed of a plurality of text boxes.

And step 30, constructing a model.

The document information extraction model is constructed on the basis of a transform model Encoder end and comprises 12 transform units, an Embedding layer vector is 768 dimensions and about 500 ten thousand parameters, and the Embedding layer is constructed by four characteristics of texts, layouts, images and page numbers.

And step 40, calculating a multitask loss function.

The document information extraction model of the embodiment has 3 training tasks: regional multi-label classification, entity extraction and relationship classification.

The area multi-label classification refers to determining which area the group of text boxes belongs to, for example, the area belongs to the area < basic information >, < intention to ask > and at this time, the first character T-0 is subjected to multi-label classification and belongs to a label classification task.

The entity extraction refers to judging the entity type, such as a < name > entity, of the character level of a text sequence, and belongs to a sequence labeling task;

the relation classification refers to whether a relation exists between every two text boxes in a text sequence and belongs to a binary classification task.

Step 41, the loss function of the regional multi-label classification is:

P _tag +P′ _tag ＝1。

wherein, I is an indicator function (indicator function), and tag is a region tag set S _area A label of P _tag Probability of T-0 character being predicted as tag, P' _tag The probability of not being a tag is predicted for the T-0 character.

Step 42, the loss function extracted by the entity is:

wherein, I is an indicator function (indicator function), token is a character of a T sequence (T sequence, i.e. text sequence), and tag is an entity tag set S _entity A label of P _tag A probability value predicted to be tag for token,

is the real label of token in T.

Step 43, the loss function of the relationship classification is:

wherein I is an indicator function (indicator function), P _0-ij Predict probability of 0 class (no relation) for a sentence, P _1-ij Predict the probability of 1 class (related) for a sentence, P _0-ij +P _1-ij ＝1。

Wherein, i and j are 2 different entities belonging to a batch, and may be from one sample or from different samples.

Step 44, adding the above 3 loss functions to obtain the total loss function of the document information extraction model training of the present embodiment:

Total loss＝Sentence _area loss+Sentence _entity loss+Sentence _relation loss。

and step 50, optimizing the model.

And training the transform by adopting a deep learning model optimization framework according to the prepared training corpus, the multi-modal Embedding layer and the multi-task loss function, and finally obtaining a resume document information extraction model. Compared with a model which only depends on the characteristics of plain texts, the accuracy rate of entity extraction is improved by 20%, the accuracy rate of relation extraction is improved by 15%, the overall accuracy rate reaches 95%, and the requirements and expectations of workers at present are completely met.

Step 60, document information extraction model reasoning

The reasoning of the resume document information extraction model in the embodiment mainly comprises two steps: entity extraction and relation classification.

Step 61, entity extraction.

Substituting a text box and picture information (namely, the description prepared according to the data in the step 10, but label information does not need to be filled) of the resume document into the extraction model to obtain an entity extraction result; if the entity extraction result is null, the process jumps out, otherwise, the process goes to step 62.

Step 62, relationship classification.

And taking out the relation result between every two extracted text boxes where the entities are located, summarizing the relation results, and dividing the related entities into a group.

During training, data of one batch (one batch processing) belongs to one document (text of one document is randomly taken out), and each epoch randomly scrambles a text box of the document and then puts the document into training (so that a situation of page crossing may exist in one batch, and learning of a hierarchical relationship of page crossing is supported).

Therefore, the embodiment extracts the training data from the original data, then constructs the initial information extraction model of the multi-feature Embedding layer, finally trains the initial information extraction model by adopting the training data to obtain the information extraction model, and finally performs extraction processing to obtain the classification results of a plurality of entities, the information of each entity and the entities in the resume document, so that the effect of performing information extraction processing on the high-complexity document is improved, and the accuracy and precision of extraction are improved.

In the following, the resume document information extraction device provided in the embodiment of the present application is introduced, and the resume document information extraction device described below and the resume document information extraction method described above may be referred to in correspondence with each other.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a resume document information extraction device according to an embodiment of the present application.

In this embodiment, the apparatus may include:

a training data acquisition module 100, configured to perform training corpus construction processing according to original resume document data to obtain training data;

the model building module 200 is used for building a Transformer model according to the multi-feature Embedding layer and taking the model as an initial information extraction model; the multi-feature Embedding layer is constructed by position features, layout features, image features and page features;

the model training module 300 is configured to train the initial information extraction model according to the training data to obtain an information extraction model;

the document information extraction module 400 is configured to process the resume document to be processed by using an information extraction model to obtain an information extraction result; the information extraction result comprises a plurality of entities, information of each entity and classification results of the entities.

Optionally, the training data obtaining module 100 is specifically configured to perform text box extraction on the original resume document data to obtain a plurality of text boxes; and constructing a training corpus based on the information of each text box and the corresponding entity classification to obtain training data.

Optionally, the model building module 200 is specifically configured to respectively build a position feature embedding layer, a layout feature embedding layer, an image feature embedding layer, and a page feature embedding layer; and fusing the position feature embedding layer, the layout feature embedding layer, the image feature embedding layer and the page feature embedding layer into a Transformer model to obtain an initial information extraction model.

Optionally, the model training module 300 is specifically configured to respectively construct a loss function of region classification, a loss function of entity extraction, and a loss function of entity relationship classification; and training the initial information extraction model based on the loss function of the region classification, the loss function of the entity extraction, the loss function of the entity relation classification and the training data to obtain the information extraction model.

An embodiment of the present application further provides a terminal device, including:

a memory for storing a computer program;

a processor for implementing the steps of the resume document information extraction method as described in the above embodiments when executing the computer program.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the resume document information extraction method according to the above embodiment are implemented.

The embodiments are described in a progressive mode in the specification, the emphasis of each embodiment is on the difference from the other embodiments, and the same and similar parts among the embodiments can be referred to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the components and steps of the various examples have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The above provides a method for extracting resume document information, a device for extracting resume document information, a terminal device and a computer readable storage medium. The principles and embodiments of the present application are explained herein using specific examples, which are provided only to help understand the method and the core idea of the present application. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

Claims

1. A resume document information extraction method is characterized by comprising the following steps:

performing corpus construction processing according to original resume document data to obtain training data;

2. The method for extracting resume document information according to claim 1, wherein performing corpus construction processing on original resume document data to obtain training data comprises:

3. The resume document information extraction method according to claim 1, wherein constructing a Transformer model according to a multi-feature Embedding layer and serving as an initial information extraction model comprises:

and fusing the position feature embedding layer, the layout feature embedding layer, the image feature embedding layer and the page feature embedding layer into a transform model to obtain the initial information extraction model.

4. The resume document information extraction method of claim 1, wherein training the initial information extraction model according to the training data to obtain an information extraction model comprises:

5. A resume document information extraction device, characterized by comprising:

6. The resume document information extraction device according to claim 5, wherein the training data acquisition module is specifically configured to perform text box extraction on the start resume document data to obtain a plurality of text boxes; and constructing a training corpus based on the information of each text box and the corresponding entity classification to obtain the training data.

7. The resume document information extraction device of claim 5, wherein the model construction module is specifically configured to respectively construct a position feature embedding layer, a layout feature embedding layer, an image feature embedding layer, and a page feature embedding layer; and fusing the position feature embedding layer, the layout feature embedding layer, the image feature embedding layer and the page feature embedding layer into a Transformer model to obtain the initial information extraction model.

8. The apparatus according to claim 5, wherein the model training module is configured to respectively construct a loss function for region classification, a loss function for entity extraction, and a loss function for entity relationship classification; and training the initial information extraction model based on the loss function of the region classification, the loss function of the entity extraction, the loss function of the entity relation classification and the training data to obtain the information extraction model.

9. A terminal device, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the resume document information extraction method according to any one of claims 1 to 4 when executing the computer program.

10. A computer-readable storage medium, characterized in that a computer program is stored thereon, which, when being executed by a processor, implements the steps of the resume document information extraction method of any one of claims 1 to 4.