CN115690795A - Resume information extraction method and device, electronic equipment and storage medium - Google Patents

Resume information extraction method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115690795A
CN115690795A CN202211364710.3A CN202211364710A CN115690795A CN 115690795 A CN115690795 A CN 115690795A CN 202211364710 A CN202211364710 A CN 202211364710A CN 115690795 A CN115690795 A CN 115690795A
Authority
CN
China
Prior art keywords
text
text line
image
line
resume
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211364710.3A
Other languages
Chinese (zh)
Inventor
王根
刘驰
蒋磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN202211364710.3A priority Critical patent/CN115690795A/en
Publication of CN115690795A publication Critical patent/CN115690795A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Character Input (AREA)

Abstract

The invention provides a resume information extraction method, a resume information extraction device, electronic equipment and a storage medium, belonging to the technical field of image processing, wherein the resume information extraction method comprises the following steps: performing text line detection and text recognition on the target resume image to obtain position information of each text line in the target resume image and a text recognition result corresponding to each text line; sequencing all text lines in the target resume image based on the position information of all the text lines and the text recognition result to obtain the content of the sequenced text lines; and carrying out coding and decoding processing on the contents of the sequenced text lines to obtain the text line structured information of the target resume image. According to the method and the device, through obtaining the text line structured information of the resume image, various limits on the resume can be broken through, useful information can be accurately extracted from various types of resumes with various characteristics, the resume screening efficiency can be effectively improved, and the screening error rate is reduced.

Description

Resume information extraction method and device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of image processing, in particular to a resume information extraction method and device, electronic equipment and a storage medium.
Background
When searching for a job, the job seeker generally uploads his personal resume to the recruitment company system or web page by means of scanning or photographing. Since the resume is the first impression of the interviewer on the job seeker, most job seekers can individually design their own resume to highlight their own value or advantage, and therefore the uploaded resume has no format limitation. If in the time of job hunting, relevant staff need to extract useful information from a large number of resumes with various characteristics and various types, and screen resumes meeting the interview requirements, the workload is large, the efficiency is low, and errors are easy to occur, so how to extract useful information from resumes with various characteristics and various types needs to be solved.
Disclosure of Invention
The invention provides a resume information extraction method, a resume information extraction device, electronic equipment and a storage medium, which are used for solving the problem of how to extract useful information from resumes with various characteristics and various types.
The invention provides a resume information extraction method, which comprises the following steps:
performing text line detection and text recognition on a target resume image to obtain position information of each text line in the target resume image and a text recognition result corresponding to each text line;
sequencing all text lines in the target resume image based on the position information of all the text lines and the text recognition result to obtain the content of the sequenced text lines;
and carrying out coding and decoding processing on the content of the sequenced text lines to obtain the text line structured information of the target resume image.
According to the resume information extraction method provided by the invention, the text line detection and the text recognition are carried out on the target resume image to obtain the position information of each text line in the target resume image and the text recognition result corresponding to each text line, and the method comprises the following steps:
performing text line detection on a target resume image to obtain position information of each text line in the target resume image;
correcting each text line in the target resume image according to the position information of each text line to obtain a text line correction image;
and performing text recognition on the text line corrected image to obtain a text line recognition result corresponding to each text line.
According to the resume information extraction method provided by the invention, the correcting is performed on each text line in the target resume image according to the position information of each text line to obtain a text line corrected image, and the method comprises the following steps:
obtaining line images corresponding to the text lines in the target resume image by adopting a circumscribed rectangle method according to the position information of the text lines;
according to the line image corresponding to each text line, obtaining a rotation image corresponding to each text line by adopting a minimum external rectangle method;
calculating the rotation angle of each text line based on the rotation image corresponding to each text line;
performing affine transformation on each text line based on the rotation angle of each text line to obtain a line image corresponding to each text line subjected to rotation correction;
and removing the interference information in the line image corresponding to each text line subjected to the rotation correction by adopting a mask image to obtain a text line correction image.
According to the resume information extraction method provided by the invention, the step of sequencing each text line in the target resume image based on the position information of each text line and the text recognition result to obtain the content of the sequenced text lines comprises the following steps:
determining the height direction coincidence degree between semantic blocks in the target resume image based on the position information of each text line;
sorting the semantic blocks based on the height direction coincidence degree between the semantic blocks in the target resume image;
and sequencing the text lines in each semantic block after sequencing to obtain the content of the sequenced text lines.
According to the resume information extraction method provided by the invention, the encoding and decoding processing is performed on the content of the sequenced text lines to obtain the text line structured information of the target resume image, and the method comprises the following steps:
obtaining text features based on the sequenced text line contents, obtaining layout features corresponding to the text lines based on the position information of the text lines and the target resume image, and obtaining image features based on the target resume image;
performing feature fusion on the text features, the layout features and the image features to obtain fused text line features;
and carrying out hierarchical relation reasoning on the fused text line characteristics to obtain the text line structural relation of the target resume image.
According to the resume information extraction method provided by the invention, the hierarchical relationship reasoning is carried out on the fused text line characteristics to obtain the text line structural relationship of the target resume image, and the method comprises the following steps:
determining a title, a father node and a relation between the father node and the father node corresponding to each text line based on the text line characteristics after the fusion;
aiming at each text line, determining the hierarchical information of each text line according to the corresponding title of each text line and the relation between each text line and the father node;
and obtaining the text line structured relation of the target resume image based on the hierarchical information of each text line.
According to the resume information extraction method provided by the invention, the method further comprises the following steps:
according to the text line structural relation of the target resume image, checking semantic blocks to which the text lines belong to obtain a checking result;
and processing the semantic block according to the checking result.
The invention also provides a resume information extraction device, comprising:
the text detection and identification module is used for carrying out text line detection and text identification on the target resume image to obtain the position information of each text line in the target resume image and a text identification result corresponding to each text line;
the text sorting module is used for sorting the text lines in the target resume image based on the position information of the text lines and the text recognition result to obtain the contents of the sorted text lines;
and the coding and decoding processing module is used for coding and decoding the content of the sequenced text lines to obtain the text line structured information of the target resume image.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the resume information extraction method.
The present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program that, when executed by a processor, implements the resume information extraction method as described in any one of the above.
The present invention also provides a computer program product comprising a computer program which, when executed by a processor, implements the resume information extraction method as described in any of the above.
According to the resume information extraction method, the resume information extraction device, the electronic equipment and the storage medium, the text line detection and the text recognition are carried out on the target resume image, and the position information and the text recognition result of each text line are obtained; sequencing each text line in the target resume image based on the position information of each text line and the text recognition result to obtain the content of the sequenced text lines; the content of the sequenced text lines is coded and decoded to obtain the text line structured information of the target resume image, so that various limits on the resumes can be broken through, useful information can be extracted from various resumes with various characteristics, the resume screening efficiency can be effectively improved, and the screening error rate can be reduced.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
Fig. 1 is an application environment diagram in which the resume information extraction method provided by the embodiment of the present invention can be operated;
FIG. 2 is a schematic flow chart of a resume information extraction method according to the present invention;
FIG. 3 is an exemplary diagram of a target resume image provided by an embodiment of the present invention;
FIG. 4 is an exemplary diagram of a line of text provided by an embodiment of the present invention;
FIG. 5 is an exemplary diagram of structured information of a text line provided by an embodiment of the present invention;
fig. 6 is a schematic flowchart of a process of performing text line detection and text recognition on a target resume image according to an embodiment of the present invention;
FIG. 7 is a schematic flow chart illustrating a process of correcting text lines in a target resume image according to an embodiment of the present invention;
fig. 8 is a schematic flowchart of encoding and decoding processing performed on the sorted text line content according to the embodiment of the present invention;
FIG. 9 is a flowchart illustrating a hierarchical relationship inference scheme according to an embodiment of the present invention;
FIG. 10 is a schematic structural diagram of a resume information extraction apparatus according to the present invention;
fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the existing resume information extraction scheme, most texts are directly extracted from electronic resumes, then the texts are coded, and the resumes are extracted through models such as a convolutional neural network or a cyclic neural network. Or corresponding attributes such as education experience, project experience, work experience and the like are preset, and specific fields such as place names, time, duties and the like are extracted through different rules, so that the purpose of information extraction is achieved. Or a defined uniform format such as html format, and simple structured information such as names, salaries, work places and the like is extracted. The existing resume information extraction scheme limits the source of the resume, or limits the format of the resume, or limits the corresponding language to Chinese and the like, and for complex scenes such as complex sentences, multi-layer nesting of contents, incorrect photographing angle, complex background information, english resumes and the like, the extraction of the resume information cannot be well realized, and key information may be omitted.
Therefore, the invention provides a resume information extraction method, a resume information extraction device, electronic equipment and a storage medium, wherein the position information and the text recognition result of each text line are obtained by performing text line detection and text recognition on a target resume image; sequencing all text lines in the target resume image based on the position information of all the text lines and the text recognition result to obtain the content of the sequenced text lines; and carrying out coding and decoding processing on the contents of the sequenced text lines to obtain the text line structured information of the target resume image. According to the method and the device, through obtaining the text line structured information of the resume image, various limits on the resume can be broken through, useful information can be extracted from various types of resumes with various characteristics, the resume screening efficiency can be effectively improved, and the screening error rate can be reduced.
The resume information extraction method provided by the invention can be applied to the application environment shown in fig. 1. Fig. 1 is an application environment diagram in which the resume information extraction method provided by the embodiment of the present invention can be operated. As shown in fig. 1, the application environment includes a terminal 110 and a server 120, and the terminal 110 and the server 120 communicate with each other through a network, which may be a wireless communication network or a wired communication network, where the number of the terminal and the server is not limited. The wireless communication network may include, but is not limited to, at least one of: WIFI (Wireless Fidelity), bluetooth, the wired communication network may include, but is not limited to, at least one of the following: wide area networks, metropolitan area networks, local area networks.
The terminal 110 includes various handheld devices, vehicle mounted devices, wearable devices, computing devices or other processing devices connected to a wireless modem, such as a cell phone, tablet, desktop notebook, and smart device that can run applications, including a center console of a smart car, a smart phone, etc., with wireless or wired communication capabilities.
The server 120 may be implemented as a stand-alone server or a server cluster comprising a plurality of servers.
It should be noted that, the method for extracting resume information in the present invention may be implemented directly on the terminal 110, may be implemented directly on the server 120, or may be implemented on the server 120 and then the server 120 sends the resume information to the terminal 110.
The terminal 110 or the server 120 performs text line detection and text recognition on the target resume image to obtain position information of each text line and a text recognition result; sequencing all text lines in the target resume image based on the position information of all the text lines and the text recognition result to obtain the content of the sequenced text lines; the contents of the sequenced text lines are coded and decoded to obtain the text line structured information of the target resume image, so that the limitation on the resume format can be broken through, useful information can be extracted from various types of resumes with various characteristics, the resume screening efficiency can be effectively improved, and the screening error rate can be reduced. The method for extracting resume information according to the present invention executed by a terminal is described as an example.
Fig. 2 is a schematic flow chart of the resume information extraction method provided by the present invention. As shown in fig. 2, a resume information extraction method is provided, which is described by taking the application of the method to the terminal in fig. 1 as an example, and includes the following steps: step 210, step 220, step 230.
Step 210, performing text line detection and text recognition on a target resume image to obtain position information of each text line in the target resume image and a text recognition result corresponding to each text line;
the target resume image can be obtained by the job seeker uploading the personal resume document to a recruitment system or a background server of the recruitment company in a scanning or photographing mode and reading the resume image stored in a terminal database or a server database.
The target resume image can also be obtained by scanning or photographing the personal resume document of the job seeker to obtain the resume image, sending the resume image to a recruitment mailbox of a recruitment company and reading the resume image in the recruitment mailbox.
The target resume image can also be obtained by directly sending the personal resume document of the job seeker to a recruitment mailbox of a recruitment company and converting the resume documents of various types into images by reading the resume documents of various types in the recruitment mailbox.
Fig. 3 is an exemplary diagram of a target resume image according to an embodiment of the present invention. As shown in fig. 3, a target resume image may be divided into a plurality of text blocks, and a, B, C, D, and E in fig. 3 represent one text block, respectively.
A text block may include a plurality of text lines having semantic relationships. Fig. 4 is an exemplary diagram of a text line provided by an embodiment of the invention. As shown in fig. 4, the text block of personal information includes a plurality of text lines, wherein each text line corresponds to a number.
The method for detecting the text line and recognizing the text of the target resume image comprises the following two steps:
firstly, text line detection is carried out, and therefore position information of contour points of all text lines in the target resume image is obtained. And the accurate position of each text line can be obtained through the position information and the position information of the contour points.
And secondly, performing text recognition to obtain a text recognition result corresponding to each text line in the target resume image. And the content expressed by each text line can be obtained through the text recognition result.
By performing text line detection and text recognition on the target resume image, the completeness of resume information extraction can be ensured, and information loss possibly existing in text line extraction in a complex scene is fundamentally avoided.
Step 220, sorting the text lines in the target resume image based on the position information of the text lines and the text recognition result to obtain the contents of the sorted text lines;
in this embodiment, in order to improve the readability of the text recognition result and overcome the ambiguity problem caused by the reading order, after the position information of each text line and the text recognition result are obtained, it is necessary to sort the text lines in the target resume image based on the position information of each text line and the text recognition result, so as to obtain the contents of the sorted text lines.
In some embodiments, the position information of each text line and the text recognition result may be input into a semantic block generation model, and a semantic block output by the semantic block generation model is obtained.
It should be noted that the semantic block output by the semantic block generation model includes a plurality of text lines in an ordered manner, and the semantic block may be understood as a paragraph composed of a plurality of text lines. Therefore, the content of the text line after sequencing can be obtained by acquiring the semantic block output by the semantic block generation model.
The semantic block generation model is obtained by training according to the position information samples and the text recognition result samples of all the text lines and the corresponding semantic block samples.
According to the method and the device, the text lines in the target resume image are sequenced based on the position information of the text lines and the text recognition result, the content of the sequenced text lines is obtained, the readability of the text recognition result can be improved, and effective information can be extracted from the resume under the condition that the resume has multiple columns or is nested.
And step 230, performing coding and decoding processing on the content of the sorted text line to obtain the text line structured information of the target resume image.
The text line structured information comprises the hierarchical information of each text line in the target resume image, the category corresponding to the hierarchical information and the hierarchical relation among the text lines.
The hierarchy information and the category corresponding to the hierarchy information may be predefined. For example, level 1 is defined to represent a first-level title, and level 2 is defined to represent a text line content, that is, the category corresponding to level 1 is the first-level title, and the category corresponding to level 2 is the text line content.
In some embodiments, the hierarchical information and the category corresponding to the hierarchical information may be determined or defined according to reading rules or reading habits of the resume. For example, following a general reading habit, a text line having a semantic relationship is divided into one text block. The reading order between text blocks is from top to bottom and from left to right. And for each text block, sequentially dividing the text block into text lines according to the reading habit. And then determining the hierarchical information of each text line and the corresponding category of the hierarchical information. For example, referring to fig. 4, the latest work, the highest degree/degree, and the personal information in fig. 4 may be determined as a first hierarchy, and categories corresponding to the latest work, the highest degree/degree, and the personal information may be defined as first-level titles. And defining positions, companies and industries under the recent work as a second level, and defining categories corresponding to the positions, the companies and the industries as second-level titles.
In one embodiment, according to a conventional resume format, the defined hierarchy information and the corresponding category of the hierarchy information are shown in table 1.
TABLE 1 hierarchy information and categories to which the hierarchy information corresponds
Figure BDA0003923492420000091
Figure BDA0003923492420000101
It should be noted that the hierarchy information and the category corresponding to the hierarchy information in table 1 are only an example, and are not intended to limit the hierarchy and the category in the embodiment of the present invention.
The hierarchical relationship between the text lines is uncertain and needs to be inferred according to the specific content of the resume.
The hierarchical relationship among the text lines comprises inclusion, parallel connection and connection. The connection means that two text lines are combined together to form a whole, and the relationship between the two text lines is called as a connection relationship.
Fig. 5 is an exemplary diagram of structured information of a text line according to an embodiment of the present invention. As shown in fig. 5, the work experience is level 1, the work experience is level 2, and level 3 is a specific work station and work description.
In the embodiment of the present invention, the encoding and decoding processing of the sorted text line content includes: and coding the contents of the sequenced text lines to obtain coded characteristics, and then decoding the coded characteristics to obtain text line structured information. Through the decoding process, reasoning on the hierarchical relationship between the text lines is realized.
The method and the device perform coding and decoding processing on the sequenced text line content to obtain the text line structured information of the target resume image, can be used for subsequently retrieving key information, can also be used for recovering the resume document, or can be used for a resume recommendation system, and the method and the device are not particularly limited in this respect.
In some embodiments, after obtaining the text line structured information of the target resume image, the resume information extraction method further includes:
outputting resume information corresponding to the target resume image, wherein the resume information comprises at least one of the following items: the position information of each text line, the text recognition result corresponding to each text line, the content of the sequenced text lines and the text line structured information.
It can be understood that the position information of each text line, the text recognition result corresponding to each text line, the content of the sorted text lines, the text line structural information, and the like are finally serialized and output for subsequent processing.
According to the method, after the sequenced text line contents are obtained, encoding and decoding processing is carried out on the sequenced text line contents to obtain the text line structured information of the target resume image, the problems of multi-column and multi-layer nesting of the resume can be solved, and different resume extraction rules or requirements can be set according to different recruitment requirements of different recruitment companies based on the text line structured information subsequently, so that different resume information can be extracted, and various limits on the resume can be broken through.
The resume information extraction method provided by the invention comprises the steps of detecting text lines and identifying texts of target resume images to obtain the position information of each text line and a text identification result; sequencing each text line in the target resume image based on the position information of each text line and the text recognition result to obtain the content of the sequenced text lines; the contents of the sequenced text lines are coded and decoded to obtain the text line structured information of the target resume image, so that various limits on the resumes can be broken through, useful information can be extracted from the resumes with various characteristics and various types, the resume screening efficiency can be effectively improved, and the screening error rate can be reduced.
Fig. 6 is a schematic flowchart of a process of performing text line detection and text recognition on a target resume image according to an embodiment of the present invention, as shown in fig. 6, in some embodiments of the present invention, step 210 includes:
step 211, performing text line detection on the target resume image to obtain position information of each text line in the target resume image;
optionally, the target resume image is input to a text line detection model for text line detection, and an output result of the text line detection model is obtained to obtain position information of contour points of each text line in the target resume image, that is, position information of each text line. It is understood that the position information of the contour points of each text line may describe the position of the text line in the target resume image.
The text line detection model is obtained by training based on the resume image sample and the labeling information of the text line corresponding to the resume image sample.
In some embodiments, the Text line Detection model may employ a PSENET network model, or a DB (Real-time Scene Text Detection with differential Binarization) model.
Among them, PSENet is an example split network, which has two advantages. First, PSENet, a segmentation-based method, can locate arbitrarily shaped text. Second, the model proposes a progressive scaling algorithm that can successfully identify adjacent text instances. The text line detection is carried out on the target resume image by adopting the PSENET network model, and the text information in the target resume image can be effectively extracted.
It should be noted that, the present invention may also adopt other models capable of performing text line detection as the text line detection model, and the present invention does not specifically limit this.
Step 212, correcting each text line in the target resume image according to the position information of each text line to obtain a text line correction image;
because the target resume image may have the problems of incorrect photographing angle, poor light and the like, so that the text line has the conditions of inclination or bending, shadow and the like, the text line can be detected first to obtain the position information of each text line, and then the detected text line is corrected according to the position information of each text line, so that the text line is accurately corrected to obtain the corrected text line image.
The text recognition result obtained by the method can ensure the integrity and the accuracy of the resume information extraction.
And 213, performing text recognition on the text line corrected image to obtain a text line recognition result corresponding to each text line.
After the correction, the content of the text line on each position information is recognized, and a text recognition result corresponding to each text line can be obtained.
Optionally, the text line correction image is input to a text recognition model, and an output result of the text recognition model is obtained to obtain a text recognition result corresponding to each text line.
The text recognition model is obtained by training according to the resume image sample and the text labeling information corresponding to the resume image sample.
In some embodiments, the text recognition model may employ a CRNN (Convolutional Recurrent Neural Network) model, which may recognize longer text sequences. The CRNN model comprises a CNN characteristic extraction layer and a BLSTM sequence characteristic extraction layer, and can perform end-to-end joint training. Context relations in the character images are learned by using the BLSTM and CTC components, so that the text recognition accuracy is effectively improved, and the model is more robust. In the prediction process, the front end extracts the characteristics of a text image by using a standard CNN network, fuses the characteristic vectors by using BLSTM to extract the context characteristics of a character sequence, then obtains the probability distribution of each line of characteristics, and finally predicts through a transcription layer (CTC rule) to obtain a text sequence.
It should be noted that, the embodiment of the present invention may also use other models capable of performing text recognition to perform text recognition on the text line correction image, and the present invention is not limited in this respect.
In the embodiment of the invention, after the text line detection is carried out on the target resume image, the detected text line is corrected to obtain the text line correction image, and then the text line correction image is subjected to text recognition, so that the accuracy of text recognition can be improved, the completeness of resume information extraction is ensured, and the information loss possibly existing in the text line extraction in a complex scene is fundamentally avoided.
Fig. 7 is a schematic flowchart of a process of correcting each text line in the target resume image according to an embodiment of the present invention, as shown in fig. 7, in some embodiments of the present invention, step 212 includes:
2121, according to the position information of each text line, obtaining a line image corresponding to each text line in the target resume image by adopting a circumscribed rectangle method;
and drawing a circumscribed rectangle according to the position information of the contour points of each text line, thereby obtaining a corresponding line image of each text line.
2122, according to the line image corresponding to each text line, obtaining a rotating image corresponding to each text line by adopting a minimum circumscribed rectangle method;
it can be understood that the minimum circumscribed rectangle of the line image corresponding to each text line is determined, and a rotated image corresponding to the text line is obtained.
Step 2123, calculating a rotation angle of each text line based on the rotation image corresponding to each text line;
the rotation image corresponding to each text line is obtained, and the included angle between the rotation image and the horizontal direction, that is, the rotation angle of each text line can be calculated.
Step 2124, performing affine transformation on each text line based on the rotation angle of each text line to obtain a line image corresponding to each rotation-corrected text line;
according to the rotation angle, an affine transformation matrix can be determined, so that affine transformation is carried out on each text line by each module affine transformation matrix, and a line image corresponding to each text line after rotation correction is obtained.
And step 2125, removing the interference information in the line image corresponding to each text line subjected to the rotation correction by using the mask image to obtain a text line correction image.
Finally, removing the interference information in the line image corresponding to each text line after rotation correction, specifically, removing the interference information by using a mask image, and finally obtaining a text line correction image, namely, a corrected clean image, and the position information of each text line after correction, such as coordinate information, angle information and the like.
In the embodiment of the invention, the line images corresponding to the rotation corrected text lines are obtained by determining the rotation images corresponding to the text lines, calculating the rotation angle and carrying out affine transformation based on the rotation angle, and the interference information in the line images corresponding to the rotation corrected text lines is removed by adopting the mask image, so that the problems of incorrect photographing angle, complex background information and the like can be solved, and the method is suitable for extracting the resume information in a complex scene.
In some embodiments of the present invention, step 220 comprises:
determining the height direction coincidence degree between semantic blocks in the target resume image based on the position information of each text line;
sorting the semantic blocks based on the height direction coincidence degree between the semantic blocks in the target resume image;
and sequencing the text rows in the sequenced semantic blocks to obtain the content of the sequenced text rows.
In order to improve the readability of the text recognition result and overcome the ambiguity problem caused by the reading sequence, the embodiment of the invention provides that after the position information of each text line and the text recognition result are obtained, the text lines in the target resume image are sequenced based on the position information of each text line and the text recognition result, and the content of the sequenced text lines is obtained.
Since the semantic blocks in the resume generally have distinct boundary distinctions, the ordering of the text lines can be achieved by outputting the semantic blocks.
The output of the semantic block is realized by the following steps:
firstly, determining the height direction coincidence degree between semantic blocks in the target resume image based on the position information of each text line.
Then, based on the height direction coincidence degree between the semantic blocks in the target resume image, sequencing the semantic blocks;
optionally, if the height direction contact ratio is smaller than a preset value, for example, the height direction contact ratio is smaller than 0.4, sorting is performed according to the height direction, otherwise, sorting is performed according to the left-right direction.
And sequencing the text rows in the sequenced semantic blocks to obtain the content of the sequenced text rows.
The output of the semantic block can be realized through a semantic block generation model, the semantic block generation model is obtained by training according to the position information samples and the text recognition result samples of all the text lines and the corresponding semantic block samples, and the semantic block generation model executes the output steps of the semantic block.
In the embodiment of the invention, the text lines in the target resume image are sequenced by sequencing the semantic blocks and then sequencing the text lines in the semantic blocks, so that the sequencing of the text lines in the target resume image is realized, the readability of the text recognition result can be effectively improved, and the extraction of resume information can be better realized.
Fig. 8 is a schematic flowchart of encoding and decoding the sorted text line content according to the embodiment of the present invention. As shown in fig. 8, in some embodiments of the invention, step 230 comprises:
231, obtaining text features based on the sorted text line contents, obtaining layout features corresponding to the text lines based on the position information of the text lines and the target resume image, and obtaining image features based on the target resume image;
optionally, the embodiment of the present invention employs a layout xlm model to encode the content of the sorted text lines, and the layout xlm model requires information of three different modalities, namely, a text modality, a layout modality, and an image modality, as input.
Therefore, it is necessary to acquire text features, layout features, and image features.
And inputting the contents of the sequenced text lines into a text feature extraction model for feature extraction to obtain text features.
In an embodiment, a bert _ wm model may be used as a text feature extraction model, that is, the contents of the sorted text lines are input into the bert _ wm model to obtain text features.
And obtaining layout characteristics corresponding to each text line based on the position information of each text line and the target resume image, wherein the layout characteristics are obtained by extracting from an original target establishing image according to the position information of each text line.
And obtaining image characteristics based on the target resume image, wherein the image characteristics are determined according to the target resume image.
Step 232, performing feature fusion on the text features, the layout features and the image features to obtain fused text line features;
optionally, inputting the text features, the layout features and the image features into a coding model for feature fusion to obtain fused text line features;
the coding model is obtained by training according to the text feature sample, the layout feature sample, the image feature sample and the text line feature sample.
The coding model uses an attention mechanism to fuse the inputs of multiple modalities.
And inputting the text features, the layout features and the image features into a coding model for feature fusion to obtain fused text line features, wherein the fused text line features are features of text line units.
Step 233, performing hierarchical relation reasoning on the fused text line features to obtain a text line structured relation of the target resume image;
optionally, the fused text line features are input into a decoding model for hierarchical relationship reasoning, so that a text line structural relationship of the target resume image is obtained.
It can be understood that the decoding model in the embodiment of the present invention is used for performing hierarchical relationship reasoning on the text line features after the fusion, so as to obtain the text line structural relationship of the target resume image.
The decoding model is obtained by training according to the text line characteristic sample and the text line structure relation corresponding to the text line characteristic sample.
Optionally, in the embodiment of the present invention, a GRU (Gate recovery Unit) model structure is used to implement hierarchical relationship inference.
It should be noted that other decoding models may also be used, and the present invention is not limited in this respect.
In the embodiment of the invention, the text feature, the layout feature and the image feature are obtained, the text feature, the layout feature and the image feature are input into the coding model for feature fusion to obtain the fused text line feature, and the fused text line feature is input into the decoding model for hierarchical relationship reasoning, so that the text line structural relationship of the target resume image can be obtained, and subsequently, different resume extraction rules or requirements can be set according to the recruitment requirements of different recruitment companies based on the text line structural information to extract different resume information, thereby breaking through various limits on resumes.
In some embodiments of the present invention, step 233 specifically includes:
determining a title, a father node and a relation between the father node and the father node corresponding to each text line based on the text line characteristics after the fusion;
aiming at each text line, determining the hierarchical information of each text line according to the corresponding title of each text line and the relation between each text line and the father node;
and obtaining the text line structural relationship of the target resume image based on the hierarchical information of each text line.
Specifically, firstly, based on the text line features after the fusion, determining a title corresponding to each text line, a parent node of each text line, and a relationship between each text line and the parent node, including an inclusion, a parallel relationship, or a connection relationship.
Then, for each text line, the steps shown in fig. 9 are performed in sequence:
step 900, judging whether the title is a primary title or not;
step 901, if the title is a first-level title, outputting the level information of the current text line as 1;
step 902, if the title is not a first-level title, determining whether the title is a start;
step 903, if the title is the beginning, outputting the level information of the current text line as 1;
step 904, if the title is not the beginning, judging whether the relation is a connection or parallel relation;
step 905, if the relation is a connection or parallel relation, outputting the hierarchy information of the current text line as the hierarchy information of the father node;
step 906, if the relationship is not a connection or parallel relationship, judging whether the relationship is an inclusion relationship;
step 907, if the relationship is an inclusion relationship, outputting the hierarchy information of the current text line as the hierarchy information of the father node plus 1;
step 908, if the relationship is not an inclusion relationship, determining whether the title is an end;
step 909, if the title is end, outputting the hierarchical information of the current text line as-1;
after the above steps 901 to 909 are performed for each text line, the hierarchical information of each text line can be obtained.
And finally, obtaining the text line structured relation of the target resume image based on the hierarchical information of each text line.
The trained decoding model can perform the above-mentioned hierarchical relationship inference step.
In the embodiment of the invention, the reasoning of the hierarchical relationship among the text lines is realized through the decoding process, and the text line structured information of the target resume image is obtained, and because the text line structured information can be closely related to the structure of the resume, the method provided by the invention can obtain the structured information of the resume with any structure, thereby breaking through various limitations of the resume, such as the situations of complex sentences, multi-layer nesting of contents, english resume and the like, realizing the extraction of useful information from various types of resumes with various characteristics, effectively improving the resume screening efficiency and reducing the screening error rate.
On the basis of the above embodiments, the method further includes:
according to the text line structural relationship of the target resume image, checking the semantic block to which each text line belongs to obtain a checking result;
and processing the semantic block according to the checking result.
The present embodiment mainly considers the following situations:
in case one, if the current line and the previous line are connected or in parallel, it indicates that the current line and the previous text line need to be merged into a semantic block.
Case two, there is a page spread. Specifically, the original position information of the text line may be combined to determine whether the page crossing condition exists. If the cross page exists, the text lines of the cross page need to be merged into a semantic block.
Case three, the page number information needs to be used. In this case, the semantic block needs to be split.
In consideration of the situations, after the text line structural relationship of the target resume image is obtained, the semantic blocks to which the text lines belong are verified according to the text line structural relationship of the target resume image to obtain a verification result; and processing the semantic block according to the checking result. The processing comprises merging into one semantic block or splitting one semantic block.
The checking process comprises the following steps: and judging whether the current line and the previous line are in a connection or parallel relation, if so, merging the current line and the previous text line into a semantic block.
And judging whether the page crossing condition exists or not by combining the original position information of the text line. And if the cross page exists, merging the text lines of the cross page into a semantic block.
And judging whether the page number information needs to be used or not, and splitting the semantic block if the page number information needs to be used. It should be noted that the semantic block needs to be split into two, but the hierarchical information should be consistent.
According to the resume information extraction method provided by the invention, the semantic blocks to which the text lines belong are verified according to the text line structural relationship of the target resume image to obtain a verification result; and processing the semantic block according to the verification result, thereby further effectively extracting the resume information, solving the limitation of the conventional scheme on the resume format and acquiring the resume information in a complex scene.
The following describes the resume information extraction device provided by the present invention, and the resume information extraction device described below and the resume information extraction method described above may be referred to in correspondence with each other.
Fig. 10 is a schematic structural diagram of a resume information extraction apparatus provided in the present invention, as shown in fig. 10, the apparatus includes:
the text detection and recognition module 1010 is configured to perform text line detection and text recognition on the target resume image to obtain position information of each text line in the target resume image and a text recognition result corresponding to each text line;
a text sorting module 1020, configured to sort, based on the position information of each text line and the text recognition result, each text line in the target resume image to obtain a content of the sorted text line;
and an encoding and decoding processing module 1030, configured to perform encoding and decoding processing on the sorted text line content to obtain text line structured information of the target resume image.
In some embodiments of the present invention, the text detection and recognition module 1010 comprises:
the text detection submodule is used for detecting text lines of the target resume image to obtain the position information of each text line in the target resume image;
the text line correction submodule is used for correcting each text line in the target resume image according to the position information of each text line to obtain a text line correction image;
and the text recognition sub-module is used for performing text recognition on the text line correction image to obtain a text line recognition result corresponding to each text line.
In some embodiments of the invention, the text line correction sub-module is configured to:
obtaining line images corresponding to the text lines in the target resume image by adopting a circumscribed rectangle method according to the position information of the text lines;
obtaining a rotating image corresponding to each text line by adopting a minimum circumscribed rectangle method according to the line image corresponding to each text line;
calculating the rotation angle of each text line based on the rotation image corresponding to each text line;
performing affine transformation on each text line based on the rotation angle of each text line to obtain line images corresponding to each text line subjected to rotation correction;
and removing the interference information in the line image corresponding to each text line subjected to the rotation correction by adopting a mask image to obtain a text line correction image.
In some embodiments of the present invention, the text ordering module 1020 is configured to:
determining the height direction contact ratio between semantic blocks in the target resume image based on the position information of each text line;
sorting the semantic blocks based on the height direction coincidence degree between the semantic blocks in the target resume image;
and sequencing the text rows in the sequenced semantic blocks to obtain the content of the sequenced text rows.
In some embodiments of the present invention, the codec processing module 1030 includes:
the feature extraction sub-module is used for obtaining text features based on the sorted text line contents, obtaining layout features corresponding to the text lines based on the position information of the text lines and the target resume image, and obtaining image features based on the target resume image;
the coding submodule is used for carrying out feature fusion on the text features, the layout features and the image features to obtain fused text line features;
and the decoding submodule is used for carrying out hierarchical relation reasoning on the fused text line characteristics to obtain the text line structural relation of the target resume image.
In some embodiments of the invention, the decoding submodule is operable to:
determining a title, a father node and a relation between the father node and the father node corresponding to each text line based on the text line characteristics after the fusion;
aiming at each text line, determining the hierarchical information of each text line according to the corresponding title of each text line and the relation between each text line and the father node;
and obtaining the text line structured relation of the target resume image based on the hierarchical information of each text line.
In some embodiments of the present invention, the resume information extraction apparatus further includes a checking module, and the checking module is configured to:
according to the text line structural relation of the target resume image, checking semantic blocks to which the text lines belong to obtain a checking result;
and processing the semantic block according to the checking result.
It should be noted that the resume information extraction device provided in the embodiment of the present invention can implement all the method steps implemented by the resume information extraction method embodiment, and can achieve the same technical effect, and detailed descriptions of the same parts and beneficial effects as those of the method embodiment in this embodiment are not repeated herein.
Fig. 11 illustrates a physical structure diagram of an electronic device, and as shown in fig. 11, the electronic device may include: a processor (processor) 1110, a communication Interface (Communications Interface) 1120, a memory (memory) 1130, and a communication bus 1140, wherein the processor 1110, the communication Interface 1120, and the memory 1130 communicate with each other via the communication bus 1140. Processor 1110 may invoke logic instructions in memory 1130 to perform a resume information extraction method comprising: performing text line detection and text recognition on a target resume image to obtain position information of each text line in the target resume image and a text recognition result corresponding to each text line; sequencing all text lines in the target resume image based on the position information of all the text lines and the text recognition result to obtain the content of the sequenced text lines; and carrying out coding and decoding processing on the content of the sequenced text lines to obtain the text line structured information of the target resume image.
In addition, the logic instructions in the memory 1130 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention or a part thereof which substantially contributes to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product including a computer program, the computer program being stored on a non-transitory computer-readable storage medium, wherein when the computer program is executed by a processor, a computer is capable of executing the resume information extraction method provided by the above methods, the method including: performing text line detection and text recognition on a target resume image to obtain position information of each text line in the target resume image and a text recognition result corresponding to each text line; sequencing all text lines in the target resume image based on the position information of all the text lines and the text recognition result to obtain the content of the sequenced text lines; and carrying out coding and decoding processing on the content of the sequenced text lines to obtain the text line structured information of the target resume image.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor, implements a method for extracting resume information provided by the above methods, the method including: performing text line detection and text recognition on a target resume image to obtain position information of each text line in the target resume image and a text recognition result corresponding to each text line; sequencing all text lines in the target resume image based on the position information of all the text lines and the text recognition result to obtain the content of the sequenced text lines; and carrying out coding and decoding processing on the content of the sequenced text lines to obtain the text line structured information of the target resume image.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on the understanding, the above technical solutions substantially or otherwise contributing to the prior art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the various embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A resume information extraction method is characterized by comprising the following steps:
performing text line detection and text recognition on a target resume image to obtain position information of each text line in the target resume image and a text recognition result corresponding to each text line;
sequencing all text lines in the target resume image based on the position information of all the text lines and the text recognition result to obtain the content of the sequenced text lines;
and carrying out coding and decoding processing on the content of the sequenced text lines to obtain the text line structured information of the target resume image.
2. The resume information extraction method according to claim 1, wherein the performing text line detection and text recognition on the target resume image to obtain the position information of each text line in the target resume image and the text recognition result corresponding to each text line comprises:
performing text line detection on a target resume image to obtain position information of each text line in the target resume image;
correcting each text line in the target resume image according to the position information of each text line to obtain a text line correction image;
and performing text recognition on the text line corrected image to obtain a text line recognition result corresponding to each text line.
3. The resume information extraction method according to claim 2, wherein the correcting the text lines in the target resume image according to the position information of the text lines to obtain a text line corrected image comprises:
obtaining line images corresponding to the text lines in the target resume image by adopting a circumscribed rectangle method according to the position information of the text lines;
according to the line image corresponding to each text line, obtaining a rotation image corresponding to each text line by adopting a minimum external rectangle method;
calculating the rotation angle of each text line based on the rotation image corresponding to each text line;
performing affine transformation on each text line based on the rotation angle of each text line to obtain a line image corresponding to each text line subjected to rotation correction;
and removing the interference information in the line image corresponding to each text line subjected to the rotation correction by adopting a mask image to obtain a text line correction image.
4. The resume information extraction method according to claim 1, wherein the step of sorting the text lines in the target resume image based on the position information of the text lines and the text recognition result to obtain the content of the sorted text lines comprises:
determining the height direction contact ratio between semantic blocks in the target resume image based on the position information of each text line;
sorting the semantic blocks based on the height direction coincidence degree between the semantic blocks in the target resume image;
and sequencing the text rows in the sequenced semantic blocks to obtain the content of the sequenced text rows.
5. The resume information extraction method according to claim 1, wherein the encoding and decoding processing is performed on the content of the sorted text lines to obtain the structured information of the text lines of the target resume image, and includes:
obtaining text features based on the content of the sorted text lines, obtaining layout features corresponding to the text lines based on the position information of the text lines and the target resume image, and obtaining image features based on the target resume image;
performing feature fusion on the text features, the layout features and the image features to obtain fused text line features;
and carrying out hierarchical relation reasoning on the fused text line characteristics to obtain the text line structural relation of the target resume image.
6. The resume information extraction method according to claim 5, wherein the performing hierarchical relationship reasoning on the fused text line features to obtain the text line structured relationship of the target resume image comprises:
determining a title, a father node and a relation between the father node and the father node corresponding to each text line based on the text line characteristics after the fusion;
aiming at each text line, determining the hierarchical information of each text line according to the corresponding title of each text line and the relation between each text line and the father node;
and obtaining the text line structural relationship of the target resume image based on the hierarchical information of each text line.
7. The resume information extraction method of claim 1, wherein the method further comprises:
according to the text line structural relation of the target resume image, checking semantic blocks to which the text lines belong to obtain a checking result;
and processing the semantic block according to the checking result.
8. A resume information extraction apparatus, characterized by comprising:
the text detection and identification module is used for carrying out text line detection and text identification on the target resume image to obtain the position information of each text line in the target resume image and a text identification result corresponding to each text line;
the text sorting module is used for sorting the text lines in the target resume image based on the position information of the text lines and the text recognition result to obtain the content of the sorted text lines;
and the coding and decoding processing module is used for coding and decoding the contents of the sequenced text lines to obtain the text line structured information of the target resume image.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the resume information extraction method of any of claims 1 to 7 when executing the program.
10. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the resume information extraction method of any of claims 1 to 7.
CN202211364710.3A 2022-11-02 2022-11-02 Resume information extraction method and device, electronic equipment and storage medium Pending CN115690795A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211364710.3A CN115690795A (en) 2022-11-02 2022-11-02 Resume information extraction method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211364710.3A CN115690795A (en) 2022-11-02 2022-11-02 Resume information extraction method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115690795A true CN115690795A (en) 2023-02-03

Family

ID=85047495

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211364710.3A Pending CN115690795A (en) 2022-11-02 2022-11-02 Resume information extraction method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115690795A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116502625A (en) * 2023-06-28 2023-07-28 浙江同花顺智能科技有限公司 Resume analysis method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116502625A (en) * 2023-06-28 2023-07-28 浙江同花顺智能科技有限公司 Resume analysis method and system
CN116502625B (en) * 2023-06-28 2023-09-15 浙江同花顺智能科技有限公司 Resume analysis method and system

Similar Documents

Publication Publication Date Title
CN112199375B (en) Cross-modal data processing method and device, storage medium and electronic device
CN111476067B (en) Character recognition method and device for image, electronic equipment and readable storage medium
CN109343920B (en) Image processing method and device, equipment and storage medium thereof
CN108334805B (en) Method and device for detecting document reading sequence
CN112836650B (en) Semantic analysis method and system for quality inspection report scanning image table
CN112396049A (en) Text error correction method and device, computer equipment and storage medium
CN108734159B (en) Method and system for detecting sensitive information in image
CN112949476B (en) Text relation detection method, device and storage medium based on graph convolution neural network
CN115424282A (en) Unstructured text table identification method and system
CN113051356A (en) Open relationship extraction method and device, electronic equipment and storage medium
CN116049397A (en) Sensitive information discovery and automatic classification method based on multi-mode fusion
CN114612921B (en) Form recognition method and device, electronic equipment and computer readable medium
CN116311310A (en) Universal form identification method and device combining semantic segmentation and sequence prediction
CN114241499A (en) Table picture identification method, device and equipment and readable storage medium
CN112613293A (en) Abstract generation method and device, electronic equipment and storage medium
CN115690795A (en) Resume information extraction method and device, electronic equipment and storage medium
CN117423124A (en) Table data processing method, device, equipment and medium based on table image
CN112966676A (en) Document key information extraction method based on zero sample learning
CN115984886A (en) Table information extraction method, device, equipment and storage medium
CN114443898B (en) Video big data pushing method for Internet intelligent education
CN112560849B (en) Neural network algorithm-based grammar segmentation method and system
CN115410185A (en) Method for extracting specific name and unit name attributes in multi-modal data
CN114743204A (en) Automatic question answering method, system, equipment and storage medium for table
CN113962196A (en) Resume processing method and device, electronic equipment and storage medium
CN110909737A (en) Picture character recognition method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination