CN105183742A

CN105183742A - Resume identification method

Info

Publication number: CN105183742A
Application number: CN201510321901.5A
Authority: CN
Inventors: 蔡志旻; 沈峰; 王峰; 邹阳; 张海涛
Original assignee: Nanjing Fujitsu Nanda Software Technology Co Ltd
Current assignee: Nanjing Fujitsu Nanda Software Technology Co Ltd
Priority date: 2015-06-12
Filing date: 2015-06-12
Publication date: 2015-12-23

Abstract

The invention discloses a resume identification method which is characterized by comprising the following steps of a first step, setting all latent keywords in a resume; a second step, selecting a to-be-analyzed resume; a third step, performing preprocessing on the resume according to the preset keywords, and analyzing the keywords which are included in the resume; a fourth step, differentiating a simple information field and a complex information field according to the information of the keywords which are included in the resume; a fifth step, performing secondary analysis processing on the complicated information field, extracting sub-item information; and a sixth step, outputting the simple information field and the complex information field. The resume identification method can realize high-efficiency accurate extraction for resume information and furthermore has high extraction accuracy.

Description

A kind of resume recognition methods

Technical field

The present invention relates to a kind of text recognition method, be specifically related to a kind of resume recognition methods, the invention belongs to text identification field.

Background technology

Resume is the common text of a class.Functionally see, resume is its introduction of authors oneself, promotes oneself, finally reaches the important means of effective communication; From style of writing structure, it is a kind of semi-structured text.Such text application extensively, Numerous, therefore, realizing its information extraction efficiently, accurately becomes a urgent demand.On the one hand, from information extraction efficiency, artificial reading obviously can not meet current demand, and must utilize computer-related technologies; On the other hand, from the feasibility accurately extracted, according to the characteristic sum Text Information Extraction technology of semi-structured text, as the methods such as matching regular expressions, correlation analysis, statistics can make extraction result meet actual needs, it is feasible for namely realizing machine intelligence identification.But not yet there is the technology of the effective identification to resume in prior art.

Summary of the invention

For solving the deficiencies in the prior art, the object of the present invention is to provide a kind of resume recognition methods, be difficult to realize the technical matters to effective identification of resume to solve prior art.

In order to realize above-mentioned target, the present invention adopts following technical scheme:

a kind of resume recognition methods, it is characterized in that, comprise the steps:

Step one: all key words potential in setting resume;

Step 2: the resume selecting Water demand;

Step 3: the key word according to setting carries out pre-service to resume;

Step 4: according to the keyword message comprised in resume, distinguishes simple information territory and complex information territory;

Step 5: carry out secondary analysis process to complex information territory, extracts subitem information;

Step 6: simple information territory, complex information territory are exported.

Aforesaid a kind of resume recognition methods, it is characterized in that, in described step one, also comprise setting key word Conflict Analysis, described key word Conflict Analysis, for the treatment of when the plurality of positions of key word in resume occurs, judges the actual location of key word in resume.

Aforesaid a kind of resume recognition methods, it is characterized in that, in described step 3, adopt canonical matching way to analyze the key word comprised in resume.

Aforesaid a kind of resume recognition methods, it is characterized in that, in described step 3, if the plurality of positions of key word in resume occurs, according to key word Conflict Analysis, judge the actual location of key word in resume.

Aforesaid a kind of resume recognition methodsit is characterized in that, described key word Conflict Analysis comprises: if the plurality of positions of key word in resume occurs, for the position, everywhere that key word occurs, carry out text analyzing in the front and back of described position, when retrieval, there is the check information corresponding with this key word, if there is described check information, then judge the actual location of described position as this key word, if there is no described check information, then judge that described position is not the actual location of this key word.

Aforesaid a kind of resume recognition methods, it is characterized in that, in step 3, if obtain the key word in resume, then continue next step, if do not obtain key word in resume, then terminate analytic process.

Aforesaid a kind of resume recognition methods, it is characterized in that, described simple information territory comprises name, age, date of birth; Complex information territory comprises subitem, as working experience, project experiences.

Aforesaid a kind of resume recognition methods, it is characterized in that, carry out secondary analysis process and comprise: the key word that Analysis of Complex information field comprises to complex information territory, is secondary key by the keyword definition of acquisition, extracts the specifying information of secondary key and correspondence thereof.

Aforesaid a kind of resume recognition methods, it is characterized in that, the form of resume is any one in Word form/html format, PDF, txt form.

Aforesaid a kind of resume recognition methods, it is characterized in that, simple information territory and complex information territory are exported the XML format data for standard or JSON formatted data.

Usefulness of the present invention is: the present invention can realize extracting the efficiently and accurately of biographic information, and the accuracy rate of extraction is high.

Accompanying drawing explanation

fig. 1it is a preferred implementing procedure of the present invention figure;

fig. 2it is the signal of resume in the present invention figure;

fig. 3it is the signal of keyword-dictionary in the present invention figure;

fig. 4for the resume recognition effect exported actual in the present invention figure;

Embodiment

Below in conjunction with accompanying drawingwith specific embodiment, concrete introduction is done to the present invention.

Reference fig. 1shown in, the present invention includes following steps:

Step one: all key words potential in setting resume.Key word stores with the form of dictionary.The resume that the present embodiment is recruited using portion as explanation, as Fig. 2shown in.Keyword-dictionary wherein with as Fig. 3shown in.Its key word comprises name, sex, date of birth, residence etc.In this step, can also set key word Conflict Analysis, described key word Conflict Analysis, for the treatment of when the plurality of positions of key word in resume occurs, judges the actual location of key word in resume.

Step 2: the resume selecting Water demand; The form of preferred resume is Word form or html format.Except conventional Word form, Web text based on XML is a kind of semi-structured text, XML is then a kind of semi-structured data descriptive language, which overcome the content that traditional Web descriptive language HTML is merely able to expression data, the architectural feature of the web data that is beyond expression, this is not enough to be not easy to Query semi-structured for data, progressively substitute HTML, become web data of new generation to describe with data exchange standard based on the Web text of XML in semi-structured text, format information is than more rich, and have fixing standard, so, this class text is in information extraction process, than other semi-structured texts, easier.

Step 3: the key word according to setting carries out pre-service to resume, analyzes the key word comprised in resume.In this step, the present invention is first split resume text.The target of segmentation is that one section of resume text is dismembered into many units.Basic composition unit due to semi-structured text is unit, so one section of text is resolved into a metasequence, is the key that machine carries out Text Information Extraction.What segmentation adopted is text segmentation based on regular expression.Text segmentation based on regular expression can with reference to existing techniques in realizing.

After completing text segmentation, text identification is carried out to resume.Key word and resume text are compared.Analyze in this resume and comprise which keyword message, if the plurality of positions of key word in resume occurs, judge the actual location of key word in resume.If the plurality of positions of key word in resume occurs, this means that this in resume key word of repeating has a position, place to be real keyword message, remaining position is plain text information.Carry out fuzzy analysis to these key words repeated, judge which is real key word, which is plain text information.Specifically, if the plurality of positions of key word in resume occurs, according to key word Conflict Analysis, the actual location of key word in resume is judged.

Provide a kind of key word Conflict Analysis below, it comprises: if the plurality of positions of key word in resume occurs, for the position, everywhere that key word occurs, text analyzing is carried out in the front and back of described position, there is the check information corresponding with this key word when retrieval, if there is described check information, then judge the actual location of described position as this key word, if there is no described check information, then judge that described position is not the actual location of this key word.Described check information is appear at before and after real key word in text, a kind of word with verify relation relevant to described real keyword message.For example, to the key word of potentially conflicting, text analyzing is carried out before and after its position, such as " working experience " occurs repeatedly, if " working experience " that a certain position occurs is real key word, so general followed by temporal information after it, as 2012-09 ~ 2013-02, so just screen " working experience " of this position, place for key word present position, if somewhere " working experience " be not below followed by the time, then judge that " working experience " of this position, place is as plain text information, instead of genuine key word.

Step 4: according to the keyword message comprised in resume, distinguishes simple information territory and complex information territory.Simple information territory comprises name, age, date of birth and so on, and complex information territory then comprises subitem, as working experience, project experiences and so on.The reason distinguishing simple information territory and complex information territory is that complex information territory comprises subitem, needs to be further analyzed subitem.The subitem of such as working experience has Reason for leaving, work unit etc.

Step 5: carry out secondary analysis process to complex information territory, extracts subitem information; Such as, Reason for leaving, work unit are extracted to above-mentioned working experience.In fact the form of above-mentioned steps three that still adopts of carrying out secondary analysis process processes.The subitem information extracted is called secondary key.Obtain specifying information corresponding to secondary key simultaneously.Comprising: the key word that Analysis of Complex information field comprises, is secondary key by the keyword definition of acquisition, extracts the specifying information of secondary key and correspondence thereof.

Step 6: simple information territory, complex information territory are exported.Output format can be XML format data or the JSON formatted data of standard.

In reality, the simple information territory after output, complex information territory as Fig. 4shown in.

More than show and describe ultimate principle of the present invention, principal character and advantage.The technician of the industry should understand, and above-described embodiment does not limit the present invention in any form, the technical scheme that the mode that all employings are equal to replacement or equivalent transformation obtains, and all drops in protection scope of the present invention.

Claims

1. a resume recognition methods, is characterized in that, comprises the steps:

Step one: all key words potential in setting resume;

Step 2: the resume selecting Water demand;

Step 3: the key word according to setting carries out pre-service to resume;

2. a kind of resume recognition methods according to claim 1, it is characterized in that, in described step one, also comprise setting key word Conflict Analysis, described key word Conflict Analysis, for the treatment of when the plurality of positions of key word in resume occurs, judges the actual location of key word in resume.

3. a kind of resume recognition methods according to claim 2, is characterized in that, in described step 3, adopts canonical matching way to analyze the key word comprised in resume.

4. a kind of resume recognition methods according to claim 3, is characterized in that, in described step 3, if the plurality of positions of key word in resume occurs, according to key word Conflict Analysis, judges the actual location of key word in resume.

5. a kind of resume recognition methods according to claim 4, it is characterized in that, described key word Conflict Analysis comprises: if the plurality of positions of key word in resume occurs, for the position, everywhere that key word occurs, text analyzing is carried out in the front and back of described position, the check information corresponding with this key word is there is when retrieval, if there is described check information, then judge the actual location of described position as this key word, if there is no described check information, then judge that described position is not the actual location of this key word.

6. a kind of resume recognition methods according to claim 5, is characterized in that, in step 3, if obtain the key word in resume, then continues next step, if do not obtain key word in resume, then terminates analytic process.

7. a kind of resume recognition methods according to claim 6, is characterized in that, described simple information territory comprises name, age, date of birth; Complex information territory comprises subitem, as working experience, project experiences.

8. a kind of resume recognition methods according to claim 7, it is characterized in that, carrying out secondary analysis process to complex information territory to comprise: the key word that Analysis of Complex information field comprises, is secondary key by the keyword definition of acquisition, extracts the specifying information of secondary key and correspondence thereof.

9. a kind of resume recognition methods according to claim 8, is characterized in that, the form of resume is any one in Word form/html format, PDF, txt form.

10. a kind of resume recognition methods according to claim 9, is characterized in that, simple information territory and complex information territory is exported the XML format data for standard or JSON formatted data.