CN105183742A - Resume identification method - Google Patents
Resume identification method Download PDFInfo
- Publication number
- CN105183742A CN105183742A CN201510321901.5A CN201510321901A CN105183742A CN 105183742 A CN105183742 A CN 105183742A CN 201510321901 A CN201510321901 A CN 201510321901A CN 105183742 A CN105183742 A CN 105183742A
- Authority
- CN
- China
- Prior art keywords
- resume
- key word
- recognition methods
- information
- methods according
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/80—Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
- G06F16/83—Querying
- G06F16/835—Query processing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a resume identification method which is characterized by comprising the following steps of a first step, setting all latent keywords in a resume; a second step, selecting a to-be-analyzed resume; a third step, performing preprocessing on the resume according to the preset keywords, and analyzing the keywords which are included in the resume; a fourth step, differentiating a simple information field and a complex information field according to the information of the keywords which are included in the resume; a fifth step, performing secondary analysis processing on the complicated information field, extracting sub-item information; and a sixth step, outputting the simple information field and the complex information field. The resume identification method can realize high-efficiency accurate extraction for resume information and furthermore has high extraction accuracy.
Description
Technical field
The present invention relates to a kind of text recognition method, be specifically related to
a kind of resume recognition methods, the invention belongs to text identification field.
Background technology
Resume is the common text of a class.Functionally see, resume is its introduction of authors oneself, promotes oneself, finally reaches the important means of effective communication; From style of writing structure, it is a kind of semi-structured text.Such text application extensively, Numerous, therefore, realizing its information extraction efficiently, accurately becomes a urgent demand.On the one hand, from information extraction efficiency, artificial reading obviously can not meet current demand, and must utilize computer-related technologies; On the other hand, from the feasibility accurately extracted, according to the characteristic sum Text Information Extraction technology of semi-structured text, as the methods such as matching regular expressions, correlation analysis, statistics can make extraction result meet actual needs, it is feasible for namely realizing machine intelligence identification.But not yet there is the technology of the effective identification to resume in prior art.
Summary of the invention
For solving the deficiencies in the prior art, the object of the present invention is to provide
a kind of resume recognition methods, be difficult to realize the technical matters to effective identification of resume to solve prior art.
In order to realize above-mentioned target, the present invention adopts following technical scheme:
a kind of resume recognition methods, it is characterized in that, comprise the steps:
Step one: all key words potential in setting resume;
Step 2: the resume selecting Water demand;
Step 3: the key word according to setting carries out pre-service to resume;
Step 4: according to the keyword message comprised in resume, distinguishes simple information territory and complex information territory;
Step 5: carry out secondary analysis process to complex information territory, extracts subitem information;
Step 6: simple information territory, complex information territory are exported.
Aforesaid
a kind of resume recognition methods, it is characterized in that, in described step one, also comprise setting key word Conflict Analysis, described key word Conflict Analysis, for the treatment of when the plurality of positions of key word in resume occurs, judges the actual location of key word in resume.
Aforesaid
a kind of resume recognition methods, it is characterized in that, in described step 3, adopt canonical matching way to analyze the key word comprised in resume.
Aforesaid
a kind of resume recognition methods, it is characterized in that, in described step 3, if the plurality of positions of key word in resume occurs, according to key word Conflict Analysis, judge the actual location of key word in resume.
Aforesaid
a kind of resume recognition methodsit is characterized in that, described key word Conflict Analysis comprises: if the plurality of positions of key word in resume occurs, for the position, everywhere that key word occurs, carry out text analyzing in the front and back of described position, when retrieval, there is the check information corresponding with this key word, if there is described check information, then judge the actual location of described position as this key word, if there is no described check information, then judge that described position is not the actual location of this key word.
Aforesaid
a kind of resume recognition methods, it is characterized in that, in step 3, if obtain the key word in resume, then continue next step, if do not obtain key word in resume, then terminate analytic process.
Aforesaid
a kind of resume recognition methods, it is characterized in that, described simple information territory comprises name, age, date of birth; Complex information territory comprises subitem, as working experience, project experiences.
Aforesaid
a kind of resume recognition methods, it is characterized in that, carry out secondary analysis process and comprise: the key word that Analysis of Complex information field comprises to complex information territory, is secondary key by the keyword definition of acquisition, extracts the specifying information of secondary key and correspondence thereof.
Aforesaid
a kind of resume recognition methods, it is characterized in that, the form of resume is any one in Word form/html format, PDF, txt form.
Aforesaid
a kind of resume recognition methods, it is characterized in that, simple information territory and complex information territory are exported the XML format data for standard or JSON formatted data.
Usefulness of the present invention is: the present invention can realize extracting the efficiently and accurately of biographic information, and the accuracy rate of extraction is high.
Accompanying drawing explanation
fig. 1it is a preferred implementing procedure of the present invention
figure;
fig. 2it is the signal of resume in the present invention
figure;
fig. 3it is the signal of keyword-dictionary in the present invention
figure;
fig. 4for the resume recognition effect exported actual in the present invention
figure;
Embodiment
Below in conjunction with
accompanying drawingwith specific embodiment, concrete introduction is done to the present invention.
Reference
fig. 1shown in, the present invention includes following steps:
Step one: all key words potential in setting resume.Key word stores with the form of dictionary.The resume that the present embodiment is recruited using portion as explanation,
as Fig. 2shown in.Keyword-dictionary wherein with
as Fig. 3shown in.Its key word comprises name, sex, date of birth, residence etc.In this step, can also set key word Conflict Analysis, described key word Conflict Analysis, for the treatment of when the plurality of positions of key word in resume occurs, judges the actual location of key word in resume.
Step 2: the resume selecting Water demand; The form of preferred resume is Word form or html format.Except conventional Word form, Web text based on XML is a kind of semi-structured text, XML is then a kind of semi-structured data descriptive language, which overcome the content that traditional Web descriptive language HTML is merely able to expression data, the architectural feature of the web data that is beyond expression, this is not enough to be not easy to Query semi-structured for data, progressively substitute HTML, become web data of new generation to describe with data exchange standard based on the Web text of XML in semi-structured text, format information is than more rich, and have fixing standard, so, this class text is in information extraction process, than other semi-structured texts, easier.
Step 3: the key word according to setting carries out pre-service to resume, analyzes the key word comprised in resume.In this step, the present invention is first split resume text.The target of segmentation is that one section of resume text is dismembered into many units.Basic composition unit due to semi-structured text is unit, so one section of text is resolved into a metasequence, is the key that machine carries out Text Information Extraction.What segmentation adopted is text segmentation based on regular expression.Text segmentation based on regular expression can with reference to existing techniques in realizing.
After completing text segmentation, text identification is carried out to resume.Key word and resume text are compared.Analyze in this resume and comprise which keyword message, if the plurality of positions of key word in resume occurs, judge the actual location of key word in resume.If the plurality of positions of key word in resume occurs, this means that this in resume key word of repeating has a position, place to be real keyword message, remaining position is plain text information.Carry out fuzzy analysis to these key words repeated, judge which is real key word, which is plain text information.Specifically, if the plurality of positions of key word in resume occurs, according to key word Conflict Analysis, the actual location of key word in resume is judged.
Provide a kind of key word Conflict Analysis below, it comprises: if the plurality of positions of key word in resume occurs, for the position, everywhere that key word occurs, text analyzing is carried out in the front and back of described position, there is the check information corresponding with this key word when retrieval, if there is described check information, then judge the actual location of described position as this key word, if there is no described check information, then judge that described position is not the actual location of this key word.Described check information is appear at before and after real key word in text, a kind of word with verify relation relevant to described real keyword message.For example, to the key word of potentially conflicting, text analyzing is carried out before and after its position, such as " working experience " occurs repeatedly, if " working experience " that a certain position occurs is real key word, so general followed by temporal information after it, as 2012-09 ~ 2013-02, so just screen " working experience " of this position, place for key word present position, if somewhere " working experience " be not below followed by the time, then judge that " working experience " of this position, place is as plain text information, instead of genuine key word.
Step 4: according to the keyword message comprised in resume, distinguishes simple information territory and complex information territory.Simple information territory comprises name, age, date of birth and so on, and complex information territory then comprises subitem, as working experience, project experiences and so on.The reason distinguishing simple information territory and complex information territory is that complex information territory comprises subitem, needs to be further analyzed subitem.The subitem of such as working experience has Reason for leaving, work unit etc.
Step 5: carry out secondary analysis process to complex information territory, extracts subitem information; Such as, Reason for leaving, work unit are extracted to above-mentioned working experience.In fact the form of above-mentioned steps three that still adopts of carrying out secondary analysis process processes.The subitem information extracted is called secondary key.Obtain specifying information corresponding to secondary key simultaneously.Comprising: the key word that Analysis of Complex information field comprises, is secondary key by the keyword definition of acquisition, extracts the specifying information of secondary key and correspondence thereof.
Step 6: simple information territory, complex information territory are exported.Output format can be XML format data or the JSON formatted data of standard.
In reality, the simple information territory after output, complex information territory
as Fig. 4shown in.
More than show and describe ultimate principle of the present invention, principal character and advantage.The technician of the industry should understand, and above-described embodiment does not limit the present invention in any form, the technical scheme that the mode that all employings are equal to replacement or equivalent transformation obtains, and all drops in protection scope of the present invention.
Claims (10)
1. a resume recognition methods, is characterized in that, comprises the steps:
Step one: all key words potential in setting resume;
Step 2: the resume selecting Water demand;
Step 3: the key word according to setting carries out pre-service to resume;
Step 4: according to the keyword message comprised in resume, distinguishes simple information territory and complex information territory;
Step 5: carry out secondary analysis process to complex information territory, extracts subitem information;
Step 6: simple information territory, complex information territory are exported.
2. a kind of resume recognition methods according to claim 1, it is characterized in that, in described step one, also comprise setting key word Conflict Analysis, described key word Conflict Analysis, for the treatment of when the plurality of positions of key word in resume occurs, judges the actual location of key word in resume.
3. a kind of resume recognition methods according to claim 2, is characterized in that, in described step 3, adopts canonical matching way to analyze the key word comprised in resume.
4. a kind of resume recognition methods according to claim 3, is characterized in that, in described step 3, if the plurality of positions of key word in resume occurs, according to key word Conflict Analysis, judges the actual location of key word in resume.
5. a kind of resume recognition methods according to claim 4, it is characterized in that, described key word Conflict Analysis comprises: if the plurality of positions of key word in resume occurs, for the position, everywhere that key word occurs, text analyzing is carried out in the front and back of described position, the check information corresponding with this key word is there is when retrieval, if there is described check information, then judge the actual location of described position as this key word, if there is no described check information, then judge that described position is not the actual location of this key word.
6. a kind of resume recognition methods according to claim 5, is characterized in that, in step 3, if obtain the key word in resume, then continues next step, if do not obtain key word in resume, then terminates analytic process.
7. a kind of resume recognition methods according to claim 6, is characterized in that, described simple information territory comprises name, age, date of birth; Complex information territory comprises subitem, as working experience, project experiences.
8. a kind of resume recognition methods according to claim 7, it is characterized in that, carrying out secondary analysis process to complex information territory to comprise: the key word that Analysis of Complex information field comprises, is secondary key by the keyword definition of acquisition, extracts the specifying information of secondary key and correspondence thereof.
9. a kind of resume recognition methods according to claim 8, is characterized in that, the form of resume is any one in Word form/html format, PDF, txt form.
10. a kind of resume recognition methods according to claim 9, is characterized in that, simple information territory and complex information territory is exported the XML format data for standard or JSON formatted data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510321901.5A CN105183742A (en) | 2015-06-12 | 2015-06-12 | Resume identification method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510321901.5A CN105183742A (en) | 2015-06-12 | 2015-06-12 | Resume identification method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105183742A true CN105183742A (en) | 2015-12-23 |
Family
ID=54905826
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510321901.5A Pending CN105183742A (en) | 2015-06-12 | 2015-06-12 | Resume identification method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105183742A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106933783A (en) * | 2015-12-31 | 2017-07-07 | 远光软件股份有限公司 | A kind of method and device on the intelligent extraction date from text |
CN107392143A (en) * | 2017-07-20 | 2017-11-24 | 中国科学院软件研究所 | A kind of resume accurate Analysis method based on SVM text classifications |
CN109165295A (en) * | 2018-09-27 | 2019-01-08 | 天涯社区网络科技股份有限公司 | A kind of intelligence resume appraisal procedure |
CN109271479A (en) * | 2018-09-29 | 2019-01-25 | 广东润弘科技有限公司 | A kind of resume structuring processing method |
CN109948120A (en) * | 2019-04-02 | 2019-06-28 | 深圳市前海欢雀科技有限公司 | A kind of resume analytic method based on dualization |
CN110020327A (en) * | 2019-04-16 | 2019-07-16 | 上海大易云计算股份有限公司 | A kind of resume resolution system based on vertical search engine |
CN110222292A (en) * | 2019-04-29 | 2019-09-10 | 毕昀 | Website resume automatically parses method, computer equipment and storage medium |
CN112214572A (en) * | 2020-10-20 | 2021-01-12 | 济南浪潮高新科技投资发展有限公司 | Method for secondarily extracting entities in resume analysis |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102043796A (en) * | 2009-10-14 | 2011-05-04 | 无锡华润上华半导体有限公司 | Information collecting method and device based on Internet |
CN102999523A (en) * | 2011-09-16 | 2013-03-27 | 陆敏 | Intelligence digitizing method |
CN103634420A (en) * | 2013-11-22 | 2014-03-12 | 北京极客优才科技有限公司 | Resume e-mail screening system and method |
CN104572849A (en) * | 2014-12-17 | 2015-04-29 | 西安美林数据技术股份有限公司 | Automatic standardized filing method based on text semantic mining |
-
2015
- 2015-06-12 CN CN201510321901.5A patent/CN105183742A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102043796A (en) * | 2009-10-14 | 2011-05-04 | 无锡华润上华半导体有限公司 | Information collecting method and device based on Internet |
CN102999523A (en) * | 2011-09-16 | 2013-03-27 | 陆敏 | Intelligence digitizing method |
CN103634420A (en) * | 2013-11-22 | 2014-03-12 | 北京极客优才科技有限公司 | Resume e-mail screening system and method |
CN104572849A (en) * | 2014-12-17 | 2015-04-29 | 西安美林数据技术股份有限公司 | Automatic standardized filing method based on text semantic mining |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106933783A (en) * | 2015-12-31 | 2017-07-07 | 远光软件股份有限公司 | A kind of method and device on the intelligent extraction date from text |
CN107392143A (en) * | 2017-07-20 | 2017-11-24 | 中国科学院软件研究所 | A kind of resume accurate Analysis method based on SVM text classifications |
CN107392143B (en) * | 2017-07-20 | 2019-12-27 | 中国科学院软件研究所 | Resume accurate analysis method based on SVM text classification |
CN109165295A (en) * | 2018-09-27 | 2019-01-08 | 天涯社区网络科技股份有限公司 | A kind of intelligence resume appraisal procedure |
CN109271479A (en) * | 2018-09-29 | 2019-01-25 | 广东润弘科技有限公司 | A kind of resume structuring processing method |
CN109948120A (en) * | 2019-04-02 | 2019-06-28 | 深圳市前海欢雀科技有限公司 | A kind of resume analytic method based on dualization |
CN109948120B (en) * | 2019-04-02 | 2023-03-14 | 深圳市前海欢雀科技有限公司 | Binary resume parsing method |
CN110020327A (en) * | 2019-04-16 | 2019-07-16 | 上海大易云计算股份有限公司 | A kind of resume resolution system based on vertical search engine |
CN110222292A (en) * | 2019-04-29 | 2019-09-10 | 毕昀 | Website resume automatically parses method, computer equipment and storage medium |
CN112214572A (en) * | 2020-10-20 | 2021-01-12 | 济南浪潮高新科技投资发展有限公司 | Method for secondarily extracting entities in resume analysis |
CN112214572B (en) * | 2020-10-20 | 2022-11-01 | 山东浪潮科学研究院有限公司 | Method for secondarily extracting entities in resume analysis |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105183742A (en) | Resume identification method | |
CN105975499B (en) | A kind of text subject detection method and system | |
CN104598535B (en) | A kind of event extraction method based on maximum entropy | |
CN104572958A (en) | Event extraction based sensitive information monitoring method | |
CN103294664A (en) | Method and system for discovering new words in open fields | |
CN106777957B (en) | The new method of biomedical more ginseng event extractions on unbalanced dataset | |
CN111309910A (en) | Text information mining method and device | |
CN102270212A (en) | User interest feature extraction method based on hidden semi-Markov model | |
CN108664474A (en) | A kind of resume analytic method based on deep learning | |
CN110929520B (en) | Unnamed entity object extraction method and device, electronic equipment and storage medium | |
CN106372053B (en) | Syntactic analysis method and device | |
CN103838796A (en) | Webpage structured information extraction method | |
CN110032649A (en) | Relation extraction method and device between a kind of entity of TCM Document | |
CN105975475A (en) | Chinese phrase string-based fine-grained thematic information extraction method | |
CN104462041A (en) | Method for completely detecting hot event from beginning to end | |
CN101782897A (en) | Chinese corpus labeling method based on events | |
CN103246641A (en) | Text semantic information analyzing system and method | |
CN106503256B (en) | A kind of hot information method for digging based on social networks document | |
CN105868408A (en) | Machine learning based recruitment information analyzing system and method thereof | |
CN105389303B (en) | A kind of automatic fusion method of heterologous corpus | |
CN104991909B (en) | A kind of dictionary method for auto constructing for specific software history codes storehouse | |
CN110866172B (en) | Data analysis method for block chain system | |
CN102737244B (en) | Method for determining corresponding relationships between areas and annotations in annotated image | |
CN103646117A (en) | Link-based bilingual parallel page identification method and system | |
CN109325159A (en) | A kind of microblog hot event method for digging |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20151223 |