CN113886588A - Major professional employment direction identification method based on recruitment text mining - Google Patents

Major professional employment direction identification method based on recruitment text mining Download PDF

Info

Publication number
CN113886588A
CN113886588A CN202111220573.1A CN202111220573A CN113886588A CN 113886588 A CN113886588 A CN 113886588A CN 202111220573 A CN202111220573 A CN 202111220573A CN 113886588 A CN113886588 A CN 113886588A
Authority
CN
China
Prior art keywords
recruitment
word
data
professional
employment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111220573.1A
Other languages
Chinese (zh)
Inventor
张建桃
曾莉
刘洁荧
韦婷婷
黄文玲
宋世领
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China Agricultural University
Original Assignee
South China Agricultural University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China Agricultural University filed Critical South China Agricultural University
Priority to CN202111220573.1A priority Critical patent/CN113886588A/en
Publication of CN113886588A publication Critical patent/CN113886588A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a major professional employment direction recognition method based on recruitment text mining, which takes recruitment information of a recruitment website as a data source and analyzes a major recruitment position name through 4 steps of main data acquisition, data preprocessing, word vectorization and K-means clustering to obtain a major professional employment direction. The method for cultivating and researching the professional directional talents based on text mining can quickly, efficiently and accurately identify the employment direction requirements of employment markets on the professional talents from the network recruitment text data, can optimize and improve professional talent cultivation schemes for colleges and universities, and provides decision support for cultivating the professional directional talents meeting the market requirements.

Description

Major professional employment direction identification method based on recruitment text mining
Technical Field
The invention relates to the field of major employment direction recognition, in particular to a major employment direction recognition method based on recruitment text mining.
Background
Along with diversification of professional education in colleges and universities, employment directions of professional talents are wider, and requirements of enterprises on talents in different directions of the profession are different. Therefore, under the background that the contradiction between supply and demand of talents is increasingly prominent, the accurate insight of the employment direction demand of the market for the professional talents is the key for the colleges and universities to cultivate the professional talents meeting the market demand, promote the employment of the professional talents and solve the contradiction between supply and demand of talents.
According to a '2020 Chinese network recruitment industry market development research report' issued by the ai rui network, the number of enterprise employers in 2019 network recruitment reaches 486.6 thousands, and the network recruitment becomes a main enterprise recruitment mode, and the extraction of enterprise recruitment requirements from network recruitment information is an effective path for acquiring employment market requirements. Text mining is a technique that can extract meaningful information from unstructured text data.
Disclosure of Invention
The invention aims to provide a major professional employment direction identification method based on recruitment text mining, which acquires major recruitment post name data from a recruitment website through a text mining technology, analyzes major professional employment directions, optimizes and improves a major talent culture scheme for colleges and universities, and provides decision support for culturing a major direction talent meeting market requirements.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
a major professional employment direction recognition method based on recruitment text mining comprises the following steps:
step 1: data acquisition, namely crawling a professional recruitment post name from a selected recruitment website by using a Web crawler technology by taking the professional name as a keyword;
step 2: preprocessing data, namely preprocessing the collected recruitment post name data;
and step 3: performing Word vectorization, namely performing Word vectorization on the recruitment position name by adopting a Word2vec algorithm to obtain vector representation of each recruitment position name;
and 4, step 4: and (4) carrying out K-means clustering, and carrying out clustering analysis on the recruitment post names by using a K-means clustering algorithm to obtain the major employment direction of the specialty.
Preferably, the data acquisition comprises the following sub-steps:
step 1.1: formulating a crawler rule, and determining a webpage URL, a page number range and a post screening condition for acquiring the recruitment post name data;
step 1.2: and the Web crawler acquires the professional network recruitment post name by adopting a Web crawler technology according to the formulated crawler rule.
Preferably, the data preprocessing comprises the following sub-steps:
step 2.1: data cleaning, namely cleaning the collected recruitment post name data to remove data noise in the data, wherein the data noise comprises a null value, a repetition value, an abnormal value, an HTML (hypertext markup language) label and the like;
step 2.2: constructing a custom dictionary, selecting a special post name combination word from the data after word segmentation and stop word removal processing, putting the combination word into the custom word segmentation dictionary, and selecting a word without research meaning to put into a custom stop word library;
step 2.3: and segmenting words and stop words, segmenting data by using a Jieba segmentation program package in Python and a constructed custom segmentation dictionary, and selecting a Hardset stop word list and combining the constructed custom segmentation dictionary to perform stop word processing.
Preferably, said word vectorization comprises the sub-steps of:
step 3.1: and initializing word vectors, and initializing dictionary vector representation by utilizing a uniformly distributed random fixed-length sequence initialization dictionary.
Step 3.2: training word vectors, modeling a problem into the context of a given target word through a conditional probability model, and predicting a language model of the target word; the vector expression of the target word is obtained by utilizing the gradient descent and the back propagation to maximize the log-likelihood target function,
Figure BDA0003312427790000021
in the formula, P (omega)tt-c:ωt+c) Is the conditional probability, T is the length of the sentence; omegatFor predicted target words, c is context size; omegat-c:ωt+cThe target word does not contain the first c to the last c words of the target word;
the conditional probability P is obtained by softmax,
Figure BDA0003312427790000031
Figure BDA0003312427790000032
in the formula, N is the size of a word list;
Figure BDA0003312427790000033
is composed of
Figure BDA0003312427790000034
Transposing;
Figure BDA0003312427790000035
is a vector representation of the target word; exp () is an exponential function with a natural constant as the base; v. ofnIs the vector representation of the nth time in the word list; v. ofjIs a vector representation of the jth word in the context of the target word.
Preferably, the K-means clustering comprises the following sub-steps:
step 4.1: performing K-means clustering on the recruitment position names, performing clustering analysis on the recruitment position names by using a K-means clustering algorithm, wherein the K-means algorithm takes the minimum value of the square error sum SSE of the sample and the mass center as an objective function, and the calculation formula is as follows:
Figure BDA0003312427790000036
Figure BDA0003312427790000037
in the formula: k is the number of clusters, EiIs the ith cluster; e.g. of the typeiIs EiThe center of mass of; x is EiA knowledge point sample of (1); n is a radical ofiIs EiThe number of samples in (1);
selecting the minimum K meeting the following constraint formula as the optimal clustering number, namely the value of the K,
Gapk≥Gapk+1-sk+1 (6)
Figure BDA0003312427790000038
Figure BDA0003312427790000039
b is simulation times of Monte Carlo simulation calculation; SSEkObtaining SSE obtained by calculation when the K value is K for the current sample; SSEkbTaking the K value as K, and carrying out the quadratic error sum of the mass centers when carrying out the b-th Monte Carlo simulation calculation;
step 4.2: and (5) summarizing the main employment direction, and summarizing each post after the K-means clustering to obtain the professional main employment direction.
The invention has the following effective benefits: compared with traditional investigation methods such as questionnaire investigation, enterprise visit, expert consultation and the like, the employment direction requirements of employment markets for professional talents can be quickly, efficiently and accurately identified from the network recruitment text data by adopting the text mining technology. The method adopts a text mining technology to deeply mine the professional network recruitment post names, obtains the professional main employment direction through 4 steps of main data acquisition, data preprocessing, word vectorization and K-means clustering, and provides decision support for optimization and improvement of talent culture schemes in colleges and universities.
Drawings
Fig. 1 is a major employment direction recognition method for major specialization based on recruitment text mining according to the present invention.
FIG. 2 is a flow of custom thesaurus construction.
FIG. 3 is a graph of Gap value as a function of k value.
Detailed Description
In order to make the technical features, objects and effects of the present invention more clearly understood, the present invention will be further described in detail with reference to the accompanying drawings and examples. The embodiments described herein are only for explaining the technical solution of the present invention and are not limited to the present invention.
The invention provides a major professional employment direction identification method based on recruitment text mining, which has the process shown in figure 1 and takes industrial engineering major as an example, and the implementation comprises the following steps:
step 1: data acquisition, namely selecting the current popular recruitment website with no worry (https:// www.51job.com) by adopting a web crawler technology, and crawling the recruitment post name data of industrial engineering major nationwide by taking industrial engineering as a search keyword;
step 2: data preprocessing, namely performing data cleaning operations such as duplicate removal, null removal and noise removal on the collected post name data, constructing a customized word segmentation dictionary and a disabled word library according to a customized word library construction process shown in fig. 2, and performing word segmentation and word deactivation processing on recruitment data by combining a Jieba word segmentation program package and a Haugh disabled word list in Python, wherein 8169 effective industrial engineering professional recruitment post name data are obtained before and after post name preprocessing as shown in table 1.
TABLE 1 post name data comparison before and after preprocessing
Figure BDA0003312427790000041
Figure BDA0003312427790000051
And step 3: and (4) performing Word vectorization, namely performing Word vectorization on the recruitment position name data through a Word2vec algorithm to obtain vector expression of each position name.
And 4, step 4: and (3) performing K-means clustering, namely clustering the names of the recruitment posts by adopting a K-means algorithm, wherein the change curve of gap (K) is shown in figure 3, so that the optimal clustering number K is 4. The clustering results of the position names are shown in table 2, and analyzing the position names in each category can find out that: an industrial engineer, a process engineer, an IE engineer and the like in the category 1 can belong to an engineering post, a production plan, lean production, production management and the like in the category 2 can belong to a production post, a logistics specialist, a supply chain director, a supply chain specialist and the like in the category 3 can belong to a logistics supply chain post, and an ERP implementation consultant, an SAP implementation consultant and an MES implementation consultant in the category 4 can belong to a consultation consultant post. Through the cluster analysis of the post names, 4 main employment directions of industrial engineering major, such as engineering management, production management, logistics supply chain and consultants can be obtained.
TABLE 2 post name clustering results
Figure BDA0003312427790000052
Through example research, the major professional employment direction identification method based on recruitment text mining, provided by the invention, can quickly, efficiently and accurately identify the employment direction requirements of employment markets on professional talents from the network recruitment text data, can be applied and popularized in various professional fields, and provides decision support for colleges and universities to optimize and improve professional talent culture schemes and culture professional talents meeting the market requirements.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.
The above-described embodiments of the present invention do not limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present invention shall be included in the protection scope of the claims of the present invention.

Claims (5)

1. A major professional employment direction recognition method based on recruitment text mining is characterized by comprising the following steps:
step 1: data acquisition, namely crawling a professional recruitment post name from a selected recruitment website by using a Web crawler technology by taking the professional name as a keyword;
step 2: preprocessing data, namely preprocessing the collected recruitment post name data;
and step 3: performing Word vectorization, namely performing Word vectorization on the recruitment position name by adopting a Word2vec algorithm to obtain vector representation of each recruitment position name;
and 4, step 4: and (4) carrying out K-means clustering, and carrying out clustering analysis on the recruitment post names by using a K-means clustering algorithm to obtain the major employment direction of the specialty.
2. The method for identifying major employment directions for professionals based on recruitment text mining as claimed in claim 1, wherein the data collection of the step 1 comprises the following substeps:
step 1.1: formulating a crawler rule, and determining a webpage URL, a page number range and a post screening condition for acquiring the recruitment post name data;
step 1.2: and the Web crawler acquires the professional network recruitment post name by adopting a Web crawler technology according to the formulated crawler rule.
3. The method for identifying major employment directions for professionals based on recruitment text mining as claimed in claim 1, wherein the data preprocessing of the step 2 comprises the following sub-steps:
step 2.1: data cleaning, namely cleaning the collected recruitment post name data to remove data noise in the data, wherein the data noise comprises a null value, a repetition value, an abnormal value and an HTML (hypertext markup language) label;
step 2.2: constructing a custom dictionary, selecting a special post name combination word from the data after word segmentation and stop word removal processing, putting the combination word into the custom word segmentation dictionary, and selecting a word without research meaning to put into a custom stop word library;
step 2.3: and segmenting words and stop words, segmenting data by using a Jieba segmentation program package in Python and a constructed custom segmentation dictionary, and selecting a Hardset stop word list and combining the constructed custom segmentation dictionary to perform stop word processing.
4. The professional major employment direction recognition method based on recruitment text mining as claimed in claim 1, wherein the word vectorization of step 3 comprises the following sub-steps:
step 3.1: and initializing word vectors, and initializing dictionary vector representation by utilizing a uniformly distributed random fixed-length sequence initialization dictionary.
Step 3.2: training word vectors, modeling a problem into the context of a given target word through a conditional probability model, and predicting a language model of the target word; the vector expression of the target word is obtained by utilizing the gradient descent and the back propagation to maximize the log-likelihood target function,
Figure FDA0003312427780000021
in the formula, P (omega)tt-c:ωt+c) Is the conditional probability, T is the length of the sentence; omegatFor predicted target words, c is context size; omegat-c:ωt+cThe target word does not contain the first c to the last c words of the target word;
the conditional probability P is obtained by softmax,
Figure FDA0003312427780000022
Figure FDA0003312427780000023
in the formula, N is the size of a word list;
Figure FDA0003312427780000024
is composed of
Figure FDA0003312427780000025
Transposing;
Figure FDA0003312427780000028
is a vector representation of the target word; exp () is an exponential function with a natural constant as the base; v. ofnIs the vector representation of the nth time in the word list; v. ofjIs a vector representation of the jth word in the context of the target word.
5. The method for identifying major employment directions for expertise based on recruitment text mining as claimed in any one of claims 1 to 4, wherein the K-means clustering of step 4 comprises the following sub-steps:
step 4.1: performing K-means clustering on the recruitment position names, performing clustering analysis on the recruitment position names by using a K-means clustering algorithm, wherein the K-means algorithm takes the minimum value of the square error sum SSE of the sample and the mass center as an objective function, and the calculation formula is as follows:
Figure FDA0003312427780000026
Figure FDA0003312427780000027
in the formula: k is the number of clusters, EiIs the ith cluster; e.g. of the typeiIs EiThe center of mass of; x is EiA knowledge point sample of (1); n is a radical ofiIs EiThe number of samples in (1);
selecting the minimum K meeting the following constraint formula as the optimal clustering number, namely the value of the K,
Gapk≥Gapk+1-sk+1 (6)
Figure FDA0003312427780000031
Figure FDA0003312427780000032
b is simulation times of Monte Carlo simulation calculation; SSEkObtaining SSE obtained by calculation when the K value is K for the current sample; SSEkbTaking the K value as K, and carrying out the quadratic error sum of the mass centers when carrying out the b-th Monte Carlo simulation calculation;
step 4.2: and (5) summarizing the main employment direction, and summarizing each post after the K-means clustering to obtain the professional main employment direction.
CN202111220573.1A 2021-10-20 2021-10-20 Major professional employment direction identification method based on recruitment text mining Pending CN113886588A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111220573.1A CN113886588A (en) 2021-10-20 2021-10-20 Major professional employment direction identification method based on recruitment text mining

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111220573.1A CN113886588A (en) 2021-10-20 2021-10-20 Major professional employment direction identification method based on recruitment text mining

Publications (1)

Publication Number Publication Date
CN113886588A true CN113886588A (en) 2022-01-04

Family

ID=79003645

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111220573.1A Pending CN113886588A (en) 2021-10-20 2021-10-20 Major professional employment direction identification method based on recruitment text mining

Country Status (1)

Country Link
CN (1) CN113886588A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116523225A (en) * 2023-04-18 2023-08-01 泸州职业技术学院 Data mining-based overturning classroom hybrid teaching method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008338A (en) * 2019-03-04 2019-07-12 华南理工大学 A kind of electric business evaluation sentiment analysis method of fusion GAN and transfer learning
CN112861530A (en) * 2021-03-17 2021-05-28 华南农业大学 Course setting analysis method based on text mining
CN113139599A (en) * 2021-04-22 2021-07-20 北方工业大学 Service distributed clustering method fusing word vector expansion and topic model

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008338A (en) * 2019-03-04 2019-07-12 华南理工大学 A kind of electric business evaluation sentiment analysis method of fusion GAN and transfer learning
CN112861530A (en) * 2021-03-17 2021-05-28 华南农业大学 Course setting analysis method based on text mining
CN113139599A (en) * 2021-04-22 2021-07-20 北方工业大学 Service distributed clustering method fusing word vector expansion and topic model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
朱爱璐: "基于文本挖掘的数据分析岗位人才需求分析", 中国优秀硕士学位论文全文数据库(电子期刊), no. 11, 15 November 2020 (2020-11-15), pages 123 - 81 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116523225A (en) * 2023-04-18 2023-08-01 泸州职业技术学院 Data mining-based overturning classroom hybrid teaching method
CN116523225B (en) * 2023-04-18 2024-01-23 泸州职业技术学院 Data mining-based overturning classroom hybrid teaching method

Similar Documents

Publication Publication Date Title
Mohan et al. Stock price prediction using news sentiment analysis
CN110825882A (en) Knowledge graph-based information system management method
US20180053107A1 (en) Aspect-based sentiment analysis
Mačutek et al. Evaluating goodness-of-fit of discrete distribution models in quantitative linguistics
US10726058B2 (en) System, computer program product and method for generating embeddings of textual and quantitative data
CN115293131B (en) Data matching method, device, equipment and storage medium
Wang et al. A context-aware recommendation system for improving manufacturing process modeling
Munna et al. Sentiment analysis and product review classification in e-commerce platform
CN113255321A (en) Financial field chapter-level event extraction method based on article entity word dependency relationship
CN111709225B (en) Event causal relationship discriminating method, device and computer readable storage medium
CN111651994B (en) Information extraction method and device, electronic equipment and storage medium
ul Hassan et al. Multi-class categorization of design-build contract requirements using text mining and natural language processing techniques
CN112861530A (en) Course setting analysis method based on text mining
CN113886588A (en) Major professional employment direction identification method based on recruitment text mining
Wang et al. A Systematic Mapping Study of Information Retrieval Approaches Applied to Requirements Trace Recovery.
CN117574858A (en) Automatic generation method of class case retrieval report based on large language model
Rahman et al. Example driven code review explanation
CN113806538B (en) Label extraction model training method, device, equipment and storage medium
CN115455198A (en) Model training method, legal action information alignment and fusion method and terminal equipment thereof
Mankolli et al. A hybrid machine learning method for text analysis to determine job titles similarity
Radomirović et al. Text document clustering approach by improved sine cosine algorithm
Pant et al. Automatic Software Engineering Position Resume Screening using Natural Language Processing, Word Matching, Character Positioning, and Regex
Trinh et al. Automatic process resume in talent pool by applying natural language processing
Poovizhi et al. Automatic scraping of employment record using machine learning—An assistance for the recruiter
CN109885647B (en) User history verification method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination