CN113111187B - Method and system for mining employment platform comments - Google Patents

Method and system for mining employment platform comments Download PDF

Info

Publication number
CN113111187B
CN113111187B CN202110369952.0A CN202110369952A CN113111187B CN 113111187 B CN113111187 B CN 113111187B CN 202110369952 A CN202110369952 A CN 202110369952A CN 113111187 B CN113111187 B CN 113111187B
Authority
CN
China
Prior art keywords
word
unit
denotes
recruitment
employment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110369952.0A
Other languages
Chinese (zh)
Other versions
CN113111187A (en
Inventor
吴方同
吴晓军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hebei Jilian Human Resources Service Group Co ltd
Original Assignee
Hebei Jilian Human Resources Service Group Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hebei Jilian Human Resources Service Group Co ltd filed Critical Hebei Jilian Human Resources Service Group Co ltd
Priority to CN202110369952.0A priority Critical patent/CN113111187B/en
Publication of CN113111187A publication Critical patent/CN113111187A/en
Application granted granted Critical
Publication of CN113111187B publication Critical patent/CN113111187B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a system for mining employment platform comments, which are based on natural language processing, acquire employment platform worker comment data, store the comment data in a comment data table, identify the comment data as new data, construct a recruitment unit word library and a characteristic vector matrix based on the comment data, analyze recruitment unit co-occurrence frequency analysis, and output display co-occurrence frequency according to word frequency data, so that enterprises with similar working properties can be found by analyzing the recruitment unit co-occurrence frequency, resumes can be pushed to recruitment enterprises as required, and the calculation speed of an algorithm and the recruitment unit resume matching efficiency are improved.

Description

Method and system for mining employment platform comments
Technical Field
The invention relates to the technical field of natural language processing, in particular to a method and a system for mining employment platform comments.
Background
Today, the form of network recruitment is gradually diversified, and accurate resume pushing is a new technical means capable of quickly obtaining working opportunities and quickly matching suitable workers for recruiters and applicants.
In the prior art, resume matching pushing is realized through a neural network algorithm, and worker data suitable for a certain industry and a certain enterprise are matched in a deep learning mode, which is common, but in the technologies, a complex mathematical model is generally required to be established, a large amount of big data calculation is carried out, the consumed time is long, and the efficiency is not high. Based on this, a solution for rapidly obtaining and pushing the resume of people with high accuracy is needed in the art.
Disclosure of Invention
Based on the problems, the invention provides a method and a system for mining employment platform comments, which are based on natural language processing, acquire employment platform comment data, construct a word bank of employment units and a feature vector matrix based on the comment data, analyze the co-occurrence frequency of the employment units so as to find out enterprises with similar working properties, and push resumes to recruitment enterprises as required.
In order to achieve the purpose, the invention provides a method for mining employment platform comments,
step 101, obtaining comment data of workers on a recruitment platform, storing the comment data in a comment data table, and identifying the comment data as new data;
102, constructing an employment unit word stock;
103, constructing a post work type word library;
104, constructing a feature vector matrix;
105, analyzing co-occurrence frequency;
and 106, displaying the co-occurrence frequency according to the word frequency data output.
Further, the method comprises periodically traversing the newly acquired comment data;
further, the building of the recruitment unit thesaurus specifically comprises:
(1) Loading new comments into text collections XText i (j) Wherein i represents the number of new comments, and j is the jth comment;
(2) Judging whether the new comment data contains recruitment unit information or not by using an indexOf () function, and judging whether the new comment data contains recruitment unit information or not when the XText function is used i (j) indexOf ("company") = -1, or XText i (j) Indiexof ("unit") = -1,xtext i (j) Indiexof ("plant") = -1,xtext i (j) indexOf ("factory") = = -1, consider that the review data contains recruitment unit information;
(3) For data XText containing recruitment unit information i (j) Introducing a jieba word segmentation function, segmenting the comment data, and defining a word segmentation chain Dword n (w) wherein: n =1 denotes a noun, n =2 denotes a verb, n =3 denotes an adjective, n =4 denotes a quantifier, n =5 denotes a pronoun, n =6 denotes an adverb, n =7 denotes a preposition, n =8 denotes a conjunct, n =9 denotes an auxiliary word, n =10 denotes an exclamation, n =11 denotes an analogistic, W denotes the order of the words, dword denotes the order of the words, and n (w) values represent specific vocabulary;
(4) Word-segmented text Dword n (w) processing, if n =1, the participle is a noun, referring to a standard noun dictionary Mdic, judging whether the participle is in a common noun dictionary, if not, feeding back 0 by the function in the common noun dictionary, and if so, jumping to the next vocabulary;
(5) Checking whether the noun already exists in the employment unit library Bdic under the condition that the function return value is 0, and skipping to continue executing if the noun exists;
(6) Using AddDIC (Dword) if the noun position number is less than the work unit appearance position number p n (w)) function, adding the vocabulary to the thesaurus of employment units.
Further, the constructing the feature vector matrix includes: traversing newly generated recruitment units in the recruitment unit word library, and constructing a characteristic vector matrix corresponding to the recruitment unit word library for each new recruitment unit
Figure BDA0003008868000000021
Wherein Pp represents the position index of the work unit library, cp represents the position index of the production position work seeds, and e is the co-occurrence number.
Further, the co-occurrence frequency analysis includes:
(1) Loading all comments into a text set Atext;
(2) Introducing a jieba word segmentation function, segmenting the comment Atext data, and defining a word stock chain Aword n (w) wherein: n =1 represents a noun, n =2 represents a verb, n =3 represents an adjective, n =4 represents a quantitative word, n =5 represents a pronoun, n =6 represents an adverb, n =7 represents a preposition, n =8 represents a conjunctive, n =9 represents a co-word, n =10 represents an exclamation, n =11 represents an ideogram; w represents the order of words; aword n (w) values represent specific vocabulary;
(3) For thesaurus chain Aword n (w) performing word frequency analysis on all vocabularies, selecting vocabularies with the occurrence frequency exceeding a threshold value, and constructing a word frequency matrix Aword of a word bank chain n (w, c), wherein n represents part of speech, w represents lexical position, and c represents word frequency;
(4) According to word frequency matrix Aword of word stock chain n C in (w, c), constructing a complete binary Huffman tree, generating a corresponding binary code k according to the corresponding position of each word, and constructing a Huffman vector matrix Hword n (w, c, k), wherein k is used to store the binary code k;
(5) For feature vectors
Figure BDA0003008868000000035
Cp post work of pp-in-middle unit, and vector matrix Hword comparison n (w, c, K), acquiring binary code K1 value corresponding to cp position work type of pp employment unit, and judging vector matrix Hword n Whether each vector in (w, c, k) belongs to a post work and seed library vocabulary of an employment unit, if the vector belongs to a cp post work and seed library of a certain employment unit, extracting corresponding Ki, and calculating cosine distance by using a cosine similarity formula, wherein the formula is as follows:
Figure BDA0003008868000000031
wherein j represents each component of binary coding K value, co-occurrence words of cp post work types with the former 10 recruitment units with the closest cosine distance as pp recruitment units are selected and added into a co-occurrence word matrix
Figure BDA0003008868000000032
Where n denotes a part of speech, w denotes a position, c denotes a word frequency, and k denotes a binary coded value. Will be provided with
Figure BDA0003008868000000033
Is saved to
Figure BDA0003008868000000034
In (1).
(6) And updating the work unit word library and the post work type word library.
In addition, the invention also provides a system for mining the comments of the employment platform, which comprises the following steps:
the data acquisition module 201 is used for acquiring comment data of workers on a employment platform, storing the comment data in a comment data table, and identifying the comment data as new data;
a unit word library construction module 202, configured to construct an employment unit word library;
the post work kind construction module 203 is used for constructing a post work kind word bank;
a vector matrix module 204, configured to construct a feature vector matrix;
a co-occurrence analysis module 205 for co-occurrence frequency analysis;
and a display module 206, configured to output a display co-occurrence frequency according to the word frequency data.
Further, the data obtaining module 201 periodically traverses the newly obtained comment data;
further, the constructing of the recruitment unit word library by the unit word library constructing module 202 specifically includes:
(1) Loading new comments into text collections XText i (j) Wherein i represents the number of new comments, and j is the jth comment;
(2) Judging whether the new comment data contains recruitment unit information by using an indexOf () function, and judging whether the new comment data contains recruitment unit information when XText i (j) indexOf ("company") = -1, or XText i (j) Indiexof ("unit") = -1,xtext i (j) Indiexof ("plant") = -1,xtext i (j) indexOf ("factory") = = -1, consider that the review data contains recruitment unit information;
(3) For data XText containing recruitment unit information i (j) Introducing a jieba word segmentation function, segmenting the comment data, and defining a word segmentation chain Dword n (w) wherein: n =1 denotes a noun, n =2 denotes a verb, n =3 denotes an adjective, n =4 denotes a quantifier, n =5 denotes a pronoun, n =6 denotes an adverb, n =7 denotes a preposition, n =8 denotes a conjunct, n =9 denotes an auxiliary word, n =10 denotes an exclamation, n =11 denotes an analogistic, W denotes the order of the words, dword denotes the order of the words, and n (w) values represent specific vocabulary;
(4) Word-segmented text Dword n (w) processing, if n =1, the participle is a noun, referring to a standard noun dictionary Mdic, judging whether the participle is in the common noun dictionary, if not, feeding back 0 by a function in the common noun dictionary, and if so, jumping to the next vocabulary;
(5) Checking whether the noun already exists in the employment unit library Bdic under the condition that the function return value is 0, and skipping to continue executing if the noun exists;
(6) Work orderIf the noun does not exist in the bit word library, if the noun position serial number is less than the work unit appearance position serial number p, addDIC (Dword) is used n (w)) function, adding the vocabulary to the thesaurus of employment units.
Further, the vector matrix module 204 constructs the feature vector matrix including: traversing newly generated recruitment units in the recruitment unit word bank, and constructing a feature vector matrix corresponding to the recruitment unit word bank for each new recruitment unit
Figure BDA0003008868000000051
Wherein Pp represents the position index of the work unit library, cp represents the position index of the production post work seeds, and e is the co-occurrence number;
further, the co-occurrence frequency analysis module 205 comprises:
(1) Loading all comments into a text set Atext;
(2) Introducing a jieba word segmentation function, segmenting the comment Atext data, and defining a word stock chain Aword n (w) wherein: n =1 represents a noun, n =2 represents a verb, n =3 represents an adjective, n =4 represents a quantitative word, n =5 represents a pronoun, n =6 represents an adverb, n =7 represents a preposition, n =8 represents a conjunctive, n =9 represents a co-word, n =10 represents an exclamation, n =11 represents an ideogram; w represents the order of words; aword n (w) the values represent specific vocabulary;
(3) For thesaurus chain Aword n (w) performing word frequency analysis on all vocabularies, selecting vocabularies with the occurrence frequency exceeding a threshold value, and constructing a word frequency matrix Aword of a word bank chain n (w, c), wherein n represents part of speech, w represents lexical position, and c represents word frequency;
(4) According to word frequency matrix Aword of word stock chain n C in (w, c), constructing a complete binary Huffman tree, generating a corresponding binary code k according to the corresponding position of each word, and constructing a Huffman vector matrix Hword n (w, c, k), wherein k is used to hold the binary code k;
(5) For feature vectors
Figure BDA0003008868000000052
Cp position work seeds of pp-middle employment unit, and vector matrix Hword comparison n (w, c, K), acquiring binary code K1 value corresponding to cp position work type of pp employment unit, and judging vector matrix Hword n Whether each vector in (w, c, k) belongs to a work and seed library vocabulary of an employment unit, if the vector belongs to a cp work and seed library of a certain employment unit, extracting corresponding Ki, and calculating cosine distance by using a cosine similarity formula, wherein the formula is as follows:
Figure BDA0003008868000000053
wherein j represents each component of binary coding K value, co-occurrence words of cp post work types with the former 10 recruitment units with the closest cosine distance as pp recruitment units are selected and added into a co-occurrence word matrix
Figure BDA0003008868000000061
Where n denotes a part of speech, w denotes a position, c denotes a word frequency, and k denotes a binary coded value. Will be provided with
Figure BDA0003008868000000062
Is saved to
Figure BDA0003008868000000063
In (1).
(6) And updating the work unit word library and the post work type word library.
Furthermore, the invention also provides a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, is adapted to carry out a method such as mining with a workbench comment.
The invention provides a method and a system for mining recruitment platform comments, which are based on natural language processing, acquire recruitment platform worker comment data, store the comment data in a comment data table, identify the comment data as new data, construct a recruitment unit word library and a feature vector matrix based on the comment data, analyze the analysis of recruitment unit co-occurrence frequency, output and display co-occurrence frequency according to the word frequency data, realize finding out enterprises with similar working properties by analyzing the recruitment unit co-occurrence frequency, push resumes to recruitment enterprises as required, and improve the calculation speed of an algorithm and the recruitment unit resume matching efficiency.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a method of mining employment platform reviews in accordance with the present invention;
FIG. 2 is a block diagram of a system architecture for benchmarking mining with benchmarks in accordance with the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a method and a system for mining recruitment platform comments, which are used for acquiring recruitment platform comment data based on natural language processing, constructing a recruitment unit lexicon and a characteristic vector matrix based on the comment data, analyzing the co-occurrence frequency of recruitment units to find out enterprises with similar working properties, and pushing resumes to recruitment enterprises as required.
Firstly, the invention provides a method for mining employment platform comments, and a flow chart is shown as the attached figure 1:
step 101, obtaining comment data of workers of a recruitment platform, storing the comment data in a comment data table, and identifying the comment data as new data;
the data acquisition module H acquires worker platform comment data through a data acquisition algorithm in the prior art, stores the data in a worker platform comment data table in the database D, and identifies the acquired data as new data.
102, constructing an employment unit word stock;
the process of specifically constructing the recruitment unit word stock comprises the following steps: and traversing newly acquired comment data by using the worker building module M1 at regular time.
(1) Loading new comments into text collections XText i (j) Wherein i represents the number of new comments, and j is the jth comment.
(2) Judging whether the new comment data contains recruitment unit information by using an indexOf () function, and judging whether the new comment data contains recruitment unit information when XText i (j) indexOf ("company") = -1, or XText i (j) Indiexof ("unit") = -1,xtext i (j) indexOf ("factory") = = -1,xtext i (j) indexOf ("factory") = = -1, and the comment data is considered to contain the recruitment unit information.
(3) For data XText containing employing unit information i (j) Introducing a jieba word segmentation function, segmenting the comment data, and defining a word segmentation chain Dword n (w) wherein: n =1 represents a noun, n =2 represents a verb, n =3 represents an adjective, n =4 represents a quantifier, n =5 represents a pronoun, n =6 represents an adverb, n =7 represents a preposition, n =8 represents a conjunct, n =9 represents a co-word, n =10 represents an exclamation, n =11 represents an analogistic. W denotes the order of words. Dword n The values of (w) represent specific vocabulary.
(4) Word-segmented text Dword n (w) processing, if n =1, the participle is a noun, referring to the standard noun dictionary Mdic, and using a wave _ dic () function to judge whether the participle is in the common noun dictionary, if not, feeding back 0 by the function. If in the common noun dictionary, jump to the next vocabulary, the logic of the have _ dic () function is constructed as follows:
Figure BDA0003008868000000081
(5) If the return value of the have _ dic () function is 0, the function have _ brand () is used to check whether the noun already exists in the work unit library Bdic, and if so, the execution is skipped. The have _ brand () function is logically constructed as follows:
Figure BDA0003008868000000082
(6) Using AddDIC (Dword) if the noun position number is less than the work unit appearance position number p n (w)) function, adding the vocabulary to the thesaurus of employment units. The AddDIC () function is logically constructed as follows:
Figure BDA0003008868000000091
the specific logic of the module M1 constructed by the employment unit word library is as follows:
Figure BDA0003008868000000092
Figure BDA0003008868000000101
103, constructing a post work type word library;
the post work type obtains the post work type of the worker comment through the obtaining algorithm in the prior art, and a post work type word library is constructed. The post job is a specific job, such as a Java development engineer, python algorithm engineer, etc.
104, constructing a feature vector matrix;
the characteristic vector matrix construction module traverses newly generated recruitment units in the recruitment unit word bank, and constructs a characteristic vector matrix corresponding to the recruitment unit word bank for each new recruitment unit
Figure BDA0003008868000000102
Wherein Pp represents the work cell libraryAnd a position index, cp represents a position index of the production position work type, and e is a co-occurrence number.
105, analyzing co-occurrence frequency;
and the co-occurrence analysis module analyzes the word co-occurrence condition with a certain recruitment unit and a potential competition recruitment unit as keywords, and analyzes the key characteristics of the post type of the certain recruitment unit by using the data.
(1) All comments are loaded into the text collection Atext.
(2) Introducing a jieba word segmentation function, segmenting the comment Atext data, and defining a word stock chain Aword n (w) wherein: n =1 represents a noun, n =2 represents a verb, n =3 represents an adjective, n =4 represents a quantifier, n =5 represents a pronoun, n =6 represents an adverb, n =7 represents a preposition, n =8 represents a conjunct, n =9 represents a co-word, n =10 represents an exclamation, n =11 represents an analogistic. W represents the order of words. Aword n The values of (w) represent specific vocabulary.
(3) For thesaurus chain Aword n (w) performing word frequency analysis on all vocabularies in the word library, selecting vocabularies with the occurrence frequency of more than 50, and constructing a word frequency matrix Aword of a word library chain n (w, c), wherein n represents part of speech, w represents lexical position, and c represents word frequency.
(4) Word frequency matrix Aword according to word stock chain n C in (w, c), constructing a complete binary Huffman tree, generating a corresponding binary code k according to the corresponding position of each word, and constructing a Huffman vector matrix Hword n (w, c, k), where k is used to hold the binary code k.
(5) For feature vectors
Figure BDA0003008868000000111
Cp post work of pp-in-middle unit, and vector matrix Hword comparison n (w, c, K), acquiring binary code K1 value corresponding to cp post work type of pp employment unit, and judging vector matrix Hword n (w, c, k) whether each vector belongs to the work and seed library vocabulary of the employment unit, if the vector belongs to the cp work and seed library of a certain employment unit, extracting the corresponding Ki, calculating the cosine distance by using a cosine similarity formula,the formula is as follows:
Figure BDA0003008868000000112
wherein j represents each component of the binary coding K value, the co-occurrence words of cp position work types with the former 10 employment units with the closest cosine distance as pp employment units are selected and added into the co-occurrence word matrix
Figure BDA0003008868000000113
Where n denotes a part of speech, w denotes a position, c denotes a word frequency, and k denotes a binary coded value. Will be provided with
Figure BDA0003008868000000114
Is saved to
Figure BDA0003008868000000115
In (1).
(6) And updating the work unit word library and the post work type word library.
And 106, displaying the co-occurrence frequency according to the word frequency data output.
And outputting the co-occurrence frequency of each recruitment unit according to the selection of the user on the visual interface and the word frequency data, and presenting the co-occurrence frequency in a chart form. For example, when the user selects the co-occurrence frequency between the B work category of company a and the post of another company, co-occurrence frequency data with "the B work category of company a" as a comparison target is output and displayed.
Then, the invention provides a system for mining the comments of the employment platform, and the structural block diagram of the system is shown in the attached figure 2:
the data acquisition module 201 is used for acquiring comment data of workers on a platform, storing the comment data in a comment data table, and identifying the comment data as new data;
the data acquisition module H acquires worker platform comment data through a data acquisition algorithm in the prior art, stores the data in a worker platform comment data table in the database D, and identifies the acquired data as new data.
A unit word stock construction module 202, configured to construct an employment unit word stock;
the process of constructing the recruitment unit word bank specifically comprises the following steps: and traversing the newly acquired comment data by the employment unit building module M1 at regular time.
(1) Loading New comments into text set XText i (j) Wherein i represents the number of new comments, and j is the jth comment.
(2) Judging whether the new comment data contains recruitment unit information by using an indexOf () function, and judging whether the new comment data contains recruitment unit information when XText i (j) indexOf ("company") = = -1, or XText i (j) indexOf ("unit") = = -1,xtext i (j) Indiexof ("plant") = -1,xtext i (j) indexOf ("factory") = = -1, and the comment data is considered to contain the recruitment unit information.
(3) For data XText containing employing unit information i (j) Introducing a jieba word segmentation function, segmenting the comment data, and defining a word segmentation chain Dword n (w) wherein: n =1 represents a noun, n =2 represents a verb, n =3 represents an adjective, n =4 represents a quantifier, n =5 represents a pronoun, n =6 represents an adverb, n =7 represents a preposition, n =8 represents a conjunct, n =9 represents a co-word, n =10 represents an exclamation, n =11 represents an analogistic. W denotes the order of words. Dword n The values of (w) represent specific words.
(4) Word-segmented text Dword n (w) processing, if n =1, the participle is a noun, referring to the standard noun dictionary Mdic, and using a wave _ dic () function to judge whether the participle is in the common noun dictionary, if not, feeding back 0 by the function. If in the common noun dictionary, jump to the next vocabulary, the logic of the have _ dic () function is constructed as follows:
Figure BDA0003008868000000121
(5) If the return value of the have _ dic () function is 0, the function have _ brand () is used to check whether the noun already exists in the work unit library Bdic, and if so, the execution is skipped. The have _ brand () function is logically constructed as follows:
Figure BDA0003008868000000131
(6) Using AddDIC (Dword) if the noun position number is less than the work unit appearance position number p n (w)) function, adding the vocabulary to the thesaurus of employment units. AddDIC () function is logically constructed as follows:
Figure BDA0003008868000000132
the concrete logic of the module M1 constructed by the employment unit word library is as follows:
Figure BDA0003008868000000133
Figure BDA0003008868000000141
the post work kind construction module 203 is used for constructing a post work kind word bank;
the post work type obtains the post work type of the worker comment through the obtaining algorithm in the prior art, and a post work type word library is constructed. The job type is a specific job type, such as a Java development engineer, a Python algorithm engineer, and the like.
A vector matrix module 204, configured to construct a feature vector matrix;
the characteristic vector matrix construction module traverses newly generated recruitment units in the recruitment unit word bank, and constructs a characteristic vector matrix corresponding to the recruitment unit word bank for each new recruitment unit
Figure BDA0003008868000000142
Where Pp represents the station bin position index, cp represents the station production station seed position index, and e is the co-occurrence number.
A co-occurrence analysis module 205 for co-occurrence frequency analysis;
and the co-occurrence analysis module analyzes the word co-occurrence condition with a certain recruitment unit and a potential competition recruitment unit as keywords, and analyzes the key characteristics of the post type of the certain recruitment unit by using the data.
(1) All comments are loaded into the text collection Atext.
(2) Introducing a jieba word segmentation function, segmenting the comment Atext data, and defining a word stock chain Aword n (w) wherein: n =1 represents a noun, n =2 represents a verb, n =3 represents an adjective, n =4 represents a quantifier, n =5 represents a pronoun, n =6 represents an adverb, n =7 represents a preposition, n =8 represents a conjunct, n =9 represents a co-word, n =10 represents an exclamation, n =11 represents an analogistic. W denotes the order of words. Aword n The values of (w) represent specific words.
(3) For thesaurus chain Aword n (w) performing word frequency analysis on all words in the vocabulary, selecting words with the occurrence frequency of more than 50, and constructing word frequency matrix Aword of a word stock chain n (w, c), wherein n represents part of speech, w represents lexical position, and c represents word frequency.
(4) Word frequency matrix Aword according to word stock chain n C in (w, c), constructing a complete binary Huffman tree, generating a corresponding binary code k according to the corresponding position of each word, and constructing a Huffman vector matrix Hword n (w, c, k), where k is used to hold the binary code k.
(5) For feature vectors
Figure BDA0003008868000000151
Cp post work of pp-in-middle unit, and vector matrix Hword comparison n (w, c, K), acquiring binary code K1 value corresponding to cp post work type of pp employment unit, and judging vector matrix Hword n Whether each vector in (w, c, k) belongs to a work and seed library vocabulary of an employment unit, if the vector belongs to a cp work and seed library of a certain employment unit, extracting corresponding Ki, and calculating cosine distance by using a cosine similarity formula, wherein the formula is as follows:
Figure BDA0003008868000000152
wherein j represents each component of the binary coding K value, the co-occurrence words of cp position work types with the former 10 employment units with the closest cosine distance as pp employment units are selected and added into the co-occurrence word matrix
Figure BDA0003008868000000153
Where n denotes a part of speech, w denotes a position, c denotes a word frequency, and k denotes a binary coded value. Will be provided with
Figure BDA0003008868000000154
Is saved to
Figure BDA0003008868000000155
In (1).
(6) And updating the work unit word library and the post work type word library.
And a display module 206, configured to output a display co-occurrence frequency according to the word frequency data.
And outputting the co-occurrence frequency of each recruitment unit according to the selection of the user on the visual interface and the word frequency data, and presenting the co-occurrence frequency in a chart form. For example, when the user selects the co-occurrence frequency between the B work category of company a and the post of another company, co-occurrence frequency data to be compared with "the B work category of company a" is output and displayed.
In addition, the invention also provides a computer readable storage medium, which stores a computer program, and the computer program is executed by a processor to execute the method such as mining with platform comments.
The invention provides a method and a system for mining recruitment platform comments, which are based on natural language processing, acquire recruitment platform worker comment data, store the comment data in a comment data table, identify the comment data as new data, construct a recruitment unit word library and a feature vector matrix based on the comment data, analyze the analysis of recruitment unit co-occurrence frequency, output and display co-occurrence frequency according to the word frequency data, realize finding out enterprises with similar working properties by analyzing the recruitment unit co-occurrence frequency, push resumes to recruitment enterprises as required, and improve the calculation speed of an algorithm and the recruitment unit resume matching efficiency.
The principles and embodiments of the present invention have been described herein using specific examples, which are presented only to assist in understanding the method and its core concepts of the present invention. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims (3)

1. A method for mining employment platform comments is characterized in that,
step 101, obtaining comment data of workers of a recruitment platform, storing the comment data in a comment data table, and identifying the comment data as new data;
102, constructing an employment unit word stock;
the construction of the recruitment unit word library specifically comprises the following steps:
(1) Loading new comments into text collections XText i (j) Wherein i represents the number of new comments, and j is the jth comment;
(2) Judging whether the new comment data contains recruitment unit information by using an indexOf () function, and judging whether the new comment data contains recruitment unit information when XText i (j) indexOf ("company") = -1, or XText i (j) indexOf ("unit") = = -1,xtext i (j) indexOf ("factory") = = -1,xtext i (j) indexOf ("factory") = = -1, consider that the review data contains recruitment unit information;
(3) For data XText containing recruitment unit information i (j) Introducing a jieba word segmentation function, segmenting the comment data, and defining a word segmentation chain Dword n (w) wherein: n =1 denotes noun, n =2 denotes verb, n =3 denotes adjective, n =4 denotes quantifier, n =5 denotes pronoun, n =6 denotes adverb, n =7 denotes preposition, n =8 denotes conjunct, n =9 denotes co-word, n =10 denotes interjection, n =11 denotes pseudonym, W denotes the order of words, dword denotes the order of words n (w) the values represent specific vocabulary;
(4) Word-segmented text Dword n (w) processing, if n =1, the participle is a noun, referring to a standard noun dictionary Mdic, judging whether the participle is in the common noun dictionary, if not, feeding back 0 by a function in the common noun dictionary, and if so, jumping to the next vocabulary;
(5) Checking whether the noun already exists in the employment unit library Bdic under the condition that the function return value is 0, and skipping to continue executing if the noun exists;
(6) If the noun does not exist in the work unit word bank, if the noun position sequence number is less than the work unit appearance position sequence number p, addDIC (Dword) is used n (w)) a function that adds the vocabulary to the recruitment unit thesaurus;
103, acquiring the post work type of the worker comment, and constructing a post work type word bank;
104, constructing a feature vector matrix; the constructing the feature vector matrix comprises: traversing newly generated recruitment units in the recruitment unit word bank, and constructing a feature vector matrix corresponding to the recruitment unit word bank for each new recruitment unit
Figure FDA0004053559580000021
Wherein Pp represents the position index of the work unit library, cp represents the position index of the post work seeds, and e is the co-occurrence number;
step 105, co-occurrence frequency analysis;
the co-occurrence frequency analysis comprises:
(1) Loading all comments into a text set Atext;
(2) Introducing a jieba word segmentation function, segmenting the comment Atext data, and defining a word stock chain Aword n (w) wherein: n =1 represents a noun, n =2 represents a verb, n =3 represents an adjective, n =4 represents a quantitative word, n =5 represents a pronoun, n =6 represents an adverb, n =7 represents a preposition, n =8 represents a conjunctive, n =9 represents a co-word, n =10 represents an exclamation, n =11 represents an ideogram; w represents the order of the words; aword n (w) values represent specific vocabulary;
(3) For thesaurus chain Aword n (w) performing word frequency analysis on all the words in the vocabulary, and selecting wordsAssembling words with frequency exceeding threshold value, and constructing word frequency matrix Aword of word bank chain n (w, c), wherein n represents part of speech, w represents lexical position, and c represents word frequency;
(4) Word frequency matrix Aword according to word stock chain n C in (w, c), constructing a complete binary Huffman tree, generating a corresponding binary code k according to the corresponding position of each word, and constructing a Huffman vector matrix Hword n (w, c, k), wherein k is used to hold the binary code k;
(5) For feature vectors
Figure FDA0004053559580000022
Cp position work seeds of pp-middle employment unit, and vector matrix Hword comparison n (w, c, K), acquiring binary code K1 value corresponding to cp post work type of pp employment unit, and judging vector matrix Hword n Whether each vector in (w, c, k) belongs to a work and seed library vocabulary of an employment unit, if the vector belongs to a cp work and seed library of a certain employment unit, extracting corresponding Ki, and calculating cosine distance by using a cosine similarity formula, wherein the formula is as follows:
Figure FDA0004053559580000031
wherein j represents each component of the binary coding K value, the co-occurrence words of cp position work types with the former 10 employment units with the closest cosine distance as pp employment units are selected and added into the co-occurrence word matrix
Figure FDA0004053559580000032
Wherein n represents part of speech, w represents position, c represents word frequency, and k represents binary coding value; will be provided with
Figure FDA0004053559580000033
Is saved to
Figure FDA0004053559580000034
The preparation method comprises the following steps of (1) performing;
(6) Updating the recruitment unit word library and the post work type word library;
and 106, displaying the co-occurrence frequency according to the word frequency data output.
2. A system for mining employment platform reviews, the system comprising:
the data acquisition module 201 is used for acquiring comment data of workers on a employment platform, storing the comment data in a comment data table, and identifying the comment data as new data;
a unit word stock construction module 202, configured to construct an employment unit word stock;
the building of the recruitment unit word stock specifically comprises the following steps:
(1) Loading new comments into text collections XText i (j) Wherein i represents the number of new comments, and j is the jth comment;
(2) Judging whether the new comment data contains recruitment unit information by using an indexOf () function, and judging whether the new comment data contains recruitment unit information when XText i (j) indexOf ("company") = = -1, or XText i (j) Indiexof ("unit") = -1,xtext i (j) Indiexof ("plant") = -1,xtext i (j) indexOf ("factory") = = -1, consider that the review data contains recruitment unit information;
(3) For data XText containing recruitment unit information i (j) Introducing a jieba word segmentation function, segmenting the comment data, and defining a word segmentation chain Dword n (w) wherein: n =1 denotes a noun, n =2 denotes a verb, n =3 denotes an adjective, n =4 denotes a quantifier, n =5 denotes a pronoun, n =6 denotes an adverb, n =7 denotes a preposition, n =8 denotes a conjunct, n =9 denotes an auxiliary word, n =10 denotes an exclamation, n =11 denotes an analogistic, W denotes the order of the words, dword denotes the order of the words, and n (w) the values represent specific vocabulary;
(4) Word-segmented text Dword n (w) processing, if n =1, the participle is a noun, referring to a standard noun dictionary Mdic, judging whether the participle is in a common noun dictionary, if not, feeding back 0 by the function in the common noun dictionary, and if so, jumping to the next vocabulary;
(5) Checking whether the noun already exists in the employment unit library Bdic under the condition that the function return value is 0, and skipping to continue executing if the noun exists;
(6) If the noun does not exist in the work unit word bank, if the noun position sequence number is less than the work unit appearance position sequence number p, addDIC (Dword) is used n (w)) a function that adds the vocabulary to the recruitment unit thesaurus;
the post work type construction module 203 is used for acquiring post work types of workers for commenting and constructing a post work type word bank;
a vector matrix module 204, configured to construct a feature vector matrix;
the constructing the feature vector matrix comprises: traversing newly generated recruitment units in the recruitment unit word bank, and constructing a feature vector matrix corresponding to the recruitment unit word bank for each new recruitment unit
Figure FDA0004053559580000041
Wherein Pp represents the position index of the work unit library, cp represents the position index of the post work seeds, and e is the co-occurrence number;
a co-occurrence analysis module 205 for co-occurrence frequency analysis;
the co-occurrence frequency analysis comprises:
(1) Loading all comments into a text set Atext;
(2) Introducing a jieba word segmentation function, segmenting the comment Atext data, and defining a word stock chain Aword n (w) wherein: n =1 represents a noun, n =2 represents a verb, n =3 represents an adjective, n =4 represents a quantitative word, n =5 represents a pronoun, n =6 represents an adverb, n =7 represents a preposition, n =8 represents a conjunctive, n =9 represents a co-word, n =10 represents an exclamation, n =11 represents an ideogram; w represents the order of words; aword n (w) the values represent specific vocabulary;
(3) For thesaurus chain Aword n (w) performing word frequency analysis on all vocabularies, selecting vocabularies with the occurrence frequency exceeding a threshold value, and constructing a word frequency matrix Aword of a word bank chain n (w, c), wherein n represents part of speech, w represents lexical position, and c represents word frequency;
(4) Word frequency matrix Aword according to word stock chain n C in (w, c), constructing a complete binary Huffman tree, generating a corresponding binary code k according to the corresponding position of each word, and constructing a Huffman vector matrix Hword n (w, c, k), wherein k is used to hold the binary code k;
(5) For feature vectors
Figure FDA0004053559580000051
Cp position work seeds of pp-middle employment unit, and vector matrix Hword comparison n (w, c, K), acquiring binary code K1 value corresponding to cp post work type of pp employment unit, and judging vector matrix Hword n Whether each vector in (w, c, k) belongs to a work and seed library vocabulary of an employment unit, if the vector belongs to a cp work and seed library of a certain employment unit, extracting corresponding Ki, and calculating cosine distance by using a cosine similarity formula, wherein the formula is as follows:
Figure FDA0004053559580000052
wherein j represents each component of binary coding K value, co-occurrence words of cp post work types with the former 10 recruitment units with the closest cosine distance as pp recruitment units are selected and added into a co-occurrence word matrix
Figure FDA0004053559580000053
Wherein n represents part of speech, w represents position, c represents word frequency, and k represents binary coding value; will be provided with
Figure FDA0004053559580000054
Is saved to
Figure FDA0004053559580000055
Performing the following steps;
(6) Updating an employment unit word bank and a post work type word bank;
and the display module 206 is configured to display the co-occurrence frequency according to the word frequency data output.
3. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, performs the method of claim 1.
CN202110369952.0A 2021-04-07 2021-04-07 Method and system for mining employment platform comments Active CN113111187B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110369952.0A CN113111187B (en) 2021-04-07 2021-04-07 Method and system for mining employment platform comments

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110369952.0A CN113111187B (en) 2021-04-07 2021-04-07 Method and system for mining employment platform comments

Publications (2)

Publication Number Publication Date
CN113111187A CN113111187A (en) 2021-07-13
CN113111187B true CN113111187B (en) 2023-03-10

Family

ID=76714397

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110369952.0A Active CN113111187B (en) 2021-04-07 2021-04-07 Method and system for mining employment platform comments

Country Status (1)

Country Link
CN (1) CN113111187B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013107031A1 (en) * 2012-01-20 2013-07-25 华为技术有限公司 Method, device and system for determining video quality parameter based on comment
WO2014002774A1 (en) * 2012-06-25 2014-01-03 日本電気株式会社 Synonym extraction system, method, and recording medium
JP2014142800A (en) * 2013-01-24 2014-08-07 Mega Chips Corp Feature amount extraction device, image detection device, and control program, and feature amount extraction method
CN110688407A (en) * 2019-09-09 2020-01-14 创新奇智(南京)科技有限公司 Social relationship mining method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111667374B (en) * 2020-06-10 2023-07-18 创新奇智(上海)科技有限公司 Method and device for constructing user portrait, storage medium and electronic equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013107031A1 (en) * 2012-01-20 2013-07-25 华为技术有限公司 Method, device and system for determining video quality parameter based on comment
WO2014002774A1 (en) * 2012-06-25 2014-01-03 日本電気株式会社 Synonym extraction system, method, and recording medium
JP2014142800A (en) * 2013-01-24 2014-08-07 Mega Chips Corp Feature amount extraction device, image detection device, and control program, and feature amount extraction method
CN110688407A (en) * 2019-09-09 2020-01-14 创新奇智(南京)科技有限公司 Social relationship mining method

Also Published As

Publication number Publication date
CN113111187A (en) 2021-07-13

Similar Documents

Publication Publication Date Title
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
US20200293293A1 (en) Pruning Engine
US8924197B2 (en) System and method for converting a natural language query into a logical query
CN107562919B (en) Multi-index integrated software component retrieval method and system based on information retrieval
CN114119057B (en) User portrait model construction system
CN111597356B (en) Intelligent education knowledge map construction system and method
US20130110500A1 (en) Method, system, and appartus for selecting an acronym expansion
US8914378B2 (en) Specification document check method, program, and system
CN115576984A (en) Method for generating SQL (structured query language) statement and cross-database query by Chinese natural language
US20220114340A1 (en) System and method for an automatic search and comparison tool
CN114579430A (en) Test case recommendation system and method based on pre-training language model
CN110555205A (en) negative semantic recognition method and device, electronic equipment and storage medium
CN112036185B (en) Method and device for constructing named entity recognition model based on industrial enterprise
CN112905665A (en) Express delivery data mining method, device, equipment and storage medium
CN114579104A (en) Data analysis scene generation method, device, equipment and storage medium
CN113553400A (en) Construction method and device of enterprise knowledge graph entity link model
Rodriguez et al. Comparison of information retrieval techniques for traceability link recovery
CN114625834A (en) Enterprise industry information determination method and device and electronic equipment
CN113515587A (en) Object information extraction method and device, computer equipment and storage medium
Korobkin et al. Patent data analysis system for information extraction tasks
Sanyal et al. Natural language processing technique for generation of SQL queries dynamically
CN113111187B (en) Method and system for mining employment platform comments
Panahandeh et al. Correction of spaces in Persian sentences for tokenization
CN116644148A (en) Keyword recognition method and device, electronic equipment and storage medium
CN115309995A (en) Scientific and technological resource pushing method and device based on demand text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant