CN112115705A - Method and device for screening electronic resume - Google Patents
Method and device for screening electronic resume Download PDFInfo
- Publication number
- CN112115705A CN112115705A CN202011008492.0A CN202011008492A CN112115705A CN 112115705 A CN112115705 A CN 112115705A CN 202011008492 A CN202011008492 A CN 202011008492A CN 112115705 A CN112115705 A CN 112115705A
- Authority
- CN
- China
- Prior art keywords
- resume
- data
- preset
- screened
- electronic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012216 screening Methods 0.000 title claims abstract description 78
- 238000000034 method Methods 0.000 title claims abstract description 43
- 238000000605 extraction Methods 0.000 claims abstract description 52
- 238000012545 processing Methods 0.000 claims abstract description 49
- 238000012549 training Methods 0.000 claims abstract description 14
- 238000009826 distribution Methods 0.000 claims description 27
- 230000011218 segmentation Effects 0.000 claims description 11
- 238000004422 calculation algorithm Methods 0.000 claims description 10
- 238000007873 sieving Methods 0.000 claims 1
- 230000007115 recruitment Effects 0.000 abstract description 8
- 238000004458 analytical method Methods 0.000 description 10
- 238000004364 calculation method Methods 0.000 description 5
- 239000013598 vector Substances 0.000 description 5
- 238000010586 diagram Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000012423 maintenance Methods 0.000 description 2
- 238000012067 mathematical method Methods 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 238000012797 qualification Methods 0.000 description 2
- 241000282414 Homo sapiens Species 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/10—Office automation; Time management
- G06Q10/105—Human resources
- G06Q10/1053—Employment or hiring
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Human Resources & Organizations (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Entrepreneurship & Innovation (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Strategic Management (AREA)
- Economics (AREA)
- Marketing (AREA)
- Operations Research (AREA)
- Quality & Reliability (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a method and a device for screening electronic resumes, which are used for analyzing the electronic resumes to be screened to obtain target posts and a plurality of preset field data corresponding to the electronic resumes to be screened, wherein the preset fields can be configured to be fields which best reflect recruitment requirements. According to the type of each preset field data, structural feature extraction and/or semi-structural feature extraction and/or non-structural feature extraction are/is carried out on each preset field data respectively, feature data which comprehensively and accurately reflect the features of the applicant are further obtained, then the feature data of the electronic resume to be screened are input into a resume screening model which is obtained by training with the feature data of the electronic resume screened by the target position as a positive sample and the feature data of the electronic resume not screened by the target position as a negative sample, and are processed, so that the electronic resume is rapidly and accurately screened according to the processing result of the resume screening model, and the screening efficiency of the electronic resume is improved.
Description
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for screening electronic resumes.
Background
In recent years, with the development of informatization technology, the job recruitment process becomes internet and digitalization, and by means of network technology and a corresponding recruitment platform, organizations can rapidly obtain a large number of electronic resumes by releasing recruitment information.
However, in the case of a large amount of electronic resumes, if a manual screening method is adopted, the workload is large and the efficiency is low. Therefore, how to quickly and accurately screen out the electronic resumes meeting the post requirements from the massive electronic resumes becomes a technical problem to be solved in the field.
Disclosure of Invention
In view of this, the invention provides a method and a device for screening electronic resumes, which improve the screening efficiency of the electronic resumes.
In order to achieve the above purpose, the invention provides the following specific technical scheme:
a method for screening electronic resumes comprises the following steps:
acquiring an electronic resume to be screened;
analyzing the electronic resume to be screened to obtain a target post corresponding to the electronic resume to be screened and a plurality of preset field data;
according to a feature extraction mode corresponding to the type of each preset field data, feature extraction is respectively carried out on each preset field data to obtain feature data of the electronic resume to be screened, wherein the feature extraction mode comprises the following steps: extracting structural features, extracting semi-structural features and extracting non-structural features;
and inputting the characteristic data of the electronic resume to be screened into the resume screening model corresponding to the target post for processing to obtain a processing result indicating whether the electronic resume to be screened passes the screening, wherein a positive sample in the training data of the resume screening model corresponding to the target post is the characteristic data of the electronic resume screened by the target post, and a negative sample is the characteristic data of the electronic resume not screened by the target post.
Optionally, the analyzing the electronic resume to be filtered to obtain a target post and a plurality of preset field data corresponding to the electronic resume to be filtered includes:
extracting text information of the electronic resume to be screened;
extracting post data from the text information, and determining the target post corresponding to the electronic resume to be screened;
judging whether the format of the text information meets a preset resume format or not;
if the preset resume format is met, extracting a plurality of preset field data from the text information according to the corresponding relation between preset fields and positions in the preset resume format;
and if the preset resume format is not met, extracting a plurality of preset field data from the text information by adopting a mode based on rules and keywords.
Optionally, when the type of the preset field data is structured data, performing structured feature extraction on the preset field data, including:
extracting preset attribute feature data in the preset field data;
and performing feature coding on the preset attribute feature data to obtain feature data of the preset field data.
Optionally, when the type of the preset field data is semi-structured data, performing semi-structured feature extraction on the preset field data, including:
carrying out named entity recognition on the preset field data to obtain a plurality of entities;
determining a characteristic value corresponding to each characteristic name according to a preset corresponding relationship between the entity and the characteristic name and a preset corresponding relationship between the entity and the characteristic value;
and performing feature coding on the feature value corresponding to each feature name to obtain feature data of the preset field data.
Optionally, when the type of the preset field data is unstructured data, performing unstructured feature extraction on the preset field data, including:
performing word segmentation processing, new word discovery processing and stop word processing on the preset field to obtain a plurality of formed words;
clustering the multiple formed words to obtain multiple word classes;
performing distribution statistics on positive and negative samples on each part of speech, and removing low-discrimination words according to the distribution statistical result of each part of speech;
calculating JS divergence of each part of speech after low-discrimination-degree words are removed according to a JS divergence algorithm, and determining the part of speech with the JS divergence larger than a preset value as the part of speech to be extracted;
and respectively calculating the word frequency score of each part of speech to be extracted to obtain the characteristic data of the preset field.
Optionally, the inputting the feature data of the electronic resume to be screened into the resume screening model corresponding to the target post for processing to obtain a processing result indicating whether the electronic resume to be screened passes the screening includes:
judging whether the characteristic data of the electronic resume to be screened accords with a preset preposed rule or not;
and if the preset rule is met, inputting the characteristic data of the electronic resume to be screened into the resume screening model corresponding to the target post for processing to obtain a processing result indicating whether the electronic resume to be screened passes the screening.
A screening apparatus of electronic resumes, comprising:
the resume acquisition unit is used for acquiring the electronic resumes to be screened;
the resume analyzing unit is used for analyzing the electronic resumes to be screened to obtain target posts and a plurality of preset field data corresponding to the electronic resumes to be screened;
the feature extraction unit is configured to perform feature extraction on each preset field data according to a feature extraction manner corresponding to the type of each preset field data, so as to obtain feature data of the electronic resume to be filtered, where the feature extraction manner includes: extracting structural features, extracting semi-structural features and extracting non-structural features;
and the resume screening unit is used for inputting the characteristic data of the electronic resumes to be screened into the resume screening model corresponding to the target post for processing to obtain a processing result indicating whether the electronic resumes to be screened pass the screening, wherein a positive sample in the training data of the resume screening model corresponding to the target post is the characteristic data of the electronic resumes which pass the screening of the target post, and a negative sample is the characteristic data of the electronic resumes which do not pass the screening of the target post.
Optionally, the resume parsing unit is specifically configured to:
extracting text information of the electronic resume to be screened;
extracting post data from the text information, and determining the target post corresponding to the electronic resume to be screened;
judging whether the format of the text information meets a preset resume format or not;
if the preset resume format is met, extracting a plurality of preset field data from the text information according to the corresponding relation between preset fields and positions in the preset resume format;
and if the preset resume format is not met, extracting a plurality of preset field data from the text information by adopting a mode based on rules and keywords.
Optionally, when the type of the preset field data is structured data, the feature extraction unit is specifically configured to:
extracting preset attribute feature data in the preset field data;
and performing feature coding on the preset attribute feature data to obtain feature data of the preset field data.
Optionally, when the type of the preset field data is semi-structured data, the feature extraction unit is specifically configured to:
carrying out named entity recognition on the preset field data to obtain a plurality of entities;
determining a characteristic value corresponding to each characteristic name according to a preset corresponding relationship between the entity and the characteristic name and a preset corresponding relationship between the entity and the characteristic value;
and performing feature coding on the feature value corresponding to each feature name to obtain feature data of the preset field data.
Optionally, when the type of the preset field data is unstructured data, the feature extraction unit is specifically configured to:
performing word segmentation processing, new word discovery processing and stop word processing on the preset field to obtain a plurality of formed words;
clustering the multiple formed words to obtain multiple word classes;
performing distribution statistics on positive and negative samples on each part of speech, and removing low-discrimination words according to the distribution statistical result of each part of speech;
calculating JS divergence of each part of speech after low-discrimination-degree words are removed according to a JS divergence algorithm, and determining the part of speech with the JS divergence larger than a preset value as the part of speech to be extracted;
and respectively calculating the word frequency score of each part of speech to be extracted to obtain the characteristic data of the preset field.
Optionally, the resume screening unit is specifically configured to:
judging whether the characteristic data of the electronic resume to be screened accords with a preset preposed rule or not;
and if the preset rule is met, inputting the characteristic data of the electronic resume to be screened into the resume screening model corresponding to the target post for processing to obtain a processing result indicating whether the electronic resume to be screened passes the screening.
Compared with the prior art, the invention has the following beneficial effects:
according to the screening method of the electronic resume, the electronic resume to be screened is analyzed to obtain the target post corresponding to the electronic resume to be screened and the data of the plurality of preset fields, and the plurality of preset fields can be configured to be the fields which most reflect the recruitment requirement. According to the type of each preset field data, structural feature extraction and/or semi-structural feature extraction and/or non-structural feature extraction are/is carried out on each preset field data respectively, feature data which comprehensively and accurately reflect the features of the applicant are further obtained, then the feature data of the electronic resume to be screened are input into a resume screening model which is obtained by training with the feature data of the electronic resume screened by the target position as a positive sample and the feature data of the electronic resume not screened by the target position as a negative sample, and are processed, so that the electronic resume is rapidly and accurately screened according to the processing result of the resume screening model, and the screening efficiency of the electronic resume is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a schematic flow chart of a method for screening electronic resumes according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart illustrating an electronic resume parsing method according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart of a method for extracting unstructured features according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of the score distribution of a sample resume of a financial position on a related term of "bank class" according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating the distribution result after removing the 0-point data according to the embodiment of the present invention;
fig. 6 is a schematic structural diagram of a screening apparatus for electronic resumes, according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a method for screening electronic resumes, please refer to fig. 1, which comprises the following steps:
s101: acquiring an electronic resume to be screened;
the electronic resume to be screened can be in the formats of doc, docx, PDF, HTML and the like.
S102: analyzing the electronic resume to be screened to obtain a target post corresponding to the electronic resume to be screened and a plurality of preset field data;
referring to fig. 2, the parsing of the to-be-filtered e-resume specifically includes the following steps:
s201: extracting text information of the electronic resume to be screened;
specifically, if the electronic resume to be filtered is in a text format such as doc, docx, and the like, the text information of the electronic resume to be filtered is directly extracted, and if the electronic resume to be filtered is in a non-text format such as PDF, HTML, and the like, the electronic resume to be filtered needs to be firstly converted into the text information, and then the text information is extracted.
S202: extracting position data from the text information, and determining a target position corresponding to the electronic resume to be screened;
s203: judging whether the format of the text information meets a preset resume format or not;
if the predetermined resume format is satisfied, S204: extracting a plurality of preset field data from the text information according to the corresponding relation between the preset field and the position in the preset resume format;
the preset resume format may be a preset resume format including preset fields, each preset field is identified by an identification text, and the preset fields and the positions have corresponding relationships.
If the default resume format is not satisfied, S205: and extracting a plurality of preset field data from the text information by adopting a rule and keyword based mode.
The preset field data may be basic data, educational experience data, work experience data, professional skill data, qualification certificate data, reward honor data, etc.
Specifically, for example, in the work experience data part, most resumes are marked by characters such as "work experience", "work experience" and the like, and the marking characters are relatively fixed, so that the recognition rule can be constructed by summarizing corresponding marking character sets in positive and negative samples, and corresponding information can be extracted more accurately. The following table identifies a text set example for each field of a certain sales position resume.
In addition, the basic data is not divided into one paragraph in some resume, but directly lists various demographic characteristics, such as name, gender, age, birth date, ethnicity, native place, address, political face, job status, highest scholarship and the like. The information also has respective preposed identification texts, for example, characters such as 'year and month of birth', 'date of birth' and the like are usually identified before the birth date information, so the information can also be directly extracted.
Similarly, we can summarize the identification words of other parts based on the sample data to construct corresponding recognition rules for locating corresponding information.
S103: according to the feature extraction mode corresponding to the type of each preset field data, feature extraction is respectively carried out on each preset field data to obtain the feature data of the electronic resume to be screened, and the feature extraction mode comprises the following steps: extracting structural features, extracting semi-structural features and extracting non-structural features;
structured feature extraction, semi-structured feature extraction, and unstructured feature extraction are described separately below.
First, structured feature extraction
The structured data mainly comprises basic information, the content mainly comprises demographic data such as personal information and the like, the structure is obvious, attribute information such as name, gender, age, birth date, ethnicity, native place, address, political face, job status, highest academic calendar, highest academic degree and the like can be extracted, and the formats are unified.
And according to the requirement of the target post, selecting part of attribute feature data as preset attribute feature data.
Binary and semi-structured feature extraction
The semi-structured data primarily includes educational experience information. This type of information will typically include school names, professional names, years of life, academic degree, performance points, courses learned, and the like. Generally, the part of information has certain structurality, preset feature name related information such as school names, professional names and corresponding academic calendars and the like can be extracted through a rule method or a named entity identification method, specifically, named entity identification is carried out on semi-structured field data to obtain a plurality of entities, feature values corresponding to each feature name are determined according to the preset corresponding relationship between the entities and the feature names and the preset corresponding relationship between the entities and the feature values, and finally feature coding is carried out on the feature values corresponding to each feature name to obtain the feature data of the semi-structured field data.
The following table is an exemplary table of characteristics corresponding to educational history information, and characteristic derivation and characteristic value adjustment can be performed as desired.
Three, unstructured feature extraction
Unstructured data is typically represented as work experience data, and often such data contains large segments of natural language-expressed work content descriptions, features of which are difficult to identify by rule methods. Some methods in the industry at present use word vectors to directly perform semantic vector coding on natural language texts, but are limited by the capabilities of semantic coding technologies, especially the capabilities of long text semantic coding, and the methods are difficult to reach the semantic understanding level of human beings, are easily interfered by redundant information description, and are not strong in interpretability. The embodiment of the invention provides a method for analyzing and extracting the characteristics of the part of unstructured data, which can better capture the characteristics and has stronger interpretability.
The general idea of the feature extraction method is that the method is triggered from a sample space, mathematical model analysis is carried out on resume characters of positive and negative samples, key word sets capable of distinguishing the positive and negative samples are found from a word level through a mathematical method, namely keywords or phrases capable of being screened and supposed to be rejected can be distinguished to serve as extracted features, and the features are finally used for extracting the features of the electronic resume. Referring to fig. 3, a method for extracting unstructured features disclosed in the embodiment of the present invention specifically includes the following steps:
s301: performing word segmentation processing, new word discovery processing and stop word processing on a preset field to obtain a plurality of formed words;
and performing word segmentation on unstructured data in the electronic resume by using a word segmentation tool, and removing common meaningless stop words. Given that the generic word segmentation tools do not segment well the proper noun terms that may be involved in the resume, new word discovery is also required here. The new word discovery is realized by adopting a method based on mutual information and left and right information entropy.
Mutual Information (PMI) mainly considers the degree of cohesion between two adjacent words, and judges whether the two words are one word by comparing the probability of the co-occurrence of the two adjacent words with the probability of the single occurrence of the two words, wherein the calculation formula is as follows:
where A, B is two adjacent words, P (a, B) is the probability of the two co-occurring, and P (a), P (B) are the probabilities of the two occurring separately, and the probabilities can be fitted by the word frequency in the sample space. After the result is calculated, the result is compared with a preset threshold value, if the result is larger than the preset threshold value, the result is represented as a possible new word, otherwise, the result is represented as two independent words.
Besides mutual information, left and right information entropy needs to be considered, which focuses on the degree of freedom of word combination. A fixed new word, whose combination with the preceding and following words should be very rich, indicates that the word should be further merged if a very fixed combination occurs. For example, the term "finance" may be a plurality of terms such as "study", "application", etc. in the above (left) combination, and the term (right) may include a plurality of terms such as "direction", "professional", "field", etc., so that the term is suitable as an idiom; on the contrary, the left sides of the three characters of the artificial intelligence can be combined with a plurality of words such as learning and utilizing, but the right sides can only be matched with the 'can' characters usually, the degree of freedom of combination of the right sides is low, and the characters need to be further combined to form new words. Specifically, the comparison can be calculated by the following conditional entropy formula:
where a is the candidate word, B is all its neighboring upper (left word) or all its neighboring lower (right word), and P (B | a) is the conditional probability that can be fitted by the word frequency of the sample space. After the left entropy and the right entropy are respectively calculated according to the formula, the smaller value of the left entropy and the right entropy is usually taken to be compared with a threshold value defined in advance, if the smaller value is larger than the threshold value, the A is a word, and if the smaller value is smaller than the threshold value, the A can be continuously merged.
For example, in the resume of the recruitment of financial high-grade sales posts, the following new words can be found through the new words:
high net value, due diligence, private fund, business plan book, employment consultant, private share right, block chain, development of new client, maintenance of client relationship, public fund, overseas employment, maintenance of old client, risk management and control, professional qualification certificate, comprehensive teller, taking photo hanging, new three board, fixed income … "
After the words are segmented, new words are found and stop words are removed, the resume sample in the sample space and the unstructured data of the resume sample can be segmented into words.
S302: clustering the multiple formed words to obtain multiple word classes;
the word list obtained by word segmentation has high dimensionality, and also comprises contents such as near-meaning words, words in the same field and the like, so that the words are directly used for constructing distribution statistics, the workload is high, and the effect is not obvious, therefore, all the words are clustered, and the words are divided into a plurality of main characteristic domains for subsequent analysis.
The word clustering operation mainly relies on an unsupervised clustering algorithm based on word vectors. Firstly, training on a general language material based on a word2vec (optional fastText, GloVe) word vector model, and then further training on professional field data and electronic resume data to obtain word vector expression of each word.
And then clustering the words by using a Kmeans algorithm, setting the range of the selection of the number of classes, namely the initial value K, to be 10-30, respectively clustering, evaluating the clustering effect by using a contour coefficient, and clustering the result obtained by word segmentation in the previous step into a plurality of word classes, wherein each class of words is suitable for jointly expressing certain aspect characteristics. For example, the partial results after the financial senior sales position sample resume work is subjected to partial word clustering are as follows:
s303: performing distribution statistics on positive and negative samples on each part of speech, and removing low-discrimination words according to the distribution statistical result of each part of speech;
after the word classes are obtained through clustering, the distribution condition of each class of words on the positive and negative samples can be counted for subsequent evaluation of the distribution difference.
The part of speech distribution statistical process is as follows:
on each part of speech, words with small positive and negative sample distribution difference (which can be marked as non-discrimination words) and small probability words, which are collectively called as low-discrimination words, are removed, so that high-discrimination words are obtained, and the noise removing effect is achieved for subsequent analysis.
The specific method comprises the following steps: and counting the word frequency, sample coverage number and sample coverage rate of each word on each category by the positive and negative samples respectively, and removing words with the positive and negative sample coverage rate difference ratio smaller than t1 and the word frequency smaller than t2 or the sample coverage rate smaller than t 3. Where t1, t2, and t3 are preset thresholds, for example, t1 may be 0.5, t2 may be 0.0001, and t3 may be 0.008.
For each class of words, a score S is given on each sample indicating the degree to which the characteristics represented by that class of words are involved in the e-resume. The score calculation method adopts a method based on word frequency statistics, and the ith sample xiIn class k word ckThe specific calculation mode of the word frequency score is as follows:
where count (a) denotes the number of words in a, xi∩ckRepresents a sample xiIn, the word belonging to the k-th class ckA set of words of (a). The score maps to between 0-10 points.
And for each class of words, calculating the score of each sample of the class of words on the positive sample set and the score of each sample of the positive sample set, performing box separation, and summing up the scores to an integer score within the interval of 0-10. Then, the distribution conditions of the fraction-sample ratio of the positive sample and the negative sample are respectively counted, and a graph is drawn for visual analysis.
For example, as shown in FIG. 4, a score distribution over "bank-like" related words is resume for a positive and negative sample of a financial sales location. The horizontal axis represents scores, and the vertical axis represents the proportion of corresponding samples in the positive and negative sample sets.
Alternatively, as shown in the example of fig. 4, the 0 point data is obviously more, so for the convenience of analysis, we can remove the 0 point data for detailed analysis. Fig. 5 is the distribution result after 0-point data is removed, which is obviously more convenient for analysis.
S304: calculating JS divergence of each part of speech after low-discrimination-degree words are removed according to a JS divergence algorithm, and determining the part of speech with the JS divergence larger than a preset value as the part of speech to be extracted;
the difference in the distribution of parts of speech over positive and negative samples represents the ability of the part of speech feature to distinguish between positive and negative samples. The part-of-speech characteristics with strong distinguishing capability can be found to a certain degree from the image, but the scientific analysis by using a mathematical method is obviously more persuasive.
An algorithm based on KL divergence is usually used here to measure the dissimilarity of the sample distribution, and the calculation formula of KL divergence for two probability distributions P (positive samples) and Q (negative samples) is as follows:
note that the KL divergence is asymmetric, i.e., swapping the positions of P and Q will yield different results, so embodiments of the present invention use JS divergences based on KL divergence to measure the difference in positive and negative sample distributions. The formula for JS divergence is as follows:
wherein the JS divergence ranges from 0 to 1, and is 0 when the JS divergence is completely the same, and the JS divergence is symmetrical. Therefore, after the JS divergence of each word on the positive and negative samples is calculated, the larger the value of the JS divergence indicates that the word class has the capability of separating the positive and negative samples. May be selected as a feature. Corresponding threshold values can be preset for comparison and judgment, and the part of speech larger than the threshold values is determined as the part of speech to be extracted.
For example, two sets of distribution data (0.65,0.25,0.07,0.03), (0.1,0.2,0.3,0.4) have JS divergence of about 0.247, and are differentiated to some extent.
S305: and respectively calculating the word frequency score of each part of speech to be extracted to obtain the characteristic data of the preset field.
The calculation method of the word frequency score of each part of speech to be extracted is as follows:
where count (a) denotes the number of words in a, xi∩ckRepresents a sample xiIn, the word belonging to the k-th class ckA set of words of (a). The score maps to between 0-10 points.
S104: and inputting the characteristic data of the electronic resume to be screened into the resume screening model corresponding to the target post for processing to obtain a processing result indicating whether the electronic resume to be screened passes the screening, wherein a positive sample in the training data of the resume screening model corresponding to the target post is the characteristic data of the electronic resume screened by the target post, and a negative sample is the characteristic data of the electronic resume not screened by the target post.
Specifically, the resume screening model is a statistical learning model, such as a logistic regression model, compared with a neural network model or other integrated learning models, the model is simple and stable, the sample demand is small, the model has good interpretability, the parameter weight corresponding to each feature well represents the importance degree of the feature, deep analysis of business links is facilitated while prediction is performed, and the data basis and the business demand of resume analysis business are relatively met.
Optionally, in the purpose of pursuing accuracy, an integrated model such as XgBoost can be selected; if there are more samples, the neural network model can also be selected.
And after the lower model is determined, training by using the characteristics of the positive and negative samples to obtain a final prediction model. And a positive sample in the training data of the resume screening model corresponding to the target post is the characteristic data of the electronic resume screened by the target post, and a negative sample is the characteristic data of the electronic resume not screened by the target post. The method for extracting the characteristics of the positive and negative samples is the same as above, and is not described herein again. The model will typically output a score between 0 and 1. A threshold value can be preset to judge the service of the score, wherein the condition that the score is greater than or equal to the threshold value indicates that the resume screening is passed, and the condition that the score is less than or equal to the threshold value indicates that the resume screening is not passed. The threshold is 0.5 by default, but can be adjusted according to actual conditions: the threshold value is reduced, the number of passed candidate resumes is increased, but more resumes with low matching are possibly mixed; the threshold value is increased, the resume screening is more accurate, but the passing candidate resumes are reduced.
Preferably, in the model training stage, the corresponding positive-negative sample ratio is analyzed based on each type of feature, and features larger than a certain threshold or smaller than a certain threshold can be extracted as the pre-rules.
For example, if the proportion of the samples with the special characteristics out of the { statistics, mathematics, computer science } set in the educational experience is less than 2%, we can presume that the post recruitment speciality is limited in the { statistics, mathematics, computer science } set, and therefore can refine a preposed rule.
In addition, besides the preposed rule analyzed from the sample data, in actual use, business personnel can construct a new preposed rule by using priori knowledge at any time and add the new preposed rule into the rule set so as to fully meet the actual business requirement.
On the basis, judging whether the characteristic data of the electronic resume to be screened accords with a preset preposed rule or not, and if not, excluding the electronic resume; and if the electronic resume accords with the preposed rule, inputting the characteristic data of the electronic resume to be screened into the resume screening model corresponding to the target post for processing to obtain a processing result indicating whether the electronic resume to be screened passes the screening. Therefore, the method and the device improve the screening efficiency of the electronic resume.
Based on the method for screening electronic resumes disclosed in the above embodiments, this embodiment correspondingly discloses a device for screening electronic resumes, please refer to fig. 6, and the device includes:
a resume obtaining unit 100, configured to obtain an electronic resume to be screened;
the resume analyzing unit 200 is configured to analyze the electronic resumes to be screened to obtain target posts and a plurality of preset field data corresponding to the electronic resumes to be screened;
the feature extraction unit 300 is configured to perform feature extraction on each preset field data according to a feature extraction manner corresponding to the type of each preset field data, to obtain feature data of the to-be-filtered electronic resume, where the feature extraction manner includes: extracting structural features, extracting semi-structural features and extracting non-structural features;
and a resume screening unit 400, configured to input the feature data of the electronic resumes to be screened into the resume screening model corresponding to the target post for processing, so as to obtain a processing result indicating whether the electronic resumes to be screened pass the screening, where a positive sample in the training data of the resume screening model corresponding to the target post is the feature data of the electronic resumes that pass the screening at the target post, and a negative sample is the feature data of the electronic resumes that do not pass the screening at the target post.
Optionally, the resume parsing unit 200 is specifically configured to:
extracting text information of the electronic resume to be screened;
extracting post data from the text information, and determining the target post corresponding to the electronic resume to be screened;
judging whether the format of the text information meets a preset resume format or not;
if the preset resume format is met, extracting a plurality of preset field data from the text information according to the corresponding relation between preset fields and positions in the preset resume format;
and if the preset resume format is not met, extracting a plurality of preset field data from the text information by adopting a mode based on rules and keywords.
Optionally, when the type of the preset field data is structured data, the feature extraction unit 300 is specifically configured to:
extracting preset attribute feature data in the preset field data;
and performing feature coding on the preset attribute feature data to obtain feature data of the preset field data.
Optionally, when the type of the preset field data is semi-structured data, the feature extraction unit 300 is specifically configured to:
carrying out named entity recognition on the preset field data to obtain a plurality of entities;
determining a characteristic value corresponding to each characteristic name according to a preset corresponding relationship between the entity and the characteristic name and a preset corresponding relationship between the entity and the characteristic value;
and performing feature coding on the feature value corresponding to each feature name to obtain feature data of the preset field data.
Optionally, when the type of the preset field data is unstructured data, the feature extraction unit 300 is specifically configured to:
performing word segmentation processing, new word discovery processing and stop word processing on the preset field to obtain a plurality of formed words;
clustering the multiple formed words to obtain multiple word classes;
performing distribution statistics on positive and negative samples on each part of speech, and removing low-discrimination words according to the distribution statistical result of each part of speech;
calculating JS divergence of each part of speech after low-discrimination-degree words are removed according to a JS divergence algorithm, and determining the part of speech with the JS divergence larger than a preset value as the part of speech to be extracted;
and respectively calculating the word frequency score of each part of speech to be extracted to obtain the characteristic data of the preset field.
Optionally, the resume screening unit 400 is specifically configured to:
judging whether the characteristic data of the electronic resume to be screened accords with a preset preposed rule or not;
and if the preset rule is met, inputting the characteristic data of the electronic resume to be screened into the resume screening model corresponding to the target post for processing to obtain a processing result indicating whether the electronic resume to be screened passes the screening.
According to the screening device for the electronic resumes, the electronic resumes to be screened are analyzed to obtain the target posts and the plurality of preset field data corresponding to the electronic resumes to be screened, and the plurality of preset fields can be configured to be the fields which most reflect the recruitment requirement. According to the type of each preset field data, structural feature extraction and/or semi-structural feature extraction and/or non-structural feature extraction are/is carried out on each preset field data respectively, feature data which comprehensively and accurately reflect the features of the applicant are further obtained, then the feature data of the electronic resume to be screened are input into a resume screening model which is obtained by training with the feature data of the electronic resume screened by the target position as a positive sample and the feature data of the electronic resume not screened by the target position as a negative sample, and are processed, so that the electronic resume is rapidly and accurately screened according to the processing result of the resume screening model, and the screening efficiency of the electronic resume is improved.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (12)
1. A method for screening electronic resumes is characterized by comprising the following steps:
acquiring an electronic resume to be screened;
analyzing the electronic resume to be screened to obtain a target post corresponding to the electronic resume to be screened and a plurality of preset field data;
according to a feature extraction mode corresponding to the type of each preset field data, feature extraction is respectively carried out on each preset field data to obtain feature data of the electronic resume to be screened, wherein the feature extraction mode comprises the following steps: extracting structural features, extracting semi-structural features and extracting non-structural features;
and inputting the characteristic data of the electronic resume to be screened into the resume screening model corresponding to the target post for processing to obtain a processing result indicating whether the electronic resume to be screened passes the screening, wherein a positive sample in the training data of the resume screening model corresponding to the target post is the characteristic data of the electronic resume screened by the target post, and a negative sample is the characteristic data of the electronic resume not screened by the target post.
2. The method according to claim 1, wherein the analyzing the e-resume to be filtered to obtain a target post and a plurality of preset field data corresponding to the e-resume to be filtered comprises:
extracting text information of the electronic resume to be screened;
extracting post data from the text information, and determining the target post corresponding to the electronic resume to be screened;
judging whether the format of the text information meets a preset resume format or not;
if the preset resume format is met, extracting a plurality of preset field data from the text information according to the corresponding relation between preset fields and positions in the preset resume format;
and if the preset resume format is not met, extracting a plurality of preset field data from the text information by adopting a mode based on rules and keywords.
3. The method of claim 1, wherein when the type of the preset field data is structured data, performing structured feature extraction on the preset field data comprises:
extracting preset attribute feature data in the preset field data;
and performing feature coding on the preset attribute feature data to obtain feature data of the preset field data.
4. The method according to claim 1, wherein when the type of the preset field data is semi-structured data, performing semi-structured feature extraction on the preset field data comprises:
carrying out named entity recognition on the preset field data to obtain a plurality of entities;
determining a characteristic value corresponding to each characteristic name according to a preset corresponding relationship between the entity and the characteristic name and a preset corresponding relationship between the entity and the characteristic value;
and performing feature coding on the feature value corresponding to each feature name to obtain feature data of the preset field data.
5. The method of claim 1, wherein when the type of the preset field data is unstructured data, performing unstructured feature extraction on the preset field data comprises:
performing word segmentation processing, new word discovery processing and stop word processing on the preset field to obtain a plurality of formed words;
clustering the multiple formed words to obtain multiple word classes;
performing distribution statistics on positive and negative samples on each part of speech, and removing low-discrimination words according to the distribution statistical result of each part of speech;
calculating JS divergence of each part of speech after low-discrimination-degree words are removed according to a JS divergence algorithm, and determining the part of speech with the JS divergence larger than a preset value as the part of speech to be extracted;
and respectively calculating the word frequency score of each part of speech to be extracted to obtain the characteristic data of the preset field.
6. The method according to claim 1, wherein the inputting the feature data of the electronic resume to be screened into the resume screening model corresponding to the target post for processing to obtain a processing result indicating whether the electronic resume to be screened passes the screening includes:
judging whether the characteristic data of the electronic resume to be screened accords with a preset preposed rule or not;
and if the preset rule is met, inputting the characteristic data of the electronic resume to be screened into the resume screening model corresponding to the target post for processing to obtain a processing result indicating whether the electronic resume to be screened passes the screening.
7. A sieving mechanism of electron resume, its characterized in that includes:
the resume acquisition unit is used for acquiring the electronic resumes to be screened;
the resume analyzing unit is used for analyzing the electronic resumes to be screened to obtain target posts and a plurality of preset field data corresponding to the electronic resumes to be screened;
the feature extraction unit is configured to perform feature extraction on each preset field data according to a feature extraction manner corresponding to the type of each preset field data, so as to obtain feature data of the electronic resume to be filtered, where the feature extraction manner includes: extracting structural features, extracting semi-structural features and extracting non-structural features;
and the resume screening unit is used for inputting the characteristic data of the electronic resumes to be screened into the resume screening model corresponding to the target post for processing to obtain a processing result indicating whether the electronic resumes to be screened pass the screening, wherein a positive sample in the training data of the resume screening model corresponding to the target post is the characteristic data of the electronic resumes which pass the screening of the target post, and a negative sample is the characteristic data of the electronic resumes which do not pass the screening of the target post.
8. The apparatus according to claim 7, wherein the resume parsing unit is specifically configured to:
extracting text information of the electronic resume to be screened;
extracting post data from the text information, and determining the target post corresponding to the electronic resume to be screened;
judging whether the format of the text information meets a preset resume format or not;
if the preset resume format is met, extracting a plurality of preset field data from the text information according to the corresponding relation between preset fields and positions in the preset resume format;
and if the preset resume format is not met, extracting a plurality of preset field data from the text information by adopting a mode based on rules and keywords.
9. The apparatus according to claim 7, wherein when the type of the preset field data is structured data, the feature extraction unit is specifically configured to:
extracting preset attribute feature data in the preset field data;
and performing feature coding on the preset attribute feature data to obtain feature data of the preset field data.
10. The apparatus according to claim 7, wherein when the type of the preset field data is semi-structured data, the feature extraction unit is specifically configured to:
carrying out named entity recognition on the preset field data to obtain a plurality of entities;
determining a characteristic value corresponding to each characteristic name according to a preset corresponding relationship between the entity and the characteristic name and a preset corresponding relationship between the entity and the characteristic value;
and performing feature coding on the feature value corresponding to each feature name to obtain feature data of the preset field data.
11. The apparatus according to claim 7, wherein when the type of the preset field data is unstructured data, the feature extraction unit is specifically configured to:
performing word segmentation processing, new word discovery processing and stop word processing on the preset field to obtain a plurality of formed words;
clustering the multiple formed words to obtain multiple word classes;
performing distribution statistics on positive and negative samples on each part of speech, and removing low-discrimination words according to the distribution statistical result of each part of speech;
calculating JS divergence of each part of speech after low-discrimination-degree words are removed according to a JS divergence algorithm, and determining the part of speech with the JS divergence larger than a preset value as the part of speech to be extracted;
and respectively calculating the word frequency score of each part of speech to be extracted to obtain the characteristic data of the preset field.
12. The apparatus according to claim 7, wherein the resume screening unit is specifically configured to:
judging whether the characteristic data of the electronic resume to be screened accords with a preset preposed rule or not;
and if the preset rule is met, inputting the characteristic data of the electronic resume to be screened into the resume screening model corresponding to the target post for processing to obtain a processing result indicating whether the electronic resume to be screened passes the screening.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011008492.0A CN112115705A (en) | 2020-09-23 | 2020-09-23 | Method and device for screening electronic resume |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011008492.0A CN112115705A (en) | 2020-09-23 | 2020-09-23 | Method and device for screening electronic resume |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112115705A true CN112115705A (en) | 2020-12-22 |
Family
ID=73800686
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011008492.0A Pending CN112115705A (en) | 2020-09-23 | 2020-09-23 | Method and device for screening electronic resume |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112115705A (en) |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160232456A1 (en) * | 2015-02-06 | 2016-08-11 | Box, Inc. | Method and system for implementing machine learning analysis of documents |
CN106354872A (en) * | 2016-09-18 | 2017-01-25 | 广州视源电子科技股份有限公司 | Text clustering method and system |
CN106663038A (en) * | 2014-06-30 | 2017-05-10 | 亚马逊科技公司 | Feature processing recipes for machine learning |
CN106934220A (en) * | 2017-02-24 | 2017-07-07 | 黑龙江特士信息技术有限公司 | Towards the disease class entity recognition method and device of multi-data source |
CN107220311A (en) * | 2017-05-12 | 2017-09-29 | 北京理工大学 | A kind of document representation method of utilization locally embedding topic modeling |
US20190096386A1 (en) * | 2017-09-28 | 2019-03-28 | Baidu Online Network Technology (Beijing) Co., Ltd | Method and apparatus for generating speech synthesis model |
CN109960720A (en) * | 2019-03-21 | 2019-07-02 | 于建岗 | For the information extraction method of semi-structured text |
CN110647626A (en) * | 2019-07-30 | 2020-01-03 | 浙江工业大学 | REST data service clustering method based on Internet service domain |
CN110765759A (en) * | 2019-10-21 | 2020-02-07 | 普信恒业科技发展(北京)有限公司 | Intention identification method and device |
CN110931128A (en) * | 2019-12-05 | 2020-03-27 | 中国科学院自动化研究所 | Method, system and device for automatically identifying unsupervised symptoms of unstructured medical texts |
CN110941703A (en) * | 2019-12-03 | 2020-03-31 | 南京烽火星空通信发展有限公司 | Integrated resume information extraction method based on machine learning and fuzzy rules |
CN111242565A (en) * | 2019-12-31 | 2020-06-05 | 广州轩辕研究院有限公司 | Resume optimization method and device based on intelligent personnel model |
CN111311180A (en) * | 2020-02-10 | 2020-06-19 | 腾讯云计算(北京)有限责任公司 | Resume screening method and device |
-
2020
- 2020-09-23 CN CN202011008492.0A patent/CN112115705A/en active Pending
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106663038A (en) * | 2014-06-30 | 2017-05-10 | 亚马逊科技公司 | Feature processing recipes for machine learning |
US20160232456A1 (en) * | 2015-02-06 | 2016-08-11 | Box, Inc. | Method and system for implementing machine learning analysis of documents |
CN106354872A (en) * | 2016-09-18 | 2017-01-25 | 广州视源电子科技股份有限公司 | Text clustering method and system |
CN106934220A (en) * | 2017-02-24 | 2017-07-07 | 黑龙江特士信息技术有限公司 | Towards the disease class entity recognition method and device of multi-data source |
CN107220311A (en) * | 2017-05-12 | 2017-09-29 | 北京理工大学 | A kind of document representation method of utilization locally embedding topic modeling |
US20190096386A1 (en) * | 2017-09-28 | 2019-03-28 | Baidu Online Network Technology (Beijing) Co., Ltd | Method and apparatus for generating speech synthesis model |
CN109960720A (en) * | 2019-03-21 | 2019-07-02 | 于建岗 | For the information extraction method of semi-structured text |
CN110647626A (en) * | 2019-07-30 | 2020-01-03 | 浙江工业大学 | REST data service clustering method based on Internet service domain |
CN110765759A (en) * | 2019-10-21 | 2020-02-07 | 普信恒业科技发展(北京)有限公司 | Intention identification method and device |
CN110941703A (en) * | 2019-12-03 | 2020-03-31 | 南京烽火星空通信发展有限公司 | Integrated resume information extraction method based on machine learning and fuzzy rules |
CN110931128A (en) * | 2019-12-05 | 2020-03-27 | 中国科学院自动化研究所 | Method, system and device for automatically identifying unsupervised symptoms of unstructured medical texts |
CN111242565A (en) * | 2019-12-31 | 2020-06-05 | 广州轩辕研究院有限公司 | Resume optimization method and device based on intelligent personnel model |
CN111311180A (en) * | 2020-02-10 | 2020-06-19 | 腾讯云计算(北京)有限责任公司 | Resume screening method and device |
Non-Patent Citations (5)
Title |
---|
DANIEL J. TROSTEN等: "Recurrent Deep Divergence-based Clustering for Simultaneous Feature Learning and Clustering of Variable Length Time Series", ICASSP 2019 - 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, pages 3257 * |
田世海等: "基于NRL和k-means的舆情事件聚类研究", 情报科学, vol. 39, no. 2, pages 129 * |
谷楠楠;冯筠;孙霞;赵妍;张蕾;: "中文简历自动解析及推荐算法", 计算机工程与应用, vol. 53, no. 18, pages 141 * |
邸亮等: "LDA模型在微博用户推荐中的应用", 计算机工程, vol. 40, no. 5, pages 1 * |
黎楠;杜永萍;何明;: "基于主题发现的专利发明人推荐方法", 情报工程, vol. 1, no. 03, pages 90 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Grimmer et al. | Text as data: A new framework for machine learning and the social sciences | |
CN108376151B (en) | Question classification method and device, computer equipment and storage medium | |
Stein et al. | Intrinsic plagiarism analysis | |
Grimmer et al. | Text as data: The promise and pitfalls of automatic content analysis methods for political texts | |
Azpiazu et al. | Multiattentive recurrent neural network architecture for multilingual readability assessment | |
US20230075341A1 (en) | Semantic map generation employing lattice path decoding | |
US7827133B2 (en) | Method and arrangement for SIM algorithm automatic charset detection | |
CN113094578B (en) | Deep learning-based content recommendation method, device, equipment and storage medium | |
US9348901B2 (en) | System and method for rule based classification of a text fragment | |
Jerzak et al. | An improved method of automated nonparametric content analysis for social science | |
Minhas et al. | From spin to swindle: Identifying falsification in financial text | |
CN109165529B (en) | Dark chain tampering detection method and device and computer readable storage medium | |
KR20180120488A (en) | Classification and prediction method of customer complaints using text mining techniques | |
Schofield et al. | Identifying hate speech in social media | |
Imperial et al. | Developing a machine learning-based grade level classifier for Filipino children’s literature | |
Akram et al. | ISE-Hate: A benchmark corpus for inter-faith, sectarian, and ethnic hatred detection on social media in Urdu | |
US11599580B2 (en) | Method and system to extract domain concepts to create domain dictionaries and ontologies | |
CN112487132A (en) | Keyword determination method and related equipment | |
CN112507115B (en) | Method and device for classifying emotion words in barrage text and storage medium | |
CN112115705A (en) | Method and device for screening electronic resume | |
US20170293863A1 (en) | Data analysis system, and control method, program, and recording medium therefor | |
Suhud et al. | Recognizing Public Satisfaction Toward Kampus Mengajar Program with Naive Bayes | |
Dhanya et al. | Comparative performance of machine learning algorithms in detecting offensive speech in malayalam-english code-mixed data | |
CN111611394A (en) | Text classification method and device, electronic equipment and readable storage medium | |
Slobodzian et al. | An Approach Based on the Visualization Model for the Ukrainian Web Content Classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |