CN111274404A

CN111274404A - Small sample entity multi-field classification method based on man-machine cooperation

Info

Publication number: CN111274404A
Application number: CN202010088532.0A
Authority: CN
Inventors: 高汕; 李健; 宗畅; 吴海燕
Original assignee: Hangzhou Liangzhi Data Technology Co ltd
Current assignee: Hangzhou Liangzhi Data Technology Co ltd
Priority date: 2020-02-12
Filing date: 2020-02-12
Publication date: 2020-06-12
Anticipated expiration: 2040-02-12
Also published as: CN111274404B

Abstract

The invention discloses a method for multi-domain classification of entities, which comprises the steps of firstly obtaining attribute semantic vocabularies of the entities in each domain in a crowdsourcing mode, then utilizing the semantic vocabularies to match attribute texts of the entities, calculating scores by using a calculation formula after obtaining matching results, comparing the scores with a threshold value to obtain classification results, further generating small-batch training samples by checking the correctness of the results with expert knowledge, automatically adjusting formula coefficients by using grid search on the basis of the small samples to improve recall rate and accuracy, and solving the problem that a large amount of texts need to be checked in manual entity classification by continuously and automatically processing optimization of classification effects. The invention solves the problem of entity classification by fully utilizing crowdsourcing, man-machine cooperation and semi-supervised learning modes, and can quickly implement multi-field classification of entities under the condition of lacking of labeled data.

Description

Small sample entity multi-field classification method based on man-machine cooperation

Technical Field

The invention relates to the fields of computer technology, artificial intelligence, natural language processing and label classification, in particular to a man-machine cooperation multi-source text content cognition method under the classification scene of the industrial chain field.

Background

Industrial chain analysis plays an important role in the development of regional economy, the development of industry. But there is no good method for classifying entities in the industry chain. At present, the attribute description of the entity can only be used for manually judging the attribution of the marked entity.

The description of the field of entities in the manual tagging process has different descriptive words in different attribute texts, for example, the description of the computer vision field in patents is "vision algorithm", in products is "face recognition", and in recruitment positions is "CV algorithm engineer". The human exhaustion of these words that contain domain semantics would create a huge effort.

The method for automatically classifying by using the keywords specified by the simple rules cannot simultaneously give consideration to the classification accuracy and the recall rate, if the coverage of the selected keywords is not complete, the recall is often not high, and if the coverage of the selected keywords is complete, the accuracy is not high. The method can assist in judging the characteristic description of the classification of the domain to which the entity belongs to be reflected in the text data of each attribute dimension, and reasonably quantizes the association tightness degree of the keywords and the domain by a statistical probability analysis method.

If the entity field classification is carried out by purely using deep learning and machine learning algorithms, three main defects exist, namely, a large amount of labeled linguistic data are needed for training, and the text needs to be specially preprocessed and quantized into computable data before being used; thirdly, the black box model of deep learning causes the interpretability of the final result to be poor, and the classification basis is difficult to trace.

Therefore, it is an urgent need for technical personnel to provide a classification method in the semi-supervised entity field, which can acquire high classification accuracy by using crowd wisdom to collect semantics and using a small amount of corpus training.

Disclosure of Invention

In view of the above, the invention provides a statistical probability text matching algorithm based on a man-machine cooperation mode, and the method solves the problem of multi-field classification of entities by combining modes such as crowdsourcing collection, expert verification and the like, not only has high classification accuracy, but also can be used for various entities of different types and fields of different industries.

In order to achieve the purpose, the invention adopts the following technical scheme:

a multi-field classification method for small sample entities based on man-machine cooperation comprises the following steps:

s1: semantic vocabularies related to the entities are obtained in a crowdsourcing mode, and the semantic vocabularies returned in the crowdsourcing mode comprise three dimensions of the domain of the semantic vocabularies, the attribute of the semantic vocabularies and the semantic association degree of the domain of the semantic vocabularies;

s2: initializing each parameter required by entity field classification, wherein the initialization parameter comprises an attribute score A_iWeight coefficient B of semantic relation degree_niAnd a classification threshold;

s3: acquiring multi-attribute texts of the entities, matching each attribute text of the entities with semantic vocabularies of different fields acquired in S1, and calculating scores of each entity in different fields according to matching results;

s4: comparing and judging the score obtained in the S3 with the classification threshold value to obtain a classification result, and generating training data after the classification result is verified;

s5: determining optimal parameters through a grid search based on the training data;

s6: and predicting the field of the unknown entity to be classified based on the optimal parameters.

Based on the technical scheme, the steps can be realized by adopting the following preferred mode:

preferably, the specific method of step S1 is as follows:

s11: in a crowdsourcing solving platform, semantic vocabularies in multi-attribute texts of the entities are obtained in a crowdsourcing mode, wherein the crowdsourcing mode is that the semantic vocabularies are drawn from each attribute text of the entities, or the semantic vocabularies are directly provided and the positions are marked; the crowdsourcing return result comprises three dimensions of semantic vocabularies, the affiliated fields of the semantic vocabularies, the affiliated attributes and semantic association degrees with the affiliated fields; a semantic vocabulary belongs to one or more attribute dimensions;

s12: checking the crowdsourcing return result, and writing the crowdsourcing return result into a database after checking; dictionary D is formed by all semantic words belonging to jth field in database_jAnd j is 1,2, …, and M is the total number of the domain classification categories of the entity.

Preferably, the specific method of step S2 is as follows:

s21: initializing and setting the total score of each field to be 100, averaging the total score of each field to each attribute dimension, and obtaining the attribute score A of the ith attribute_iI is 100/I, and I is the number of attributes;

s22: and initializing a weight coefficient of the association degree of the semantic vocabulary under each attribute, wherein the higher the association degree of the semantic vocabulary and the domain to which the semantic vocabulary belongs is, the higher the weight coefficient is.

S23: initializing the classification threshold to be equal to A_i。

Preferably, in step S2, the association degree between the semantic vocabulary and the domain is divided into three levels, namely, high, medium and low; when the degree of association is high, the weight coefficient B_1i1.0; when the degree of association is medium, the weight coefficient B_2i0.8; when the degree of association is low, the weight coefficient B_3i＝0.4。

Preferably, the specific method of step S3 is as follows:

sequentially aiming at each field, based on the semantic vocabulary dictionary D corresponding to the field obtained in S1_jCalculating the score of each entity in the jth domain, j being 1,2, …, M, wherein the calculation method comprises the following steps:

s31: obtaining multi-attribute text of an entity, and then combining each attribute text with a dictionary D_jMatches each semantic vocabulary in the dictionary and outputs a dictionary D_jEach of the words inThe occurrence times of meaning vocabularies in the attribute text; in an attribute text, if the same semantic vocabulary appears for a plurality of times, the number of the occurrences is only marked as 1;

s32: in the matching result obtained in S31, according to dictionary D_jThe semantic association degree corresponding to each semantic word in the entity is counted, and the total occurrence frequency of all the semantic words of each semantic association degree in each attribute text of the entity is counted;

s33: according to the statistical result obtained in S32, calculating the score of the entity belonging to the jth field, wherein the calculation formula is as follows:

wherein: a. the_iAn attribute score representing the ith attribute, B_niN-th semantic relevance degree weight representing ith attribute, C_niThe total occurrence times of all semantic vocabularies representing the nth semantic association degree in the ith attribute text of the entity; if it is

If the value of (A) is greater than 1, then order

Equal to 1 to ensure that eventually all attribute dimension score accumulations are the same.

Preferably, the specific method of step S4 is as follows:

s41: comparing the score of each entity belonging to each field with the classification threshold, and if the score of the entity belonging to a certain field is higher than the classification threshold, judging the entity belonging to the field;

s42: and checking the judgment result based on expert knowledge, and obtaining correct entities in each field according to the result data passing the checking to serve as training data.

Preferably, the specific method of step S5 is as follows:

determining optimal parameters through grid search based on the training data obtained in S4, the parameter packet of the grid searchThe attribute score A_iWeight coefficient B of semantic relation degree_niAnd a classification threshold; the evaluation index of the optimal parameter selects a jaccard coefficient, and the calculation formula of the jaccard coefficient is as follows:

wherein x represents the domain label of entity prediction, y represents the real domain label of the entity, x ∩ y represents the number of the intersection of the prediction label and the real label, x ∪ y represents the number of the union of the prediction label and the real label, and finally the parameter corresponding to the maximum value of the average jaccard coefficient of all samples is selected as the optimal parameter in grid search.

Preferably, the training sample is expanded through multiple rounds of expansion of the semantic vocabulary base and expert knowledge verification, and the step of grid search in the step S5 is repeated after each expansion to determine new optimal parameters.

Preferably, the specific method of step S6 is as follows:

s61: according to the method of the step S3, acquiring the multi-attribute text of the unknown entity to be classified, matching each attribute text of the unknown entity with the semantic vocabulary of different fields obtained in the step S1, and calculating the scores of the unknown entity in the different fields according to the matching result;

s62: and then comparing the scores of the unknown entities belonging to each field with the classification threshold value in the optimal parameter, and if the scores of the entities belonging to a certain field are higher than the classification threshold value in the optimal parameter, judging the entities belonging to the field.

Preferably, when the multi-attribute text of the entity is acquired, if a plurality of texts exist under the same attribute, the plurality of texts are spliced to obtain the attribute text.

According to the technical scheme, compared with the prior art, the invention discloses a method for obtaining a semantic library by a crowdsourcing mode, grading and quantizing semantics, counting scores of an entity in a certain field according to whether the attribute of the entity contains semantic words in the field, and finally setting a threshold value to judge a classification result. When the method is used for entity classification, only a semantic vocabulary library and a database of various parameters need to be maintained, and the entity attribute text to be classified is transmitted into a system to obtain a classification result.

Enterprise entities in the database are classified by the classification method, the recall rate and the accuracy rate are calculated by random sampling, and the recall rate is more than 80% and the accuracy rate is more than 90% finally obtained after parameters are adjusted. The invention is applied to the classification of enterprise entities and expert entities in the fields of artificial intelligence and geographic information industrial chains, and can obtain good application effect.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flowchart illustrating an entity multi-domain classification algorithm according to an embodiment.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The method has the main innovation points that a mode for classifying the keywords by direct hard matching is softened in a probability counting mode, the accumulation efficiency of semantic vocabularies is improved in a crowdsourcing mode, training data is obtained by checking classification results by experts, parameters are optimized by grid search of machine learning, and the classification effect is improved by fully utilizing the advantages of man-machine cooperation. The method makes full use of knowledge precipitation and reduces the dependence on the labeled data.

The following details a specific implementation manner of the small sample entity multi-field classification method based on human-computer cooperation, which comprises the following steps:

s1: semantic vocabularies related to the entities are obtained in a crowdsourcing mode, and the semantic vocabularies returned in the crowdsourcing mode comprise three dimensions of the affiliated field, the affiliated attribute and the semantic association degree of the affiliated field of the semantic vocabularies.

In this implementation, the specific method of step S1 is as follows:

s11: in a crowdsourcing solving platform, semantic vocabularies in multi-attribute texts (containing various attribute texts) of an entity are obtained in a crowdsourcing mode, wherein the crowdsourcing mode is that the semantic vocabularies are drawn from each attribute text of the entity, or the semantic vocabularies are directly provided and a place is marked; the crowdsourcing return result comprises three dimensions of semantic vocabularies, the affiliated fields of the semantic vocabularies, the affiliated attributes and semantic association degrees with the affiliated fields; a semantic vocabulary belongs to one or more attribute dimensions. For example, with a semantic word "visual algorithm" in the patent text, the domain to which the semantic word belongs may be labeled as "computer visual domain" in the crowd-sourced results, the attribute is "patent", the semantic association degree is "high", and the crowd-sourced results may be returned for subsequent verification. The crowdsourcing solution platform can comprise an open source tool and a specific scene tool which is independently developed, and when a crowdsourcing task is issued, a plurality of fixed fields, attribute dimensions and semantic association degrees can be preset, so that the returned crowdsourcing result meets the requirement.

S2: initializing each parameter required by entity field classification, wherein the initialization parameter comprises an attribute score A_iWeight coefficient B of semantic relation degree_niAnd a classification threshold.

In this implementation, the specific method of step S2 is as follows:

s21: initialization settingThe total score of each field is 100, and then the total score of each field is averaged to each attribute dimension, namely the attribute score A of the ith attribute_iAnd I is 100/I, and I is the number of attributes.

In the invention, the specific attribute is different according to different entities. For example, a business entity may contain attributes such as a business profile, business name, patent, soft work, job placement, etc.; the expert entity may include attributes of articles, patents, personal profiles, research areas, works, etc.

S22: and initializing a weight coefficient of the association degree of the semantic vocabulary under each attribute, wherein the higher the association degree of the semantic vocabulary and the domain to which the semantic vocabulary belongs is, the higher the weight coefficient is. Wherein, the degree of association between the semantic vocabulary and the domain can be modified according to the situation, and the general level 2-5 is more suitable. For example, in this implementation, the degree of association may be divided into three levels, high, medium, and low; when the degree of association is high, the weight coefficient B_1i1.0; when the degree of association is medium, the weight coefficient B_2i0.8; when the degree of association is low, the weight coefficient B_3i＝0.4。

S23: initializing the classification threshold to be equal to A_i。

S3: and acquiring a multi-attribute text of the entity, matching each attribute text of the entity with the semantic vocabulary in the different fields acquired in the step S1, and calculating the score of each entity in the different fields according to the matching result.

In this implementation, the specific method of step S3 is as follows:

sequentially aiming at each field, based on the semantic vocabulary dictionary D corresponding to the field obtained in S1_jAnd calculating the score of each entity in the jth field (j values are 1,2, … and M in sequence), wherein the calculation method comprises the following steps:

s31: firstly, acquiring a multi-attribute text of an entity, wherein the attribute text is different according to different entity dimensions. For example, when the entity to be classified is a business entity, the attribute text thereof may include a business introduction, a business name, a patent, a soft work, a recruitment post; and when the entity to be classified is an expert entity, its property text may contain a paper, a patent, a personal brief description, a research field, a work. If a plurality of texts exist under the same attribute, the plurality of texts are spliced to obtain the attribute text. The attribute text may be crawled from the web or otherwise obtained.

Then each attribute text is associated with a dictionary D_jEach semantic word in the dictionary D is matched, and whether the semantic word to be matched is contained in the text or not is output by using regular matching output attribute, namely a dictionary D is output_jThe number of occurrences of each semantic word in the attribute text. In a property text, if the same semantic vocabulary appears for a plurality of times, the appearance frequency is only marked as 1 time.

The matching result counts the vocabulary number under each semantic association degree under each attribute and is marked as C_niThe subscript I represents the ith attribute, n represents the nth semantic vocabulary association degree, and I is 1,2, …, I; n is 1,2, …, N. N represents the total degree of association between the semantic vocabulary and the domain, and is generally 2-5. In this implementation, since the degree of association is three levels, i.e., high, medium, low, and high, N is 3.

If the value of (A) is greater than 1, then order

Note that when calculating the score of an entity belonging to the jth domain, the second number C_niShould count the dictionary D corresponding to the entity in the jth domain_jThe total occurrence number of all semantic words in (1). That is, in the present invention, the score of an entity in a certain domain is counted according to whether the attribute of the entity includes semantic vocabulary in the domain.

S4: and comparing and judging the score obtained in the step S3 with the classification threshold value to obtain a classification result, and verifying the classification result to generate training data.

In this implementation, the specific method of step S4 is as follows:

s42: and checking the judgment result based on expert knowledge, eliminating data which is not checked, and obtaining correct entities in each field according to the result data which is checked to be used as small sample training data.

S5: based on the training data in S42 described above, the optimum parameters are determined by the lattice search.

In this implementation, the specific method of step S5 is as follows:

determining the optimum parameters by a lattice search based on the training data obtained in S4, the parameters of the lattice search including the attribute score A_iWeight coefficient B of semantic relation degree_niAnd a classification threshold; the evaluation index of the optimal parameter selects a jaccard coefficient, and the calculation formula of the jaccard coefficient is as follows:

wherein x represents a domain label for entity prediction; y representing the reality of an entityThe number of the intersection of the prediction tag and the real tag is represented by x ∩ y, the number of the union of the prediction tag and the real tag is represented by x ∪ y, the general parameter range is set as follows, and the attribute score A is set as follows_iThe range of (1) is 0-100, the sum of the total scores of all attributes is 100, and the adjustment interval of each time during grid search is 5; weight coefficient B of semantic association degree_niThe range is 0-1.5, and the interval is adjusted by 0.1 each time during grid search; the classification threshold ranges from 100/N to 100(N is the number of attributes), and the adjustment interval is 5 every time the grid search is performed. And finally, grid searching and selecting a parameter corresponding to the maximum value of the average jaccard coefficients of all samples as an optimal parameter.

In practical use, the training sample should be expanded through multiple rounds of expansion of the semantic vocabulary library and expert knowledge verification, and the grid search in step S5 needs to be repeated each time the training sample is expanded or expanded, so as to determine new optimal parameters.

S6: and predicting the field of the unknown entity to be classified based on the determined optimal parameters.

In this implementation, the specific method of step S6 is as follows:

s61: and according to the method in the step S3, acquiring the multi-attribute text of the unknown entity to be classified, matching each attribute text of the unknown entity with the semantic vocabulary in the different fields obtained in the step S1, and calculating the score of the unknown entity in the different fields according to the matching result, which is specifically referred to in steps S31 to S33.

S62: and then comparing the scores of the unknown entities belonging to each field with the latest classification threshold value in the optimal parameter, and if the scores of the entities belonging to a certain field are higher than the classification threshold value in the optimal parameter, judging the entities belonging to the field. Thereby, a prediction result of the domain of the unknown entity is obtained, and the domain may have one or more or no corresponding domain.

The following shows a specific implementation of the method by way of example based on the above. In this embodiment, the specific steps are as described above, and are not described in detail, and the specific parameter settings and technical effects are mainly shown.

Examples

Referring to fig. 1, the method for classifying entities in multiple fields provided in this embodiment includes the steps of S1-S6, and the specific implementation process of each step is as follows:

step 1: crowdsourcing to obtain semantic vocabulary

In the embodiment, semantic vocabularies belonging to different fields in texts with different attributes are obtained through a crowdsourcing platform, and the high, medium and low relevance importance of the vocabularies is distinguished. And writing the checked semantic vocabulary into a database.

Step 2: initializing various parameters in a calculation formula

In this embodiment, the attribute dimension takes an enterprise entity as an example, and the name, introduction, patent, soft work, and recruitment data of the enterprise are collected on the network, and the total dimension is 5. The total dimension score is set to be 100 points, each attribute is assigned with 20 points, and the high, medium and low weight coefficients of all the attribute dimensions are initialized to be 1.0 high, 0.8 medium and 0.4 low.

And step 3: and acquiring a multi-attribute text of the entity, matching the multi-attribute text with semantic words, and calculating a domain category score according to a formula.

In the embodiment, the attribute texts of the entities are spliced firstly, wherein the patents are spliced by using patent names and patent abstracts, the soft works are spliced by using the soft works, and the recruitment position and the position details are spliced. And finally, after matching each attribute text with the corresponding semantic vocabulary, respectively counting the vocabulary quantity of each attribute at three levels of high, medium and low. The matching result storage database is convenient for query, statistics and result analysis.

The calculation formula in this embodiment is:

wherein A is_iAn attribute score representing the ith attribute, B_niAn nth semantic relevance degree weight representing the ith attribute, C_niAll semantic vocabulary total for expressing nth semantic association degree in ith attribute text of entityThe number of occurrences. In particular, if

If the value of (A) is greater than 1, then order

And 4, step 4: and (5) obtaining a classification result by threshold judgment, and generating training data by an expert knowledge verification result.

In this embodiment, according to the initial threshold of 20 points, for the classified area with the area score greater than 20 points, the classified area of the statistical entity is checked by the expert. And arranging the verified data into training data for subsequent grid search optimization parameters.

And 5: using training data for grid search for optimal parameters

The parameters of the grid search in this embodiment include an attribute score a_iWeight coefficient B of semantic relation degree_niAnd a classification threshold. The evaluation index is jaccard coefficient. Setting the range of the general attribute score to be 0-100 in the parameter range, and adjusting the interval at each time to be 5 under the condition of 100 total scores; the weight coefficient range of the semantic association degree is 0-1.5, and the interval is adjusted by 0.1 each time; the classification threshold ranges from 100/N to 100(N is the number of attributes), and the interval is adjusted to 5 each time. And finally, grid searching and selecting a parameter corresponding to the maximum value of the average jaccard coefficients of all samples as a final optimization result.

In the embodiment, training samples are expanded through multi-round expansion of the semantic library and expert verification, the grid search optimization parameters in the step 5 are repeated, final parameters are determined, and the parameters and corresponding versions adjusted each time are stored in the database.

Step 6: predicting unknown entities using finalized parameters

In this embodiment, the final parameter is read from the database according to the version number, then all semantic words are obtained, the attribute text of the entity is input, and the domain to which the entity belongs is output, where the output domain may be a single value, a multiple value, or a null value.

It should be noted that if attributes of an entity are missing, the entity with the missing data should be handled separately.

In order to ensure the reliability of parameter adjustment, the accuracy of training data should be ensured as much as possible, and well-known entities in the field can be selected. For example, the known company soups in the field of computer vision in the artificial intelligence industry are used as training data for entity classification of enterprises.

Enterprise entities in the database are classified by the classification method, the recall rate and the accuracy rate are calculated by random sampling, and the recall rate is more than 80% and the accuracy rate is more than 90% finally obtained after parameters are adjusted.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A multi-field classification method of small sample entities based on man-machine cooperation is characterized by comprising the following steps:

2. The method according to claim 1, wherein the specific method of step S1 is as follows:

3. The method according to claim 1, wherein the specific method of step S2 is as follows:

S23: initializing the classification threshold to be equal to A_i。

4. According to claim 3The method is characterized in that in the step S2, the association degree of the semantic vocabulary and the affiliated field is divided into three levels, namely high, middle and low; when the degree of association is high, the weight coefficient B_1i1.0; when the degree of association is medium, the weight coefficient B_2i0.8; when the degree of association is low, the weight coefficient B_3i＝0.4。

5. The method according to claim 1, wherein the specific method of step S3 is as follows:

s31: obtaining multi-attribute text of an entity, and then combining each attribute text with a dictionary D_jMatches each semantic vocabulary in the dictionary and outputs a dictionary D_jThe number of occurrences of each semantic vocabulary in the attribute text; in an attribute text, if the same semantic vocabulary appears for a plurality of times, the number of the occurrences is only marked as 1;

If the value of (A) is greater than 1, then order

6. The method according to claim 1, wherein the specific method of step S4 is as follows:

7. The method according to claim 1, wherein the specific method of step S5 is as follows:

determining the best parameters by a grid search based on the training data obtained in S4, the parameters of the grid search including an attribute score A_iWeight coefficient B of semantic relation degree_niAnd a classification threshold; the evaluation index of the optimal parameter selects a jaccard coefficient, and the calculation formula of the jaccard coefficient is as follows:

8. The method of claim 1, wherein the training sample is expanded by multiple rounds of expanding the semantic vocabulary library and by expert knowledge verification, and the step of grid search to determine new optimal parameters in step S5 is repeated after each expansion.

9. The method according to claim 1, wherein the specific method of step S6 is as follows:

10. The method according to claim 1, wherein when the multi-attribute text of the entity is obtained, if a plurality of texts exist under the same attribute, the plurality of texts are spliced to obtain the attribute text.