CN111274404A - Small sample entity multi-field classification method based on man-machine cooperation - Google Patents

Small sample entity multi-field classification method based on man-machine cooperation Download PDF

Info

Publication number
CN111274404A
CN111274404A CN202010088532.0A CN202010088532A CN111274404A CN 111274404 A CN111274404 A CN 111274404A CN 202010088532 A CN202010088532 A CN 202010088532A CN 111274404 A CN111274404 A CN 111274404A
Authority
CN
China
Prior art keywords
semantic
attribute
entity
field
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010088532.0A
Other languages
Chinese (zh)
Other versions
CN111274404B (en
Inventor
高汕
李健
宗畅
吴海燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Liangzhi Data Technology Co ltd
Original Assignee
Hangzhou Liangzhi Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Liangzhi Data Technology Co ltd filed Critical Hangzhou Liangzhi Data Technology Co ltd
Priority to CN202010088532.0A priority Critical patent/CN111274404B/en
Publication of CN111274404A publication Critical patent/CN111274404A/en
Application granted granted Critical
Publication of CN111274404B publication Critical patent/CN111274404B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method for multi-domain classification of entities, which comprises the steps of firstly obtaining attribute semantic vocabularies of the entities in each domain in a crowdsourcing mode, then utilizing the semantic vocabularies to match attribute texts of the entities, calculating scores by using a calculation formula after obtaining matching results, comparing the scores with a threshold value to obtain classification results, further generating small-batch training samples by checking the correctness of the results with expert knowledge, automatically adjusting formula coefficients by using grid search on the basis of the small samples to improve recall rate and accuracy, and solving the problem that a large amount of texts need to be checked in manual entity classification by continuously and automatically processing optimization of classification effects. The invention solves the problem of entity classification by fully utilizing crowdsourcing, man-machine cooperation and semi-supervised learning modes, and can quickly implement multi-field classification of entities under the condition of lacking of labeled data.

Description

Small sample entity multi-field classification method based on man-machine cooperation
Technical Field
The invention relates to the fields of computer technology, artificial intelligence, natural language processing and label classification, in particular to a man-machine cooperation multi-source text content cognition method under the classification scene of the industrial chain field.
Background
Industrial chain analysis plays an important role in the development of regional economy, the development of industry. But there is no good method for classifying entities in the industry chain. At present, the attribute description of the entity can only be used for manually judging the attribution of the marked entity.
The description of the field of entities in the manual tagging process has different descriptive words in different attribute texts, for example, the description of the computer vision field in patents is "vision algorithm", in products is "face recognition", and in recruitment positions is "CV algorithm engineer". The human exhaustion of these words that contain domain semantics would create a huge effort.
The method for automatically classifying by using the keywords specified by the simple rules cannot simultaneously give consideration to the classification accuracy and the recall rate, if the coverage of the selected keywords is not complete, the recall is often not high, and if the coverage of the selected keywords is complete, the accuracy is not high. The method can assist in judging the characteristic description of the classification of the domain to which the entity belongs to be reflected in the text data of each attribute dimension, and reasonably quantizes the association tightness degree of the keywords and the domain by a statistical probability analysis method.
If the entity field classification is carried out by purely using deep learning and machine learning algorithms, three main defects exist, namely, a large amount of labeled linguistic data are needed for training, and the text needs to be specially preprocessed and quantized into computable data before being used; thirdly, the black box model of deep learning causes the interpretability of the final result to be poor, and the classification basis is difficult to trace.
Therefore, it is an urgent need for technical personnel to provide a classification method in the semi-supervised entity field, which can acquire high classification accuracy by using crowd wisdom to collect semantics and using a small amount of corpus training.
Disclosure of Invention
In view of the above, the invention provides a statistical probability text matching algorithm based on a man-machine cooperation mode, and the method solves the problem of multi-field classification of entities by combining modes such as crowdsourcing collection, expert verification and the like, not only has high classification accuracy, but also can be used for various entities of different types and fields of different industries.
In order to achieve the purpose, the invention adopts the following technical scheme:
a multi-field classification method for small sample entities based on man-machine cooperation comprises the following steps:
s1: semantic vocabularies related to the entities are obtained in a crowdsourcing mode, and the semantic vocabularies returned in the crowdsourcing mode comprise three dimensions of the domain of the semantic vocabularies, the attribute of the semantic vocabularies and the semantic association degree of the domain of the semantic vocabularies;
s2: initializing each parameter required by entity field classification, wherein the initialization parameter comprises an attribute score AiWeight coefficient B of semantic relation degreeniAnd a classification threshold;
s3: acquiring multi-attribute texts of the entities, matching each attribute text of the entities with semantic vocabularies of different fields acquired in S1, and calculating scores of each entity in different fields according to matching results;
s4: comparing and judging the score obtained in the S3 with the classification threshold value to obtain a classification result, and generating training data after the classification result is verified;
s5: determining optimal parameters through a grid search based on the training data;
s6: and predicting the field of the unknown entity to be classified based on the optimal parameters.
Based on the technical scheme, the steps can be realized by adopting the following preferred mode:
preferably, the specific method of step S1 is as follows:
s11: in a crowdsourcing solving platform, semantic vocabularies in multi-attribute texts of the entities are obtained in a crowdsourcing mode, wherein the crowdsourcing mode is that the semantic vocabularies are drawn from each attribute text of the entities, or the semantic vocabularies are directly provided and the positions are marked; the crowdsourcing return result comprises three dimensions of semantic vocabularies, the affiliated fields of the semantic vocabularies, the affiliated attributes and semantic association degrees with the affiliated fields; a semantic vocabulary belongs to one or more attribute dimensions;
s12: checking the crowdsourcing return result, and writing the crowdsourcing return result into a database after checking; dictionary D is formed by all semantic words belonging to jth field in databasejAnd j is 1,2, …, and M is the total number of the domain classification categories of the entity.
Preferably, the specific method of step S2 is as follows:
s21: initializing and setting the total score of each field to be 100, averaging the total score of each field to each attribute dimension, and obtaining the attribute score A of the ith attributeiI is 100/I, and I is the number of attributes;
s22: and initializing a weight coefficient of the association degree of the semantic vocabulary under each attribute, wherein the higher the association degree of the semantic vocabulary and the domain to which the semantic vocabulary belongs is, the higher the weight coefficient is.
S23: initializing the classification threshold to be equal to Ai
Preferably, in step S2, the association degree between the semantic vocabulary and the domain is divided into three levels, namely, high, medium and low; when the degree of association is high, the weight coefficient B1i1.0; when the degree of association is medium, the weight coefficient B2i0.8; when the degree of association is low, the weight coefficient B3i=0.4。
Preferably, the specific method of step S3 is as follows:
sequentially aiming at each field, based on the semantic vocabulary dictionary D corresponding to the field obtained in S1jCalculating the score of each entity in the jth domain, j being 1,2, …, M, wherein the calculation method comprises the following steps:
s31: obtaining multi-attribute text of an entity, and then combining each attribute text with a dictionary DjMatches each semantic vocabulary in the dictionary and outputs a dictionary DjEach of the words inThe occurrence times of meaning vocabularies in the attribute text; in an attribute text, if the same semantic vocabulary appears for a plurality of times, the number of the occurrences is only marked as 1;
s32: in the matching result obtained in S31, according to dictionary DjThe semantic association degree corresponding to each semantic word in the entity is counted, and the total occurrence frequency of all the semantic words of each semantic association degree in each attribute text of the entity is counted;
s33: according to the statistical result obtained in S32, calculating the score of the entity belonging to the jth field, wherein the calculation formula is as follows:
Figure BDA0002382909130000031
wherein: a. theiAn attribute score representing the ith attribute, BniN-th semantic relevance degree weight representing ith attribute, CniThe total occurrence times of all semantic vocabularies representing the nth semantic association degree in the ith attribute text of the entity; if it is
Figure BDA0002382909130000032
If the value of (A) is greater than 1, then order
Figure BDA0002382909130000033
Equal to 1 to ensure that eventually all attribute dimension score accumulations are the same.
Preferably, the specific method of step S4 is as follows:
s41: comparing the score of each entity belonging to each field with the classification threshold, and if the score of the entity belonging to a certain field is higher than the classification threshold, judging the entity belonging to the field;
s42: and checking the judgment result based on expert knowledge, and obtaining correct entities in each field according to the result data passing the checking to serve as training data.
Preferably, the specific method of step S5 is as follows:
determining optimal parameters through grid search based on the training data obtained in S4, the parameter packet of the grid searchThe attribute score AiWeight coefficient B of semantic relation degreeniAnd a classification threshold; the evaluation index of the optimal parameter selects a jaccard coefficient, and the calculation formula of the jaccard coefficient is as follows:
Figure BDA0002382909130000041
wherein x represents the domain label of entity prediction, y represents the real domain label of the entity, x ∩ y represents the number of the intersection of the prediction label and the real label, x ∪ y represents the number of the union of the prediction label and the real label, and finally the parameter corresponding to the maximum value of the average jaccard coefficient of all samples is selected as the optimal parameter in grid search.
Preferably, the training sample is expanded through multiple rounds of expansion of the semantic vocabulary base and expert knowledge verification, and the step of grid search in the step S5 is repeated after each expansion to determine new optimal parameters.
Preferably, the specific method of step S6 is as follows:
s61: according to the method of the step S3, acquiring the multi-attribute text of the unknown entity to be classified, matching each attribute text of the unknown entity with the semantic vocabulary of different fields obtained in the step S1, and calculating the scores of the unknown entity in the different fields according to the matching result;
s62: and then comparing the scores of the unknown entities belonging to each field with the classification threshold value in the optimal parameter, and if the scores of the entities belonging to a certain field are higher than the classification threshold value in the optimal parameter, judging the entities belonging to the field.
Preferably, when the multi-attribute text of the entity is acquired, if a plurality of texts exist under the same attribute, the plurality of texts are spliced to obtain the attribute text.
According to the technical scheme, compared with the prior art, the invention discloses a method for obtaining a semantic library by a crowdsourcing mode, grading and quantizing semantics, counting scores of an entity in a certain field according to whether the attribute of the entity contains semantic words in the field, and finally setting a threshold value to judge a classification result. When the method is used for entity classification, only a semantic vocabulary library and a database of various parameters need to be maintained, and the entity attribute text to be classified is transmitted into a system to obtain a classification result.
Enterprise entities in the database are classified by the classification method, the recall rate and the accuracy rate are calculated by random sampling, and the recall rate is more than 80% and the accuracy rate is more than 90% finally obtained after parameters are adjusted. The invention is applied to the classification of enterprise entities and expert entities in the fields of artificial intelligence and geographic information industrial chains, and can obtain good application effect.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flowchart illustrating an entity multi-domain classification algorithm according to an embodiment.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The method has the main innovation points that a mode for classifying the keywords by direct hard matching is softened in a probability counting mode, the accumulation efficiency of semantic vocabularies is improved in a crowdsourcing mode, training data is obtained by checking classification results by experts, parameters are optimized by grid search of machine learning, and the classification effect is improved by fully utilizing the advantages of man-machine cooperation. The method makes full use of knowledge precipitation and reduces the dependence on the labeled data.
The following details a specific implementation manner of the small sample entity multi-field classification method based on human-computer cooperation, which comprises the following steps:
s1: semantic vocabularies related to the entities are obtained in a crowdsourcing mode, and the semantic vocabularies returned in the crowdsourcing mode comprise three dimensions of the affiliated field, the affiliated attribute and the semantic association degree of the affiliated field of the semantic vocabularies.
In this implementation, the specific method of step S1 is as follows:
s11: in a crowdsourcing solving platform, semantic vocabularies in multi-attribute texts (containing various attribute texts) of an entity are obtained in a crowdsourcing mode, wherein the crowdsourcing mode is that the semantic vocabularies are drawn from each attribute text of the entity, or the semantic vocabularies are directly provided and a place is marked; the crowdsourcing return result comprises three dimensions of semantic vocabularies, the affiliated fields of the semantic vocabularies, the affiliated attributes and semantic association degrees with the affiliated fields; a semantic vocabulary belongs to one or more attribute dimensions. For example, with a semantic word "visual algorithm" in the patent text, the domain to which the semantic word belongs may be labeled as "computer visual domain" in the crowd-sourced results, the attribute is "patent", the semantic association degree is "high", and the crowd-sourced results may be returned for subsequent verification. The crowdsourcing solution platform can comprise an open source tool and a specific scene tool which is independently developed, and when a crowdsourcing task is issued, a plurality of fixed fields, attribute dimensions and semantic association degrees can be preset, so that the returned crowdsourcing result meets the requirement.
S12: checking the crowdsourcing return result, and writing the crowdsourcing return result into a database after checking; dictionary D is formed by all semantic words belonging to jth field in databasejAnd j is 1,2, …, and M is the total number of the domain classification categories of the entity.
S2: initializing each parameter required by entity field classification, wherein the initialization parameter comprises an attribute score AiWeight coefficient B of semantic relation degreeniAnd a classification threshold.
In this implementation, the specific method of step S2 is as follows:
s21: initialization settingThe total score of each field is 100, and then the total score of each field is averaged to each attribute dimension, namely the attribute score A of the ith attributeiAnd I is 100/I, and I is the number of attributes.
In the invention, the specific attribute is different according to different entities. For example, a business entity may contain attributes such as a business profile, business name, patent, soft work, job placement, etc.; the expert entity may include attributes of articles, patents, personal profiles, research areas, works, etc.
S22: and initializing a weight coefficient of the association degree of the semantic vocabulary under each attribute, wherein the higher the association degree of the semantic vocabulary and the domain to which the semantic vocabulary belongs is, the higher the weight coefficient is. Wherein, the degree of association between the semantic vocabulary and the domain can be modified according to the situation, and the general level 2-5 is more suitable. For example, in this implementation, the degree of association may be divided into three levels, high, medium, and low; when the degree of association is high, the weight coefficient B1i1.0; when the degree of association is medium, the weight coefficient B2i0.8; when the degree of association is low, the weight coefficient B3i=0.4。
S23: initializing the classification threshold to be equal to Ai
S3: and acquiring a multi-attribute text of the entity, matching each attribute text of the entity with the semantic vocabulary in the different fields acquired in the step S1, and calculating the score of each entity in the different fields according to the matching result.
In this implementation, the specific method of step S3 is as follows:
sequentially aiming at each field, based on the semantic vocabulary dictionary D corresponding to the field obtained in S1jAnd calculating the score of each entity in the jth field (j values are 1,2, … and M in sequence), wherein the calculation method comprises the following steps:
s31: firstly, acquiring a multi-attribute text of an entity, wherein the attribute text is different according to different entity dimensions. For example, when the entity to be classified is a business entity, the attribute text thereof may include a business introduction, a business name, a patent, a soft work, a recruitment post; and when the entity to be classified is an expert entity, its property text may contain a paper, a patent, a personal brief description, a research field, a work. If a plurality of texts exist under the same attribute, the plurality of texts are spliced to obtain the attribute text. The attribute text may be crawled from the web or otherwise obtained.
Then each attribute text is associated with a dictionary DjEach semantic word in the dictionary D is matched, and whether the semantic word to be matched is contained in the text or not is output by using regular matching output attribute, namely a dictionary D is outputjThe number of occurrences of each semantic word in the attribute text. In a property text, if the same semantic vocabulary appears for a plurality of times, the appearance frequency is only marked as 1 time.
The matching result counts the vocabulary number under each semantic association degree under each attribute and is marked as CniThe subscript I represents the ith attribute, n represents the nth semantic vocabulary association degree, and I is 1,2, …, I; n is 1,2, …, N. N represents the total degree of association between the semantic vocabulary and the domain, and is generally 2-5. In this implementation, since the degree of association is three levels, i.e., high, medium, low, and high, N is 3.
S32: in the matching result obtained in S31, according to dictionary DjThe semantic association degree corresponding to each semantic word in the entity is counted, and the total occurrence frequency of all the semantic words of each semantic association degree in each attribute text of the entity is counted;
s33: according to the statistical result obtained in S32, calculating the score of the entity belonging to the jth field, wherein the calculation formula is as follows:
Figure BDA0002382909130000071
wherein: a. theiAn attribute score representing the ith attribute, BniN-th semantic relevance degree weight representing ith attribute, CniThe total occurrence times of all semantic vocabularies representing the nth semantic association degree in the ith attribute text of the entity; if it is
Figure BDA0002382909130000072
If the value of (A) is greater than 1, then order
Figure BDA0002382909130000073
Equal to 1 to ensure that eventually all attribute dimension score accumulations are the same.
Note that when calculating the score of an entity belonging to the jth domain, the second number CniShould count the dictionary D corresponding to the entity in the jth domainjThe total occurrence number of all semantic words in (1). That is, in the present invention, the score of an entity in a certain domain is counted according to whether the attribute of the entity includes semantic vocabulary in the domain.
S4: and comparing and judging the score obtained in the step S3 with the classification threshold value to obtain a classification result, and verifying the classification result to generate training data.
In this implementation, the specific method of step S4 is as follows:
s41: comparing the score of each entity belonging to each field with the classification threshold, and if the score of the entity belonging to a certain field is higher than the classification threshold, judging the entity belonging to the field;
s42: and checking the judgment result based on expert knowledge, eliminating data which is not checked, and obtaining correct entities in each field according to the result data which is checked to be used as small sample training data.
S5: based on the training data in S42 described above, the optimum parameters are determined by the lattice search.
In this implementation, the specific method of step S5 is as follows:
determining the optimum parameters by a lattice search based on the training data obtained in S4, the parameters of the lattice search including the attribute score AiWeight coefficient B of semantic relation degreeniAnd a classification threshold; the evaluation index of the optimal parameter selects a jaccard coefficient, and the calculation formula of the jaccard coefficient is as follows:
Figure BDA0002382909130000074
wherein x represents a domain label for entity prediction; y representing the reality of an entityThe number of the intersection of the prediction tag and the real tag is represented by x ∩ y, the number of the union of the prediction tag and the real tag is represented by x ∪ y, the general parameter range is set as follows, and the attribute score A is set as followsiThe range of (1) is 0-100, the sum of the total scores of all attributes is 100, and the adjustment interval of each time during grid search is 5; weight coefficient B of semantic association degreeniThe range is 0-1.5, and the interval is adjusted by 0.1 each time during grid search; the classification threshold ranges from 100/N to 100(N is the number of attributes), and the adjustment interval is 5 every time the grid search is performed. And finally, grid searching and selecting a parameter corresponding to the maximum value of the average jaccard coefficients of all samples as an optimal parameter.
In practical use, the training sample should be expanded through multiple rounds of expansion of the semantic vocabulary library and expert knowledge verification, and the grid search in step S5 needs to be repeated each time the training sample is expanded or expanded, so as to determine new optimal parameters.
S6: and predicting the field of the unknown entity to be classified based on the determined optimal parameters.
In this implementation, the specific method of step S6 is as follows:
s61: and according to the method in the step S3, acquiring the multi-attribute text of the unknown entity to be classified, matching each attribute text of the unknown entity with the semantic vocabulary in the different fields obtained in the step S1, and calculating the score of the unknown entity in the different fields according to the matching result, which is specifically referred to in steps S31 to S33.
S62: and then comparing the scores of the unknown entities belonging to each field with the latest classification threshold value in the optimal parameter, and if the scores of the entities belonging to a certain field are higher than the classification threshold value in the optimal parameter, judging the entities belonging to the field. Thereby, a prediction result of the domain of the unknown entity is obtained, and the domain may have one or more or no corresponding domain.
The following shows a specific implementation of the method by way of example based on the above. In this embodiment, the specific steps are as described above, and are not described in detail, and the specific parameter settings and technical effects are mainly shown.
Examples
Referring to fig. 1, the method for classifying entities in multiple fields provided in this embodiment includes the steps of S1-S6, and the specific implementation process of each step is as follows:
step 1: crowdsourcing to obtain semantic vocabulary
In the embodiment, semantic vocabularies belonging to different fields in texts with different attributes are obtained through a crowdsourcing platform, and the high, medium and low relevance importance of the vocabularies is distinguished. And writing the checked semantic vocabulary into a database.
Step 2: initializing various parameters in a calculation formula
In this embodiment, the attribute dimension takes an enterprise entity as an example, and the name, introduction, patent, soft work, and recruitment data of the enterprise are collected on the network, and the total dimension is 5. The total dimension score is set to be 100 points, each attribute is assigned with 20 points, and the high, medium and low weight coefficients of all the attribute dimensions are initialized to be 1.0 high, 0.8 medium and 0.4 low.
And step 3: and acquiring a multi-attribute text of the entity, matching the multi-attribute text with semantic words, and calculating a domain category score according to a formula.
In the embodiment, the attribute texts of the entities are spliced firstly, wherein the patents are spliced by using patent names and patent abstracts, the soft works are spliced by using the soft works, and the recruitment position and the position details are spliced. And finally, after matching each attribute text with the corresponding semantic vocabulary, respectively counting the vocabulary quantity of each attribute at three levels of high, medium and low. The matching result storage database is convenient for query, statistics and result analysis.
The calculation formula in this embodiment is:
Figure BDA0002382909130000091
wherein A isiAn attribute score representing the ith attribute, BniAn nth semantic relevance degree weight representing the ith attribute, CniAll semantic vocabulary total for expressing nth semantic association degree in ith attribute text of entityThe number of occurrences. In particular, if
Figure BDA0002382909130000092
If the value of (A) is greater than 1, then order
Figure BDA0002382909130000093
Equal to 1 to ensure that eventually all attribute dimension score accumulations are the same.
And 4, step 4: and (5) obtaining a classification result by threshold judgment, and generating training data by an expert knowledge verification result.
In this embodiment, according to the initial threshold of 20 points, for the classified area with the area score greater than 20 points, the classified area of the statistical entity is checked by the expert. And arranging the verified data into training data for subsequent grid search optimization parameters.
And 5: using training data for grid search for optimal parameters
The parameters of the grid search in this embodiment include an attribute score aiWeight coefficient B of semantic relation degreeniAnd a classification threshold. The evaluation index is jaccard coefficient. Setting the range of the general attribute score to be 0-100 in the parameter range, and adjusting the interval at each time to be 5 under the condition of 100 total scores; the weight coefficient range of the semantic association degree is 0-1.5, and the interval is adjusted by 0.1 each time; the classification threshold ranges from 100/N to 100(N is the number of attributes), and the interval is adjusted to 5 each time. And finally, grid searching and selecting a parameter corresponding to the maximum value of the average jaccard coefficients of all samples as a final optimization result.
In the embodiment, training samples are expanded through multi-round expansion of the semantic library and expert verification, the grid search optimization parameters in the step 5 are repeated, final parameters are determined, and the parameters and corresponding versions adjusted each time are stored in the database.
Step 6: predicting unknown entities using finalized parameters
In this embodiment, the final parameter is read from the database according to the version number, then all semantic words are obtained, the attribute text of the entity is input, and the domain to which the entity belongs is output, where the output domain may be a single value, a multiple value, or a null value.
It should be noted that if attributes of an entity are missing, the entity with the missing data should be handled separately.
In order to ensure the reliability of parameter adjustment, the accuracy of training data should be ensured as much as possible, and well-known entities in the field can be selected. For example, the known company soups in the field of computer vision in the artificial intelligence industry are used as training data for entity classification of enterprises.
Enterprise entities in the database are classified by the classification method, the recall rate and the accuracy rate are calculated by random sampling, and the recall rate is more than 80% and the accuracy rate is more than 90% finally obtained after parameters are adjusted.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A multi-field classification method of small sample entities based on man-machine cooperation is characterized by comprising the following steps:
s1: semantic vocabularies related to the entities are obtained in a crowdsourcing mode, and the semantic vocabularies returned in the crowdsourcing mode comprise three dimensions of the domain of the semantic vocabularies, the attribute of the semantic vocabularies and the semantic association degree of the domain of the semantic vocabularies;
s2: initializing each parameter required by entity field classification, wherein the initialization parameter comprises an attribute score AiWeight coefficient B of semantic relation degreeniAnd a classification threshold;
s3: acquiring multi-attribute texts of the entities, matching each attribute text of the entities with semantic vocabularies of different fields acquired in S1, and calculating scores of each entity in different fields according to matching results;
s4: comparing and judging the score obtained in the S3 with the classification threshold value to obtain a classification result, and generating training data after the classification result is verified;
s5: determining optimal parameters through a grid search based on the training data;
s6: and predicting the field of the unknown entity to be classified based on the optimal parameters.
2. The method according to claim 1, wherein the specific method of step S1 is as follows:
s11: in a crowdsourcing solving platform, semantic vocabularies in multi-attribute texts of the entities are obtained in a crowdsourcing mode, wherein the crowdsourcing mode is that the semantic vocabularies are drawn from each attribute text of the entities, or the semantic vocabularies are directly provided and the positions are marked; the crowdsourcing return result comprises three dimensions of semantic vocabularies, the affiliated fields of the semantic vocabularies, the affiliated attributes and semantic association degrees with the affiliated fields; a semantic vocabulary belongs to one or more attribute dimensions;
s12: checking the crowdsourcing return result, and writing the crowdsourcing return result into a database after checking; dictionary D is formed by all semantic words belonging to jth field in databasejAnd j is 1,2, …, and M is the total number of the domain classification categories of the entity.
3. The method according to claim 1, wherein the specific method of step S2 is as follows:
s21: initializing and setting the total score of each field to be 100, averaging the total score of each field to each attribute dimension, and obtaining the attribute score A of the ith attributeiI is 100/I, and I is the number of attributes;
s22: and initializing a weight coefficient of the association degree of the semantic vocabulary under each attribute, wherein the higher the association degree of the semantic vocabulary and the domain to which the semantic vocabulary belongs is, the higher the weight coefficient is.
S23: initializing the classification threshold to be equal to Ai
4. According to claim 3The method is characterized in that in the step S2, the association degree of the semantic vocabulary and the affiliated field is divided into three levels, namely high, middle and low; when the degree of association is high, the weight coefficient B1i1.0; when the degree of association is medium, the weight coefficient B2i0.8; when the degree of association is low, the weight coefficient B3i=0.4。
5. The method according to claim 1, wherein the specific method of step S3 is as follows:
sequentially aiming at each field, based on the semantic vocabulary dictionary D corresponding to the field obtained in S1jCalculating the score of each entity in the jth domain, j being 1,2, …, M, wherein the calculation method comprises the following steps:
s31: obtaining multi-attribute text of an entity, and then combining each attribute text with a dictionary DjMatches each semantic vocabulary in the dictionary and outputs a dictionary DjThe number of occurrences of each semantic vocabulary in the attribute text; in an attribute text, if the same semantic vocabulary appears for a plurality of times, the number of the occurrences is only marked as 1;
s32: in the matching result obtained in S31, according to dictionary DjThe semantic association degree corresponding to each semantic word in the entity is counted, and the total occurrence frequency of all the semantic words of each semantic association degree in each attribute text of the entity is counted;
s33: according to the statistical result obtained in S32, calculating the score of the entity belonging to the jth field, wherein the calculation formula is as follows:
Figure FDA0002382909120000021
wherein: a. theiAn attribute score representing the ith attribute, BniN-th semantic relevance degree weight representing ith attribute, CniThe total occurrence times of all semantic vocabularies representing the nth semantic association degree in the ith attribute text of the entity; if it is
Figure FDA0002382909120000022
If the value of (A) is greater than 1, then order
Figure FDA0002382909120000023
Equal to 1 to ensure that eventually all attribute dimension score accumulations are the same.
6. The method according to claim 1, wherein the specific method of step S4 is as follows:
s41: comparing the score of each entity belonging to each field with the classification threshold, and if the score of the entity belonging to a certain field is higher than the classification threshold, judging the entity belonging to the field;
s42: and checking the judgment result based on expert knowledge, and obtaining correct entities in each field according to the result data passing the checking to serve as training data.
7. The method according to claim 1, wherein the specific method of step S5 is as follows:
determining the best parameters by a grid search based on the training data obtained in S4, the parameters of the grid search including an attribute score AiWeight coefficient B of semantic relation degreeniAnd a classification threshold; the evaluation index of the optimal parameter selects a jaccard coefficient, and the calculation formula of the jaccard coefficient is as follows:
Figure FDA0002382909120000031
wherein x represents the domain label of entity prediction, y represents the real domain label of the entity, x ∩ y represents the number of the intersection of the prediction label and the real label, x ∪ y represents the number of the union of the prediction label and the real label, and finally the parameter corresponding to the maximum value of the average jaccard coefficient of all samples is selected as the optimal parameter in grid search.
8. The method of claim 1, wherein the training sample is expanded by multiple rounds of expanding the semantic vocabulary library and by expert knowledge verification, and the step of grid search to determine new optimal parameters in step S5 is repeated after each expansion.
9. The method according to claim 1, wherein the specific method of step S6 is as follows:
s61: according to the method of the step S3, acquiring the multi-attribute text of the unknown entity to be classified, matching each attribute text of the unknown entity with the semantic vocabulary of different fields obtained in the step S1, and calculating the scores of the unknown entity in the different fields according to the matching result;
s62: and then comparing the scores of the unknown entities belonging to each field with the classification threshold value in the optimal parameter, and if the scores of the entities belonging to a certain field are higher than the classification threshold value in the optimal parameter, judging the entities belonging to the field.
10. The method according to claim 1, wherein when the multi-attribute text of the entity is obtained, if a plurality of texts exist under the same attribute, the plurality of texts are spliced to obtain the attribute text.
CN202010088532.0A 2020-02-12 2020-02-12 Small sample entity multi-field classification method based on man-machine cooperation Active CN111274404B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010088532.0A CN111274404B (en) 2020-02-12 2020-02-12 Small sample entity multi-field classification method based on man-machine cooperation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010088532.0A CN111274404B (en) 2020-02-12 2020-02-12 Small sample entity multi-field classification method based on man-machine cooperation

Publications (2)

Publication Number Publication Date
CN111274404A true CN111274404A (en) 2020-06-12
CN111274404B CN111274404B (en) 2023-07-14

Family

ID=70997015

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010088532.0A Active CN111274404B (en) 2020-02-12 2020-02-12 Small sample entity multi-field classification method based on man-machine cooperation

Country Status (1)

Country Link
CN (1) CN111274404B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111506671A (en) * 2020-03-17 2020-08-07 北京捷通华声科技股份有限公司 Method, device, equipment and storage medium for processing attribute of entity object

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10254883A (en) * 1997-03-10 1998-09-25 Mitsubishi Electric Corp Automatic document sorting method
JP2005250841A (en) * 2004-03-04 2005-09-15 Energia Communications Inc Method for matching expert and likely buyer
US20060136467A1 (en) * 2004-12-17 2006-06-22 General Electric Company Domain-specific data entity mapping method and system
US20110184926A1 (en) * 2010-01-26 2011-07-28 National Taiwan University Of Science & Technology Expert list recommendation methods and systems
CN103324692A (en) * 2013-06-04 2013-09-25 北京大学 Classified knowledge acquiring method and device
CN105260482A (en) * 2015-11-16 2016-01-20 金陵科技学院 Network new word discovery device and method based on crowdsourcing technology
CN106339806A (en) * 2016-08-24 2017-01-18 北京创业公社征信服务有限公司 Industry holographic image constructing method and industry holographic image constructing system for enterprise information
CN106682128A (en) * 2016-12-13 2017-05-17 成都数联铭品科技有限公司 Method for automatic establishment of multi-field dictionaries
CN106897371A (en) * 2017-01-18 2017-06-27 南京云思创智信息科技有限公司 Chinese text classification system and method
CN106934020A (en) * 2017-03-10 2017-07-07 东南大学 A kind of entity link method based on multiple domain entity index
CN109101477A (en) * 2018-06-04 2018-12-28 东南大学 A kind of enterprise's domain classification and enterprise's keyword screening technique
CN109783818A (en) * 2019-01-17 2019-05-21 上海三零卫士信息安全有限公司 A kind of enterprises ' industry multi-tag classification method

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10254883A (en) * 1997-03-10 1998-09-25 Mitsubishi Electric Corp Automatic document sorting method
JP2005250841A (en) * 2004-03-04 2005-09-15 Energia Communications Inc Method for matching expert and likely buyer
US20060136467A1 (en) * 2004-12-17 2006-06-22 General Electric Company Domain-specific data entity mapping method and system
US20110184926A1 (en) * 2010-01-26 2011-07-28 National Taiwan University Of Science & Technology Expert list recommendation methods and systems
CN103324692A (en) * 2013-06-04 2013-09-25 北京大学 Classified knowledge acquiring method and device
CN105260482A (en) * 2015-11-16 2016-01-20 金陵科技学院 Network new word discovery device and method based on crowdsourcing technology
CN106339806A (en) * 2016-08-24 2017-01-18 北京创业公社征信服务有限公司 Industry holographic image constructing method and industry holographic image constructing system for enterprise information
CN106682128A (en) * 2016-12-13 2017-05-17 成都数联铭品科技有限公司 Method for automatic establishment of multi-field dictionaries
CN106897371A (en) * 2017-01-18 2017-06-27 南京云思创智信息科技有限公司 Chinese text classification system and method
CN106934020A (en) * 2017-03-10 2017-07-07 东南大学 A kind of entity link method based on multiple domain entity index
CN109101477A (en) * 2018-06-04 2018-12-28 东南大学 A kind of enterprise's domain classification and enterprise's keyword screening technique
CN109783818A (en) * 2019-01-17 2019-05-21 上海三零卫士信息安全有限公司 A kind of enterprises ' industry multi-tag classification method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
耿爽,杨辰,牛奔,蚁文洁,刘雷: "《面向企业信息检索的语义扩展查询方法》" *
陈果,许天祥: "《小规模知识库指导下的细分领域实体关系发现研究》", 《情报学报》, vol. 38, no. 11 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111506671A (en) * 2020-03-17 2020-08-07 北京捷通华声科技股份有限公司 Method, device, equipment and storage medium for processing attribute of entity object
CN111506671B (en) * 2020-03-17 2021-02-12 北京捷通华声科技股份有限公司 Method, device, equipment and storage medium for processing attribute of entity object

Also Published As

Publication number Publication date
CN111274404B (en) 2023-07-14

Similar Documents

Publication Publication Date Title
CN108733748B (en) Cross-border product quality risk fuzzy prediction method based on commodity comment public sentiment
CN112100344A (en) Financial field knowledge question-answering method based on knowledge graph
CN110990564A (en) Negative news identification method based on emotion calculation and multi-head attention mechanism
CN106919619A (en) A kind of commercial articles clustering method, device and electronic equipment
CN110717654B (en) Product quality evaluation method and system based on user comments
CN113537796A (en) Enterprise risk assessment method, device and equipment
CN112632228A (en) Text mining-based auxiliary bid evaluation method and system
CN110472203B (en) Article duplicate checking and detecting method, device, equipment and storage medium
CN112035658A (en) Enterprise public opinion monitoring method based on deep learning
CN109241199B (en) Financial knowledge graph discovery method
CN113962219A (en) Semantic matching method and system for knowledge retrieval and question answering of power transformer
CN109359302A (en) A kind of optimization method of field term vector and fusion sort method based on it
US20220027748A1 (en) Systems and methods for document similarity matching
CN113032570A (en) Text aspect emotion classification method and system based on ATAE-BiGRU
CN109614490A (en) Money article proneness analysis method based on LSTM
CN115599899A (en) Intelligent question-answering method, system, equipment and medium based on aircraft knowledge graph
CN114707516A (en) Long text semantic similarity calculation method based on contrast learning
Ge et al. Optimization of computer aided design system for music automatic classification based on feature analysis
CN116542800A (en) Intelligent financial statement analysis system based on cloud AI technology
Zhao RETRACTED ARTICLE: Application of deep learning algorithm in college English teaching process evaluation
CN113269477B (en) Scientific research project query scoring model training method, query method and device
CN112862569B (en) Product appearance style evaluation method and system based on image and text multi-modal data
CN111274404A (en) Small sample entity multi-field classification method based on man-machine cooperation
Chen et al. A quantitative investment model based on random forest and sentiment analysis
CN114282875A (en) Flow approval certainty rule and semantic self-learning combined judgment method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant