CN110968665A

CN110968665A - Method for recognizing upper and lower level word relation based on gradient enhanced decision tree

Info

Publication number: CN110968665A
Application number: CN201911086620.0A
Authority: CN
Inventors: 潘翔; 阮义彰
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2019-11-08
Filing date: 2019-11-08
Publication date: 2020-04-07
Anticipated expiration: 2039-11-08
Also published as: CN110968665B

Abstract

The invention relates to a method for identifying the relation between upper and lower terms based on a gradient enhanced decision tree. To train the classification model, the inputs are the entity pairs and their path information, and the outputs are either 1 (for context) or 0 (for no context). And a high-confidence recommendation set based on a positive classification result is obtained by jointly training the two classifiers. The model adapts quickly to regular patterns of unlabeled corpus text by continuously iterating high confidence sets. The method can better mine the upper and lower word relation of the E-business domain.

Description

Method for recognizing upper and lower level word relation based on gradient enhanced decision tree

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method for recognizing the relation between upper and lower level words based on a gradient enhanced decision tree.

Background

Automatic mining and verification of context relationships between entities is an important task in e-commerce. The context relationship represents a relationship between a generic entity (hyperym) and a specific instance thereof (hypoym). Such as appliances and refrigerators. In electronic commerce, mining such lower relationships helps to better understand user queries and commodity recommendations.

However, in electronic commerce, this task faces many challenges. First, the corpus of text on the web often contains a lot of noise, and the text is updated frequently. Noise makes it difficult for the general method to obtain valid information from e-commerce text. The high frequency of updates wastes significant labor costs in labeling new and proprietary words. Second, there are currently known about 10 billion commercial entities (including a large number of isomorphs). Assume that each node in the category tree has at least one root node (y) and its associated leaf node (x) is greater than or equal to 0. Then the category tree for the commodity entity would be huge. This requires a good recall rate while ensuring accuracy. Aiming at the particularity of a corpus text in the field of electronic commerce, a gradient enhancement decision tree method based on joint training is provided by taking the semi-supervised thought as reference. The method can automatically dig the up-down hyperword entity relationship in the text of the specific domain and the noise text. From the various entity relationship mining methods of the prior invention, all entity relationship mining methods can be classified as supervised, semi-supervised and unsupervised. Wherein in two-class learning, multiple classifier trains are combined together with higher accuracy than learning alone. This approach requires that related tasks share a similar representation. The bootstrapping method is to train a classifier based on a small number of labeled samples and then iteratively augment the training set with highly-trusted samples in the current model. bootstrapping is good at introducing new or large e-commerce unlabeled text corpora through small sample guided seeds. However, this method has a "semantic drift" problem after many iterations. In order to reduce errors continuously introduced in semi-supervised learning iteration, one method is to perform cross training on different types of samples to prevent the precision from being reduced; or the bias of the marking error is reduced by conditional independent segmentation of the feature space. Other non-bootstrapping techniques use the same extraction method to generate independent errors, thereby triggering predictions from multiple extractors. These predictions combine to improve the accuracy of the extraction. In addition to these, there are methods that use two complementary methods to handle both top and bottom entity relationship mining, a distributed-based method and a path-based method. Distributed methods are excellent for finding entity relationships. Some path-based methods, however, encode using a recurrent neural network, with results comparable to distributed methods.

The mining method for mining the relation of upper and lower position words in the complex text mainly comprises the following steps of:

firstly, when a user searches commodities, the search content is expanded through the upper and lower terms, secondary search is reduced, and user experience is improved.

And secondly, adding commodity information recall. Under the condition of not changing the dimensionality, the recall information precision is improved, and the recall information amount is enriched.

And thirdly, the situation that the scene card can be used for multiple times in the application scene is improved.

And fourthly, layering related words of the commodity domain into categories, attributes, attribute values and auxiliary classification tree system.

Fifthly, positioning the hot new words.

Disclosure of Invention

The invention aims to overcome the defects and provides a method for identifying the superior-inferior word relationship based on a gradient enhanced decision tree. To train the classification model, the inputs are the entity pairs and their path information, and the outputs are either 1 (for context) or 0 (for no context). And a high-confidence recommendation set based on a positive classification result is obtained by jointly training the two classifiers. The model adapts quickly to regular patterns of unlabeled corpus text by continuously iterating high confidence sets. The method can better mine the upper and lower word relation of the E-business domain.

The invention achieves the aim through the following technical scheme: a method for recognizing the relation between upper and lower terms based on a gradient enhanced decision tree comprises the following steps:

(1) constructing a random dislocation sample training set;

(2) constructing a sample training set based on the path;

(3) and training a semi-supervised combined gradient enhancement decision tree model according to the constructed random dislocation sample training set and the path-based sample training set, and recognizing the upper and lower level word relation by using the trained model.

Preferably, the construction method of the random dislocation sample training set comprises the following steps:

(1.1) segmenting the corpus text based on an Alibaba Word Segmenter lexical analysis system; extracting upper and lower word pairs from the existing word stock for matching, and constructing a positive sample by combining texts between the word pairs;

(1.2) misplacing the upper and lower words of the successfully matched word pair to serve as negative sample word pairs; matching the texts by using the staggered words to construct random staggered negative samples;

and (1.3) combining the positive and negative samples obtained in the step, and constructing to obtain a random dislocation sample training set.

Preferably, the method for constructing the path-based sample training set includes:

(2.1) fragmenting the corpus text and recording as S_split＝Split({S₁，S₂，S₃，…，S_n}); (2.2) taking the malposition word pairs in the random malposition sample, matching with the corpus text to obtain a sentence set S containing malposition upper and lower word pairs_<x,y>＝{S_<x1,y1>，S_<x2,y2>，S_<x3,y3>，…，S_<xn,yn>}; (2.3) taking out the path between the misaligned word pair, and recording the path as P ═ P₁，P₂，P₃…,P_n}；

(2.4) extracting the paths and the corpus fragment { S₁，S₂，S₃，…，S_nMatching, inquiring a fragment prototype sentence after matching is successful, and taking a first word before and after a path P' but not an original staggered word pair as a negative sample word pair based on the path; a path-based training set of samples is obtained in combination with the positive samples.

Preferably, the corpus fragmentation adopts an Ngarm algorithm to enumerate sentence fragments formed by all continuous participles, each participle is marked as length 1, and a segment with a path length not greater than 5 is taken.

Preferably, the semi-supervised joint gradient enhancement decision tree model is an addition model, the learning algorithm is a forward stepping algorithm, and the basis function is a CART tree; the loss function is the mean square error function loss, i.e.:

the negative gradient is then:

wherein y-f (x) is the residual error; the output is: classification tree f (x).

Preferably, the semi-supervised joint gradient enhancement decision tree training method comprises the following steps:

inputting a text corpus T, pre-trained word embedding and maximum iteration I;

(i) preprocessing T data, and extracting two types of training samples X_pAnd X_dWherein X is_pFor the path-based sample training set, X_dTraining set for random dislocation sample;

(ii) converting each training sample into a vector representation using word embedding W;

(iii) is provided with

And

X′_prepresents a route sample, X'_dRepresenting randomly misplaced samples;

(iv) using X separately_p∪X′_pAnd X_d∪X′_dBy training two classifiers f₁And f₂；

(v) Predicting the unlabeled samples, and selecting positive samples with high confidence coefficient to be used as new training samples X'_pAnd X'_dCarrying out expansion;

(vi) recycling step (iv) and step (v) up to X'_pAnd X'_dNew annotated samples are not appearing;

and (3) outputting: two classifiers and a prediction label for the test sample.

The invention has the beneficial effects that: the method can complete the sample construction of the complex text, and mark the prediction of the unmarked entity; the method analyzes the characteristics of the e-commerce domain text, summarizes the upper and lower word pairs of some e-commerce domains by means of substring, pattern, rule learning and the like, and can better mine the upper and lower word relation of the e-commerce domains.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention;

FIG. 2 is a schematic diagram of a training set of random misalignment samples according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a path sample training set-based construction according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a training process of a gradient enhanced decision tree model according to an embodiment of the present invention.

Detailed Description

The invention will be further described with reference to specific examples, but the scope of the invention is not limited thereto:

example (b): as shown in fig. 1, a method for identifying a context relationship based on a gradient-enhanced decision tree includes the following steps:

(1) the construction of the random misalignment sample training set is specifically as follows, as shown in fig. 2:

the linguistic data text is firstly participled through a lexical analysis system based on AliWS (Aliba Word Segmenter for short). And extracting upper and lower word pairs from the existing word stock for matching, and constructing a positive sample by combining texts between the word pairs. And (5) staggering the upper and lower words of the successfully matched word pair to serve as a negative sample word pair. And then, matching the text by using the misplaced words to construct a random misplaced negative sample, such as:

(1) < apple is a fruit >

(2) < fruits such as apple >

(3) < dog is an animal >

After dislocation, the < apple, fruit >, < dog, animal > becomes < apple, animal > < dog, fruit >. And then finding a sentence path matched with the corpus in the corpus. After screening, the following results were obtained:

(1) < apple for tropical animals >

(2) < dogs do not eat fruit >

The misaligned word pairs and their path information as a whole are constructed as negative examples. And combining the positive and negative samples to form a random dislocation sample training set.

The path between two word pairs is taken to satisfy:

1. the number of words does not exceed 5 words in length, e.g. "is a" length of 3.

2. Except the word pairs with the length of single word byte less than 2, wherein the word pairs comprise one, no and no in the non-upper and lower word pairs.

3. When the corpus cannot be matched to contain two word pairs at the same time, the training corpus cannot be constructed according to the word pair.

And based on the word mode characteristic vector representation obtained above, word embedding and word mode characteristic vector splicing of the word pairs, and using the finally spliced characteristic vector as the representation of the word pairs. The calculation process is as follows:

i.e. the vector represented by the path of a given word pair < x, y >.

(2) The construction of the path-based sample training set is specifically as follows, as shown in fig. 3: first, the corpus text is fragmented and recorded as S_split＝Split({S₁，S₂，S₃，…，S_n}). The corpus fragmentation uses the Ngarm algorithm to enumerate sentence fragments formed by all continuous participles, each participle is marked as length 1, and a fragment with the path length not more than 5 is taken. For example, if "dragon fruit is a rose fruit" the sentence with length of 7 is fragmented:

(1) dragon fruit

(2) The dragon fruit is

(3) The dragon fruit is

(4) The dragon fruit is

(5) The dragon fruit is a rose

And 28 kinds of fragments.

Taking the malposition word pair in the random malposition sample, matching with the corpus text to obtain a sentence set S containing the malposition upper and lower word pairs_<x,y>＝{S_<x1,y1>，S_<x2,y2>，S_<x3,y3>，…，S_<xn,yn>For example:

(1) s < x1, y1> < apple for tropical animals >

(2) (iii) S < x2, y2> < dogs do not eat fruit >

Taking out the path between the dislocation word pair, and recording as P ═ P₁，P₂，P₃…,P_n}. Extracting the paths and the corpus fragment { S₁，S₂，S₃，…，S_nMatch is made. And after matching is successful, inquiring the fragment prototype sentence, and taking the first word before and after the path P', which is not the original staggered word pair, as a path negative sample word pair. Such as:

P₁＝<for tropical zones>

P₂＝<Can not eat>

Matching with the text fragment to obtain a sentence:

s' < such a temperature is very suitable for tropical animals >

(cold weather people do not eat cold food >

The negative sample word pair < temperature, animal >, < people, cold food > based on the path can be obtained. And finally, combining the positive sample with the positive sample to obtain a path-based sample training set.

(3) Training a semi-supervised joint gradient enhancement decision tree model according to the constructed random dislocation sample training set and the path-based sample training set, wherein the construction training process is shown in FIG. 4; and performing upper and lower level word relation recognition by using the trained model.

After two training samples, namely random dislocation samples and path-based samples, are constructed, semi-supervised joint gradient enhancement decision tree model training is started. The misplaced-based samples change the 100-dimensional vector of the path 300 and the hyponyms when constructed, and the path-based samples change the 200-dimensional vector of the hypernyms and hyponyms when constructed.

The semi-supervised combined gradient enhancement decision tree training method comprises the following steps:

inputting a text corpus T, pre-trained word embedding and maximum iteration I;

(iii) is provided with

And

X′_prepresents a route sample, X'_dRepresenting randomly misplaced samples;

and (3) outputting: two classifiers and a prediction label for the test sample.

The relationship mining of the superior and inferior terms is essentially a binary task. Gradient-enhanced decision trees are among the best algorithms to fit to the true distribution in conventional machine learning algorithms, and are algorithms that classify or regress data by using additive models (i.e., linear combinations of basis functions) and by continuously reducing the residual errors generated by the training process. The gradient enhancement decision tree model is an addition model, the learning algorithm is a forward stepping algorithm, and the basis function is a CART tree. The loss function is the mean square error function loss, i.e.:

the negative gradient is then:

and y-f (x) is residual error, and the model learns a weak classifier by fitting the residual error in each iteration. The requirements for weak classifiers are generally simple enough and are low variance and high variance. Because the training process is to continuously improve the accuracy of the final classifier by reducing the bias. The core is that each weak classifier is the residual of the conclusion sum of all previous classifiers, and the residual is an accumulated amount of the true value obtained after adding the predicted value. The input of the model is a marked sample, and the marked sample is divided into a path sample and a dislocation sample. Since label is a label and is a binary task, label indicates a high-low word pair or a non-high word pair by [1, 0 ]. The format is as follows:

the output is: classification tree f (x).

The method mainly comprises the following steps:

(1) initialization: c is a constant value estimated to minimize the loss function, which is a tree with only one root node, and the general squared loss function is the mean of the nodes

(2) For M ═ 1,2,3, …, M:

(a) calculating a residual error for the sample i-1, 2,3 …, N;

(b) to { (x)₁,r_m1),…,(x_N,r_mN) Fitting a classification tree to obtain leaf node regions R of the mth tree_mj，j＝1,2,…,J

(c) For J equal to 1,2, …, J, the values of leaf node regions are estimated by linear search to minimize the loss function, and the calculation is performed

K represents the number of samples in the jth node of the mth tree. The above formula represents c_mjIs used to determine the average of the residuals in the jth node of the mth tree.

(d) The update minimizes the loss function, I being the parameter controlling the negative gradient.

(3) Obtaining a final classification tree:

obtaining a gradient enhanced decision tree classification function, and then carrying out marking-free data T'₁And performing path sample construction and then putting the path sample into a classification tree for prediction. The method comprises the following steps:

the input is as follows:

(4) during training, a classification regression tree is trained for each possible class of the sample X. The training set has two types, namely an upper-lower relation or a non-upper-lower word relation, for the sample<x,y>The prediction result 0 indicates a non-hypernym relationship, and 1 indicates a hypernym relationship. After multi-round iterative training, two trees are generated, and a new sample is obtained<x’,y’>Is respectively F₁(x)，F₂(x) Then the probability that the sample belongs to a certain class c is:

the method comprises the steps of training two classifiers by constructing different samples, taking a sample of which the prediction result of the same sample on the two classifiers is greater than 0.8 as a high-confidence sample.

When text { T₁，T₂，T₃，…，T_nWhen there is no intersection, the new text T₂The generated high confidence set is directly added into the training set after being audited. At this time, the growth rate:

when the semi-supervised model learns the high confidence set of the nth mutually disjoint text, the growth rate tends to be 0:

when text { T'₁，T′₂，T′₃，...，T′_NT 'for any two texts'_n，T′_mWhen there is intersection

T′_n∪T′_m＝T′_n\T′_m+T′_n∩T′_m+T′_m\T_n

I.e. for T'₁Newly-added T'_m，T′_nThe effect of the text is equivalent to newly added T'_n\T′_m+T′_n∩T′_m+T′_m\T′_n. And taking the intersection of the upper and lower word pairs in the meaning of the intersection.

Then when n documents are newly added,

that is, when there is intersection between the texts, any n texts can be split into at most

Mutually disjoint text. When learning the nth text, assuming that i ≠ j, the text growth rate is:

is T'_i\T′_j＝T′_ij，T′_i\T′_j＝T′_ji，T′_i∩T′_j＝T′_(j，i)And T'_ij，T′_ji，T′_(j，i)And if the two are not mutually intersected, then:

when N → N

Due to the fact that

The same number of nouns involved in complete disjointness

So the amount of newly added information tends to be 0 when N → N, i.e., text, tends to add the full amount of text. If for any T'_iTo learn T'_iWhen the growth rate of the model is greater than or equal to 0, when i tends to infinity, the growth rate of the model tends to 0; the model converges.

While the invention has been described in connection with specific embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for recognizing the relation between upper and lower level words based on a gradient enhanced decision tree is characterized by comprising the following steps:

(1) constructing a random dislocation sample training set;

(2) constructing a sample training set based on the path;

2. The method for recognizing the relation between upper and lower level words based on the gradient reinforced decision tree as claimed in claim 1, wherein: the construction method of the random dislocation sample training set comprises the following steps:

3. The method for recognizing the relation between upper and lower level words based on the gradient reinforced decision tree as claimed in claim 1, wherein: the method for constructing the path-based sample training set comprises the following steps:

(2.1) fragmenting the corpus text and recording as S_split＝Split({S₁，S₂，S₃，…，S_n})；

(2.2) taking the malposition word pairs in the random malposition sample, matching with the corpus text to obtain a sentence set S containing malposition upper and lower word pairs_<x,y>＝{S_<x1,y1>，S_<x2,y2>，S_<x3,y3>，…，S_<xn,yn>}；

(2.3) taking out the path between the misaligned word pair, and recording the path as P ═ P₁，P₂，P₃…,P_n}；

4. The method for recognizing the relation between upper and lower level words based on the gradient reinforced decision tree as claimed in claim 3, wherein: the corpus fragmentation adopts an Ngarm algorithm to enumerate sentence fragments formed by all continuous participles, each participle is marked as length 1, and a fragment with the path length not more than 5 is taken.

5. The method for recognizing the relation between upper and lower level words based on the gradient reinforced decision tree as claimed in claim 1, wherein: the semi-supervised joint gradient enhancement decision tree model is an addition model, the learning algorithm is a forward stepping algorithm, and the basis function is a CART tree; the loss function is the mean square error function loss, i.e.:

the negative gradient is then:

6. The method for recognizing the relation between upper and lower level words based on the gradient reinforced decision tree as claimed in claim 1, wherein: the semi-supervised combined gradient enhancement decision tree training method comprises the following steps:

inputting a text corpus T, pre-trained word embedding and maximum iteration I;

(iii) is provided with

And

X′_prepresents a route sample, X'_dRepresenting randomly misplaced samples;

and (3) outputting: two classifiers and a prediction label for the test sample.