CN110532548A

CN110532548A - A kind of hyponymy abstracting method based on FP-Growth algorithm

Info

Publication number: CN110532548A
Application number: CN201910738173.6A
Authority: CN
Inventors: 骆祥峰; 黄敬; 皇苏斌
Original assignee: Alibaba Group Holding Ltd; University of Shanghai for Science and Technology
Current assignee: Alibaba Group Holding Ltd; University of Shanghai for Science and Technology
Priority date: 2019-08-12
Filing date: 2019-08-12
Publication date: 2019-12-03

Abstract

The invention discloses a kind of hyponymy abstracting methods based on FP-Growth algorithm comprising steps are as follows: (1), the non-structured text set in any field of input；(2), it to given text collection, is pre-processed by sentence；(3), design hyponymy extracts formalization template, matches seed hyponymy；(4), building seed hyponymy classification expanded set, utilizes FP-Growth algorithm to extract hyponymy candidate collection；(5), PMI(point mutual information is utilized) screening hyponymy candidate collection；(6), the extraction result of hyponymy is exported.This method matches the seed hyponymy for obtaining high-accuracy using a small amount of formalization template, and hyponymy is extracted from non-structured text based on FP-Growth algorithm, it can be realized and extract and improve hyponymy extraction accuracy rate and coverage rate automatically.

Description

A kind of hyponymy abstracting method based on FP-Growth algorithm

Technical field

The method that the present invention relates to a kind of to extract hyponymy from non-structured text, in particular to one kind are based on The hyponymy abstracting method of FP-Growth algorithm.This method can be realized automatic extraction, and improve hyponymy pumping Take accuracy rate and coverage rate.

Background technique

Hyponymy is the important foundation of natural language processing task.Concept and entity etc. pass through superordination, bottom Close relation connects, preferably to carry out knowledge acquisition and knowledge reasoning.Traditional hyponymy abstracting method Mainly there are the method based on encyclopaedic knowledge library and the two methods based on statistics.Method based on encyclopaedic knowledge library is mainly from encyclopaedia Hyponymy is extracted in semi-structured text, it is low that this method has that hyponymy extracts coverage rate.In face of structure Complicated non-structured text, this method certainly will be difficult to obtain satisfactory effect.Statistics-Based Method is using between word Co-occurrence information feature carry out hyponymy extraction.The co-occurrence information that this method relies on word comes between grammatical term for the character with the presence or absence of upper The next relationship, accuracy rate is lower, is unable to satisfy practical application request.

Summary of the invention

For the deficiency of existing hyponymy abstracting method, the present invention proposes a kind of based on FP-Growth algorithm Hyponymy abstracting method.This method matches the next on the seed for obtaining high-accuracy first with a small amount of formalization template Relationship is then based on FP-Growth algorithm and extracts hyponymy from non-structured text, solves in non-structured text The next Relation extraction accuracy rate and the low problem of coverage rate are participated in and mark corpus without artificial, are realized in non-structured text The automatic extraction of hyponymy.

In order to achieve the above objectives, the present invention adopts the following technical scheme:

A kind of hyponymy abstracting method based on FP-Growth algorithm, comprises the following steps:

(1) the non-structured text set in any field is inputted；

(2) it to given text collection, is pre-processed by sentence；

(3) design hyponymy extracts formalization template, matches seed hyponymy；

(4) building seed hyponymy classification expanded set utilizes FP-Growth algorithm to extract hyponymy candidate Set；

(5) hyponymy candidate collection is screened using PMI point mutual information；

(6) the extraction result of hyponymy is exported.

The step (2) includes following sub-step:

(2-1) carries out segmenting word pretreatment by sentence to given text collection using natural language processing tool Hanlp, obtains Sentence set S after obtaining segmenting word₁；

(2-2) utilizes natural language processing tool Hanlp distich subclass S₁Part-of-speech tagging pretreatment is carried out, cutting is obtained Sentence set S after word and part-of-speech tagging₂。

The step (3) includes following sub-step:

(3-1) constructs seed hyponymy and extracts formalization template, such as table 1 according to natural language morphology, syntactic feature It is shown, wherein a₁And a₂Indicate hyponym, b₁Indicate hypernym；

Table 1: seed hyponymy extracts formalization template

Serial number	Template
		1	a₁[it is \| belong to \| it is \| refer to] one [kind \| class \| a] b₁
2	a₁[it is \| belong to \| it is \| refer to] b₁One [kind \| class \| a]
		3	b₁[such as \| such as \| such as] { a₁,+[and \| or] a₂ etc.
4	b₁[including] { a₁,+[and \| or] a₂ etc.

(3-2) utilizes regular expression, uses above-mentioned formalization template distich subclass S₂Template matching is carried out, is planted Sub- hyponymy set Z={ (a_i,isa,b_j)}。

The step (4) includes following sub-step:

(4-1) utilizes seed hyponymy set Z, building seed hyponymy classification expanded set Wherein hyper_kIndicate the hypernym in hyponymy, hypo_kIt indicates With hyper_kFor the next set of words of hypernym,It indicates with hyper_kFor a specific hyponym of hypernym；

(4-2) is based on paper " Mining frequent patterns without candidate generation " In FP-Growth algorithm, in sentence set S₂Middle extraction is classified with seed hyponymy in expanded setThe frequent episode bottom set of words of co-occurrenceWherein M ∈ [0, n] represents the hypo in frequent episode with seed hyponymy classification expanded set_kThe hyponym number of middle co-occurrence, w_i Candidate hyponym is represented, f indicates the number that the frequent episode occurs；

(4-3) given threshold value α=5, as frequent episode number f >=α, bottom candidate relationship set HX=(w in acquisition_i, isa,hyper_k)。

The step (5) includes following sub-step:

The point mutual information of hypernym and hyponym on (5-1) in the next candidate relationship set HX calculates, and puts mutual information meter It calculates as follows:

Wherein p (V_i,V_j) it is hypernym V_iWith hyponym V_jThe probability of co-occurrence in corpus, p (V_i) it is hypernym in corpus The probability of middle appearance, p (V_j) it is the probability that hyponym occurs in corpus；

(5-2) given threshold value β=8, the next candidate relationship set HX in traversal, as hypernym V_iWith hyponym V_jPoint it is mutual Information PMI (V_i,V_jWhen) >=β, which is added in set Z；

After (5-3) completes step (5-2), go to step (4-1) be iterated extractions, until not new upper the next pass Until system is added in set Z.

The present invention is compared with the existing methods compared with having the characteristics that and advantage:

(1) when hyponymy extracts, building seed hyponymy classification expanded set utilizes existing seed to close Be determine in corpus imply other hyponymies, and then avoid directly using co-occurrence information bring accuracy rate it is low Problem；

(2) it is based on FP-Growth algorithm, a large amount of candidate hyponymy set is extracted in corpus, is solved non-structural Change hyponymy in corpus and extracts the low problem of coverage rate；

(3) using PMI (point mutual information) screening hyponymy candidate collection, the hyponymy of high-accuracy is obtained, Further solve the problems, such as that it is low to extract accuracy rate for hyponymy in unstructured corpus.The present invention is participated in and is marked without artificial Corpus can automatically extract the hyponymy of high-accuracy, high coverage rate from unstructured corpus.

Detailed description of the invention

Fig. 1 is a kind of flow diagram of the hyponymy abstracting method based on FP-Growth algorithm of the present invention.

Specific embodiment

Below in conjunction with drawings and examples, the invention will be further described.

As shown in Figure 1, the present invention proposes a kind of hyponymy abstracting method based on FP-Growth algorithm, with food For the unstructured data of field, in May, 2018 is crawled from food science and technology net (https: //www.tech-food.com/) To 20000 field of food non-structured texts in December, 2018；Include the following steps:

S1, non-structured text 20000 for inputting field of food；

S2, to given text collection, pre-processed by sentence, specific sub-step is as follows:

S2.1, segmenting word pretreatment is carried out by sentence to given text collection using natural language processing tool Hanlp, obtained Sentence set S after obtaining segmenting word₁；

S2.2, natural language processing tool Hanlp distich subclass S is utilized₁Part-of-speech tagging pretreatment is carried out, cutting is obtained Sentence set S after word and part-of-speech tagging₂, it is as follows that specific sentence pre-processes example:

S3, design hyponymy extract formalization template, match seed hyponymy, specific sub-step is as follows:

S3.1, according to natural language morphology, syntactic feature, construct seed hyponymy and extract formalization template, such as table 2 It is shown, wherein a₁And a₂Indicate hyponym, b₁Indicate hypernym；

Table 2: Chinese hyponymy extraction template

S3.2, using regular expression, use above-mentioned formalization template distich subclass S₂Template matching is carried out, is planted Sub- hyponymy set Z={ (a_i,isa,b_j), specific acquisition field of food seed hyponymy example is as follows:

(watermelon, isa, fruit), (wax gourd, isa, vegetables), (soybean oil, isa, edible oil), (white wine, isa, drinks) Deng.

S4, building seed hyponymy classification expanded set, utilize FP-Growth algorithm to extract hyponymy candidate Set, specific sub-step are as follows:

S4.1, seed hyponymy set Z, building seed hyponymy classification expanded set are utilized Wherein hyper_kIndicate the hypernym in hyponymy, hypo_kIt indicates With hyper_kFor the next set of words of hypernym,It indicates with hyper_kFor a specific hyponym of hypernym；

TSC₁=(hyper₁=edible oil | hypo₁={ soybean oil, peanut oil, olive oil }), TSC₂=(hyper₂=wine Class | hypo₁={ white wine, beer, red wine }) etc..

S4.2, it is based on FP-Growth algorithm, in sentence set S₂Middle extraction and seed hyponymy classification expanded set InThe frequent episode bottom set of words of co-occurrenceIts Middle m ∈ [0, n] represents the hypo in frequent episode with seed hyponymy classification expanded set_kThe hyponym number of middle co-occurrence, w_iCandidate hyponym is represented, f indicates the number that the frequent episode occurs, and specific field of food frequent episode hyponym collection instance is such as Under:

W₁={ soybean oil, peanut oil；Sesame oil；15 }, wherein m=2, w₁=sesame oil, f=15.

S4.3, given threshold value α=5, as frequent episode number f >=α, bottom candidate relationship set HX=(w in acquisition_i, isa,hyper_k), the specific upper the next candidate relationship example of acquisition is as follows:

HX=(sesame oil, isa, edible oil).

S5, hyponymy candidate collection is screened using PMI (point mutual information), specific sub-step is as follows:

The point mutual information of S5.1, the hypernym in upper the next candidate relationship set HX and hyponym calculate.Point mutual information meter It calculates as follows:

Wherein p (V_i,V_j) it is hypernym V_iWith hyponym V_jThe probability of co-occurrence in corpus, p (V_i) it is hypernym in corpus The probability of middle appearance, p (V_j) it is the probability that hyponym occurs in corpus, the point of hypernym and hyponym in specific HX is mutual Information instances are as follows:

S5.2, given threshold value β=8, the next candidate relationship set HX in traversal, as hypernym V_iWith hyponym V_jPoint it is mutual Information PMI (V_i,V_jWhen) >=β, which is added in set Z, the hyponymy in specific updated Z Example is as follows:

Z=(soybean oil, isa, edible oil), (peanut oil, isa, edible oil) ..., (sesame oil, isa, edible oil) }.

After S5.3, completion step S5.2, the S4.1 that gos to step is iterated extraction, until not new hyponymy Until being added in set Z.

S6, the extraction result for exporting hyponymy.

Claims

1. a kind of hyponymy abstracting method based on FP-Growth algorithm, which is characterized in that comprise the following steps:

(1) the non-structured text set in any field is inputted；

(2) it to given text collection, is pre-processed by sentence；

(3) design hyponymy extracts formalization template, matches seed hyponymy；

(4) building seed hyponymy classification expanded set, utilizes FP-Growth algorithm to extract hyponymy Candidate Set It closes；

(6) the extraction result of hyponymy is exported.

2. the hyponymy abstracting method according to claim 1 based on FP-Growth algorithm, which is characterized in that institute Stating step (2) includes following sub-step:

(2-1) carries out segmenting word pretreatment by sentence to given text collection using natural language processing tool Hanlp, is cut Sentence set S after participle₁；

(2-2) utilizes natural language processing tool Hanlp distich subclass S₁Carry out part-of-speech tagging pretreatment, obtain segmenting word and Sentence set S after part-of-speech tagging₂。

3. the hyponymy abstracting method according to claim 1 based on FP-Growth algorithm, which is characterized in that institute Stating step (3) includes following sub-step:

(3-1) constructs seed hyponymy and extracts formalization template according to natural language morphology, syntactic feature；

(3-2) utilizes regular expression, uses above-mentioned formalization template distich subclass S₂Template matching is carried out, is obtained on seed The next set of relationship Z={ (a_i,isa,b_j)}。

4. the hyponymy abstracting method according to claim 1 based on FP-Growth algorithm, which is characterized in that institute Stating step (4) includes following sub-step:

(4-1) utilizes seed hyponymy set Z, building seed hyponymy classification expanded set Wherein hyper_kIndicate the hypernym in hyponymy, hypo_kTable Show with hyper_kFor the next set of words of hypernym,It indicates with hyper_kFor a specific hyponym of hypernym；

(4-2) is based on FP-Growth algorithm, in sentence set S₂Middle extraction is classified with seed hyponymy in expanded setThe frequent episode bottom set of words of co-occurrenceWherein M ∈ [0, n] represents the hypo in frequent episode with seed hyponymy classification expanded set_kThe hyponym number of middle co-occurrence, w_i Candidate hyponym is represented, f indicates the number that the frequent episode occurs；

(4-3) given threshold value α=5, as frequent episode number f >=α, bottom candidate relationship set HX=(w in acquisition_i,isa, hyper_k)。

5. the hyponymy abstracting method according to claim 1 based on FP-Growth algorithm, which is characterized in that institute Stating step (5) includes following sub-step:

The point mutual information of hypernym and hyponym on (5-1) in the next candidate relationship set HX calculates, and point mutual information calculates such as Under:

Wherein p (V_i,V_j) it is hypernym V_iWith hyponym V_jThe probability of co-occurrence in corpus, p (V_i) it is that hypernym goes out in corpus Existing probability, p (V_j) it is the probability that hyponym occurs in corpus；

(5-2) given threshold value β=8, the next candidate relationship set HX in traversal, as hypernym V_iWith hyponym V_jPoint mutual information PMI(V_i,V_jWhen) >=β, which is added in set Z；

After (5-3) completes step (5-2), go to step (4-1) be iterated extractions, add up to not new hyponymy Until entering into set Z.