CN110532548A - A kind of hyponymy abstracting method based on FP-Growth algorithm - Google Patents

A kind of hyponymy abstracting method based on FP-Growth algorithm Download PDF

Info

Publication number
CN110532548A
CN110532548A CN201910738173.6A CN201910738173A CN110532548A CN 110532548 A CN110532548 A CN 110532548A CN 201910738173 A CN201910738173 A CN 201910738173A CN 110532548 A CN110532548 A CN 110532548A
Authority
CN
China
Prior art keywords
hyponymy
seed
growth algorithm
hyponym
hypernym
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910738173.6A
Other languages
Chinese (zh)
Inventor
骆祥峰
黄敬
皇苏斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
University of Shanghai for Science and Technology
Original Assignee
Alibaba Group Holding Ltd
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd, University of Shanghai for Science and Technology filed Critical Alibaba Group Holding Ltd
Priority to CN201910738173.6A priority Critical patent/CN110532548A/en
Publication of CN110532548A publication Critical patent/CN110532548A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of hyponymy abstracting methods based on FP-Growth algorithm comprising steps are as follows: (1), the non-structured text set in any field of input;(2), it to given text collection, is pre-processed by sentence;(3), design hyponymy extracts formalization template, matches seed hyponymy;(4), building seed hyponymy classification expanded set, utilizes FP-Growth algorithm to extract hyponymy candidate collection;(5), PMI(point mutual information is utilized) screening hyponymy candidate collection;(6), the extraction result of hyponymy is exported.This method matches the seed hyponymy for obtaining high-accuracy using a small amount of formalization template, and hyponymy is extracted from non-structured text based on FP-Growth algorithm, it can be realized and extract and improve hyponymy extraction accuracy rate and coverage rate automatically.

Description

A kind of hyponymy abstracting method based on FP-Growth algorithm
Technical field
The method that the present invention relates to a kind of to extract hyponymy from non-structured text, in particular to one kind are based on The hyponymy abstracting method of FP-Growth algorithm.This method can be realized automatic extraction, and improve hyponymy pumping Take accuracy rate and coverage rate.
Background technique
Hyponymy is the important foundation of natural language processing task.Concept and entity etc. pass through superordination, bottom Close relation connects, preferably to carry out knowledge acquisition and knowledge reasoning.Traditional hyponymy abstracting method Mainly there are the method based on encyclopaedic knowledge library and the two methods based on statistics.Method based on encyclopaedic knowledge library is mainly from encyclopaedia Hyponymy is extracted in semi-structured text, it is low that this method has that hyponymy extracts coverage rate.In face of structure Complicated non-structured text, this method certainly will be difficult to obtain satisfactory effect.Statistics-Based Method is using between word Co-occurrence information feature carry out hyponymy extraction.The co-occurrence information that this method relies on word comes between grammatical term for the character with the presence or absence of upper The next relationship, accuracy rate is lower, is unable to satisfy practical application request.
Summary of the invention
For the deficiency of existing hyponymy abstracting method, the present invention proposes a kind of based on FP-Growth algorithm Hyponymy abstracting method.This method matches the next on the seed for obtaining high-accuracy first with a small amount of formalization template Relationship is then based on FP-Growth algorithm and extracts hyponymy from non-structured text, solves in non-structured text The next Relation extraction accuracy rate and the low problem of coverage rate are participated in and mark corpus without artificial, are realized in non-structured text The automatic extraction of hyponymy.
In order to achieve the above objectives, the present invention adopts the following technical scheme:
A kind of hyponymy abstracting method based on FP-Growth algorithm, comprises the following steps:
(1) the non-structured text set in any field is inputted;
(2) it to given text collection, is pre-processed by sentence;
(3) design hyponymy extracts formalization template, matches seed hyponymy;
(4) building seed hyponymy classification expanded set utilizes FP-Growth algorithm to extract hyponymy candidate Set;
(5) hyponymy candidate collection is screened using PMI point mutual information;
(6) the extraction result of hyponymy is exported.
The step (2) includes following sub-step:
(2-1) carries out segmenting word pretreatment by sentence to given text collection using natural language processing tool Hanlp, obtains Sentence set S after obtaining segmenting word1
(2-2) utilizes natural language processing tool Hanlp distich subclass S1Part-of-speech tagging pretreatment is carried out, cutting is obtained Sentence set S after word and part-of-speech tagging2
The step (3) includes following sub-step:
(3-1) constructs seed hyponymy and extracts formalization template, such as table 1 according to natural language morphology, syntactic feature It is shown, wherein a1And a2Indicate hyponym, b1Indicate hypernym;
Table 1: seed hyponymy extracts formalization template
Serial number Template
1 a1[it is | belong to | it is | refer to] one [kind | class | a] b1
2 a1[it is | belong to | it is | refer to] b1One [kind | class | a]
3 b1[such as | such as | such as] { a1,+[and | or] a2 etc.
4 b1[including] { a1,+[and | or] a2 etc.
(3-2) utilizes regular expression, uses above-mentioned formalization template distich subclass S2Template matching is carried out, is planted Sub- hyponymy set Z={ (ai,isa,bj)}。
The step (4) includes following sub-step:
(4-1) utilizes seed hyponymy set Z, building seed hyponymy classification expanded set Wherein hyperkIndicate the hypernym in hyponymy, hypokIt indicates With hyperkFor the next set of words of hypernym,It indicates with hyperkFor a specific hyponym of hypernym;
(4-2) is based on paper " Mining frequent patterns without candidate generation " In FP-Growth algorithm, in sentence set S2Middle extraction is classified with seed hyponymy in expanded setThe frequent episode bottom set of words of co-occurrenceWherein M ∈ [0, n] represents the hypo in frequent episode with seed hyponymy classification expanded setkThe hyponym number of middle co-occurrence, wi Candidate hyponym is represented, f indicates the number that the frequent episode occurs;
(4-3) given threshold value α=5, as frequent episode number f >=α, bottom candidate relationship set HX=(w in acquisitioni, isa,hyperk)。
The step (5) includes following sub-step:
The point mutual information of hypernym and hyponym on (5-1) in the next candidate relationship set HX calculates, and puts mutual information meter It calculates as follows:
Wherein p (Vi,Vj) it is hypernym ViWith hyponym VjThe probability of co-occurrence in corpus, p (Vi) it is hypernym in corpus The probability of middle appearance, p (Vj) it is the probability that hyponym occurs in corpus;
(5-2) given threshold value β=8, the next candidate relationship set HX in traversal, as hypernym ViWith hyponym VjPoint it is mutual Information PMI (Vi,VjWhen) >=β, which is added in set Z;
After (5-3) completes step (5-2), go to step (4-1) be iterated extractions, until not new upper the next pass Until system is added in set Z.
The present invention is compared with the existing methods compared with having the characteristics that and advantage:
(1) when hyponymy extracts, building seed hyponymy classification expanded set utilizes existing seed to close Be determine in corpus imply other hyponymies, and then avoid directly using co-occurrence information bring accuracy rate it is low Problem;
(2) it is based on FP-Growth algorithm, a large amount of candidate hyponymy set is extracted in corpus, is solved non-structural Change hyponymy in corpus and extracts the low problem of coverage rate;
(3) using PMI (point mutual information) screening hyponymy candidate collection, the hyponymy of high-accuracy is obtained, Further solve the problems, such as that it is low to extract accuracy rate for hyponymy in unstructured corpus.The present invention is participated in and is marked without artificial Corpus can automatically extract the hyponymy of high-accuracy, high coverage rate from unstructured corpus.
Detailed description of the invention
Fig. 1 is a kind of flow diagram of the hyponymy abstracting method based on FP-Growth algorithm of the present invention.
Specific embodiment
Below in conjunction with drawings and examples, the invention will be further described.
As shown in Figure 1, the present invention proposes a kind of hyponymy abstracting method based on FP-Growth algorithm, with food For the unstructured data of field, in May, 2018 is crawled from food science and technology net (https: //www.tech-food.com/) To 20000 field of food non-structured texts in December, 2018;Include the following steps:
S1, non-structured text 20000 for inputting field of food;
S2, to given text collection, pre-processed by sentence, specific sub-step is as follows:
S2.1, segmenting word pretreatment is carried out by sentence to given text collection using natural language processing tool Hanlp, obtained Sentence set S after obtaining segmenting word1
S2.2, natural language processing tool Hanlp distich subclass S is utilized1Part-of-speech tagging pretreatment is carried out, cutting is obtained Sentence set S after word and part-of-speech tagging2, it is as follows that specific sentence pre-processes example:
S3, design hyponymy extract formalization template, match seed hyponymy, specific sub-step is as follows:
S3.1, according to natural language morphology, syntactic feature, construct seed hyponymy and extract formalization template, such as table 2 It is shown, wherein a1And a2Indicate hyponym, b1Indicate hypernym;
Table 2: Chinese hyponymy extraction template
S3.2, using regular expression, use above-mentioned formalization template distich subclass S2Template matching is carried out, is planted Sub- hyponymy set Z={ (ai,isa,bj), specific acquisition field of food seed hyponymy example is as follows:
(watermelon, isa, fruit), (wax gourd, isa, vegetables), (soybean oil, isa, edible oil), (white wine, isa, drinks) Deng.
S4, building seed hyponymy classification expanded set, utilize FP-Growth algorithm to extract hyponymy candidate Set, specific sub-step are as follows:
S4.1, seed hyponymy set Z, building seed hyponymy classification expanded set are utilized Wherein hyperkIndicate the hypernym in hyponymy, hypokIt indicates With hyperkFor the next set of words of hypernym,It indicates with hyperkFor a specific hyponym of hypernym;
TSC1=(hyper1=edible oil | hypo1={ soybean oil, peanut oil, olive oil }), TSC2=(hyper2=wine Class | hypo1={ white wine, beer, red wine }) etc..
S4.2, it is based on FP-Growth algorithm, in sentence set S2Middle extraction and seed hyponymy classification expanded set InThe frequent episode bottom set of words of co-occurrenceIts Middle m ∈ [0, n] represents the hypo in frequent episode with seed hyponymy classification expanded setkThe hyponym number of middle co-occurrence, wiCandidate hyponym is represented, f indicates the number that the frequent episode occurs, and specific field of food frequent episode hyponym collection instance is such as Under:
W1={ soybean oil, peanut oil;Sesame oil;15 }, wherein m=2, w1=sesame oil, f=15.
S4.3, given threshold value α=5, as frequent episode number f >=α, bottom candidate relationship set HX=(w in acquisitioni, isa,hyperk), the specific upper the next candidate relationship example of acquisition is as follows:
HX=(sesame oil, isa, edible oil).
S5, hyponymy candidate collection is screened using PMI (point mutual information), specific sub-step is as follows:
The point mutual information of S5.1, the hypernym in upper the next candidate relationship set HX and hyponym calculate.Point mutual information meter It calculates as follows:
Wherein p (Vi,Vj) it is hypernym ViWith hyponym VjThe probability of co-occurrence in corpus, p (Vi) it is hypernym in corpus The probability of middle appearance, p (Vj) it is the probability that hyponym occurs in corpus, the point of hypernym and hyponym in specific HX is mutual Information instances are as follows:
S5.2, given threshold value β=8, the next candidate relationship set HX in traversal, as hypernym ViWith hyponym VjPoint it is mutual Information PMI (Vi,VjWhen) >=β, which is added in set Z, the hyponymy in specific updated Z Example is as follows:
Z=(soybean oil, isa, edible oil), (peanut oil, isa, edible oil) ..., (sesame oil, isa, edible oil) }.
After S5.3, completion step S5.2, the S4.1 that gos to step is iterated extraction, until not new hyponymy Until being added in set Z.
S6, the extraction result for exporting hyponymy.

Claims (5)

1. a kind of hyponymy abstracting method based on FP-Growth algorithm, which is characterized in that comprise the following steps:
(1) the non-structured text set in any field is inputted;
(2) it to given text collection, is pre-processed by sentence;
(3) design hyponymy extracts formalization template, matches seed hyponymy;
(4) building seed hyponymy classification expanded set, utilizes FP-Growth algorithm to extract hyponymy Candidate Set It closes;
(5) hyponymy candidate collection is screened using PMI point mutual information;
(6) the extraction result of hyponymy is exported.
2. the hyponymy abstracting method according to claim 1 based on FP-Growth algorithm, which is characterized in that institute Stating step (2) includes following sub-step:
(2-1) carries out segmenting word pretreatment by sentence to given text collection using natural language processing tool Hanlp, is cut Sentence set S after participle1
(2-2) utilizes natural language processing tool Hanlp distich subclass S1Carry out part-of-speech tagging pretreatment, obtain segmenting word and Sentence set S after part-of-speech tagging2
3. the hyponymy abstracting method according to claim 1 based on FP-Growth algorithm, which is characterized in that institute Stating step (3) includes following sub-step:
(3-1) constructs seed hyponymy and extracts formalization template according to natural language morphology, syntactic feature;
(3-2) utilizes regular expression, uses above-mentioned formalization template distich subclass S2Template matching is carried out, is obtained on seed The next set of relationship Z={ (ai,isa,bj)}。
4. the hyponymy abstracting method according to claim 1 based on FP-Growth algorithm, which is characterized in that institute Stating step (4) includes following sub-step:
(4-1) utilizes seed hyponymy set Z, building seed hyponymy classification expanded set Wherein hyperkIndicate the hypernym in hyponymy, hypokTable Show with hyperkFor the next set of words of hypernym,It indicates with hyperkFor a specific hyponym of hypernym;
(4-2) is based on FP-Growth algorithm, in sentence set S2Middle extraction is classified with seed hyponymy in expanded setThe frequent episode bottom set of words of co-occurrenceWherein M ∈ [0, n] represents the hypo in frequent episode with seed hyponymy classification expanded setkThe hyponym number of middle co-occurrence, wi Candidate hyponym is represented, f indicates the number that the frequent episode occurs;
(4-3) given threshold value α=5, as frequent episode number f >=α, bottom candidate relationship set HX=(w in acquisitioni,isa, hyperk)。
5. the hyponymy abstracting method according to claim 1 based on FP-Growth algorithm, which is characterized in that institute Stating step (5) includes following sub-step:
The point mutual information of hypernym and hyponym on (5-1) in the next candidate relationship set HX calculates, and point mutual information calculates such as Under:
Wherein p (Vi,Vj) it is hypernym ViWith hyponym VjThe probability of co-occurrence in corpus, p (Vi) it is that hypernym goes out in corpus Existing probability, p (Vj) it is the probability that hyponym occurs in corpus;
(5-2) given threshold value β=8, the next candidate relationship set HX in traversal, as hypernym ViWith hyponym VjPoint mutual information PMI(Vi,VjWhen) >=β, which is added in set Z;
After (5-3) completes step (5-2), go to step (4-1) be iterated extractions, add up to not new hyponymy Until entering into set Z.
CN201910738173.6A 2019-08-12 2019-08-12 A kind of hyponymy abstracting method based on FP-Growth algorithm Pending CN110532548A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910738173.6A CN110532548A (en) 2019-08-12 2019-08-12 A kind of hyponymy abstracting method based on FP-Growth algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910738173.6A CN110532548A (en) 2019-08-12 2019-08-12 A kind of hyponymy abstracting method based on FP-Growth algorithm

Publications (1)

Publication Number Publication Date
CN110532548A true CN110532548A (en) 2019-12-03

Family

ID=68662981

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910738173.6A Pending CN110532548A (en) 2019-08-12 2019-08-12 A kind of hyponymy abstracting method based on FP-Growth algorithm

Country Status (1)

Country Link
CN (1) CN110532548A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808525A (en) * 2016-03-29 2016-07-27 国家计算机网络与信息安全管理中心 Domain concept hypernym-hyponym relation extraction method based on similar concept pairs
CN106844632A (en) * 2017-01-20 2017-06-13 清华大学 Based on the product review sensibility classification method and device that improve SVMs
US20180107695A1 (en) * 2016-10-19 2018-04-19 Futurewei Technologies, Inc. Distributed fp-growth with node table for large-scale association rule mining
CN108319584A (en) * 2018-01-22 2018-07-24 北京工业大学 A kind of new word discovery method based on the microblogging class short text for improving FP-Growth algorithms

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808525A (en) * 2016-03-29 2016-07-27 国家计算机网络与信息安全管理中心 Domain concept hypernym-hyponym relation extraction method based on similar concept pairs
US20180107695A1 (en) * 2016-10-19 2018-04-19 Futurewei Technologies, Inc. Distributed fp-growth with node table for large-scale association rule mining
CN106844632A (en) * 2017-01-20 2017-06-13 清华大学 Based on the product review sensibility classification method and device that improve SVMs
CN108319584A (en) * 2018-01-22 2018-07-24 北京工业大学 A kind of new word discovery method based on the microblogging class short text for improving FP-Growth algorithms

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SUBINHUANG,XIANGFENGLUO,JINGHUANG,YIKE GUO,SHENGWEI GU: "An unsupervised approach for learning a Chinese IS-A taxonomy from an unstructured corpus", 《KNOWLEDGE-BASED SYSTEMS》 *
王细薇等: "中文短文本分类方法研究", 《现代计算机(专业版)》 *

Similar Documents

Publication Publication Date Title
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN107766324B (en) Text consistency analysis method based on deep neural network
CN111143479B (en) Knowledge graph relation extraction and REST service visualization fusion method based on DBSCAN clustering algorithm
US9779085B2 (en) Multilingual embeddings for natural language processing
CN103970729B (en) A kind of multi-threaded extracting method based on semantic category
CN103473280B (en) Method for mining comparable network language materials
CN106095749A (en) A kind of text key word extracting method based on degree of depth study
CN110502644B (en) Active learning method for field level dictionary mining construction
CN105956052A (en) Building method of knowledge map based on vertical field
CN103176963B (en) Chinese sentence meaning structure model automatic labeling method based on CRF ++
CN109408642A (en) A kind of domain entities relation on attributes abstracting method based on distance supervision
CN103646112B (en) Dependency parsing field self-adaption method based on web search
CN106951438A (en) A kind of event extraction system and method towards open field
CN104572634B (en) A kind of interactive method and its device extracted than language material and bilingual dictionary
CN103678684A (en) Chinese word segmentation method based on navigation information retrieval
CN107679110A (en) The method and device of knowledge mapping is improved with reference to text classification and picture attribute extraction
CN102253930A (en) Method and device for translating text
CN113157860B (en) Electric power equipment maintenance knowledge graph construction method based on small-scale data
Pasini et al. CluBERT: A cluster-based approach for learning sense distributions in multiple languages
CN107977345A (en) A kind of generic text information abstracting method and system
CN109101488B (en) Word semantic similarity calculation method based on known network
CN110399433A (en) A kind of data entity Relation extraction method based on deep learning
CN110781681A (en) Translation model-based elementary mathematic application problem automatic solving method and system
CN108038099A (en) Low frequency keyword recognition method based on term clustering
CN110008473A (en) A kind of medical text name Entity recognition mask method based on alternative manner

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20191203

RJ01 Rejection of invention patent application after publication