CN106126545A - Distributed fission querying method and device - Google Patents

Distributed fission querying method and device Download PDF

Info

Publication number
CN106126545A
CN106126545A CN201610425275.9A CN201610425275A CN106126545A CN 106126545 A CN106126545 A CN 106126545A CN 201610425275 A CN201610425275 A CN 201610425275A CN 106126545 A CN106126545 A CN 106126545A
Authority
CN
China
Prior art keywords
fission
word
pattern
file
new
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610425275.9A
Other languages
Chinese (zh)
Inventor
郭瑞
郭祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Intelligent Housekeeper Technology Co Ltd
Original Assignee
Beijing Intelligent Housekeeper Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Intelligent Housekeeper Technology Co Ltd filed Critical Beijing Intelligent Housekeeper Technology Co Ltd
Priority to CN201610425275.9A priority Critical patent/CN106126545A/en
Publication of CN106126545A publication Critical patent/CN106126545A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Fuzzy Systems (AREA)
  • Machine Translation (AREA)

Abstract

The present invention relates to a kind of distributed fission querying method and device, wherein, described method is included in default Development of Distributed System framework establishment fission word file and fission schema file, respectively according to presetting fission word fission word to the first support information of described fission pattern and the affiliated fission pattern of word in the second support information searching fission word file to this word and the fission pattern in fission schema file, to be iterated fission search according to described fission word file and described fission schema file, determine fission set of words and fission mould set, carry out fission process treating query statement, and obtain Query Result according to result.The present invention is based on creating fission file in Development of Distributed System framework, and determine fission set, carry out fission process with the statement to be checked that user is inputted and return Query Result, shortening the data mining time, and then improve query accuracy and search efficiency, improve Consumer's Experience.

Description

Distributed fission querying method and device
Technical field
The present invention relates to field of computer technology, be specifically related to a kind of distributed fission querying method and device.
Background technology
The fast development of network, communication and computer technology has promoted the progress of artificial intelligence technology the most dramatically. And increasingly mature along with text emotion analysis and natural language processing technique, big by the research of computer utility intellectual analysis The big demand of the Internet era that data having become as one and trend.In this context, speech processes and data mining also receive Increasing concern.Speech processes may recognize that the content described in speaker, and is translated into text data, and then permissible The lasting data of these total amount sustainable growths is preserved, thinks and follow-up carry out data mining, and then carry out data query and carry For basis.
At present, the off-line data carried out before data query excavates to use single-threaded, one process process data more, adds The growth rate of data is very rapid, causes the data mining time long, and digging efficiency is low, is even easily caused system crash, enters And affect the accuracy of subsequent user inquiry data.Additionally, prior art uses artificial formulation key word and pattern, pass through mould The data digging method autgmentability of formula match cognization target is low, it is difficult to large-scale application;And according to key word and sentence pattern structure system Mould-fixed, by then needing a large amount of labeled data with the data digging method of mode computation similarity identification target, its effect is very Relying on the covering of sample in big degree, early investment is excessive.
Summary of the invention
Expand for eliminating data mining time length, the method for digging carrying out existing when off-line data excavates in available data inquiry Malleability is low, digging efficiency is low and the level of coverage of dependence sample is high, early investment is excessive, affect user inquires about the accurate of data The drawback of property, the present invention proposes following technical scheme:
A kind of distributed fission querying method, including:
In default Development of Distributed System framework, create the fission word file for placing fission word and be used for putting Put the fission schema file of fission pattern;
The fission pattern that described default fission word is corresponding is determined according to default language material and default fission word, and according to described Preset fission word and described fission pattern is added described fission schema file by the first support information of described fission pattern;
From described default language material, find out the sentence of the fission pattern mated in described fission schema file, and extract institute State the word corresponding to the position of word of fissioning in sentence, with according to the affiliated fission pattern of described word to described word second Described word is added described fission word file by support information;
Fission is determined according to the fission word in described fission word file and the fission pattern in described fission schema file Set of words and fission mould set;
Treat query statement carry out fission process according to described fission set of words and fission mould set, to obtain inquiry knot Really.
Alternatively, described according to the fission mould fissioned in word and described fission schema file in described fission word file Before formula determines fission set of words and fission mould set, described method also includes:
The language material of predetermined number is labeled by described Development of Distributed System framework, to obtain annotation results;
Correspondingly, described treat query statement carry out fission process according to described fission set of words and fission mould set, and Query Result is obtained according to result, including:
Treat query statement carry out fission process according to described annotation results, described fission set of words and fission mould set, And obtain Query Result according to result.
Alternatively, described Development of Distributed System framework includes but not limited to distributed system basic framework hadoop.
Alternatively, described in described Development of Distributed System framework, the language material of predetermined number is labeled, including:
The map/reduce utilizing hadoop calls participle program and the statement in described language material carries out participle and according to word Property is labeled.
Alternatively, described according to the fission mould fissioned in word and described fission schema file in described fission word file Formula determines fission set of words and fission mould set, including:
It is labeled acquired mark according to the fission word in described fission word file and the language material to predetermined number Result excavates new fission pattern, and described new fission pattern is put in described fission schema file;
According to the described fission pattern in described fission schema file and described language material, excavate new fission word, and will Described new fission word is put in described fission word file;
It is alternately repeated the new fission pattern of above-mentioned excavation and excavates the step of new fission word, until no longer occurring new Fission pattern or new fission word, and using the fission word file of final gained and fission schema file as fission set of words With fission mould set.
A kind of distributed fission inquiry unit, including:
File initialization unit, for creating for placing splitting of fission word in default Development of Distributed System framework Variable file and for placing the fission schema file of fission pattern;
Fission schema file determines unit, for determining described default fission word according to default language material and default fission word Corresponding fission pattern, and support that information is by described fission pattern according to described default fission word to the first of described fission pattern Add described fission schema file;
Fission word document determining unit, for finding out in the described fission schema file of coupling from described default language material The sentence of fission pattern, and extract the word corresponding to position of word of fissioning in described sentence, with according to belonging to described word Described word is added described fission word file to the second support information of described word by fission pattern;
Set determines unit, for according in the fission word in described fission word file and described fission schema file Fission pattern determines fission set of words and fission mould set;
Fission query unit, for treating query statement carry out at fission according to described fission set of words and fission mould set Reason, to obtain Query Result.
Alternatively, described device also includes:
Mark unit, for being labeled the language material of predetermined number, to obtain in described Development of Distributed System framework Take annotation results;
Correspondingly, described fission query unit is further used for according to described annotation results, described fission set of words and splits Change mould set is treated query statement and is carried out fission process, and obtains Query Result according to result.
Alternatively, described Development of Distributed System framework includes but not limited to distributed system basic framework hadoop.
Alternatively, described mark unit is further used for utilizing the map/reduce of hadoop to call participle program to described Statement in language material carries out participle and is labeled according to part of speech.
Alternatively, described set determines that unit is further used for:
It is labeled acquired mark according to the fission word in described fission word file and the language material to predetermined number Result excavates new fission pattern, and described new fission pattern is put in described fission schema file;
According to the described fission pattern in described fission schema file and described language material, excavate new fission word, and will Described new fission word is put in described fission word file;
It is alternately repeated the new fission pattern of above-mentioned excavation and excavates the step of new fission word, until no longer occurring new Fission pattern or new fission word, and using the fission word file of final gained and fission schema file as fission set of words With fission mould set.
The distributed fission querying method of the present invention and device, based on Development of Distributed System framework and the information of support Computational methods determine fission set of words and fission mould set, to treat user's input according to fission set of words and fission mould set The query statement tupe that carries out fissioning mate and returns Query Result, eliminates during available data is inquired about and carries out off-line data and dig The data mining time length that exists during pick, method for digging autgmentability is low, digging efficiency is low and rely on the level of coverage of sample high, Early investment is excessive, affect user inquires about the drawback of accuracy of data, improves the off-line digging efficiency of data, shortens number According to the time of excavation, and then improve query accuracy and search efficiency, improve Consumer's Experience.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing In having technology to describe, the required accompanying drawing used is briefly described, it should be apparent that, the accompanying drawing in describing below is the present invention Some embodiments, for those of ordinary skill in the art, on the premise of not paying creative work, it is also possible to according to These accompanying drawings obtain other accompanying drawing.
The flow chart of the distributed fission querying method that Fig. 1 provides for one embodiment of the invention;
Fig. 2 is the flow chart of one embodiment of the invention the determination fission set of words provided and mould collection approach of fissioning;
The structural representation of the distributed fission inquiry unit that Fig. 3 provides for one embodiment of the invention;
The structural representation of the distributed fission inquiry unit that Fig. 4 provides for another embodiment of the present invention.
Detailed description of the invention
For making the purpose of the embodiment of the present invention, technical scheme and advantage clearer, below in conjunction with the embodiment of the present invention In accompanying drawing, the technical scheme in the embodiment of the present invention is explicitly described, it is clear that described embodiment is the present invention A part of embodiment rather than whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art are not having Make the every other embodiment obtained under creative work premise, broadly fall into the scope of protection of the invention.
The flow chart of the distributed fission querying method that Fig. 1 provides for one embodiment of the invention;As it is shown in figure 1, should Method includes:
S1: create the fission word file for placing fission word in default Development of Distributed System framework and be used for Place the fission schema file of fission pattern;
" why " such as wherein, described fission word is the key word for statement is divided into fission pattern, " be What ", " whom is ", [verb+pronoun] or [noun] such as " Liu Dehua ";According to described fission pattern, described statement comprises Described statement is carried out dividing the pattern of gained by the part of speech of described key word and the part of speech of other words comprised, such as For statement " whom Liu Dehua is ", if its fission word is " whom is ", then its fission pattern can be identified as " [name]+[fission Word] ".
It should be noted that be i.e. added thereto to the fission word of predetermined number after creating above-mentioned fission word file, as just Beginning language material;And need not be added thereto to fission pattern after the fission schema file of above-mentioned establishment, even if retaining empty splitting Become schema file for storage subsequently through the fission pattern acquired in iterative processing.
Additionally, above-mentioned Development of Distributed System framework can be hadoop, spark and storm distributed framework In any one or other can realize the framework of Development of Distributed System, this is not defined by the present invention.
S2: according to default language material and preset fission word and determine the fission pattern that described default fission word is corresponding, and according to Described fission pattern is added described fission schema file to the first support information of described fission pattern by described default fission word;
Specifically, the part of speech of the word comprised according to sentence each in default language material is (such as [name], [verb], [generation Word] etc.) and comprise preset the sentence of fission word determine described in comprise and preset the fission pattern of sentence of fission word (such as [name] + [fission word] etc.), and according to described default fission word, the first support information of described fission pattern (is included support, confidence The information such as degree, information gain and card side) described fission pattern is joined in fission schema file;
S3: find out the sentence of the fission pattern mated in described fission schema file from described default language material, and carry Take the word corresponding to position of word of fissioning in described sentence, with according to the affiliated fission pattern of described word to described word Second supports that described word is added described fission word file by information;
Specifically, from described default language material, extract the sentence of the arbitrary fission pattern comprised in described fission mould set, And obtain the word corresponding to position of word of fissioning in this sentence, and according to the institute of the word corresponding to the position of described fission word Belong to fission pattern and support that information (including the information such as support, confidence level, information gain and card side) will to the second of this word This word joins in fission word file as fission word;
S4: according to the fission pattern fissioned in word and described fission schema file in described fission word file (such as, Searched for by iterated fission, i.e. repeat step S2~S3, until new fission word and fission pattern no longer occur) determine fission word Set and fission mould set;
S5: treat query statement carry out fission process according to described fission set of words and fission mould set, and according to process Result obtains Query Result.
Specifically, i.e. treat query statement enter according to the fission set of words finally given and the fission mould set finally given Row fission processes, and obtains according to result and return Query Result.
The distributed fission querying method of the present embodiment, calculates based on Development of Distributed System framework and the information of support Method determines fission set of words and fission mould set, to be checked with input user according to fission set of words and fission mould set Statement carries out fission process and returns Query Result, eliminates the number carrying out existing when off-line data excavates in available data inquiry According to excavating time length, method for digging autgmentability is low, digging efficiency is low and relying on the level of coverage height of sample, early investment Greatly, affect the drawback that user inquires about the accuracy of data, improve the off-line digging efficiency of data, when shortening data mining Between, and then improve query accuracy and search efficiency, improve Consumer's Experience.Further, preferred as the present embodiment, Above-mentioned steps S2 may include that
S21: the part of speech of the word comprised according to each sentence in default language material generates the participle of described each sentence Pattern, and extract described default language material comprises preset fission word sentence, with according to described default fission word by described sentence Participle patten transformation be fission pattern;
Wherein, the part of speech of the word that each sentence is comprised includes noun, verb and pronoun etc., on this basis, The part of speech of the word comprised according to each sentence and generate participle pattern for example, [name], [verb] [pronoun], [people Name] [verb] [pronoun] etc..
Specifically, can by participle program (the map/reduce of hadoop calls participle program) to as described in default language Each sentence in material carries out participle, and carries out entity mark according to part of speech, to generate the participle pattern of each sentence, wherein, Fission word is the key word for statement is divided into fission pattern, such as " why ", " what is ", " whom is ", " Liu De China " etc. [verb+pronoun] or [noun];According to described fission pattern the part of speech of the described key word that described statement comprises with And described statement is carried out dividing the pattern of gained, such as statement " whom Liu Dehua is " by the part of speech of other words comprised Speech, if its fission word is " whom is ", then its fission pattern can be identified as " [name]+[fission word] ".
Specifically, " whom is " such as added in fission set of words, and scan above-mentioned language material comprises " whom is " Former sentence, obtains this former sentence and pattern thereof:
Liu Dehua is whose [name] [verb] [pronoun] [verb] [pronoun]=[fission word], with the pattern " [people that will obtain Name] [verb] [pronoun] " add in fission mould set.
S22: calculate the first confidence level of described fission pattern and described default fission word to the first of described fission pattern Support, to join fission schema file according to described first confidence level and described first support by described fission pattern In.
Specifically, the first confidence level of pattern of will fissioning and described default fission word to first of described fission pattern Degree of holding compares with confidence threshold value and support threshold respectively, and is all higher than accordingly at the first confidence level and the first support Threshold value in the case of, this fission pattern is joined fission schema file in.
Further, preferred as the present embodiment, above-mentioned steps S3 can also include:
S31: extract the sentence of the arbitrary fission pattern comprised in described fission schema file from described default language material, and Extract lexeme of fissioning in this sentence and put corresponding word;
Specifically, from default language material, such as scan fission pattern " [name] [the verb] [generation in fission schema file Word] " sentence: " Liang Chaowei is at which ", by front described, [verb] [pronoun]=[fission word], therefore can extract fission lexeme The neologisms " at which " put.
S32: calculate the second confidence level of the word corresponding to position of described fission word and the affiliated fission mould of this word Formula to the second support of this word, using according to described second confidence level and described second support by described word as fission Word joins in fission word file.
Specifically, will fission second confidence level of the word corresponding to position of word and the affiliated fission pattern of this word The second support to this word compares with confidence threshold value and support threshold respectively, and at the second confidence level and second In the case of support is all higher than corresponding threshold value, this word is joined in fission word file as new fission word.
Preferred, before above-mentioned steps S4 as the present embodiment, it is also possible to farther include:
S40: be labeled the language material of predetermined number in described Development of Distributed System framework, to obtain mark knot Really;
Correspondingly, described treat query statement carry out fission process according to described fission set of words and fission mould set, and Query Result is obtained according to result, including:
Treat query statement carry out fission process according to described annotation results, described fission set of words and fission mould set, And obtain Query Result according to result.
Preferred as the present embodiment, above-mentioned is carried out the language material of predetermined number in described Development of Distributed System framework Mark, including:
The map/reduce utilizing hadoop calls participle program and the statement in described language material carries out participle and according to word Property is labeled.Specifically, according to default language material, the map/reduce of hadoop is used to call participle program in this language material Sentence carry out participle and carry out part-of-speech tagging, and output format is original sentence+pattern, such as " whom Liu Dehua is " [name] [verb] [pronoun].
Fig. 2 is the flow chart of one embodiment of the invention the determination fission set of words provided and mould collection approach of fissioning;As Shown in Fig. 2, on the basis of a upper embodiment, according to the fission word in described fission word file and described fission in step S4 Fission pattern in schema file determines fission set of words and fission mould set, can farther include:
S41: be labeled acquired according to the fission word in described fission word file and the language material to predetermined number Annotation results excavates new fission pattern, and described new fission pattern is put in described fission schema file;
Specifically, in Development of Distributed System framework, such as hadoop, build the hadoop task of " word-pattern ";Its Map and the reduce details of middle Hadoop is:
Map: original sentence+pattern and fission word as input, export with fission word as key, arrange and carry out point for key with word Bucket sort;
Reduce: dividing bucket owing to arranging with word, have the data of identical fission word will assign to a processing routine, output comprises The sentence of fission mass-word, and this pattern.As " whom Liu Dehua is " comprises " whom is ", output result is:
Liu Dehua is whose [name]+[fission word].
S42: according to the described fission pattern in described fission schema file and described language material, excavate new fission word, And described new fission word is put in described fission word file;
Specifically, in hadoop, build the hadoop task of " pattern-word ";Wherein, map and reduce of Hadoop is thin Joint is:
Map: using the result of step S21 and original sentence as input, output be key in mode, and setting is entered for key in mode Row point bucket sort.
Reduce: divide bucket in mode owing to arranging, the data of model identical will assign to a processing routine, export and comprise mould The sentence of formula.Such as " Liu Dehua is at which " comprising " whom is ", output result is:
Liu Dehua is which [name]+" at which ";
This result is output as original sentence+pattern.
S43: be alternately repeated the new fission pattern of above-mentioned excavation and excavate the step of new fission word, until no longer occurring New fission pattern or new fission word, and using the fission word file of final gained and fission schema file as word of fissioning Set and fission mould set.
Specifically, by the result of output after execution of step S42, i.e. original sentence+banding returns to step S41, and Repeat step S41 to S42 step, until acquired results is restrained, no longer exist new data output.
The distributed fission querying method of the present embodiment is based on hadoop Development of Distributed System framework structure and support Information computational methods determine fission set of words and fission mould set, it is possible to achieve multithreading, the data query of multi-process, shorten Data processing time, improves the efficiency of data mining and data query.
The structural representation of the distributed fission inquiry unit that Fig. 3 provides for one embodiment of the invention;Such as Fig. 3 institute Showing, this device includes:
File initialization unit 10, for creating for placing fission word in default Development of Distributed System framework Fission word file and for placing the fission schema file of fission pattern;
" why " such as wherein, described fission word is the key word for statement is divided into fission pattern, " be What ", " whom is ", [verb+pronoun] or [noun] such as " Liu Dehua ";According to described fission pattern, described statement comprises Described statement is carried out dividing the pattern of gained by the part of speech of described key word and the part of speech of other words comprised, such as For statement " whom Liu Dehua is ", if its fission word is " whom is ", then its fission pattern can be identified as " [name]+[fission Word] ".
It should be noted that be i.e. added thereto to the fission word of predetermined number after creating above-mentioned fission word file, as just Beginning language material;And need not be added thereto to fission pattern after the fission schema file of above-mentioned establishment, even if retaining empty splitting Become schema file for storage subsequently through the fission pattern acquired in iterative processing.
Additionally, above-mentioned Development of Distributed System framework can be hadoop, spark and storm distributed framework In any one or other can realize the framework of Development of Distributed System, this is not defined by the present invention.
Fission schema file determines unit 20, for determining described default fission according to default language material and default fission word The fission pattern that word is corresponding, and support that information is by described fission mould according to described default fission word to the first of described fission pattern Formula adds described fission schema file;
Specifically, the part of speech of the word comprised according to sentence each in default language material is (such as [name], [verb], [generation Word] etc.) and comprise preset the sentence of fission word determine described in comprise and preset the fission pattern of sentence of fission word (such as [name] + [fission word] etc.), and according to described default fission word, the first support information of described fission pattern (is included support, confidence The information such as degree, information gain and card side) described fission pattern is joined in fission schema file;
Fission word document determining unit 30, for finding out in the described fission schema file of coupling from described default language material The sentence of fission pattern, and extract the word corresponding to position of word of fissioning in described sentence, with the institute according to described word Belong to fission pattern and described word is added described fission word file by the second support information of described word;
Specifically, from described default language material, extract the sentence of the arbitrary fission pattern comprised in described fission mould set, And obtain the word corresponding to position of word of fissioning in this sentence, and according to the institute of the word corresponding to the position of described fission word Belong to fission pattern and support that information (including the information such as support, confidence level, information gain and card side) will to the second of this word This word joins in fission word file as fission word;
Set determines unit 40, for according in the fission word in described fission word file and described fission schema file Fission pattern determine fission set of words and fission mould set;
Fission query unit 50, for treating query statement fission according to described fission set of words and fission mould set Process, to obtain Query Result.
Specifically, i.e. treat query statement enter according to the fission set of words finally given and the fission mould set finally given Row fission processes, and obtains according to result and return Query Result.
Distributed fission inquiry unit described in the present embodiment may be used for perform said method embodiment, its principle and Technique effect is similar to, and here is omitted.
The structural representation of the distributed fission inquiry unit that Fig. 4 provides for another embodiment of the present invention;Such as Fig. 4 institute State, as this implement preferred, this device also can farther include:
Mark unit 60, for the language material of predetermined number being labeled in described Development of Distributed System framework, with Obtain annotation results;
Correspondingly, described fission query unit 50 can be further used for according to described annotation results, described fission set of words Treat query statement with fission mould set and carry out fission process, and obtain Query Result according to result.
Preferred as the present embodiment, described mark unit 60 can be further used for utilizing the map/reduce of hadoop to adjust By participle program, the statement in described language material carried out participle and be labeled according to part of speech.
Preferred as the present embodiment, set determines that unit 40 is further used for:
It is labeled acquired mark according to the fission word in described fission word file and the language material to predetermined number Result excavates new fission pattern, and described new fission pattern is put in described fission schema file;
According to the described fission pattern in described fission schema file and described language material, excavate new fission word, and will Described new fission word is put in described fission word file;
It is alternately repeated the new fission pattern of above-mentioned excavation and excavates the step of new fission word, until no longer occurring new Fission pattern or new fission word, and using the fission word file of final gained and fission schema file as fission set of words With fission mould set.
The present embodiment is based on hadoop Development of Distributed System framework establishment fission word file and fission schema file, and really Surely fission set of words and fission mould set, it is possible to achieve multithreading, the data query of multi-process, shortens data processing time, Improve the efficiency of data mining and data query.
With a specific embodiment, the present invention is described below, but does not limit protection scope of the present invention.The present embodiment The step of distributed fission querying method is as follows:
1, using hadoop pretreatment language material, result is as shown in following table one:
Former sentence Sentence after participle and mark Fission pattern
Whom Liu Dehua is Liu Dehua [name] is [verb] whose [pronoun] [name] [verb] [pronoun]
Liang Chaowei is at which Liang Chaowei [name] is at [verb] which [pronoun] [name] [verb] [pronoun]
What alpaca is Alpaca [animal] is [verb] what [pronoun] [animal] [verb] [pronoun]
2, newly-built fission schema file in hadoop file system, for sky;
3, in hadoop file system, newly-built fission word file adds fission word, for example, " whom is ";
The hadoop task of 4, establishment " fission word is to fission pattern ":
Map: using the data of the 1st step and the 3rd step as input, with word as key;
Reduce: " whom Liu Dehua is " hit " whom is ", is output as:
Liu Dehua is whose [name] [verb] [pronoun] ([verb] [pronoun]=[fission word]);
And " Liang Chaowei is at which ", " what alpaca is " do not hit fission word, therefore without output;
The hadoop task of 5, establishment " fission pattern is to fission word ":
Map: using the data of the 4th step and the data of the first step as input, be key in mode;
Reduce: mate according to pattern, is finally output as:
" whom Liu Dehua is " " [name] [verb] [pronoun] ";
" Liang Chaowei is at which " " [name] [verb] [pronoun] ";
6, rule of thumb formulate support information threshold, when support is more than this threshold value, this fission word is joined fission In set of words;
Wherein, support information includes the information such as support, confidence level, information gain, card side;With support and confidence level it is Example:
Arranging confidence level is 0.3, and support is 0.3, establishment hadoop task:
Map: as input, form be using the data of 4,5 steps:
" [name] [name] " at which " Liang Chaowei is at which
[name] [name] [verb] [pronoun] Liang Chaowei is at which
Whom [name] [name] " whom is " Liu Dehua is
Whom [name] [name] [verb] [pronoun] Liu Dehua is
What [animal] [animal] [verb] [pronoun] alpaca is ";
The data that reduce receives are:
" [name] [name] " at which " Liang Chaowei is at which
Whom [name] [name] " whom is " Liu Dehua is
[name] [name] [verb] [pronoun] Liang Chaowei is at which
Whom [name] [name] [verb] [pronoun] Liu Dehua is
What [animal] [animal] [verb] [pronoun] alpaca is ";
Calculate:
The confidence level of [name] [verb] [pronoun] is 0.67, and support is 0.67;
The confidence level of [name] and " at which " is 0.33, and support is 0.33;
Finally it is output as:
[name] [name] [verb] [pronoun] [name] " at which " Liang Chaowei is at which;
Whom [name] [name] [verb] [pronoun] [name] " whom is " Liu Dehua is;
7, terminate
Obtain set of words of fissioning: whom is, at which;
Obtain fission mould set: [name] [verb] [pronoun];
8, mark
The target of [name]+" whom is " is " who (who) ";
The target of [name]+" at which " is " where (where) ";
When 9, using on line, meeting [name]+" whom is " pattern, problem target is who, should [name] whom be by return Answer;
Meeting [name]+" at which " pattern, problem target is where, should [name] where answer by returning Case.
The distributed fission querying method of the present invention and device, based on Development of Distributed System framework and the information of support Computational methods determine fission set of words and fission mould set, to treat user's input according to fission set of words and fission mould set Query statement carries out fission process and returns Query Result, eliminates existing querying method because using single-threaded, one process process Data and cause data mining time length, efficiency low and then affect query accuracy being even easily caused the drawback of system crash, Improve the off-line digging efficiency of data, shorten the data mining time, can accurately identify inquiry target, there is the strongest extension Property, and there is the function automatically finding neologisms, and then improve query accuracy and search efficiency, improve Consumer's Experience.
Above example is merely to illustrate technical scheme, is not intended to limit;Although with reference to previous embodiment The present invention is described in detail, it will be understood by those within the art that: it still can be to aforementioned each enforcement Technical scheme described in example is modified, or wherein portion of techniques feature is carried out equivalent;And these are revised or replace Change, do not make the essence of appropriate technical solution depart from the spirit and scope of various embodiments of the present invention technical scheme.

Claims (10)

1. a distributed fission querying method, it is characterised in that including:
In default Development of Distributed System framework, create the fission word file for placing fission word and split for placement The fission schema file of change pattern;
Determine, according to default language material and default fission word, the fission pattern that described default fission word is corresponding, and preset according to described Described fission pattern is added described fission schema file to the first support information of described fission pattern by fission word;
From described default language material, find out the sentence of the fission pattern mated in described fission schema file, and extract described sentence Son is fissioned the word corresponding to the position of word, with the second support to described word of the affiliated fission pattern according to described word Described word is added described fission word file by information;
Fission word set is determined according to the fission word in described fission word file and the fission pattern in described fission schema file Close and fission mould set;
Treat query statement carry out fission process, to obtain Query Result according to described fission set of words and fission mould set.
Method the most according to claim 1, it is characterised in that described according to the fission word in described fission word file and Before fission pattern in described fission schema file determines fission set of words and fission mould set, described method also includes:
The language material of predetermined number is labeled by described Development of Distributed System framework, to obtain annotation results;
Correspondingly, described treat query statement carry out fission process according to described fission set of words and fission mould set, and according to Result obtains Query Result, including:
Treat query statement carry out fission process, and root according to described annotation results, described fission set of words and fission mould set Query Result is obtained according to result.
Method the most according to claim 2, it is characterised in that described Development of Distributed System framework includes but not limited to point Cloth system-based framework hadoop.
Method the most according to claim 3, it is characterised in that described in described Development of Distributed System framework to preset The language material of quantity is labeled, including:
The map/reduce utilizing hadoop calls participle program to carry out participle and enters according to part of speech the statement in described language material Rower is noted.
5. according to the method according to any one of claim 1-4, it is characterised in that described according in described fission word file Fission pattern in fission word and described fission schema file determines fission set of words and fission mould set, including:
It is labeled acquired annotation results according to the fission word in described fission word file and the language material to predetermined number Excavate new fission pattern, and described new fission pattern is put in described fission schema file;
According to the described fission pattern in described fission schema file and described language material, excavate new fission word, and by described New fission word is put in described fission word file;
It is alternately repeated the new fission pattern of above-mentioned excavation and excavates the step of new fission word, until new fission no longer occurs Pattern or new fission word, and the fission word file of final gained and fission schema file as fission set of words and are split Become mould set.
6. a distributed fission inquiry unit, it is characterised in that including:
File initialization unit, for creating the fission word for placing fission word in default Development of Distributed System framework File and for placing the fission schema file of fission pattern;
Fission schema file determines unit, for determining that described default fission word is corresponding according to default language material and default fission word Fission pattern, and according to described default fission word, described fission pattern is added by the first of described fission pattern the support information Described fission schema file;
Fission word document determining unit, for finding out the fission mated in described fission schema file from described default language material The sentence of pattern, and extract the word corresponding to position of word of fissioning in described sentence, with the affiliated fission according to described word Described word is added described fission word file to the second support information of described word by pattern;
Set determines unit, for according to the fission word in described fission word file and the fission in described fission schema file Pattern determines fission set of words and fission mould set;
Fission query unit, for treating query statement carry out fission process according to described fission set of words and fission mould set, To obtain Query Result.
Device the most according to claim 6, it is characterised in that described device also includes:
Mark unit, for being labeled the language material of predetermined number, to obtain mark in described Development of Distributed System framework Note result;
Correspondingly, described fission query unit is further used for according to described annotation results, described fission set of words and fission mould Set is treated query statement and is carried out fission process, and obtains Query Result according to result.
Device the most according to claim 7, it is characterised in that described Development of Distributed System framework includes but not limited to point Cloth system-based framework hadoop.
Device the most according to claim 8, it is characterised in that described mark unit is further used for utilizing hadoop's Map/reduce calls participle program to carry out participle and is labeled according to part of speech the statement in described language material.
10. according to the device according to any one of claim 6-9, it is characterised in that described set determines that unit is used further In:
It is labeled acquired annotation results according to the fission word in described fission word file and the language material to predetermined number Excavate new fission pattern, and described new fission pattern is put in described fission schema file;
According to the described fission pattern in described fission schema file and described language material, excavate new fission word, and by described New fission word is put in described fission word file;
It is alternately repeated the new fission pattern of above-mentioned excavation and excavates the step of new fission word, until new fission no longer occurs Pattern or new fission word, and the fission word file of final gained and fission schema file as fission set of words and are split Become mould set.
CN201610425275.9A 2016-06-15 2016-06-15 Distributed fission querying method and device Pending CN106126545A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610425275.9A CN106126545A (en) 2016-06-15 2016-06-15 Distributed fission querying method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610425275.9A CN106126545A (en) 2016-06-15 2016-06-15 Distributed fission querying method and device

Publications (1)

Publication Number Publication Date
CN106126545A true CN106126545A (en) 2016-11-16

Family

ID=57469698

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610425275.9A Pending CN106126545A (en) 2016-06-15 2016-06-15 Distributed fission querying method and device

Country Status (1)

Country Link
CN (1) CN106126545A (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101414310A (en) * 2008-10-17 2009-04-22 山西大学 Method and apparatus for searching natural language
US7962487B2 (en) * 2008-12-29 2011-06-14 Microsoft Corporation Ranking oriented query clustering and applications
CN102375853A (en) * 2010-08-24 2012-03-14 ***通信集团公司 Distributed database system, method for building index therein and query method
CN103034735A (en) * 2012-12-26 2013-04-10 北京讯鸟软件有限公司 Big data distributed file export method
CN103092979A (en) * 2013-01-31 2013-05-08 中国科学院对地观测与数字地球科学中心 Processing method and device for searching of natural language by remote sensing data
CN103177120A (en) * 2013-04-12 2013-06-26 同方知网(北京)技术有限公司 Index-based XPath query mode tree matching method
CN103927358A (en) * 2014-04-15 2014-07-16 清华大学 Text search method and system
CN104252533A (en) * 2014-09-12 2014-12-31 百度在线网络技术(北京)有限公司 Search method and search device
CN104899262A (en) * 2015-05-22 2015-09-09 华中师范大学 Information categorization method supporting user-defined categorization rules
CN105243052A (en) * 2015-09-15 2016-01-13 浪潮软件集团有限公司 Corpus labeling method, device and system

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101414310A (en) * 2008-10-17 2009-04-22 山西大学 Method and apparatus for searching natural language
US7962487B2 (en) * 2008-12-29 2011-06-14 Microsoft Corporation Ranking oriented query clustering and applications
CN102375853A (en) * 2010-08-24 2012-03-14 ***通信集团公司 Distributed database system, method for building index therein and query method
CN103034735A (en) * 2012-12-26 2013-04-10 北京讯鸟软件有限公司 Big data distributed file export method
CN103092979A (en) * 2013-01-31 2013-05-08 中国科学院对地观测与数字地球科学中心 Processing method and device for searching of natural language by remote sensing data
CN103177120A (en) * 2013-04-12 2013-06-26 同方知网(北京)技术有限公司 Index-based XPath query mode tree matching method
CN103927358A (en) * 2014-04-15 2014-07-16 清华大学 Text search method and system
CN104252533A (en) * 2014-09-12 2014-12-31 百度在线网络技术(北京)有限公司 Search method and search device
CN104899262A (en) * 2015-05-22 2015-09-09 华中师范大学 Information categorization method supporting user-defined categorization rules
CN105243052A (en) * 2015-09-15 2016-01-13 浪潮软件集团有限公司 Corpus labeling method, device and system

Similar Documents

Publication Publication Date Title
CN108345690B (en) Intelligent question and answer method and system
CN104866593B (en) A kind of database search method of knowledge based collection of illustrative plates
WO2020063092A1 (en) Knowledge graph processing method and apparatus
CN102646103B (en) The clustering method of term and device
CN108154198B (en) Knowledge base entity normalization method, system, terminal and computer readable storage medium
CN111159330B (en) Database query statement generation method and device
CN106503231B (en) Search method and device based on artificial intelligence
CN105183770A (en) Chinese integrated entity linking method based on graph model
CN104615680B (en) The method for building up of web page quality model and device
CN107644062A (en) The knowledge content Weight Analysis System and method of a kind of knowledge based collection of illustrative plates
CN105808609A (en) Discrimination method and equipment of point-of-information data redundancy
CN106844640A (en) A kind of web data analysis and processing method
CN102737042A (en) Method and device for establishing question generation model, and question generation method and device
CN103324700A (en) Noumenon concept attribute learning method based on Web information
CN111831794A (en) Knowledge map-based construction method for knowledge question-answering system in comprehensive pipe gallery industry
CN110188359B (en) Text entity extraction method
CN105528432A (en) Digital resource hotspot generating method and device
CN111553138B (en) Auxiliary writing method and device for standardizing content structure document
CN108388556A (en) The method for digging and system of similar entity
CN103207921A (en) Method for automatically extracting terms from Chinese electronic document
CN106095956A (en) support information fission querying method and device
CN110019768B (en) Method and device for generating text abstract
CN117171296A (en) Information acquisition method and device and electronic equipment
CN111680514B (en) Information processing and model training method, device, equipment and storage medium
CN106126545A (en) Distributed fission querying method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20161116