CN109033076A - information mining method and device - Google Patents

information mining method and device Download PDF

Info

Publication number
CN109033076A
CN109033076A CN201810716210.9A CN201810716210A CN109033076A CN 109033076 A CN109033076 A CN 109033076A CN 201810716210 A CN201810716210 A CN 201810716210A CN 109033076 A CN109033076 A CN 109033076A
Authority
CN
China
Prior art keywords
query statement
template
query
high frequency
particular category
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810716210.9A
Other languages
Chinese (zh)
Inventor
王文敏
纪友升
凌光
徐威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201810716210.9A priority Critical patent/CN109033076A/en
Publication of CN109033076A publication Critical patent/CN109033076A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the present invention proposes a kind of information mining method and device.Wherein this method comprises: excavating each query statement of each particular category from search log;Give the kind fructification of the particular category;According to the kind fructification of the particular category and each query statement, the corresponding expression template of each query statement of the particular category is generated;According to query statement of all categories and its corresponding expression template, is excavated from described search log and obtain high frequency query statement and high frequency expression template.Using the search log of user as data source, it may include the content that the template being manually enriched with such as colloquial style expression can not cover that obtained high frequency sentence high frequency expression, which is not only enriched but also can satisfy the communicative habits that can cover various users,.

Description

Information mining method and device
Technical field
The present invention relates to technical field of information retrieval more particularly to a kind of information mining methods and device.
Background technique
In man-machine interactive system, user is varied for the requirement express of robot interactive.It is existing to be based on template The user that parsing module needs full dose puts question to query statement (query), and recall rate and the parsing that could improve user's understanding are quasi- True rate.These users expression has following feature, causes to be existed and much asked using traditional artificial enrichment rule and vocabulary Topic.
(1) expression way is varied, various with the expression-form of problem user, the communicative habits of different user Also varied, in this case, artificial enrichment building can not cover all expression.
(2) inclined colloquial style is expressed, user's expression-form colloquial style is serious, and the template being manually enriched with can not cover.
(3) the vocabulary substantial amounts of every dimension cannot manually construct the vocabulary of such vast number grade.
Due to the above feature of user's expression, if there are times and human cost using artificial enrichment rule and vocabulary The problems such as height, low efficiency, poor parsing effect, it will lead to that user's Understanding Module effect is poor, and man-machine interaction experience is poor.In addition, enrichment Vocabulary can not be enriched with extensive full dose vocabulary, cause parsing recall rate low.Enrichment expression way can not be enriched with extensive full dose table It is expressed up to template, colloquial style, causes to parse recall rate and accuracy rate is low, cannot understood that user expresses, accurate answer can not be provided, Cause user satisfaction low.
Summary of the invention
The embodiment of the present invention provides a kind of information mining method and device, to solve one or more skills in the prior art Art problem.
In a first aspect, the embodiment of the invention provides a kind of information mining methods, comprising:
Each query statement of each particular category is excavated from search log;
Give the kind fructification of the particular category;
According to the kind fructification of the particular category and each query statement, each query statement pair of the particular category is generated The expression template answered;
According to query statement of all categories and its corresponding expression template, is excavated from described search log and obtain high frequency and look into Ask sentence and high frequency expression template.
With reference to first aspect, the embodiment of the present invention is in the first implementation of first aspect, according to the certain kinds Other kind of fructification and each query statement generate the corresponding expression template of each query statement of the particular category, comprising:
If in the query statement of the particular category including kind of a fructification, described kind of fructification is used into wildcard figure generation It replaces, obtains corresponding expression template.
With reference to first aspect, the embodiment of the present invention is in second of implementation of first aspect, further includes:
Using each expression template, various entities are excavated from described search log, to obtain high frequency words and/or colloquial style Word;And/or
The full dose word for belonging to the particular category is extracted from the full dose data of selected website.
Second of implementation with reference to first aspect, the third implementation of the embodiment of the present invention in first aspect In, further includes:
Scalable vector graphics SVG dimension-reduction treatment is carried out to the expression template for the various entities excavated, is obtained corresponding Feature vector;
The corresponding feature vector of multiple expression templates is clustered, the expression template that the particular category includes is obtained.
With reference to first aspect or its any one implementation, the embodiment of the present invention is in the 4th kind of realization side of first aspect In formula, further includes:
Described search log is screened, relevant query statement and expression template are obtained.
The 4th kind of implementation with reference to first aspect, five kind implementation of the embodiment of the present invention in first aspect In, which is characterized in that according to query statement of all categories and its corresponding expression template, excavates and obtain from described search log High frequency query statement and high frequency expression template, comprising:
From relevant query statement, the query statement marked is obtained, includes class in the query statement marked Distinguishing label;
The semantic similarity of the two is calculated according to the term vector of two query statements in the query statement marked;
If the semantic similarity of described two query statements is greater than threshold value, link between the two is established;
According to the connection between the link and the corresponding expression template of each query statement between each query statement Relationship establishes sentence template relational graph;
According to the sum of the class label of query statement each in the query statement marked and the query statement marked, Calculate the parameter of random algorithm;
The random algorithm is used in the sentence template relational graph, obtains each query statement and its corresponding expression mould The sequence of plate;
High frequency query statement and high frequency query template are filtered out according to ranking results.
Second aspect, the embodiment of the invention provides a kind of information excavating devices, comprising:
Sentence excavates module, for excavating each query statement of each particular category from search log;
Entity gives module, for giving the kind fructification of the particular category;
Template generation module, for according to the particular category kind fructification and each query statement, generate it is described specific The corresponding expression template of each query statement of classification;
High frequency excavates module, for according to query statement of all categories and its corresponding expression template, from described search day It is excavated in will and obtains high frequency query statement and high frequency expression template.
In conjunction with second aspect, the embodiment of the present invention is in the first implementation of second aspect, the template generation mould If block is also used to include kind of a fructification in the query statement of the particular category, described kind of fructification is used into wildcard figure generation It replaces, obtains corresponding expression template.
In conjunction with second aspect, the embodiment of the present invention is in second of implementation of second aspect, further includes:
Query word excavates module, for utilizing each expression template, excavates various entities, from described search log to obtain To high frequency words and/or colloquial style word;And/or
Full dose word abstraction module, for extracting the full dose for belonging to the particular category from the full dose data of selected website Word.
In conjunction with second of implementation of second aspect, the third implementation of the embodiment of the present invention in second aspect In, the query word excavates module and includes:
Dimension-reduction treatment submodule carries out scalable vector graphics SVG for the expression template to the various entities excavated Dimension-reduction treatment obtains corresponding feature vector;
It clusters submodule and obtains the particular category for clustering the corresponding feature vector of multiple expression templates Including expression template.
In conjunction with second aspect or its any one implementation, the embodiment of the present invention is in the 4th kind of realization side of second aspect In formula, further includes:
Correlated expression excavates module and obtains relevant query statement and expression for screening to described search log Template.
In conjunction with the 4th kind of implementation of second aspect, five kind implementation of the embodiment of the present invention in second aspect In, the high frequency excavates module and includes:
Mark sentence acquisition submodule, for from relevant query statement, obtain the query statement marked, it is described It include class label in the query statement of mark;
Similarity calculation submodule, for calculating two according to the term vector of two query statements in the query statement marked The semantic similarity of person;
Setting up submodule is linked, if the semantic similarity for described two query statements is greater than threshold value, establishes two Link between person;
Relational graph setting up submodule, for according between each query statement link and each query statement it is right with it The connection relationship between expression template answered establishes sentence template relational graph;
Parameter computation module, for according to the class label of each query statement in the query statement that has marked and having marked The sum of the query statement of note calculates the parameter of random algorithm;
Sorting sub-module obtains each query statement for using the random algorithm in the sentence template relational graph And its sequence of corresponding expression template;
High frequency screens submodule, for filtering out high frequency query statement and high frequency query template according to ranking results.
The third aspect, the embodiment of the invention provides a kind of information excavating device, the function of described device can be by hard Part is realized, corresponding software realization can also be executed by hardware.The hardware or software include one or more and above-mentioned function It can corresponding module.
It include processor and memory, the memory in the structure of information excavating device in a possible design For storing the program for supporting information excavating device to execute above- mentioned information method for digging, the processor is configured to for executing The program stored in the memory.The information excavating device can also include communication interface, for other equipment or logical Communication network communication.
Fourth aspect, the embodiment of the invention provides a kind of computer readable storage mediums, for storing information excavating dress Set computer software instructions used comprising for executing program involved in above- mentioned information method for digging.
A technical solution in above-mentioned technical proposal is had the following advantages that or the utility model has the advantages that is made with the search log of user For data source, obtained high frequency sentence high frequency expression is not only enriched but also can satisfy the communicative habits that can cover various users, can To include content that the template that is manually enriched with such as colloquial style expression can not cover.
Another technical solution in above-mentioned technical proposal has the following advantages that or the utility model has the advantages that by merging a variety of data Excavate and artificial intelligence technology, can excavate extensive vocabulary, excavate correlated expression template, cluster extract user's expression template, User's high frequency expression template is extracted using random algorithm, to achieve the effect that high efficiency, height are recalled and height parses accuracy rate.On State the purpose summarized and be merely to illustrate that book, it is not intended to be limited in any way.It is schematical except foregoing description Except aspect, embodiment and feature, by reference to attached drawing and the following detailed description, further aspect of the present invention, implementation Mode and feature, which will be, to be readily apparent that.
Detailed description of the invention
In the accompanying drawings, unless specified otherwise herein, otherwise indicate the same or similar through the identical appended drawing reference of multiple attached drawings Component or element.What these attached drawings were not necessarily to scale.It should be understood that these attached drawings depict only according to the present invention Disclosed some embodiments, and should not serve to limit the scope of the present invention.
Fig. 1 shows the flow chart of information mining method according to an embodiment of the present invention.
Fig. 2 shows the flow charts of information mining method according to an embodiment of the present invention.
Fig. 3 shows the flow chart of information mining method according to an embodiment of the present invention.
Fig. 4 shows the structural block diagram of information excavating device according to an embodiment of the present invention.
Fig. 5 shows the structural block diagram of information excavating device according to an embodiment of the present invention.
Fig. 6 shows the structural block diagram of information excavating device according to an embodiment of the present invention.
Fig. 7 shows the schematic diagram of the general type of sentence template relational graph.
Fig. 8 shows a kind of exemplary schematic diagram of sentence template relational graph.
Fig. 9 shows the structural block diagram of information excavating device according to an embodiment of the present invention.
Specific embodiment
Hereinafter, certain exemplary embodiments are simply just described.As one skilled in the art will recognize that Like that, without departing from the spirit or scope of the present invention, described embodiment can be modified by various different modes. Therefore, attached drawing and description are considered essentially illustrative rather than restrictive.
Fig. 1 shows the flow chart of information mining method according to an embodiment of the present invention.As shown in Figure 1, the information excavating side Method may comprise steps of:
Step 101, each query statement that each particular category is excavated from search log;
Step 102, the kind fructification for giving the particular category;
Step 103, the kind fructification according to the particular category and each query statement, generate respectively looking into for the particular category Ask the corresponding expression template of sentence;
Step 104, according to query statement of all categories and its corresponding expression template, excavated from described search log To high frequency query statement and high frequency expression template.
In embodiments of the present invention, search log may include the relevant information of search behavior of user, for example, when search The query statement of input, it is searching for as a result, and the search result of user's actual click etc..It can be dug in search log Excavate the query statement for belonging to a certain particular category.For example, the name including film can be found out if particular category is film Each query statement of the relevant information such as title, star, role, the query statement as this classification of film.
Given kind fructification may include the entity for belonging to the particular category.For example, belonging to the reality of this classification of film Body includes role A, star B, movie name C etc..
In one possible implementation, if it includes kind that step 103, which includes: in the query statement of the particular category, Described kind of fructification is then replaced using wildcard figure, obtains corresponding expression template by fructification.
Specifically, it can be matched, be will match in each query statement of the category according to these kind of fructification Kind fructification in query statement is replaced using asterisk wildcard, to generate corresponding expression template.
For example, query statement Q1 includes " movie name C first broadcast ", it can be by " the movie name C " in Q1 with asterisk wildcard such as " * " It replaces, generates expression template<* first broadcast>.
For another example, query statement Q2 includes " star B participate in film festival ", can by " the star B " in Q1 with asterisk wildcard for example " * " is replaced, and is generated expression template<* and is participated in film festival>.
In one possible implementation, as shown in Fig. 2, after obtaining expression template, this method further include: step 201, using each expression template, various entities are excavated from described search log, to obtain high frequency words and/or colloquial style word.
After obtaining a large amount of expression template, these expression templates can be used and excavated in search log, obtain Belong to all entities of these expression templates.
For example, using the first broadcast of template<*>, it is excavated in log and arrives query statement Q11 " movie name C1 first broadcast ", Q12 " film Name C2 first broadcast ", Q13 " movie name C3 first broadcast " etc..To obtain " movie name C1 ", " movie name C2 ", " movie name C3 " these realities Body.
For another example, film festival is participated in using template<*>, it is excavated in log to query statement Q21 " star B1 participation film Section ", Q22 " star B2 participates in film festival ", Q23 " star B2 participates in film festival " etc..To obtain " star B1 ", " star B2 ", " star B3 " these entities.
It in one possible implementation, as shown in Fig. 2, can also be from some encyclopaedia websites such as wikipedia, hundred The full dose word of the particular category is excavated in the offline full dose data of websites such as degree encyclopaedia.This method may include: step 202, The full dose word for belonging to the particular category is extracted from the full dose data of selected website.For example, electricity can be extracted from encyclopaedia website All entries of this classification of shadow are then based on abstract, catalogue of encyclopaedia etc. and classify again to all entries of this classification.
In one possible implementation, it is excavated from described search log according to each expression template and obtains corresponding look into Ask word, comprising:
Using each expression template, various entities are excavated from described search log;
Scalable vector graphics (Scalable Vector is carried out to the expression template for the various entities excavated Graphics, SVG) dimension-reduction treatment, obtain corresponding feature vector;
The corresponding feature vector of each expression template is clustered, the expression template that the particular category includes is obtained.
It wherein, is sparse feature using expression template as the feature of each entity.It will carry out expression template SVG dimensionality reduction Processing, after obtaining corresponding feature vector, Clustering Effect is more preferable, high-efficient.
In one possible implementation, as shown in figure 3, this method further include:
Step 301 screens described search log, obtains relevant query statement and expression template.
In one possible implementation, step 104 may include according to query statement of all categories and its corresponding Expression template, from relevant query statement and expression template, excavation obtains high frequency query statement and high frequency expression template, specifically May include:
From relevant query statement, the query statement marked is obtained, includes class in the query statement marked Distinguishing label;
The semantic similarity of the two is calculated according to the term vector of two query statements in the query statement marked, such as will Semantic similarity of the COS distance of the term vector of two query words as the two;
If the semantic similarity of described two query statements is greater than threshold value, link between the two is established;
According to the connection between the link and the corresponding expression template of each query statement between each query statement Relationship establishes sentence template relational graph;
According to the sum of the class label of query statement each in the query statement marked and the query statement marked, Calculate the parameter of random algorithm, such as R value=value of class label/sum of the query statement marked;
Using the random algorithm (utilizing above-mentioned R value) in the sentence template relational graph, each query statement is obtained And its sequence of corresponding expression template;
High frequency query statement and high frequency query template are filtered out according to ranking results.
The embodiment of the present invention using the search log of user as data source, both enriched by obtained high frequency sentence high frequency expression It can satisfy the communicative habits for covering various users again, may include that the template being manually enriched with such as colloquial style expression can not be covered The content of lid.
In addition, can excavate extensive vocabulary by merging a variety of data minings and artificial intelligence technology, excavate correlation table Up to template, cluster extract user's expression template, using random algorithm extract user's high frequency expression template, thus reach high efficiency, Height is recalled and the effect of high parsing accuracy rate, is capable of providing accurate answer, improves user satisfaction, man-machine interaction experience is good.
Fig. 4 shows the structural block diagram of information excavating device according to an embodiment of the present invention.As shown in figure 4, the information excavating Device may include:
Sentence excavates module 41, for excavating each query statement of each particular category from search log;
Entity gives module 42, for giving the kind fructification of the particular category;
Template generation module 43, for according to the particular category kind fructification and each query statement, generate the spy Determine the corresponding expression template of each query statement of classification;
High frequency excavates module 44, for according to query statement of all categories and its corresponding expression template, from described search It is excavated in log and obtains high frequency query statement and high frequency expression template.
In one possible implementation, if the template generation module 43 is also used to the inquiry of the particular category Include kind of a fructification in sentence, then described kind of fructification is replaced using wildcard figure, obtain corresponding expression template.
In one possible implementation, as shown in figure 5, the device further include:
Query word excavates module 51, for excavating various entities from described search log using each expression template, with Obtain high frequency words and/or colloquial style word;And/or
Full dose word abstraction module 52, for extracting the full dose for belonging to the particular category from the full dose data of selected website Word.
In one possible implementation, the query word excavation module 51 includes:
Dimension-reduction treatment submodule carries out scalable vector graphics SVG for the expression template to the various entities excavated Dimension-reduction treatment obtains corresponding feature vector;
It clusters submodule and obtains the particular category for clustering the corresponding feature vector of multiple expression templates Including expression template.
In one possible implementation, as shown in fig. 6, the device further include: correlated expression excavates module 61, is used for Described search log is screened, relevant query statement and expression template are obtained.
In one possible implementation, the high frequency excavation module 44 includes:
Mark sentence acquisition submodule, for from relevant query statement, obtain the query statement marked, it is described It include class label in the query statement of mark;
Similarity calculation submodule, for calculating two according to the term vector of two query statements in the query statement marked The semantic similarity of person;
Setting up submodule is linked, if the semantic similarity for described two query statements is greater than threshold value, establishes two Link between person;
Relational graph setting up submodule, for according between each query statement link and each query statement it is right with it The connection relationship between expression template answered establishes sentence template relational graph;
Parameter computation module, for according to the class label of each query statement in the query statement that has marked and having marked The sum of the query statement of note calculates the parameter of random algorithm;
Sorting sub-module obtains each query statement for using the random algorithm in the sentence template relational graph And its sequence of corresponding expression template;
High frequency screens submodule, for filtering out high frequency query statement and high frequency query template according to ranking results.
The function of each module in each device of the embodiment of the present invention may refer to the corresponding description in the above method, herein not It repeats again.
The embodiment of the present invention can excavate extensive vocabulary, digging by merging a variety of data minings and artificial intelligence technology Dig correlated expression template, cluster extracts user's expression template, the expression of user's high frequency is extracted using random algorithm (randomwalk) Template, to achieve the effect that high efficiency, height are recalled and height parses accuracy rate.
In a kind of application example, the information mining method using the embodiment of the present invention may include following part:
One: extensive core vocabulary excavates, and the vocabulary of excavation may include high frequency words, colloquial style word and full dose word.
1. excavating high frequency words, colloquial style word:
1.1 excavate all query statements (query) of particular category from search log;
1.2 give a small amount of kind of fructification (can be understood as the specific query object in certain field), if some occurs in query Entity is then replaced with asterisk wildcard, generates a corresponding expression template.For example given kind of fructification --- " transformer " has One query statement is " the online high definition viewing of transformer ", then generates the online high definition viewing of expression template (pattern)<*>.
After 1.3 previous steps obtain great expression template, with these expression templates, in search, log (log) is inner digs out all realities Body.
1.4 using expression template as the feature of each entity (entity), this is a very sparse feature, does cluster effect Fruit is bad, low efficiency.Therefore, can first to expression template do scalable vector graphics (Scalable Vector Graphics, SVD) dimensionality reduction, then clustered with the feature vector after dimensionality reduction.
2. excavating full dose word: from wikipedia (Wikipedia) or the offline full dose data pick-up certain kinds of Baidupedia Other all entries carry out the classification made a summary based on Wikipedia;
Two: the whole network is expressed query and is excavated
Search log (the big search click logs of such as Baidu) is excavated, therefrom screening retains the relevant date for clicking main stream website Will, so that filtering out relevant user expresses query statement (query) and expression template (being referred to as expression way).
Three: extracting high frequency expression query and expression template
The class label of mark a batch query, query can indicate whether the query belongs to particular category.Such as electricity Shadow, otherwise it is 0 that label, which is for 1, label,.
Using label/sum, (formula indicates R of the label value 0 or 1 divided by sum (text sum) as each query It is worth (parameter of random algorithm).
It is semantic that query is indicated using the term vector (such as lstm_encoding) of query, and uses cosine similarity Calculate semantic similarity between every two query.Two query for being greater than threshold value such as 0.9 for semantic similarity construct chain It connects.
The connection relationship between link and query and its expression template (pattern) between comprehensive query, building Sentence template relational graph (can abbreviation QQT-Graph).
Random algorithm (such as randomwalk) is carried out using above-mentioned R value in QQT-Graph, obtains final query It sorts with pattern.
Finally, the expression query and expression template of screening high frequency carry out parsing and recall coverage, to promote resolution factor.
As shown in fig. 7, being the general type of sentence template relational graph, wherein q indicates that query statement, s indicate query statement Between link, t indicate expression template.Cqs indicates s weight corresponding with the side of q connection, and Cs indicates that the score of s, Cq indicate q Score.Iqt indicates that q weight corresponding with the side of t connection, Iq indicate that the score of q, It indicate the score of t.In addition, building When vertical sentence template relational graph, the connection relationship between s two query statements of expression can also not had to, and in two query statements Direct-connected side is established between q and q.
As shown in figure 8, being a kind of illustrative sentence template relational graph.Assuming that the example of each query statement therein is such as Under:
q1: jobs in chicago
q2: jobs in boston
q3: jobs in microsoft
q4: jobs in motorola
q5: marketing jobs in motorola
q6: 401k plans
q7: illinois employment statistics
The semantic similarity for calculating each query statement can establish the link between two sentences.The example of link is such as Under:
S1: monster.com
s2: motorola.com
s3: us401k.com
Wherein, the linking of q1 and q7, the linking of q1 and q2, the linking of q1 and q3, q1 and q6 are linked as s1;Q4's and q5 It is linked as s2;Q6's and q7 is linked as s3.
Based on the relationship of the expression template and each query statement excavated before, the connection of template Yu each query statement is established. The example of expression template is as follows:
t1: jobs in#location
t2: jobs in#company
t3: #category jobs in#company
t4: #location employment statistics
Wherein, q1, q2 and t1 have connection relationship;Q2, q3, q4 and t2 have connection relationship;There is q5 and t3 connection to close System;Q7 and t4 has connection relationship.
In sentence template relational graph shown in Fig. 8, numerical value 1,2,5,4,10,12 etc. indicates the corresponding weight in each side.
Using the information mining method and device of the embodiment of the present invention, there is following clear advantage:
Time and human cost are saved, it can the cracking cold start-up for completing new classification using machine excavation technology Knowledge excavation.
A variety of data minings and artificial intelligence technology are merged, the correlation table that extensive vocabulary can be excavated, excavate the whole network Up to template, cluster extract user's expression template, randomwalk extracts user's high frequency expression template, to reach high efficiency, height The effect with high parsing accuracy rate is recalled, user experience is promoted.
Fig. 9 shows the structural block diagram of information excavating device according to an embodiment of the present invention.As shown in figure 9, the device includes: Memory 910 and processor 920 are stored with the computer program that can be run on processor 920 in memory 910.The place Reason device 920 realizes the information mining method in above-described embodiment when executing the computer program.The memory 910 and processing The quantity of device 920 can be one or more.
The device further include:
Communication interface 930 carries out data interaction for being communicated with external device.
Memory 910 may include high speed RAM memory, it is also possible to further include nonvolatile memory (non- Volatile memory), a for example, at least magnetic disk storage.
If memory 910, processor 920 and the independent realization of communication interface 930, memory 910,920 and of processor Communication interface 930 can be connected with each other by bus and complete mutual communication.The bus can be Industry Standard Architecture Structure (ISA, Industry Standard Architecture) bus, external equipment interconnection (PCI, Peripheral Component) bus or extended industry-standard architecture (EISA, Extended Industry Standard Component) bus etc..The bus can be divided into address bus, data/address bus, control bus etc..For convenient for expression, Fig. 9 In only indicated with a thick line, it is not intended that an only bus or a type of bus.
Optionally, in specific implementation, if memory 910, processor 920 and communication interface 930 are integrated in one piece of core On piece, then memory 910, processor 920 and communication interface 930 can complete mutual communication by internal interface.
The embodiment of the invention provides a kind of computer readable storage mediums, are stored with computer program, the program quilt Processor realizes any method in above-described embodiment when executing.
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is included at least one embodiment or example of the invention.Moreover, particular features, structures, materials, or characteristics described It may be combined in any suitable manner in any one or more of the embodiments or examples.In addition, without conflicting with each other, this The technical staff in field can be by the spy of different embodiments or examples described in this specification and different embodiments or examples Sign is combined.
In addition, term " first ", " second " are used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance Or implicitly indicate the quantity of indicated technical characteristic." first " is defined as a result, the feature of " second " can be expressed or hidden It include at least one this feature containing ground.In the description of the present invention, the meaning of " plurality " is two or more, unless otherwise Clear specific restriction.
Any process described otherwise above or method description are construed as in flow chart or herein, and expression includes It is one or more for realizing specific logical function or process the step of executable instruction code module, segment or portion Point, and the range of the preferred embodiment of the present invention includes other realization, wherein can not press shown or discussed suitable Sequence, including according to related function by it is basic simultaneously in the way of or in the opposite order, to execute function, this should be of the invention Embodiment person of ordinary skill in the field understood.
Expression or logic and/or step described otherwise above herein in flow charts, for example, being considered use In the order list for the executable instruction for realizing logic function, may be embodied in any computer-readable medium, for Instruction execution system, device or equipment (such as computer based system, including the system of processor or other can be held from instruction The instruction fetch of row system, device or equipment and the system executed instruction) it uses, or combine these instruction execution systems, device or set It is standby and use.For the purpose of this specification, " computer-readable medium ", which can be, any may include, stores, communicates, propagates or pass Defeated program is for instruction execution system, device or equipment or the dress used in conjunction with these instruction execution systems, device or equipment It sets.The more specific example (non-exhaustive list) of computer-readable medium include the following: there is the electricity of one or more wirings Interconnecting piece (electronic device), portable computer diskette box (magnetic device), random access memory (RAM), read-only memory (ROM), erasable edit read-only storage (EPROM or flash memory), fiber device and portable read-only memory (CDROM).In addition, computer-readable medium can even is that the paper that can print described program on it or other suitable Jie Matter, because can then be edited, be interpreted or when necessary with other for example by carrying out optical scanner to paper or other media Suitable method is handled electronically to obtain described program, is then stored in computer storage.
It should be appreciated that each section of the invention can be realized with hardware, software, firmware or their combination.Above-mentioned In embodiment, software that multiple steps or method can be executed in memory and by suitable instruction execution system with storage Or firmware is realized.It, and in another embodiment, can be under well known in the art for example, if realized with hardware Any one of column technology or their combination are realized: having a logic gates for realizing logic function to data-signal Discrete logic, with suitable combinational logic gate circuit specific integrated circuit, programmable gate array (PGA), scene Programmable gate array (FPGA) etc..
Those skilled in the art are understood that realize all or part of step that above-described embodiment method carries It suddenly is that relevant hardware can be instructed to complete by program, the program can store in a kind of computer-readable storage medium In matter, which when being executed, includes the steps that one or a combination set of embodiment of the method.
It, can also be in addition, each functional unit in each embodiment of the present invention can integrate in a processing module It is that each unit physically exists alone, can also be integrated in two or more units in a module.Above-mentioned integrated mould Block both can take the form of hardware realization, can also be realized in the form of software function module.The integrated module is such as Fruit is realized and when sold or used as an independent product in the form of software function module, also can store in a computer In readable storage medium storing program for executing.The storage medium can be read-only memory, disk or CD etc..
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can readily occur in its various change or replacement, These should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with the guarantor of the claim It protects subject to range.

Claims (14)

1. a kind of information mining method characterized by comprising
Each query statement of each particular category is excavated from search log;
Give the kind fructification of the particular category;
According to the kind fructification of the particular category and each query statement, each query statement for generating the particular category is corresponding Expression template;
According to query statement of all categories and its corresponding expression template, is excavated from described search log and obtain high frequency inquiry language Sentence and high frequency expression template.
2. the method according to claim 1, wherein according to the kind fructification of the particular category and each inquiry language Sentence, generates the corresponding expression template of each query statement of the particular category, comprising:
If in the query statement of the particular category including kind of a fructification, described kind of fructification is replaced using wildcard figure, Obtain corresponding expression template.
3. the method according to claim 1, wherein further include:
Using each expression template, various entities are excavated from described search log, to obtain high frequency words and/or colloquial style word; And/or
The full dose word for belonging to the particular category is extracted from the full dose data of selected website.
4. according to the method described in claim 3, it is characterized by further comprising:
Scalable vector graphics SVG dimension-reduction treatment is carried out to the expression template for the various entities excavated, obtains corresponding feature Vector;
The corresponding feature vector of multiple expression templates is clustered, the expression template that the particular category includes is obtained.
5. method according to claim 1 to 4, which is characterized in that further include:
Described search log is screened, relevant query statement and expression template are obtained.
6. according to the method described in claim 5, it is characterized in that, according to query statement and its corresponding expression mould of all categories Plate excavates from described search log and obtains high frequency query statement and high frequency expression template, comprising:
From relevant query statement, the query statement marked is obtained, includes classification mark in the query statement marked Label;
The semantic similarity of the two is calculated according to the term vector of two query statements in the query statement marked;
If the semantic similarity of described two query statements is greater than threshold value, link between the two is established;
It is closed according to the connection between the link and the corresponding expression template of each query statement between each query statement System, establishes sentence template relational graph;
According to the sum of the class label of query statement each in the query statement marked and the query statement marked, calculate The parameter of random algorithm;
The random algorithm is used in the sentence template relational graph, obtains each query statement and its corresponding expression template Sequence;
High frequency query statement and high frequency query template are filtered out according to ranking results.
7. a kind of information excavating device characterized by comprising
Sentence excavates module, for excavating each query statement of each particular category from search log;
Entity gives module, for giving the kind fructification of the particular category;
Template generation module, for according to the particular category kind fructification and each query statement, generate the particular category The corresponding expression template of each query statement;
High frequency excavates module, for according to query statement of all categories and its corresponding expression template, from described search log Excavation obtains high frequency query statement and high frequency expression template.
8. device according to claim 7, which is characterized in that if the template generation module is also used to the certain kinds Include kind of a fructification in other query statement, then described kind of fructification is replaced using wildcard figure, obtain corresponding expression template.
9. device according to claim 7, which is characterized in that further include:
Query word excavates module, for utilizing each expression template, excavates various entities, from described search log to obtain height Frequency word and/or colloquial style word;And/or
Full dose word abstraction module, for extracting the full dose word for belonging to the particular category from the full dose data of selected website.
10. device according to claim 9, which is characterized in that the query word excavates module and includes:
Dimension-reduction treatment submodule carries out scalable vector graphics SVG dimensionality reduction for the expression template to the various entities excavated Processing, obtains corresponding feature vector;
Submodule is clustered, for clustering the corresponding feature vector of multiple expression templates, obtaining the particular category includes Expression template.
11. device according to any one of claims 7 to 10, which is characterized in that further include:
Correlated expression excavates module and obtains relevant query statement and expression template for screening to described search log.
12. device according to claim 11, which is characterized in that the high frequency excavates module and includes:
Sentence acquisition submodule is marked, it is described to have marked for from relevant query statement, obtaining the query statement marked Query statement in include class label;
Similarity calculation submodule, for calculating the two according to the term vector of two query statements in the query statement marked Semantic similarity;
Setting up submodule is linked, if the semantic similarity for described two query statements is greater than threshold value, establishes the two Between link;
Relational graph setting up submodule, for according between each query statement link and each query statement it is corresponding Connection relationship between expression template establishes sentence template relational graph;
Parameter computation module, for according to the class label of each query statement in the query statement that has marked and having marked The sum of query statement calculates the parameter of random algorithm;
Sorting sub-module, in the sentence template relational graph use the random algorithm, obtain each query statement and its The sequence of corresponding expression template;
High frequency screens submodule, for filtering out high frequency query statement and high frequency query template according to ranking results.
13. a kind of information excavating device characterized by comprising
One or more processors;
Storage device, for storing one or more programs;
When one or more of programs are executed by one or more of processors, so that one or more of processors Realize such as method described in any one of claims 1 to 6.
14. a kind of computer readable storage medium, is stored with computer program, which is characterized in that the program is held by processor Such as method described in any one of claims 1 to 6 is realized when row.
CN201810716210.9A 2018-06-29 2018-06-29 information mining method and device Pending CN109033076A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810716210.9A CN109033076A (en) 2018-06-29 2018-06-29 information mining method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810716210.9A CN109033076A (en) 2018-06-29 2018-06-29 information mining method and device

Publications (1)

Publication Number Publication Date
CN109033076A true CN109033076A (en) 2018-12-18

Family

ID=65521476

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810716210.9A Pending CN109033076A (en) 2018-06-29 2018-06-29 information mining method and device

Country Status (1)

Country Link
CN (1) CN109033076A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110990451A (en) * 2019-11-15 2020-04-10 浙江大华技术股份有限公司 Data mining method, device and equipment based on sentence embedding and storage device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102419778A (en) * 2012-01-09 2012-04-18 中国科学院软件研究所 Information searching method for discovering and clustering sub-topics of query statement
CN103425714A (en) * 2012-05-25 2013-12-04 北京搜狗信息服务有限公司 Query method and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102419778A (en) * 2012-01-09 2012-04-18 中国科学院软件研究所 Information searching method for discovering and clustering sub-topics of query statement
CN103425714A (en) * 2012-05-25 2013-12-04 北京搜狗信息服务有限公司 Query method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
伍大勇: "搜索引擎中命名实体查询处理相关技术研究", 《中国博士学位论文全文数据库》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110990451A (en) * 2019-11-15 2020-04-10 浙江大华技术股份有限公司 Data mining method, device and equipment based on sentence embedding and storage device
CN110990451B (en) * 2019-11-15 2023-05-12 浙江大华技术股份有限公司 Sentence embedding-based data mining method, device, equipment and storage device

Similar Documents

Publication Publication Date Title
Tangherlini et al. An automated pipeline for the discovery of conspiracy and conspiracy theory narrative frameworks: Bridgegate, Pizzagate and storytelling on the web
Rule et al. Lexical shifts, substantive changes, and continuity in State of the Union discourse, 1790–2014
El‐Assady et al. NEREx: Named‐entity relationship exploration in multi‐party conversations
Kiryakov et al. Semantic annotation, indexing, and retrieval
CN110750649A (en) Knowledge graph construction and intelligent response method, device, equipment and storage medium
CA2807494C (en) Method and system for integrating web-based systems with local document processing applications
CN106844341A (en) News in brief extracting method and device based on artificial intelligence
CN103544321A (en) Data processing method and device for micro-blog emotion information
US6298350B1 (en) Method for automatic processing of information materials for customised use
CN109947934A (en) For the data digging method and system of short text
Ilievski et al. Commonsense knowledge in wikidata
Ahmed et al. Framing South Asian politics: An analysis of Indian and Pakistani English print media discourses regarding Kartarpur corridor
Beytía Reyes et al. Visibility layers: a framework for systematising the gender gap in Wikipedia content
Burns et al. A suite of generative tasks for multi-level multimodal webpage understanding
Martins et al. StanceXplore: Visualization for the interactive exploration of stance in social media
CN109033076A (en) information mining method and device
Dodd Working with German Corpora: With a Foreword by John Sinclair
Yan et al. Two Diverging roads: a semantic network analysis of chinese social connection (“guanxi”) on Twitter
Abraham et al. Extraction of spatio‐temporal data about historical events from text documents
Nobre Anaphora resolution
Hanchard et al. Developing a computational ontology from mixed-methods research: a workflow and its challenges
Castano et al. SABINE: a multi-purpose dataset of semantically-annotated social content
Krzywicki et al. A knowledge acquisition method for event extraction and coding based on deep patterns
Zarifi et al. Gender identification of short text author using conceptual vectorization
Koncar et al. Text sentiment in the age of enlightenment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20181218

RJ01 Rejection of invention patent application after publication