CN106156035A - A kind of generic text method for digging and system - Google Patents

A kind of generic text method for digging and system Download PDF

Info

Publication number
CN106156035A
CN106156035A CN201510135053.9A CN201510135053A CN106156035A CN 106156035 A CN106156035 A CN 106156035A CN 201510135053 A CN201510135053 A CN 201510135053A CN 106156035 A CN106156035 A CN 106156035A
Authority
CN
China
Prior art keywords
concept
digging
text
excavation
relation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510135053.9A
Other languages
Chinese (zh)
Other versions
CN106156035B (en
Inventor
孟涛
李佳静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Network Sense To Inspect Mdt Infotech Ltd
Original Assignee
Nanjing Network Sense To Inspect Mdt Infotech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Network Sense To Inspect Mdt Infotech Ltd filed Critical Nanjing Network Sense To Inspect Mdt Infotech Ltd
Publication of CN106156035A publication Critical patent/CN106156035A/en
Application granted granted Critical
Publication of CN106156035B publication Critical patent/CN106156035B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of method that generic text is excavated, including: step 1, operational network reptile, load the excavation object in the range of excavating, step 2, described excavation object is carried out actual text extraction, obtain actual text, step 3, described actual text is formed concept tagging figure, step 4, according to the relation between described concept and the concept corresponding with excavating target, described concept and relation are compiled and forms bytecode, and then formation instruction figure, step 5, described concept tagging figure and described instruction figure are mated, the content of the concept and relation that meet instruction figure in described concept tagging figure is formed analysis result.The generic text method for digging of the present invention can carry out formal description to excavating target, excavation scope and method for digging etc., reaches to carry out the effect of text retrieval and excavation in different field.

Description

A kind of generic text method for digging and system
Technical field
The invention belongs to character search technical field, specifically, the invention belongs to text mining and nature Language processing techniques field.The present invention relates to a kind of general text mining method, can be different necks Excavation target, excavation scope, parser and analysis result in the text mining demand in territory filter Rule provides formal description means.
Background technology
Since last century the nineties, information extraction is given more sustained attention at academia and industrial quarters, has Substantial amounts of correlational study works.Along with the quick growth of internet information quantity, to non-structured text Excavate and become study hotspot.Text mining is many in public sentiment monitoring, intelligence analysis, business intelligence etc. Field is increasingly widely applied.
Existing extraction technique is generally used for concrete field, i.e. making in the specific area being pre-designed With, such as relevant to a large amount of extractions of info web, especially generate relevant to wherein Wrapper Work only for the useful information of the structure extraction utilizing webpage.Other situations such as DBLife system System, then use Datalog language to allow user customize extraction target, and improve extraction efficiency, but this Method can only use in document analysis field.
Inventor have recognized that, in existing information extraction technique means, not one The general way that can use in multiple fields, it is impossible to carry in a practical case for user Information extraction technique support for multiple fields.And, there is technology and lack in existing information extraction means Fall into, be also unable to reach the effect that generic text is excavated.On the one hand, existing extraction technique uses key word Boolean logic carrys out profile matching target, which has limited and target is portrayed ability.On the other hand, these Technological means does not possess versatility, is often accomplished by again realizing after changing an industry or scene.
Summary of the invention
It is an object of the present invention to provide a kind of general text mining method.Inventors believe that, Can develop with rule-based extraction technique and apply unrelated rule language to reach the general of field Property, allow user take the mode that statement formula extracts, customization extraction target in different field.It is full This kind of demand of foot, versatility Text Mining System to solve the subject matter of following 3 aspects: 1) as What provides a kind of formalization method to portray text mining pattern;2) how to meet excavation target and comprise multiple The demand of miscellaneous semantic structure;3) efficiency of extensive semantics extraction how is solved.The inventive method Propose a kind of formal definitions means, it is possible to the excavation target of text mining, excavate scope and point Analysis algorithm is described;And the method using layer-stepping provides analysis result, it is right to also provide for further The filtering rule of analysis result.On the other hand, present invention also offers one and can realize above-mentioned text The system of method for digging.
The generic text method for digging that the present invention provides includes:
Step 1, operational network reptile, load the excavation object in the range of excavating;
Step 2, carries out actual text extraction to described excavation object, obtains actual text;
Step 3, forms concept tagging figure by described actual text;
Step 4, according to the relation between described concept and the concept corresponding with excavating target, by described Concept and relation compiling form bytecode, and then form instruction figure;
Step 5, mates described concept tagging figure and described instruction figure, by described concept tagging The content of the concept and relation that meet instruction figure in figure forms analysis result.
After step 3, the analysis optimization step for optimizing described concept tagging figure can be included, Method for digging in described analysis optimization step includes participle, part of speech analysis, name Entity recognition.? Before described analysis optimization step, it is also possible to include that the method for digging for selecting described method for digging is fixed Justice step.
Before described step 4, it is also possible to include defining the described concept corresponding with described excavation target And defining the excavation object definition step of described relation between described concept, described excavation target is Described concept and the occurrence of relation.Can include defining described excavation scope before described step 1 Excavation scope definition step.
After described step 5, it is also possible to include the matching result according to concept and relation, to described Text carries out the step classified.
Described relation between described concept includes:
" SENT ": all concepts in action scope must be present in a statement;
" DIST_n ": the distance between any two adjacent concept in action scope can not be more than n;
" ORD ": all concepts order in action scope occurs;
" CONT ": all concepts in action scope are adjacent.
Method of the present invention also includes step 6, and described actual text is carried out subject classification, It can in addition contain include step 7, described analysis result is carried out result filtration, according to described actual literary composition Described analysis result is defined by the concept and the frequency of occurrences of relation that match in Ben.
The generic text digging system that the present invention provides includes:
Load-on module, for using web crawlers to load the excavation object text in the range of excavating;
Text Feature Extraction module, for extracting the actual text in described excavation object;
Mark figure generation module, for forming concept tagging figure by described actual text;
Collector, for according to the relation between the described concept corresponding with excavating target, by described Concept and relation compiling form bytecode, and then form instruction figure;
Matching module, mates described concept tagging figure and described instruction figure, by described concept mark The content of the concept and relation that meet instruction figure in note figure forms analysis result.
Described system can include analysis optimization module, is used for optimizing described concept tagging figure.Described system System can also include method for digging definition module, for select in described analysis optimization module use dig Pick method.Further, described system could be included for storing the method mould of described method for digging Type storehouse, described method for digging definition module is selected method for digging from described method model storehouse.Described side The method for digging of method model library includes: word-dividing mode, part of speech analyze module, name Entity recognition module Deng.
Described system can also include excavating object definition module, for definition and described excavation target pair The described concept answered, and define the described relation between described concept, described excavation target is as described The occurrence of concept.
Described relation between described concept may include that
" SENT ": all concepts in action scope must be present in a statement;
" DIST_n ": the distance between any two adjacent concept in action scope can not be more than n;
" ORD ": all concepts order in action scope occurs;
" CONT ": all concepts in action scope are adjacent.
System of the present invention can also include excavation scope definition module, is used for defining described excavation model Enclose.The system of the present invention also includes text classification module, for according to concept and pass in described text The field that the matching result of system and text relate to, carries out subject classification to actual text.It addition, this The system of invention also includes result filtering module, for according to the described concept matched in described text With the frequency of occurrences of relation, described analysis result is defined.
The present invention proposes and a kind of can enter excavation target, scope and the method for digging in text mining Row formalized description also obtains the generic text method for digging of Result.The present invention is corresponding with one Context-sensitive language supports the extraction to any concept and relation simultaneously, provides excavation simultaneously The customization means of scope and relevant method for digging.User has only to use the regular defined notion of the present invention And the relation between concept, by the described concept of concrete search target assignment, can be achieved with its literary composition The description of this excavation demand, and obtain the analysis result of text mining further.
According to method for digging of the present invention, required concept can be defined and compose by user Value, and the relation between concept is defined, then select excavation scope and method for digging.This The method of bright offer can utilize method for digging to be labeled the text in range of search, product concept Mark figure.On the other hand, concept and relation that user can be defined by the present invention are compiled into bytecode, Generate instruction figure.Further, concept tagging figure and instruction figure are mated by the present invention, thus point Separate out concept and the content of text of relation meeting user definition in the range of excavating, form analysis result.
It was found by the inventors of the present invention that in the art, technical staff does not attempts to provide one Generic text method for digging or template as kind, those skilled in the art are also unaware that real meaning On the importance of generic text method for digging.Therefore, the present invention is to be realized technical assignment or That the those skilled in the art of technical problem is that to be solved never expect or it is not expected that, therefore The present invention is a kind of new technical scheme.
By detailed description to the exemplary embodiment of the present invention referring to the drawings, its of the present invention Its feature and advantage thereof will be made apparent from.
Accompanying drawing explanation
Combined in the description and the accompanying drawing of the part that constitutes description shows the reality of the present invention Execute example, and together with its explanation for explaining the principle of the present invention.
Fig. 1 is the block diagram of the generic text method for digging of the present invention;
Fig. 2 is the block diagram of generic text method for digging in the specific embodiment of the invention;
Fig. 3 be TML language corresponding to the method for the invention simplify grammer;
Fig. 4 is the concept tagging illustrated example in the specific embodiment of the invention;
Fig. 5 is the contextual definition statement example of TML LISP program LISP in the specific embodiment of the invention;
Fig. 6 is the bytecode example of the relational statement of TML LISP program LISP in the specific embodiment of the invention;
Fig. 7 is the structured flowchart of generic text digging system in the specific embodiment that the present invention provides;
Fig. 8 is the concept of TML LISP program LISP and relational hierarchy figure in the specific embodiment of the invention.
Detailed description of the invention
The various exemplary embodiments of the present invention are described in detail now with reference to accompanying drawing.It should also be noted that Unless specifically stated otherwise, the parts that illustrate the most in these embodiments and step positioned opposite, Numerical expression and numerical value do not limit the scope of the invention.
Description only actually at least one exemplary embodiment is illustrative below, never makees For to the present invention and application thereof or any restriction of use.
May not beg in detail for technology, method and apparatus known to person of ordinary skill in the relevant Opinion, but in the appropriate case, described technology, method and apparatus should be considered a part for description.
Shown here with in all examples discussed, any occurrence should be construed as merely example Property rather than as limit.Therefore, other example of exemplary embodiment can have different Value.
It should also be noted that similar label and letter represent similar terms, therefore, one in following accompanying drawing A certain Xiang Yi the accompanying drawing of denier is defined, then need not it is carried out further in accompanying drawing subsequently Discuss.
The invention provides a generic text method for digging, Fig. 1 is the action box of the method for the invention Frame figure, the text mining method of the present invention achieves web crawlers, actual text is extracted, text marking The technology such as figure generation, participle, part-of-speech tagging, name Entity recognition and text classification, and by they shapes Become a streamline analysis.As it is shown in figure 1, including following steps: 1, operational network reptile, Load the excavation object in the range of excavating;2, to described excavation object extraction actual text;3, by institute State actual text and form concept tagging figure;4, described concept and relation are compiled into bytecode, formation refers to Order figure;5, described concept tagging figure and instruction figure are mated, obtain analysis result.In the present invention Described method in, described excavation can be the carrier of any text in sending out, such as webpage, books, The carrier of the forms such as file.
In the text mining method that the present invention provides, can be based on TML language, such as Fig. 3 institute Show, by regular expression definition statement, concept definition statement, assignment statement, contextual definition statement, Excavation scope definition statement, parser definition statement, and the statement such as output statement, loading statement Coding, it is achieved method for digging of the present invention.Fig. 3 be illustrated that the present invention provides with institute Stating the TML program frame that generic text method for digging is corresponding, the user of this method can pass through this Set program frame carries out text mining analysis to any field, any objective excavation scope.Described just Then expression formula is for being defined the form of concept and relative occurrence.
In step 1, described text mining method first operational network reptile, the excavation of loading selected In the range of excavation object.Before described step 1, as in figure 2 it is shown, excavation model can be first carried out Enclose definition step.Defined by the excavation scope definition statement as shown in Fig. 3 17-19 row, select Described excavation scope.Described excavation scope definition statement includes reserved word " PAGES ", variable name <string>with excavation range attribute list<pagerestricts>the compound statement composition formed.The The excavation range attribute list of 18 row can include multiple excavation range attribute<pagerestrict>.? In 19th row, excavating range attribute and be made up of attribute and property value, attribute definition is as shown in table 1.
Table 1 excavates range attribute definition
In step 2, owing to described excavation scope has various ways, described excavation object therein is also The layout character in insignificant object, such as webpage, the layout character etc. in books are potentially included. So, described streamline analysis can include actual text extraction step.Described actual text is extracted Step is used for removing in excavation object retrieves insignificant part to excavating, and is really needed to dig The actual text that pick is analyzed.
In described step 3, the present invention is by described actual text product concept mark figure.Described generally Read in mark figure, meeting on the words and expressions excavating target, on the words and expressions of the occurrence i.e. meeting concept, It is labeled with corresponding described concept.In a particular embodiment, described concept tagging figure such as Fig. 4 institute Show.
In step 3 and step afterwards thereof, concept tagging figure can be carried out by the method for the invention Optimize.Particularly, owing to potentially including multiple method for digging, so by analysis process shape in the present invention Become the mark map analysis optimization step of sequencing.As in figure 2 it is shown, such as, described analysis optimization walks Suddenly Chinese word segmentation step, part-of-speech tagging step and name entity extraction step can be performed successively.Described Chinese word segmentation step is for the Chinese text in the range of excavating, and a Chinese character sequence is cut into one One single word.Described part-of-speech tagging step is to be labeled the part of speech in actual text, such as Adjective, noun, verb etc., be used for describing the effect within a context of a word.Described name is real Body extraction step identification text has the entity of certain sense, mainly include name, place name, machine The entity names such as structure name.As in figure 2 it is shown, in certain embodiments of the invention, described mark figure is excellent Changing step and can include multinomial analytical procedure and method successively, those skilled in the art can be according to reality The needs used, select/accept or reject different method for digging.
Before described analysis optimization step, it is also possible to include the excavation for selecting described method for digging Method definition step.Can be filled by the method for digging definition statement as described in Fig. 3 20-23 row Carry, selected existing method for digging and method model storehouse, including the good participle step of predefined, word Property annotation step, classifying step etc., to perform method for digging definition step.As shown in Fig. 3 the 20th row, Described method for digging definition statement includes reserved word " USE ", implementation name and the path at instrument place.Special Other, before described step 4, as shown in Figure 2, it is also possible to include defining described concept and relation Excavation object definition step.In the present invention, can be fixed by the concept as shown in Fig. 3 4-7 row Described concept and relation are defined by justice statement, contextual definition statement and assignment statement.
Assignment statement may is that
<assignstatement>: :=<string>" :="<string>";" |<string>" :=" “OR”“(”<stmtargs>“)”“;”
Wherein the r value<string>of " :=" symbol is assigned to lvalue<string>, it is also possible to " OR " Operator is that concept definition is multiple and train value.
Concept definition statement may is that
<conceptstatement>: :=" CONCEPT "<stmtvars>";”|“CONCEPT” <string>“(”<stmtarg>“)”“{”<stmtlimits>“}”
Including reserved word " CONCEPT " and statement variable list<stmtvars>, described concept Can definition while assignment.Particularly, described excavation target is the occurrence of described concept, makes By described assignment statement or while defined notion, described excavation target can be assigned to described generally Read.
Contextual definition statement may is that
<predicatestatement>: :=" PREDICATE "<string>" ("<stmtargs> “)”“{”<stmtlimits>“}”
General including reserved word " PREDICATE ", statement parameter list<stmtargs>and expression Restriction relation compound statement<stmtlimits>between thought.Wherein the restriction relation between concept calculates symbol point Be two classes, i.e. boolean calculation symbol and context calculates symbol, calculates symbol definition respectively such as table 2 and table 3 institute Show.Described relation between described concept can include polytype, in the present invention can be by upper Hereafter calculate symbol and boolean calculation symbol represent, such as:
" SENT ": all concepts in action scope must be present in a statement;
" DIST_n ": the distance between any two adjacent concept in action scope can not be more than n;
" ORD ": all concepts order in action scope occurs;
" CONT ": all concepts in action scope are adjacent;
Action scope refers to calculate the content in symbol unquote.
Operator Definition
AND All words and expressions in AND action scope must occur in input text simultaneously
OR All words and expressions in OR action scope at least have one to occur in input text
NOT Words and expressions in NOT action scope can not occur, otherwise input text does not mates
Table 2 Boolean relation calculates symbol definition
The symbol definition of table 3 context operations
As described in Figure 2, described excavation object definition step can realize in TML program.It addition, TML program can also be optimized, complicated, to there is logical contradiction relation is adjusted, Form accurate logical relation.
In described step 4, between the present invention is according to the described concept corresponding with described excavation target Relation, compiles described concept and relation and forms bytecode, and then form instruction figure.
The process of described step 4 is in a particular embodiment as it is shown in figure 5, by department in Fig. 5 Occur in being relation position defined in a word with two concepts of title, then use The relations such as position and concept definition concept manager, wherein generate after the compiling of relation position Bytecode as shown in Figure 6, i.e. with the addition of START_SENT for SENT operator, START_MATCH, END_MATCH, END_SENT etc. run virtual machine instructions.Performing bytecode Time, run virtual machine and perform corresponding matching logic for these instructions.Take similar approach, permissible It is bytecode by all of described concept and transformation.It is bytecode by described concept and transformation After, instruction figure can be generated according to the target-dependent relation in bytecode.
Described concept tagging figure and described instruction figure are mated by described step 5, by described concept mark The content of the concept and relation that meet instruction figure in note figure forms analysis result.
At instruction figure with the matching process of mark figure, need to record the relevant information of each step, in order to Ensureing efficiency and correctness that matched rule performs, the present invention can also solve following point:
(1) the concurrent executive problem of rule
For there is no concept and the relation of dependence, can concurrently perform, such as in coupling SENT (person, " accepting an interview ") and SENT (person, " meeting reporter ") this two line statement Only " being interviewed " and " seeing reporter " is different, therefore " (person, " connecing " this part can for SENT Concurrently to mate, the related top being only intended in solution instruction figure is shared by some bytecode fragments and is carried The problem come.
(2) Dependence Problem between coupling concept and concept
Multiple concepts in program are likely to be of the dependence of complexity each other, be on compiling rank Section, i.e. performs to calculate during step 4 their sequencing.Fig. 5 such as must first mate " title " and " department " could mate " position ".
(3) matching problem of context relation
Symbol AND, OR and NOT are calculated for simple Boolean logic relationship, can be directly by concept tagging Figure carries out performing logical relation again after high efficiency is mated with instruction figure itself;But for SENT and DIST_n Etc. context relation, in addition to matched text self, the context of matched text also to be patrolled by the present invention The relation of collecting
(4) the nested problem of described relation
The described relation of rule is the most mutually nested, such as DIST_3 (SENT (OR (" inc ", " corp "), OR (" acquire ", " buy ")))) describe one simple " corporate buyout " Relation, in its execution of bytecode and the matching process of instruction figure and concept tagging figure, needs In view of SENT and OR nested inside DIST_n operator, need to solve this in the process of implementation Nested problem.
After described step 5, it is also possible to include step 6, text classification step, can be in excavation side The program of method is specified the categorical attribute to text, to perform step 6.As shown in Fig. 3 the 21st row Sorting technique statement includes reserved word " CLASS ", variable name and categorical attribute list, and genus of classifying Property list can be made up of multiple categorical attributes, it is also possible to for sky.23rd row shows that categorical attribute is used Coming designation method model library, method step etc., the most every is semantic as shown in table 4.
Table 4 extracts means attribute definition
Particularly, after described step 5, also include step 7, described analysis result is tied Fruit is filtered, and the described concept matched according to described actual text and the frequency of occurrences of relation are to described point Analysis result is defined.
In the present invention, statement can be filtered to step 7 according to the result shown in Fig. 3 24-29 row Filter method be defined.The result of 24-29 row filters the statement classification according to document with general Considering the frequency of occurrences of its relation the analysis result of text mining done and further limit.Such as the 24th Shown in row, described result filters statement and comprises reserved word " SELECT ", and optional " FROM " Subordinate sentence and " WHERE " subordinate sentence.Object<selectobjects>is selected to serve as reasons ", " if separate Dry concept, as shown in the 25th row.Selection source<selectsources>can be by multiple expression analysis sources Character string composition, such as webpage or classification results, as shown in the 26th row.Alternative condition list <selecetconditions>row is made up of several alternative conditions, as shown in the 27th row.Selector bar The selection subordinate sentence that part<selecetcondition>is linked by comparison operator forms, such as the 28th row institute Showing, comparison operator is as shown in table 5.Select subordinate sentence<selectconditionclause>can be with Concept name operates, as shown in the 29th row as " FREQ " of parameter.After obtaining analysis result, permissible According to aforesaid operations, perform analysis result filtration step.
Table 5 comparison operator definition table
The method of the invention by after treatment actual text formed concept tagging figure, it with by Concept and the byte code stream of relation compiling that TML language is write match, output annotation result.
On the other hand, the invention provides a kind of generic text digging system, it is possible to realize above-mentioned general Text mining method.As it is shown in fig. 7, described system includes:
Load-on module 100, for using web crawlers to load the excavation object in the range of excavating;
Actual text extraction module 200, the actual text in the range of extracting excavation;
Mark figure generation module 300, for forming concept tagging figure by described actual text;
Mark figure optimizes module 400, for according to method for digging definition module 710 and method model storehouse 720, the mark figure generating mark figure generation module 300 is optimized, and described mark figure optimizes module 400 use the method for digging in described method model storehouse 720 to be optimized, described method model storehouse 720 In can include word-dividing mode 410, part-of-speech tagging module 420 and name Entity recognition module 430.
Collector 500, for according to the relation between described concept and concept, by described concept and Relation compiling forms bytecode, and then forms instruction figure;
Matching module 600, mates described concept tagging figure and described instruction figure, by described generally The content reading the concept and relation that meet instruction figure in mark figure forms analysis result.
Described system also includes that definition module 700, described definition module 700 include: method for digging Definition module 710, makes for described optimization in analysis module selected from described method model storehouse 720 Method for digging;Accordingly, described system also includes the method model storehouse for storing method for digging 720, described method for digging definition module 710 is selected method for digging from described method model storehouse 720. The method for digging of described method base 720 can include part-of-speech tagging method, name entity abstracting method, File classification method and keyword abstraction method etc..
Described definition module 700 can also include excavation scope definition module 730, is used for defining described Excavation scope.
Described definition module 700 can also include excavating object definition module 740, for definition and institute State and excavate the described concept that target is corresponding, and define the described relation between described concept, described excavation Target is as the occurrence of described concept.
Preferably, the described relation between the concept used in described system includes:
" SENT ": all concepts in action scope must be present in a statement;
" DIST_n ": the distance between any two adjacent concept in action scope can not be more than n;
" ORD ": all concepts order in action scope occurs;
" CONT ": all concepts in action scope are adjacent.
Preferably, described system can also include text classification module 900 and result filtering module 800. Described text classification module 900 for carrying out subject classification, described result filtering module to actual text 800 analyze knot for the frequency of occurrences according to concept described in described concept tagging figure and relation to described Fruit is defined.Above-mentioned two module with Result is processed independently, it is also possible to Processing successively, the present invention is not intended to the sequencing that text classification is filtered with result.
In one embodiment of the invention, the most corresponding TML language solves from finger The method finding esbablished corporation market department director in fixed seed node.
First, definition mining target of the present invention, excavate in target and need to comprise esbablished corporation, market department The concepts such as door, director, and the relation formed by these concepts, such as market department director and Esbablished corporation market department director etc..The usability of program fragments of corresponding TML language is as follows:
Hierarchical relationship between wherein said concept and described relation is as shown in Figure 8.
Secondly, the method for the invention wants definition mining scope in this embodiment.Excavation scope is made The fixed seed node captured is http://finance.sina.com.cn and http://it.sohu.com, captures the degree of depth It is 3.The usability of program fragments of corresponding TML language is as follows:
3rd, the method for the present invention needs definition mining method in the present embodiment, at least loading one Literary composition segmenter.Assume the entitled chn_tkzo_mini.bin of segmenter, under current path.Corresponding TML The usability of program fragments of language is as follows:
USE(tokenizer:"./chn_tkzo_mini.bin");
4th, the method for the present invention can define result filtering rule in the present embodiment.Filtering rule For from including at least the document to be analyzed of a name entity finds coupling comppos relation.Corresponding The usability of program fragments of TML language is as follows:
SELECT comppos FROM sample1 WHERE FREQ(PERSON)>0;
The program that the present embodiment is complete is as follows:
USE(tokenizer:"./chn_tkzo_mini.bin");
1.CONCEPT brand,company;
2.brand:=" IBM ";
3.brand:=" General Motors ";
4.company:=" China PetroChemical Corporation ";
5.company:=" State Grid Corporation of China ";
….
6.CONCEPT department,title;
7.department:=OR (" market ", " brand ", " marketing ", " public relations ");
8.title:=OR (" manager ", " higher level manager ", " chief inspector ", " assistant director ", " department head ");
9.PREDICATE manager(department d,title t){
CONT(d,t);
}
10.PREDICATE comppos(manager per,company comp){
11.OR(SENT(per,comp),DIST_15(per,comp));
}
12.PREDICATE comppos(manager per,brand prod){
13.OR(SENT(per,prod),DIST_15(per,prod));
}
14.PAGES sample1{
15.SEED("http://finance.sina.com.cn");
16.SEED("http://it.sohu.com");
17.DEPTH(“3”);
18.}
19.SELECT comppos FROM sample1WHERE FREQ(PERSON)>0;
In another embodiment of the present invention, as a example by purchase intention analysis, the present invention is described herein The use of described method and system.
The Basic Ways excavating potential purchase intention from user's speech is, downloads user from social networks After speech, formulate, with described TML programming, the purchasing model that this field is common, additionally write Automobile, real estate, travel, insure, the knowledge base of the industry such as cosmetics, including product, product Attribute, effect, brand, company etc., then by the particular purchased wish under they definition different industries Pattern.The following is some usability of program fragments relevant to cosmetic industry:
1.COSMETIC-PRODUCT:=OR (" skin base solution ", " eyes hide glue ", " articles for washing ", " receive by cosmetics " ..., " health is sun-proof ");
2.ATTRIBUTE (COSMETIC-PRODUCT, classname) :=" cosmetics ";
(" moisturizing ", " whitening ", " deep layer is clear for " moisturizing ", " moistening " for 3.CONCEPT COSMETIC-ASPECTS:=OR Clean " ...);
# budget is correlated with
4.CONCEPT BUDGET:=OR (" budget ", " affording ", " can afford ", " affording ", " the most enough ", " Short of money " ...);
5.PREDICATE cosmetics-intending(BUDGET bd,COSMETIC-PRODUCT product){
AND(SENT(DIST_5(bd,product),NOT(NOT-PUR-DIST)),NOT(AD));
}
Table 6 provides the situation of the TML program that purchase intention is analyzed, and wherein Mean match speed is virtual Machine downloads document and the speed of the program of execution, is about under typical single CPU/2GB memory environment 1MBps;But when text contain make a big purchase wish in large quantities and make each instruction all may be performed time, Joining speed and decline about 20KBps, using TML to analyze in 7 industries of purchase intention, every day knows Do not go out the effective accuracy of tens thousand of purchase intentions between 40%-50%.The time cost that TML program is run It is not directly dependent upon with program length, it is possible to solve very large-scale problem.
Table 6 specific embodiment of the invention purchase intention is analyzed
Although some specific embodiments of the present invention being described in detail by example, but It should be appreciated by those skilled in the art, example above is merely to illustrate rather than in order to limit The scope of the present invention processed.It should be appreciated by those skilled in the art, can be without departing from the scope of the present invention In the case of spirit, above example is modified.The scope of the present invention is by claims Limit.

Claims (10)

1. a generic text method for digging, it is characterised in that including:
Step 1, operational network reptile, load the excavation object in the range of excavating;
Step 2, carries out actual text extraction to described excavation object, obtains actual text;
Step 3, forms concept tagging figure by described actual text;
Step 4, according to the relation between described concept and the concept corresponding with excavating target, by described Concept and relation compiling form bytecode, and then form instruction figure;
Step 5, mates described concept tagging figure and described instruction figure, by described concept tagging The content of the concept and relation that meet instruction figure in figure forms analysis result.
Generic text method for digging the most according to claim 1, it is characterised in that in step 3 Afterwards, including the analysis optimization step for optimizing described concept tagging figure, described analysis optimization step Method for digging include participle, part of speech analysis, name Entity recognition.
Generic text method for digging the most according to claim 2, it is characterised in that at described point Before analysis optimization step, including the method for digging definition step for selecting described method for digging.
Generic text method for digging the most according to claim 1, it is characterised in that in described step Before rapid 4, including between described concept and defined notion that definition is corresponding with described excavation target The excavation object definition step of described relation, described excavation target is the occurrence of described concept and relation.
Generic text method for digging the most according to claim 1, it is characterised in that described step The excavation scope definition step defining described excavation scope was included before 1.
6. a generic text digging system, it is characterised in that including:
Load-on module (100), for using web crawlers to load the excavation object in the range of excavating;
Text Feature Extraction module (200), for described excavation object is carried out actual text extraction, To actual text;
Mark figure generation module (300), for forming concept tagging figure by described actual text;
Collector (500), for according between described concept and the concept corresponding with excavating target Relation, described concept and relation are compiled and form bytecode, and then form instruction figure;
Matching module (600), mates described concept tagging figure and described instruction figure, by institute The content stating the concept and relation that meet instruction figure in concept tagging figure forms analysis result.
Generic text digging system the most according to claim 6, it is characterised in that include point Analysis optimizes module (400), is used for optimizing described concept tagging figure.
Generic text digging system the most according to claim 6, it is characterised in that include excavating Method definition module (710), is used for selecting the excavation used in described analysis optimization module (400) Method.
Generic text digging system the most according to claim 6, it is characterised in that include excavating Object definition module (740), for the described concept that definition is corresponding with described excavation target, and defines Described relation between described concept, described excavation target is as the occurrence of described concept.
Generic text digging system the most according to claim 6, it is characterised in that include digging Pick scope definition module (730), is used for defining described excavation scope.
CN201510135053.9A 2015-02-28 2015-03-25 A kind of generic text method for digging and system Active CN106156035B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2015100918874 2015-02-28
CN201510091887 2015-02-28

Publications (2)

Publication Number Publication Date
CN106156035A true CN106156035A (en) 2016-11-23
CN106156035B CN106156035B (en) 2019-10-22

Family

ID=57340021

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510135053.9A Active CN106156035B (en) 2015-02-28 2015-03-25 A kind of generic text method for digging and system

Country Status (1)

Country Link
CN (1) CN106156035B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107526726A (en) * 2017-07-27 2017-12-29 山东科技大学 A kind of method that Chinese procedural model is automatically converted to English natural language text
CN108052577A (en) * 2017-12-08 2018-05-18 北京百度网讯科技有限公司 A kind of generic text content mining method, apparatus, server and storage medium
CN110059176A (en) * 2019-02-28 2019-07-26 南京大学 A kind of rule-based generic text information extracts and information generating method
CN110321549A (en) * 2019-04-09 2019-10-11 广州数说故事信息科技有限公司 Based on the new concept method for digging for serializing study, relation excavation, Time-Series analysis

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140082003A1 (en) * 2012-09-17 2014-03-20 Digital Trowel (Israel) Ltd. Document mining with relation extraction
CN103678499A (en) * 2013-11-19 2014-03-26 肖冬梅 Data mining method based on multi-source heterogeneous patent data semantic integration

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140082003A1 (en) * 2012-09-17 2014-03-20 Digital Trowel (Israel) Ltd. Document mining with relation extraction
CN103678499A (en) * 2013-11-19 2014-03-26 肖冬梅 Data mining method based on multi-source heterogeneous patent data semantic integration

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
孙珠婷 等: "概念图构建中概念术语自动提取的研究与实现", 《计算机工程与设计》 *
车海燕 等: "面向中文自然语言文档的自动知识抽取方法", 《计算机研究与发展》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107526726A (en) * 2017-07-27 2017-12-29 山东科技大学 A kind of method that Chinese procedural model is automatically converted to English natural language text
CN108052577A (en) * 2017-12-08 2018-05-18 北京百度网讯科技有限公司 A kind of generic text content mining method, apparatus, server and storage medium
US11062090B2 (en) 2017-12-08 2021-07-13 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for mining general text content, server, and storage medium
CN108052577B (en) * 2017-12-08 2022-06-14 北京百度网讯科技有限公司 Universal text content mining method, device, server and storage medium
CN110059176A (en) * 2019-02-28 2019-07-26 南京大学 A kind of rule-based generic text information extracts and information generating method
CN110059176B (en) * 2019-02-28 2021-07-13 南京大学 Rule-based general text information extraction and information generation method
CN110321549A (en) * 2019-04-09 2019-10-11 广州数说故事信息科技有限公司 Based on the new concept method for digging for serializing study, relation excavation, Time-Series analysis

Also Published As

Publication number Publication date
CN106156035B (en) 2019-10-22

Similar Documents

Publication Publication Date Title
Mühlroth et al. A systematic literature review of mining weak signals and trends for corporate foresight
Enríquez et al. Entity reconciliation in big data sources: A systematic mapping study
CN102831121A (en) Method and system for extracting webpage information
CN105893485A (en) Automatic special subject generating method based on book catalogue
CN103106211B (en) Emotion recognition method and emotion recognition device for customer consultation texts
CN106156035A (en) A kind of generic text method for digging and system
Navadiya et al. Web Content Mining Techniques-A Comprehensive Survey
Schulz et al. Practical Web data extraction: are we there yet?-a short survey
Haris et al. Mining graphs from travel blogs: a review in the context of tour planning
Bhardwaj et al. A novel approach for content extraction from web pages
Aung et al. Random forest classifier for multi-category classification of web pages
CN105893574A (en) Data processing method and electronic device
Sam et al. Ontology-based text-mining model for social network analysis
CN105447191A (en) Intelligent abstracting method for providing graphic guidance steps and corresponding device
Sabri et al. Improving performance of DOM in semi-structured data extraction using WEIDJ model
Baazouzi et al. A matching approach to confer semantics over tabular data based on knowledge graphs
Li et al. Shape analysis for unstructured sharing
Alam et al. RV-Xplorer: A way to navigate lattice-based views over RDF graphs
Margitus et al. RDF versus attributed graphs: The war for the best graph representation
Sabri et al. WEIDJ: An improvised algorithm for image extraction from web pages
Gupta et al. A heuristic approach for web content extraction
Su et al. Capturing architecture documentation navigation trails for content chunking and sharing
Hellal et al. Nodar: mining globally distributed substructures from a single labeled graph
Akundi et al. Identifying the thematic trends of model based systems engineering in manufacturing and production engineering domains
Mukherjee et al. Browsing fatigue in handhelds: semantic bookmarking spells relief

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant