CN106156035A - A kind of generic text method for digging and system - Google Patents
A kind of generic text method for digging and system Download PDFInfo
- Publication number
- CN106156035A CN106156035A CN201510135053.9A CN201510135053A CN106156035A CN 106156035 A CN106156035 A CN 106156035A CN 201510135053 A CN201510135053 A CN 201510135053A CN 106156035 A CN106156035 A CN 106156035A
- Authority
- CN
- China
- Prior art keywords
- concept
- digging
- text
- excavation
- relation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of method that generic text is excavated, including: step 1, operational network reptile, load the excavation object in the range of excavating, step 2, described excavation object is carried out actual text extraction, obtain actual text, step 3, described actual text is formed concept tagging figure, step 4, according to the relation between described concept and the concept corresponding with excavating target, described concept and relation are compiled and forms bytecode, and then formation instruction figure, step 5, described concept tagging figure and described instruction figure are mated, the content of the concept and relation that meet instruction figure in described concept tagging figure is formed analysis result.The generic text method for digging of the present invention can carry out formal description to excavating target, excavation scope and method for digging etc., reaches to carry out the effect of text retrieval and excavation in different field.
Description
Technical field
The invention belongs to character search technical field, specifically, the invention belongs to text mining and nature
Language processing techniques field.The present invention relates to a kind of general text mining method, can be different necks
Excavation target, excavation scope, parser and analysis result in the text mining demand in territory filter
Rule provides formal description means.
Background technology
Since last century the nineties, information extraction is given more sustained attention at academia and industrial quarters, has
Substantial amounts of correlational study works.Along with the quick growth of internet information quantity, to non-structured text
Excavate and become study hotspot.Text mining is many in public sentiment monitoring, intelligence analysis, business intelligence etc.
Field is increasingly widely applied.
Existing extraction technique is generally used for concrete field, i.e. making in the specific area being pre-designed
With, such as relevant to a large amount of extractions of info web, especially generate relevant to wherein Wrapper
Work only for the useful information of the structure extraction utilizing webpage.Other situations such as DBLife system
System, then use Datalog language to allow user customize extraction target, and improve extraction efficiency, but this
Method can only use in document analysis field.
Inventor have recognized that, in existing information extraction technique means, not one
The general way that can use in multiple fields, it is impossible to carry in a practical case for user
Information extraction technique support for multiple fields.And, there is technology and lack in existing information extraction means
Fall into, be also unable to reach the effect that generic text is excavated.On the one hand, existing extraction technique uses key word
Boolean logic carrys out profile matching target, which has limited and target is portrayed ability.On the other hand, these
Technological means does not possess versatility, is often accomplished by again realizing after changing an industry or scene.
Summary of the invention
It is an object of the present invention to provide a kind of general text mining method.Inventors believe that,
Can develop with rule-based extraction technique and apply unrelated rule language to reach the general of field
Property, allow user take the mode that statement formula extracts, customization extraction target in different field.It is full
This kind of demand of foot, versatility Text Mining System to solve the subject matter of following 3 aspects: 1) as
What provides a kind of formalization method to portray text mining pattern;2) how to meet excavation target and comprise multiple
The demand of miscellaneous semantic structure;3) efficiency of extensive semantics extraction how is solved.The inventive method
Propose a kind of formal definitions means, it is possible to the excavation target of text mining, excavate scope and point
Analysis algorithm is described;And the method using layer-stepping provides analysis result, it is right to also provide for further
The filtering rule of analysis result.On the other hand, present invention also offers one and can realize above-mentioned text
The system of method for digging.
The generic text method for digging that the present invention provides includes:
Step 1, operational network reptile, load the excavation object in the range of excavating;
Step 2, carries out actual text extraction to described excavation object, obtains actual text;
Step 3, forms concept tagging figure by described actual text;
Step 4, according to the relation between described concept and the concept corresponding with excavating target, by described
Concept and relation compiling form bytecode, and then form instruction figure;
Step 5, mates described concept tagging figure and described instruction figure, by described concept tagging
The content of the concept and relation that meet instruction figure in figure forms analysis result.
After step 3, the analysis optimization step for optimizing described concept tagging figure can be included,
Method for digging in described analysis optimization step includes participle, part of speech analysis, name Entity recognition.?
Before described analysis optimization step, it is also possible to include that the method for digging for selecting described method for digging is fixed
Justice step.
Before described step 4, it is also possible to include defining the described concept corresponding with described excavation target
And defining the excavation object definition step of described relation between described concept, described excavation target is
Described concept and the occurrence of relation.Can include defining described excavation scope before described step 1
Excavation scope definition step.
After described step 5, it is also possible to include the matching result according to concept and relation, to described
Text carries out the step classified.
Described relation between described concept includes:
" SENT ": all concepts in action scope must be present in a statement;
" DIST_n ": the distance between any two adjacent concept in action scope can not be more than n;
" ORD ": all concepts order in action scope occurs;
" CONT ": all concepts in action scope are adjacent.
Method of the present invention also includes step 6, and described actual text is carried out subject classification,
It can in addition contain include step 7, described analysis result is carried out result filtration, according to described actual literary composition
Described analysis result is defined by the concept and the frequency of occurrences of relation that match in Ben.
The generic text digging system that the present invention provides includes:
Load-on module, for using web crawlers to load the excavation object text in the range of excavating;
Text Feature Extraction module, for extracting the actual text in described excavation object;
Mark figure generation module, for forming concept tagging figure by described actual text;
Collector, for according to the relation between the described concept corresponding with excavating target, by described
Concept and relation compiling form bytecode, and then form instruction figure;
Matching module, mates described concept tagging figure and described instruction figure, by described concept mark
The content of the concept and relation that meet instruction figure in note figure forms analysis result.
Described system can include analysis optimization module, is used for optimizing described concept tagging figure.Described system
System can also include method for digging definition module, for select in described analysis optimization module use dig
Pick method.Further, described system could be included for storing the method mould of described method for digging
Type storehouse, described method for digging definition module is selected method for digging from described method model storehouse.Described side
The method for digging of method model library includes: word-dividing mode, part of speech analyze module, name Entity recognition module
Deng.
Described system can also include excavating object definition module, for definition and described excavation target pair
The described concept answered, and define the described relation between described concept, described excavation target is as described
The occurrence of concept.
Described relation between described concept may include that
" SENT ": all concepts in action scope must be present in a statement;
" DIST_n ": the distance between any two adjacent concept in action scope can not be more than n;
" ORD ": all concepts order in action scope occurs;
" CONT ": all concepts in action scope are adjacent.
System of the present invention can also include excavation scope definition module, is used for defining described excavation model
Enclose.The system of the present invention also includes text classification module, for according to concept and pass in described text
The field that the matching result of system and text relate to, carries out subject classification to actual text.It addition, this
The system of invention also includes result filtering module, for according to the described concept matched in described text
With the frequency of occurrences of relation, described analysis result is defined.
The present invention proposes and a kind of can enter excavation target, scope and the method for digging in text mining
Row formalized description also obtains the generic text method for digging of Result.The present invention is corresponding with one
Context-sensitive language supports the extraction to any concept and relation simultaneously, provides excavation simultaneously
The customization means of scope and relevant method for digging.User has only to use the regular defined notion of the present invention
And the relation between concept, by the described concept of concrete search target assignment, can be achieved with its literary composition
The description of this excavation demand, and obtain the analysis result of text mining further.
According to method for digging of the present invention, required concept can be defined and compose by user
Value, and the relation between concept is defined, then select excavation scope and method for digging.This
The method of bright offer can utilize method for digging to be labeled the text in range of search, product concept
Mark figure.On the other hand, concept and relation that user can be defined by the present invention are compiled into bytecode,
Generate instruction figure.Further, concept tagging figure and instruction figure are mated by the present invention, thus point
Separate out concept and the content of text of relation meeting user definition in the range of excavating, form analysis result.
It was found by the inventors of the present invention that in the art, technical staff does not attempts to provide one
Generic text method for digging or template as kind, those skilled in the art are also unaware that real meaning
On the importance of generic text method for digging.Therefore, the present invention is to be realized technical assignment or
That the those skilled in the art of technical problem is that to be solved never expect or it is not expected that, therefore
The present invention is a kind of new technical scheme.
By detailed description to the exemplary embodiment of the present invention referring to the drawings, its of the present invention
Its feature and advantage thereof will be made apparent from.
Accompanying drawing explanation
Combined in the description and the accompanying drawing of the part that constitutes description shows the reality of the present invention
Execute example, and together with its explanation for explaining the principle of the present invention.
Fig. 1 is the block diagram of the generic text method for digging of the present invention;
Fig. 2 is the block diagram of generic text method for digging in the specific embodiment of the invention;
Fig. 3 be TML language corresponding to the method for the invention simplify grammer;
Fig. 4 is the concept tagging illustrated example in the specific embodiment of the invention;
Fig. 5 is the contextual definition statement example of TML LISP program LISP in the specific embodiment of the invention;
Fig. 6 is the bytecode example of the relational statement of TML LISP program LISP in the specific embodiment of the invention;
Fig. 7 is the structured flowchart of generic text digging system in the specific embodiment that the present invention provides;
Fig. 8 is the concept of TML LISP program LISP and relational hierarchy figure in the specific embodiment of the invention.
Detailed description of the invention
The various exemplary embodiments of the present invention are described in detail now with reference to accompanying drawing.It should also be noted that
Unless specifically stated otherwise, the parts that illustrate the most in these embodiments and step positioned opposite,
Numerical expression and numerical value do not limit the scope of the invention.
Description only actually at least one exemplary embodiment is illustrative below, never makees
For to the present invention and application thereof or any restriction of use.
May not beg in detail for technology, method and apparatus known to person of ordinary skill in the relevant
Opinion, but in the appropriate case, described technology, method and apparatus should be considered a part for description.
Shown here with in all examples discussed, any occurrence should be construed as merely example
Property rather than as limit.Therefore, other example of exemplary embodiment can have different
Value.
It should also be noted that similar label and letter represent similar terms, therefore, one in following accompanying drawing
A certain Xiang Yi the accompanying drawing of denier is defined, then need not it is carried out further in accompanying drawing subsequently
Discuss.
The invention provides a generic text method for digging, Fig. 1 is the action box of the method for the invention
Frame figure, the text mining method of the present invention achieves web crawlers, actual text is extracted, text marking
The technology such as figure generation, participle, part-of-speech tagging, name Entity recognition and text classification, and by they shapes
Become a streamline analysis.As it is shown in figure 1, including following steps: 1, operational network reptile,
Load the excavation object in the range of excavating;2, to described excavation object extraction actual text;3, by institute
State actual text and form concept tagging figure;4, described concept and relation are compiled into bytecode, formation refers to
Order figure;5, described concept tagging figure and instruction figure are mated, obtain analysis result.In the present invention
Described method in, described excavation can be the carrier of any text in sending out, such as webpage, books,
The carrier of the forms such as file.
In the text mining method that the present invention provides, can be based on TML language, such as Fig. 3 institute
Show, by regular expression definition statement, concept definition statement, assignment statement, contextual definition statement,
Excavation scope definition statement, parser definition statement, and the statement such as output statement, loading statement
Coding, it is achieved method for digging of the present invention.Fig. 3 be illustrated that the present invention provides with institute
Stating the TML program frame that generic text method for digging is corresponding, the user of this method can pass through this
Set program frame carries out text mining analysis to any field, any objective excavation scope.Described just
Then expression formula is for being defined the form of concept and relative occurrence.
In step 1, described text mining method first operational network reptile, the excavation of loading selected
In the range of excavation object.Before described step 1, as in figure 2 it is shown, excavation model can be first carried out
Enclose definition step.Defined by the excavation scope definition statement as shown in Fig. 3 17-19 row, select
Described excavation scope.Described excavation scope definition statement includes reserved word " PAGES ", variable name
<string>with excavation range attribute list<pagerestricts>the compound statement composition formed.The
The excavation range attribute list of 18 row can include multiple excavation range attribute<pagerestrict>.?
In 19th row, excavating range attribute and be made up of attribute and property value, attribute definition is as shown in table 1.
Table 1 excavates range attribute definition
In step 2, owing to described excavation scope has various ways, described excavation object therein is also
The layout character in insignificant object, such as webpage, the layout character etc. in books are potentially included.
So, described streamline analysis can include actual text extraction step.Described actual text is extracted
Step is used for removing in excavation object retrieves insignificant part to excavating, and is really needed to dig
The actual text that pick is analyzed.
In described step 3, the present invention is by described actual text product concept mark figure.Described generally
Read in mark figure, meeting on the words and expressions excavating target, on the words and expressions of the occurrence i.e. meeting concept,
It is labeled with corresponding described concept.In a particular embodiment, described concept tagging figure such as Fig. 4 institute
Show.
In step 3 and step afterwards thereof, concept tagging figure can be carried out by the method for the invention
Optimize.Particularly, owing to potentially including multiple method for digging, so by analysis process shape in the present invention
Become the mark map analysis optimization step of sequencing.As in figure 2 it is shown, such as, described analysis optimization walks
Suddenly Chinese word segmentation step, part-of-speech tagging step and name entity extraction step can be performed successively.Described
Chinese word segmentation step is for the Chinese text in the range of excavating, and a Chinese character sequence is cut into one
One single word.Described part-of-speech tagging step is to be labeled the part of speech in actual text, such as
Adjective, noun, verb etc., be used for describing the effect within a context of a word.Described name is real
Body extraction step identification text has the entity of certain sense, mainly include name, place name, machine
The entity names such as structure name.As in figure 2 it is shown, in certain embodiments of the invention, described mark figure is excellent
Changing step and can include multinomial analytical procedure and method successively, those skilled in the art can be according to reality
The needs used, select/accept or reject different method for digging.
Before described analysis optimization step, it is also possible to include the excavation for selecting described method for digging
Method definition step.Can be filled by the method for digging definition statement as described in Fig. 3 20-23 row
Carry, selected existing method for digging and method model storehouse, including the good participle step of predefined, word
Property annotation step, classifying step etc., to perform method for digging definition step.As shown in Fig. 3 the 20th row,
Described method for digging definition statement includes reserved word " USE ", implementation name and the path at instrument place.Special
Other, before described step 4, as shown in Figure 2, it is also possible to include defining described concept and relation
Excavation object definition step.In the present invention, can be fixed by the concept as shown in Fig. 3 4-7 row
Described concept and relation are defined by justice statement, contextual definition statement and assignment statement.
Assignment statement may is that
<assignstatement>: :=<string>" :="<string>";" |<string>" :="
“OR”“(”<stmtargs>“)”“;”
Wherein the r value<string>of " :=" symbol is assigned to lvalue<string>, it is also possible to " OR "
Operator is that concept definition is multiple and train value.
Concept definition statement may is that
<conceptstatement>: :=" CONCEPT "<stmtvars>";”|“CONCEPT”
<string>“(”<stmtarg>“)”“{”<stmtlimits>“}”
Including reserved word " CONCEPT " and statement variable list<stmtvars>, described concept
Can definition while assignment.Particularly, described excavation target is the occurrence of described concept, makes
By described assignment statement or while defined notion, described excavation target can be assigned to described generally
Read.
Contextual definition statement may is that
<predicatestatement>: :=" PREDICATE "<string>" ("<stmtargs>
“)”“{”<stmtlimits>“}”
General including reserved word " PREDICATE ", statement parameter list<stmtargs>and expression
Restriction relation compound statement<stmtlimits>between thought.Wherein the restriction relation between concept calculates symbol point
Be two classes, i.e. boolean calculation symbol and context calculates symbol, calculates symbol definition respectively such as table 2 and table 3 institute
Show.Described relation between described concept can include polytype, in the present invention can be by upper
Hereafter calculate symbol and boolean calculation symbol represent, such as:
" SENT ": all concepts in action scope must be present in a statement;
" DIST_n ": the distance between any two adjacent concept in action scope can not be more than n;
" ORD ": all concepts order in action scope occurs;
" CONT ": all concepts in action scope are adjacent;
Action scope refers to calculate the content in symbol unquote.
Operator | Definition |
AND | All words and expressions in AND action scope must occur in input text simultaneously |
OR | All words and expressions in OR action scope at least have one to occur in input text |
NOT | Words and expressions in NOT action scope can not occur, otherwise input text does not mates |
Table 2 Boolean relation calculates symbol definition
The symbol definition of table 3 context operations
As described in Figure 2, described excavation object definition step can realize in TML program.It addition,
TML program can also be optimized, complicated, to there is logical contradiction relation is adjusted,
Form accurate logical relation.
In described step 4, between the present invention is according to the described concept corresponding with described excavation target
Relation, compiles described concept and relation and forms bytecode, and then form instruction figure.
The process of described step 4 is in a particular embodiment as it is shown in figure 5, by department in Fig. 5
Occur in being relation position defined in a word with two concepts of title, then use
The relations such as position and concept definition concept manager, wherein generate after the compiling of relation position
Bytecode as shown in Figure 6, i.e. with the addition of START_SENT for SENT operator,
START_MATCH, END_MATCH, END_SENT etc. run virtual machine instructions.Performing bytecode
Time, run virtual machine and perform corresponding matching logic for these instructions.Take similar approach, permissible
It is bytecode by all of described concept and transformation.It is bytecode by described concept and transformation
After, instruction figure can be generated according to the target-dependent relation in bytecode.
Described concept tagging figure and described instruction figure are mated by described step 5, by described concept mark
The content of the concept and relation that meet instruction figure in note figure forms analysis result.
At instruction figure with the matching process of mark figure, need to record the relevant information of each step, in order to
Ensureing efficiency and correctness that matched rule performs, the present invention can also solve following point:
(1) the concurrent executive problem of rule
For there is no concept and the relation of dependence, can concurrently perform, such as in coupling
SENT (person, " accepting an interview ") and SENT (person, " meeting reporter ") this two line statement
Only " being interviewed " and " seeing reporter " is different, therefore " (person, " connecing " this part can for SENT
Concurrently to mate, the related top being only intended in solution instruction figure is shared by some bytecode fragments and is carried
The problem come.
(2) Dependence Problem between coupling concept and concept
Multiple concepts in program are likely to be of the dependence of complexity each other, be on compiling rank
Section, i.e. performs to calculate during step 4 their sequencing.Fig. 5 such as must first mate
" title " and " department " could mate " position ".
(3) matching problem of context relation
Symbol AND, OR and NOT are calculated for simple Boolean logic relationship, can be directly by concept tagging
Figure carries out performing logical relation again after high efficiency is mated with instruction figure itself;But for SENT and DIST_n
Etc. context relation, in addition to matched text self, the context of matched text also to be patrolled by the present invention
The relation of collecting
(4) the nested problem of described relation
The described relation of rule is the most mutually nested, such as DIST_3 (SENT (OR (" inc ",
" corp "), OR (" acquire ", " buy ")))) describe one simple " corporate buyout "
Relation, in its execution of bytecode and the matching process of instruction figure and concept tagging figure, needs
In view of SENT and OR nested inside DIST_n operator, need to solve this in the process of implementation
Nested problem.
After described step 5, it is also possible to include step 6, text classification step, can be in excavation side
The program of method is specified the categorical attribute to text, to perform step 6.As shown in Fig. 3 the 21st row
Sorting technique statement includes reserved word " CLASS ", variable name and categorical attribute list, and genus of classifying
Property list can be made up of multiple categorical attributes, it is also possible to for sky.23rd row shows that categorical attribute is used
Coming designation method model library, method step etc., the most every is semantic as shown in table 4.
Table 4 extracts means attribute definition
Particularly, after described step 5, also include step 7, described analysis result is tied
Fruit is filtered, and the described concept matched according to described actual text and the frequency of occurrences of relation are to described point
Analysis result is defined.
In the present invention, statement can be filtered to step 7 according to the result shown in Fig. 3 24-29 row
Filter method be defined.The result of 24-29 row filters the statement classification according to document with general
Considering the frequency of occurrences of its relation the analysis result of text mining done and further limit.Such as the 24th
Shown in row, described result filters statement and comprises reserved word " SELECT ", and optional " FROM "
Subordinate sentence and " WHERE " subordinate sentence.Object<selectobjects>is selected to serve as reasons ", " if separate
Dry concept, as shown in the 25th row.Selection source<selectsources>can be by multiple expression analysis sources
Character string composition, such as webpage or classification results, as shown in the 26th row.Alternative condition list
<selecetconditions>row is made up of several alternative conditions, as shown in the 27th row.Selector bar
The selection subordinate sentence that part<selecetcondition>is linked by comparison operator forms, such as the 28th row institute
Showing, comparison operator is as shown in table 5.Select subordinate sentence<selectconditionclause>can be with
Concept name operates, as shown in the 29th row as " FREQ " of parameter.After obtaining analysis result, permissible
According to aforesaid operations, perform analysis result filtration step.
Table 5 comparison operator definition table
The method of the invention by after treatment actual text formed concept tagging figure, it with by
Concept and the byte code stream of relation compiling that TML language is write match, output annotation result.
On the other hand, the invention provides a kind of generic text digging system, it is possible to realize above-mentioned general
Text mining method.As it is shown in fig. 7, described system includes:
Load-on module 100, for using web crawlers to load the excavation object in the range of excavating;
Actual text extraction module 200, the actual text in the range of extracting excavation;
Mark figure generation module 300, for forming concept tagging figure by described actual text;
Mark figure optimizes module 400, for according to method for digging definition module 710 and method model storehouse
720, the mark figure generating mark figure generation module 300 is optimized, and described mark figure optimizes module
400 use the method for digging in described method model storehouse 720 to be optimized, described method model storehouse 720
In can include word-dividing mode 410, part-of-speech tagging module 420 and name Entity recognition module 430.
Collector 500, for according to the relation between described concept and concept, by described concept and
Relation compiling forms bytecode, and then forms instruction figure;
Matching module 600, mates described concept tagging figure and described instruction figure, by described generally
The content reading the concept and relation that meet instruction figure in mark figure forms analysis result.
Described system also includes that definition module 700, described definition module 700 include: method for digging
Definition module 710, makes for described optimization in analysis module selected from described method model storehouse 720
Method for digging;Accordingly, described system also includes the method model storehouse for storing method for digging
720, described method for digging definition module 710 is selected method for digging from described method model storehouse 720.
The method for digging of described method base 720 can include part-of-speech tagging method, name entity abstracting method,
File classification method and keyword abstraction method etc..
Described definition module 700 can also include excavation scope definition module 730, is used for defining described
Excavation scope.
Described definition module 700 can also include excavating object definition module 740, for definition and institute
State and excavate the described concept that target is corresponding, and define the described relation between described concept, described excavation
Target is as the occurrence of described concept.
Preferably, the described relation between the concept used in described system includes:
" SENT ": all concepts in action scope must be present in a statement;
" DIST_n ": the distance between any two adjacent concept in action scope can not be more than n;
" ORD ": all concepts order in action scope occurs;
" CONT ": all concepts in action scope are adjacent.
Preferably, described system can also include text classification module 900 and result filtering module 800.
Described text classification module 900 for carrying out subject classification, described result filtering module to actual text
800 analyze knot for the frequency of occurrences according to concept described in described concept tagging figure and relation to described
Fruit is defined.Above-mentioned two module with Result is processed independently, it is also possible to
Processing successively, the present invention is not intended to the sequencing that text classification is filtered with result.
In one embodiment of the invention, the most corresponding TML language solves from finger
The method finding esbablished corporation market department director in fixed seed node.
First, definition mining target of the present invention, excavate in target and need to comprise esbablished corporation, market department
The concepts such as door, director, and the relation formed by these concepts, such as market department director and
Esbablished corporation market department director etc..The usability of program fragments of corresponding TML language is as follows:
Hierarchical relationship between wherein said concept and described relation is as shown in Figure 8.
Secondly, the method for the invention wants definition mining scope in this embodiment.Excavation scope is made
The fixed seed node captured is http://finance.sina.com.cn and http://it.sohu.com, captures the degree of depth
It is 3.The usability of program fragments of corresponding TML language is as follows:
3rd, the method for the present invention needs definition mining method in the present embodiment, at least loading one
Literary composition segmenter.Assume the entitled chn_tkzo_mini.bin of segmenter, under current path.Corresponding TML
The usability of program fragments of language is as follows:
USE(tokenizer:"./chn_tkzo_mini.bin");
4th, the method for the present invention can define result filtering rule in the present embodiment.Filtering rule
For from including at least the document to be analyzed of a name entity finds coupling comppos relation.Corresponding
The usability of program fragments of TML language is as follows:
SELECT comppos FROM sample1 WHERE FREQ(PERSON)>0;
The program that the present embodiment is complete is as follows:
USE(tokenizer:"./chn_tkzo_mini.bin");
1.CONCEPT brand,company;
2.brand:=" IBM ";
3.brand:=" General Motors ";
…
4.company:=" China PetroChemical Corporation ";
5.company:=" State Grid Corporation of China ";
….
6.CONCEPT department,title;
7.department:=OR (" market ", " brand ", " marketing ", " public relations ");
8.title:=OR (" manager ", " higher level manager ", " chief inspector ", " assistant director ", " department head ");
9.PREDICATE manager(department d,title t){
CONT(d,t);
}
10.PREDICATE comppos(manager per,company comp){
11.OR(SENT(per,comp),DIST_15(per,comp));
}
12.PREDICATE comppos(manager per,brand prod){
13.OR(SENT(per,prod),DIST_15(per,prod));
}
14.PAGES sample1{
15.SEED("http://finance.sina.com.cn");
16.SEED("http://it.sohu.com");
17.DEPTH(“3”);
18.}
19.SELECT comppos FROM sample1WHERE FREQ(PERSON)>0;
In another embodiment of the present invention, as a example by purchase intention analysis, the present invention is described herein
The use of described method and system.
The Basic Ways excavating potential purchase intention from user's speech is, downloads user from social networks
After speech, formulate, with described TML programming, the purchasing model that this field is common, additionally write
Automobile, real estate, travel, insure, the knowledge base of the industry such as cosmetics, including product, product
Attribute, effect, brand, company etc., then by the particular purchased wish under they definition different industries
Pattern.The following is some usability of program fragments relevant to cosmetic industry:
1.COSMETIC-PRODUCT:=OR (" skin base solution ", " eyes hide glue ", " articles for washing ", " receive by cosmetics
" ..., " health is sun-proof ");
2.ATTRIBUTE (COSMETIC-PRODUCT, classname) :=" cosmetics ";
(" moisturizing ", " whitening ", " deep layer is clear for " moisturizing ", " moistening " for 3.CONCEPT COSMETIC-ASPECTS:=OR
Clean " ...);
…
# budget is correlated with
4.CONCEPT BUDGET:=OR (" budget ", " affording ", " can afford ", " affording ", " the most enough ", "
Short of money " ...);
5.PREDICATE cosmetics-intending(BUDGET bd,COSMETIC-PRODUCT
product){
AND(SENT(DIST_5(bd,product),NOT(NOT-PUR-DIST)),NOT(AD));
}
Table 6 provides the situation of the TML program that purchase intention is analyzed, and wherein Mean match speed is virtual
Machine downloads document and the speed of the program of execution, is about under typical single CPU/2GB memory environment
1MBps;But when text contain make a big purchase wish in large quantities and make each instruction all may be performed time,
Joining speed and decline about 20KBps, using TML to analyze in 7 industries of purchase intention, every day knows
Do not go out the effective accuracy of tens thousand of purchase intentions between 40%-50%.The time cost that TML program is run
It is not directly dependent upon with program length, it is possible to solve very large-scale problem.
Table 6 specific embodiment of the invention purchase intention is analyzed
Although some specific embodiments of the present invention being described in detail by example, but
It should be appreciated by those skilled in the art, example above is merely to illustrate rather than in order to limit
The scope of the present invention processed.It should be appreciated by those skilled in the art, can be without departing from the scope of the present invention
In the case of spirit, above example is modified.The scope of the present invention is by claims
Limit.
Claims (10)
1. a generic text method for digging, it is characterised in that including:
Step 1, operational network reptile, load the excavation object in the range of excavating;
Step 2, carries out actual text extraction to described excavation object, obtains actual text;
Step 3, forms concept tagging figure by described actual text;
Step 4, according to the relation between described concept and the concept corresponding with excavating target, by described
Concept and relation compiling form bytecode, and then form instruction figure;
Step 5, mates described concept tagging figure and described instruction figure, by described concept tagging
The content of the concept and relation that meet instruction figure in figure forms analysis result.
Generic text method for digging the most according to claim 1, it is characterised in that in step 3
Afterwards, including the analysis optimization step for optimizing described concept tagging figure, described analysis optimization step
Method for digging include participle, part of speech analysis, name Entity recognition.
Generic text method for digging the most according to claim 2, it is characterised in that at described point
Before analysis optimization step, including the method for digging definition step for selecting described method for digging.
Generic text method for digging the most according to claim 1, it is characterised in that in described step
Before rapid 4, including between described concept and defined notion that definition is corresponding with described excavation target
The excavation object definition step of described relation, described excavation target is the occurrence of described concept and relation.
Generic text method for digging the most according to claim 1, it is characterised in that described step
The excavation scope definition step defining described excavation scope was included before 1.
6. a generic text digging system, it is characterised in that including:
Load-on module (100), for using web crawlers to load the excavation object in the range of excavating;
Text Feature Extraction module (200), for described excavation object is carried out actual text extraction,
To actual text;
Mark figure generation module (300), for forming concept tagging figure by described actual text;
Collector (500), for according between described concept and the concept corresponding with excavating target
Relation, described concept and relation are compiled and form bytecode, and then form instruction figure;
Matching module (600), mates described concept tagging figure and described instruction figure, by institute
The content stating the concept and relation that meet instruction figure in concept tagging figure forms analysis result.
Generic text digging system the most according to claim 6, it is characterised in that include point
Analysis optimizes module (400), is used for optimizing described concept tagging figure.
Generic text digging system the most according to claim 6, it is characterised in that include excavating
Method definition module (710), is used for selecting the excavation used in described analysis optimization module (400)
Method.
Generic text digging system the most according to claim 6, it is characterised in that include excavating
Object definition module (740), for the described concept that definition is corresponding with described excavation target, and defines
Described relation between described concept, described excavation target is as the occurrence of described concept.
Generic text digging system the most according to claim 6, it is characterised in that include digging
Pick scope definition module (730), is used for defining described excavation scope.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2015100918874 | 2015-02-28 | ||
CN201510091887 | 2015-02-28 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106156035A true CN106156035A (en) | 2016-11-23 |
CN106156035B CN106156035B (en) | 2019-10-22 |
Family
ID=57340021
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510135053.9A Active CN106156035B (en) | 2015-02-28 | 2015-03-25 | A kind of generic text method for digging and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106156035B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107526726A (en) * | 2017-07-27 | 2017-12-29 | 山东科技大学 | A kind of method that Chinese procedural model is automatically converted to English natural language text |
CN108052577A (en) * | 2017-12-08 | 2018-05-18 | 北京百度网讯科技有限公司 | A kind of generic text content mining method, apparatus, server and storage medium |
CN110059176A (en) * | 2019-02-28 | 2019-07-26 | 南京大学 | A kind of rule-based generic text information extracts and information generating method |
CN110321549A (en) * | 2019-04-09 | 2019-10-11 | 广州数说故事信息科技有限公司 | Based on the new concept method for digging for serializing study, relation excavation, Time-Series analysis |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140082003A1 (en) * | 2012-09-17 | 2014-03-20 | Digital Trowel (Israel) Ltd. | Document mining with relation extraction |
CN103678499A (en) * | 2013-11-19 | 2014-03-26 | 肖冬梅 | Data mining method based on multi-source heterogeneous patent data semantic integration |
-
2015
- 2015-03-25 CN CN201510135053.9A patent/CN106156035B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140082003A1 (en) * | 2012-09-17 | 2014-03-20 | Digital Trowel (Israel) Ltd. | Document mining with relation extraction |
CN103678499A (en) * | 2013-11-19 | 2014-03-26 | 肖冬梅 | Data mining method based on multi-source heterogeneous patent data semantic integration |
Non-Patent Citations (2)
Title |
---|
孙珠婷 等: "概念图构建中概念术语自动提取的研究与实现", 《计算机工程与设计》 * |
车海燕 等: "面向中文自然语言文档的自动知识抽取方法", 《计算机研究与发展》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107526726A (en) * | 2017-07-27 | 2017-12-29 | 山东科技大学 | A kind of method that Chinese procedural model is automatically converted to English natural language text |
CN108052577A (en) * | 2017-12-08 | 2018-05-18 | 北京百度网讯科技有限公司 | A kind of generic text content mining method, apparatus, server and storage medium |
US11062090B2 (en) | 2017-12-08 | 2021-07-13 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and apparatus for mining general text content, server, and storage medium |
CN108052577B (en) * | 2017-12-08 | 2022-06-14 | 北京百度网讯科技有限公司 | Universal text content mining method, device, server and storage medium |
CN110059176A (en) * | 2019-02-28 | 2019-07-26 | 南京大学 | A kind of rule-based generic text information extracts and information generating method |
CN110059176B (en) * | 2019-02-28 | 2021-07-13 | 南京大学 | Rule-based general text information extraction and information generation method |
CN110321549A (en) * | 2019-04-09 | 2019-10-11 | 广州数说故事信息科技有限公司 | Based on the new concept method for digging for serializing study, relation excavation, Time-Series analysis |
Also Published As
Publication number | Publication date |
---|---|
CN106156035B (en) | 2019-10-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Mühlroth et al. | A systematic literature review of mining weak signals and trends for corporate foresight | |
Enríquez et al. | Entity reconciliation in big data sources: A systematic mapping study | |
CN102831121A (en) | Method and system for extracting webpage information | |
CN105893485A (en) | Automatic special subject generating method based on book catalogue | |
CN103106211B (en) | Emotion recognition method and emotion recognition device for customer consultation texts | |
CN106156035A (en) | A kind of generic text method for digging and system | |
Navadiya et al. | Web Content Mining Techniques-A Comprehensive Survey | |
Schulz et al. | Practical Web data extraction: are we there yet?-a short survey | |
Haris et al. | Mining graphs from travel blogs: a review in the context of tour planning | |
Bhardwaj et al. | A novel approach for content extraction from web pages | |
Aung et al. | Random forest classifier for multi-category classification of web pages | |
CN105893574A (en) | Data processing method and electronic device | |
Sam et al. | Ontology-based text-mining model for social network analysis | |
CN105447191A (en) | Intelligent abstracting method for providing graphic guidance steps and corresponding device | |
Sabri et al. | Improving performance of DOM in semi-structured data extraction using WEIDJ model | |
Baazouzi et al. | A matching approach to confer semantics over tabular data based on knowledge graphs | |
Li et al. | Shape analysis for unstructured sharing | |
Alam et al. | RV-Xplorer: A way to navigate lattice-based views over RDF graphs | |
Margitus et al. | RDF versus attributed graphs: The war for the best graph representation | |
Sabri et al. | WEIDJ: An improvised algorithm for image extraction from web pages | |
Gupta et al. | A heuristic approach for web content extraction | |
Su et al. | Capturing architecture documentation navigation trails for content chunking and sharing | |
Hellal et al. | Nodar: mining globally distributed substructures from a single labeled graph | |
Akundi et al. | Identifying the thematic trends of model based systems engineering in manufacturing and production engineering domains | |
Mukherjee et al. | Browsing fatigue in handhelds: semantic bookmarking spells relief |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |