CN106156286A - Type extraction system and method towards technical literature knowledge entity - Google Patents

Type extraction system and method towards technical literature knowledge entity Download PDF

Info

Publication number
CN106156286A
CN106156286A CN201610488849.7A CN201610488849A CN106156286A CN 106156286 A CN106156286 A CN 106156286A CN 201610488849 A CN201610488849 A CN 201610488849A CN 106156286 A CN106156286 A CN 106156286A
Authority
CN
China
Prior art keywords
type
entity
knowledge
knowledge entity
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610488849.7A
Other languages
Chinese (zh)
Other versions
CN106156286B (en
Inventor
温雯
伍思杰
蔡瑞初
郝志峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN201610488849.7A priority Critical patent/CN106156286B/en
Publication of CN106156286A publication Critical patent/CN106156286A/en
Application granted granted Critical
Publication of CN106156286B publication Critical patent/CN106156286B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of type extraction system towards technical literature knowledge entity, this system includes: user's inquiry and feedback interface, online reptile and management module, knowledge Entity recognition module, the type abstraction module of knowledge entity, type label propagation and index database set up module, knowledge entity type graph of a relation model construction module, data visualization module.Present system can carry out entity type extraction according to the entity key of user's inquiry, is then visually presented with out the type of relationship between knowledge entity, hierarchical relationship and sequential evolution.Additionally, a kind of type abstracting method towards technical literature knowledge entity that the present invention also proposes, the inventive method effectively can carry out type label extraction to the knowledge entity of professional field, solving the limitation of artificial predefined type and the problem of subjectivity, the structuring contributing to Professional knowledge network realizes.

Description

Type extraction system and method towards technical literature knowledge entity
Technical field
The present invention relates to text mining and information extraction field, be specifically related to a kind of class towards technical literature knowledge entity Type extraction system and abstracting method.
Background technology
Along with the most universal of the Internet and the development of hardware store technology, people can the most on different devices Browse, get all kinds of digital resources, it is also possible to got by numerous Academic Data storehouses or academic search engine required Technical literature, such as Google Scholar, Baidu's science, Cnki, ten-thousand-ton train etc..From this, obtain from the Internet The e-sourcing taking magnanimity becomes a light simple thing really, but problem appear to is that therewith, existing knowledge Service cannot meet people's demand to information " quick, simple, accurate ".In the face of such knowledge services demand, we It is required for this kind of technical literature text carry out Entity recognition and extract the type information of entity, sets up structurized specialty and know Knowledge system, to assist user to carry out literature search.The type information extraction system of currently the majority and technology are both for Daily social text, such as microblogging, Facebook, Twitter etc., and for this kind of academic documents having numerous technical term Study the most less.
At present, although the information extraction for technical literature field is studied and few, but its considerable application prospect and knowing What knowledge serviced needs also to have caused research boom both domestic and external, and achieves certain achievement in research.The most external Google Knowledge graph and Google Trends, the Chinese thesaurus of domestic Harbin Institute of Technology, knowing of ten-thousand-ton train Know venation retrieval etc..Wherein, Google knowledge graph be the retrieval object of user as an entity rather than Simple Keywords matching retrieval, can be effectively obtained some relevant attributes of entity and physical resource;Google Trends It is that the search record to user is analyzed, obtains the focus trend of some key words;Domestic " Chinese thesaurus " is then profit Carry out the excavation of entity hyponymy by the data of the Internet thus obtain the hyponymy of major part entity, but but lack Few special technical term this kind of to technical literature knowledge entity is analyzed;And the knowledge venation retrieval of ten-thousand-ton train is basis The key word of document is associated by the relation of pertinent literature and list of references, the most chronologically arrangement show certain period with The maximally related vocabulary of user search word.
Existing type extraction technique is primarily present the deficiency of the following aspects: A) type needs artificial pre-defined, With limitation;B) need substantial amounts of artificial mark, take time and effort;C) type for professional field extracts the fewest, major part It is applied to conventional entity information abstracting method at professional field inapplicable;D) directly perceived, vivid tree graph Visualization Demo is lacked, Major part system remains based on word, data demonstrating.
Summary of the invention
It is an object of the invention to overcome existing professional field entity type extraction technique above shortcomings, propose one Plant the type abstracting method towards technical literature knowledge entity and system.
For achieving the above object, the technical scheme is that
The invention discloses the type extraction system towards technical literature knowledge entity, including following 7 modules:
A () inquiry and feedback interface, for input processing and the query processing of user, feed back to data visualization result User;
B () online reptile and management module, automatically crawl manager and specify or the technical literature of acquiescence for backstage The page and carry out the pretreatment of page data;
C () knowledge Entity recognition module, for carrying out knowledge entity knowledge to pretreated document title and summary data Not;
D () type label abstraction module, carries out type label extraction for realization to the knowledge entity obtained in module (c) And part entity type mark, obtain type label set and part has marked entity;
E () type label is propagated and index database sets up module, not mark knowledge entity sets, module (d) in module (c) Type label set and part marked entity for input, carry out label based on multi-tag weighting and propagate and to set up knowledge real Body and type of relationship index database thereof;
F () knowledge entity type graph of a relation model construction module, examines index database according to the key word of user's input Rope, and construct different knowledge entity type graph of a relation models;
G () data visualization module, carries out Web the Visual Implementation to the model in module (f).
The invention also discloses the type abstracting method towards technical literature knowledge entity, use above-mentioned extraction system, enter Row following steps:
S1. data crawl and pretreatment: Administrator document crawls address and scope, online reptile and management module and exists The document page is crawled by backstage according to the scope specified, and the page data crawled is carried out pretreatment simultaneously;
S2. knowledge Entity recognition is extracted: knowledge Entity recognition module carries out Entity recognition to pretreated documentation & info And extract;
S3. type extraction and mark: knowledge entity type abstraction module to extract knowledge entity carry out type extraction and Mark, obtains type label set and part has marked entity;
S4. index database is set up: by the knowledge entity obtained and type label set thereof and partly marked entity number According to library storage, carry out label based on multi-tag weighting and propagate, obtain type label matrix and set up knowledge entity and type thereof Index database;
S5. keyword is obtained: inquired about by user and feedback interface obtains the knowledge entity keyword that user inquires about;
S6. list of types is set up: the index database created in step s 4 according to keyword carries out knowledge entity index entry and enters Row coupling, thus obtain the knowledge list of entities relevant to keyword, obtain final knowledge entity after sorting according to similarity And list of types;
Modeling the most according to demand: utilize knowledge entity type graph of a relation model construction module to acquisition according to user's request Knowledge entity and list of types be modeled;
S8. data visualization: the model that step S7 is obtained by data visualization module carries out Web visualization data and processes, Return JSON data and to front end and realize web front end Visualization Demo.
Use the type extraction system towards technical literature knowledge entity and the method for the present invention, there is the following aspects Advantage:
1) present invention predefines aspect in type and solves the confinement problems of type Manual definition, uses unsupervised opening Hairdo rule and method carries out type label extraction to whole entities, it is thus achieved that most possible type label collection;Due to the class proposed Type abstracting method is the combination without supervision with semi-supervised method, and the process therefore extracted is without substantial amounts of artificial mark, Er Qieling Activity and versatility are also eager to excel than general have supervision or semi-supervised method.It addition, this method is by analyzing professional field The characteristic of knowledge entity improves, it is adaptable to the type extraction of different professional field knowledge entities, contributes to specialty and knows The structuring knowing network realizes.
2) can specify and crawl the document page.Manager can specify address and the scope crawling the page, therefore native system Local data base can be not limited in the data acquisition of easy expansion to other field technical literature, retrieval amount.Such as: when When the paper database of line has renewal, manager can also update the scope of crawling, and the reptile of system will crawl new data automatically And update local data base.
3) the knowledge entity type retrieved is open, various.Native system unartificial predefined entity type, but utilize Method based on heuristic rule in conjunction with summary carries out type label set extraction, then carries out unreliable type label sieve Choosing, obtains final type label set.The tag set so obtained solves artificial predefined limitation and subjectivity Problem, can open, comprehensively, objectively obtain comparison rational type set, cover most knowledge entity.
4) user can obtain, by visualization interface, the knowledge venation figure that type is relevant.Native system utilizes knowledge entity class Knowledge entity and the list of types thereof of acquisition are modeled by type graph of a relation model construction module, respectively obtain based on same type Entity level relational tree model, knowledge relation graph model based on type packet and knowledge hotspot tracking artwork based on sequential Type, finally uses Visualization Model to be fed back to user.
5) systematic function is high, easy to use.System uses the thought of MVC framework, the user search on foreground and visualization mould The analysis module that crawls on block and backstage separates, and therefore, the data on backstage crawl, pretreatment, extract and the flow process such as mark The visualization that can't drag slow front end shows.Further, since establish index database, so speed when front end retrieval and acquisition data Quickly, performance is higher.It is the most simple and convenient that the visualization of sing on web also makes user use, it is not necessary to installs any client Can use.
Accompanying drawing explanation
Fig. 1 is the type extraction system Organization Chart towards technical literature knowledge entity of the present invention.
Fig. 2 is the flow chart of the type abstracting method towards technical literature knowledge entity of the present invention.
Fig. 3 is the flow chart of the knowledge Entity recognition step based on condition random field of the present invention.
Fig. 4 be the present invention entity type extraction with annotation step realize schematic diagram.
Fig. 5 be the present invention based on multi-tag weighting label propagation algorithm realize schematic diagram.
Detailed description of the invention
For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and reference Accompanying drawing, the present invention is described in more detail.
Fig. 1 shows the type extraction system Organization Chart towards technical literature knowledge entity of the present invention.
With reference to Fig. 1, the entity type extraction system of the present invention includes that user inquires about and feedback interface, online reptile and management Module, knowledge Entity recognition module, the type abstraction module of knowledge entity, type label are propagated and index database is set up module, known Know entity type graph of a relation model construction module, data visualization module, totally 7 modules.
Inquiry and feedback interface, for input processing and the query processing of user, feed back to use by data visualization result Family;
Online reptile and management module, automatically crawl manager and specify or the technical literature page of acquiescence for backstage And carry out the pretreatment of page data;
Knowledge Entity recognition module, for pretreated document title and summary data are carried out knowledge Entity recognition, Obtain knowledge entity sets;
The type abstraction module of knowledge entity, for realize the knowledge entity sets obtained carried out type label extraction and Part entity type marks, and obtains type label set and part has marked entity;
Type label is propagated and index database sets up module, not marked knowledge entity sets and type label set with part the most Mark entity is input, carries out label based on multi-tag weighting and propagates, then sets up knowledge entity and type of relationship index thereof Storehouse, carries out locally stored;
Knowledge entity type graph of a relation model construction module, retrieves index database according to the key word of user's input, And construct different knowledge entity type graph of a relation models;
Data visualization module, carries out Web the Visual Implementation to building tree graph model.
The invention also discloses the abstracting method of above-mentioned entity type extraction system, Fig. 2 is the civilian towards specialty of the present invention The flow chart of the knowledge entity type abstracting method offered.Knowledge entity type abstracting method step described below.
S1. data crawl and pretreatment
Manager is arranged by management module and crawls address and scope;Online reptile module on backstage according to the scope specified The document page is crawled;The page data crawled is carried out data prediction, such as Chinese word segmentation, removes stop words, feature Screening etc..
S2. knowledge Entity recognition is extracted
Utilize knowledge Entity recognition module that the documentation & info such as the document title after cleaning, summary, key word are carried out entity Identify and extract.
S3. type extraction and mark
Utilize knowledge entity type abstraction module that the knowledge entity obtained in step S2 is carried out type extraction and mark, Having marked entity to type label set and part, detailed process is as follows:
(S3-1) combine the related context of knowledge entity in literature summary information to extract with additional type label, with extraction Based on the knowledge entity arrived, the summary of document is carried out knowledge Entities Matching, the knowledge entity matched in summary and The most adjacent noun extracts, and adds in knowledge entity sets;
(S3-2) method based on heuristic rule is utilized to carry out type to step (S3-1) obtains knowledge entity sets Label extracts, and obtains candidate type tag set, and while type extraction, acquisition part has marked entity;
(S3-3) insecure type label is screened out, by the frequency of knowledge entity co-occurrence belonging to measurement type label and its Secondary, then screen out the type label that the co-occurrence frequency is low and corresponding knowledge entity frequency of occurrence is few, output sieve according to frequency characteristic Type label set after choosing.
S4. index database is set up
The knowledge entity obtained and type label set thereof are carried out database purchase with mark entity, carries out based on many marks The label signing weighting is propagated, and obtains type label matrix and sets up the index database of knowledge entity and type thereof.Add based on multi-tag The label of power is propagated and is comprised the following steps:
(S4-1) build and initialize matrix of transition probabilities T, for representing the transition probability between knowledge entity.
Matrix of transition probabilities T is calculated by formula 1.
Wherein, TijRepresent from nodes XjTransfer to nodes XiProbability, namely knowledge entity ejTransfer to knowledge entity ei Probability, transition probability WijIt is calculated by equation 2 below.
Wherein, sijIt is knowledge entity eiAnd ejSimilarity,Parameter is used for adjusting sijRatio,Parameter is sijFlat Average.Similarity S of knowledge inter-entity uses editing distance to measure: editing distance is the biggest, and similarity is the least, it is assumed that source word Symbol string is L with the maximum of target character string lengthmax, editing distance is LD, and similarity S utilizes equation 3 below to calculate.
S=1-LD/Lmax(formula 3)
(S4-2) build and initialization type label matrix Y, for represent type label that each knowledge entity comprises and Its type label weight.If the knowledge entity number successfully extracting type word in ground floor extraction out is l, fail to extract type word out Knowledge entity number is u, then (R is for having extracted type word duplicate removal word to define the matrix that type label matrix Y is (l+u) × R Allusion quotation number).Therefore, if YLFor marking type matrix, YUFor not marking type matrix, YNFor the newly-increased mark after each propagation iterative Matrix.Type label weight and type label matrix Y are calculated by formula 4,5.
Wherein, if knowledge entity eiK type label, C is had after ground floor type marksikIt it is the k label of i-th entity Frequency of occurrence, WikIt is knowledge entity eiHave the weight of type label k, WikWith label k at eiThe frequency of middle appearance is measured, As knowledge entity eiWhen having type label k, then Yij=Wik, otherwise Yij=0.
(S4-3) each having been marked to entity, circulation carries out transition probability calculating to all entities of not marking, if knowledge Transition probability between entity is more than threshold value (threshold value ζ is calculated) by formula 6, then carry out label propagation.One takes turns after propagation terminates, will New mark knowledge entity sets replaces the original knowledge entity sets of mark, obtains the newly-increased mark matrix in t generation
Wherein, N isLine number,Newly-increased mark matrix when being the t time iteration.
(S4-4) loop iteration carries out the label communication process of step (S4-3), until new mark knowledge entity sets be sky or Not marking type matrix no longer to change, iteration terminates, and exports the up-to-date type matrix of mark(t+1 is complete for label propagation iterative Become).
S5. keyword is obtained
Inquired about by user and feedback interface gets the knowledge entity keyword that user inquires about.
S6. type shape table is set up:
Keyword according to user's input carries out knowledge entity index entry at index database and mates, thus obtains with crucial The knowledge list of entities that word is relevant, obtains final knowledge entity and list of types thereof after sorting according to similarity;
Model the most according to demand
According to user's request, utilize the knowledge entity type graph of a relation model construction module knowledge entity to obtaining and class thereof Type list is modeled, and respectively obtains and closes based on same type of entity level relational tree model, knowledge based on type packet It is graph model and knowledge hotspot tracking graph model based on sequential.Specifically modeling process is as detailed below:
(S7-1) from knowledge entity index database, know relevant to this key word is extracted according to the key word of user's input Knowing entity sets, dependency relation includes cooccurrence relation, inclusion relation and the expansion relation in title and in summary.
(S7-2) build based on same type of entity level relational tree model, the most individual reality in checking knowledge entity sets Extension between body or inclusion relation, such as sporocarp eiComprise entity ej, then filiation R (e in tree graph model is set upi,ej), Represent eiIt is ejFather node, the like, set up hierarchical relationship model.
(S7-3) knowledge relation graph model based on type packet is built, to the knowledge entity in knowledge entity sets by class Type is grouped, and adds up the weights of each type packet, and the knowledge entity in packet is also according to entity weight descending sort;Screening Going out N number of packet that weights are the highest, each packet filters out the knowledge entity coming front M, according to key word, type packet, reality The graph model of the sequential configuration of body three layers.
(S7-4) build knowledge hotspot tracking graph model based on sequential, be ranked up according to the time of knowledge entity, structure Build according to the time period packet that half a year is the cycle, add up the relevant knowledge physical quantities of appearance of each time period respectively, each Knowledge entity in time period packet is ranked up according to entity weight, finally with time packet and correspondent entity list builder heat Point tracing figure model.
(S7-5) model conversion described in step (S7-2), (S7-3), (S7-4) is become the data of JSON form and exports To data visualization module.
S8. data visualization
Utilize three models in data visualization module step S7 to carry out Web visualization data to process, return JSON number According to front end and realize web front end Visualization Demo.
Flow chart such as knowledge Entity recognition step based on condition random field that Fig. 3 is the present invention.First, to pretreatment After data in literature collection carry out feature extraction, including part of speech feature, front and back introductory word feature, front and back sew feature etc..Next step portion Minute mark note data set and the feature being drawn into all are put CRF model into and are trained, the CRF model after being trained.Then use CRF model after training carries out entity mark to not marking data, calculates its F1 value after obtaining the data set marked.If F1 value Lifting amplitude more than the F1 value of prior-generation, then carries out semi-supervised iterative process.Semi-supervised iterative process is first labeled data collection It is divided into 10 parts, calculates respective F1 value respectively, select that best a data sets to concentrate to artificial labeled data, weight Newly CRF model is trained.Repeating above-mentioned training, annotation process, until F1 value is not promoting, iterative process terminates, output Entity mark collection.
Fig. 4 be the present invention entity type extraction with annotation step realize schematic diagram.The first step of flow process is by reality Body identification, then uses the type abstracting method based on heuristic rule combining summary to carry out the extraction of type, and it is complete for obtaining The data (type word occurs in inside entity) that body type label set and part have marked.Then, utilization adds based on multi-tag The label propagation algorithm of power carries out type label propagation and mark, finally obtains type annotation results.
Fig. 5 be the present invention based on multi-tag weighting label propagation algorithm realize schematic diagram.This figure mainly illustrates examination That inscribes the label propagation algorithm based on multi-tag weighting in type annotation step realizes principle.Wherein, it is to have marked on the left of figure L the entity signed and k label data thereof are as input data, and each label has the weights W of self correspondenceik, and scheme right side Be that the n-l-1 carrying out label propagation is not marked entity, before label is propagated, the output label of the rightmost side is not exist 's.Example as shown in Figure 5, has marked entity e1And e2Meet entity e simultaneouslyl+1Label propagation conditions time, entity e1Mark Sign 1-3 and travel to entity el+1, and new weights corresponding to the rightmost side new label 1-3 are Wik*Tij.Then, entity e2Label 2,4, 5 travel to entity el+1, wherein the new weights of label 4 and label 5 are also Wik*Tij, and label 2 has had weights, so entering Adding up of row weights, therefore in label 2, weights are W12*T1,l+1+W22*T2,l+1
To sum up, the type extraction system towards technical literature knowledge entity of the present invention and method, crawl with online reptile Technical literature data based on, carry out the identification of knowledge entity, the extraction of entity type label, type mark and label pass Broadcast, obtain the type of knowledge entity and relation based on type thereof, set up index database and carry out locally stored.Then, according to user The key word of input extracts the knowledge entity sets relevant to this key word from knowledge entity index database, builds based on same The entity level relational tree model of type, knowledge relation graph model based on type packet, knowledge hotspot tracking based on sequential Graph model, finally uses data visualization technique carry out front end drawing and present to user, and the present invention implements simply, and extraction is accurately Rate is high, has the strongest real value and realistic meaning.
Particular embodiments described above, has been carried out the purpose of the present invention, technical scheme and beneficial effect the most in detail Describe in detail bright, be it should be understood that the specific embodiment that the foregoing is only the present invention, be not limited to the present invention, all Within the spirit and principles in the present invention, any modification, equivalent substitution and improvement etc. done, should be included in the guarantor of the present invention Within the scope of protecting.

Claims (10)

1. towards the type extraction system of technical literature knowledge entity, it is characterised in that include following 7 modules:
A () inquiry and feedback interface, for input processing and the query processing of user, feed back to use by data visualization result Family;
B () online reptile and management module, automatically crawl manager and specify or the technical literature page of acquiescence for backstage And carry out the pretreatment of page data;
C () knowledge Entity recognition module, for carrying out knowledge Entity recognition to pretreated document title and summary data;
D () type label abstraction module, carries out type label extraction and portion for realization to the knowledge entity obtained in module (c) Divide entity type mark, obtain type label set and part has marked entity;
E () type label is propagated and index database sets up module, with the class not marking knowledge entity sets, module (d) in module (c) Type tag set and part marked entity for input, carry out based on multi-tag weighting label propagate and set up knowledge entity and Its type of relationship index database;
F () knowledge entity type graph of a relation model construction module, retrieves index database according to the key word of user's input, and Construct different knowledge entity type graph of a relation models;
G () data visualization module, carries out Web the Visual Implementation to the model in module (f).
2. towards the type abstracting method of technical literature knowledge entity, it is characterised in that use the extraction system described in claim 1 System, follows the steps below,
S1. data crawl and pretreatment: Administrator document crawls address and scope, online reptile and management module on backstage According to the scope specified, the document page is crawled, the page data crawled is carried out pretreatment simultaneously;
S2. knowledge Entity recognition is extracted: knowledge Entity recognition module carries out Entity recognition to pretreated documentation & info and carries Take out;
S3. type extraction and mark: knowledge entity type abstraction module carries out type extraction and mark to the knowledge entity extracted, Obtain type label set and part has marked entity;
S4. index database is set up: the knowledge entity obtained and type label set thereof have been marked entity with part and has carried out data base Storage, carries out label based on multi-tag weighting and propagates, obtain type label matrix and set up the rope of knowledge entity and type thereof Draw storehouse;
S5. keyword is obtained: inquired about by user and feedback interface obtains the knowledge entity keyword that user inquires about;
S6. list of types is set up: the index database created in step s 4 according to keyword carries out knowledge entity index entry and carries out Join, thus obtain the knowledge list of entities relevant to keyword, according to similarity sort after obtain final knowledge entity and List of types;
Modeling the most according to demand: utilize knowledge entity type graph of a relation model construction module that acquisition is known according to user's request Know entity and list of types is modeled;
S8. data visualization: the model that step S7 is obtained by data visualization module carries out Web visualization data and processes, and returns JSON data are to front end and realize web front end Visualization Demo.
The most according to claim 2 towards the type abstracting method of technical literature knowledge entity, it is characterised in that in step S3 The step of knowledge entity type label extraction is as follows:
(S3-1) combine the related context of knowledge entity in literature summary information to extract with additional type label, be drawn into Based on knowledge entity, the summary of document is carried out knowledge Entities Matching, the knowledge entity matched in summary and thereafter Adjacent noun extracts, and adds in knowledge entity sets;
(S3-2) method based on heuristic rule is utilized to carry out type label to step (S3-1) obtains knowledge entity sets Extraction, obtains candidate type tag set, and while type extraction, acquisition part has marked entity;
(S3-3) insecure type label is screened out, by the frequency of knowledge entity co-occurrence belonging to measurement type label and its, Then screen out, according to frequency characteristic, the type label that the co-occurrence frequency is low and corresponding knowledge entity frequency of occurrence is few, after output screening Type label set.
The most according to claim 2 towards the type abstracting method of technical literature knowledge entity, it is characterised in that in step S4 Label based on multi-tag weighting is propagated and is comprised the following steps:
(S4-1) build and initialize matrix of transition probabilities T, for representing the transition probability between knowledge entity;
(S4-2) build also initialization type label matrix Y, be used for representing type label and the class thereof that each knowledge entity comprises Type label weight, wherein, if YLFor marking type matrix, YUFor not marking type matrix, YNFor the newly-increased mark after each propagation iterative Note matrix;
(S4-3) each having been marked to entity, circulation carries out transition probability calculating to all entities of not marking, if knowledge entity Between transition probability more than threshold value, then carry out label propagation, one takes turns after propagation terminates, and will newly mark knowledge entity sets and replace former The knowledge entity sets of mark come, obtains the newly-increased mark matrix in t generation
(S4-4) loop iteration carries out the label communication process of step (S4-3), until new mark knowledge entity sets for sky or is not marked Type matrix no longer changes, and iteration terminates, if t+1 completes for label propagation iterative, then exports the up-to-date type matrix of mark
The most according to claim 4 towards the type abstracting method of technical literature knowledge entity, it is characterised in that step (S4- 1) in, matrix of transition probabilities T:
T i j = P ( j → i ) = W i j Σ k = 1 n W k j ,
Wherein, TijRepresent from nodes XjTransfer to nodes XiProbability, namely knowledge entity ejTransfer to knowledge entity ei's Probability, transition probability WijIt is calculated by formula below:
W i j = exp ( - S i j 2 ∂ 2 )
Wherein, sijIt is knowledge entity eiAnd ejSimilarity,Parameter is used for adjusting sijRatio,Parameter is sijMeansigma methods.
The most according to claim 5 towards the type abstracting method of technical literature knowledge entity, it is characterised in that knowledge entity Between similarity S use editing distance measure: editing distance is the biggest, and similarity is the least, it is assumed that source string and target word The maximum of symbol string length is Lmax, editing distance is LD, and similarity S utilizes below equation to calculate:
S=1-LD/Lmax
7., according to type extraction system and method towards technical literature knowledge entity described in claim 5 or 6, its feature exists In, in step (S4-2), if the knowledge entity number successfully extracting type word in ground floor extraction out is l, fail to extract type word out Knowledge entity number be u, then definition type label matrix Y is the matrix of (l+u) × R, and R is for having extracted type word duplicate removal Dictionary number, the computing formula of type label weight and type label matrix Y is as follows:
W i k = C i k Σ 0 l = K C i l
Y i j = W i k , i f y i i s l a b e l r j ; 0 , o t h e r w i s e .
Wherein, if knowledge entity eiK type label, C is had after ground floor type marksikIt is the going out of k label of i-th entity The existing frequency, WikIt is knowledge entity eiHave the weight of type label k, WikWith label k at eiThe frequency of middle appearance is measured, when knowing Know entity eiWhen having type label k, then Yij=Wik, otherwise Yij=0.
The most according to claim 7 towards type extraction system and the method for technical literature knowledge entity, it is characterised in that step Suddenly in (S4-3), the computational methods of transition probability threshold value ζ:
ζ = Σ k = 0 N T k j / N
Wherein, N isLine number,It it is the newly-increased mark matrix after the t time iteration.
9., according to the abstracting method of the type extraction system towards technical literature knowledge entity described in claim 2 or 8, it is special Levy and be: in step S7, modeling generate three kinds of tree graph model, be respectively based on same type of entity level relational tree model, Knowledge relation graph model based on type packet and knowledge hotspot tracking graph model based on sequential.
The most according to claim 9 towards the type extraction system of technical literature knowledge entity, it is characterised in that step S7 Middle modeling method particularly includes:
(S7-1) from knowledge entity index database, the knowledge relevant to this key word is extracted according to the key word of user's input real Body set, dependency relation includes cooccurrence relation, inclusion relation and the expansion relation in title and in summary;
(S7-2) build based on same type of entity level relational tree model, in checking knowledge entity sets the most individual entity it Between extension or inclusion relation, such as sporocarp eiComprise entity ej, then filiation R (e in tree graph model is set upi,ej), represent ei It is ejFather node, the like, set up hierarchical relationship model;
(S7-3) build knowledge relation graph model based on type packet, the knowledge entity in knowledge entity sets is entered by type Row packet, adds up the weights of each type packet, and the knowledge entity in packet is also according to entity weight descending sort;Filter out power Be worth the highest N number of packet, each packet filters out the knowledge entity coming front M, according to key word, type packet, entity The graph model that sequential configuration is three layers;
(S7-4) building knowledge hotspot tracking graph model based on sequential, be ranked up according to the time of knowledge entity, structure is pressed According to the time period packet that half a year is the cycle, add up the relevant knowledge physical quantities of appearance of each time period, each time respectively Knowledge entity in section packet is ranked up according to entity weight, finally with time packet and correspondent entity list builder focus with Track graph model;
(S7-5) model conversion described in step (S7-2), (S7-3), (S7-4) become the data of JSON form and export number According to visualization model.
CN201610488849.7A 2016-06-24 2016-06-24 Type extraction system and method towards technical literature knowledge entity Active CN106156286B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610488849.7A CN106156286B (en) 2016-06-24 2016-06-24 Type extraction system and method towards technical literature knowledge entity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610488849.7A CN106156286B (en) 2016-06-24 2016-06-24 Type extraction system and method towards technical literature knowledge entity

Publications (2)

Publication Number Publication Date
CN106156286A true CN106156286A (en) 2016-11-23
CN106156286B CN106156286B (en) 2019-09-17

Family

ID=57350111

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610488849.7A Active CN106156286B (en) 2016-06-24 2016-06-24 Type extraction system and method towards technical literature knowledge entity

Country Status (1)

Country Link
CN (1) CN106156286B (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108038220A (en) * 2017-12-22 2018-05-15 新奥(中国)燃气投资有限公司 A kind of keyword methods of exhibiting and device
CN108415900A (en) * 2018-02-05 2018-08-17 中国科学院信息工程研究所 A kind of visualText INFORMATION DISCOVERY method and system based on multistage cooccurrence relation word figure
CN108702361A (en) * 2016-02-26 2018-10-23 三菱电机株式会社 Using the real-time verification of the JSON data of tree graph attribute
CN108984683A (en) * 2018-06-29 2018-12-11 北京百度网讯科技有限公司 Extracting method, system, equipment and the storage medium of structural data
CN109508382A (en) * 2018-10-19 2019-03-22 北京明略软件***有限公司 A kind of label for labelling method and apparatus, computer readable storage medium
CN109543153A (en) * 2018-11-13 2019-03-29 成都数联铭品科技有限公司 A kind of sequence labelling system and method
CN109815338A (en) * 2018-12-28 2019-05-28 北京市遥感信息研究所 Relation extraction method and system in knowledge mapping based on mixed Gauss model
CN110209814A (en) * 2019-05-23 2019-09-06 西安交通大学 A method of knowledget opic is extracted from encyclopaedic knowledge website using field modeling
CN110309291A (en) * 2019-07-09 2019-10-08 国网山东省电力公司 A kind of method and device towards the analysis of timing data in literature
CN111221957A (en) * 2020-01-10 2020-06-02 合肥工业大学 Scientific and technological information automatic processing method and system based on knowledge organization
CN111259213A (en) * 2020-01-07 2020-06-09 中国联合网络通信集团有限公司 Data visualization processing method and device
CN111325018A (en) * 2020-01-21 2020-06-23 上海恒企教育培训有限公司 Domain dictionary construction method based on web retrieval and new word discovery
CN111597245A (en) * 2020-05-20 2020-08-28 政采云有限公司 Data extraction method and device, information statistics method and related equipment
CN111797296A (en) * 2020-07-08 2020-10-20 中国人民解放军军事科学院军事医学研究院 Method and system for mining poison-target literature knowledge based on network crawling
CN111950264A (en) * 2020-08-05 2020-11-17 广东工业大学 Text data enhancement method and knowledge element extraction method
CN112862020A (en) * 2021-04-25 2021-05-28 北京芯盾时代科技有限公司 Data identification method and device and storage medium
CN113076432A (en) * 2021-04-30 2021-07-06 平安科技(深圳)有限公司 Document knowledge context generation method, device and storage medium
CN113128234A (en) * 2021-06-17 2021-07-16 明品云(北京)数据科技有限公司 Method and system for establishing entity recognition model, electronic equipment and medium
CN115169848A (en) * 2022-06-28 2022-10-11 上海东普信息科技有限公司 Statistical analysis method, device, equipment and storage medium for logistics business data
CN115952304A (en) * 2023-03-13 2023-04-11 苏州超云生命智能产业研究院有限公司 Method, device and equipment for searching variant documents and storage medium
CN116796750A (en) * 2023-08-24 2023-09-22 宁波甬恒瑶瑶智能科技有限公司 NER model-based gene literature information extraction method, system and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902649A (en) * 2014-02-17 2014-07-02 复旦大学 Knowledge extraction method based on online encyclopedia link entities
CN104216934A (en) * 2013-09-29 2014-12-17 北大方正集团有限公司 Knowledge extraction method and knowledge extraction system
CN105550253A (en) * 2015-12-09 2016-05-04 百度在线网络技术(北京)有限公司 Method and device for obtaining type relation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104216934A (en) * 2013-09-29 2014-12-17 北大方正集团有限公司 Knowledge extraction method and knowledge extraction system
CN103902649A (en) * 2014-02-17 2014-07-02 复旦大学 Knowledge extraction method based on online encyclopedia link entities
CN105550253A (en) * 2015-12-09 2016-05-04 百度在线网络技术(北京)有限公司 Method and device for obtaining type relation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
THOMAS LIN: "No Noun Phrase Left Behind: Detecting and Typing Unlinkable Entities", 《COMPUTER SCIENCE & ENGINEERING》 *
陈毅恒: "文本检索结果聚类及类别标签抽取技术研究", 《中国博士学位论文全文数据库 信息科技辑》 *

Cited By (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108702361A (en) * 2016-02-26 2018-10-23 三菱电机株式会社 Using the real-time verification of the JSON data of tree graph attribute
CN108038220A (en) * 2017-12-22 2018-05-15 新奥(中国)燃气投资有限公司 A kind of keyword methods of exhibiting and device
CN108415900A (en) * 2018-02-05 2018-08-17 中国科学院信息工程研究所 A kind of visualText INFORMATION DISCOVERY method and system based on multistage cooccurrence relation word figure
CN108984683A (en) * 2018-06-29 2018-12-11 北京百度网讯科技有限公司 Extracting method, system, equipment and the storage medium of structural data
CN109508382A (en) * 2018-10-19 2019-03-22 北京明略软件***有限公司 A kind of label for labelling method and apparatus, computer readable storage medium
CN109543153A (en) * 2018-11-13 2019-03-29 成都数联铭品科技有限公司 A kind of sequence labelling system and method
CN109543153B (en) * 2018-11-13 2023-08-18 成都数联铭品科技有限公司 Sequence labeling system and method
CN109815338A (en) * 2018-12-28 2019-05-28 北京市遥感信息研究所 Relation extraction method and system in knowledge mapping based on mixed Gauss model
CN110209814B (en) * 2019-05-23 2021-02-02 西安交通大学 Method for extracting knowledge topic from encyclopedic knowledge website by utilizing domain modeling
CN110209814A (en) * 2019-05-23 2019-09-06 西安交通大学 A method of knowledget opic is extracted from encyclopaedic knowledge website using field modeling
CN110309291B (en) * 2019-07-09 2021-04-13 国网山东省电力公司 Time sequence literature data analysis-oriented method and device
CN110309291A (en) * 2019-07-09 2019-10-08 国网山东省电力公司 A kind of method and device towards the analysis of timing data in literature
CN111259213A (en) * 2020-01-07 2020-06-09 中国联合网络通信集团有限公司 Data visualization processing method and device
CN111221957A (en) * 2020-01-10 2020-06-02 合肥工业大学 Scientific and technological information automatic processing method and system based on knowledge organization
CN111325018B (en) * 2020-01-21 2023-08-11 上海恒企教育培训有限公司 Domain dictionary construction method based on web retrieval and new word discovery
CN111325018A (en) * 2020-01-21 2020-06-23 上海恒企教育培训有限公司 Domain dictionary construction method based on web retrieval and new word discovery
CN111597245A (en) * 2020-05-20 2020-08-28 政采云有限公司 Data extraction method and device, information statistics method and related equipment
CN111597245B (en) * 2020-05-20 2023-09-29 政采云有限公司 Data extraction method and device and related equipment
CN111797296A (en) * 2020-07-08 2020-10-20 中国人民解放军军事科学院军事医学研究院 Method and system for mining poison-target literature knowledge based on network crawling
CN111797296B (en) * 2020-07-08 2024-04-09 中国人民解放军军事科学院军事医学研究院 Method and system for mining poison-target literature knowledge based on network crawling
CN111950264B (en) * 2020-08-05 2024-04-26 广东工业大学 Text data enhancement method and knowledge element extraction method
CN111950264A (en) * 2020-08-05 2020-11-17 广东工业大学 Text data enhancement method and knowledge element extraction method
CN112862020A (en) * 2021-04-25 2021-05-28 北京芯盾时代科技有限公司 Data identification method and device and storage medium
CN112862020B (en) * 2021-04-25 2021-08-03 北京芯盾时代科技有限公司 Data identification method and device and storage medium
CN113076432B (en) * 2021-04-30 2024-05-03 平安科技(深圳)有限公司 Literature knowledge context generation method, device and storage medium
CN113076432A (en) * 2021-04-30 2021-07-06 平安科技(深圳)有限公司 Document knowledge context generation method, device and storage medium
CN113128234B (en) * 2021-06-17 2021-11-02 明品云(北京)数据科技有限公司 Method and system for establishing entity recognition model, electronic equipment and medium
CN113128234A (en) * 2021-06-17 2021-07-16 明品云(北京)数据科技有限公司 Method and system for establishing entity recognition model, electronic equipment and medium
CN115169848A (en) * 2022-06-28 2022-10-11 上海东普信息科技有限公司 Statistical analysis method, device, equipment and storage medium for logistics business data
CN115952304A (en) * 2023-03-13 2023-04-11 苏州超云生命智能产业研究院有限公司 Method, device and equipment for searching variant documents and storage medium
CN116796750A (en) * 2023-08-24 2023-09-22 宁波甬恒瑶瑶智能科技有限公司 NER model-based gene literature information extraction method, system and storage medium
CN116796750B (en) * 2023-08-24 2023-11-10 宁波甬恒瑶瑶智能科技有限公司 NER model-based gene literature information extraction method, system and storage medium

Also Published As

Publication number Publication date
CN106156286B (en) 2019-09-17

Similar Documents

Publication Publication Date Title
CN106156286B (en) Type extraction system and method towards technical literature knowledge entity
CN108573411B (en) Mixed recommendation method based on deep emotion analysis and multi-source recommendation view fusion of user comments
CN106250412B (en) Knowledge mapping construction method based on the fusion of multi-source entity
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN108874878A (en) A kind of building system and method for knowledge mapping
CN103605665B (en) Keyword based evaluation expert intelligent search and recommendation method
CN108681557B (en) Short text topic discovery method and system based on self-expansion representation and similar bidirectional constraint
CN106055675B (en) A kind of Relation extraction method based on convolutional neural networks and apart from supervision
CN111339313A (en) Knowledge base construction method based on multi-mode fusion
CN105528437B (en) A kind of question answering system construction method extracted based on structured text knowledge
CN105045875B (en) Personalized search and device
CN106909643A (en) The social media big data motif discovery method of knowledge based collection of illustrative plates
CN105512245A (en) Enterprise figure building method based on regression model
CN103425763B (en) User based on SNS recommends method and device
CN112966091B (en) Knowledge map recommendation system fusing entity information and heat
Shah et al. Sentimental Analysis Using Supervised Learning Algorithms
Lu Semi-supervised microblog sentiment analysis using social relation and text similarity
CN111191099B (en) User activity type identification method based on social media
CN109858020A (en) A kind of method and system obtaining taxation informatization problem answers based on grapheme
CN114090861A (en) Education field search engine construction method based on knowledge graph
CN116561264A (en) Knowledge graph-based intelligent question-answering system construction method
CN115310005A (en) Neural network recommendation method and system based on meta-path fusion and heterogeneous network
CN114911893A (en) Method and system for automatically constructing knowledge base based on knowledge graph
CN103699568B (en) A kind of from Wiki, extract the method for hyponymy between field term
CN117033654A (en) Science and technology event map construction method for science and technology mist identification

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant