CN103729402B - Method for establishing mapping knowledge domain based on book catalogue - Google Patents

Method for establishing mapping knowledge domain based on book catalogue Download PDF

Info

Publication number
CN103729402B
CN103729402B CN201310601668.7A CN201310601668A CN103729402B CN 103729402 B CN103729402 B CN 103729402B CN 201310601668 A CN201310601668 A CN 201310601668A CN 103729402 B CN103729402 B CN 103729402B
Authority
CN
China
Prior art keywords
speech
node
catalogue
coordination
superior
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310601668.7A
Other languages
Chinese (zh)
Other versions
CN103729402A (en
Inventor
鲁伟明
张萌
魏宝刚
庄越挺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201310601668.7A priority Critical patent/CN103729402B/en
Publication of CN103729402A publication Critical patent/CN103729402A/en
Application granted granted Critical
Publication of CN103729402B publication Critical patent/CN103729402B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation
    • G06F16/90324Query formulation using system suggestions
    • G06F16/90328Query formulation using system suggestions using search space presentation or visualization, e.g. category or range presentation and selection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9017Indexing; Data structures therefor; Storage structures using directory or table look-up
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation
    • G06F16/90332Natural language query formulation or dialogue systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method for establishing a mapping knowledge domain based on a book catalogue. The method comprises the steps that a catalogue page in a digitized book is extracted, the lengths of items in the catalogue are differentiated, and part-of-speech tagging is conducted on the long items through a natural language processing tool, so that part-of-speech arrays are obtained, and candidate nodes are extracted according to rules of conjunctions, punctuations and parts of speech; the long items and the short items are authenticated in the Baidu encyclopedia and the Hudong encyclopedia, a leader-member relation and parallel relations are formed through a catalogue structure and serve as a framework of the mapping knowledge domain, the strong and weak parallel relations are differentiated and serve as increments respectively, and the leader-member relation is supplemented with the strong and weak parallel relations; according to a noisy data excavating algorithm with suffixes serving as a base, nodes are selected from the items which do not pass the authentication of the encyclopedias and the mapping knowledge domain is supplemented with the selected nodes; finally, the weights of relations in the supplemented mapping knowledge domain are calculated and ranked, so that noise is removed through screening. Compared with an existing mapping knowledge domain, the mapping knowledge domain established through the method is richer in node, better in expandability and higher in accuracy.

Description

A kind of construction method of the knowledge mapping based on library catalogue
Technical field
The present invention relates to the use of the generation that the methods such as Artificial intelligence, data mining carry out knowledge mapping, especially relate to And a kind of construction method of the knowledge mapping based on library catalogue.
Background technology
Computer fast development and today of popularization, in order to more easily, more clearly obtain information, learning knowledge, with And the contact evolutionary process between analysis mining knowledge, have increasing need for a content, level enriches, accuracy is high, and is easy to The knowledge mapping of extension, and how to build this knowledge mapping and then naturally become the focus of current research.
The knowledge mapping of current Chinese has hownet, interactive encyclopaedic knowledge tree, cnki classification, but each of which exists Limitation and various problem.
Hownet Shi You Chinese Academy of Sciences Mr. Dong Zhendong exploitation, with the concept representated by the word of Chinese and english be Description object, with disclose between concept and concept and attribute that concept has between the general knowledge as substance for the relation know Know storehouse.Specifically, hownet interior joint is most of is popular vocabulary, and level can not do depth, comparatively number of nodes Less, relation is few, and needs by manually generated.
Interactive encyclopaedic knowledge tree is by traditional encyclopaedia mode classification, and encyclopaedia complete works is divided into personage, history, culture, skill Art, nature, geography, science, economy, life, society, physical culture, the big objective classification of technology 12, draw under each objective classification again step by step It is divided into the subclasses such as different secondary classifications, three-level classification.In interactive encyclopaedic knowledge tree, structure is fixed, and level is not relatively deep, and is Manually generated, it is unfavorable for extending.
Cnki is China National Knowledge Infrastructure engineering (china national knowledge infrastructure). Cnki engineering is to propagate the shared information system work being utilized as target with increment to realize whole society's knowledge resource, by Tsing-Hua University University, Tsing Hua Tong Fang initiate, and are established in June, 1999.Cnki classifies based on discipline classification, and the document in database is divided For ten special editions, under each special edition, it is divided into several special topics, 168 special topics altogether.Weak point be level relatively less, between node It is bad that relation is relatively sparse, structure is relatively fixed autgmentability, and is an artificially generated.
Content of the invention
The purpose of the present invention is for overcoming the deficiencies in the prior art, provides and a kind of automatically generates knowledge based on library catalogue The method of collection of illustrative plates.
The construction method of the knowledge mapping based on library catalogue comprises the following steps:
1) select a book, its catalogue page is carried out optical character identification and realizes digitlization, and in digitized catalogue knot On structure, according to the length of entry in catalogue, distinguish long entry and billet mesh two class entry;
2) to billet mesh directly as a collection of both candidate nodes, simultaneously by long entry using the natural language processing instrument increased income Fudannlp carries out part-of-speech tagging and obtains part of speech array, then extracts in addition a collection of time using conjunction, punctuate and part-of-speech rule Select node;
3) to two batches both candidate nodes, strictly filtered first, go to identify that this entry exists in Baidupedia, interactive encyclopaedia Whether, utilize the level structure up and down of catalogue to form relationship between superior and subordinate by the part of Baidupedia, interactive encyclopaedia identification, using mesh The peer-to-peer architecture of record forms coordination, using this two parts as the skeleton of knowledge mapping;
4) distinguish strong and weak coordination, from two kinds of coordinations, choose egress respectively, carry out on incremental supplementation enters Inferior relation, enriches the skeleton of knowledge mapping obtained in the previous step;
5) according to the method excavating useful part in noise data proposing based on suffix, never go through Baidu hundred Select a part of node in section, the entry of interactive encyclopaedia identification to be supplemented in knowledge mapping;
6) to each relation in the knowledge mapping having supplemented, calculate its weight and be ranked up again, thus screening out one Divide noise, realize sequence screening.
Described step 2) as follows:
Using natural language processing instrument, cutting word is gone to sentence and marks part of speech, be the arranged side by side of conjunction according to pause mark and part of speech Conjunction removes distich quantum splitting, and a sentence divides the character string dimension,
First to each of character string dimension a that each splits into substring a [i], formed to each a The part of speech array of one<word, the part of speech>of [i] such substring,
Next adjacent element in part of speech array is merged, merge during to character string dimension a in not Adopt different merging orders, the i.e. part of speech to first substring a [0] in character string dimension a with the character string of position Using back to front continuous adjective, noun being merged into a word during array manipulation, meanwhile, with the part of speech array of a [0] In last word part of speech on the basis of part of speech, using benchmark part of speech improve to ensuing character string in character string dimension a, Accuracy rate when merging part of speech array,
During to the part of speech array manipulation of second substring a [1] and each later substring, using as lower section Method:
2.1) if benchmark part of speech be noun, in a [1] and later each character string respective part of speech array by Match after forward direction till last part of speech is noun, form a word, otherwise not returning result;
2.2) film, title knowledge node can add " " or " " symbol, if then benchmark part of speech is to demarcate symbol, i.e. title Number, quotation marks when, then match last from the front to the back in a [1] and later each character string respective part of speech array Till individual demarcation symbol, form a word, otherwise not returning result;
If benchmark part of speech is not noun or demarcates symbol, to a [1] and the later respective part of speech of each character string In array, when first part of speech is mated with benchmark part of speech, then form a word, otherwise not returning result;
To the word do not included on Baidupedia, interactive encyclopaedia, using normalization Google distance (normalized *** Distance) both condensation degree together lower are calculated, two nodes between value is for 0 to one threshold value are considered as one group of energy The merged rational word receiving, when being identified by part of speech, when being the word merging out using part-of-speech rule, then to its word Property array, calculating value using normalization Google distance decides whether to include,
ngd ( x , y ) = max { log f ( x ) , log f ( y ) } - log f ( x , y ) log m - min { log f ( x ) , log f ( y ) }
Ngd (x, y) represents the value calculating using normalization Google distance,
F (x, y) represents the number of results that " xy " searches out in Google for keyword xy,
F (x) represents the number of results that " x " searches out in Google for keyword x,
F (y) represents the number of results that " y " searches out in Google for keyword y,
M is all of webpage number included in Google.
Described step 3) includes:
3.1) exercise in the library catalogue page that the optical character recognition process of each book is gone out, experiment, example this The books distribution information of sample is it will usually repeat repeatedly, so to pass through in the catalogue of the same level in same book The catalogue of each level is set up with a Hash table having<catalogue entry counts>just to count and filter out;
3.2) length for catalogue entry limits, and only takes the catalogue entry of 9 Chinese characters and following length;
Using the entry on Baidupedia, interactive encyclopaedia, set up out index, then process each of catalogue node When, carry out Baidupedia, interactive encyclopaedia identification, what Baidupedia, interactive encyclopaedia identification were passed through is included;
Process each catalogue in node when, carry out morphology using the natural language processing software fudannlp that increases income and divide Analysis, when part of speech is labeled as verb, then does not go to include;
3.3) for catalogue entry length the above length of 9 Chinese characters catalogue entry, if the entry class processing Type be " noun+conjunction+noun " type, using step 2.1), step 2.2) in in sentence extract coordination way , then again and the superior node of " noun+conjunction+noun " forms relationship between superior and subordinate, side by side by relationship between superior and subordinate and simultaneously The book number that relation is located preserves;
Meanwhile, need during process to keep the catalogue minor structure of every book, that is, need to preserve two tables, a table preserves and leads to Cross the above method knowledge node obtaining and its book number occurring in, page number numbering, and between knowledge node, the superior and the subordinate close The book number that system and coordination occur in, even if but the entry being temporarily dropped is also likely to be qualified node, needs to build Another table by all of directory node that pass through and unsanctioned, number and also preserve by the book number being located, the page number, The book number simultaneously each relationship between superior and subordinate that all entries are formed and coordination being located also preserves into, connects down Correlation and after being counted by books distribution between coming by calculate node, finds out rationally useful from the entry being dropped Knowledge node, next the positional information entering each entry in knowledge mapping, simultaneously preserving as incremental supplementation can be used for Definition extraction.
Described step 4) includes:
4.1) differentiation of strong and weak coordination
The number of times that coordination is occurred counts, and is ranked up according to the absolute frequency that coordination occurs, will absolutely Out as strong coordination, absolute frequency is less than the pass arranged side by side of threshold value for the coordination binary group selection that frequency is more than with threshold value It is two tuples as weak coordination;
4.2) degree of correlation between knowledge node
Often there is between knowledge node ambiguity, in a data set being made up of relationship between superior and subordinate and coordination, The multiple superior nodes existing between for node point to the problem of same downstream sites, according to other related to knowledge node Node come to help solve,
Detailed process is as follows:
4.2.1) to each a → b, find one group of downstream site suba of a using relationship between superior and subordinate, to this group subordinate section Each of point node eleofsuba, finds one group of node paraofeleofsuba arranged side by side using strong coordination, to institute There is paraofeleofsuba to merge and form a set set1;
4.2.2) using weak coordination, one group of node parab arranged side by side is found to b, to each of parab node c, according to Secondary find one group of node paraofparab arranged side by side using strong coordination, each paraofparab as one gather set2;
4.2.3) to set set1, set set2,
Degree of correlation relevancy=sameelementcount+ (weight1+ weight2) * 10
weight 1 = sameelementcount set 1 totalelementcount ,
weight 2 = sameelementcount set 2 totalelementcount ,
Wherein, sameelementcount represents element identical number in two set, and weight1, weight2 are respectively Represent the percentage that the number of identical element accounts in each set, set1totalelementcount represents in set set1 Element total number, set2totalelementcount represent set set2 in element total number,
Calculate the degree of correlation between downstream site c and superior node a, when relevancy is more than threshold value, c is thought It is the downstream site of a, corresponding a → c is included,
The multiple superior nodes existing between for node point to the problem of same downstream site, calculate different higher levels respectively The degree of correlation of node and this downstream site, the size according to the degree of correlation is constituted with this downstream site selecting superior node Relationship between superior and subordinate;
4.3) supplemented using strong and weak coordination
Strong coordination is directly dissolved in relationship between superior and subordinate, for weak coordination, using the knowledge section introducing Between point, the concept of the degree of correlation is dissolved in relationship between superior and subordinate.
Described step 5) includes:
When having in the data under same first class catalogue by Baidupedia, the entry of interactive encyclopaedia identification, according to step 1) Book number that the catalogue minor structure preserving and entry are located, page number numbering are gone in book again not passing through Baidupedia, interaction The data filling of encyclopaedia identification is entered,
When having data in a subdirectory after Baidupedia, interactive encyclopaedia identification, remove catalogue based on suffix Excavate useful part in structure, knowledge mapping supplemented,
Concrete grammar is as follows:
Setx is combined into by the collection of element in Baidupedia, the relationship between superior and subordinate of interactive encyclopaedia identification, in the superior and the subordinate Each a → b, finds the list of all of book number a → b,
Find the set sety of downstream site in the record of this bibliography for a for each book number,
To each of sety and setx common factor node node, node is by Baidupedia, interactive encyclopaedia identification Part, finding out in sety has identical suffix with node but does not pass through Baidupedia, the set of the entry of interactive encyclopaedia identification Setz, if
percentage=[(|setx∩sety|)+|setz|]/|sety|>level
Then include beyond node in sety remaining all do not pass through Baidupedia, the entry of interactive encyclopaedia identification, otherwise, Only include the entry having identical suffix in sety with node, wherein level is the threshold value setting.
Described step 6) includes:
6.1) clean in relationship between superior and subordinate
Exist in relationship between superior and subordinate: Class1, redundancy relationship, i.e. a → a, it is short that type 2, long relation a → bc are chopped into out Relation a → b, type 3, the insignificant relation a → b → a of circulation, so the relationship between superior and subordinate after supplement is read in again, and Class1, type 3 are screened out, short relation a in type 2 → b is merged and is included in long relation a → bc;
6.2nd, relationship between superior and subordinate sequence
To the relationship between superior and subordinate cleaned, need to calculate weight to each relation, to represent the confidence level of this relation;
The relationship between superior and subordinate that step 1) to step 5) is produced calculates its weight w ' (t → l) according to equation below, then carries out Sequence;
idf ( l ) = c ( t &rightarrow; l ) * 1 + n 1 + df ( l ) - - - ( 1 )
Wherein
C (t → l) represents the number of times that t → l occurs;
Df (l) represents the number of times occurring in coordination of l;
N represents the total nodes in coordination;
Idf (l) represents the anti-document frequency in coordination of l;
w(t→l)=c(t→l)*idf(l) (2)
Wherein
After w (t → l) represents the inverse document frequency of the number of times and downstream site considering t → l appearance, the power of t → l this edge Value;
sim ( t , t 1 ) = log [ 1 + n ( t , t 1 ) idf ( t ) * idf ( t 1 ) ] - - - ( 3 )
Wherein
Sim (t, t1) represents the similarity between t and t1;
N (t, t1) represents the common number of times occurring of t, t1 in coordination;
w ~ ( t &rightarrow; l ) = log ( &sigma; l &prime; w ( t &rightarrow; l &prime; ) ) &sigma; l &prime; w ( t &rightarrow; l &prime; ) * w ( t &rightarrow; l ) - - - ( 4 )
Wherein
Represent to increase and consider renewal after the different subordinates in coordination, to t → l weights for the t;
w &prime; ( t &rightarrow; l ) = w ~ ( t &rightarrow; l ) + &sigma; t 1 &notequal; t [ &mu; * sim ( t , t 1 ) * w ~ ( t 1 &rightarrow; l ) ] - - - ( 5 )
Wherein
W ' (t → l) represents increases the different higher levels considering l in coordination, and adds t → l after correlation among nodes Weight after the renewal of this relation,
μ is weights, is 0.5.
The present invention compared with prior art has the advantages that
1. the flow process of the method ensures to rely on machine to be automatically performed, without manual intervention.
2. the method has good autgmentability, if enrich one's knowledge collection of illustrative plates when new library catalogue supplement is entered.
3. the method level depth, relationships between nodes enrich, and being continuously replenished with new library catalogue, level depth, Between node, contact and the degree of accuracy can improve therewith.
Brief description
Fig. 1 is the general flow chart of the present invention;
Fig. 2 be step 2) flow chart;
Fig. 3 is the flow chart of step 3);
Fig. 4 is the flow chart of step 4);
Fig. 5 is the flow chart of step 5).
Specific embodiment
A kind of construction method of the knowledge mapping based on library catalogue comprises the following steps:
1) select a book, its catalogue page is carried out optical character identification and realizes digitlization, and in digitized catalogue knot On structure, according to the length of entry in catalogue, distinguish long entry and billet mesh two class entry;
2) to billet mesh directly as a collection of both candidate nodes, simultaneously by long entry using the natural language processing instrument increased income Fudannlp carries out part-of-speech tagging and obtains part of speech array, then extracts in addition a collection of time using conjunction, punctuate and part-of-speech rule Select node;
3) to two batches both candidate nodes, strictly filtered first, go to identify that this entry exists in Baidupedia, interactive encyclopaedia Whether, utilize the level structure up and down of catalogue to form relationship between superior and subordinate by the part of Baidupedia, interactive encyclopaedia identification, using mesh The peer-to-peer architecture of record forms coordination, using this two parts as the skeleton of knowledge mapping;
4) distinguish strong and weak coordination, from two kinds of coordinations, choose egress respectively, carry out on incremental supplementation enters Inferior relation, enriches the skeleton of knowledge mapping obtained in the previous step;
5) according to the method excavating useful part in noise data proposing based on suffix, never go through Baidu hundred Select a part of node in section, the entry of interactive encyclopaedia identification to be supplemented in knowledge mapping;
6) to each relation in the knowledge mapping having supplemented, calculate its weight and be ranked up again, thus screening out one Divide noise, realize sequence screening.
Described step 2) as follows:
Using natural language processing instrument, cutting word is gone to sentence and marks part of speech, be the arranged side by side of conjunction according to pause mark and part of speech Conjunction removes distich quantum splitting, and a sentence divides the character string dimension,
First to each of character string dimension a that each splits into substring a [i], formed to each a The part of speech array of one<word, the part of speech>of [i] such substring,
Next adjacent element in part of speech array is merged, merge during to character string dimension a in not Adopt different merging orders, the i.e. part of speech to first substring a [0] in character string dimension a with the character string of position Using back to front continuous adjective, noun being merged into a word during array manipulation, meanwhile, with the part of speech array of a [0] In last word part of speech on the basis of part of speech, using benchmark part of speech improve to ensuing character string in character string dimension a, Accuracy rate when merging part of speech array,
During to the part of speech array manipulation of second substring a [1] and each later substring, using as lower section Method:
2.1) if benchmark part of speech be noun, in a [1] and later each character string respective part of speech array by Match after forward direction till last part of speech is noun, form a word, otherwise not returning result;
2.2) film, title knowledge node can add " " or " " symbol, if then benchmark part of speech is to demarcate symbol, i.e. title Number, quotation marks when, then match last from the front to the back in a [1] and later each character string respective part of speech array Till individual demarcation symbol, form a word, otherwise not returning result;
If benchmark part of speech is not noun or demarcates symbol, to a [1] and the later respective part of speech of each character string In array, when first part of speech is mated with benchmark part of speech, then form a word, otherwise not returning result;
To the word do not included on Baidupedia, interactive encyclopaedia, using normalization Google distance (normalized *** Distance) both condensation degree together lower are calculated, two nodes between value is for 0 to one threshold value are considered as one group of energy The merged rational word receiving, when being identified by part of speech, when being the word merging out using part-of-speech rule, then to its word Property array, calculating value using normalization Google distance decides whether to include,
ngd ( x , y ) = max { log f ( x ) , log f ( y ) } - log f ( x , y ) log m - min { log f ( x ) , log f ( y ) }
Ngd (x, y) represents the value calculating using normalization Google distance,
F (x, y) represents the number of results that " xy " searches out in Google for keyword xy,
F (x) represents the number of results that " x " searches out in Google for keyword x,
F (y) represents the number of results that " y " searches out in Google for keyword y,
M is all of webpage number included in Google.
Described step 3) includes:
3.1) exercise in the library catalogue page that the optical character recognition process of each book is gone out, experiment, example this The books distribution information of sample is it will usually repeat repeatedly, so to pass through in the catalogue of the same level in same book The catalogue of each level is set up with a Hash table having<catalogue entry counts>just to count and filter out;
3.2) length for catalogue entry limits, and only takes the catalogue entry of 9 Chinese characters and following length;
Using the entry on Baidupedia, interactive encyclopaedia, set up out index, then process each of catalogue node When, carry out Baidupedia, interactive encyclopaedia identification, what Baidupedia, interactive encyclopaedia identification were passed through is included;
Process each catalogue in node when, carry out morphology using the natural language processing software fudannlp that increases income and divide Analysis, when part of speech is labeled as verb, then does not go to include;
3.3) for catalogue entry length the above length of 9 Chinese characters catalogue entry, if the entry class processing Type be " noun+conjunction+noun " type, using step 2.1), step 2.2) in in sentence extract coordination way , then again and the superior node of " noun+conjunction+noun " forms relationship between superior and subordinate, side by side by relationship between superior and subordinate and simultaneously The book number that relation is located preserves;
Meanwhile, need during process to keep the catalogue minor structure of every book, that is, need to preserve two tables, a table preserves and leads to Cross the above method knowledge node obtaining and its book number occurring in, page number numbering, and between knowledge node, the superior and the subordinate close The book number that system and coordination occur in, even if but the entry being temporarily dropped is also likely to be qualified node, needs to build Another table by all of directory node that pass through and unsanctioned, number and also preserve by the book number being located, the page number, The book number simultaneously each relationship between superior and subordinate that all entries are formed and coordination being located also preserves into, connects Correlation and after being counted by books distribution between getting off by calculate node, finds out rationally useful from the entry being dropped Knowledge node, the positional information entering each entry in knowledge mapping, simultaneously preserving as incremental supplementation can be used for connecing down The extraction of the definition coming.
Described step 4) includes:
4.1) differentiation of strong and weak coordination
The number of times that coordination is occurred counts, and is ranked up according to the absolute frequency that coordination occurs, will absolutely Out as strong coordination, absolute frequency is less than the pass arranged side by side of threshold value for the coordination binary group selection that frequency is more than with threshold value It is two tuples as weak coordination;
4.2) degree of correlation between knowledge node
Often there is between knowledge node ambiguity, in a data set being made up of relationship between superior and subordinate and coordination, The multiple superior nodes existing between for node point to the problem of same downstream sites, according to other related to knowledge node Node come to help solve,
Detailed process is as follows:
4.2.1) to each a → b, find one group of downstream site suba of a using relationship between superior and subordinate, to this group subordinate section Each of point node eleofsuba, finds one group of node paraofeleofsuba arranged side by side using strong coordination, to institute There is paraofeleofsuba to merge and form a set set1;
4.2.2) using weak coordination, one group of node parab arranged side by side is found to b, to each of parab node c, according to Secondary find one group of node paraofparab arranged side by side using strong coordination, each paraofparab as one gather set2;
4.2.3) to set set1, set set2,
Degree of correlation relevancy=sameelementcount+ (weight1+weight2) * 10
weight 1 = sameelementcount set 1 totalelementcount ,
weight 2 = sameelementcount set 2 totalelementcount ,
Wherein, sameelementcount represents element identical number in two set, and weight1, weight2 are respectively Represent the percentage that the number of identical element accounts in each set, set1totalelementcount represents in set set1 Element total number, set2totalelementcount represent set set2 in element total number,
Calculate the degree of correlation between downstream site c and superior node a, when relevancy is more than threshold value, c is thought It is the downstream site of a, corresponding a → c is included,
The multiple superior nodes existing between for node point to the problem of same downstream site, calculate respectively not ibid Level node and the degree of correlation of this downstream site, the size according to the degree of correlation is selecting superior node and this downstream site structure Become relationship between superior and subordinate;
4.3) supplemented using strong and weak coordination
Strong coordination is directly dissolved in relationship between superior and subordinate, for weak coordination, using the knowledge section introducing Between point, the concept of the degree of correlation is dissolved in relationship between superior and subordinate.
Described step 5) includes:
When having in the data under same first class catalogue by Baidupedia, the entry of interactive encyclopaedia identification, according to step 1) Book number that the catalogue minor structure preserving and entry are located, page number numbering are gone in book again not passing through Baidupedia, interaction The data filling of encyclopaedia identification is entered,
When having data in a subdirectory after Baidupedia, interactive encyclopaedia identification, remove catalogue based on suffix Excavate useful part in structure, knowledge mapping supplemented,
Concrete grammar is as follows:
Setx is combined into by the collection of element in Baidupedia, the relationship between superior and subordinate of interactive encyclopaedia identification, in the superior and the subordinate Each a → b, finds the list of all of book number a → b,
Find the set sety of downstream site in the record of this bibliography for a for each book number,
To each of sety and setx common factor node node, node is by Baidupedia, interactive encyclopaedia identification Part, finding out in sety has identical suffix with node but does not pass through Baidupedia, the set of the entry of interactive encyclopaedia identification Setz, if
percentage=[(|setx∩sety|)+|setz|]/|sety|>level
Then include beyond node in sety remaining all do not pass through Baidupedia, the entry of interactive encyclopaedia identification, otherwise, Only include the entry having identical suffix in sety with node, wherein level is the threshold value setting.
Described step 6) includes:
6.1) clean in relationship between superior and subordinate
Exist in relationship between superior and subordinate: Class1, redundancy relationship, i.e. a → a, it is short that type 2, long relation a → bc are chopped into out Relation a → b, type 3, the insignificant relation a → b → a of circulation, so the relationship between superior and subordinate after supplement is read in again, and Class1, type 3 are screened out, short relation a in type 2 → b is merged and is included in long relation a → bc;
6.2nd, relationship between superior and subordinate sequence
To the relationship between superior and subordinate cleaned, need to calculate weight to each relation, to represent the confidence level of this relation;
The relationship between superior and subordinate that step 1) to step 5) is produced calculates its weight w ' (t → l) according to equation below, then carries out Sequence;
idf ( l ) = c ( t &rightarrow; l ) * 1 + n 1 + df ( l ) - - - ( 1 )
Wherein
C (t → l) represents the number of times that t → l occurs;
Df (l) represents the number of times occurring in coordination of l;
N represents the total nodes in coordination;
Idf (l) represents the anti-document frequency in coordination of l;
w(t→l)=c(t→l)*idf(l) (2)
Wherein
After w (t → l) represents the inverse document frequency of the number of times and downstream site considering t → l appearance, the power of t → l this edge Value;
sim ( t , t 1 ) = log [ 1 + n ( t , t 1 ) idf ( t ) * idf ( t 1 ) ] - - - ( 3 )
Wherein
Sim (t, t1) represents the similarity between t and t1;
N (t, t1) represents the common number of times occurring of t, t1 in coordination;
w ~ ( t &rightarrow; l ) = log ( &sigma; l &prime; w ( t &rightarrow; l &prime; ) ) &sigma; l &prime; w ( t &rightarrow; l &prime; ) * w ( t &rightarrow; l ) - - - ( 4 )
Wherein
Represent to increase and consider renewal after the different subordinates in coordination, to t → l weights for the t;
w &prime; ( t &rightarrow; l ) = w ~ ( t &rightarrow; l ) + &sigma; t 1 &notequal; t [ &mu; * sim ( t , t 1 ) * w ~ ( t 1 &rightarrow; l ) ] - - - ( 5 )
Wherein
W ' (t → l) represents increases the different higher levels considering l in coordination, and adds t → l after correlation among nodes Weight after the renewal of this relation,
μ is weights, is 0.5.
Embodiment
Describe the concrete steps of this example enforcement with reference to the method for the present invention in detail, as follows:
1) 10,000 computer books have been carried out the process of optical character identification ocr, and in digitized bibliographic structure On, the length according to entry in catalogue is made a distinction with 9 Chinese characters for boundary, distinguishes long entry and billet mesh two class bar Mesh;
As depicted in figs. 1 and 2,2) to billet mesh directly as a collection of both candidate nodes, simultaneously by long entry using increasing income Natural language processing instrument fudannlp carries out part-of-speech tagging and obtains part of speech array, then utilizes conjunction, punctuate and part-of-speech rule Extract in addition a collection of both candidate nodes;
As follows to the process dividing fusion between strip purpose cutting word, part-of-speech tagging and word:
Using natural language processing instrument, cutting word is gone to sentence and marks part of speech, be the arranged side by side of conjunction according to pause mark and part of speech Conjunction removes distich quantum splitting, and a sentence divides the character string dimension,
First to each of character string dimension a that each splits into substring a [i], formed to each a The part of speech array of one<word, the part of speech>of [i] such substring,
Next adjacent element in part of speech array is merged, merge during to character string dimension a in not Adopt different merging orders, the i.e. part of speech to first substring a [0] in character string dimension a with the character string of position Using back to front continuous adjective, noun being merged into a word during array manipulation, meanwhile, with the part of speech array of a [0] In last word part of speech on the basis of part of speech, using benchmark part of speech improve to ensuing character string in character string dimension a, Accuracy rate when merging part of speech array,
During to the part of speech array manipulation of second substring a [1] and each later substring, using as lower section Method:
2.1) if benchmark part of speech be noun, in a [1] and later each character string respective part of speech array by Match after forward direction till last part of speech is noun, form a word, otherwise not returning result;
2.2) film, title knowledge node can add " " or " " symbol, if then benchmark part of speech is to demarcate symbol, i.e. title Number, quotation marks when, then match last from the front to the back in a [1] and later each character string respective part of speech array Till individual demarcation symbol, form a word, otherwise not returning result;
If benchmark part of speech is not noun or demarcates symbol, to a [1] and the later respective part of speech of each character string In array, when first part of speech is mated with benchmark part of speech, then form a word, otherwise not returning result;
To the word do not included on Baidupedia, interactive encyclopaedia, using normalization Google distance (normalized *** Distance) both condensation degree together lower are calculated, two nodes between value is for 0 to one threshold value are considered as one group Can the merged rational word receiving, when being identified by part of speech, when being the word merging out using part-of-speech rule, then to it Part of speech array, calculates value using normalization Google distance and decides whether to include,
ngd ( x , y ) = max { log f ( x ) , log f ( y ) } - log f ( x , y ) log m - min { log f ( x ) , log f ( y ) }
Ngd (x, y) represents the value calculating using normalization Google distance,
F (x, y) represents the number of results that " xy " searches out in Google for keyword xy,
F (x) represents the number of results that " x " searches out in Google for keyword x,
F (y) represents the number of results that " y " searches out in Google for keyword y,
M is all of webpage number included in Google.
3) as shown in figure 3,
Exercise in the library catalogue page going out for the optical character recognition process of each book, experiment, example are such Books distribution information is it will usually repeat repeatedly, so by every in the catalogue of the same level in same book The catalogue of one level is set up a Hash table having<catalogue entry counts>and just can be counted and filter out;
Length for catalogue entry limits, and only takes the catalogue entry of 9 Chinese characters and following length;
Using the entry on Baidupedia, interactive encyclopaedia, set up out index, then process each of catalogue node When, carry out Baidupedia, interactive encyclopaedia identification, what Baidupedia, interactive encyclopaedia identification were passed through is included;
Process each catalogue in node when, carry out morphology using the natural language processing software fudannlp that increases income and divide Analysis, when part of speech is labeled as verb, then does not go to include;
For catalogue entry length the above length of 9 Chinese characters catalogue entry, if the entry type processing is The type of " noun+conjunction+noun ", using in Fig. 2 in sentence extract coordination way, then again with " noun+ The superior node of conjunction+noun " forms relationship between superior and subordinate, and the book number simultaneously relationship between superior and subordinate and coordination being located is protected Leave;
Meanwhile, need during process to keep the catalogue minor structure of every book, that is, need to preserve two tables, a table preserves and leads to Cross the above method knowledge node obtaining and its book number occurring in, page number numbering, and between knowledge node, the superior and the subordinate close The book number that system and coordination occur in, even if but the entry being temporarily dropped is also likely to be qualified node, needs to build Another table by all of directory node that pass through and unsanctioned, number and also preserve by the book number being located, the page number, The book number simultaneously each relationship between superior and subordinate that all entries are formed and coordination being located also preserves into, connects down Correlation and after being counted by books distribution between coming by calculate node, finds out rationally useful from the entry being dropped Knowledge node, next the positional information entering each entry in knowledge mapping, simultaneously preserving as incremental supplementation can be used for Definition extraction.
4) as shown in figure 4, the number of times that coordination is occurred counts, entered according to the absolute frequency that coordination occurs Row sequence, absolute frequency is more than the coordination binary group selection of 4 times out as strong coordination, absolute frequency is less than 4 Secondary coordination two tuple is as weak coordination;
Detailed process is as follows:
To each a → b, find one group of downstream site suba of a using relationship between superior and subordinate, in this group downstream site Each node eleofsuba, finds one group of node paraofeleofsuba arranged side by side using strong coordination, to all Paraofeleofsuba merges one set set1 of formation;
Using weak coordination, one group of node parab arranged side by side is found to b, to each of parab node c, utilizes successively Strong coordination finds one group of node paraofparab arranged side by side, and each paraofparab is as a set set2;
To set set1, set set2,
Degree of correlation relevancy=sameelementcount+ (weight1+weight2) * 10
weight 1 = sameelementcount set 1 totalelementcount ,
weight 2 = sameelementcount set 2 totalelementcount ,
Wherein, sameelementcount represents element identical number in two set, and weight1, weight2 are respectively Represent the percentage that the number of identical element accounts in each set, set1totalelementcount represents in set set1 Element total number, set2totalelementcount represent set set2 in element total number,
Calculate the degree of correlation between downstream site c and superior node a, when relevancy is more than threshold value 0.5, c is recognized For being the downstream site of a, corresponding a → c is included,
The multiple superior nodes existing between for node point to the problem of same downstream site, calculate different higher levels respectively The degree of correlation of node and this downstream site, the size according to the degree of correlation is constituted with this downstream site selecting superior node Relationship between superior and subordinate;
Strong coordination is directly dissolved in relationship between superior and subordinate, for weak coordination, using the knowledge section introducing Between point, the concept of the degree of correlation is dissolved in relationship between superior and subordinate.
5) as shown in figure 5, according to propose based on suffix excavate noise data in useful part method, never It is supplemented in knowledge mapping by selecting a part of node in Baidupedia, the entry of interactive encyclopaedia identification, when same one-level mesh When having in the data under record by Baidupedia, the entry of interactive encyclopaedia identification, according to the catalogue minor structure preserving and entry institute Book number, page number numbering go in book again being entered by the data filling of Baidupedia, interactive encyclopaedia identification,
When having data in a subdirectory after Baidupedia, interactive encyclopaedia identification, remove catalogue based on suffix Excavate useful part in structure, knowledge mapping supplemented,
Concrete grammar is as follows:
Setx is combined into by the collection of element in Baidupedia, the relationship between superior and subordinate of interactive encyclopaedia identification, in the superior and the subordinate Each a → b, finds the list of all of book number a → b,
Find the set sety of downstream site in the record of this bibliography for a for each book number,
To each of sety and setx common factor node node, node is by Baidupedia, interactive encyclopaedia identification Part, finding out in sety has identical suffix with node but does not pass through Baidupedia, the set of the entry of interactive encyclopaedia identification Setz, if
percentage=[(|setx∩sety|)+|setz|]/|sety|>level
Then include beyond node in sety remaining all do not pass through Baidupedia, the entry of interactive encyclopaedia identification, otherwise, Only include the entry having identical suffix in sety with node, wherein level is the threshold value that sets as 0.75.
6) take into account similarity between node, to each relation in the knowledge mapping having supplemented, calculate its weight and carry out again Sequence, thus screening out a part of noise, realizes sequence screening,
Exist in relationship between superior and subordinate: Class1, redundancy relationship, i.e. a → a, it is short that type 2, long relation a → bc are chopped into out Relation a → b, type 3, the insignificant relation a → b → a of circulation, so the relationship between superior and subordinate after supplement is read in again, and Class1, type 3 are screened out, short relation a in type 2 → b are merged and is included in long relation a → bc, above-mentioned for cleaning step Suddenly;
To the relationship between superior and subordinate cleaned, need to calculate weight to each relation, to represent the confidence level of this relation;
The relationship between superior and subordinate that step 1) to step 5) is produced calculates its weight w ' (t → l) according to equation below, then carries out Sequence;
idf ( l ) = c ( t &rightarrow; l ) * 1 + n 1 + df ( l ) - - - ( 1 )
Wherein
C (t → l) represents the number of times that t → l occurs;
Df (l) represents the number of times occurring in coordination of l;
N represents the total nodes in coordination;
Idf (l) represents the anti-document frequency in coordination of l;
w(t→l)=c(t→l)*idf(l) (2)
Wherein
After w (t → l) represents the inverse document frequency of the number of times and downstream site considering t → l appearance, the power of t → l this edge Value;
sim ( t , t 1 ) = log [ 1 + n ( t , t 1 ) idf ( t ) * idf ( t 1 ) ] - - - ( 3 )
Wherein
Sim (t, t1) represents the similarity between t and t1;
N (t, t1) represents the common number of times occurring of t, t1 in coordination;
w ~ ( t &rightarrow; l ) = log ( &sigma; l &prime; w ( t &rightarrow; l &prime; ) ) &sigma; l &prime; w ( t &rightarrow; l &prime; ) * w ( t &rightarrow; l ) - - - ( 4 )
Wherein
Represent to increase and consider renewal after the different subordinates in coordination, to t → l weights for the t
w &prime; ( t &rightarrow; l ) = w ~ ( t &rightarrow; l ) + &sigma; t 1 &notequal; t [ &mu; * sim ( t , t 1 ) * w ~ ( t 1 &rightarrow; l ) ] - - - ( 5 )
Wherein
W ' (t → l) represents increases the different higher levels considering l in coordination, and adds t → l after correlation among nodes Weight after the renewal of this relation,
μ is weights, is 0.5.
The operation result of this example: after four kinds of increments, all supplement is entered, a total of 25426 relationships between superior and subordinate, Create 741 root nodes, in the knowledge mapping of generation, contain 843998 nodes, maximum level is 85 layers, average 28.2 Layer, and accuracy rate is 75.1%.
Meanwhile, because the middle-level depth of hownet, cnki knowledge classification is generally units, and node is quantitatively far not As interactive encyclopaedia classification tree, therefore choose interactive encyclopaedic knowledge tree here as comparison other.Calculate in interactive encyclopaedic knowledge tree Counted in the related subclass of machine, draw and comprise 21 root nodes altogether, have 75434 nodes, maximum level depth is 48 Layer, average level depth is 7.3 layers.
Contrast is as can be seen that this method much exceeds current classification side in the indexs such as number of nodes, level depth Method, ensure that the higher degree of accuracy simultaneously, without manual intervention, and has good extensibility.
In the knowledge mapping processing out below with this method, 6 levels of selected parts are 5 example, and the respective degree of accuracy Statistics:
.

Claims (4)

1. a kind of construction method of the knowledge mapping based on library catalogue is it is characterised in that comprise the following steps:
1) select a book, its catalogue page is carried out optical character identification and realizes digitlization, and on digitized bibliographic structure, According to the length of entry in catalogue, distinguish long entry and billet mesh two class entry;
2) to billet mesh directly as a collection of both candidate nodes, simultaneously by long entry using the natural language processing instrument increased income Fudannlp carries out part-of-speech tagging and obtains part of speech array, then extracts in addition a collection of time using conjunction, punctuate and part-of-speech rule Select node;
3) to two batches both candidate nodes, strictly filtered first, go to identify in Baidupedia, interactive encyclopaedia this node exist with No, utilize the level structure up and down of catalogue to form relationship between superior and subordinate by the part of Baidupedia, interactive encyclopaedia identification, using catalogue Peer-to-peer architecture formed coordination, using this two parts as the skeleton of knowledge mapping;
4) distinguish strong and weak coordination, from two kinds of coordinations, choose egress respectively, carry out incremental supplementation and enter the superior and the subordinate Relation, enriches the skeleton of knowledge mapping obtained in the previous step;
5) according to propose based on suffix excavate noise data in useful part method, never go through Baidupedia, Select a part of node in the entry of interactive encyclopaedia identification to be supplemented in knowledge mapping;
6) to each relation in the knowledge mapping having supplemented, calculate its weight and be ranked up again, thus screen out a part making an uproar Sound, realizes sequence screening.
2. a kind of knowledge mapping based on library catalogue according to claim 1 construction method it is characterised in that: described Step 2) as follows:
Using natural language processing instrument, cutting word is gone to sentence and marks part of speech, be the connection arranged side by side of conjunction according to pause mark and part of speech Word removes distich quantum splitting, and a sentence divides the character string dimension,
First to each of character string dimension a that each splits into substring a [i], formed to each a [i] this The part of speech array of one<word, part of speech>of the substring of sample,
Next adjacent element in part of speech array is merged, during merging to character string dimension a in different positions The character string put adopts different merging orders, i.e. the part of speech array to first substring a [0] in character string dimension a Using back to front continuous adjective, noun being merged into a word during process, meanwhile, with the part of speech array of a [0] Part of speech on the basis of the part of speech of a word afterwards, is improved to ensuing character string in character string dimension a using benchmark part of speech, is closing And accuracy rate during part of speech array,
During to the part of speech array manipulation of second substring a [1] and each later substring, adopt with the following method:
2.1) if benchmark part of speech be noun, in a [1] and later each character string respective part of speech array by forward direction Match afterwards till last part of speech is noun, form a word, otherwise not returning result;
2.2) film, title knowledge node can add " " or " " symbol, if then benchmark part of speech be demarcate symbol, that is, punctuation marks used to enclose the title, draw Number when, then to match from the front to the back in a [1] and later each character string respective part of speech array last demarcation Till symbol, form a word, otherwise not returning result;
If benchmark part of speech is not noun or demarcates symbol, to a [1] and later each character string respective part of speech array In, when first part of speech is mated with benchmark part of speech, then form a word, otherwise not returning result;
To the word do not included on Baidupedia, interactive encyclopaedia, using normalization Google distance (normalized *** Distance) both condensation degree together lower are calculated, two nodes between value is for 0 to one threshold value are considered as one group of energy The merged rational word receiving, when being identified by part of speech, when being the word merging out using part-of-speech rule, then to its word Property array, calculating value using normalization Google distance decides whether to include,
n g d ( x , y ) = max { log f ( x ) , log f ( y ) } - log f ( x , y ) log m - min { log f ( x ) , log f ( y ) }
Ngd (x, y) represents the value calculating using normalization Google distance,
F (x, y) represents the number of results that " xy " searches out in Google for keyword xy,
F (x) represents the number of results that " x " searches out in Google for keyword x,
F (y) represents the number of results that " y " searches out in Google for keyword y,
M is all of webpage number included in Google.
3. a kind of knowledge mapping based on library catalogue according to claim 2 construction method it is characterised in that: described Step 3) include:
3.1) exercise in the library catalogue page going out for the optical character recognition process of each book, experiment, example are such Books distribution information is it will usually repeat repeatedly, so by every in the catalogue of the same level in same book The catalogue of one level is set up a Hash table having<catalogue entry counts>and just can be counted and filter out;
3.2) length for catalogue entry limits, and only takes the catalogue entry of 9 Chinese characters and following length;
Using the entry on Baidupedia, interactive encyclopaedia, set up out index, then when processing each of catalogue node, Carry out Baidupedia, interactive encyclopaedia identification, what Baidupedia, interactive encyclopaedia identification were passed through is included;
Process each catalogue in node when, using increasing income, natural language processing software fudannlp carries out morphological analysis, When part of speech is labeled as verb, then do not go to include;
3.3) for catalogue entry length the above length of 9 Chinese characters catalogue entry, if the entry type processing is The type of " noun+conjunction+noun ", using step 2.1), step 2.2) in in sentence extract coordination way, Then form relationship between superior and subordinate with the superior node of " noun+conjunction+noun " again, simultaneously by relationship between superior and subordinate and coordination institute Book number preserve;
Meanwhile, need during process to keep the catalogue minor structure of every book, that is, need to preserve two tables, a table preserves by upper Knowledge node that face method obtains and its book number occurring in, page number numbering, and between knowledge node relationship between superior and subordinate and The book number that coordination occurs in, even if but the entry being temporarily dropped is also likely to be qualified node, need to build another The book number being located, the page number are numbered and are also preserved, simultaneously by all of directory node that pass through and unsanctioned to open table The book number that each relationship between superior and subordinate that all entries are formed and coordination are located also preserves into, next leads to After crossing between calculate node correlation and being counted by books distribution, find out rationally useful knowledge from the entry being dropped Node, the positional information entering each entry in knowledge mapping, simultaneously preserving as incremental supplementation can be used for ensuing fixed The extraction of justice.
4. a kind of knowledge mapping based on library catalogue according to claim 1 construction method it is characterised in that: described Step 4) include:
4.1) differentiation of strong and weak coordination
The number of times that coordination is occurred counts, and is ranked up according to the absolute frequency that coordination occurs, will definitely frequently Rate is more than the coordination binary group selection of threshold value out as strong coordination, and absolute frequency is less than the coordination two of threshold value Tuple is as weak coordination;
4.2) degree of correlation between knowledge node
Often there is between knowledge node ambiguity, in a data set being made up of relationship between superior and subordinate and coordination, for The multiple superior nodes existing between node point to the problem of same downstream site, according to other nodes related to knowledge node To help solve,
Detailed process is as follows:
4.2.1) to each a → b, find one group of downstream site suba of a using relationship between superior and subordinate, in this group downstream site Each node eleofsuba, find one group of node paraofeleofsuba arranged side by side using strong coordination, to all Paraofeleofsuba merges one set set1 of formation;
4.2.2) using weak coordination, one group of node parab arranged side by side is found to b, to each of parab node c, profit successively Find one group of node paraofparab arranged side by side with strong coordination, each paraofparab is as a set set2;
4.2.3) to set set1, set set2,
Degree of correlation relevancy=sameelementcount+ (weight1+weight2) * 10
w e i g h t 1 = s a m e e l e m e n t c o u n t s e t 1 t o t a l e l e m e n t c o u n t ,
w e i g h t 2 = s a m e e l e m e n t c o u n t s e t 2 t o t a l e l e m e n t c o u n t ,
Wherein, sameelementcount represents element identical number in two set, and weight1, weight2 represent respectively The percentage that the number of identical element accounts in each set, set1totalelementcount represents the unit in set set1 Plain total number, set2totalelementcount represents the element total number in set set2,
Calculate the degree of correlation between downstream site c and superior node a, when relevancy is more than threshold value, c is considered a's Downstream site, corresponding a → c is included,
The multiple superior nodes existing between for node point to the problem of same downstream site, calculate different superior nodes respectively With the degree of correlation of this downstream site, the size according to the degree of correlation constitutes up and down selecting superior node and this downstream site Level relation;
4.3) supplemented using strong and weak coordination
Strong coordination is directly dissolved in relationship between superior and subordinate, for weak coordination, using between the knowledge node introducing The concept of the degree of correlation is dissolved in relationship between superior and subordinate.
CN201310601668.7A 2013-11-22 2013-11-22 Method for establishing mapping knowledge domain based on book catalogue Active CN103729402B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310601668.7A CN103729402B (en) 2013-11-22 2013-11-22 Method for establishing mapping knowledge domain based on book catalogue

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310601668.7A CN103729402B (en) 2013-11-22 2013-11-22 Method for establishing mapping knowledge domain based on book catalogue

Publications (2)

Publication Number Publication Date
CN103729402A CN103729402A (en) 2014-04-16
CN103729402B true CN103729402B (en) 2017-01-18

Family

ID=50453477

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310601668.7A Active CN103729402B (en) 2013-11-22 2013-11-22 Method for establishing mapping knowledge domain based on book catalogue

Country Status (1)

Country Link
CN (1) CN103729402B (en)

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462227A (en) * 2014-11-13 2015-03-25 中国测绘科学研究院 Automatic construction method of graphic knowledge genealogy
US10417280B2 (en) * 2014-12-23 2019-09-17 Intel Corporation Assigning global edge IDs for evolving graphs
CN106355627A (en) * 2015-07-16 2017-01-25 中国石油化工股份有限公司 Method and system used for generating knowledge graphs
CN105653706B (en) * 2015-12-31 2018-04-06 北京理工大学 A kind of multilayer quotation based on literature content knowledge mapping recommends method
CN105893485B (en) * 2016-03-29 2019-02-12 浙江大学 A kind of thematic automatic generation method based on library catalogue
CN108205564B (en) * 2016-12-19 2021-04-09 北大方正集团有限公司 Knowledge system construction method and system
CN107609639A (en) * 2017-09-18 2018-01-19 前海梧桐(深圳)数据有限公司 The business data layering method and its system of imitative neuron
CN110019730A (en) * 2017-12-25 2019-07-16 上海智臻智能网络科技股份有限公司 Automatic interaction system and intelligent terminal
CN110110089B (en) * 2018-01-09 2021-03-30 网智天元科技集团股份有限公司 Cultural relation graph generation method and system
CN108491469B (en) * 2018-03-07 2021-03-30 浙江大学 Neural collaborative filtering concept descriptor recommendation method introducing concept label
CN108416024A (en) * 2018-03-08 2018-08-17 网易乐得科技有限公司 Data processing method and device, medium and computing device
CN108509420A (en) * 2018-03-29 2018-09-07 赵维平 Gu spectrum and ancient culture knowledge mapping natural language processing method
CN110019948B (en) * 2018-08-31 2022-04-26 北京字节跳动网络技术有限公司 Method and apparatus for outputting information
CN109657074B (en) * 2018-09-28 2023-11-10 北京信息科技大学 News knowledge graph construction method based on address tree
CN109597856B (en) * 2018-12-05 2020-12-25 北京知道创宇信息技术股份有限公司 Data processing method and device, electronic equipment and storage medium
CN110379520A (en) * 2019-06-18 2019-10-25 北京百度网讯科技有限公司 The method for digging and device of medical knowledge map, computer equipment and readable medium
CN111061884B (en) * 2019-11-14 2023-11-21 临沂市拓普网络股份有限公司 Method for constructing K12 education knowledge graph based on deep technology
CN111090754B (en) * 2019-11-20 2023-04-07 新华智云科技有限公司 Method for automatically constructing movie comprehensive knowledge map based on encyclopedic entries
CN112015792B (en) * 2019-12-11 2023-12-01 天津泰凡科技有限公司 Material repeated code analysis method and device and computer storage medium
CN111177411A (en) * 2019-12-27 2020-05-19 赣州市智能产业创新研究院 Knowledge graph construction method based on NLP
CN113393201A (en) * 2020-03-12 2021-09-14 阿里巴巴集团控股有限公司 Contract processing system and method and electronic equipment
CN111444352A (en) * 2020-03-26 2020-07-24 深圳壹账通智能科技有限公司 Knowledge graph construction method and device based on knowledge node membership
CN115129890A (en) * 2022-06-22 2022-09-30 青岛海尔电冰箱有限公司 Feedback data map generation method and generation device, question answering device and refrigerator
CN115809371B (en) * 2023-02-01 2023-09-01 中信联合云科技有限责任公司 Learning requirement determining method and system based on data analysis
CN117494811B (en) * 2023-11-20 2024-05-28 南京大经中医药信息技术有限公司 Knowledge graph construction method and system for Chinese medicine books

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1380620A (en) * 2001-12-18 2002-11-20 张弦 Automatic editing method of book index
CN1389811A (en) * 2002-02-06 2003-01-08 北京造极人工智能技术有限公司 Intelligent search method of search engine
CN102332023A (en) * 2011-09-27 2012-01-25 北京中科希望软件股份有限公司 Method and system for fast semantic annotation of e-book
KR20120105796A (en) * 2011-03-16 2012-09-26 주식회사 유비온 Method for intelligent tutoring and system therefor

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120324346A1 (en) * 2011-06-15 2012-12-20 Terrence Monroe Method for relational analysis of parsed input for visual mapping of knowledge information
TW201344652A (en) * 2012-04-24 2013-11-01 Richplay Information Co Ltd Method for manufacturing knowledge map

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1380620A (en) * 2001-12-18 2002-11-20 张弦 Automatic editing method of book index
CN1389811A (en) * 2002-02-06 2003-01-08 北京造极人工智能技术有限公司 Intelligent search method of search engine
KR20120105796A (en) * 2011-03-16 2012-09-26 주식회사 유비온 Method for intelligent tutoring and system therefor
CN102332023A (en) * 2011-09-27 2012-01-25 北京中科希望软件股份有限公司 Method and system for fast semantic annotation of e-book

Also Published As

Publication number Publication date
CN103729402A (en) 2014-04-16

Similar Documents

Publication Publication Date Title
CN103729402B (en) Method for establishing mapping knowledge domain based on book catalogue
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
Lehmberg et al. The mannheim search join engine
Papadopoulou et al. A corpus of debunked and verified user-generated videos
CN104765769B (en) The short text query expansion and search method of a kind of word-based vector
CN104504150B (en) News public sentiment monitoring system
Stamatatos et al. Clustering by authorship within and across documents
Ryu et al. Open domain question answering using Wikipedia-based knowledge model
Caldarola et al. An approach to ontology integration for ontology reuse
CN103617280B (en) Method and system for mining Chinese event information
CN110888991B (en) Sectional type semantic annotation method under weak annotation environment
Li et al. Efficient similarity join and search on multi-attribute data
CN104281653A (en) Viewpoint mining method for ten million microblog texts
CN105528437A (en) Question-answering system construction method based on structured text knowledge extraction
KR20060122276A (en) Relation extraction from documents for the automatic construction of ontologies
CN108763348A (en) A kind of classification improved method of extension short text word feature vector
CN107871002A (en) A kind of across language plagiarism detection method based on fingerprint fusion
CN106649823A (en) Webpage classification recognition method based on comprehensive subject term vertical search and focused crawler
CN111090994A (en) Chinese-internet-forum-text-oriented event place attribution province identification method
CN116244446A (en) Social media cognitive threat detection method and system
Campbell et al. Content+ context networks for user classification in twitter
CN102722526B (en) Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method
CN111008285B (en) Author disambiguation method based on thesis key attribute network
Vosoughi et al. A semi-automatic method for efficient detection of stories on social media
CN112307364A (en) Character representation-oriented news text place extraction method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant