CN101017504A - Literature retrieval method based on semantic small-word model - Google Patents

Literature retrieval method based on semantic small-word model Download PDF

Info

Publication number
CN101017504A
CN101017504A CN 200710051607 CN200710051607A CN101017504A CN 101017504 A CN101017504 A CN 101017504A CN 200710051607 CN200710051607 CN 200710051607 CN 200710051607 A CN200710051607 A CN 200710051607A CN 101017504 A CN101017504 A CN 101017504A
Authority
CN
China
Prior art keywords
node
document
semantic
query statement
chain
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 200710051607
Other languages
Chinese (zh)
Other versions
CN100517331C (en
Inventor
金海�
宁小敏
袁平鹏
武浩
余一娇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CNB2007100516072A priority Critical patent/CN100517331C/en
Publication of CN101017504A publication Critical patent/CN101017504A/en
Application granted granted Critical
Publication of CN100517331C publication Critical patent/CN100517331C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This invention discloses one file index method based on language meanings small world, which comprises the following steps: firstly using latent meanings index to extract file property vector to maintain file property to lower its dimensions and to reduce the information memory volume; then using supportive vector machine to sort all common files to form sort information to mark the sort interest proportion; finally using social network small world with small linkage point with high proportion interest of certain file sort to form network topological structure with small property.

Description

Document retrieval method based on semantic worldlet model
Technical field
The invention belongs to the Distributed Calculation and the information retrieval of computer realm, be specifically related to a kind of document retrieval method based on semantic worldlet model, this method is mainly utilized the efficient information storage and retrieval problem in the shared peer-to-peer network of semantic worldlet model solution documentation ﹠ info.
Background technology
The peer-to-peer network system is because characteristics such as its extensibility, fault-tolerance, autonomy and self-organization more and more receive people's concern in large-scale information retrieval field.But in the peer-to-peer network that documentation ﹠ info is shared, how to carry out effective information storage and retrieval and remain one and have very big challenging problem.
The worldlet phenomenon extensively is present in the community network, promptly can connect by very short social relationships chain between in the world everyone, the length of social relationships chain generally is no more than six, be called as " six degree separation theorems ", this theoretical reason that exists is in community network, people have some friends similar to its interest usually, it is not necessarily similar to its interest but the friend of numerous social bonds is arranged also to have simultaneously minority, thereby people can connect each other by very short " friend's friend " social relationships chain.
Potential semantic indexing is the expansion to the vector space model in traditional information retrieval, can eliminate the synonym that influences recall ratio and precision ratio and the polysemia that extensively exist in the information retrieval, on the semantic concept space basis of document, realize dimensionality reduction, reduce the documentation ﹠ info memory space document.
Support vector machine is a kind of machine learning method, is widely used in fields such as pattern-recognition, data qualification, can realize the classification to extensive document efficiently and accurately.
At present, the information storage and retrieval in the peer-to-peer network is mainly based on following method: centralized index (as Napster, BitTorrent), inquiry flood (Gnutella) or random walk.But above method all requires accurate meta data match (as filename or key word) to finish searching requirement, owing to can't obtain the semantic information of other nodes in the network, so need the recall ratio of a large amount of node of search blindly, thereby cause serious offered load with the guarantee information retrieval.Can improve query performance by improved neighbor node index information (as local index) guiding query messages, but upgrading index information requires very large overhead.Extensibility that can provide based on the peer-to-peer network that structure is arranged (as CAN, Chord) of distributed hashtable and effective search performance, but can only support the mode of searching of key word/value, for the full-text search in the information retrieval field is inappropriate, and safeguards that the expense of the peer network architecture that structure is arranged is very big.
Summary of the invention
The purpose of this invention is to provide a kind of document retrieval method based on semantic worldlet model, this method can improve the recall ratio and the inquiry velocity of retrieval.
The present invention is based on the document retrieval method of semantic worldlet model, comprise the steps:
(1) set up the overall network topology structure with semantic worldlet feature, its step comprises:
(1.1) utilize potential semantic indexing method to extract the document feature vector, document proper vector comprises frequency that word occurs in the document and the number of times that occurs in the literature collection of all proper vectors to be extracted;
(1.2) on above-mentioned document feature vector basis, support vector machine has directed learning to the training document, obtains the support vector model;
(1.3) every shared machine of participation document is called node in the peer-to-peer network, and each node is classified to all shared documents of this node after obtaining above vector model, forms classified information, other interest ratio of this node document category of this classified information sign;
(1.4) each node node of in the scope of two jumpings, selecting other interest of document category to be in similar proportion, the standard of selection is that similarity surpasses predetermined similarity threshold;
(1.5) if part of nodes is very high in other interest ratio of a certain document category in the network, surpass predetermined threshold value, then this node is arranged to have the super semantic node that possibility is directly linked by other nodes;
(1.6) all nodes directly link with Probability p with super semantic node outside surpassing the scope of two jumpings, wherein 0<p≤0.001;
(2) have on the network topology basis of semantic worldlet feature in foundation, carry out document information retrieval, its step comprises:
(2.1) node sends query requests, and each query statement comprises the document classification of key word of the inquiry and inquiry;
(2.2) if the document classification of inquiry belongs to the document classification of the node that sends this query statement, and its ratio is then jumped into step (2.3) greater than 50%; Otherwise, jump into step (2.5);
(2.3) node carries out local search, returns Query Result;
(2.4) each short chain that this query statement is transmitted to this node connects node; And each long-chain connect node, if the document classification of its happiest interesting ratio is consistent with the document classification of query statement, then is transmitted to this long-chain and connects node, and jump into step (2.6), otherwise directly jump into step (2.6);
(2.5) query statement is transmitted to the neighbor node that each physics of this node directly links to each other; And each long-chain connect node, if the document classification of its happiest interesting ratio is consistent with the document classification of query statement, then is transmitted to this long-chain and connects node; Jump into step (2.6);
(2.6) poll-final.
At the storage and the recall precision problem that exist in the shared peer-to-peer network of documentation ﹠ info, the present invention provides a kind of documentation ﹠ info that is applicable to share the storage and the search method of peer-to-peer network in conjunction with the worldlet phenomenon in potential semantic indexing and support vector machine and the community network.The inventive method can be with documentation ﹠ info note right way of conduct formula tissue, utilize worldlet phenomenon in the community network (be in the community network people can be) by very short path acquaintance, under the prerequisite that reduces message transmission and offered load, improve the recall ratio and the inquiry velocity of retrieval.Adopt the inventive method, query statement can be routed to the node of most possible this request of answer, rather than traditional blindness route, thereby search efficiency is provided; Simultaneously, the long-chain that makes full use of in the worldlet connects, and makes the query statement also can be by very fast other parts that are routed in the network, rather than is trapped in the little web search scope, thereby improves the important indicator recall ratio of information retrieval.Particularly, the present invention has following characteristics:
(1) uses potential semantic indexing to extract the document feature vector and can under the situation that as far as possible keeps the documentation ﹠ info feature, reduce information storage;
(2) utilize support vector machine to node documentation ﹠ info classification, the accuracy rate height the more important thing is that the document classification information of node can express the semanteme of this node, for follow-up search provides effective support;
(3) utilize the worldlet phenomenon, can make Query Information very fast be routed to relevant node, improve recall ratio, and can reduce network overhead.
Description of drawings
Fig. 1 sets up the network topology process flow diagram with semantic worldlet feature.
Fig. 2 is based on the document information retrieval process flow diagram of semantic topological structure.
Embodiment
The present invention will be further described below in conjunction with the drawings and specific embodiments.
The present invention includes two key steps, promptly at first need to set up network topology with semantic worldlet feature; Secondly, offer information retrieval, below above two steps are described respectively in the enterprising style of writing of setting up of topological structure.
(1) set up the overall network topology structure with semantic worldlet feature, its step comprises:
(1.1) utilize potential semantic indexing method to extract the document feature vector, document proper vector comprises frequency that word occurs in the document and the number of times that occurs in the literature collection of all proper vectors to be extracted.
(1.2) on above-mentioned document feature vector basis, support vector machine has directed learning to the training document, obtains the support vector model;
(1.3) each node in the peer-to-peer network is classified to all shared documents of this node after obtaining above vector model, forms classified information, other interest ratio of this node document category of this classified information sign; The standard of classification is determined by concrete application, share as computer document, then can select to be divided into computer system organization (Computer SystemsOrganization), computational mathematics (Mathematics of Computing), infosystem (InformationSystems) etc. according to the computer classification system of ACM;
(1.4) each node node of in the scope of two jumpings, selecting other interest of document category to be in similar proportion, the standard of selecting is that similarity surpasses predetermined similarity threshold, the span of this threshold value is [0.5,1], thereby the short chain that satisfies in the worldlet phenomenon connects requirement;
(1.5) if part of nodes is very high in other interest ratio of a certain document category in the network, surpass predetermined threshold value, the span of this threshold value is [0.8,1], then this node is arranged to have the super semantic node that possibility is directly linked by other nodes;
(1.6) scope of two jumpings is outer directly to link with Probability p with super semantic node all nodes surpassing, and promptly node is a Probability p with the possibility of super semantic node connection, 0<p≤0.001 wherein, thus the long-chain that satisfies in the worldlet phenomenon connects requirement;
After finishing above-mentioned steps (1.1)-(1.6), all nodes in the peer-to-peer network all have the less short chain similar to its interest that directly links to each other and connect node, have simultaneously few not necessarily similar to its interest but one fix on the very high long-chain of other interest ratio of a certain document category and connect, thereby form network topology with semantic worldlet feature.
(2) have on the network topology basis of semantic worldlet feature in foundation, carry out document information retrieval, its step comprises:
(2.1) node sends query requests, and each query statement comprises the document classification of key word of the inquiry and inquiry;
(2.2) if the document classification of inquiry belongs to the higher proportion part (this ratio is selected greater than 50%) of the node that sends this query statement, then jump into step (2.3); Otherwise, jump into step (2.5);
(2.3) node carries out local search, returns Query Result;
(2.4) each short chain that this query statement is transmitted to this node connects node; And each long-chain connect node, if the document classification of its happiest interesting ratio is consistent with the document classification of query statement, then is transmitted to this long-chain and connects node, and jump into step (2.6), otherwise directly jump into step (2.6);
(2.5) query statement is transmitted to the neighbor node that each physics of this node directly links to each other; And each long-chain connect node, if the document classification of its happiest interesting ratio is consistent with the document classification of query statement, then is transmitted to this long-chain and connects node; Jump into step (2.6);
(2.6) poll-final.
Example:
(1) the concrete enforcement of setting up the network topology structure with semantic worldlet feature comprises following step:
(1.1) utilize potential semantic indexing to extract the document feature vector, specific as follows:
Potential semantic indexing is the expansion to the vector space model in traditional information retrieval.In vector space model, document and question blank are shown as the weight information of all words in literature collection, and query statement is represented by both cosine of angle in vector space with the similarity of document.If t different word arranged in the set that d document arranged, then uses word-document matrix A=(a Ij) ∈ R T * dRepresent this set.Every column vector a jCorresponding document j, a IjThe weight of expression word i in document j.By svd, matrix A is broken down into three matrix U, ∑ and V, and wherein ∑ is the diagonal angle matrix of the capable d row of t, and its singular value is σ 1〉=σ 2〉=... 〉=σ Min (t, d), keeping k maximum singular value in the ∑, matrix A can be by matrix A k=U kkV k' approximate representation;
(1.2) support vector machine has directed learning to the training document, obtains the support vector model, and the support vector model is by the matrix ∑ kV k' the d column vector represent;
(1.3) all shared documents of this node are classified, form classified information, specific as follows:
Each document representation on the node becomes document vector p ', utilizes the support vector model with vectorial U k' p ' classification, the document semanteme of node is represented as S={N, Pr}, wherein N represents the document sum of this node, Pr={Pr 1, Pr 2..., Pr mRepresent other ratio of each document category;
(1.4) each node node of in the scope of two jumpings, selecting other interest of document category to be in similar proportion, the standard of selection is that similarity surpasses predetermined similarity threshold 0.5, and is specific as follows:
For node P 1With node P 2, its document semanteme is respectively S 1={ N 1, Pr 1And S 2={ N 2, Pr 2, P then 1And P 2Between similarity be Sim (P 1, P 2)=((1+logmin (N 1, N 2))/(1+logmin (N 1, N 2))) * (‖ Pr 1‖ ‖ Pr 2‖), ‖ Pr wherein 1‖ ‖ Pr 2‖ is a vector multiplication.
(1.5) if part of nodes is very high in other interest ratio of a certain document category in the network, surpass predetermined threshold value, then this node is arranged to have the super semantic node that possibility is directly linked by other nodes; All nodes directly link with 0.001 probability with super semantic node in surpassing the scope of two jumpings, and the tolerance of wherein super semantic node P is U (P)=((1+logN)/(1+log (maxN i))) * maxPr i, wherein i gets all nodes in the peer-to-peer network, and predetermined threshold value is 0.8, if U (P)>0.8, then this node definition is super semantic node; The probability that other nodes and this super semantic node directly link be d (u, v) -r, wherein (u v) represents the shortest jumping figure between node u and the node v to d, and r represents 1/2 of this peer-to-peer network average degree.
According to said process, all nodes in the peer-to-peer network all have the less short chain similar to its interest that directly links to each other and connect node, have simultaneously few not necessarily similar to its interest but one fix on the very high long-chain of other interest ratio of a certain document category and connect, thereby form network topology with semantic worldlet feature.
(2) have on the network topology basis of semantic worldlet feature in foundation, can carry out document information retrieval, concrete steps are as follows:
(2.1) node sends query requests, and each query statement comprises the document classification of key word of the inquiry and inquiry, and wherein query statement is Q={K, c}, and K represents key word of the inquiry, c represents to inquire about the document classification;
(2.2) if the document classification of inquiry belongs to the higher proportion part (this ratio is greater than 50%) of the node that sends this query statement, then carry out local search, and Query Result is returned; Simultaneously, each short chain that query statement Q is transmitted to this node connects node; Connecing node for each long-chain, is c if this long-chain connects the classification of document the most at high proportion of node, then query statement Q is transmitted to this node processing;
(2.3) if the document classification of inquiry does not belong to the higher proportion part of the node that sends this query statement, then query statement Q is transmitted to the neighbor node that directly links to each other with the physics of this node; And each long-chain connect node, and be c if this long-chain connects the classification of document the most at high proportion of node, then query statement Q is transmitted to this node processing;
(2.4) poll-final.
According to said method, query statement can be routed to the node of most possible this request of answer, rather than traditional blindness route, thereby search efficiency is provided; Simultaneously, the long-chain that makes full use of in the worldlet connects, and make query statement also can be routed to other parts in the network soon, rather than be confined in the little network range, thus the recall ratio of raising information retrieval.
This programme not only is suitable for the peer-to-peer network that documentation ﹠ info is shared; and can be equal to change or replacement accordingly according to technical scheme of the present invention; the peer-to-peer networks of sharing as image information etc., and all these changes or replacement all should belong to the protection domain of claims of the present invention.

Claims (1)

1, a kind of document retrieval method based on semantic worldlet model comprises the steps:
(1) set up the overall network topology structure with semantic worldlet feature, its step comprises:
(1.1) utilize potential semantic indexing method to extract the document feature vector, document proper vector comprises frequency that word occurs in the document and the number of times that occurs in the literature collection of all proper vectors to be extracted;
(1.2) on above-mentioned document feature vector basis, support vector machine has directed learning to the training document, obtains the support vector model;
(1.3) each node in the peer-to-peer network is classified to all shared documents of this node after obtaining above vector model, forms classified information, other interest ratio of this node document category of this classified information sign;
(1.4) each node node of in the scope of two jumpings, selecting other interest of document category to be in similar proportion, the standard of selection is that similarity surpasses predetermined similarity threshold;
(1.5) if part of nodes is very high in other interest ratio of a certain document category in the network, surpass predetermined threshold value, then this node is arranged to have the super semantic node that possibility is directly linked by other nodes;
(1.6) all nodes directly link with Probability p with super semantic node outside surpassing the scope of two jumpings, wherein 0<p≤0.001;
(2) have on the network topology basis of semantic worldlet feature in foundation, carry out document information retrieval, its step comprises:
(2.1) node sends query requests, and each query statement comprises the document classification of key word of the inquiry and inquiry;
(2.2) if the document classification of inquiry belongs to the document classification of the node that sends this query statement, and its ratio is then jumped into step (2.3) greater than 50%; Otherwise, jump into step (2.5);
(2.3) node carries out local search, returns Query Result;
(2.4) each short chain that this query statement is transmitted to this node connects node; And each long-chain connect node, if the document classification of its happiest interesting ratio is consistent with the document classification of query statement, then is transmitted to this long-chain and connects node, and jump into step (2.6), otherwise directly jump into step (2.6);
(2.5) query statement is transmitted to the neighbor node that each physics of this node directly links to each other; And each long-chain connect node, if the document classification of its happiest interesting ratio is consistent with the document classification of query statement, then is transmitted to this long-chain and connects node; Jump into step (2.6);
(2.6) poll-final.
CNB2007100516072A 2007-03-02 2007-03-02 Literature retrieval method based on semantic small-word model Expired - Fee Related CN100517331C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2007100516072A CN100517331C (en) 2007-03-02 2007-03-02 Literature retrieval method based on semantic small-word model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2007100516072A CN100517331C (en) 2007-03-02 2007-03-02 Literature retrieval method based on semantic small-word model

Publications (2)

Publication Number Publication Date
CN101017504A true CN101017504A (en) 2007-08-15
CN100517331C CN100517331C (en) 2009-07-22

Family

ID=38726511

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2007100516072A Expired - Fee Related CN100517331C (en) 2007-03-02 2007-03-02 Literature retrieval method based on semantic small-word model

Country Status (1)

Country Link
CN (1) CN100517331C (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101877711A (en) * 2009-04-28 2010-11-03 华为技术有限公司 Social network establishment method and device, and community discovery method and device
CN102136007A (en) * 2011-03-31 2011-07-27 石家庄铁道大学 Small world property-based engineering information organization method
CN102663001A (en) * 2012-03-15 2012-09-12 华南理工大学 Automatic blog writer interest and character identifying method based on support vector machine
US8910252B2 (en) 2009-04-14 2014-12-09 Huwei Technologies Co., Ltd. Peer enrollment method, route updating method, communication system, and relevant devices
CN107038155A (en) * 2017-04-23 2017-08-11 四川用联信息技术有限公司 The extracting method of text feature is realized based on improved small-world network model

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8910252B2 (en) 2009-04-14 2014-12-09 Huwei Technologies Co., Ltd. Peer enrollment method, route updating method, communication system, and relevant devices
US9819688B2 (en) 2009-04-14 2017-11-14 Huawei Technologies Co., Ltd. Peer enrollment method, route updating method, communication system, and relevant devices
US10616243B2 (en) 2009-04-14 2020-04-07 Huawei Technologies Co., Ltd. Route updating method, communication system, and relevant devices
CN101877711A (en) * 2009-04-28 2010-11-03 华为技术有限公司 Social network establishment method and device, and community discovery method and device
CN101877711B (en) * 2009-04-28 2013-08-28 华为技术有限公司 Social network establishment method and device, and community discovery method and device
CN102136007A (en) * 2011-03-31 2011-07-27 石家庄铁道大学 Small world property-based engineering information organization method
CN102136007B (en) * 2011-03-31 2013-07-10 石家庄铁道大学 Small world property-based engineering information organization method
CN102663001A (en) * 2012-03-15 2012-09-12 华南理工大学 Automatic blog writer interest and character identifying method based on support vector machine
CN107038155A (en) * 2017-04-23 2017-08-11 四川用联信息技术有限公司 The extracting method of text feature is realized based on improved small-world network model

Also Published As

Publication number Publication date
CN100517331C (en) 2009-07-22

Similar Documents

Publication Publication Date Title
Batsakis et al. Improving the performance of focused web crawlers
Ahamed et al. An intelligent web search framework for performing efficient retrieval of data
CN100517331C (en) Literature retrieval method based on semantic small-word model
CN104391908B (en) Multiple key indexing means based on local sensitivity Hash on a kind of figure
Hammouda et al. Collaborative document clustering
CN101272399A (en) Method for implementing full text retrieval system based on P2P network
CN107153687B (en) Indexing method for social network text data
Vishwakarma et al. A comparative study of K-means and K-medoid clustering for social media text mining
Li et al. Efficient continual cohesive subgraph search in large temporal graphs
Gunaratna et al. Alignment and dataset identification of linked data in semantic web
Yuan et al. A distributed link prediction algorithm based on clustering in dynamic social networks
Hammouda et al. Distributed collaborative web document clustering using cluster keyphrase summaries
CN104281714A (en) Hospital portal website clinic specialist information extracting system
Ke et al. Scalability of findability: effective and efficient IR operations in large information networks
CN107169037B (en) Personalized search method combining sequencing dynamic modeling and emotion semantics
Selvakumara Samy et al. Intelligent web-history based on a hybrid clustering algorithm for future-internet systems
Du et al. A novel page ranking algorithm based on triadic closure and hyperlink-induced topic search
Su et al. Bibliometric assessments of network formations by keyword-based vector space model
Gaou et al. The optimization of search engines to improve the ranking to detect user’s intent
Ban et al. CICPV: A new academic expert search model
Li et al. Document clustering for distributed fulltext search
CN112749246A (en) Search phrase evaluation method, device, server and storage medium
Liu et al. Discovery of web services based on collaborated semantic link network
Ke et al. Studying the clustering paradox and scalability of search in highly distributed environments
Gadamshetti et al. RDRLLJ: Integrating Deep Learning Approach with Latent Semantic Analysis for Document Retrieval

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20090722

Termination date: 20120302