CN104182464A

CN104182464A - Semantic-based text retrieval method

Info

Publication number: CN104182464A
Application number: CN201410348390.1A
Authority: CN
Inventors: 贾岩
Original assignee: ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Current assignee: ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Priority date: 2014-07-21
Filing date: 2014-07-21
Publication date: 2014-12-03

Abstract

The invention provides a semantic-based text retrieval method, and solves the problems that polysemy or synonymy cannot be distinguished and retrieval results are mistakenly selected or missed in the retrieval process. According to the method, concepts of words replace words for searching and retrieval, and the retrieved files are sorted; the method specifically comprises steps as follows: S1, a concept tree is established according to the concepts of words, and a word similarity matrix is calculated; S2, the concept of a target file is extracted in reference of a preset body, the target file is subjected to indexing processing according to the concept, and an index file is generated; S3, word segmentation is performed on initial query of a user, similar items whose similarity to the query words is larger than the threshold value M are found out from the word similarity matrix, and the similar items are added to the user query in an OR manner; S4, a search engine searches the target file according to the user query; S5, according to the similarity of words, the similarity of the files is evaluated, and the files are sorted; and S6, file data are read, and a sorting result is output.

Description

A kind of text searching method based on semantic

Technical field

The present invention relates to technical field of information retrieval, relate in particular to a kind of text searching method based on semantic.

Background technology

Modern society enters the information age, and the information resources sustainable growth that internet contains becomes important information source.Provide at present the technology of information customization search a lot, but in these technology, a part requires high to Back ground Information facility, the implementation cycle is long, and system Construction and maintenance cost are high, and major customer is ultra-large type business and government, ordinary enterprises and unable the bearing of individual; Another part can only be supported the most basic information retrieval function, and range of search is little, and result for retrieval is not comprehensive yet.Especially in Chinese expression way, the situation of polysemy and many words one justice is very common, and existing retrieval technique is difficult to distinguish both of these case, often causes the wrong choosing of result for retrieval or omits.

Summary of the invention

The problem existing based on background technology, the present invention proposes a kind of text searching method based on semantic, has solved in retrieving, can not differentiate the situation of polysemy or many words one justice, the problem that causes the wrong choosing of result for retrieval or omit.

A kind of text searching method based on semantic that the present invention proposes, is converted into concept by word, replaces word to carry out index and retrieval, and the document retrieving is sorted by the concept of word, comprising:

S1, according to the concept of word, set up conceptional tree, calculate word similarity matrix;

S2, the prefabricated body of reference, extract the concept of destination document, and according to concept, destination document carried out to index process, generating indexes file;

S3, user's initial query is carried out to participle, in word similarity matrix, find out to query word similarity and be greater than similar of threshold value M, similar is added in user's inquiry with the relation of "or";

S4, search engine are according to user's query search destination document;

S5, according to word similarity to document carry out similarity evaluation and sequence;

S6, read document data and export ranking results.

Preferably, step S6 is divided into:

S61, in conjunction with the high frequency search word of user and nearest search word, determine user's degree of liking;

S62, according to user's degree of liking, adjust document ordering, then read document data and export ranking results.

Preferably, the Word similarity formula of word W1, W2 is:

dis (W ₁, W ₂) be concept that word W1, W2 the are corresponding distance on conceptional tree, a is computational constant.

Preferably, prefabricated body comprises general body and industry body.

What preferably, search engine search destination document adopted is Inverted Index Technique.

Preferably, threshold value 0.1<M<1.

Preferably, threshold value 0.2<M<1.

Preferably, read document data and export ranking results employing similarity order from big to small.

In the present invention, word is changed into concept, by the concept of word, replace word to carry out index and retrieve the problem of having avoided polysemy and many words one justice.In index, word is represented by its concept, then by these concepts, document is carried out to index, in retrieval, convert retrieval word to concept equally, by concept, retrieve, guaranteed search efficiency and practicality.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of a kind of text searching method based on semantic of proposing of the present invention;

Fig. 2 is word similarity distribution plan.

Embodiment

With reference to Fig. 1, a kind of text searching method based on semantic that the present invention proposes, is converted into concept by word, by the concept of word, replaces word to carry out index and retrieval, and the document retrieving is sorted, avoided polysemy and the misleading of many words one justice to result for retrieval.

Search method of the present invention specifically comprises the following steps:

S4, search engine are according to user's query search destination document;

S6, read document data and export ranking results.

In the step S3 of this method, the same or analogous similar item of concept to user's initial query is joined in user's inquiry, and be the relation with "or", as long as destination document hits user's initial query or any one similar item, being about to destination document adds in Search Results, cover hunting zone comprehensively, avoid the omission of information.

Step S6 is divided into:

In above step, by the analysis to user's degree of liking, the document ordering after adjustment more can catch user's needs, and the service for user provides hommization more, reduces user's screening time, raising recall precision.Read document data and export ranking results and adopt similarity order from big to small.

In this method, the Word similarity formula of word W1, W2 is:

In this method, prefabricated body comprises general body and industry body, and the two combination makes the concept conversion of destination document more complete, and by the displacement of industry body, improves specific aim and the accuracy of concept conversion, more can meet user's needs.

In this method, what search engine search destination document adopted is Inverted Index Technique, and this technology is quite ripe in searching algorithm, can further guarantee search efficiency and practicality.

In this method, threshold value 0.1<M<1, this is because word similarity is less than at 0.1 o'clock, it is worth not enough employing, and this part word proportion in similar word is maximum, abandons the word that similarity is less than 0.1, can significantly improve retrieval rate.

Figure 2 shows that and take the distribution situation of the similarity value sum on different sections in the similarity matrix that < < knows that net > > is basic calculation.As seen from Figure 2, the word that similarity drops on interval [0,0.1] accounts for 70%, and the word proportion that similarity drops on interval [0,0.2] is greater than 90%.If establish M=0.2, data scale after optimizing is approximately 8.7% of raw data, originally needed the data of 5G storage space only need to be less than the storage space of 450MB, and this time, average each word can have nearly 9000 higher similarities to be stored, this is for general word, with its semantically higher the and valuable near synonym language of similarity can store completely.So while specifically implementing, it is proper getting 0.2<M<1.

The above; it is only preferably embodiment of the present invention; but protection scope of the present invention is not limited to this; anyly be familiar with those skilled in the art in the technical scope that the present invention discloses; according to technical scheme of the present invention and inventive concept thereof, be equal to replacement or changed, within all should being encompassed in protection scope of the present invention.

Claims

1. the text searching method based on semantic, is characterized in that, word is converted into concept, replaces word to carry out index and retrieval, and the document retrieving is sorted by the concept of word, comprising:

S4, search engine are according to user's query search destination document;

S6, read document data and export ranking results.

2. the text searching method based on semantic as claimed in claim 1, is characterized in that, step S6 is divided into:

3. the text searching method based on semantic as claimed in claim 1, is characterized in that, the Word similarity formula of word W1, W2 is:

4. the text searching method based on semantic as claimed in claim 1 or 2, is characterized in that, prefabricated body comprises general body and industry body.

5. the text searching method based on semantic as claimed in claim 1, is characterized in that, what search engine search destination document adopted is Inverted Index Technique.

6. the text searching method based on semantic as claimed in claim 1, is characterized in that threshold value 0.1<M<1.

7. the text searching method based on semantic as claimed in claim 6, is characterized in that threshold value 0.2<M<1.

8. the text searching method based on semantic as claimed in claim 1 or 2, is characterized in that, reads document data and export ranking results to adopt similarity order from big to small.