CN1763739A

CN1763739A - Search method based on semantics in search engine

Info

Publication number: CN1763739A
Application number: CN 200410009691
Authority: CN
Inventors: 谢欣; 李晓明
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2004-10-21
Filing date: 2004-10-21
Publication date: 2006-04-26

Abstract

The invention discloses an index method in the file search engine based on the semantics, which is characterized by the following: establishing resource information base and matchingship of the resource information base, file and user input searching word; adapting the matchingship to mate corresponding file if successful; adapting the searched word to find file directly and returning the searching result if failure.

Description

Search method in the search engine based on semanteme

Technical field

The present invention relates to a kind of method for information retrieval, relate in particular to the search method in a kind of medium file search engine based on semanteme.

Background technology

Search engine can be divided into Web search engine and medium file search engine.At present, the search method of known medium file search engine, as Beijing University's sky online article spare search engine (http://bingle.pku.edu.cn), be based on string matching: the user imports a query word, searching system is searched the file entries that comprises this character string in file entries all to be retrieved, and returns to the user.But the recall ratio of this search method and precision ratio all are not high.At first, the name of a lot of files is also lack of standardization, a lot of resources such as software, film etc. all have the title of different language, when the user is referred to as query word when input with a kind of name of language, if this document is to name with other language, then can not retrieve this document clauses and subclauses, this has caused recall ratio not high.Secondly, when the user imported a character string, this character string had comprised semantic information often, and simple string matching tends to return user and unwanted file entries.As a picture file that comprises the user inquiring speech often is not the accurate return results of a software query requests.And present retrieval only instrument can be returned Query Result, and more useful adding description information can not be provided.

Summary of the invention

At the existing problem and shortage of information retrieval method in the above-mentioned existing search engine, the purpose of this invention is to provide a kind of medium file search engine, thereby improve recall ratio and precision ratio based on semanteme.

The present invention is achieved in that the search method based on semanteme in a kind of medium file search engine, may further comprise the steps,

1) sets up resource information bank, set up the matching relationship of this resource information bank and file, user input query speech simultaneously;

2) behind the user input query speech, at first go coupling, if the match is successful, then utilize the resource information in this resource information bank and the matching relationship of file to remove to mate corresponding document, and return Search Results to resource information bank; If it fails to match, then directly utilize this query word search file, and return Search Results.

Further, the resource information in the described resource information bank comprises resource class and resource introduction.

Further, described resource information bank is classified to resource information according to resource class.

Further, the matching relationship of resource in the described resource information bank and file comprises the information such as file name, size, file type, path of resource respective file.

Further, the matching relationship of resource in the described resource information bank and user inquiring speech comprises the query word that the user may import when this resource is inquired about.

The present invention at first sets up a resource information bank for the user.Like this, when the user input query speech, at first in resource information bank, mate, go matching files again with the resource information that matches.

Utilization of the present invention has comprised a plurality of information of each basic resources to be inquired about file, when therefore using a kind of title to inquire about for the user, also utilizes other resource information to inquire about simultaneously in internal system of the present invention, and recall ratio is improved.The present invention just handles trusted file, and size, the extension name type of file of coupling limited, thereby precision ratio is also corresponding is improved.

Embodiment

For science of the present invention and practicality are described, below search engine is carried out simple analysis.File that the present invention can provide from each file server and user's query demand two aspects are analyzed.Though file enormous amount, kind that the current file server provides are numerous, but its type is very concentrated, the present invention is to 839 ftp servers, 13 of sky online article spare search engine, 306,765 files are added up according to file extension, find the file of types such as file mainly concentrates on music, video display, can carry out, compression, the classification of visible mutation still clearly.And, carrying out statistics and analysis by query word to search engine, this class file is the highest file of user inquiring rate just also.After other websites are added up, consistent with The above results.This illustrates that fully though user's query word difference content huge, each file server also is not quite similar, both sides' main supply and demand is several classifications of concentrating relatively.Specifically, concentrate on exactly in software, singer, song, film, these five classifications of playing.This point is a key character of medium file search engine, and it is different from the Web search engine.The Web search is in order to obtain to comprise the specific webpage of appointed information; It then is in order to obtain the download address of certain logic resource that the user uses the purpose of medium file search engine.

The present invention at first needs to set up a basic resources storehouse, and it has comprised description information of files, generally can adopt the professional download site of method from Web of information extraction to obtain this information; The present invention is that example illustrates that the information state of this resources bank is as follows with " NetAnts ":

[dbase] NetAnts netants;

[software Chinese] NetAnts;

[software English name] netants;

[classification]/Software/ network tool;

[software size] 871424Bytes;

[exploitation] Http:// www.netants.com/;

[language form] simplified form of Chinese Character;

[platform] win9x/me/nt/2000/xp;

[introduce in detail] NetAnts are instruments that are used for file in download.The built-in Chinese edition of this version.The NetAnts characteristic is: it has further expanded the function of breakpoint transmission, can carry out multicast communication.New features: support plug-in resource packet, the speed of download of drag and drop basket shows.

For each resource is set up the query word matching characteristic, the Chinese and English title of resource normally finally forms the corresponding relation of the multi-to-multi between resource and the query word.Can be corresponding to a kind of resource except that different query words, the present invention also supports the situation of a query word corresponding to multiple resource.

Resources bank of the present invention is also classified to Miscellaneous Documents.Classify by files such as above-mentioned software, song, film, recreation, with convenient coupling.

When submit queries, user's query word is mapped to certain concrete mated resource information, carry out the file coupling with this resource information again.Carrying out file with resource information when mating, mate with the form of character string, promptly mate respectively with each descriptor.This can mate resource matched document entity back output that the match is successful, and to the descriptor of this resource.Here, document entity is meant a file that can unique location or the catalogue that is made of a plurality of files.Its include file name or directory name, time, size, path.Having the FTP address of determining URL such as one is exactly a document entity.The purpose of user's query requests obtains one or more document entities exactly in medium file search engine.

If successfully do not mate any resource of mating, the same support of the present invention carried out the routine inquiry according to simple string matching.

Claims

1, the search method based on semanteme in a kind of medium file search engine may further comprise the steps,

2, the search method based on semanteme in the medium file search engine as claimed in claim 1 is characterized in that, the resource information in the described resource information bank comprises resource class and resource introduction.

3, the search method based on semanteme in the medium file search engine as claimed in claim 1 is characterized in that, the resource in the described resource information bank and the matching relationship of file comprise file name, size, the file type of resource respective file.

4, the search method based on semanteme in the medium file search engine as claimed in claim 1 is characterized in that, the matching relationship of resource in the described resource information bank and user inquiring speech comprises the query word that the user may import when this resource is inquired about.

5, the search method based on semanteme in the medium file search engine as claimed in claim 2 is characterized in that, described resource information bank is classified to resource information according to resource class.

6, the search method based on semanteme in the medium file search engine as claimed in claim 3 is characterized in that, described file available is specially the believable file of fileinfo.