CN104182464A - Semantic-based text retrieval method - Google Patents
Semantic-based text retrieval method Download PDFInfo
- Publication number
- CN104182464A CN104182464A CN201410348390.1A CN201410348390A CN104182464A CN 104182464 A CN104182464 A CN 104182464A CN 201410348390 A CN201410348390 A CN 201410348390A CN 104182464 A CN104182464 A CN 104182464A
- Authority
- CN
- China
- Prior art keywords
- word
- concept
- semantic
- similarity
- user
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/322—Trees
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a semantic-based text retrieval method, and solves the problems that polysemy or synonymy cannot be distinguished and retrieval results are mistakenly selected or missed in the retrieval process. According to the method, concepts of words replace words for searching and retrieval, and the retrieved files are sorted; the method specifically comprises steps as follows: S1, a concept tree is established according to the concepts of words, and a word similarity matrix is calculated; S2, the concept of a target file is extracted in reference of a preset body, the target file is subjected to indexing processing according to the concept, and an index file is generated; S3, word segmentation is performed on initial query of a user, similar items whose similarity to the query words is larger than the threshold value M are found out from the word similarity matrix, and the similar items are added to the user query in an OR manner; S4, a search engine searches the target file according to the user query; S5, according to the similarity of words, the similarity of the files is evaluated, and the files are sorted; and S6, file data are read, and a sorting result is output.
Description
Technical field
The present invention relates to technical field of information retrieval, relate in particular to a kind of text searching method based on semantic.
Background technology
Modern society enters the information age, and the information resources sustainable growth that internet contains becomes important information source.Provide at present the technology of information customization search a lot, but in these technology, a part requires high to Back ground Information facility, the implementation cycle is long, and system Construction and maintenance cost are high, and major customer is ultra-large type business and government, ordinary enterprises and unable the bearing of individual; Another part can only be supported the most basic information retrieval function, and range of search is little, and result for retrieval is not comprehensive yet.Especially in Chinese expression way, the situation of polysemy and many words one justice is very common, and existing retrieval technique is difficult to distinguish both of these case, often causes the wrong choosing of result for retrieval or omits.
Summary of the invention
The problem existing based on background technology, the present invention proposes a kind of text searching method based on semantic, has solved in retrieving, can not differentiate the situation of polysemy or many words one justice, the problem that causes the wrong choosing of result for retrieval or omit.
A kind of text searching method based on semantic that the present invention proposes, is converted into concept by word, replaces word to carry out index and retrieval, and the document retrieving is sorted by the concept of word, comprising:
S1, according to the concept of word, set up conceptional tree, calculate word similarity matrix;
S2, the prefabricated body of reference, extract the concept of destination document, and according to concept, destination document carried out to index process, generating indexes file;
S3, user's initial query is carried out to participle, in word similarity matrix, find out to query word similarity and be greater than similar of threshold value M, similar is added in user's inquiry with the relation of "or";
S4, search engine are according to user's query search destination document;
S5, according to word similarity to document carry out similarity evaluation and sequence;
S6, read document data and export ranking results.
Preferably, step S6 is divided into:
S61, in conjunction with the high frequency search word of user and nearest search word, determine user's degree of liking;
S62, according to user's degree of liking, adjust document ordering, then read document data and export ranking results.
Preferably, the Word similarity formula of word W1, W2 is:
dis (W
1, W
2) be concept that word W1, W2 the are corresponding distance on conceptional tree, a is computational constant.
Preferably, prefabricated body comprises general body and industry body.
What preferably, search engine search destination document adopted is Inverted Index Technique.
Preferably, threshold value 0.1<M<1.
Preferably, threshold value 0.2<M<1.
Preferably, read document data and export ranking results employing similarity order from big to small.
In the present invention, word is changed into concept, by the concept of word, replace word to carry out index and retrieve the problem of having avoided polysemy and many words one justice.In index, word is represented by its concept, then by these concepts, document is carried out to index, in retrieval, convert retrieval word to concept equally, by concept, retrieve, guaranteed search efficiency and practicality.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of a kind of text searching method based on semantic of proposing of the present invention;
Fig. 2 is word similarity distribution plan.
Embodiment
With reference to Fig. 1, a kind of text searching method based on semantic that the present invention proposes, is converted into concept by word, by the concept of word, replaces word to carry out index and retrieval, and the document retrieving is sorted, avoided polysemy and the misleading of many words one justice to result for retrieval.
Search method of the present invention specifically comprises the following steps:
S1, according to the concept of word, set up conceptional tree, calculate word similarity matrix;
S2, the prefabricated body of reference, extract the concept of destination document, and according to concept, destination document carried out to index process, generating indexes file;
S3, user's initial query is carried out to participle, in word similarity matrix, find out to query word similarity and be greater than similar of threshold value M, similar is added in user's inquiry with the relation of "or";
S4, search engine are according to user's query search destination document;
S5, according to word similarity to document carry out similarity evaluation and sequence;
S6, read document data and export ranking results.
In the step S3 of this method, the same or analogous similar item of concept to user's initial query is joined in user's inquiry, and be the relation with "or", as long as destination document hits user's initial query or any one similar item, being about to destination document adds in Search Results, cover hunting zone comprehensively, avoid the omission of information.
Step S6 is divided into:
S61, in conjunction with the high frequency search word of user and nearest search word, determine user's degree of liking;
S62, according to user's degree of liking, adjust document ordering, then read document data and export ranking results.
In above step, by the analysis to user's degree of liking, the document ordering after adjustment more can catch user's needs, and the service for user provides hommization more, reduces user's screening time, raising recall precision.Read document data and export ranking results and adopt similarity order from big to small.
In this method, the Word similarity formula of word W1, W2 is:
dis (W
1, W
2) be concept that word W1, W2 the are corresponding distance on conceptional tree, a is computational constant.
In this method, prefabricated body comprises general body and industry body, and the two combination makes the concept conversion of destination document more complete, and by the displacement of industry body, improves specific aim and the accuracy of concept conversion, more can meet user's needs.
In this method, what search engine search destination document adopted is Inverted Index Technique, and this technology is quite ripe in searching algorithm, can further guarantee search efficiency and practicality.
In this method, threshold value 0.1<M<1, this is because word similarity is less than at 0.1 o'clock, it is worth not enough employing, and this part word proportion in similar word is maximum, abandons the word that similarity is less than 0.1, can significantly improve retrieval rate.
Figure 2 shows that and take the distribution situation of the similarity value sum on different sections in the similarity matrix that < < knows that net > > is basic calculation.As seen from Figure 2, the word that similarity drops on interval [0,0.1] accounts for 70%, and the word proportion that similarity drops on interval [0,0.2] is greater than 90%.If establish M=0.2, data scale after optimizing is approximately 8.7% of raw data, originally needed the data of 5G storage space only need to be less than the storage space of 450MB, and this time, average each word can have nearly 9000 higher similarities to be stored, this is for general word, with its semantically higher the and valuable near synonym language of similarity can store completely.So while specifically implementing, it is proper getting 0.2<M<1.
The above; it is only preferably embodiment of the present invention; but protection scope of the present invention is not limited to this; anyly be familiar with those skilled in the art in the technical scope that the present invention discloses; according to technical scheme of the present invention and inventive concept thereof, be equal to replacement or changed, within all should being encompassed in protection scope of the present invention.
Claims (8)
1. the text searching method based on semantic, is characterized in that, word is converted into concept, replaces word to carry out index and retrieval, and the document retrieving is sorted by the concept of word, comprising:
S1, according to the concept of word, set up conceptional tree, calculate word similarity matrix;
S2, the prefabricated body of reference, extract the concept of destination document, and according to concept, destination document carried out to index process, generating indexes file;
S3, user's initial query is carried out to participle, in word similarity matrix, find out to query word similarity and be greater than similar of threshold value M, similar is added in user's inquiry with the relation of "or";
S4, search engine are according to user's query search destination document;
S5, according to word similarity to document carry out similarity evaluation and sequence;
S6, read document data and export ranking results.
2. the text searching method based on semantic as claimed in claim 1, is characterized in that, step S6 is divided into:
S61, in conjunction with the high frequency search word of user and nearest search word, determine user's degree of liking;
S62, according to user's degree of liking, adjust document ordering, then read document data and export ranking results.
3. the text searching method based on semantic as claimed in claim 1, is characterized in that, the Word similarity formula of word W1, W2 is:
dis (W
1, W
2) be concept that word W1, W2 the are corresponding distance on conceptional tree, a is computational constant.
4. the text searching method based on semantic as claimed in claim 1 or 2, is characterized in that, prefabricated body comprises general body and industry body.
5. the text searching method based on semantic as claimed in claim 1, is characterized in that, what search engine search destination document adopted is Inverted Index Technique.
6. the text searching method based on semantic as claimed in claim 1, is characterized in that threshold value 0.1<M<1.
7. the text searching method based on semantic as claimed in claim 6, is characterized in that threshold value 0.2<M<1.
8. the text searching method based on semantic as claimed in claim 1 or 2, is characterized in that, reads document data and export ranking results to adopt similarity order from big to small.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410348390.1A CN104182464A (en) | 2014-07-21 | 2014-07-21 | Semantic-based text retrieval method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410348390.1A CN104182464A (en) | 2014-07-21 | 2014-07-21 | Semantic-based text retrieval method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104182464A true CN104182464A (en) | 2014-12-03 |
Family
ID=51963504
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410348390.1A Pending CN104182464A (en) | 2014-07-21 | 2014-07-21 | Semantic-based text retrieval method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104182464A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105069080A (en) * | 2015-07-31 | 2015-11-18 | 中国农业科学院农业信息研究所 | Document retrieval method and system |
WO2016009321A1 (en) * | 2014-07-14 | 2016-01-21 | International Business Machines Corporation | System for searching, recommending, and exploring documents through conceptual associations and inverted table for storing and querying conceptual indices |
CN107193930A (en) * | 2017-05-17 | 2017-09-22 | 东莞市华睿电子科技有限公司 | A kind of website sensitive word screen method |
US10496683B2 (en) | 2014-07-14 | 2019-12-03 | International Business Machines Corporation | Automatically linking text to concepts in a knowledge base |
US10503761B2 (en) | 2014-07-14 | 2019-12-10 | International Business Machines Corporation | System for searching, recommending, and exploring documents through conceptual associations |
US10572521B2 (en) | 2014-07-14 | 2020-02-25 | International Business Machines Corporation | Automatic new concept definition |
CN117290460A (en) * | 2023-11-24 | 2023-12-26 | 中孚信息股份有限公司 | Method, system, device and storage medium for calculating similarity of massive texts |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5576965A (en) * | 1992-04-16 | 1996-11-19 | Hitachi, Ltd. | Method and apparatus for aiding of designing process |
US5642502A (en) * | 1994-12-06 | 1997-06-24 | University Of Central Florida | Method and system for searching for relevant documents from a text database collection, using statistical ranking, relevancy feedback and small pieces of text |
CN101251841A (en) * | 2007-05-17 | 2008-08-27 | 华东师范大学 | Method for establishing and searching feature matrix of Web document based on semantics |
CN103389998A (en) * | 2012-05-11 | 2013-11-13 | 安徽华贞信息科技有限公司 | Novel Internet commercial intelligence information semantic analysis technology based on cloud service |
-
2014
- 2014-07-21 CN CN201410348390.1A patent/CN104182464A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5576965A (en) * | 1992-04-16 | 1996-11-19 | Hitachi, Ltd. | Method and apparatus for aiding of designing process |
US5642502A (en) * | 1994-12-06 | 1997-06-24 | University Of Central Florida | Method and system for searching for relevant documents from a text database collection, using statistical ranking, relevancy feedback and small pieces of text |
CN101251841A (en) * | 2007-05-17 | 2008-08-27 | 华东师范大学 | Method for establishing and searching feature matrix of Web document based on semantics |
CN103389998A (en) * | 2012-05-11 | 2013-11-13 | 安徽华贞信息科技有限公司 | Novel Internet commercial intelligence information semantic analysis technology based on cloud service |
Non-Patent Citations (1)
Title |
---|
张胜: "一种基于领域本体的语义检索模型", 《软件导刊》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2016009321A1 (en) * | 2014-07-14 | 2016-01-21 | International Business Machines Corporation | System for searching, recommending, and exploring documents through conceptual associations and inverted table for storing and querying conceptual indices |
US10496683B2 (en) | 2014-07-14 | 2019-12-03 | International Business Machines Corporation | Automatically linking text to concepts in a knowledge base |
US10496684B2 (en) | 2014-07-14 | 2019-12-03 | International Business Machines Corporation | Automatically linking text to concepts in a knowledge base |
US10503761B2 (en) | 2014-07-14 | 2019-12-10 | International Business Machines Corporation | System for searching, recommending, and exploring documents through conceptual associations |
US10503762B2 (en) | 2014-07-14 | 2019-12-10 | International Business Machines Corporation | System for searching, recommending, and exploring documents through conceptual associations |
US10572521B2 (en) | 2014-07-14 | 2020-02-25 | International Business Machines Corporation | Automatic new concept definition |
US10956461B2 (en) | 2014-07-14 | 2021-03-23 | International Business Machines Corporation | System for searching, recommending, and exploring documents through conceptual associations |
CN105069080A (en) * | 2015-07-31 | 2015-11-18 | 中国农业科学院农业信息研究所 | Document retrieval method and system |
CN105069080B (en) * | 2015-07-31 | 2018-06-29 | 中国农业科学院农业信息研究所 | A kind of document retrieval method and system |
CN107193930A (en) * | 2017-05-17 | 2017-09-22 | 东莞市华睿电子科技有限公司 | A kind of website sensitive word screen method |
CN117290460A (en) * | 2023-11-24 | 2023-12-26 | 中孚信息股份有限公司 | Method, system, device and storage medium for calculating similarity of massive texts |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104182464A (en) | Semantic-based text retrieval method | |
CN108415902B (en) | Named entity linking method based on search engine | |
CN102725759B (en) | For the semantic directory of Search Results | |
US9928296B2 (en) | Search lexicon expansion | |
US20110093455A1 (en) | Search and retrieval methods and systems of short messages utilizing messaging context and keyword frequency | |
CN103838833A (en) | Full-text retrieval system based on semantic analysis of relevant words | |
CN103605665A (en) | Keyword based evaluation expert intelligent search and recommendation method | |
US20110184946A1 (en) | Applying synonyms to unify text search with faceted browsing classification | |
CN103678576A (en) | Full-text retrieval system based on dynamic semantic analysis | |
CN104123366A (en) | Search method and server | |
CN104166651A (en) | Data searching method and device based on integration of data objects in same classes | |
CN111026710A (en) | Data set retrieval method and system | |
CN105843960B (en) | Indexing method and system based on semantic tree | |
WO2010062445A1 (en) | Predictive indexing for fast search | |
CN104778201A (en) | Multi-query result combination-based prior art retrieval method | |
CN103714149A (en) | Self-adaptive incremental deep web data source discovery method | |
CN104360993A (en) | Method for extracting needed content from text | |
CN105404677B (en) | A kind of search method based on tree structure | |
CN102915381A (en) | Multi-dimensional semantic based visualized network retrieval rendering system and rendering control method | |
Provatorova et al. | Named entity recognition and linking on historical newspapers: UvA. ILPS & REL at CLEF HIPE 2020 | |
CN105574004B (en) | A kind of removing duplicate webpages method and apparatus | |
CN103034709B (en) | Retrieving result reordering system and method | |
Fink et al. | Statute-enhanced lexical retrieval of court cases for COLIEE 2022 | |
CN105426490B (en) | A kind of indexing means based on tree structure | |
CN103914480B (en) | A kind of data query method, controller and system for automatic answering system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
EXSB | Decision made by sipo to initiate substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20141203 |