CN104182464A - Semantic-based text retrieval method - Google Patents

Semantic-based text retrieval method Download PDF

Info

Publication number
CN104182464A
CN104182464A CN201410348390.1A CN201410348390A CN104182464A CN 104182464 A CN104182464 A CN 104182464A CN 201410348390 A CN201410348390 A CN 201410348390A CN 104182464 A CN104182464 A CN 104182464A
Authority
CN
China
Prior art keywords
word
concept
semantic
similarity
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410348390.1A
Other languages
Chinese (zh)
Inventor
贾岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Original Assignee
ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd filed Critical ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Priority to CN201410348390.1A priority Critical patent/CN104182464A/en
Publication of CN104182464A publication Critical patent/CN104182464A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a semantic-based text retrieval method, and solves the problems that polysemy or synonymy cannot be distinguished and retrieval results are mistakenly selected or missed in the retrieval process. According to the method, concepts of words replace words for searching and retrieval, and the retrieved files are sorted; the method specifically comprises steps as follows: S1, a concept tree is established according to the concepts of words, and a word similarity matrix is calculated; S2, the concept of a target file is extracted in reference of a preset body, the target file is subjected to indexing processing according to the concept, and an index file is generated; S3, word segmentation is performed on initial query of a user, similar items whose similarity to the query words is larger than the threshold value M are found out from the word similarity matrix, and the similar items are added to the user query in an OR manner; S4, a search engine searches the target file according to the user query; S5, according to the similarity of words, the similarity of the files is evaluated, and the files are sorted; and S6, file data are read, and a sorting result is output.

Description

A kind of text searching method based on semantic
Technical field
The present invention relates to technical field of information retrieval, relate in particular to a kind of text searching method based on semantic.
Background technology
Modern society enters the information age, and the information resources sustainable growth that internet contains becomes important information source.Provide at present the technology of information customization search a lot, but in these technology, a part requires high to Back ground Information facility, the implementation cycle is long, and system Construction and maintenance cost are high, and major customer is ultra-large type business and government, ordinary enterprises and unable the bearing of individual; Another part can only be supported the most basic information retrieval function, and range of search is little, and result for retrieval is not comprehensive yet.Especially in Chinese expression way, the situation of polysemy and many words one justice is very common, and existing retrieval technique is difficult to distinguish both of these case, often causes the wrong choosing of result for retrieval or omits.
Summary of the invention
The problem existing based on background technology, the present invention proposes a kind of text searching method based on semantic, has solved in retrieving, can not differentiate the situation of polysemy or many words one justice, the problem that causes the wrong choosing of result for retrieval or omit.
A kind of text searching method based on semantic that the present invention proposes, is converted into concept by word, replaces word to carry out index and retrieval, and the document retrieving is sorted by the concept of word, comprising:
S1, according to the concept of word, set up conceptional tree, calculate word similarity matrix;
S2, the prefabricated body of reference, extract the concept of destination document, and according to concept, destination document carried out to index process, generating indexes file;
S3, user's initial query is carried out to participle, in word similarity matrix, find out to query word similarity and be greater than similar of threshold value M, similar is added in user's inquiry with the relation of "or";
S4, search engine are according to user's query search destination document;
S5, according to word similarity to document carry out similarity evaluation and sequence;
S6, read document data and export ranking results.
Preferably, step S6 is divided into:
S61, in conjunction with the high frequency search word of user and nearest search word, determine user's degree of liking;
S62, according to user's degree of liking, adjust document ordering, then read document data and export ranking results.
Preferably, the Word similarity formula of word W1, W2 is:
dis (W 1, W 2) be concept that word W1, W2 the are corresponding distance on conceptional tree, a is computational constant.
Preferably, prefabricated body comprises general body and industry body.
What preferably, search engine search destination document adopted is Inverted Index Technique.
Preferably, threshold value 0.1<M<1.
Preferably, threshold value 0.2<M<1.
Preferably, read document data and export ranking results employing similarity order from big to small.
In the present invention, word is changed into concept, by the concept of word, replace word to carry out index and retrieve the problem of having avoided polysemy and many words one justice.In index, word is represented by its concept, then by these concepts, document is carried out to index, in retrieval, convert retrieval word to concept equally, by concept, retrieve, guaranteed search efficiency and practicality.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of a kind of text searching method based on semantic of proposing of the present invention;
Fig. 2 is word similarity distribution plan.
Embodiment
With reference to Fig. 1, a kind of text searching method based on semantic that the present invention proposes, is converted into concept by word, by the concept of word, replaces word to carry out index and retrieval, and the document retrieving is sorted, avoided polysemy and the misleading of many words one justice to result for retrieval.
Search method of the present invention specifically comprises the following steps:
S1, according to the concept of word, set up conceptional tree, calculate word similarity matrix;
S2, the prefabricated body of reference, extract the concept of destination document, and according to concept, destination document carried out to index process, generating indexes file;
S3, user's initial query is carried out to participle, in word similarity matrix, find out to query word similarity and be greater than similar of threshold value M, similar is added in user's inquiry with the relation of "or";
S4, search engine are according to user's query search destination document;
S5, according to word similarity to document carry out similarity evaluation and sequence;
S6, read document data and export ranking results.
In the step S3 of this method, the same or analogous similar item of concept to user's initial query is joined in user's inquiry, and be the relation with "or", as long as destination document hits user's initial query or any one similar item, being about to destination document adds in Search Results, cover hunting zone comprehensively, avoid the omission of information.
Step S6 is divided into:
S61, in conjunction with the high frequency search word of user and nearest search word, determine user's degree of liking;
S62, according to user's degree of liking, adjust document ordering, then read document data and export ranking results.
In above step, by the analysis to user's degree of liking, the document ordering after adjustment more can catch user's needs, and the service for user provides hommization more, reduces user's screening time, raising recall precision.Read document data and export ranking results and adopt similarity order from big to small.
In this method, the Word similarity formula of word W1, W2 is:
dis (W 1, W 2) be concept that word W1, W2 the are corresponding distance on conceptional tree, a is computational constant.
In this method, prefabricated body comprises general body and industry body, and the two combination makes the concept conversion of destination document more complete, and by the displacement of industry body, improves specific aim and the accuracy of concept conversion, more can meet user's needs.
In this method, what search engine search destination document adopted is Inverted Index Technique, and this technology is quite ripe in searching algorithm, can further guarantee search efficiency and practicality.
In this method, threshold value 0.1<M<1, this is because word similarity is less than at 0.1 o'clock, it is worth not enough employing, and this part word proportion in similar word is maximum, abandons the word that similarity is less than 0.1, can significantly improve retrieval rate.
Figure 2 shows that and take the distribution situation of the similarity value sum on different sections in the similarity matrix that < < knows that net > > is basic calculation.As seen from Figure 2, the word that similarity drops on interval [0,0.1] accounts for 70%, and the word proportion that similarity drops on interval [0,0.2] is greater than 90%.If establish M=0.2, data scale after optimizing is approximately 8.7% of raw data, originally needed the data of 5G storage space only need to be less than the storage space of 450MB, and this time, average each word can have nearly 9000 higher similarities to be stored, this is for general word, with its semantically higher the and valuable near synonym language of similarity can store completely.So while specifically implementing, it is proper getting 0.2<M<1.
The above; it is only preferably embodiment of the present invention; but protection scope of the present invention is not limited to this; anyly be familiar with those skilled in the art in the technical scope that the present invention discloses; according to technical scheme of the present invention and inventive concept thereof, be equal to replacement or changed, within all should being encompassed in protection scope of the present invention.

Claims (8)

1. the text searching method based on semantic, is characterized in that, word is converted into concept, replaces word to carry out index and retrieval, and the document retrieving is sorted by the concept of word, comprising:
S1, according to the concept of word, set up conceptional tree, calculate word similarity matrix;
S2, the prefabricated body of reference, extract the concept of destination document, and according to concept, destination document carried out to index process, generating indexes file;
S3, user's initial query is carried out to participle, in word similarity matrix, find out to query word similarity and be greater than similar of threshold value M, similar is added in user's inquiry with the relation of "or";
S4, search engine are according to user's query search destination document;
S5, according to word similarity to document carry out similarity evaluation and sequence;
S6, read document data and export ranking results.
2. the text searching method based on semantic as claimed in claim 1, is characterized in that, step S6 is divided into:
S61, in conjunction with the high frequency search word of user and nearest search word, determine user's degree of liking;
S62, according to user's degree of liking, adjust document ordering, then read document data and export ranking results.
3. the text searching method based on semantic as claimed in claim 1, is characterized in that, the Word similarity formula of word W1, W2 is:
dis (W 1, W 2) be concept that word W1, W2 the are corresponding distance on conceptional tree, a is computational constant.
4. the text searching method based on semantic as claimed in claim 1 or 2, is characterized in that, prefabricated body comprises general body and industry body.
5. the text searching method based on semantic as claimed in claim 1, is characterized in that, what search engine search destination document adopted is Inverted Index Technique.
6. the text searching method based on semantic as claimed in claim 1, is characterized in that threshold value 0.1<M<1.
7. the text searching method based on semantic as claimed in claim 6, is characterized in that threshold value 0.2<M<1.
8. the text searching method based on semantic as claimed in claim 1 or 2, is characterized in that, reads document data and export ranking results to adopt similarity order from big to small.
CN201410348390.1A 2014-07-21 2014-07-21 Semantic-based text retrieval method Pending CN104182464A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410348390.1A CN104182464A (en) 2014-07-21 2014-07-21 Semantic-based text retrieval method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410348390.1A CN104182464A (en) 2014-07-21 2014-07-21 Semantic-based text retrieval method

Publications (1)

Publication Number Publication Date
CN104182464A true CN104182464A (en) 2014-12-03

Family

ID=51963504

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410348390.1A Pending CN104182464A (en) 2014-07-21 2014-07-21 Semantic-based text retrieval method

Country Status (1)

Country Link
CN (1) CN104182464A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105069080A (en) * 2015-07-31 2015-11-18 中国农业科学院农业信息研究所 Document retrieval method and system
WO2016009321A1 (en) * 2014-07-14 2016-01-21 International Business Machines Corporation System for searching, recommending, and exploring documents through conceptual associations and inverted table for storing and querying conceptual indices
CN107193930A (en) * 2017-05-17 2017-09-22 东莞市华睿电子科技有限公司 A kind of website sensitive word screen method
US10496683B2 (en) 2014-07-14 2019-12-03 International Business Machines Corporation Automatically linking text to concepts in a knowledge base
US10503761B2 (en) 2014-07-14 2019-12-10 International Business Machines Corporation System for searching, recommending, and exploring documents through conceptual associations
US10572521B2 (en) 2014-07-14 2020-02-25 International Business Machines Corporation Automatic new concept definition
CN117290460A (en) * 2023-11-24 2023-12-26 中孚信息股份有限公司 Method, system, device and storage medium for calculating similarity of massive texts

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5576965A (en) * 1992-04-16 1996-11-19 Hitachi, Ltd. Method and apparatus for aiding of designing process
US5642502A (en) * 1994-12-06 1997-06-24 University Of Central Florida Method and system for searching for relevant documents from a text database collection, using statistical ranking, relevancy feedback and small pieces of text
CN101251841A (en) * 2007-05-17 2008-08-27 华东师范大学 Method for establishing and searching feature matrix of Web document based on semantics
CN103389998A (en) * 2012-05-11 2013-11-13 安徽华贞信息科技有限公司 Novel Internet commercial intelligence information semantic analysis technology based on cloud service

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5576965A (en) * 1992-04-16 1996-11-19 Hitachi, Ltd. Method and apparatus for aiding of designing process
US5642502A (en) * 1994-12-06 1997-06-24 University Of Central Florida Method and system for searching for relevant documents from a text database collection, using statistical ranking, relevancy feedback and small pieces of text
CN101251841A (en) * 2007-05-17 2008-08-27 华东师范大学 Method for establishing and searching feature matrix of Web document based on semantics
CN103389998A (en) * 2012-05-11 2013-11-13 安徽华贞信息科技有限公司 Novel Internet commercial intelligence information semantic analysis technology based on cloud service

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张胜: "一种基于领域本体的语义检索模型", 《软件导刊》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016009321A1 (en) * 2014-07-14 2016-01-21 International Business Machines Corporation System for searching, recommending, and exploring documents through conceptual associations and inverted table for storing and querying conceptual indices
US10496683B2 (en) 2014-07-14 2019-12-03 International Business Machines Corporation Automatically linking text to concepts in a knowledge base
US10496684B2 (en) 2014-07-14 2019-12-03 International Business Machines Corporation Automatically linking text to concepts in a knowledge base
US10503761B2 (en) 2014-07-14 2019-12-10 International Business Machines Corporation System for searching, recommending, and exploring documents through conceptual associations
US10503762B2 (en) 2014-07-14 2019-12-10 International Business Machines Corporation System for searching, recommending, and exploring documents through conceptual associations
US10572521B2 (en) 2014-07-14 2020-02-25 International Business Machines Corporation Automatic new concept definition
US10956461B2 (en) 2014-07-14 2021-03-23 International Business Machines Corporation System for searching, recommending, and exploring documents through conceptual associations
CN105069080A (en) * 2015-07-31 2015-11-18 中国农业科学院农业信息研究所 Document retrieval method and system
CN105069080B (en) * 2015-07-31 2018-06-29 中国农业科学院农业信息研究所 A kind of document retrieval method and system
CN107193930A (en) * 2017-05-17 2017-09-22 东莞市华睿电子科技有限公司 A kind of website sensitive word screen method
CN117290460A (en) * 2023-11-24 2023-12-26 中孚信息股份有限公司 Method, system, device and storage medium for calculating similarity of massive texts

Similar Documents

Publication Publication Date Title
CN104182464A (en) Semantic-based text retrieval method
CN108415902B (en) Named entity linking method based on search engine
CN102725759B (en) For the semantic directory of Search Results
US9928296B2 (en) Search lexicon expansion
US20110093455A1 (en) Search and retrieval methods and systems of short messages utilizing messaging context and keyword frequency
CN103838833A (en) Full-text retrieval system based on semantic analysis of relevant words
CN103605665A (en) Keyword based evaluation expert intelligent search and recommendation method
US20110184946A1 (en) Applying synonyms to unify text search with faceted browsing classification
CN103678576A (en) Full-text retrieval system based on dynamic semantic analysis
CN104123366A (en) Search method and server
CN104166651A (en) Data searching method and device based on integration of data objects in same classes
CN111026710A (en) Data set retrieval method and system
CN105843960B (en) Indexing method and system based on semantic tree
WO2010062445A1 (en) Predictive indexing for fast search
CN104778201A (en) Multi-query result combination-based prior art retrieval method
CN103714149A (en) Self-adaptive incremental deep web data source discovery method
CN104360993A (en) Method for extracting needed content from text
CN105404677B (en) A kind of search method based on tree structure
CN102915381A (en) Multi-dimensional semantic based visualized network retrieval rendering system and rendering control method
Provatorova et al. Named entity recognition and linking on historical newspapers: UvA. ILPS & REL at CLEF HIPE 2020
CN105574004B (en) A kind of removing duplicate webpages method and apparatus
CN103034709B (en) Retrieving result reordering system and method
Fink et al. Statute-enhanced lexical retrieval of court cases for COLIEE 2022
CN105426490B (en) A kind of indexing means based on tree structure
CN103914480B (en) A kind of data query method, controller and system for automatic answering system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20141203