CN104281693A - Semantic search method and semantic search system - Google Patents

Semantic search method and semantic search system Download PDF

Info

Publication number
CN104281693A
CN104281693A CN201410537867.0A CN201410537867A CN104281693A CN 104281693 A CN104281693 A CN 104281693A CN 201410537867 A CN201410537867 A CN 201410537867A CN 104281693 A CN104281693 A CN 104281693A
Authority
CN
China
Prior art keywords
semantic
industry
concept
ontology
storehouse
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410537867.0A
Other languages
Chinese (zh)
Inventor
贾岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Original Assignee
ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd filed Critical ANHUI HUAZHEN INFORMATION SCIENCE & TECHNOLOGY Co Ltd
Priority to CN201410537867.0A priority Critical patent/CN104281693A/en
Publication of CN104281693A publication Critical patent/CN104281693A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a semantic search method and a semantic search system. The semantic search method includes the steps of establishing a semantic ontology library; analyzing sentences according to the semantic ontology library so as to obtain nominal concepts, movement concepts and tendentiousness of the sentences and semantic descriptions of the sentences, making statistics on main semantic reference of analyzed paragraphs, summarizing semantic basic information of main description objects, semantic tendentiousness and the like of a text by the aid of a document text structure, and storing the semantic basic information with the document associatively; subjecting industry-related data probing and capture according to the semantic ontology library. The semantic search method has the advantages that by the aid of the network probe technology, websites high in content similarity can be discovered automatically; by means of extracting webpage texts and encoding each paragraph of text, multiplicity of one article can be judged accurately.

Description

A kind of semantic searching method and system
Technical field
The present invention relates to grid computing technology field, particularly relate to a kind of semantic searching method and system.
Background technology
Current internet information reprinting rate is very high, adds the recall ratio of the search engine such as Baidu, *** in order to search for, and causes the Search Results multiplicity of universal search very high, is unfavorable for that enterprise finds valuable content fast.
Summary of the invention
In order to solve the technical matters existed in background technology, the present invention proposes a kind of semantic searching method and system, by network probe technology, automatically finding the website that content similarity is high; By extracting Web page text, to every section of text code, one section of article multiplicity accurately can be judged.
A kind of semantic searching method that the present invention proposes, comprises the following steps:
Set up Ontology storehouse;
According to Ontology storehouse parsing sentence, obtain the nominal concept of sentence, movement concept and tendentiousness, obtain the semantic description of statement, the main semanteme of statistical study paragraph refers to, then utilize the document structure of an article to sum up the semantic essential informations such as chapter main description object, semantic tendency, and together with document association store;
The detection of industry related data and crawl is carried out according to Ontology storehouse.
Preferably, described Ontology storehouse comprises semantic relation between industry concept system, concept, relation between word and concept.
Preferably, described Ontology storehouse comprises internal body storehouse that industry the has nothing to do industry ontology library relevant with industry.
Preferably, describedly carry out the detection of industry related data and crawl according to Ontology storehouse, specifically comprise: adopt networking industry information probes, utilize Ontology storehouse, by means such as URL link, search engine springboards, look for alternative website, then verify website or substation, whether sub-directory be company-related information, what correlation density is, and by Website Topological, URL link, form form excavates degree of depth network, to look for potential data source.
Preferably, described employing networking industry information probes, specifically comprises a constantly detection site pages, by the mode of automatic filling list, test return data, thus find most suitable list form, after finding list form, automatic submission form, compares acquisition webpage.
A kind of semantic search system that the present invention proposes, comprising:
Set up module, for setting up Ontology storehouse;
Analysis module, model calling is set up with described, for according to Ontology storehouse parsing sentence, obtain the nominal concept of sentence, movement concept and tendentiousness, obtain the semantic description of statement, the main semanteme of statistical study paragraph refers to, and then utilizes the document structure of an article to sum up the semantic essential informations such as chapter main description object, semantic tendency, and together with document association store;
Detection and handling module, be connected with described analysis module, for carrying out the detection of industry related data and crawl according to Ontology storehouse.
Preferably, described Ontology storehouse comprises semantic relation between industry concept system, concept, relation between word and concept.
Preferably, described Ontology storehouse comprises internal body storehouse that industry the has nothing to do industry ontology library relevant with industry.
Preferably, describedly carry out the detection of industry related data and crawl according to Ontology storehouse, specifically comprise: adopt networking industry information probes, utilize Ontology storehouse, by means such as URL link, search engine springboards, look for alternative website, then verify website or substation, whether sub-directory be company-related information, what correlation density is, and by Website Topological, URL link, form form excavates degree of depth network, to look for potential data source.
Preferably, described employing networking industry information probes, specifically comprises a constantly detection site pages, by the mode of automatic filling list, test return data, thus find most suitable list form, after finding list form, automatic submission form, compares acquisition webpage.
In the present invention, pass through parsing sentence, obtain the semantic description of the nominal concept of sentence, movement concept and tendentiousness, statement, then the main semanteme of statistical study paragraph refers to, the document structure of an article is utilized to sum up the semantic essential informations such as the main description object of chapter, semantic tendency, and together with document association store, to support semantic search and intelligence analysis.And the feature being generally industry internal information with strong points that enterprise search demand can be utilized to pay close attention to, in conjunction with already quite abundant Internet resources, rapid build is applicable to body required in the semantic search model that this project proposes, then utilize the feature that ontology semantic information is abundant, achieve industry customization search engine that is practical, semantic level.
Accompanying drawing explanation
Fig. 1 is a kind of semantic searching method process flow diagram that the embodiment of the present invention proposes;
Fig. 2 is a kind of semantic search system construction drawing that the embodiment of the present invention proposes.
Embodiment
As shown in Figure 1, the embodiment of the present invention proposes a kind of semantic searching method and system, comprises the following steps:
Step 101, sets up Ontology storehouse.Wherein, the main points that Ontology storehouse describes comprise semantic relation between industry concept system, concept, relation etc. between word and concept.Build this Ontology storehouse, need to utilize data mining and Internet resources mutually to contrast real concept system and semantic relation etc., and visual artificial dressing tool is provided, greatly reduce construction cost.Ontology storehouse is mainly two covers, and a set of is the irrelevant internal body storehouse of industry, can describe the irrelevant vocabulary of generality, industry and language concept, and user automatically more can be newly arrived by system and upgrades this ontology library; A set of is describe the relevant industry ontology library of industry, the main relation described between industry concept and concept.
Step 102, according to Ontology storehouse parsing sentence, obtain the nominal concept of sentence, movement concept and tendentiousness, obtain the semantic description of statement, the main semanteme of statistical study paragraph refers to, then utilize the document structure of an article to sum up the semantic essential informations such as chapter main description object, semantic tendency, and together with document association store, to support semantic search and intelligence analysis.
Step 103, carries out the detection of industry related data and crawl according to Ontology storehouse.Adopt networking industry information (deep web) probe, utilize Ontology storehouse, by means such as URL link, search engine springboards, look for alternative website, then verify website or substation, whether sub-directory be company-related information, what correlation density is, and by Website Topological, URL link, form form etc. excavates degree of depth network, to look for potential data source.Because deep web is much the good data of structuring, be convenient to analyze, and often cannot search under universal search engine and obtain, have immense value to client.This strategy, when not losing industry data acquisition amount, is greatly saved bandwidth sum data retrieval amount, and is improve the data loading cycle, improve and spend in real time.
Wherein, adopt networking industry information probes, constantly a detection site pages, by the mode of automatic filling list, test return data, thus find most suitable list form, after finding list form, automatic submission form, compares acquisition webpage.
Wherein, Deep Web is referred to that those are stored in network data base, does not need the resource collection of being accessed by dynamic web page technique by hyperlink access.Web page is resolved and is namely passed through analyzing tags, resolve html page, and extract body matter, utilize HTML specification and view-based access control model Segment technology, extract metamessage (as title, key word etc.) and the body text of the page, effectively avoid the interference of irrelevant information.
In the present invention tests, the Deep web resource back page structural difference of same website is very little, utilizes this feature, obtains page dom tree, extract the node that dom tree interior joint content is different before and after analyzing, and Here it is needs the data of collection.After extracting correct data, notice administrator configurations data layout, completes Deep Web site and finds and gather.
In the present invention, utilize semantic analysis technology, semantic analysis is done to chapter every words, the semantic point of mark verb, nominal semanteme point and semantic tendency, then be aggregated into the semantic side emphasis of paragraph and whole chapter, finally utilize semantic side emphasis, in conjunction with chapter feature, with number of words (as 400 words) for constraint condition, select and contain several " sentence groups " composition summary in full semantic in full as far as possible.The documentation summary of Search Results realizes this constraint condition of density that upper difference is to increase search word (comprising concept close to word).
The generalities index of document is based on above-described document semantic presentation technology, by the semantic description (Ontology space) of document, then using other additional semantic information such as the weights of these concepts and concept as index object, be stored as inverted index index file.The generalities of search word rewrite to refer to and to be also mapped to by the search word of user in semantic space that body defines.Semantic search technology in the present system or the basic-level support of some other modules (as keypoint recommendation information, information roaming etc.).User's frequent search word and nearest search word are sorted on realizing, the degree and the data that checking collects recently match, estimates that user is to its interested degree, as recommending information and the important reference browsing sequence.
As shown in Figure 2, the embodiment of the present invention proposes a kind of semantic search system, comprising: set up module 10, for setting up Ontology storehouse; Analysis module 20, be connected with described module 10 of setting up, for according to Ontology storehouse parsing sentence, obtain the nominal concept of sentence, movement concept and tendentiousness, obtain the semantic description of statement, the main semanteme of statistical study paragraph refers to, and then utilizes the document structure of an article to sum up the semantic essential informations such as chapter main description object, semantic tendency, and together with document association store; Detection and handling module 30, be connected with described analysis module 20, for carrying out the detection of industry related data and crawl according to Ontology storehouse.
Described Ontology storehouse comprises semantic relation between industry concept system, concept, relation between word and concept.
Described Ontology storehouse comprises internal body storehouse that industry the has nothing to do industry ontology library relevant with industry.

Claims (10)

1. a semantic searching method, is characterized in that, comprises the following steps:
Set up Ontology storehouse;
According to Ontology storehouse parsing sentence, obtain the nominal concept of sentence, movement concept and tendentiousness, obtain the semantic description of statement, the main semanteme of statistical study paragraph refers to, then utilize the document structure of an article to sum up the semantic essential informations such as chapter main description object, semantic tendency, and together with document association store;
The detection of industry related data and crawl is carried out according to Ontology storehouse.
2. semantic searching method according to claim 1, is characterized in that, described Ontology storehouse comprises semantic relation between industry concept system, concept, relation between word and concept.
3. semantic searching method according to claim 1, is characterized in that, described Ontology storehouse comprises internal body storehouse that industry the has nothing to do industry ontology library relevant with industry.
4. semantic searching method according to claim 1, it is characterized in that, describedly carry out the detection of industry related data and crawl according to Ontology storehouse, specifically comprise: adopt networking industry information probes, utilize Ontology storehouse, by means such as URL link, search engine springboards, look for alternative website, then verify website or substation, whether sub-directory be company-related information, what correlation density is, and by Website Topological, URL link, form form excavates degree of depth network, to look for potential data source.
5. semantic searching method according to claim 4, it is characterized in that, described employing networking industry information probes, specifically comprise a constantly detection site pages, by the mode of automatic filling list, test return data, thus find most suitable list form, after finding list form, automatic submission form, compares acquisition webpage.
6. a semantic search system, is characterized in that, comprising:
Set up module, for setting up Ontology storehouse;
Analysis module, model calling is set up with described, for according to Ontology storehouse parsing sentence, obtain the nominal concept of sentence, movement concept and tendentiousness, obtain the semantic description of statement, the main semanteme of statistical study paragraph refers to, and then utilizes the document structure of an article to sum up the semantic essential informations such as chapter main description object, semantic tendency, and together with document association store;
Detection and handling module, be connected with described analysis module, for carrying out the detection of industry related data and crawl according to Ontology storehouse.
7. semantic search system according to claim 6, is characterized in that, described Ontology storehouse comprises semantic relation between industry concept system, concept, relation between word and concept.
8. semantic search system according to claim 6, is characterized in that, described Ontology storehouse comprises internal body storehouse that industry the has nothing to do industry ontology library relevant with industry.
9. semantic search system according to claim 6, it is characterized in that, describedly carry out the detection of industry related data and crawl according to Ontology storehouse, specifically comprise: adopt networking industry information probes, utilize Ontology storehouse, by means such as URL link, search engine springboards, look for alternative website, then verify website or substation, whether sub-directory be company-related information, what correlation density is, and by Website Topological, URL link, form form excavates degree of depth network, to look for potential data source.
10. semantic search system according to claim 9, it is characterized in that, described employing networking industry information probes, specifically comprise a constantly detection site pages, by the mode of automatic filling list, test return data, thus find most suitable list form, after finding list form, automatic submission form, compares acquisition webpage.
CN201410537867.0A 2014-10-13 2014-10-13 Semantic search method and semantic search system Pending CN104281693A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410537867.0A CN104281693A (en) 2014-10-13 2014-10-13 Semantic search method and semantic search system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410537867.0A CN104281693A (en) 2014-10-13 2014-10-13 Semantic search method and semantic search system

Publications (1)

Publication Number Publication Date
CN104281693A true CN104281693A (en) 2015-01-14

Family

ID=52256566

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410537867.0A Pending CN104281693A (en) 2014-10-13 2014-10-13 Semantic search method and semantic search system

Country Status (1)

Country Link
CN (1) CN104281693A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105843791A (en) * 2015-01-15 2016-08-10 克拉玛依红有软件有限责任公司 Semantic network model establishing method based on 6W semantic identification
CN106021339A (en) * 2016-05-09 2016-10-12 中国联合网络通信集团有限公司 A semantic query method and system for a resource tree
US10678820B2 (en) 2018-04-12 2020-06-09 Abel BROWARNIK System and method for computerized semantic indexing and searching

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5781879A (en) * 1996-01-26 1998-07-14 Qpl Llc Semantic analysis and modification methodology
CN101004760A (en) * 2007-01-10 2007-07-25 苏州大学 Method for extracting page query interface based on character of vision
CN101639840A (en) * 2008-07-29 2010-02-03 华天清 Method and device for identifying semantic structure of network information
CN101655862A (en) * 2009-08-11 2010-02-24 华天清 Method and device for searching information object
CN103116635A (en) * 2013-02-07 2013-05-22 中国科学院计算技术研究所 Field-oriented method and system for collecting invisible web resources
CN103389998A (en) * 2012-05-11 2013-11-13 安徽华贞信息科技有限公司 Novel Internet commercial intelligence information semantic analysis technology based on cloud service

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5781879A (en) * 1996-01-26 1998-07-14 Qpl Llc Semantic analysis and modification methodology
CN101004760A (en) * 2007-01-10 2007-07-25 苏州大学 Method for extracting page query interface based on character of vision
CN101639840A (en) * 2008-07-29 2010-02-03 华天清 Method and device for identifying semantic structure of network information
CN101655862A (en) * 2009-08-11 2010-02-24 华天清 Method and device for searching information object
CN103389998A (en) * 2012-05-11 2013-11-13 安徽华贞信息科技有限公司 Novel Internet commercial intelligence information semantic analysis technology based on cloud service
CN103116635A (en) * 2013-02-07 2013-05-22 中国科学院计算技术研究所 Field-oriented method and system for collecting invisible web resources

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105843791A (en) * 2015-01-15 2016-08-10 克拉玛依红有软件有限责任公司 Semantic network model establishing method based on 6W semantic identification
CN105843791B (en) * 2015-01-15 2018-08-03 克拉玛依红有软件有限责任公司 A kind of semantic network models construction method based on 6W semantemes mark
CN106021339A (en) * 2016-05-09 2016-10-12 中国联合网络通信集团有限公司 A semantic query method and system for a resource tree
CN106021339B (en) * 2016-05-09 2019-07-26 中国联合网络通信集团有限公司 The semantic query method and system of resourceoriented tree
US10678820B2 (en) 2018-04-12 2020-06-09 Abel BROWARNIK System and method for computerized semantic indexing and searching

Similar Documents

Publication Publication Date Title
CN103049575B (en) A kind of academic conference search system of topic adaptation
CN103365924B (en) A kind of method of internet information search, device and terminal
EP2057557B1 (en) Joint optimization of wrapper generation and template detection
CN106126648B (en) It is a kind of based on the distributed merchandise news crawler method redo log
CN102930059B (en) Method for designing focused crawler
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
CN107423391B (en) Information extraction method of webpage structured data
CN104182412A (en) Webpage crawling method and webpage crawling system
US20110208715A1 (en) Automatically mining intents of a group of queries
CN105045901A (en) Search keyword push method and device
CN104899273A (en) Personalized webpage recommendation method based on topic and relative entropy
CN103324666A (en) Topic tracing method and device based on micro-blog data
CN105159930A (en) Search keyword pushing method and apparatus
CN103530429B (en) Webpage content extracting method
CN104715064A (en) Method and server for marking keywords on webpage
CN106844640A (en) A kind of web data analysis and processing method
CN103389998A (en) Novel Internet commercial intelligence information semantic analysis technology based on cloud service
CN103838732A (en) Vertical search engine in life service field
US11263062B2 (en) API mashup exploration and recommendation
CN103559234A (en) System and method for automated semantic annotation of RESTful Web services
US20220292160A1 (en) Automated system and method for creating structured data objects for a media-based electronic document
CN104317845A (en) Method and system for automatic extraction of deep web data
CN103838862A (en) Video searching method, device and terminal
CN116775972A (en) Remote resource arrangement service method and system based on information technology
CN105528357A (en) Webpage content extraction method based on similarity of URLs and similarity of webpage document structures

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20150114