CN105786964B - Remote Sensing Products retrieval based on Web Mining limits item semantic extension method - Google Patents

Remote Sensing Products retrieval based on Web Mining limits item semantic extension method Download PDF

Info

Publication number
CN105786964B
CN105786964B CN201610048113.8A CN201610048113A CN105786964B CN 105786964 B CN105786964 B CN 105786964B CN 201610048113 A CN201610048113 A CN 201610048113A CN 105786964 B CN105786964 B CN 105786964B
Authority
CN
China
Prior art keywords
time
retrieval
word
disclosure
paragraph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610048113.8A
Other languages
Chinese (zh)
Other versions
CN105786964A (en
Inventor
何建军
李玉堂
陈婷
关盛勇
王西亚
高宇
武文斌
高松峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Twenty First Century Aerospace Technology Co Ltd
Original Assignee
Twenty First Century Aerospace Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Twenty First Century Aerospace Technology Co Ltd filed Critical Twenty First Century Aerospace Technology Co Ltd
Priority to CN201610048113.8A priority Critical patent/CN105786964B/en
Publication of CN105786964A publication Critical patent/CN105786964A/en
Application granted granted Critical
Publication of CN105786964B publication Critical patent/CN105786964B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3338Query expansion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The Remote Sensing Products retrieval that the invention proposes a kind of based on Web Mining limits item semantic extension method, specifically includes that brief introduction content based on web search results, temporal information extraction, is formed and take passages chapter;It extracts, the time word in extracts chapter, space word, the corresponding basic time expression of label time word and space word;The basic time marked is expressed and carries out standardization processing;Statistical specifications basic time expression and space word, using high frequency occur when empty word as spreading result.For in Remote Sensing Products retrieval, the space time information of user's input has the problem of ambiguity, reference property, dynamic, it is proposed that the Remote Sensing Products retrieval based on Web Mining limits item semantic extension method, position complete space time information, realize the semantic extension that information is inputted to user, and then accurate understanding user demand, improve accuracy, the real-time of retrieval.

Description

Remote Sensing Products retrieval based on Web Mining limits item semantic extension method
Technical field
The invention belongs to Remote Sensing Data Processing and information extraction field, it is related to the Remote Sensing Products inspection based on Web Mining technology Rope limits item semantic extension technology.
Background technique
When Remote Sensing Products retrieval restriction item semantic extension refers to retrieving Remote Sensing Products, it is retrieved in content and is limited Determine the when and where information that item is included and carries out semantic extension.Wherein, the retrieval that item refers to user query Remote Sensing Products is limited The time and space prescribed information for including in content is divided into dominant and recessive space time information, wherein dominant space time information Refer to the information that time and space are directly displayed out in retrieval content, and recessive space time information refers in retrieval content without straight It connects and shows time and spatial information, but the letter by the way that correlation time and space can be obtained to the analysis or extension of retrieving content Breath.
Currently, Remote Sensing Products retrieval service is based on semantic-based Remote Sensing Products retrieval mode, for this kind of retrieval side Formula studies the building that more extended method is Object Semanteme and spatial relation semantics at present, but to the natural language of higher Research in terms of semantic extension is less, especially the research in terms of extension Remote Sensing Products user semantic automatically.
Wuhan University Chen Xu et al. proposes a kind of method of automatic extension Remote Sensing Products user semantic, is to pass through extension ISO19115-2 model (international standard that ISO19115-2 is a geographic information metadata), with UML, (UML is object-oriented Model construction language) mode of combined data dictionary constructs image metadata ontology, realize that the inquiry of remote sensing image product is expanded Exhibition.But it is limited by ontological construction principle, has by the query expansion of ontology extremely strong professional, be not easy to ordinary user's inspection Rope, but with the publicization of Remote Sensing Products service, the domain features of Remote Sensing Products increasingly weaken, and the isomery of Remote Sensing Products service Property and dynamic feature it is increasingly significant, merely by ontology thought carry out user semantic extension it is impossible to meet retrieval precision ratio and Recall ratio requirement.
Summary of the invention
The technical problem to be solved in the present invention is to provide it is a kind of it is based on Web Mining technology, just with the public use, Cha Quan The high Remote Sensing Products retrieval of rate limits item semantic extension method.
In order to solve the above technical problems, the invention proposes a kind of, the Remote Sensing Products retrieval based on Web Mining limits item language Adopted extended method, includes the following steps:
S1, the restriction item for inquiring content is inputted into search engine, web search results is extracted, extract every record Brief introduction formed paragraph, sequence composition take passages chapter;
Meanwhile the time of disclosure of every record is extracted, and abstracting document settling time in the case where no time of disclosure, definition Time cannonical format regard time of disclosure or document settling time as benchmark reference time according to the conversion of time cannonical format, and Benchmark reference time is recorded in its corresponding brief introduction paragraph;If certain record is established without time of disclosure and document Between or time of disclosure, document settling time cannot be converted according to time cannonical format, then its corresponding brief introduction paragraph without Benchmark reference time;
S2, word segmentation processing, time word and space word in identification participle, when time word is formed basic are carried out to extracts chapter Between express, and mark basic time expression, space word;
S3, to the paragraph of no benchmark reference time, determine whether to have in the basic time marked expression and the time advise The basic time of model format match expresses, if so, being set to the benchmark reference time of the paragraph;If no, deleting the paragraph; To the paragraph containing benchmark reference time, the basic time marked expression is converted into time cannonical format, if base when conversion This temporal expressions is imperfect, then lack part fills the benchmark reference time of the paragraph;
S4, the temporal expressions of statistical specifications and space word, the temporal expressions for occuring frequently existing using highest and space word are as semanteme Spreading result.
It is retrieved for above-mentioned Remote Sensing Products and limits item semantic extension method, the step S1 includes the following steps:
S11, network retrieval Extracting Information table is established, network retrieval Extracting Information table includes search engine domain name, searches for and draw It holds up address template, clip Text node identification, time of disclosure mark, document settling time mark, extract page quantity, retrieval knot Fruit page quantity mark;
Described search engine domain name is that searching class website is used to identify internet address in administrative authentication institute registration Character string, the field record limit the network address of item retrieval for Remote Sensing Products;
Described search Engine Address template is the corresponding retrieval message address input structure of search engine, is used in the template Asterisk wildcard mark dynamic input information;
The clip Text node identification is the character that synopsis is identified in search result page structure;
The time of disclosure mark is the character that the document time of disclosure is identified in search result page structure;
The document settling time mark is the character that the document time of disclosure is identified in search result page structure;
The extraction page quantity is user's expectation using preceding how many search results as semantic extension source;
When the search result page quantity identity is that search result item number is more than one page displayable content, page turning access Location identifier;
Retrieval is limited item and carried out according to coding mode by S12, the uniform resource locator coding mode for obtaining search engine Transcoding obtains uniform resource locator coding, and the uniform resource locator coding after translation is replaced search engine address template In asterisk wildcard;According to page quantity is extracted in network retrieval Extracting Information table, be written in search result page quantity identity;
S13, the form that the search result page is resolved to dom tree;
S14, according to the clip Text node identification in network retrieval Extracting Information table, extract the content of text of this record As brief introduction, the brief introduction of this record is formed into paragraph;
S15, according to the corresponding time of disclosure mark of clip Text node identification in network retrieval Extracting Information table or text Shelves settling time mark, extracts the time of this record, defines time cannonical format, time of disclosure or document settling time are pressed It is used as benchmark reference time according to the conversion of time cannonical format, and benchmark reference time is recorded in its corresponding brief introduction paragraph In;If this record cannot be advised without time of disclosure and document settling time or time of disclosure, document settling time according to the time The conversion of model format, then its corresponding brief introduction paragraph is without benchmark reference time;
Paragraph sequence, is saved as the extracts chapter of all search results by S16, each search result page of circulation.
Above-mentioned Remote Sensing Products retrieval limits in item semantic extension method, the semantic extension method of temporal expressions described in step S4 Following steps can be selected:
S41, array is established according to unit is descending to the time of standardization;
The frequency that S42, each identical array of comparison occur, the highest array of the frequency of occurrences is time spreading result;If going out When now frequency is identical, then the frequency occurred according to descending chronomere statistics, each highest data of the unit frequency of occurrences As the correspondence unit as a result, forming final time spreading result;If temporally unit can not still obtain unique consequence, take The temporal expressions of first appearance are as time spreading result.
Beneficial effects of the present invention:
The present invention proposes under the premise of analyzing Remote Sensing Products user search natural language feature and utilizes Web Mining Method carries out semantic extension to user query content, obtains the space time information of user demand Remote Sensing Products.
Web Mining technology is application of the data mining technology in network information processing, and Web Mining is to internet reality When, multidate information processing.Based on Web Mining Remote Sensing Products retrieval limit item semantic extension method, not by building model or The limitation of specification has been evaded and has extended the professional influence having using Ontology Query, just used with the public, and Web Mining skill Art can in real time, dynamically obtain information in network, improve recall ratio.
Restriction item semantic extension method, which is retrieved, the present invention is based on the Remote Sensing Products of Web Mining significantly improves Remote Sensing Products Accuracy, the real-time of retrieval.
Detailed description of the invention
Fig. 1 is the flow chart that the Remote Sensing Products retrieval based on Web Mining limits item semantic extension method.
Specific embodiment
The present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.
The invention proposes a kind of, and the Remote Sensing Products retrieval based on Web Mining limits item semantic extension method, and step is such as Under:
S1, the restriction item for inquiring content is inputted into search engine, web search results is extracted, extract every record Brief introduction formed paragraph, sequence composition take passages chapter.
Meanwhile the time of disclosure of every record is extracted, and abstracting document settling time in the case where no time of disclosure, definition Time cannonical format regard time of disclosure or document settling time as benchmark reference time according to the conversion of time cannonical format, and Benchmark reference time is recorded in its corresponding brief introduction paragraph;If certain record is established without time of disclosure and document Between or time of disclosure, document settling time cannot be converted according to time cannonical format, then its corresponding brief introduction paragraph without Benchmark reference time.
Detailed process is as follows:
S11, the restriction item for inquiring content is inputted into search engine, obtains search result;
The time and space when restriction item of the inquiry content refers to user query Remote Sensing Products in inquiry content limits letter Breath includes dominant and recessive space time information.Wherein dominant space time information, which refers in retrieval content, directly displays out the time Refer to the space time information of the information in space, and recessiveness and do not directly display out time and spatial information in retrieval content, but is logical Cross the information that correlation time and space can be obtained to the analysis or extension of retrieval content.Such as: inquiry content inputs " 2014 Beijing's winter wheat audio and video products ", wherein " Beijing in 2014 " is the dominant restriction item of winter wheat Remote Sensing Products;Or input " Wenchuan earthquake image ", wherein " Wenchuan earthquake " is the recessive space-time limit of " May in 2008 of Wenchuan on the 12nd city " remote sensing Related product Determine item.
In the present embodiment, by taking the inquiry of " Wenchuan earthquake image " Remote Sensing Products as an example, retrieval limits item as " Wenchuan Shake " inputs Baidu search engine;It determines retrieval, returns to a result of page searching.
S12, the extracts page for grabbing search result, preceding 50 search results of general crawl;
S13, conversion of page will be taken passages into source code, the brief introduction for intercepting every record forms paragraph, and sequence composition is taken passages Chapter.Meanwhile the time of disclosure of every record is extracted, abstracting document settling time in the case where no time of disclosure, when definition Between cannonical format, regard time of disclosure or document settling time as benchmark reference time according to the conversion of time cannonical format, and will Benchmark reference time is recorded in its corresponding brief introduction paragraph;If certain recorded without time of disclosure and document settling time, Or time of disclosure, document settling time cannot convert according to time cannonical format, then its corresponding brief introduction paragraph is without base Quasi- reference time
It is in the step that the source code for taking passages the page is locally downloading, it will be in source code using conventional text intercept method Hold brief introduction to extract, while judging whether there is the time of disclosure, if so, time cannonical format, (time will be switched to the time of disclosure Cannonical format can be defined freely, such as be defined as " * * month * day "), and marking is benchmark reference time, if not having, searches text Document settling time is switched to time cannonical format by shelves settling time, and marking is benchmark reference time.
Web information extraction technique is numerous, according to extracting principle and extracting mode, falls into 5 types: based on natural language processing side Formula is concluded mode based on wrapper, is based on ontological manner, based on Htm1 frame mode and based on Web query mode.The present invention mentions Gone out it is a kind of suitable for a variety of search engines automatically extract information based on DOM (Model object model, document object mould Type) tree network retrieval page results abstracting method, steps are as follows:
(1) establish network retrieval Extracting Information table, the table be search engine access address construct automatically, the search result page Return parameters are established and search result node attribute obtains and provides parameter.Network retrieval Extracting Information table includes: search engine domains Name (Domain), search engine address template (URL_Form), clip Text node identification (Abstract), time of disclosure mark (CreateTime), document settling time mark (DocumnetTime), extraction page quantity (Page_Num), retrieval result page Face quantity identity (Page_NumCode).
Described search engine domain name (Domain) is searching class website interconnecting for identifying in administrative authentication institute registration The character string of net address, the field record limit the network address of item retrieval for Remote Sensing Products, such as: www.***.com.
Described search Engine Address template (URL_Form) refers to the corresponding retrieval message address input structure of search engine, should Asterisk wildcard mark dynamic input information is used in template.By selection search engine address template, advertising information can not be extracted, Such as: the address template of Baidu search engine is https: //www.***.com/#ie=*&f=3&rsv_bp=1&rsv_ Idx=1&tn=***local&wd=* will not extract advertising information using the address template.
The clip Text node identification (Abstract) refers to the character that synopsis is identified in search result page structure. Such as: be in Baidu's search result page structure " c-abstract ".
The time of disclosure mark (CreateTime) is the word that the document time of disclosure is identified in search result page structure Symbol.Such as: be in Baidu's search result page structure " f13m ".
The document settling time mark (DocumnetTime) is when identifying document in search result page structure to disclose Between character.Such as: be in Baidu's search result page structure " g ".
The extraction page quantity (Page_Num) refer to user's expectation using preceding how many search results as semantic extension source, Such as: it is desirable that preceding 50 search results in Baidu's search result then insert 50 as extended source.
The search result page quantity identity (Page_NumCode) is that search result item number is more than in one page can be shown Rong Shi, page turning access address identifier, such as Baidu " * ".
(2) retrieval is limited item according to coding mode by uniform resource locator (URL) coding mode for obtaining search engine It carries out transcoding and obtains uniform resource locator (URL) coding, and will be in the URL coding replacement search engine address template after translation Asterisk wildcard;According to page quantity is extracted in network retrieval Extracting Information table, be written in search result page quantity identity.
(3) the search result page is resolved to the form of dom tree.
(4) according to the clip Text node identification in network retrieval Extracting Information table, the content of text of this record is extracted As brief introduction, the brief introduction of this record is formed into paragraph.
(5) according to the corresponding time of disclosure mark of clip Text node identification or document in network retrieval Extracting Information table Settling time mark, extracts the time of this record, defines time cannonical format, by time of disclosure or document settling time according to The conversion of time cannonical format is used as benchmark reference time, and benchmark reference time is recorded in its corresponding brief introduction paragraph In;If this record cannot be advised without time of disclosure and document settling time or time of disclosure, document settling time according to the time The conversion of model format, then its corresponding brief introduction paragraph is without benchmark reference time.
(6) each search result page is recycled, paragraph sequence is saved as to the extracts chapter of all search results.
Using the network retrieval page results abstracting method based on dom tree, analytical form is convenient for information extraction, and can be right Relevant information in a variety of search engines is automatically extracted, filtering advertisements information.
S2, word segmentation processing, time word and space word in identification participle, when time word is formed basic are carried out to extracts chapter Between express, and mark basic time expression, space word.
The basic time expression, which refers to, is combined into one completely according to certain format for continuous several time words Time phrase expresses a complete time.
The step can be realized using following methods:
S21, it carries out taking passages chapter participle based on open source participle software
Chinese lexical analysis system ICTCLAS (the Institute of developed according to Chinese Academy of Sciences's computing technique research Computing Technology, Chinese Lexical Analysis System) interface document call ICTCLAS5.0 Words partition system.Chapter will be taken passages and insert Words partition system, run Words partition system, obtain the word segmentation result for taking passages chapter, the participle The result is that a series of word.
S22, scanning word segmentation result, identify which word is time word according to time trigger word dictionary, according to temporal expressions mould Plate forms basic time expression, and is labeled to its type.
Time trigger word dictionary in the present invention can use existing time dictionary, specification time word.Present embodiment In establish a kind of new time trigger word dictionary, which includes three classes time trigger word: time part of speech, preceding Sew modifier class and suffix modifier class.
The time part of speech is a kind of expression chronomere (such as year, month, day, hour, min, second), (such as " National Day in red-letter day Section "), technical dates abbreviation (such as " May Day ") date form temporal expressions language.
The prefix modifier class is a kind of common time qualifier, and in the time, (58 divide 23 to these qualifiers when such as 13 Second), the date (such as 2015 on August 20), before section time (such as summer, winter) or recombination time phrase time word, Qualifier is combined with time word indicates the time, such as: " since ... ".
The suffix modifier class is one kind in time, date, section time or the subsequent modification of recombination time phrase Word, qualifier is combined with time word indicates the time, such as: " ... until ", " ... it is preceding ".
Above-mentioned temporal expressions template can be established according to the rule for meeting Chinese temporal expressions habit, pass through phenotypic marker of classifying Basis is provided for time standardization expression.Present embodiment proposes a kind of temporal expressions template, as shown in table 1, wherein " when Between expression template " be temporal expressions format, temporal expressions are time contaminations, and " type " is " temporal expressions template " in table Classification.
1 temporal expressions template table of table
The time word taken passages in chapter is identified using time trigger word dictionary, it will be as defined in time word temporally expression template Format forms basic time expression, when judging that it belongs to the time of that type, and corresponding time type being labeled in basic Between express below.
Chapter word segmentation result is taken passages in S23, scanning, using geo-spatial data as space dictionary, identification, mark space word. Geo-spatial data is the administrative areas such as the data issued by national fundamental geographic information service platform, including domestic each province, city, county Title, the range drawn.
S231, scanning word segmentation result, are matched with " title " in geo-spatial data;
If S232, having word that can match with " title " in geo-spatial data, it is labeled as space word, i.e., after the word Face marks "/ns ", if mismatching, into next word;
S233, each word of circulation, until taking passages chapter end of text.
S3, judge whether each paragraph for taking passages chapter has benchmark reference time, to the paragraph of no benchmark reference time, Determine whether to have in the basic time marked expression and be expressed with time cannonical format matched basic time, if so, being set For the benchmark reference time of the paragraph;If no, deleting the paragraph.
To all paragraphs containing benchmark reference time, the basic time marked expression is converted into time specification lattice Formula, if basic time expression is imperfect when conversion, lack part fills the benchmark reference time of the paragraph.
This can be removed using conventional methods such as atomic time specification expression to the method for standardization management of time in the present invention Except, following methods can also be used:
The combination that expression of all basic times marked is converted to number and chronomere, by itself and time specification lattice Formula is matched: if can all match, being expressed into next basic time;If can partially match, by matching part code insurance It stays, the unmatched benchmark reference time for being partially filled with the paragraph, is expressed into next basic time, until completing the paragraph All standardization processings for having marked basic time expression.
The above-mentioned method that basic time expression is converted to number and chronomere's combination is as follows:
As the calendar type time " 1997-09-01 " is converted to " on 09 01st, 1997 ";
As absolute time " May Day " is converted to " May 01 ";
As the section time " 2001 year " is converted to " on December 31st, 01 month in 2001 ";
In addition, as week or week period, relative time pass through the place paragraph of comparison time appearance taken passages in chapter Benchmark reference time, absolute time is inferred according to calendar according to benchmark reference time.
It such as takes passages certain paragraph in chapter to occur " this Friday ", the benchmark reference time of the paragraph is " December 25 in 2014 Day ", then it is " on December 26th, 2014 " according to the absolute time that calendar records " this Friday " conversion;
It such as takes passages certain paragraph in chapter to occur relative time " the year before last ", the benchmark reference time of the paragraph is " 2014 years December 25 ", then " the year before last " is converted to " on December 25th, 2013 ".
S4, the temporal expressions of statistical specifications and space word, the temporal expressions for occuring frequently existing using highest and space word are as semanteme Spreading result.
The semantic extension of the temporal expressions can use following step:
S41, array is established according to unit is descending to the time of standardization;
The frequency that S42, each identical array of comparison occur, the highest array of the frequency of occurrences is time spreading result;If going out When now frequency is identical, then the frequency occurred according to descending chronomere statistics, each highest data of the unit frequency of occurrences As the correspondence unit as a result, forming final time spreading result;If temporally unit can not still obtain unique consequence, take The temporal expressions of first appearance are as time spreading result.
The space word semantic extension can use following step:
The frequency that S43, each space word of statistics occur, using the highest space word of the frequency of occurrences as search result, if going out When now frequency is identical, it regard the identical space word of frequency as search result;
If S44, space word and search result be it is at county level, according to geo-spatial data spreading result include affiliated districts and cities' title, Save title;If space word and search result is prefecture-level title, spreading result also lists affiliated province's title;If space word and search result To save name, then spatial spread result is save space range.
Embodiments of the present invention are explained in detail above in conjunction with attached drawing, but the present invention is not limited to above-mentioned implementations Mode does not depart from the various change that present inventive concept is made within the knowledge of one of ordinary skill in the art, Still fall in protection scope of the present invention.

Claims (3)

1. a kind of Remote Sensing Products retrieval based on Web Mining limits item semantic extension method, include the following steps:
S1, the restriction item for inquiring content is inputted into search engine, web search results is extracted, extract the interior of every record Hold brief introduction and form paragraph, sequence composition takes passages chapter;
Meanwhile the time of disclosure of every record is extracted, abstracting document settling time in the case where no time of disclosure defines the time Cannonical format regard time of disclosure or document settling time as benchmark reference time according to the conversion of time cannonical format, and by base Quasi- reference time is recorded in its corresponding brief introduction paragraph;If certain recorded without time of disclosure and document settling time, or Person's time of disclosure, document settling time cannot convert according to time cannonical format, then its corresponding brief introduction paragraph is without benchmark Reference time;
S2, word segmentation processing is carried out to extracts chapter, time word is formed basic time table by time word and space word in identification participle It reaches, and marks basic time expression, space word;
S3, to the paragraph of no benchmark reference time, determine whether have and time specification lattice in the basic time marked expression The matched basic time expression of formula, if so, being set to the benchmark reference time of the paragraph;If no, deleting the paragraph;
To the paragraph containing benchmark reference time, the basic time marked expression is converted into time cannonical format, if conversion When basic time expression it is imperfect, then lack part fills the benchmark reference time of the paragraph;
S4, the temporal expressions of statistical specifications and space word, the temporal expressions for occuring frequently existing using highest and space word are as semantic extension As a result.
2. Remote Sensing Products retrieval as described in claim 1 limits item semantic extension method, which is characterized in that the step S1 packet Include following steps:
S11, network retrieval Extracting Information table is established, network retrieval Extracting Information table includes search engine domain name, search engine Location template, time of disclosure mark, document settling time mark, extracts page quantity, retrieval result page at clip Text node identification Face quantity identity;
Described search engine domain name is the character that is used to identify internet address of the searching class website in administrative authentication institute registration String, the character string limit the network address of item retrieval for Remote Sensing Products;
Described search Engine Address template is the corresponding retrieval message address input structure of search engine, uses wildcard in the template Symbol mark dynamic input information;
The clip Text node identification is the character that synopsis is identified in search result page structure;
The time of disclosure mark is the character that the document time of disclosure is identified in search result page structure;
The document settling time mark is the character that the document time of disclosure is identified in search result page structure;
The extraction page quantity is user's expectation using preceding how many search results as semantic extension source;
When the search result page quantity identity is that search result item number is more than one page displayable content, page turning access address mark Know symbol;
Retrieval is limited item and carries out transcoding according to coding mode by S12, the uniform resource locator coding mode for obtaining search engine Uniform resource locator coding is obtained, and will be in the uniform resource locator coding replacement search engine address template after translation Asterisk wildcard;According to page quantity is extracted in network retrieval Extracting Information table, be written in search result page quantity identity;
S13, the form that the search result page is resolved to dom tree;
S14, according to the clip Text node identification in network retrieval Extracting Information table, extract the content of text conduct of this record The brief introduction of this record is formed paragraph by brief introduction;
S15, it is built according to the corresponding time of disclosure mark of clip Text node identification in network retrieval Extracting Information table or document Vertical time identifier extracts the time of this record, defines time cannonical format, by time of disclosure or document settling time according to when Between cannonical format conversion be used as benchmark reference time, and benchmark reference time is recorded in its corresponding brief introduction paragraph; If this record cannot be standardized without time of disclosure and document settling time or time of disclosure, document settling time according to the time Format conversion, then its corresponding brief introduction paragraph is without benchmark reference time;
Paragraph sequence, is saved as the extracts chapter of all search results by S16, each search result page of circulation.
3. Remote Sensing Products retrieval as claimed in claim 1 or 2 limits item semantic extension method, which is characterized in that in step S4 Steps are as follows for the semantic extension of the temporal expressions:
S41, array is established according to unit is descending to the time of standardization;
The frequency that S42, each identical array of comparison occur, the highest array of the frequency of occurrences is time spreading result;If there is frequency When rate is identical, then the frequency occurred according to descending chronomere statistics, each highest data conduct of the unit frequency of occurrences The correspondence unit as a result, forming final time spreading result;If temporally unit can not still obtain unique consequence, is taken The temporal expressions of one appearance are as time spreading result.
CN201610048113.8A 2016-01-15 2016-01-15 Remote Sensing Products retrieval based on Web Mining limits item semantic extension method Active CN105786964B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610048113.8A CN105786964B (en) 2016-01-15 2016-01-15 Remote Sensing Products retrieval based on Web Mining limits item semantic extension method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610048113.8A CN105786964B (en) 2016-01-15 2016-01-15 Remote Sensing Products retrieval based on Web Mining limits item semantic extension method

Publications (2)

Publication Number Publication Date
CN105786964A CN105786964A (en) 2016-07-20
CN105786964B true CN105786964B (en) 2019-08-09

Family

ID=56403184

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610048113.8A Active CN105786964B (en) 2016-01-15 2016-01-15 Remote Sensing Products retrieval based on Web Mining limits item semantic extension method

Country Status (1)

Country Link
CN (1) CN105786964B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106528644B (en) * 2016-10-14 2020-07-31 航天恒星科技有限公司 Remote sensing data retrieval method and device
CN106776556B (en) * 2016-12-12 2019-10-11 北京蓝海讯通科技股份有限公司 A kind of Text Mode generation method, device and calculate equipment
CN107729314B (en) * 2017-09-29 2021-10-26 东软集团股份有限公司 Chinese time identification method and device, storage medium and program product

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5873107A (en) * 1996-03-29 1999-02-16 Apple Computer, Inc. System for automatically retrieving information relevant to text being authored
CN102073692A (en) * 2010-12-16 2011-05-25 北京农业信息技术研究中心 Agricultural field ontology library based semantic retrieval system and method
CN103186556A (en) * 2011-12-28 2013-07-03 北京百度网讯科技有限公司 Method for obtaining and searching structural semantic knowledge and corresponding device
CN104239300A (en) * 2013-06-06 2014-12-24 富士通株式会社 Method and device for excavating semantic keywords from text

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060218115A1 (en) * 2005-03-24 2006-09-28 Microsoft Corporation Implicit queries for electronic documents

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5873107A (en) * 1996-03-29 1999-02-16 Apple Computer, Inc. System for automatically retrieving information relevant to text being authored
CN102073692A (en) * 2010-12-16 2011-05-25 北京农业信息技术研究中心 Agricultural field ontology library based semantic retrieval system and method
CN103186556A (en) * 2011-12-28 2013-07-03 北京百度网讯科技有限公司 Method for obtaining and searching structural semantic knowledge and corresponding device
CN104239300A (en) * 2013-06-06 2014-12-24 富士通株式会社 Method and device for excavating semantic keywords from text

Also Published As

Publication number Publication date
CN105786964A (en) 2016-07-20

Similar Documents

Publication Publication Date Title
CN108241728B (en) Geographic mapping of interpretation of natural language expressions
JP6906419B2 (en) Information providing equipment, information providing method, and program
CN101847160B (en) Method and device for pushing personalized pages to mobile terminal
CN111694965B (en) Image scene retrieval system and method based on multi-mode knowledge graph
US8682882B2 (en) System and method for automatically identifying classified websites
US20090319515A1 (en) System and method for managing entity knowledgebases
CN110472066A (en) A kind of construction method of urban geography semantic knowledge map
US20090276716A1 (en) Content Adaptation
EP2557511B1 (en) Information processing device, information processing method, information processing programme, and recording medium
CN106326438B (en) A kind of correlating method of personal information
CN106383887A (en) Environment-friendly news data acquisition and recommendation display method and system
CN103514234A (en) Method and device for extracting page information
CN107784059A (en) For searching for and selecting the method and system and machine-readable medium of image
CN105718585B (en) Document and label word justice correlating method and its device
CN102841920A (en) Method and device for extracting webpage frame information
CN102194006B (en) Search system and method capable of gathering personalized features of group
CN102855480A (en) Method and device for recognizing characters in image
CN105786964B (en) Remote Sensing Products retrieval based on Web Mining limits item semantic extension method
CN112052414A (en) Data processing method and device and readable storage medium
US20090276398A1 (en) Search server
US20170039264A1 (en) Area modeling by geographic photo label analysis
JP2022532451A (en) How to disambiguate Chinese place name meanings based on encyclopedia knowledge base and word embedding
CN103207901B (en) A kind of method and apparatus that IP address ownership place is obtained based on search engine
CN105389338B (en) A kind of analytic method of buying acceptance of the bid data
CN108984640A (en) A kind of geography information acquisition methods excavated based on web data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant