CN104484388A - Method and device for screening scarce information pages - Google Patents

Method and device for screening scarce information pages Download PDF

Info

Publication number
CN104484388A
CN104484388A CN201410759482.9A CN201410759482A CN104484388A CN 104484388 A CN104484388 A CN 104484388A CN 201410759482 A CN201410759482 A CN 201410759482A CN 104484388 A CN104484388 A CN 104484388A
Authority
CN
China
Prior art keywords
participle
page
rare
word
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410759482.9A
Other languages
Chinese (zh)
Inventor
魏少俊
王智广
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201410759482.9A priority Critical patent/CN104484388A/en
Publication of CN104484388A publication Critical patent/CN104484388A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a device for screening scarce information pages. The method comprises the following steps: carrying out word segmentation treatment on pages which are captured by a search engine and generating a plurality of segmentation words; searching scarce words in the plurality of segmentation words; screening the pages containing the scarce words as scarce information pages. According to the technical scheme, the scarce information pages can be screened from the pages captured by the search engine; the coverage surface is large; relatively rich data support can be provided for users; the scarce information pages are screened according to the scarce words, so that the screened scarce information pages are relatively high in quality; the provided information can meet the requirements on information of the users; the search accuracy is high; the user information search experience can be improved.

Description

The screening technique of rare information page and device
Technical field
The present invention relates to information search field, particularly a kind of screening technique of rare information page and device.
Background technology
The page set that search engine captures is comparatively huge, considers from the angle of cost and efficiency, search engine can therefrom the selected part page as index, screening according to mainly the repetition degree of content of pages and the quality of content of pages itself.
Above-mentioned screening can reduce the number of process to huge page set, deletes a large amount of repeated page, and the information improving index set provides efficiency.But, there is part in search procedure because of different reason (such as repeating degree lower) easy uncared-for information, such as some name, remote place name or marque etc.
Summary of the invention
In view of the above problems, the present invention is proposed to provide a kind of overcoming the problems referred to above or the screening technique of rare information page solved the problem at least in part and device.
According to one aspect of the present invention, provide a kind of screening technique of rare information page, comprising: word segmentation processing is done to the page of search engine collecting, generate multiple participle; Rare word is searched in described multiple participle; Filter out comprise described rare word the page as rare information page.
Alternatively, in described multiple participle, search rare word, comprising: for each participle, search in index the quantity of the page comprising this participle; Determine that corresponding page quantity is less than the participle of the first amount threshold, be labeled as rare word.
Alternatively, determine that corresponding page quantity is less than the participle of the first amount threshold, be labeled as rare word, comprise: determine that corresponding page quantity is less than the first amount threshold and is greater than the participle of the second amount threshold, be labeled as rare word, wherein, described second amount threshold is less than the first amount threshold.
Alternatively, in described multiple participle, search rare word, comprising: for each participle, determine the quantity of the page comprising this participle; According to the quantity determined, calculate reverse file word frequency (IDF, the Inverse Document Frequency) value of each participle; In described multiple participle, search IDF value be greater than the participle of specifying numerical threshold, be labeled as rare word.
Alternatively, for each participle, before determining the quantity of the page comprising this participle, also comprise: calculate the frequency of occurrences of each participle in the page of respective place in described multiple participle; In described multiple participle, filter out the participle that the frequency of occurrences is greater than assigned frequency threshold value.
Alternatively, word segmentation processing is done to the page of search engine collecting, comprising: the content text of the page that decimated search engine captures; Word segmentation processing is done to the content text extracted.
Alternatively, described rare word comprise following one of at least: name, place name, name, marque.
Alternatively, filter out comprise described rare word the page as after rare information page, also comprise: Screening Treatment is carried out to described rare information page; Rare information page after Screening Treatment is set up index, the service retrieving rare information is provided to user for search engine.
Alternatively, the mode of described Screening Treatment comprise following one of at least: go rubbish, duplicate removal, instead to practise fraud.
According to another aspect of the present invention, additionally provide a kind of screening plant of rare information page, comprising:
Participle maker, is suitable for making word segmentation processing to the page of search engine collecting, generates multiple participle;
Rare word finger, is suitable for searching rare word in described multiple participle;
Rare information page screening washer, be suitable for filtering out comprise described rare word the page as rare information page.
Alternatively, described rare word finger is also suitable for: for each participle, search in index the quantity of the page comprising this participle; Determine that corresponding page quantity is less than the participle of the first amount threshold, be labeled as rare word.
Alternatively, described rare word finger is also suitable for: determine that corresponding page quantity is less than the first amount threshold and is greater than the participle of the second amount threshold, be labeled as rare word, wherein, described second amount threshold is less than the first amount threshold.
Alternatively, described rare word finger is also suitable for: for each participle, determine the quantity of the page comprising this participle; According to the quantity determined, calculate the IDF value of each participle; In described multiple participle, search IDF value be greater than the participle of specifying numerical threshold, be labeled as rare word.
Alternatively, for each participle, before determining the quantity of the page comprising this participle, described rare word finger is also suitable for: calculate the frequency of occurrences of each participle in the page of respective place in described multiple participle; In described multiple participle, filter out the participle that the frequency of occurrences is greater than assigned frequency threshold value.
Alternatively, described participle maker is also suitable for: the content text of the page that decimated search engine captures; Word segmentation processing is done to the content text extracted.
Alternatively, described rare word comprise following one of at least: name, place name, name, marque.
Alternatively, filter out comprise described rare word the page as after rare information page, described rare information page screening washer is also suitable for: carry out Screening Treatment to described rare information page; Rare information page after Screening Treatment is set up index, the service retrieving rare information is provided to user for search engine.
Alternatively, the mode of described Screening Treatment comprise following one of at least: go rubbish, duplicate removal, instead to practise fraud.
Technical scheme provided by the invention, screens rare information page, wide coverage from the page of search engine collecting, can provide abundanter Data support for user.Further, processed by the page of the means such as word segmentation processing to search engine collecting, therefrom find rare word, and then filter out rare information page, thus containing the Search Results of rare information page, existing search engine can be solved because of easy uncared-for problem when the reasons such as rare information page content repetition degree is lower cause searching for by providing package accurately and efficiently when user inquires about.Further, the present invention screens rare information page according to rare word, and the rare information page quality thus filtered out is higher, and its information provided can meet the demand of user to information, and search accuracy rate is high, improves user information search and experiences.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to technological means of the present invention can be better understood, and can be implemented according to the content of instructions, and can become apparent, below especially exemplified by the specific embodiment of the present invention to allow above and other objects of the present invention, feature and advantage.
According to hereafter by reference to the accompanying drawings to the detailed description of the specific embodiment of the invention, those skilled in the art will understand above-mentioned and other objects, advantage and feature of the present invention more.
Accompanying drawing explanation
By reading hereafter detailed description of the preferred embodiment, various other advantage and benefit will become cheer and bright for those of ordinary skill in the art.Accompanying drawing only for illustrating the object of preferred implementation, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts by identical reference symbol.In the accompanying drawings:
Fig. 1 shows the process flow diagram of the screening technique of rare according to an embodiment of the invention information page; And
Fig. 2 shows the structural representation of the screening plant of rare according to an embodiment of the invention information page.
Embodiment
Below with reference to accompanying drawings exemplary embodiment of the present disclosure is described in more detail.Although show exemplary embodiment of the present disclosure in accompanying drawing, however should be appreciated that can realize the disclosure in a variety of manners and not should limit by the embodiment set forth here.On the contrary, provide these embodiments to be in order to more thoroughly the disclosure can be understood, and complete for the scope of the present disclosure can be conveyed to those skilled in the art.
For solving the problems of the technologies described above, embodiments provide a kind of screening technique of rare information page.Fig. 1 shows the process flow diagram of the screening technique of rare according to an embodiment of the invention information page.See Fig. 1, the method at least comprises step S102 to step S106.
Step S102, word segmentation processing is done to the page of search engine collecting, generate multiple participle.
Step S104, in multiple participle, search rare word.
Step S106, filter out comprise rare word the page as rare information page.
Technical scheme provided by the invention, screens rare information page, wide coverage from the page of search engine collecting, can provide abundanter Data support for user.Further, processed by the page of the means such as word segmentation processing to search engine collecting, therefrom find rare word, and then filter out rare information page, thus containing the Search Results of rare information page, existing search engine can be solved because of easy uncared-for problem when the reasons such as rare information page content repetition degree is lower cause searching for by providing package accurately and efficiently when user inquires about.Further, the present invention screens rare information page according to rare word, and the rare information page quality thus filtered out is higher, and its information provided can meet the demand of user to information, and search accuracy rate is high, improves user information search and experiences.
In step S102, word segmentation processing is done to the page of search engine collecting above, generate multiple participle, embodiments provide a kind of preferred processing mode, i.e. the content text of the page of decimated search engine crawl, and then word segmentation processing is done to the content text extracted.Here, the content text extracting the page refers to filters the programmed statements in the page, after HTML (Hypertext Markup Language, HTML (Hypertext Markup Language)) mark, script etc. are all removed, the left text representing flesh and blood.Namely be not only content (content) text, also comprise the text of the contents such as title (title), abstract (summary), author (author), time (time).Further, make word segmentation processing, one section of content text can be divided into conventional word to the content text extracted, such as " artistic district, Jiuxianqiao Road, Beijing City 798 " is " Beijing/art/district, winebibber's bridge road/798/ " after making word segmentation processing.
In step s 102 word segmentation processing is done to the page of search engine collecting, after generating multiple participle, step S104 searches rare word in multiple participle, embodiments provide a kind of preferred scheme, in this scenario, for each participle, search in index the quantity of the page comprising this participle, determine that corresponding page quantity is less than the participle of the first amount threshold afterwards, be labeled as rare word.Here rare word can be possess notional word implication, can illustrate the participle of event content.Analyze from the angle of grammer, in word quasi-sentence, the type of the participle that the frequency of occurrences is higher does not normally possess notional word meaning, such as common modal particle, conjunction, auxiliary word, the title with type etc.Modal particle typically refers to and carries out to language the word that tone amplitude strengthens class, such as eh, sound of crying or vomiting, etc., this kind of word itself does not have concrete meaning, only for increasing tone amplitude.Conjunction is for connecting different subjects, predicate, object etc., unless common conjunction as with or etc.Auxiliary word typically refers to the auxiliary word of predicate, such as, follow the ground after verb.Other has the title of type, refers to the title of a certain class things, but this class things itself can not be illustrated cannot form distinction, such as company, team, association etc. by concrete event content.In addition, event here can be a larger Event Concepts, such as time-event, place event, people event, contact method event, etc.Because event content can be illustrated in rare word, therefore rare word can possess the word of event meaning for time, place, personage, telephone number, email address etc. accordingly.
Further, consider to there is clerical mistake or other reason, in the page, easily occur the rare information of some "false" (such as the English word of a misspelling), in practical application, need this category information to weed out.For this reason, in preferred version of the present invention, determine that corresponding page quantity is less than the first amount threshold and is greater than the participle of the second amount threshold, be labeled as rare word, wherein, the second amount threshold is less than the first amount threshold.Here, the value of the first amount threshold and the second amount threshold can be decided by sampling of data analysis.After setting such interval, false rare information has just been left out substantially.
In the preferred scheme of one of the present invention, step S104 searches rare word and may be embodied as each participle in multiple participle, determine the quantity of the page comprising this participle, and then according to the quantity determined, calculate the IDF value of each participle, in multiple participle, search IDF value be subsequently greater than the participle of specifying numerical threshold, be labeled as rare word.It should be noted that, IDF (reverse file word frequency) is the term in text retrieval, and refer to the inverse (usually also can take the logarithm again) of the probability of occurrence of a word in all documents, the higher expression of IDF value of word is more uncommon.Thus; in embodiment provided by the invention; for each participle; the inverse (usually also can take the logarithm again) of this participle probability of occurrence in all pages can be calculated; the higher expression of IDF value calculating this participle is more uncommon, so can be marked on IDF value and be greater than and specify the participle of numerical threshold to be rare word.Further, the embodiment of the present invention additionally provides the processing mode of anti-cheating, namely for each participle, before determining the quantity of the page comprising this participle, calculate the frequency of occurrences of each participle in the page of respective place in multiple participle, subsequently in multiple participle, filter out the participle that the frequency of occurrences is greater than assigned frequency threshold value.For example, the quantity comprising the page of certain participle is 2, but the frequency that this participle occurs in these two pages is respectively 50 and 60, now can filter out this participle.
Step S106 filter out comprise rare word the page as rare information page after, the embodiment of the present invention can also carry out Screening Treatment to rare information page, and then the rare information page after Screening Treatment is set up index, the service retrieving rare information is provided to user for search engine.The mode of Screening Treatment here can be duplicate removal, remove rubbish, anti-cheating etc.Carrying out Screening Treatment is herein because rare information page is much the page that quality is lower, needs filtration rubbish wherein and the page of repetition.In preferred version of the present invention, can realize going garbage disposal by duplicate removal process.The duplicate removal process of ordinary pages generally can do a signature (most long sentence signature is one wherein) to the page, and signing identical is then the identical page.Rare information page duplicate removal needs to do a signature to rare word on this basis, be rare information page and carry out page signature, and be that rare word in rare information page generates word signature, and then to page signature and all identical multiple pages of word signature, retain one of them.Adopt this processing mode to ensure that the quality of rare information page, and the rare information page comprising different rare word can be retained, for providing abundanter Search Results when user inquires about more.
Be described above the multiple implementation of each link in the embodiment shown in Fig. 1, be described further with the screening technique of specific embodiment to the rare information page that the embodiment of the present invention provides below.In this embodiment, the page of search engine collecting comprises as next section of text: " rename China's news on February 19; it is reported; the open platform oneBox (application box) of 360 search recently formally reaches the standard grade; so second level domain onebox.so.com enabled by this platform, and this platform provides the special type of the multiple vertical information such as news, star, video display, tourism ticketing service, medical treatment & health to represent.”
First, word segmentation processing is done to the page of search engine collecting, after word segmentation processing for " rename/China/February/19 days/news/;/it is reported/;/recently/360 search// open platform/oneBox/ (/ application box /)/formal/reach the standard grade/;/should/platform/enable/so/ second level domain/onebox.so.com/ ,/should/platform/provide/news/,/star/,/video display/,/tourism/ticketing service/,/medical treatment/health/etc./multiple/vertical information// special type/represent/./ ", "/" wherein occurs as participle separator.
Secondly, filter out the type of the higher participle of the frequency of occurrences, more such as, do not possess the modal particle of notional word meaning, conjunction, auxiliary word, the title with type etc.Filtering out these is convenience in order to subsequent treatment, otherwise can produce a large amount of calculating.
Subsequently, for left each participle, search in index the quantity of the page comprising this participle, determine that corresponding page quantity is less than the first amount threshold and is greater than the participle of the second amount threshold afterwards, be labeled as rare word.Or, for left each participle, search in index the quantity of the page comprising this participle, and then according to the quantity determined, calculate the IDF value of each participle, in multiple participle, search IDF value be afterwards greater than the participle of specifying numerical threshold, be labeled as rare word.Here rare word can be possess notional word implication, can illustrate the participle of event content, and such as rare word can possess the word of event meaning for time, place, personage, telephone number, email address etc.Further, the embodiment of the present invention additionally provides a kind of processing mode of anti-cheating, namely for each participle, before determining the quantity of the page comprising this participle, calculate the frequency of occurrences of each participle in the page of respective place in multiple participle, subsequently in multiple participle, filter out the participle that the frequency of occurrences is greater than assigned frequency threshold value.For example, the quantity comprising the page of certain participle is 2, but the frequency that this participle occurs in these two pages is respectively 50 and 60, now can filter out this participle.
Afterwards, filter out comprise rare word the page as rare information page.
Finally, Screening Treatment is carried out to rare information page, and then the rare information page after Screening Treatment is set up index, the service retrieving rare information is provided to user for search engine.The mode of Screening Treatment here can be duplicate removal, remove rubbish, anti-cheating etc.Carrying out Screening Treatment is herein because rare information page is much the page that quality is lower, needs filtration rubbish wherein and the page of repetition.
Based on same inventive concept, the embodiment of the present invention additionally provides a kind of screening plant of rare information page, to realize the screening technique of above-mentioned rare information page.
Fig. 2 shows the structural representation of the screening plant of rare according to an embodiment of the invention information page.See Fig. 2, this device at least comprises: participle maker 210, rare word finger 220 and rare information page screening washer 230.
Now introduce the annexation between each composition of the screening plant of the rare information page of the embodiment of the present invention or the function of device and each several part:
Participle maker 210, is suitable for making word segmentation processing to the page of search engine collecting, generates multiple participle;
Rare word finger 220, is coupled with participle maker 210, is suitable for searching rare word in multiple participle;
Rare information page screening washer 230, is coupled with rare word finger 220, be suitable for filtering out comprise rare word the page as rare information page.
In one embodiment of the invention, participle maker 210 is also suitable for the content text of the page that decimated search engine captures, and then makes word segmentation processing to the content text extracted.Here, the content text extracting the page refers to filters the programmed statements in the page, after HTML mark, script etc. are all removed, and the left text representing flesh and blood.Namely be not only content (content) text, also comprise the text of the contents such as title (title), abstract (summary), author (author), time (time).Further, make word segmentation processing, one section of content text can be divided into conventional word to the content text extracted, such as " artistic district, Jiuxianqiao Road, Beijing City 798 " is " Beijing/art/district, winebibber's bridge road/798/ " after making word segmentation processing.
In one embodiment of the invention, rare word finger 220 is also suitable for, for each participle, searching in index the quantity of the page comprising this participle, determines that corresponding page quantity is less than the participle of the first amount threshold afterwards, is labeled as rare word.Here rare word can be possess notional word implication, can illustrate the participle of event content.Because event content can be illustrated in rare word, therefore rare word can possess the word of event meaning for time, place, personage, telephone number, email address etc. accordingly.
In one embodiment of the invention, consider to there is clerical mistake or other reason, in the page, easily occur the rare information of some "false" (such as the English word of a misspelling), in practical application, need this category information to weed out.For this reason, rare word finger 220 is also suitable for determining that corresponding page quantity is less than the first amount threshold and is greater than the participle of the second amount threshold, and be labeled as rare word, wherein, the second amount threshold is less than the first amount threshold.Here, the value of the first amount threshold and the second amount threshold can be decided by sampling of data analysis.After setting such interval, false rare information has just been left out substantially.
In one embodiment of the invention, rare word finger 220 is also suitable for for each participle, determine the quantity of the page comprising this participle, and then according to the quantity determined, calculate the IDF value of each participle, in multiple participle, search IDF value be subsequently greater than the participle of specifying numerical threshold, be labeled as rare word.
In one embodiment of the invention, in order to anti-cheating, for each participle, before determining the quantity of the page comprising this participle, rare word finger 220 is also suitable for calculating the frequency of occurrences of each participle in the page of respective place in multiple participle, subsequently in multiple participle, filter out the participle that the frequency of occurrences is greater than assigned frequency threshold value.
In one embodiment of the invention, rare information page screening washer 230 filter out comprise rare word the page as after rare information page, rare information page screening washer 230 is also suitable for: carry out Screening Treatment to rare information page, and then the rare information page after Screening Treatment is set up index, the service retrieving rare information is provided to user for search engine.The mode of Screening Treatment here can be duplicate removal, remove rubbish, anti-cheating etc.Carrying out Screening Treatment is herein because rare information page is much the page that quality is lower, needs filtration rubbish wherein and the page of repetition.
According to the combination of any one preferred embodiment above-mentioned or multiple preferred embodiment, the embodiment of the present invention can reach following beneficial effect:
Technical scheme provided by the invention, screens rare information page, wide coverage from the page of search engine collecting, can provide abundanter Data support for user.Further, processed by the page of the means such as word segmentation processing to search engine collecting, therefrom find rare word, and then filter out rare information page, thus containing the Search Results of rare information page, existing search engine can be solved because of easy uncared-for problem when the reasons such as rare information page content repetition degree is lower cause searching for by providing package accurately and efficiently when user inquires about.Further, the present invention screens rare information page according to rare word, and the rare information page quality thus filtered out is higher, and its information provided can meet the demand of user to information, and search accuracy rate is high, improves user information search and experiences.
The invention also discloses:
The screening technique of A1, a kind of rare information page, comprising:
Word segmentation processing is done to the page of search engine collecting, generates multiple participle;
Rare word is searched in described multiple participle;
Filter out comprise described rare word the page as rare information page.
A2, method according to A1, wherein, in described multiple participle, search rare word, comprising:
For each participle, search in index the quantity of the page comprising this participle;
Determine that corresponding page quantity is less than the participle of the first amount threshold, be labeled as rare word.
A3, method according to any one of A1-A2, wherein, determine that corresponding page quantity is less than the participle of the first amount threshold, be labeled as rare word, comprise:
Determine that corresponding page quantity is less than the first amount threshold and is greater than the participle of the second amount threshold, be labeled as rare word, wherein, described second amount threshold is less than the first amount threshold.
A4, method according to any one of A1-A3, wherein, in described multiple participle, search rare word, comprising:
For each participle, determine the quantity of the page comprising this participle;
According to the quantity determined, calculate the reverse file word frequency IDF value of each participle;
In described multiple participle, search IDF value be greater than the participle of specifying numerical threshold, be labeled as rare word.
A5, method according to any one of A1-A4, wherein, for each participle, before determining the quantity of the page comprising this participle, also comprise:
Calculate the frequency of occurrences of each participle in the page of respective place in described multiple participle;
In described multiple participle, filter out the participle that the frequency of occurrences is greater than assigned frequency threshold value.
A6, method according to any one of A1-A5, wherein, word segmentation processing is done to the page of search engine collecting, comprising:
The content text of the page that decimated search engine captures;
Word segmentation processing is done to the content text extracted.
A7, method according to any one of A1-A6, wherein, described rare word comprise following one of at least: name, place name, name, marque.
A8, method according to any one of A1-A7, wherein, filter out comprise described rare word the page as after rare information page, also comprise:
Screening Treatment is carried out to described rare information page;
Rare information page after Screening Treatment is set up index, the service retrieving rare information is provided to user for search engine.
A9, method according to any one of A1-A8, wherein, the mode of described Screening Treatment comprise following one of at least: go rubbish, duplicate removal, instead to practise fraud.
The screening plant of B10, a kind of rare information page, comprising:
Participle maker, is suitable for making word segmentation processing to the page of search engine collecting, generates multiple participle;
Rare word finger, is suitable for searching rare word in described multiple participle;
Rare information page screening washer, be suitable for filtering out comprise described rare word the page as rare information page.
B11, device according to B10, wherein, described rare word finger is also suitable for:
For each participle, search in index the quantity of the page comprising this participle;
Determine that corresponding page quantity is less than the participle of the first amount threshold, be labeled as rare word.
B12, device according to any one of B10-B11, wherein, described rare word finger is also suitable for:
Determine that corresponding page quantity is less than the first amount threshold and is greater than the participle of the second amount threshold, be labeled as rare word, wherein, described second amount threshold is less than the first amount threshold.
B13, device according to any one of B10-B12, wherein, described rare word finger is also suitable for:
For each participle, determine the quantity of the page comprising this participle;
According to the quantity determined, calculate the reverse file word frequency IDF value of each participle;
In described multiple participle, search IDF value be greater than the participle of specifying numerical threshold, be labeled as rare word.
B14, device according to any one of B10-B13, wherein, for each participle, before determining the quantity of the page comprising this participle, described rare word finger is also suitable for:
Calculate the frequency of occurrences of each participle in the page of respective place in described multiple participle;
In described multiple participle, filter out the participle that the frequency of occurrences is greater than assigned frequency threshold value.
B15, device according to any one of B10-B14, wherein, described participle maker is also suitable for:
The content text of the page that decimated search engine captures;
Word segmentation processing is done to the content text extracted.
B16, device according to any one of B10-B15, wherein, described rare word comprise following one of at least: name, place name, name, marque.
B17, device according to any one of B10-B16, wherein, filter out comprise described rare word the page as after rare information page, described rare information page screening washer is also suitable for:
Screening Treatment is carried out to described rare information page;
Rare information page after Screening Treatment is set up index, the service retrieving rare information is provided to user for search engine.
B18, device according to any one of B10-B17, wherein, the mode of described Screening Treatment comprise following one of at least: go rubbish, duplicate removal, instead to practise fraud.
In instructions provided herein, describe a large amount of detail.But can understand, embodiments of the invention can be put into practice when not having these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand in each inventive aspect one or more, in the description above to exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or the description to it sometimes.But, the method for the disclosure should be construed to the following intention of reflection: namely the present invention for required protection requires feature more more than the feature clearly recorded in each claim.Or rather, as claims below reflect, all features of disclosed single embodiment before inventive aspect is to be less than.Therefore, the claims following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and adaptively can change the module in the equipment in embodiment and they are arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and multiple submodule or subelement or sub-component can be put them in addition.Except at least some in such feature and/or process or unit be mutually repel except, any combination can be adopted to combine all processes of all features disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) and so disclosed any method or equipment or unit.Unless expressly stated otherwise, each feature disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) can by providing identical, alternative features that is equivalent or similar object replaces.
In addition, those skilled in the art can understand, although embodiments more described herein to comprise in other embodiment some included feature instead of further feature, the combination of the feature of different embodiment means and to be within scope of the present invention and to form different embodiments.Such as, in detail in the claims, the one of any of embodiment required for protection can use with arbitrary array mode.
All parts embodiment of the present invention with hardware implementing, or can realize with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions that microprocessor or digital signal processor (DSP) can be used in practice to realize according to the some or all parts in the screening plant of the rare information page of the embodiment of the present invention.The present invention can also be embodied as part or all equipment for performing method as described herein or device program (such as, computer program and computer program).Realizing program of the present invention and can store on a computer-readable medium like this, or the form of one or more signal can be had.Such signal can be downloaded from internet website and obtain, or provides on carrier signal, or provides with any other form.
The present invention will be described instead of limit the invention to it should be noted above-described embodiment, and those skilled in the art can design alternative embodiment when not departing from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and does not arrange element in the claims or step.Word "a" or "an" before being positioned at element is not got rid of and be there is multiple such element.The present invention can by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In the unit claim listing some devices, several in these devices can be carry out imbody by same hardware branch.Word first, second and third-class use do not represent any order.Can be title by these word explanations.
So far, those skilled in the art will recognize that, although multiple exemplary embodiment of the present invention is illustrate and described herein detailed, but, without departing from the spirit and scope of the present invention, still can directly determine or derive other modification many or amendment of meeting the principle of the invention according to content disclosed by the invention.Therefore, scope of the present invention should be understood and regard as and cover all these other modification or amendments.

Claims (10)

1. a screening technique for rare information page, comprising:
Word segmentation processing is done to the page of search engine collecting, generates multiple participle;
Rare word is searched in described multiple participle;
Filter out comprise described rare word the page as rare information page.
2. method according to claim 1, wherein, in described multiple participle, search rare word, comprising:
For each participle, search in index the quantity of the page comprising this participle;
Determine that corresponding page quantity is less than the participle of the first amount threshold, be labeled as rare word.
3. the method according to any one of claim 1-2, wherein, determine that corresponding page quantity is less than the participle of the first amount threshold, be labeled as rare word, comprise:
Determine that corresponding page quantity is less than the first amount threshold and is greater than the participle of the second amount threshold, be labeled as rare word, wherein, described second amount threshold is less than the first amount threshold.
4. the method according to any one of claim 1-3, wherein, in described multiple participle, search rare word, comprising:
For each participle, determine the quantity of the page comprising this participle;
According to the quantity determined, calculate the reverse file word frequency IDF value of each participle;
In described multiple participle, search IDF value be greater than the participle of specifying numerical threshold, be labeled as rare word.
5. the method according to any one of claim 1-4, wherein, for each participle, before determining the quantity of the page comprising this participle, also comprises:
Calculate the frequency of occurrences of each participle in the page of respective place in described multiple participle;
In described multiple participle, filter out the participle that the frequency of occurrences is greater than assigned frequency threshold value.
6. the method according to any one of claim 1-5, wherein, word segmentation processing is done to the page of search engine collecting, comprising:
The content text of the page that decimated search engine captures;
Word segmentation processing is done to the content text extracted.
7. the method according to any one of claim 1-6, wherein, described rare word comprise following one of at least: name, place name, name, marque.
8. the method according to any one of claim 1-7, wherein, filter out comprise described rare word the page as after rare information page, also comprise:
Screening Treatment is carried out to described rare information page;
Rare information page after Screening Treatment is set up index, the service retrieving rare information is provided to user for search engine.
9. the method according to any one of claim 1-8, wherein, the mode of described Screening Treatment comprise following one of at least: go rubbish, duplicate removal, instead to practise fraud.
10. a screening plant for rare information page, comprising:
Participle maker, is suitable for making word segmentation processing to the page of search engine collecting, generates multiple participle;
Rare word finger, is suitable for searching rare word in described multiple participle;
Rare information page screening washer, be suitable for filtering out comprise described rare word the page as rare information page.
CN201410759482.9A 2014-12-10 2014-12-10 Method and device for screening scarce information pages Pending CN104484388A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410759482.9A CN104484388A (en) 2014-12-10 2014-12-10 Method and device for screening scarce information pages

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410759482.9A CN104484388A (en) 2014-12-10 2014-12-10 Method and device for screening scarce information pages

Publications (1)

Publication Number Publication Date
CN104484388A true CN104484388A (en) 2015-04-01

Family

ID=52758929

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410759482.9A Pending CN104484388A (en) 2014-12-10 2014-12-10 Method and device for screening scarce information pages

Country Status (1)

Country Link
CN (1) CN104484388A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016000511A1 (en) * 2014-06-30 2016-01-07 北京奇虎科技有限公司 Method and apparatus for mining rare resource of internet

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101042709A (en) * 2007-04-11 2007-09-26 芦树鹏 Active mode search
CN101968801A (en) * 2010-09-21 2011-02-09 上海大学 Method for extracting key words of single text
CN102194013A (en) * 2011-06-23 2011-09-21 上海毕佳数据有限公司 Domain-knowledge-based short text classification method and text classification system
CN103136300A (en) * 2011-12-05 2013-06-05 北京百度网讯科技有限公司 Recommendation method and device of text related subject
CN103258000A (en) * 2013-03-29 2013-08-21 北界创想(北京)软件有限公司 Method and device for clustering high-frequency keywords in webpages
CN103324745A (en) * 2013-07-04 2013-09-25 微梦创科网络科技(中国)有限公司 Text garbage identifying method and system based on Bayesian model
CN104142918A (en) * 2014-07-31 2014-11-12 天津大学 Short text clustering and hotspot theme extraction method based on TF-IDF characteristics

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101042709A (en) * 2007-04-11 2007-09-26 芦树鹏 Active mode search
CN101968801A (en) * 2010-09-21 2011-02-09 上海大学 Method for extracting key words of single text
CN102194013A (en) * 2011-06-23 2011-09-21 上海毕佳数据有限公司 Domain-knowledge-based short text classification method and text classification system
CN103136300A (en) * 2011-12-05 2013-06-05 北京百度网讯科技有限公司 Recommendation method and device of text related subject
CN103258000A (en) * 2013-03-29 2013-08-21 北界创想(北京)软件有限公司 Method and device for clustering high-frequency keywords in webpages
CN103324745A (en) * 2013-07-04 2013-09-25 微梦创科网络科技(中国)有限公司 Text garbage identifying method and system based on Bayesian model
CN104142918A (en) * 2014-07-31 2014-11-12 天津大学 Short text clustering and hotspot theme extraction method based on TF-IDF characteristics

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
白凡: "改进的K近邻算法在网页文本分类中的应用", 《中国优秀硕士学位论文全文数据库》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016000511A1 (en) * 2014-06-30 2016-01-07 北京奇虎科技有限公司 Method and apparatus for mining rare resource of internet

Similar Documents

Publication Publication Date Title
CN102831248B (en) Network focus method for digging and device
CN104077402B (en) Data processing method and data handling system
CN102968297B (en) The software management system of mobile terminal and method
CN104715064B (en) It is a kind of to realize the method and server that keyword is marked on webpage
CN104951512A (en) Public sentiment data collection method and system based on Internet
CN104462501A (en) Knowledge graph construction method and device based on structural data
CN102710795B (en) Hotspot collecting method and device
CN104462508A (en) Character relation search method and device based on knowledge graph
CN102207961B (en) Automatic web page classification method and device
CN105468744A (en) Big data platform for realizing tax public opinion analysis and full text retrieval
CN102982157A (en) Device and method used for mining microblog hot topics
CN103279476B (en) The detection method of a kind of WEB application system sensitive word and system
CN106021418A (en) News event clustering method and device
CN104391978A (en) Method and device for storing and processing web pages of browsers
CN104618132A (en) Generation method and generation device for application program recognition rule
CN104462504A (en) Method and device for providing reasoning process data in search
CN103455758A (en) Method and device for identifying malicious website
CN109753656A (en) Data processing method, device and storage medium
CN103530336A (en) Equipment and method for identifying invalid parameters in URLs
CN105095391A (en) Device and method for identifying organization name by word segmentation program
CN104933171A (en) Method and device for associating data of interest point
CN112948664A (en) Method and system for automatically processing sensitive words
Barbaresi Finding viable seed URLs for web corpora: a scouting approach and comparative study of available sources
CN103399874B (en) The method and apparatus that webpage capture under same domain name is optimized
CN103838865B (en) For excavating the method and device of ageing kind of subpage

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20150401