CN105528422B

CN105528422B - A kind of Theme Crawler of Content processing method and processing device

Info

Publication number: CN105528422B
Application number: CN201510890437.1A
Authority: CN
Inventors: 张晨; 邵小亮; 谢隆飞; 王全礼
Original assignee: China Construction Bank Corp
Current assignee: China Construction Bank Corp
Priority date: 2015-12-07
Filing date: 2015-12-07
Publication date: 2019-04-26
Anticipated expiration: 2035-12-07
Also published as: CN105528422A

Abstract

The present invention provides a kind of Theme Crawler of Content processing method and processing device, after getting web document, network title characteristic information is extracted at least from web document, keyword feature information in metamessage, Expressive Features information and Web page text characteristic information in metamessage, topic relativity analysis is carried out to web document based on these characteristic informations, obtain classification results, and in the case where web document is stored into web document set based on classification results, based on web document increment situation in web document set, subject classification device is trained, therefore during being crawled based on Theme Crawler of Content, can also subject classification model relevant for Theme Crawler of Content be trained, so that Theme Crawler of Content based on subject classification model closer to search for, Theme Crawler of Content based on subject classification model when being crawled in this way, it crawls The content arrived is more related to search for, to improve the accurate rate crawled and recall rate.

Description

A kind of Theme Crawler of Content processing method and processing device

Technical field

The invention belongs to web crawlers technical fields, more specifically, more particularly to a kind of Theme Crawler of Content processing method and Device.

Background technique

Web crawlers, is the program of a kind of " automation browse network ", or perhaps a kind of network robot, at present network Crawler has been widely used for internet search engine or other similar website, can be with all search engines of automatic collection or website In its content of pages for being able to access that, allow users to the information for retrieving needs by web crawlers faster, and Can be further processed for search engine or website by the collected content of pages of web crawlers so that search engine or Website can be trained based on collected content of pages.

A kind of Theme Crawler of Content, the i.e. one kind of Theme Crawler of Content as web crawlers are developed out on the basis of web crawlers, It is a kind of web crawlers with theme discrimination module, can be crawled relevant to search on internet according to search for The network information.Theme Crawler of Content is mainly based upon keyword or regular expression building at present, and it is interior that this mode crawls it It stores the low recall rate the problem of.

Summary of the invention

In view of this, the purpose of the present invention is to provide a kind of Theme Crawler of Content processing methods, for improving recall rate.Technology Scheme is as follows:

The present invention provides a kind of Theme Crawler of Content processing method, which comprises

It obtains wait crawl the corresponding web document of uniform resource locator in queue；

Characteristic information is extracted from the web document, wherein the characteristic information is believed including at least network title feature Breath, the keyword feature information in metamessage, the Expressive Features information in metamessage and Web page text characteristic information；

Topic relativity classification is carried out to the web document based on the characteristic information, obtains classification results；

Based on the classification results, it is determined whether storing the web document into web document set；

When storing the web document into web document set based on the classification results, it is based on web document collection Web document increment situation in conjunction is trained subject classification model relevant to the Theme Crawler of Content.

Preferably, after obtaining wait crawl the corresponding web document of uniform resource locator in queue, the method is also It include: to judge whether the corresponding page of the uniform resource locator is navigation page；

If it is, parsing to the navigation page, the uniform resource locator in the navigation page is obtained, and will obtain The uniform resource locator write-in got is described wait crawl in queue；

If it is not, then the step of characteristic information, is extracted in triggering from the web document.

It is preferably, described to extract characteristic information from the web document, comprising:

The title of the web document is segmented, obtain the first participle as a result, and based on the first participle as a result, Obtain a tuple-set of the title；

Using fisrt feature function, the relationship of a tuple-set of each word in the title and the title is sentenced It is fixed, title feature vector is obtained, the title feature vector is used to indicate each word and a tuple-set in the title Relationship；

The critical word information of metamessage in the web document is segmented, obtains the second word segmentation result, and be based on Second word segmentation result obtains a tuple-set of the critical word information；

Using second feature function, to the one of each keyword in the critical word information and the critical word information The relationship of tuple-set is determined, keyword feature vector is obtained, and the keyword feature vector is used to indicate the key The relationship of each keyword and a tuple-set of the critical word information in lemma information；

The description metamessage of metamessage in the web document is segmented, obtains third word segmentation result, and be based on institute The second word segmentation result is stated, a tuple-set of the description metamessage is obtained；

Using third feature function, to the one of each webpage descriptor in the description metamessage and the description metamessage The relationship of tuple-set is determined, Expressive Features vector is obtained, and the Expressive Features vector is used to indicate the description member letter The relationship of a tuple-set of each webpage descriptor and the description metamessage in breath；

After handling the Web page text of the web document, a tuple-set of the Web page text and described is obtained The binary group set of Web page text；

Using fourth feature function, to a tuple-set of each keyword in the Web page text and the Web page text Relationship determined that obtain the first eigenvector of Web page text, the first eigenvector of the Web page text is used to indicate The relationship of a tuple-set of each keyword and the Web page text in the Web page text；

Using fifth feature function, to the binary group set of each keyword in the Web page text and the Web page text Relationship determined that obtain the second feature vector of Web page text, the second feature vector of the Web page text is used to indicate The relationship of the binary group set of each keyword and the Web page text in the Web page text.

Preferably, described to be based on the classification results, it is determined whether to store the web document to web document set In, comprising:

When the classification results indicate that the web document is related to search for, the theme of the web document is judged Whether dependent probability is greater than theme dependent probability threshold value, and wherein described search theme is the theme that the Theme Crawler of Content crawls；

When the theme dependent probability for judging the web document is greater than theme dependent probability threshold value, by the webpage text Shelves are stored into the web document set；

When the classification results indicate that the web document is uncorrelated to described search theme, the web document is judged Whether theme relevant documentation quantity and not a theme relevant documentation ratio of number are less than the related accounting threshold value of theme in set, wherein institute State the quantity that theme relevant documentation quantity refers to web document relevant to described search theme, the not a theme relevant documentation number Amount refers to the quantity with the incoherent web document of described search theme；

When judging that theme relevant documentation quantity and not a theme relevant documentation ratio of number are small in the web document set When theme correlation accounting threshold value, the web document is stored into the web document set.

Preferably, described when being stored the web document into web document set based on the classification results, base The web document increment situation in web document set is trained subject classification model relevant to the Theme Crawler of Content, Include:

When the web document is stored into the web document set, increment counter is carried out plus one is handled, Described in the initial value of increment counter be 0, and one web document of every storage in the web document set, the increment meter Number device adds one automatically；

Judge whether the value of the increment counter is greater than delta threshold, if so, to the subject classification model into Row re -training, and the value of the increment counter is updated to initial value.

The present invention also provides a kind of Theme Crawler of Content processing unit, described device includes:

Acquiring unit, for obtaining wait crawl the corresponding web document of uniform resource locator in queue；

Extraction unit, for extracting characteristic information from the web document, wherein the characteristic information includes at least net Expressive Features information in keyword feature information, metamessage and Web page text in network title feature information, metamessage is special Reference breath；

Taxon is divided for carrying out topic relativity classification to the web document based on the characteristic information Class result；

Judging unit, for being based on the classification results, it is determined whether storing the web document to web document collection In conjunction；

Training unit, for when based on the classification results web document is stored into web document set when, Based on web document increment situation in web document set, subject classification model relevant to the Theme Crawler of Content is instructed Practice.

Preferably, described device further include: page judging unit, for judging the corresponding page of the uniform resource locator Whether face is navigation page, if it is triggers the acquiring unit and parses to the navigation page, obtains in the navigation page Uniform resource locator, and the uniform resource locator write-in that will acquire is described wait crawl in queue；If otherwise triggered The extraction unit.

Preferably, the extraction unit includes:

First participle subelement is segmented for the title to the web document, obtains the first participle as a result, simultaneously base In the first participle as a result, obtaining a tuple-set of the title；

Title feature vector obtains subelement, for using fisrt feature function, to each word in the title and described The relationship of one tuple-set of title is determined, title feature vector is obtained, and the title feature vector is used to indicate described The relationship of each word and a tuple-set in title；

Second participle subelement, segments for the critical word information to metamessage in the web document, obtains Second word segmentation result, and it is based on second word segmentation result, obtain a tuple-set of the critical word information；

Keyword feature vector obtains subelement, for using second feature function, to each in the critical word information The relationship of a keyword and a tuple-set of the critical word information is determined, keyword feature vector is obtained, described Keyword feature vector is used to indicate a tuple of each keyword and the critical word information in the critical word information The relationship of set；

Third segments subelement, segments for the description metamessage to metamessage in the web document, obtains the Three word segmentation results, and it is based on second word segmentation result, obtain a tuple-set of the description metamessage；

Expressive Features vector obtains subelement, for using third feature function, to each net in the description metamessage The relationship of one tuple-set of page descriptor and the description metamessage is determined, Expressive Features vector, the description are obtained Feature vector is used to indicate a tuple-set of each webpage descriptor and the description metamessage in the description metamessage Relationship；

4th participle subelement, after handling for the Web page text to the web document, is obtaining the webpage just One tuple-set of text and the binary group set of the Web page text；

First eigenvector obtains subelement, for using fourth feature function, to each key in the Web page text The relationship of one tuple-set of word and the Web page text is determined, the first eigenvector of Web page text, the net are obtained The first eigenvector of page text is used to indicate a tuple set of each keyword and the Web page text in the Web page text The relationship of conjunction；

Second feature vector obtains subelement, for using fifth feature function, to each key in the Web page text The relationship of the binary group set of word and the Web page text is determined, the second feature vector of Web page text, the net are obtained The second feature vector of page text is used to indicate the binary group collection of each keyword and the Web page text in the Web page text The relationship of conjunction.

Preferably, the judging unit includes:

First judgment sub-unit, for sentencing when the classification results indicate that the web document is related to search for Whether the theme dependent probability for the web document of breaking is greater than theme dependent probability threshold value, and wherein described search theme is the master The theme that topic crawler crawls；

First storing sub-units, for being greater than theme dependent probability when the theme dependent probability for judging the web document When threshold value, the web document is stored into the web document set；

Second judgment sub-unit, for indicating that the web document is uncorrelated to described search theme when the classification results When, judge whether theme relevant documentation quantity and not a theme relevant documentation ratio of number are less than theme in the web document set Related accounting threshold value, wherein the theme relevant documentation quantity refers to the quantity of web document relevant to described search theme, The not a theme relevant documentation quantity refers to the quantity with the incoherent web document of described search theme；

Second storing sub-units judge theme relevant documentation quantity and not a theme in the web document set for working as When relevant documentation ratio of number is less than theme correlation accounting threshold value, the web document is stored to the web document set In.

Preferably, the training unit includes:

Counter, for being carried out to increment counter when the web document is stored into the web document set Add a processing, wherein the initial value of the increment counter is 0, and one webpage text of every storage in the web document set Shelves, the increment counter add one automatically；

Judgment sub-unit, for judging whether the value of the increment counter is greater than delta threshold；

Training subelement, for when the value of the increment counter be greater than delta threshold when, to the subject classification mould Type carries out re -training, and the value of the increment counter is updated to initial value.

Compared with prior art, above-mentioned technical proposal provided by the invention has the advantages that

In above-mentioned technical proposal provided by the invention, after getting web document, net is at least extracted from web document Expressive Features information in keyword feature information, metamessage and Web page text in network title feature information, metamessage is special Reference breath carries out topic relativity analysis to web document based on these characteristic informations, obtains classification results, and based on classification As a result in the case where web document being stored into web document set, based on web document increment feelings in web document set Condition is trained subject classification device, therefore during being crawled based on Theme Crawler of Content, can also be for Theme Crawler of Content correlation Subject classification model be trained so that Theme Crawler of Content based on subject classification model closer to search for, it is main in this way For topic crawler when being crawled based on subject classification model, the content crawled is more related to search for, climbs to improve The accurate rate and recall rate taken.

And the embodiment of the present invention, when being trained to subject classification model, the characteristic information of use is that Theme Crawler of Content exists The information of automatic collection during crawling reduces people for the mode of artificial labeled data training subject classification model The workload of work labeled data.It furthermore, all can be by new addition in web document when carrying out re -training to subject classification model Web document in set brings training in subject classification model into as training input variable, so that training input variable increases Add, therefore available new subject classification model, and new theme is judged based on new subject classification model.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is the present invention Some embodiments for those of ordinary skill in the art without creative efforts, can also basis These attached drawings obtain other attached drawings.

Fig. 1 is a kind of flow chart of Theme Crawler of Content processing method provided in an embodiment of the present invention；

Fig. 2 is a kind of sub-process figure of Theme Crawler of Content processing method provided in an embodiment of the present invention；

Fig. 3 is another sub-process figure of Theme Crawler of Content processing method provided in an embodiment of the present invention；

Fig. 4 is another flow chart of Theme Crawler of Content processing method provided in an embodiment of the present invention；

Fig. 5 is a kind of structural schematic diagram of Theme Crawler of Content processing unit provided in an embodiment of the present invention；

Fig. 6 is the structural schematic diagram of extraction unit in Theme Crawler of Content processing unit provided in an embodiment of the present invention；

Fig. 7 is the structural schematic diagram of judging unit in Theme Crawler of Content processing unit provided in an embodiment of the present invention；

Fig. 8 is another structural schematic diagram of Theme Crawler of Content device provided in an embodiment of the present invention.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.

It, can be with referring to Fig. 1, it illustrates a kind of flow chart of Theme Crawler of Content processing method provided in an embodiment of the present invention The following steps are included:

101: obtaining corresponding wait crawl uniform resource locator in queue (Uniform Resource Locator, URL) Web document.In embodiments of the present invention, Theme Crawler of Content can carry out page resource request by the prior art, and using existing There is technology to parse the URL in each request and is added to wait crawl in queue.

Such as Theme Crawler of Content using open source hypertext transfer protocol (HyperText Transfer Protocol, HTTP) Apache HttpClient in kit carries out page resource request, wherein Apache HttpClient be using The primary multithreading encapsulation provided in the Software Development Kit (Java Development Kit, JDK) of Java language obtains The tool of the parallel page resource request of the progress arrived.And it is parsed, is parsed using the primary multithreading packet provided in JDK URL out is added to queue to be crawled.

102: characteristic information is extracted from web document, wherein characteristic information includes at least network title characteristic information, member The Expressive Features information and Web page text characteristic information in keyword feature information, metamessage in information.

That is the embodiment of the present invention extracts spy at least from the network title of web document, metamessage and Web page text Reference breath, and this three parts especially network title and metamessage can indicate that the corresponding theme of web document, therefore pass through The characteristic information extracted from this three parts more fits in theme.

Wherein metamessage be the corresponding HyperText Mark-up Language of web document (Hyper Text Mark-up Language, Html) the summary information for the web document for including in meta (member) label of file, such as keyword, theme, are marked according to meta Content in label can extract the various characteristic informations in metamessage, in embodiments of the present invention, using in metamessage Description metamessage in critical word information and metamessage extracts keyword feature information and from retouching from critical word information State extraction Expressive Features information in metamessage.

Why from critical word information and description metamessage in extract characteristic information be because are as follows: critical word information is net Page developer write-in have recapitulative key word information, include at least web document key topic, form such as: < Meta name=" keywords " content=" it is first middle Eastern Europe applicant country that the sub- investment bank, which is added, in Poland's application " >.Description member Information is then the title of meta label, and the content of record is similar with critical word information and some related with dry building body Recapitulative information, such as: " sub- system, the investment bank is added in the application of < meta name=" Description " content=" Poland Poland, first middle Eastern Europe applicant country is the maximum economy in middle Eastern Europe." > ", i.e. critical word information and description metamessage and webpage The correlation of the theme of document is higher, therefore extracts feature in the critical word information preferably in metamessage and description metamessage.

In the embodiment of the present invention, the acquisition process of features described above information is as shown in Fig. 2, may comprise steps of:

201: the title of web document being segmented, obtains the first participle as a result, and being based on the first participle as a result, obtaining One tuple-set of title.It is to be understood that title is one section of descriptive matter in which there for playing summary in web document, to sentencing Whether disconnected web document is related to theme good directive property.

Existing participle technique can be used to the participle of title in embodiments of the present invention, such as use Stamford participle technique Title is segmented, obtains the first participle as a result, wherein the first participle is the result is that after title participle, the word of each word composition Set, is then based on the tandem that each word occurs in title in first participle result, obtains a tuple-set.

Such as: " it is first middle Eastern Europe applicant country that the sub- investment bank, which is added, in Poland's application " after participle, an obtained tuple-set Are as follows: (Poland), (application), (addition), (the sub- investment bank), (first), (middle Eastern Europe), (applicant country).

Herein it should be noted is that: after participle, the vertical sequence occurred in a document according to word, often A word segmentation result can regard a time series as, and each word therein is on a time t, each in a tuple-set One tuple is exactly the word (w (t)) of current time t；And so on, each binary group is exactly time t and time in binary group set The contamination (w (t-1), w (t)) of t-1, by taking " it is first middle Eastern Europe applicant country that the sub- investment bank, which is added, in Poland's application " as an example, binary Group set is then (Poland, be ready), (being ready, be added), (being added, the sub- investment bank), (the sub- investment bank, first), (first, middle Eastern Europe), (middle Eastern Europe, applicant country).

202: fisrt feature function is used, the relationship of a tuple-set of word each in title and title is determined, Title feature vector is obtained, title feature vector is used to indicate the relationship of each word and a tuple-set in title.In the present invention In embodiment, the form of fisrt feature function is as follows:

It can be seen that from fisrt feature function when the word w in title belongs to a tuple in a tuple-set, characteristic value It is 1, otherwise characteristic value is 0, by fisrt feature function, the available one title feature vector being made of 0 and 1.

203: the critical word information of metamessage in web document being segmented, obtains the second word segmentation result, and be based on Second word segmentation result obtains a tuple-set of critical word information.The form of critical word information is such as: < meta name=" It is first middle Eastern Europe applicant country that the sub- investment bank, which is added, in the application of keywords " content=" Poland " >.It is first when being segmented to it First extract the information in content attribute: " it is first middle Eastern Europe applicant country that the sub- investment bank, which is added, in Poland's application " is then divided Word obtains a tuple-set.Such as: " it is first middle Eastern Europe applicant country that the sub- investment bank, which is added, in Poland's application ", a tuple-set of building Are as follows: (Poland), (application), (addition), (the sub- investment bank), (first), (middle Eastern Europe), (applicant country).

204: second feature function is used, to a tuple of each keyword in critical word information and critical word information The relationship of set is determined, keyword feature vector is obtained, and keyword feature vector is used to indicate in critical word information respectively The relationship of a keyword and a tuple-set of critical word information.In embodiments of the present invention, the form of second feature function It is as follows:

By second feature function, the available one keyword feature vector being made of 0 and 1.

205: the description metamessage of metamessage in web document being segmented, obtains third word segmentation result, and based on the Two word segmentation results obtain a tuple-set of description metamessage.The form of metamessage is described such as: " < meta name=" It is that middle Eastern Europe is maximum that it is Poland, first middle Eastern Europe applicant country that the sub- investment bank, which is added, in the application of Description " content=" Poland Economy.">".When segmenting to it, extract the information in name attribute first: " sub- system, the investment bank is added in Poland's application Poland, first middle Eastern Europe applicant country is the maximum economy in middle Eastern Europe ", it is then segmented to obtain a tuple-set.

206: third feature function is used, to a tuple of each webpage descriptor and description metamessage in description metamessage The relationship of set is determined, Expressive Features vector is obtained, and Expressive Features vector is used to indicate each webpage in description metamessage The relationship of one tuple-set of descriptor and description metamessage.In embodiments of the present invention, the form of third feature function is as follows:

By third feature function, the available one Expressive Features vector being made of 0 and 1.

207: after handling the Web page text of web document, obtaining a tuple-set and Web page text for Web page text Binary group set.After obtaining a web document, need to extract Web page text from web document, such as can use In open source algorithm " the generic web pages text extracting based on row block distribution function " method that Chinese Harbin Institute of Technology proposes The extraction of Open Source Code CXExtractor progress Web page text.

After extracting Web page text, a series of pretreatment can be carried out to Web page text, such as passes through regular expression, mistake Filter replaces the spcial character in Web page text, as shown in table 1:

Spcial character in 1 Web page text of table

After removing spcial character, web page text is segmented, and deactivated in vocabulary removal word segmentation result according to stopping Word does not include spcial character and stop words in the word segmentation result of that is, final web page text.

According to the sequence that each word in the word segmentation result of web page text occurs in web page text, the unitary of a word is constructed Group, binary group.One tuple-set of the tuple composition Web page text constructed, the binary group composition Web page text constructed Binary group set.

Such as partial content in Web page text are as follows: Poland is ready that the leading Asia of China is added with original member's identity throws It goes, then one tuple-set are as follows: (Poland), (being ready), (original), (member state), (identity), (addition), (China) is (main Lead), (the sub- investment bank)；Binary group set are as follows: (Poland, be ready), (being ready, originate), (original, member state), (member state, body Part), (identity is added), (being added, China), (China dominates), (leading, the sub- investment bank).

208: fourth feature function is used, to the pass of a tuple-set of keyword each in Web page text and Web page text System is determined that obtain the first eigenvector of Web page text, the first eigenvector of Web page text is used to indicate Web page text In a tuple-set of each keyword and Web page text relationship.In embodiments of the present invention, the form of fourth feature function It is as follows:

By fourth feature function, an available first eigenvector being made of 0 and 1.

209: fifth feature function is used, to the pass of the binary group set of keyword each in Web page text and Web page text System is determined that obtain the second feature vector of Web page text, the second feature vector of Web page text is used to indicate Web page text In the binary group set of each keyword and Web page text relationship.In embodiments of the present invention, the form of fifth feature function It is as follows:

By fifth feature function, the available one second feature vector being made of 0 and 1.

103: topic relativity classification being carried out to web document based on characteristic information, obtains classification results.Wherein classification knot Fruit is used to indicate whether web document is related to the search for of Theme Crawler of Content, and search for can be the theme of user's input. It can be determined in embodiments of the present invention using subject classification model, detailed process is as follows:

By features described above information, as title feature vector, keyword feature vector, Expressive Features vector, fisrt feature to Amount and second feature vector connect, and form the total characteristic vector of a line multiple row, are then input to total characteristic vector In subject classification model, the output result of subject classification model is then classification results, indicates whether web document is led with search Topic is related.It is specific: when the output result of subject classification model is 1, then it represents that the search for of web document and Theme Crawler of Content It is related；When the output result of subject classification model is 0, then it represents that web document is uncorrelated to the search for of Theme Crawler of Content.

104: being based on classification results, it is determined whether store the web document into web document set.In the present invention In embodiment, multiple web documents are stored in web document set, these web documents can be further used as subject classification mould The training data of type is trained subject classification model, that is to say, that the embodiment of the present invention, which provides, is applicable in semi-supervised learning side Method expands subject classification model, when Theme Crawler of Content is predicted using subject classification model, can be based on classification results Web document is handled, to expand the web document in webpage collection of document as training data.

Wherein web document storage mode is as shown in figure 3, may comprise steps of:

301: judge whether classification results indicate that web document is related to search for, if it is execution step 302, if No execution step 305.

302: when classification results instruction web document is related to search for, judging the theme dependent probability of web document Whether it is greater than theme dependent probability threshold value, if it is step 303 is executed, executes step 304 if not.

303: when the theme dependent probability for judging web document is greater than theme dependent probability threshold value, web document being deposited Storage is into web document set.

304: when the theme dependent probability for judging web document is less than or equal to theme dependent probability threshold value, abandoning net Page document.

305: when classification results instruction web document is uncorrelated to search for, judging theme phase in web document set Close whether number of documents and not a theme relevant documentation ratio of number are less than the related accounting threshold value of theme, if it is execution step 306, Step 307 is executed if not.Wherein theme relevant documentation quantity refers to the quantity of web document relevant to search for, non-master Topic relevant documentation quantity refers to the quantity with the incoherent web document of search for.

306: when judging that theme relevant documentation quantity and not a theme relevant documentation ratio of number are small in web document set When theme correlation accounting threshold value, web document is stored into web document set.

307: when judging that theme relevant documentation quantity and not a theme relevant documentation ratio of number are big in web document set When theme correlation accounting threshold value, web document is abandoned.

From above-mentioned storage mode it is found that the embodiment of the present invention is based on theme dependent probability threshold value accounting threshold value related to theme Determine whether to store web document into web document set, the especially meeting when storing web document relevant to theme The web document that theme dependent probability is greater than theme dependent probability threshold value is chosen, so that being based in training subject classification model Web document relevant to theme and search for more close to improve the accuracy of subject classification model.

Wherein above-mentioned theme dependent probability threshold value accounting threshold value related to theme is artificial empirical data, in different application field Different values can be chosen under scape, this embodiment of the present invention is not defined its specific value.

105: when being stored web document into web document set based on classification results, based in web document set Web document increment situation is trained subject classification model relevant to Theme Crawler of Content.Its feasible pattern is: when webpage text When shelves are stored into web document set, increment counter is carried out plus one is handled, wherein the initial value of increment counter is 0, And one web document of every storage, increment counter add one automatically in web document set；Judge taking for the increment counter Whether value is greater than delta threshold, if so, carrying out re -training to subject classification model, and the value of increment counter is updated For initial value.

The increment that web document in web document set is detected by an increment counter, when increment counter refers to When the increment of web document being shown greater than delta threshold, then need to carry out re -training to subject classification model.To theme point It is based on the multiple web documents stored in web document set, and each web document is adopted when class model carries out re -training The mode shown in Fig. 2 automatically extracts characteristic information, therefore the embodiment of the present invention is based in re -training subject classification model Web document be labeled automatically, reduce the workload of artificial labeled data.

Herein it should be noted is that: for the first time training subject classification model when, need user to a small amount of webpage Document carries out manual mark, these network documentations marked by hand are come as the training data in initial web document set Training subject classification model, and the training of subsequent subject classification model is based on the web document stored in web document set. The source of the web document wherein marked by hand can be the webpage that general networking crawler crawls from internet at random, can also be with It is the webpage artificially obtained from internet, is also possible to the web page library of open source.

From above-mentioned technical proposal it is found that Theme Crawler of Content processing method provided in an embodiment of the present invention is getting web document Afterwards, at least from extracting network title characteristic information, the keyword feature information in metamessage, retouching in metamessage in web document Characteristic information and Web page text characteristic information are stated, topic relativity analysis is carried out to web document based on these characteristic informations, Classification results are obtained, and in the case where storing web document into web document set based on classification results, are based on webpage Web document increment situation in collection of document, is trained subject classification device, therefore in the process crawled based on Theme Crawler of Content In, can also subject classification model relevant for Theme Crawler of Content be trained so that Theme Crawler of Content based on subject classification mould Type closer to search for, such Theme Crawler of Content when being crawled based on subject classification model, the content that crawls with search Rope theme is more related, to improve the accurate rate crawled and recall rate.

And the embodiment of the present invention, when being trained to subject classification model, the characteristic information of use is that Theme Crawler of Content exists The information of automatic collection during crawling reduces people for the mode of artificial labeled data training subject classification model The workload of work labeled data.

Referring to Fig. 4, it illustrates another flow chart of Theme Crawler of Content processing method provided in an embodiment of the present invention, it can With the following steps are included:

401: obtaining wait crawl the corresponding web document of URL in queue.

402: judging whether the corresponding page of URL is navigation page, if it is execution step 403；Step is executed if not 404.In embodiments of the present invention, the prior art can be used, judge whether the page is navigation page such as Logic Regression Models.

403: navigation page being parsed, obtains the URL in navigation page, and queue to be crawled is written in the URL that will acquire In.

It is to be understood that type of webpage, which is divided by function, is divided into navigation page and content pages.It is not wrapped in navigation page wherein A series of content containing essence, only comprising Anchor Texts as navigation；And content pages then include substantive content and less anchor text This.Therefore after judging that the corresponding page of URL is navigation page, need to get from navigation page comprising substantive content and The URL of the content pages of less Anchor Text, and these URL are added to wait crawl in queue, to the corresponding webpage of these URL Document is classified.

404: characteristic information is extracted from web document, wherein characteristic information includes at least network title characteristic information, member The Expressive Features information and Web page text characteristic information in keyword feature information, metamessage in information.

405: topic relativity classification being carried out to web document based on characteristic information, obtains classification results.

406: being based on classification results, it is determined whether store the web document into web document set.

407: when being stored web document into web document set based on classification results, based in web document set Web document increment situation is trained subject classification model relevant to Theme Crawler of Content.

Wherein above-mentioned steps 404 to step 407 specific implementation process: it is identical to step 105 as above-mentioned steps 102, it is right This embodiment of the present invention no longer illustrates.

It can be seen from the above technical proposal that Theme Crawler of Content processing method provided in an embodiment of the present invention can be to URL pairs Whether the page answered is that navigation page is judged, in this way in the case where judging is navigation page, can no longer be held to navigation page Row feature extraction and classification deterministic process, reduce the data volume of processing.

Theme Crawler of Content processing method based on above-mentioned offer, verifies its accuracy and recall rate, wherein initial Web document set, which uses, manually to be won 1000 web documents on internet and carries out as Initial page collection of document, and to it It marks by hand one by one, the training data as subject classification model；Theme dependent probability threshold value is set as 0.8；Theme correlation accounts for It is set as 0.75 than threshold value；Delta threshold is 1000；The step-length of gradient decline is when Logic Regression Models carry out parameter training 0.05.The application environment of the method is as follows:

Central processing unit (Central Processing Unit, CPU): Intel E5 2620；

Random access memory (Random Access Memory, RAM): 64GB；

Operating system: 7 Ultimate SP1 of Windows；

JAVA virtual machine environment: JDK1.6；

Network bandwidth: 100Mbps；

Under application environment, web-page requests Thread Count is 10 threads；Page URL parses thread 2；Text extracting thread 1 It is a；

Based on above-mentioned application environment, the operation result of Theme Crawler of Content is as shown in table 2:

2 operation result of table statistics

	Quantity
		Crawl webpage quantity	370,561

It carries out random sampling 100 to operation result to evaluate, confusion matrix is as shown in table 3:

3 confusion matrix of table

It can be obtained by confusion matrix: accurate rate 87%；Recall rate is 82.1%.

For the various method embodiments described above, for simple description, therefore, it is stated as a series of action combinations, but Be those skilled in the art should understand that, the present invention is not limited by the sequence of acts described because according to the present invention, certain A little steps can be performed in other orders or simultaneously.Secondly, those skilled in the art should also know that, it is retouched in specification The embodiment stated belongs to preferred embodiment, and related actions and modules are not necessarily necessary for the present invention.

Referring to Fig. 5, it illustrates a kind of structural schematic diagram of Theme Crawler of Content processing unit provided in an embodiment of the present invention, It may include: acquiring unit 11, extraction unit 12, taxon 13, judging unit 14 and training unit 15.

Acquiring unit 11, for obtaining wait crawl the corresponding web document of URL in queue.In embodiments of the present invention, it obtains It takes unit 11 to be based on Theme Crawler of Content and page resource request is carried out using the prior art, and each request is parsed using the prior art In URL and be added to wait crawl in queue.

Extraction unit 12, for extracting characteristic information from web document, wherein characteristic information includes at least network title The Expressive Features information and Web page text characteristic information in keyword feature information, metamessage in characteristic information, metamessage. That is the embodiment of the present invention at least extracts characteristic information from the network title of web document, metamessage and Web page text, And this three parts especially network title and metamessage can indicate that the corresponding theme of web document, therefore by from this three The characteristic information extracted is divided more to fit in theme.

It preferably can be using the structure of extraction unit 12 shown in Fig. 6, comprising: first participle subelement 121, title feature Vector obtains the participle of subelement 122, second subelement 123, keyword feature vector obtains subelement 124, third participle is single Member 125, Expressive Features vector obtain subelement the 126, the 4th and segment subelement 127, first eigenvector acquisition 128 and of subelement Second feature vector obtains subelement 129.

First participle subelement 121 is segmented for the title to web document, obtains the first participle as a result, simultaneously base In the first participle as a result, obtaining a tuple-set of title.

Title feature vector obtains subelement 122, for using fisrt feature function, to word each in title and title The relationship of one tuple-set determined, obtains title feature vector, title feature vector be used to indicate in title each word and The relationship of one tuple-set.In embodiments of the present invention, the form of fisrt feature function is as follows:

Second participle subelement 123, segments for the critical word information to metamessage in web document, obtains the Two word segmentation results, and it is based on the second word segmentation result, obtain a tuple-set of critical word information.

Keyword feature vector obtains subelement 124, for using second feature function, to each in critical word information The relationship of keyword and a tuple-set of critical word information is determined, keyword feature vector, keyword feature are obtained Vector is used to indicate the relationship of each keyword and a tuple-set of critical word information in critical word information.In the present invention In embodiment, the form of second feature function is as follows:

Third segments subelement 125, segments for the description metamessage to metamessage in web document, obtains third Word segmentation result, and it is based on the second word segmentation result, obtain a tuple-set of description metamessage.

Expressive Features vector obtains subelement 126, for using third feature function, to each webpage in description metamessage The relationship of one tuple-set of descriptor and description metamessage is determined, Expressive Features vector is obtained, and Expressive Features vector is used The relationship of a tuple-set of each webpage descriptor and description metamessage in instruction description metamessage.In the embodiment of the present invention In, the form of third feature function is as follows:

4th participle subelement 127, after handling for the Web page text to web document, obtains the one of Web page text The binary group set of tuple-set and Web page text, concrete processing procedure please refer to the explanation of embodiment of the method part.

First eigenvector obtains subelement 128, for using fourth feature function, to keyword each in Web page text Determined with the relationship of a tuple-set of Web page text, obtains the first eigenvector of Web page text, the of Web page text One feature vector is used to indicate the relationship of a tuple-set of each keyword and Web page text in Web page text.Of the invention real It applies in example, the form of fourth feature function is as follows:

Second feature vector obtains subelement 129, for using fifth feature function, to keyword each in Web page text Determined with the relationship of the binary group set of Web page text, obtains the second feature vector of Web page text, the of Web page text Two feature vectors are used to indicate the relationship of the binary group set of each keyword and Web page text in Web page text.Of the invention real It applies in example, the form of fifth feature function is as follows:

Taxon 13 obtains classification results for carrying out topic relativity classification to web document based on characteristic information. Wherein classification results are used to indicate whether web document is related to the search for of Theme Crawler of Content, and it is defeated that search for can be user The theme entered.It can be determined in embodiments of the present invention using subject classification model, detailed process is as follows:

Judging unit 14, for being based on classification results, it is determined whether storing web document into web document set.? In the embodiment of the present invention, multiple web documents are stored in web document set, these web documents can be further used as theme The training data of disaggregated model is trained subject classification model, that is to say, that offer of the embodiment of the present invention is applicable in semi-supervised Learning method expands subject classification model, when Theme Crawler of Content is predicted using subject classification model, can be based on dividing Class result handles web document, to expand the web document in webpage collection of document as training data.

Wherein judging unit 14 can determine whether to store web document using structure shown in Fig. 7, can specifically include: First judgment sub-unit 141, the first storing sub-units 142, the second judgment sub-unit 143 and the second storing sub-units 144.

First judgment sub-unit 141, for judging webpage when classification results instruction web document is related to search for Whether the theme dependent probability of document is greater than theme dependent probability threshold value, and wherein search for is the theme the theme that crawler crawls.

First storing sub-units 142, for being greater than theme dependent probability when the theme dependent probability for judging web document When threshold value, web document is stored into web document set.

Second judgment sub-unit 143, for judging net when classification results instruction web document is uncorrelated to search for Whether theme relevant documentation quantity and not a theme relevant documentation ratio of number are less than the related accounting threshold value of theme in page collection of document, Wherein theme relevant documentation quantity refers to that the quantity of web document relevant to search for, not a theme relevant documentation quantity refer to With the quantity of the incoherent web document of search for.

Second storing sub-units 144 judge theme relevant documentation quantity and not a theme in web document set for working as When relevant documentation ratio of number is less than theme correlation accounting threshold value, web document is stored into web document set.

Training unit 15, for being based on net when storing web document into web document set based on classification results Web document increment situation in page collection of document, is trained subject classification model relevant to Theme Crawler of Content.

In embodiments of the present invention, training unit 15 may include: counter, judgment sub-unit and training subelement.Its Middle counter adds a processing for being carried out to increment counter when web document is stored into web document set, wherein increasing The initial value of batching counter is 0, and one web document of every storage in web document set, increment counter add one automatically.

Judgment sub-unit, for judging whether the value of increment counter is greater than delta threshold.

Training subelement, for carrying out weight to subject classification model when the value of increment counter is greater than delta threshold New training, and the value of increment counter is updated to initial value.

From above-mentioned technical proposal it is found that Theme Crawler of Content processing unit provided in an embodiment of the present invention is getting web document Afterwards, at least from extracting network title characteristic information, the keyword feature information in metamessage, retouching in metamessage in web document Characteristic information and Web page text characteristic information are stated, topic relativity analysis is carried out to web document based on these characteristic informations, Classification results are obtained, and in the case where storing web document into web document set based on classification results, are based on webpage Web document increment situation in collection of document, is trained subject classification device, therefore in the process crawled based on Theme Crawler of Content In, can also subject classification model relevant for Theme Crawler of Content be trained so that Theme Crawler of Content based on subject classification mould Type closer to search for, such Theme Crawler of Content when being crawled based on subject classification model, the content that crawls with search Rope theme is more related, to improve the accurate rate crawled and recall rate.

Referring to Fig. 8, it illustrates another structural representation of Theme Crawler of Content processing unit provided in an embodiment of the present invention, It can also include: page judging unit 16 on the basis of Fig. 5, for judging whether the corresponding page of URL is navigation page, if it is It then triggers acquiring unit 11 to parse navigation page, obtains the URL in navigation page, and the URL that will acquire is written wait crawl In queue.If otherwise triggering extraction unit 12.

It can be seen from the above technical proposal that Theme Crawler of Content processing unit provided in an embodiment of the present invention can be to URL pairs Whether the page answered is that navigation page is judged, in this way in the case where judging is navigation page, can no longer be held to navigation page Row feature extraction and classification deterministic process, reduce the data volume of processing.

It should be noted that all the embodiments in this specification are described in a progressive manner, each embodiment weight Point explanation is the difference from other embodiments, and the same or similar parts between the embodiments can be referred to each other. For device class embodiment, since it is basically similar to the method embodiment, so being described relatively simple, related place ginseng See the part explanation of embodiment of the method.

Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant meaning Covering non-exclusive inclusion, so that the process, method, article or equipment for including a series of elements not only includes that A little elements, but also including other elements that are not explicitly listed, or further include for this process, method, article or The intrinsic element of equipment.In the absence of more restrictions, the element limited by sentence "including a ...", is not arranged Except there is also other identical elements in the process, method, article or apparatus that includes the element.

The foregoing description of the disclosed embodiments can be realized those skilled in the art or using the present invention.To this A variety of modifications of a little embodiments will be apparent for a person skilled in the art, and the general principles defined herein can Without departing from the spirit or scope of the present invention, to realize in other embodiments.Therefore, the present invention will not be limited It is formed on the embodiments shown herein, and is to fit to consistent with the principles and novel features disclosed in this article widest Range.

The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims

1. a kind of Theme Crawler of Content processing method, which is characterized in that the described method includes:

Characteristic information is extracted from the web document, wherein the characteristic information includes at least network title characteristic information, member The Expressive Features information and Web page text characteristic information in keyword feature information, metamessage in information；

When being stored the web document into web document set based on the classification results, based in web document set Web document increment situation is trained subject classification model relevant to the Theme Crawler of Content；

It is wherein, described to extract characteristic information from the web document, comprising:

The title of the web document is segmented, obtains the first participle as a result, and based on the first participle as a result, obtaining One tuple-set of the title；

Using fisrt feature function, the relationship of a tuple-set of each word in the title and the title is determined, Title feature vector is obtained, the title feature vector is used to indicate the pass of each word and a tuple-set in the title System；

The critical word information of metamessage in the web document is segmented, obtains the second word segmentation result, and based on described Second word segmentation result obtains a tuple-set of the critical word information；

Using second feature function, to a tuple of each keyword and the critical word information in the critical word information The relationship of set is determined, keyword feature vector is obtained, and the keyword feature vector is used to indicate the critical word The relationship of each keyword and a tuple-set of the critical word information in information；

The description metamessage of metamessage in the web document is segmented, obtains third word segmentation result, and based on described the Three word segmentation results obtain a tuple-set of the description metamessage；

Using third feature function, to a tuple of each webpage descriptor and the description metamessage in the description metamessage The relationship of set is determined, Expressive Features vector is obtained, and the Expressive Features vector is used to indicate in the description metamessage The relationship of one tuple-set of each webpage descriptor and the description metamessage；

After handling the Web page text of the web document, a tuple-set and the webpage for the Web page text is obtained The binary group set of text；

Using fourth feature function, to the pass of a tuple-set of each keyword in the Web page text and the Web page text System is determined, the first eigenvector of Web page text is obtained, and the first eigenvector of the Web page text is used to indicate described The relationship of a tuple-set of each keyword and the Web page text in Web page text；

Using fifth feature function, to the pass of the binary group set of each keyword in the Web page text and the Web page text System is determined, the second feature vector of Web page text is obtained, and the second feature vector of the Web page text is used to indicate described The relationship of the binary group set of each keyword and the Web page text in Web page text.

2. the method according to claim 1, wherein corresponding wait crawl uniform resource locator in queue obtaining Web document after, the method also includes: judge whether the corresponding page of the uniform resource locator is navigation page；

If it is, parsing to the navigation page, the uniform resource locator in the navigation page is obtained, and will acquire Uniform resource locator write-in it is described wait crawl in queue；

3. the method according to claim 1, wherein described be based on the classification results, it is determined whether will be described Web document is stored into web document set, comprising:

When the classification results indicate that the web document is related to search for, judge that the theme of the web document is related Whether probability is greater than theme dependent probability threshold value, and wherein described search theme is the theme that the Theme Crawler of Content crawls；

When the theme dependent probability for judging the web document is greater than theme dependent probability threshold value, the web document is deposited Storage is into the web document set；

When the classification results indicate that the web document is uncorrelated to described search theme, the web document set is judged Whether middle theme relevant documentation quantity and not a theme relevant documentation ratio of number are less than the related accounting threshold value of theme, wherein the master Topic relevant documentation quantity refers to that the quantity of web document relevant to described search theme, the not a theme relevant documentation quantity are Refer to the quantity with the incoherent web document of described search theme；

When judging that theme relevant documentation quantity and not a theme relevant documentation ratio of number are less than master in the web document set When inscribing related accounting threshold value, the web document is stored into the web document set.

4. according to the method described in claim 3, it is characterized in that, described ought be based on the classification results for the web document When storing into web document set, based on web document increment situation in web document set, to the Theme Crawler of Content phase The subject classification model of pass is trained, comprising:

When the web document is stored into the web document set, increment counter is carried out plus one is handled, wherein institute The initial value for stating increment counter is 0, and one web document of every storage in the web document set, the increment counter It is automatic to add one；

Judge whether the value of the increment counter is greater than delta threshold, if so, carrying out weight to the subject classification model New training, and the value of the increment counter is updated to initial value.

5. a kind of Theme Crawler of Content processing unit, which is characterized in that described device includes:

Extraction unit, for extracting characteristic information from the web document, wherein the characteristic information includes at least network mark Inscribe characteristic information, the keyword feature information in metamessage, the Expressive Features information in metamessage and Web page text feature letter Breath；

Taxon obtains classification knot for carrying out topic relativity classification to the web document based on the characteristic information Fruit；

Judging unit, for being based on the classification results, it is determined whether storing the web document into web document set；

Training unit, for being based on when being stored the web document into web document set based on the classification results Web document increment situation in web document set is trained subject classification model relevant to the Theme Crawler of Content；

Wherein, the extraction unit includes:

First participle subelement is segmented for the title to the web document, obtains the first participle as a result, and based on institute The first participle is stated as a result, obtaining a tuple-set of the title；

Title feature vector obtains subelement, for using fisrt feature function, to each word in the title and the title The relationship of a tuple-set determined that obtain title feature vector, the title feature vector is used to indicate the title In each word and a tuple-set relationship；

Keyword feature vector obtains subelement, for using second feature function, to each pass in the critical word information The relationship of keyword and a tuple-set of the critical word information is determined, keyword feature vector, the key are obtained Word feature vector is used to indicate a tuple-set of each keyword and the critical word information in the critical word information Relationship；

Third segments subelement, segments for the description metamessage to metamessage in the web document, obtains third point Word as a result, and be based on the third word segmentation result, obtain it is described description metamessage a tuple-set；

Expressive Features vector obtains subelement, for using third feature function, retouches to each webpage in the description metamessage The relationship of one tuple-set of predicate and the description metamessage is determined, Expressive Features vector, the Expressive Features are obtained Vector is used to indicate the relationship of a tuple-set of each webpage descriptor and the description metamessage in the description metamessage；

4th participle subelement, after handling for the Web page text to the web document, obtains the Web page text The binary group set of one tuple-set and the Web page text；

First eigenvector obtains subelement, for using fourth feature function, to each keyword in the Web page text and The relationship of one tuple-set of the Web page text is determined, obtains the first eigenvector of Web page text, the webpage is just The first eigenvector of text is used to indicate a tuple-set of each keyword and the Web page text in the Web page text Relationship；

Second feature vector obtains subelement, for using fifth feature function, to each keyword in the Web page text and The relationship of the binary group set of the Web page text is determined, obtains the second feature vector of Web page text, the webpage is just The second feature vector of text is used to indicate the binary group set of each keyword and the Web page text in the Web page text Relationship.

6. device according to claim 5, which is characterized in that described device further include: page judging unit, for judging Whether the corresponding page of the uniform resource locator is navigation page, if it is triggers the acquiring unit to the navigation page Described in the uniform resource locator write-in for being parsed, obtaining the uniform resource locator in the navigation page, and will acquire Wait crawl in queue；If otherwise triggering the extraction unit.

7. device according to claim 5, which is characterized in that the judging unit includes:

First judgment sub-unit, for judging institute when the classification results indicate that the web document is related to search for Whether the theme dependent probability for stating web document is greater than theme dependent probability threshold value, and wherein described search theme is that the theme is climbed The theme that worm crawls；

First storing sub-units, for being greater than theme dependent probability threshold value when the theme dependent probability for judging the web document When, the web document is stored into the web document set；

Second judgment sub-unit is used for when the classification results indicate that the web document is uncorrelated to described search theme, Judge in the web document set whether theme relevant documentation quantity is less than theme phase with not a theme relevant documentation ratio of number Accounting threshold value is closed, wherein the theme relevant documentation quantity refers to the quantity of web document relevant to described search theme, institute It states not a theme relevant documentation quantity and refers to quantity with the incoherent web document of described search theme；

Second storing sub-units judge that theme relevant documentation quantity is related to not a theme in the web document set for working as When the ratio between number of documents is less than theme correlation accounting threshold value, the web document is stored into the web document set.

8. device according to claim 7, which is characterized in that the training unit includes:

Counter adds one for being carried out to increment counter when the web document is stored into the web document set Processing, wherein the initial value of the increment counter is 0, and one web document of every storage in the web document set, institute It states increment counter and adds one automatically；

Training subelement, for when the value of the increment counter be greater than delta threshold when, to the subject classification model into Row re -training, and the value of the increment counter is updated to initial value.