A kind of recognition methods of webpage information and device
Technical field
The present invention relates to computer application technology, the recognition methods more particularly to a kind of webpage information and device.
Background technology
On third party's shopping platform, seller user is passed through by platform release product webpage, buyer user on platform
Search engine searches the webpage for meeting specific search condition in the webpage that seller issues, and search engine meets these specific
The webpage of search condition show buyer in the form of search result, buyer user, which further passes through, browses product search result
And then decides whether to click and check some search result product in detail.In addition, being produced when buyer user is searched by search engine
When product meet the webpage of specific search condition, search engine also can be based on webpage information to the webpage progress as search result
Sequence.Therefore, some seller users are in order to make the webpage that it is issued become the search result of search engine, alternatively, in order to make it
The webpage of publication come when as search result front to obtain more chances for exposure, usually all can be in third party's shopping platform
Upper publication includes the webpage of false webpage information.For example, product price information is buyer user in all webpage informations
The key factor paid close attention to the most, and search engine also is provided with the ranking function based on price, therefore, some seller users
False pricing information can be deliberately issued in publishing web page.
It is influenced by these false webpage informations, in information search, on the one hand, search engine is likely to wrap
Webpage containing false webpage information feeds back to buyer user as search result;On the other hand, search engine it is also possible to
Webpage comprising false webpage information is come to the previous section of entire search result when sequence.Above-mentioned two situations all will be tight
Ghost image rings the search quality of search engine, reduces user experience.
In addition to this, for other website platforms also can the inconsistent phenomenon of existence information, such as video website, generally
Video website includes:The videos such as film, music, TV play and animation, and video has its heading message in webpage information
And attribute information, such as:Just there are heading message and film recommended information, wherein film recommended information i.e. film for film
Attribute information.The existing user's (i.e. " upload user ") for uploading film video, also has search, browsing and downloads in video website
The user (i.e. " download user ") of film video, upload user more chances for exposure in order to obtain, the heading message filled in and
Attribute information can have inconsistent phenomenon, and this inconsistent webpage information can equally influence searching for video website search engine
Suo Zhiliang, and then just influence to download the search experience of user.
In order to improve the search quality of search engine, the prior art be by way of manually spot-check from the webpage of publication
Find out the doubtful webpage for including false webpage information.And seller user publication webpage be quantitatively it is very huge, by
In the limitation of human resources, it is also extremely limited to lead to the webpage quantity of this selective examination processing, so, it is this manually to spot-check
Mode is difficult to be widely used, and working efficiency is also very low.Based on the above-mentioned technical problems in the prior art, compel to be essential at present
A kind of method of the webpage information of the automatic identification falseness in third party's shopping platform is provided, the work to improve identification is imitated
Rate.
Invention content
In order to solve the above-mentioned technical problem, an embodiment of the present invention provides a kind of recognition methods of webpage information and device,
Automatically to identify false webpage information, the working efficiency of identification is improved, meanwhile, also improve the search matter of search engine
Amount.
The embodiment of the present application discloses following technical solution:
A kind of recognition methods of webpage information, including:
Webpage log information is obtained from database, the Webpage log information includes description object in publishing log
Characteristic information and characteristic information in exposure daily record, in the characteristic information in characteristic information and transaction log in click logs
It is any one or any number of;
The Webpage log information obtained is divided according to the classification belonging to description object, and counts the webpage in each classification
Log information;
All kinds of purpose statistical models are established using the Webpage log information in each classification of statistics, according to the statistics
Model determines the characteristic information distribution of each classification description object;
Judge whether the characteristic information of object described in identified webpage information is distributed in the characteristic information of affiliated classification
Normal range (NR) in;
If so, determining that the identified webpage information is real information, otherwise, it determines the identified webpage letter
Breath is deceptive information.
A kind of web information recognition, including:
Webpage log information is obtained from database, the Webpage log information includes description object in publishing log
Characteristic information and the characteristic information in the characteristic information in exposing daily record, characteristic information and transaction log in click logs
In it is any one or any number of;
The Webpage log information obtained is divided according to the classification belonging to description object, and counts the webpage in each classification
Log information;
All kinds of purpose Webpage log information are divided according to the subcategory belonging to description object, and count each in each classification
The Webpage log information of subcategory;
The system of each subcategory in each classification is established using the Webpage log information of each subcategory in each classification of statistics
Model is counted, determines that the characteristic information of each subcategory description object in each classification is distributed according to the statistical model;
Judge object described in identified webpage information characteristic information whether affiliated class now belonging to subcategory
In the normal range (NR) of characteristic information distribution;
If so, determining that the identified webpage information is real information, otherwise, it determines the identified webpage letter
Breath is deceptive information.
A kind of identification device of webpage information, including:
Acquisition module, for obtaining Webpage log information from database, the Webpage log information includes description object
Characteristic information in publishing log and the characteristic information in exposure daily record, characteristic information and transaction log in click logs
In characteristic information in it is any one or any number of;
Statistical module for dividing the Webpage log information obtained according to the classification belonging to description object, and counts
Webpage log information in each classification;
First establishes model module, and all kinds of purposes are established for the Webpage log information in each classification using statistics
Statistical model determines that the characteristic information of each classification description object is distributed according to the statistical model;
First judgment module, for judging the characteristic information of object described in identified webpage information whether in affiliated class
In the normal range (NR) of purpose characteristic information distribution;
First determining module, for when the result of the first judgment module is to be, determining the identified webpage information
For real information, otherwise, it determines the identified webpage information is deceptive information.
A kind of webpage information identification device, including:
Acquisition module, for obtaining Webpage log information from database, the Webpage log information includes description object
Characteristic information in publishing log and the characteristic information in exposure daily record, characteristic information and transaction log in click logs
In characteristic information in it is any one or any number of;
Industry statistic module, for dividing the Webpage log information obtained according to the classification belonging to description object, and
Count the Webpage log information in each classification;
Type statistics module is believed for dividing all kinds of purpose Webpage logs according to the subcategory belonging to description object
Breath, and count the Webpage log information of each subcategory in each classification;
Second establishes model module, and the Webpage log information for each subcategory in the classification using statistics is established each
The statistical model of each subcategory in classification determines that the feature of each subcategory description object in each classification is believed according to the statistical model
Breath distribution;
Second judgment module, for judging the characteristic information of object described in identified webpage information whether in affiliated class
Now in the normal range (NR) of the characteristic information distribution of affiliated subcategory;
Second determining module determines the identified webpage information for being yes when the second judgment module judging result
For real information, otherwise, it determines the identified webpage information is deceptive information.
As can be seen from the above-described embodiment, the characteristic information distribution of each classification description object is established, or establishes each classification
Under each subcategory description object characteristic information distribution, according to the characteristic information of each classification description object distribution or it is all kinds of now
The characteristic information distribution of each subcategory description object automatically identifies whether a webpage information is deceptive information.This automatic knowledge
The mode of other webpage information improves recognition efficiency.
In addition, search engine after finding search result, is filtered out in search result comprising false webpage information
Webpage, alternatively, spy of the characteristic information of object described in webpage information according to each webpage of search result in affiliated classification
Probability in reference breath distribution is ranked up search result, can improve the search quality of search engine.
Description of the drawings
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention without having to pay creative labor, may be used also for those of ordinary skill in the art
With obtain other attached drawings according to these attached drawings.
Fig. 1 is a kind of method flow diagram for web information recognition that the embodiment of the present application one discloses;
Fig. 2 is a kind of method flow diagram for web information recognition that the embodiment of the present application two discloses;
Fig. 3 is a kind of method flow diagram for web information recognition that the embodiment of the present application three discloses;
Fig. 4 is a kind of structure drawing of device for webpage information identification device that the embodiment of the present application four discloses;
Fig. 5 is a kind of structure drawing of device for webpage information identification device that the embodiment of the present application five discloses;
Fig. 6 is the result figure using semantic analysis tool analysis product title that the application discloses.
Specific implementation mode
In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, below in conjunction with the accompanying drawings to the present invention
Embodiment is described in detail.
Embodiment one
Referring to Fig. 1, a kind of method flow diagram of its web information recognition disclosed for the embodiment of the present application one, it should
Method includes the following steps:
Step 101:Webpage log information is obtained from database, the Webpage log information includes that description object is being issued
Characteristic information in daily record and the characteristic information in exposing daily record, the characteristic information in click logs and in transaction log
In characteristic information in it is any one or any number of;
The webpage of website is as a kind of novel information carrier, the information for carrying a certain special object, so as to net
The user that stands is browsed, which is the description object of webpage.The description object of the webpage of different web sites is also not
With, such as:For Taobao, Jingdone district, Amazon, when the shopping websites such as working as, the description object of webpage can be product (that is,
Clothes, food, furniture, household electrical appliances, books etc.);For video websites such as youku.com, iqiyi.com, potatoes, the description object of webpage can
To be video (that is, video in the form of film, TV, animation, music etc.).In addition, other websites such as novel website, recruitment website
Webpage also have the description object that it is directed to, also just say, the webpage of any type website has the description object of oneself.
In the various information carried on webpage, one information of most critical is the characteristic information of description object.So-called " description
The characteristic information of object " exactly refers to the information of the feature of characterization description object in one aspect.For example, for product and
Speech, price is exactly one feature, and pricing information is exactly the characteristic information of product.Below only with the description of the webpage of shopping website
It is illustrated for object-product:In the database of third party's shopping platform, it can record and issue net in each seller user
Some generated historical informations when page, and saved as publishing log, include the spy of product in publishing log
Reference ceases.In addition, can also record exposure daily record, click logs and/or the day of trade in the database of third party's shopping platform
Will, wherein also all including the characteristic information of product, that is, characteristic information of the product in exposing daily record, in click logs
Characteristic information and the characteristic information in transaction log.
So-called " publication of product " refers to the product net when buyer user's release product webpage on third party's shopping platform
The product described on page is considered as publication.The publication pricing information of the product will be correspondingly recorded in the database.
So-called " exposure of product " refers to meeting when buyer user is searched for by the search engine on third party's shopping platform
The webpage of specific search condition, and the webpage for meeting specific search condition is showed as search result and is bought by search engine
When family user, product is regarded as exposure described in search result.Product is often exposed once, in the database will
Correspondingly record the number and exposure pricing information that the product is exposed.For example, buyer user searches on third party's shopping platform
Rope and " mobile phone " relevant webpage, search engine will with " mobile phone " relevant web page display to buyer user, at this point, these with
" mobile phone " product involved in " mobile phone " relevant webpage is exposed." mobile phone " product will be also recorded in the database
Exposure frequency and exposure pricing information.
So-called " click of product " refers to when buyer user carries out clicking browsing to each webpage in search result, by point
Product described in the webpage hit is considered as clicking.Product is often clicked once, will correspondingly be recorded in the database
The number and click pricing information that the product is clicked.For example, all and " mobile phone " phase that buyer user shows in search engine
With the webpage of " iPhone " click in the webpage of pass pair and check, at this point, involved in " iPhone " webpage being clicked
" iPhone " product be clicked.The number of clicks of " iPhone " product will be recorded in the database and clicks valence
Lattice information.
So-called " transaction of product " refers to the quilt when product described in buyer user successfully has purchased the webpage being clicked
The product of purchase is considered as being merchandised.Product is often primary by transaction, can correspondingly record time of transaction in the database
Number, the product quantity merchandised every time and transaction value information.
Product is likely to have its characteristic information in publishing log, exposure daily record, click logs and transaction log, and
And characteristic information of the product in publishing log, exposure daily record, click logs and transaction log is possible to variant.
Such as:The characteristic information of product be pricing information, some product issuing process pricing information for 100 and
The pricing information of exposure process is 100, is possible to for 150, in the pricing information of process of exchange in the pricing information of the process of click
It is possible to be 180 again.That is, the pricing information in issuing process of product with the product in exposure process, click process
It is likely to be different with the pricing information of process of exchange.
By taking the Iphone4 products on third party's shopping platform as an example, the product for having recorded the product in the database is being sent out
Publication price when cloth is 3100, the affiliated industry of product be " mobile phone ", product it is entitled " apple 4 generation Iphone4 mobile phones official without
It is wholesale to lock the original-pack intelligent iPhone of 16G certified products ".The exposure time of the product is also recorded in the exposure log information of database
Number is 100 times, wherein the exposure price that 30 exposure prices are 3500,70 times is 3000 (the product exposure valences exposed every time
Lattice possibility is identical may also be different).The number of clicks of the product is 40 times, wherein 10 click prices are 3500,30 times
It is 3000 (the click price possibility for the product clicked every time is identical may also be different) to click price, and the transaction count of the product is
20 times, wherein the product number merchandised every time in 15 transaction is 50, transaction value 3000, wherein in 5 transaction every time
The product number of transaction is 40, and transaction value is that 3500 (the product number merchandised every time and transaction value all may Bu Tong
It may be identical).
Step 102:The Webpage log information obtained is divided according to the classification belonging to description object, and counts each classification
In Webpage log information;
Classification belonging to description object divides all webpage informations of above-mentioned acquisition, for example, product can be divided into
Few following classification:Mobile phone industry, computer industry, apparel industry and household electric appliances etc..Certainly, above to enumerate only example
Property, can also include other classifications.Here, the classification belonging to description object can with coarseness be divided according to actual demand,
It can also divide to fine granularity the classification belonging to product.Also, it is directed to different types of description object, the mode classification of classification
It is also different with classification results.The present invention is not defined the classification mode classification and classification results of each description object.
For technical scheme of the present invention, after the classification mode classification of description object determines, classification results also determine that.
For example, after the classification for dividing description object is mobile phone industry, computer industry, apparel industry and household electric appliances, it will
All Webpage log information obtained are respectively divided into:The Webpage log information of mobile phone industry, the Webpage log letter of computer industry
The Webpage log information of breath, the Webpage log information of apparel industry and household electric appliances, then counts the webpage day in each classification again
Will information.
Step 103:All kinds of purpose statistical models are established using the Webpage log information in each classification of statistics, according to
The statistical model determines the characteristic information distribution of each classification description object;
When description object is product, and characteristic information is pricing information, step 103 is specially:Utilize the described each of statistics
The Webpage log information of classification product establishes the statistical model of each classification product, and each classification product is determined using the statistical model
Pricing information distribution.
Assuming that the pricing information of each classification product all obeys Gaussian Mixture distribution, in the case, described in statistics
All kinds of purpose Webpage log information establish being achieved in that for statistical model:The institute of the statistics is parsed using EM algorithm
All kinds of purpose Webpage log information are stated, the gauss hybrid models of each classification description object are established using analysis result;According to described
The gauss hybrid models of each classification description object determine the characteristic information distribution of each classification description object.
Establishing the process of gauss hybrid models is:Using the data in all kinds of purpose Webpage log information of acquisition as training
Training data is trained to a gauss hybrid models to be fitted the feature letter of description object by data using the method for machine learning
The probability distribution of breath includes N number of single Gaussian function altogether if the sample number of training data is N, in mixed model, they have not
With mean value, different covariance matrixes and different weights, being combined summation according to different parameter values, to obtain Gauss mixed
Molding type.So-called EM algorithm is to make likelihood function value reach maximum by the increase iteration of training data, Jin Erqiu
Obtain model parameter corresponding when functional value maximum, you can fit gauss hybrid models, retouched according to gauss hybrid models
State the characteristic information distribution of object.Certainly, the characteristic information of description object also may be used other than it can obey Gaussian Mixture distribution
To obey other distributions, e.g., logarithm normal distribution, X2Distribution, T distributions, F distributions or Poisson distribution, based on other distribution sides
Formula can also establish corresponding other statistical models.EM algorithm is a kind of clustering algorithm based on gauss hybrid models,
In addition to using EM algorithm parsing statistical model other than, can also use K-means algorithms, least-squares algorithm, greatly
The method for parameter estimation such as likelihood algorithm parse gauss hybrid models, obtain the characteristic information distribution of description object.When using other
Statistical model when, other algorithms can also be used to be parsed.
Which kind of algorithm parsing statistical model is gone using it should be noted that not limited in the embodiment of the present application, that is,
Say, above-mentioned any one arithmetic analysis statistical model enumerated may be used, it is of course also possible to use the prior art it is disclosed its
His arithmetic analysis statistical model.In addition, the embodiment of the present application does not also limit which kind of statistical model used, above-mentioned row may be used
Any one statistical model lifted, it is of course also possible to use other statistical models disclosed in the prior art.
It should be noted that:In order to preferably train statistical model so that the degree of fitting of the statistical model trained is more
Height, training pattern is more accurate, more demanding to the authenticity of training data, since the transaction log information in database is most true
It is real, can most reflect the data of user behavior, followed by click logs information, exposure log information, product characteristic information.Institute
With, when obtaining Webpage log information, the number for the training data that can be needed according to statistical model, to determine using in database
Which information as Webpage log information.Such as:When needs training data number be 100, obtained from transaction log
Description object the characteristic information totally 30 of process of exchange, the description object that is obtained from click logs the process of click spy
Reference breath totally 40, the characteristic information totally 50 of the description object that obtains in exposure process from exposure daily record.Training statistics mould
When type, need in all information (40) and exposure daily record in all information (30) and click logs using transaction log
Partial information (30).That is, when extracting Webpage log information, according to the sequence that log information authenticity is descending, press
The number selection data required according to training data carry out training pattern.In addition, when the information content in transaction log and/or click logs
It is smaller, and when cannot be satisfied the requirement of training data number, it can be to the characteristic information in transaction log and/or click logs
(that is, characteristic information of the description object in process of exchange and/or characteristic information in the process of click) is weighted processing, then carries out
Training.That is, in order to meet the requirement of training data number and required precision, it is higher to validity in Webpage log information
After information is weighted processing, then it is trained.
Step 104:Judge object described in identified webpage information characteristic information whether affiliated classification feature
In the normal range (NR) of information distribution, if so, entering step 105, otherwise, 106 are entered step;
When description object is product, and characteristic information is pricing information, step 104 is specially:Judge identified webpage
Whether the pricing information of the product in information is in the normal range (NR) of the pricing information distribution of affiliated product classification.
When using gauss hybrid models, judge the characteristic information of object described in identified webpage information whether in institute
Belong to classification characteristic information distribution normal range (NR) in realization method be:According to the description object of identified webpage information institute
The gauss hybrid models for belonging to classification calculate two standard deviation ranges of Gaussian Mixture distribution;Judge identified webpage information
Described in object characteristic information whether in the numberical range between described two standard deviations, if so, identified webpage
The characteristic information of object described in information is in the normal range (NR) that the characteristic information of affiliated classification is distributed, otherwise, identified net
The characteristic information of object described in page information is not in the normal range (NR) of the characteristic information of affiliated classification distribution.
When assuming that the characteristic information of description object obeys Gaussian Mixture distribution, due to most of data in Gaussian Profile
All concentrate between two standard deviations, therefore, the present invention using the numberical range between two standard deviations as Gaussian Profile just
Normal range, it is real information that the characteristic information within the scope of this, which is determined, the characteristic information being located at except this range
It is deceptive information to be determined.It, can be with when assuming that characteristic information obeys other distributions in addition to being judged using the above method
According to the distribution characteristics of other distributions, the regime values range of characteristic information distribution is determined.
In practical applications, the type of statistical model can be selected according to actual needs, and further determine that characteristic information
The normal range (NR) of distribution, does not limit in this application.
Step 105:Determine that the identified webpage information is real information;
Step 106:Determine that the identified webpage information is deceptive information.
In addition, in order to keep the fitting effect of statistical model more preferable, can also further comprise after step 103:Removal
Numerical value in the Webpage log information of statistics is relatively low and the higher partial data of numerical value;
Then at step 104, all kinds of purposes are established using the Webpage log information in removal treated each classification to count
Model determines that the characteristic information of each classification description object is distributed according to statistical model.
Search engine can utilize the recognition result of above-mentioned webpage information, can be filtered to search result, screen out packet
Search result containing false webpage information.Alternatively, search engine is also based on the webpage information of each webpage in search result
Described in object characteristic information affiliated classification characteristic information distribution in probability, in search result each webpage carry out
Sequence.Here it is possible to which webpage information is identified by search engine, and directly search result was carried out using recognition result
Filter or sequence.It is of course also possible to execute the identification of webpage information by other function modules on third party's shopping platform, search is drawn
It holds up and calls recognition result from the function module.The present invention does not limit this.
Preferably, after identifying that webpage information is deceptive information, further include:It is filtered out from search result comprising falseness
Webpage information webpage, filtered search result is fed back into client.
Or, it is preferred that after obtaining all kinds of purpose characteristic information distributions, further include:To each net in search result
When page is ranked up, the characteristic information for calculating object described in the webpage information of each webpage is distributed in the characteristic information of affiliated classification
In probability;Each webpage in search result is ranked up according to the sequence of the probability from big to small.It is of course also possible to press
It is ranked up processing according to other sequential systems.
As can be seen from the above-described embodiment, Webpage log information is obtained from database, establishes the statistical model of each classification,
And determine that the characteristic information of each classification description object is distributed according to statistical model, pass through the characteristic information point of each classification description object
Cloth identifies whether identified webpage information is deceptive information, can also provide and be ordered as consumer's offer better choice.
Particularly, search engine can utilize the authenticity of the webpage information identified, and false webpage information is filtered
Fall, filtered search result is fed back into client, to improve the search quality of search engine.Search engine can also lead to
It crosses and the true webpage information in search result is ranked up according to the probability in distribution according to descending mode, from
And improve user experience.
Embodiment two
Due to the description object huge number of each class now, characteristic information it is widely different, so judging result is accurate
Degree is not high.Therefore, second embodiment of the present invention provides a kind of information identifying method, further to identify each subclass in each classification
Whether purpose description object is deceptive information.Referring to Fig. 2, it knows for another webpage information that the embodiment of the present application two discloses
The method flow diagram of other method, includes the following steps:
Step 201:Webpage log information is obtained from database, the Webpage log information includes that description object is being issued
Characteristic information in daily record and in the characteristic information in exposing daily record, characteristic information and transaction log in click logs
It is any one or any number of in characteristic information;
Wherein, the characteristic information of description object includes at least heading message.
Step 202:The Webpage log information obtained is divided according to the classification belonging to description object, and counts each classification
In Webpage log information;
Step 203:All kinds of purpose Webpage log information are divided according to the subcategory belonging to description object, and count each
The Webpage log information of each subcategory in classification;
When the characteristic information of description object includes at least heading message, institute is divided according to the subcategory belonging to description object
All kinds of purpose Webpage log information are stated, and the Webpage log information for counting each subcategory in each classification is specially::Using semanteme point
Analysis tool (such as Termweight) carries out semantic analysis to the heading message, obtains belonging to the description object in each classification
Subcategory;Count the Webpage log information of the description object with identical subcategory in each classification.
Such as:The classification of product be mobile phone industry, product it is entitled " apple 4 generation Iphone4 mobile phones official without lock 16G
The original-pack intelligent iPhone of certified products is wholesale ", by semantic analysis tool analysis product title, analysis result is obtained specifically such as Fig. 6
Shown, the subcategory that can be further known to the product is apple 4 generation mobile phone, then counts 4 generation of all apples in mobile phone industry
Webpage log information.Again for example:The classification of product is apparel industry, entitled " the male jacket man of Nike/NIKE movements of product
Fill colorant match jacket ", the subcategory that the product is further known to by semantic analysis is Nike man style jacket, then counts clothes row
All subcategories are the Webpage log information of Nike man style jacket in industry.
It should be noted that here, the subcategory belonging to description object can with coarseness be divided according to actual demand,
It can with fine granularity divide the subcategory belonging to product.Also, it is directed to different types of description object, the classification side of subcategory
Formula and classification results are also different.The present invention does not limit the subcategory mode classification and classification results of each description object
It is fixed.For technical scheme of the present invention, after the subcategory mode classification of description object determines, classification results also determine that
.
Step 204:Each son in each classification is established using the Webpage log information of each subcategory in each classification of statistics
The statistical model of classification determines that the characteristic information of each subcategory description object in each classification is distributed according to the statistical model;
When the characteristic information of product is pricing information, above-mentioned steps are specially:Using in each product industry of statistics
The Webpage log information of all types of products establishes the statistical model of each product type in each product industry, according to the statistical model
Determine the pricing information distribution of each product type in each product industry.
Step 205:Judge whether the characteristic information of object described in identified webpage information is affiliated now in affiliated class
In the normal range (NR) of the characteristic information distribution of subcategory, if so, entering step 206, otherwise, 207 are entered step;
In practical applications, the type of different statistical models can be selected according to actual needs, and according to different systems
Meter model further determines that the normal range (NR) of characteristic information distribution, does not limit in this application.
Step 206:Determine that the identified webpage information is real information;
Step 207:Determine that the identified webpage information is deceptive information.
The implementation procedure of above-mentioned steps 204-207 may refer to implement the step 103-106 in one, due to the contents of the section
As soon as being described in detail in embodiment, therefore repeat no more herein.
In addition to so that the statistical model established is more accurate, each subcategory describes in step 203 counts each classification
After the Webpage log information of object, divider value can also be gone relatively low from the Webpage log information of statistics and the higher portion of numerical value
Divided data;Partial data can be 5%, 10% or the data of other percentages, and removal how many data determined according to actual conditions.
Then step 204 is specially:It is established using the Webpage log information of each subcategory in removal treated each classification each in each classification
The statistical model of subcategory description object determines that the feature of each subcategory description object in each classification is believed according to the statistical model
Breath distribution.
Search engine can utilize the recognition result of above-mentioned webpage information, can be filtered to search result, screen out packet
Search result containing false webpage information.Alternatively, search engine is also based on the webpage information of each webpage in search result
Described in object characteristic information affiliated class now belonging to subcategory characteristic information distribution in probability, in search result
Each webpage be ranked up.Here it is possible to webpage information is identified by search engine, and directly using recognition result to searching
Hitch fruit is filtered or sorts.It is of course also possible to execute webpage information by other function modules on third party's shopping platform
Identification, search engine calls recognition result from the function module.The present invention does not limit this.
Preferably, after identifying that webpage information is deceptive information, further include:It is filtered out from search result comprising falseness
Webpage information webpage, filtered search result is fed back into client.
Or, it is preferred that in obtaining each classification after the characteristic information distribution of each subcategory, further include:Obtaining each production
In conduct industry after the webpage information distribution of each product type, further include:When being ranked up to each webpage in search result, meter
The characteristic information of object described in the webpage information of each webpage is calculated in affiliated class now the characteristic information distribution of affiliated subcategory
Probability;Each webpage is ranked up according to the sequence of the probability from big to small.It is of course also possible to carry out in other orders
Sequence.
As can be seen from the above-described embodiment, Webpage log information is obtained from database, establishes all kinds of subcategories each now
Statistical model, and determine that the characteristic information of all kinds of subcategory description objects each now is distributed according to statistical model, pass through each classification
Under each subcategory description object characteristic information distribution identify whether identified webpage information is deceptive information so that identification
Effect higher, the precision of identification more increases.
In particular, when search engine is using the authenticity of the product web page information of identification, true webpage information is fed back
To client, search result can also be ranked up according to the probability that product web page information is distributed, search can be not only provided
Quality can more provide better search experience to the user.
Embodiment three
Below using statistical model as gauss hybrid models, description object is product, and characteristic information includes pricing information and mark
Information is inscribed, it is right for the Type division product subcategory belonging to product according to the trade division product classification belonging to product
A kind of web information recognition provided by the present application is described in greater detail.Referring to Fig. 3, it is the embodiment of the present application
A kind of method flow diagram of three information identifying methods disclosed, includes the following steps:
Step 301:Webpage log information is extracted from database, the Webpage log information includes product in publishing log
In pricing information and the price in the pricing information in exposing daily record, pricing information and transaction log in click logs
It is any one or any number of in information;
Step 302:According to the Webpage log information that the trade division belonging to product obtains, and count each product industry
Webpage log information;
Step 303:According to the Webpage log information of each product industry described in the Type division belonging to product, and count each production
The Webpage log information of each product type in conduct industry;
The specific implementation that the webpage information of each product industry is divided according to the Type division belonging to product is:It adopts
Semantic analysis is carried out to the heading message with semantic analysis tool (such as Termweight), is obtained belonging to each product industry
Product type;Then the Webpage log information of the product with like products type in each product industry is counted.
Step 304:Each product is established using the Webpage log information of all types of products in each product industry of statistics
The statistical model of each product type in industry determines that the price of each product type in each product industry is believed according to the statistical model
Breath distribution;
Step 305:Judge whether the pricing information of product in identified webpage information is affiliated under affiliated product industry
In the normal range (NR) of the pricing information distribution of product type, if so, entering step 306, otherwise, 307 are entered step;
When statistical model is gauss hybrid models, a kind of realization method of step 305 is:According to identified product web page
The gauss hybrid models of affiliated product type calculate two standard deviations of Gaussian Mixture distribution under the affiliated product industry of information
Range;
The pricing information of product in identified webpage information is judged whether within the scope of described two standard deviations, such as
Fruit is the product price information of the pricing information of product affiliated type under affiliated product industry point in identified webpage information
In the normal range (NR) of cloth, otherwise, the pricing information of the product not affiliated class under affiliated product industry in identified webpage information
In the normal range (NR) of the product price information distribution of type.
Step 306:Determine that the identified webpage information is real information;
Step 307:Determine that the identified webpage information is deceptive information;
Step 308:When being ranked up to each webpage in search result, product in the webpage information of each webpage is calculated
Probability in the product price information distribution of pricing information affiliated product type under affiliated product industry;
Step 309:Each webpage in search result is ranked up according to the sequence of the probability from big to small.
In addition, in order to enable the gauss hybrid models established are more accurate, in step 303 and count each in each product industry
After the Webpage log information of product type, the numerical value that can also be removed in the Webpage log information of statistics is relatively low higher with numerical value
Partial data;For example, partial data can be 5%, 10% or the data of other percentages, determine to remove according to actual conditions
How many data.
Then step 304 is specially:Utilize the Webpage log of each product type in removal treated each product industry
Information establishes the gauss hybrid models of each product type in each product industry, and each product line is determined according to the gauss hybrid models
The pricing information distribution of each product type in industry.
As can be seen from the above-described embodiment, the product web page day of all types of products in each product industry of statistics is utilized
Will information establishes gauss hybrid models, obtains the product web page characteristic information distribution of all types of products in each product industry, and right
Product is ranked up, and not only can accurately identify whether the product web page information of all types of products of every profession and trade is false letter
Breath so that the effect higher of identification, the precision of identification are more increased, and can be provided more reliable search information to consumer and
More easily search experience.
Example IV
Corresponding with a kind of web information recognition in above-described embodiment one, the embodiment of the present application provides a kind of net
Page information identification device.Referring to Fig. 4, a kind of device knot of its webpage information identification device disclosed for the embodiment of the present application four
Composition, the device include:Acquisition module 401, statistical module 402, first establish model module 403,404 and of the first judgment module
First determining module 405.It is further described its internal structure and its connection relation with reference to the operation principle of the device.
Acquisition module 401, for obtaining Webpage log information from database, the webpage information includes that description object exists
Characteristic information in publishing log and the information in exposure daily record, the spy in characteristic information and transaction log in click logs
It is any one or any number of in reference breath;
Statistical module 402 for dividing the Webpage log information obtained according to the classification belonging to description object, and is united
Count the Webpage log information in each classification;
First establishes model module 403, is established for the Webpage log information in each classification using statistics all kinds of
Purpose statistical model determines that the characteristic information of each classification description object is distributed according to the statistical model;
First judgment module 404, for judging the characteristic information of object described in identified webpage information whether in institute
Belong in the normal range (NR) of characteristic information distribution of classification;
First determining module 405, for when the result of the first judgment module is to be, determining the identified webpage letter
Breath is real information, otherwise, it determines the identified webpage information is deceptive information.
Preferably, when statistical model is gauss hybrid models, described first, which establishes model module 403, includes:Parsing
Module one and determination sub-module one, wherein analyzing sub-module one, for parsing the described of the statistics using EM algorithm
All kinds of purpose Webpage log information establish all kinds of purpose gauss hybrid models using analysis result;Determination sub-module one is used for root
All kinds of purpose characteristic information distributions are determined according to all kinds of purpose gauss hybrid models.
Preferably, described first establish model module 403 include analyzing sub-module one and determination sub-module for the moment, described the
One judgment module 404 includes:Computational submodule one is used for the height of the affiliated classification of description object according to identified webpage information
This mixed model calculates two standard deviation ranges of Gaussian Mixture distribution;Judging submodule one, it is identified for judging
Whether the characteristic information of object described in webpage information is within the scope of described two standard deviations, if so, identified net
The characteristic information of object described in page information is otherwise, identified in the normal range (NR) that the characteristic information of affiliated classification is distributed
The characteristic information of object described in webpage information is not in the normal range (NR) of the characteristic information of affiliated classification distribution.
Preferably, which further includes:First feedback module, for being filtered out from search result comprising false webpage
Filtered search result is fed back to client by the webpage of information.
Preferably, which further includes:First calculates probabilistic module, for being arranged to each webpage in search result
When sequence, the characteristic information for calculating object described in the webpage information of each webpage is general in the characteristic information distribution of affiliated classification
Rate;
First sorting module, for being arranged each webpage in search result according to the sequence of the probability from big to small
Sequence.
As can be seen from the above-described embodiment, Webpage log information is obtained from database, establishes the statistical model of each classification,
And determine that the characteristic information of each classification description object is distributed according to statistical model, pass through the characteristic information point of each classification description object
Cloth identifies whether identified webpage information is deceptive information, can also provide and be ordered as consumer's offer better choice.
Particularly, search engine can utilize the authenticity of the webpage information identified, and false webpage information is filtered
Fall, filtered search result is fed back into client, to improve the search quality of search engine.Search engine can also lead to
It crosses and the true webpage information in search result is ranked up according to the probability in distribution according to descending mode, from
And improve user experience.
Embodiment five
Corresponding with a kind of web information recognition in above-described embodiment two, the embodiment of the present application provides a kind of net
Page information identification device.Please Parameter Map 5, the device of a kind of webpage information identification device disclosed for the embodiment of the present application five shows
It is intended to, which includes:Acquisition module 501, industry statistic module 502, type statistics model 503, second establish model module
504, the second judgment module 505, the second determining module 506.It is further described inside it with reference to the operation principle of the device
Structure and connection relation.
Acquisition module 501, for obtaining Webpage log information from database, the Webpage log information includes description pair
As the characteristic information in publishing log and characteristic information, the characteristic information in click logs and the day of trade in exposure daily record
It is any one or any number of in characteristic information in will;
Industry statistic module 502, for dividing the Webpage log information obtained according to the classification belonging to description object,
And count the Webpage log information in each classification;
Type statistics model 503, for dividing all kinds of purpose Webpage logs according to the subcategory belonging to description object
Information, and count the Webpage log information of each subcategory in each classification;
Second establishes model module 504, the Webpage log information for each subcategory in each classification using statistics
The statistical model for establishing each subcategory in each classification determines each subcategory description object in each classification according to the statistical model
Characteristic information is distributed;
Second judgment module 505, for judging the characteristic information of object described in identified webpage information whether in institute
Category class is now in the normal range (NR) of the characteristic information distribution of affiliated subcategory;
Second determining module 506 determines the identified webpage letter for being yes when the second judgment module judging result
Breath is real information, otherwise, it determines the identified webpage information is deceptive information.
Preferably, the characteristic information of the description object includes at least heading message, then the type statistics module, specifically
Including:Analyze submodule and statistic submodule;Wherein, submodule is analyzed, for believing the title using semantic analysis tool
Breath carries out semantic analysis, obtains the subcategory belonging to the description object in each classification;
Statistic submodule, the Webpage log information for counting the description object with identical subcategory in each classification.
Preferably, further include:Second feedback module includes false webpage information for being filtered out from search result
Filtered search result is fed back to client by webpage.
Preferably, further include:Second computing module and the second sorting module;
Second calculates probabilistic module, for when being ranked up to each webpage in search result, calculating the net of each webpage
Probability of the characteristic information of object described in page information in affiliated class now the characteristic information distribution of affiliated subcategory;
Second sorting module, for being ranked up to each webpage according to the sequence of the probability from big to small.
As can be seen from the above-described embodiment, Webpage log information is obtained from database, establishes all kinds of subcategories each now
Statistical model, and determine that the characteristic information of all kinds of subcategory description objects each now is distributed according to statistical model, pass through each classification
Under each subcategory description object characteristic information distribution identify whether identified webpage information is deceptive information so that identification
Effect higher, the precision of identification more increases.
In particular, when search engine is using the authenticity of the product web page information of identification, true webpage information is fed back
To client, search result can also be ranked up according to the probability that product web page information is distributed, search can be not only provided
Quality can more provide better search experience to the user.
It should be noted that one of ordinary skill in the art will appreciate that realizing the whole in above-described embodiment method or portion
Split flow is relevant hardware can be instructed to complete by computer program, and the program can be stored in computer can
It reads in storage medium, the program is when being executed, it may include such as the flow of the embodiment of above-mentioned each method.Wherein, described to deposit
Storage media can be magnetic disc, CD, read-only memory (Read-Only Memory, ROM) or random access memory
(Random Access Memory, RAM) etc..
A kind of web information recognition provided by the present invention and device are described in detail above, answered herein
With specific embodiment, principle and implementation of the present invention are described, and the explanation of above example is only intended to help
Understand the method and its core concept of the present invention;Meanwhile for those of ordinary skill in the art, according to the thought of the present invention,
There will be changes in the specific implementation manner and application range, in conclusion the content of the present specification should not be construed as to this
The limitation of invention.