CN103970801B

CN103970801B - Microblogging advertisement blog article recognition methods and device

Info

Publication number: CN103970801B
Application number: CN201310046176.6A
Authority: CN
Inventors: 张国强
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd; Tencent Cloud Computing Beijing Co Ltd
Priority date: 2013-02-05
Filing date: 2013-02-05
Publication date: 2019-03-26
Anticipated expiration: 2033-02-05
Also published as: CN103970801A

Abstract

The present invention discloses a kind of microblogging advertisement blog article recognition methods and device, and method includes: to create microblogging filter using known advertisement blog article and non-advertisement blog article as sample；Advertisement identification is carried out to current microblogging blog article based on microblogging filter and bayesian algorithm.The present invention is based on bayesian algorithms, using known advertisement blog article and non-advertisement blog article as sample, obtain advertisement or non-advertisement microblogging filter, and judge that current microblogging blog article is the probability of advertisement blog article using the microblogging filter, thus the advertisement blog article in microblogging is effectively identified, and improves the valid data recall rate of search engine；Further, it is also possible to which continuous update is trained to update microblogging filter, more efficient to the identification of the advertisement blog article of the stronger microblog media text of real-time by learning new advertisement blog article and normal blog article sample (non-advertisement blog article).

Description

Microblogging advertisement blog article recognition methods and device

Technical field

The present invention relates to Internet technical field more particularly to a kind of microblogging advertisement blog article recognition methods and devices.

Background technique

In internet, the identification to advertisement blog article in microblogging community is counteradvertising, the anti-important content practised fraud.Currently, Identification method to advertisement microblogging is mainly: collecting advertisement blog article by manual type, generates the keyword for advertisement identification Then table judges current blog article using the keyword in antistop list.As shown in Figure 1, the microblogging in Fig. 1 is advertisement Microblogging.

In identification, it is assumed that comprising " number, member, presell " these keywords in obtained antistop list, and Recognition rule is set are as follows: if containing these keywords in a microblogging, being considered as the microblogging is advertisement, then in Fig. 1 Microblogging can then be identified as advertisement microblogging.

But existing advertisement microblogging recognition methods has the disadvantage that

1, advertisement keyword vocabulary is difficult in maintenance, and manual identified is needed to collect, and efficiency is lower, and is manually difficult to collect complete The advertisement blog article in face, cannot generate comprehensive antistop list, can only passively accumulate, and recall not so as to cause the identification of advertisement blog article It is enough；In addition, therefrom select keyword also relatively difficult the advertisement blog article of artificial discovery, it is rich with two in Fig. 2 and Fig. 3 For text, wherein blog article corresponding to Fig. 2 is advertisement, single in terms of this blog article, wherein advertisement keyword can be added in " Taobao " word Vocabulary；In terms of Fig. 3, advertisement keyword vocabulary should not be then added in " Taobao " word.

It 2, only according to antistop list whether is that advertisement judges to blog article, accuracy is difficult to control, because word Occur and context have much relations (unless take long illustration and text juxtaposed setting this as vocabulary, otherwise determine blog article for advertisement by vocabulary Accuracy remain to be discussed), different terms have very big difference in the function and significance of different context.Such as " cheap " once exists It is common word in advertisement blog article, but may also appear in normal blog article, two blog articles of comparison diagram 4 and Fig. 5, wherein Fig. 4 is normal Blog article, Fig. 5 are advertisement blog articles.

3, the corresponding very strong community Media of this timeliness of microblogging, the mode renewal speed for collecting vocabulary is slow, and renewal amount is small, Therefore cheating blog article cannot be found in time.

Summary of the invention

The main purpose of the present invention is to provide a kind of microblogging advertisement blog article recognition methods and devices, it is intended to in microblogging Advertisement blog article is effectively identified, the valid data recall rate of search engine is improved.

In order to achieve the above object, the present invention proposes a kind of microblogging advertisement blog article recognition methods, comprising:

Using known advertisement blog article and non-advertisement blog article as sample, microblogging filter is created；

Advertisement identification is carried out to current microblogging blog article based on the microblogging filter and bayesian algorithm.

The present invention also proposes a kind of microblogging advertisement blog article identification device, comprising:

Creation module, for creating microblogging filter using known advertisement blog article and non-advertisement blog article as sample；

Identification module, for carrying out advertisement knowledge to current microblogging blog article based on the microblogging filter and bayesian algorithm Not.

A kind of microblogging advertisement blog article recognition methods proposed by the present invention and device are based on bayesian algorithm, with known advertisement Blog article and non-advertisement blog article are sample, obtain advertisement or non-advertisement microblogging filter, and judge to work as using the microblogging filter Preceding microblogging blog article is the probability of advertisement blog article, is thus effectively identified to the advertisement blog article in microblogging, and improve search and draw The valid data recall rate held up；Further, it is also possible to by learning new advertisement blog article and normal blog article sample (non-advertisement blog article), It is continuous to update training to update microblogging filter, more to the identification of the advertisement blog article of the stronger microblog media text of real-time Effectively.

Detailed description of the invention

Fig. 1 is the first existing microblogging example schematic；

Fig. 2 is existing second of microblogging example schematic；

Fig. 3 is the third existing microblogging example schematic；

Fig. 4 is existing 4th kind of microblogging example schematic；

Fig. 5 is existing 5th kind of microblogging example schematic；

Fig. 6 is the flow diagram of microblogging advertisement blog article recognition methods first embodiment of the present invention；

Fig. 7 is in microblogging advertisement blog article recognition methods first embodiment of the present invention with known advertisement blog article and non-advertisement blog article For sample, a kind of flow diagram of microblogging filter is created；

Fig. 8 is in microblogging advertisement blog article recognition methods first embodiment of the present invention with known advertisement blog article and non-advertisement blog article For sample, another flow diagram of microblogging filter is created；

Fig. 9 is the flow diagram of microblogging advertisement blog article recognition methods second embodiment of the present invention；

Figure 10 is the structural schematic diagram of microblogging advertisement blog article identification device first embodiment of the present invention；

Figure 11 is the structural schematic diagram of creation module in microblogging advertisement blog article identification device first embodiment of the present invention；

Figure 12 is the structural schematic diagram of identification module in microblogging advertisement blog article identification device first embodiment of the present invention；

Figure 13 is the structural schematic diagram of microblogging advertisement blog article identification device second embodiment of the present invention.

In order to keep technical solution of the present invention clearer, clear, it is described in further detail below in conjunction with attached drawing.

Specific embodiment

The solution of the embodiment of the present invention is mainly: it is based on bayesian algorithm, it is rich with known advertisement blog article and non-advertisement Text is sample, obtains advertisement or non-advertisement microblogging filter, and judge that current microblogging blog article is wide using the microblogging filter The probability of blog article is accused, to realize effective identification to the advertisement blog article in microblogging.

As shown in fig. 6, first embodiment of the invention proposes a kind of microblogging advertisement blog article recognition methods, comprising:

Step S101 creates microblogging filter using known advertisement blog article and non-advertisement blog article as sample；

The identification of microblogging advertisement blog article is realized the present invention is based on bayesian theory.

In order to identify that microblogging is advertisement microblogging or normal microblogging (the present embodiment refers to non-advertisement microblogging), the present embodiment is first Microblogging filter is created by known advertisement blog article and non-advertisement blog article, then by the microblogging filter of creation to current micro- It is rich to carry out advertisement identification.

Wherein, microblogging filter can be divided into advertisement microblogging filter and non-advertisement microblogging filter, the filtering of advertisement microblogging Device output result is the probability that current microblogging is advertisement microblogging, and non-advertisement microblogging filter output result is that current microblogging is non-wide Accuse the probability of microblogging.

It is rich that above-mentioned microblogging filter creation process and the identification process of advertisement blog article respectively correspond the present embodiment microblogging advertisement The offline segmentation scheme and online segmentation scheme of text identification total system.

In offline segmentation scheme, creation advertisement microblogging filter can choose, also can choose the non-advertisement microblogging of creation Filter, the two select one, or comprehensive two kinds of selections to implement.

Step S102 carries out advertisement identification to current microblogging blog article based on the microblogging filter and bayesian algorithm.

After the completion of microblogging filter creation, into the online segmentation scheme of the present embodiment, current microblogging blog article is known Not, judge that current microblogging blog article be advertisement blog article is also non-advertisement blog article.

Specifically, participle is carried out to current microblogging blog article first and vector is converted, then input the vector being converted to In the microblogging filter created into step S101, and it is rich to combine bayesian algorithm and total probability formula to calculate current microblogging Text is the probability of advertisement blog article.

For advertisement microblogging filter, then directly exporting result is the probability that current microblogging is advertisement microblogging, for non-wide Microblogging filter is accused, output result is the probability that current microblogging is non-advertisement microblogging, then converts current microblogging for this result For the probability of advertisement microblogging.

Later, the probability that the current microblogging blog article of acquisition is advertisement blog article is compared with preset threshold value, if More than predetermined threshold, then determine the microblogging blog article for advertisement blog article.

Wherein, to the setting of predetermined threshold, it can be based on known advertisement blog article collection and non-advertisement blog article collection counting statistics, And obtain the predetermined threshold.

More specifically, as shown in fig. 7, for creating advertisement blog article filter, above-mentioned steps S101, with known advertisement The step of blog article and non-advertisement blog article are sample, create microblogging filter may include:

Step S1010, collects several known advertisement blog articles and non-advertisement blog article separately constitutes advertisement blog article collection and non-advertisement Blog article collection, as sample；

Step S1011, each blog article concentrated to the advertisement blog article collection and non-advertisement blog article segment, and obtain every The word sequence of one blog article；

Step S1012 is calculated and is obtained the probability that the advertisement blog article concentrates each word to concentrate appearance in the advertisement blog article； It calculates and obtains the probability that the non-advertisement blog article concentrates each word to concentrate appearance in the non-advertisement blog article；

Step S1013, according to the probability obtained is calculated, advertisement blog article collection described in correspondence establishment and non-advertisement blog article are concentrated every One word and the word concentrate the corresponding relationship Hash table of the probability occurred in the advertisement blog article collection or non-advertisement blog article；

Step S1014 is based on the corresponding relationship Hash table, establishes advertisement blog article according to bayesian algorithm and concentrates, is based on There is the probability of advertisement blog article and the mapping relations Hash table of the word in corresponding word, obtains advertisement blog article filter.

As shown in figure 8, for creating non-advertisement blog article filter, above-mentioned steps S101, with known advertisement blog article and non- The step of advertisement blog article is sample, creates microblogging filter is similar to above-mentioned step shown in Fig. 7, the difference is that, this Above-mentioned step S1014 shown in fig. 7 is substituted with step S1015 in example, in which:

Step S1015 is based on the corresponding relationship Hash table, establishes non-advertisement blog article according to bayesian algorithm and concentrates, base There is the probability of non-advertisement blog article and the mapping relations Hash table of the word in corresponding word, obtains non-advertisement blog article filter.

The specific implementation process of the present embodiment is elaborated with example below:

Segmentation scheme (for creating advertisement blog article filter) offline for the present embodiment:

1, the normal blog article collection (non-advertisement blog article collection) and advertisement blog article collection for accumulating magnanimity, are divided into SET_GOOD, SET_ BAD。

2, corresponding normal blog article collection and advertisement blog article collection, the word sequence that will be obtained after any one blog article D participle can be with It is indicated with vector, i.e. D=(W₁,W₂,...W_n), n is the number after blog article participle, therefore can be by SET_GOOD and SET_BAD Regard a series of single contaminations as.Any word W in SET_GOOD_iIt is expressed as W_i∈ SET_GOOD, from SET_GOOD optionally One word and word W_iProbability (i.e. W_iIn the probability that SET_GOOD occurs) it is expressed as P_{i_good}, thenOrWherein, TF (W_i) it is word W_iCorresponding word frequency, N are the number of not repetitor in SET_GOOD； P can be calculated with same method_{i_bad}(i.e. W_iThe probability occurred in SET_BAD).

3, according to the calculated result of above-mentioned steps 2, following corresponding relationship is generated respectively for SET_GOOD and SET_BAD Hash table:

Wherein, GoodHashtable indicates any word W in SET_GOOD_iOccur in SET_GOOD with the word Probability P_{i_good}Corresponding relationship, BadHashtable indicate SET_BAD in any word W_iGo out in SET_BAD with the word Existing Probability p_{i_bad}Corresponding relationship.

4, for any one blog article, the vector after participle is expressed as D=(W₁,W₂,...W_n), n is blog article participle Number afterwards, if P (i_bad | W_i) it is word W occur in the blog article_iWhen, blog article is the probability of advertisement, is breathed out using above-mentioned corresponding relationship Uncommon table Goodhashtable and Badhashtable, according to bayesian algorithm can calculate P (i_bad | W_i) value, then For each of SET_BAD word, i.e. W_i∈ SET_BAD establishes the Hash table of following mapping relations and storage:

Wherein, Bad Pr obabilityH ashtable is advertisement blog article filter alleged by the present embodiment, is indicated Any word W in SET_BAD_i, work as W_iWhen appearing in any one blog article D, blog article D is the probability of advertisement.

Segmentation scheme online for the present embodiment:

For any one blog article, the vector after participle is expressed as D=(W₁,W₂,...W_n), n is after the blog article segments Number, if the blog article be advertisement probability be expressed as P (bad | W₁,W₂...W_n), the Bad obtained using above-mentioned offline segmentation scheme ProbabilityHashtable(advertisement blog article filter), it is rich that this can be calculated according to bayesian algorithm and total probability formula Text be advertisement blog article probability P (bad | W₁,W₂...W_n), when P (bad | W₁,W₂...W_n) be more than some threshold θ when, i.e., it is believed that The blog article is advertisement blog article.

Wherein, for the setting of threshold θ, to the advertisement blog article collection SET_BAD and normal blog article collection SET_ accumulated GOOD, the mode that online processing scheme can be taken similar calculate the probability that each blog article is advertisement blog article, and observation statistics can To obtain threshold θ.Theoretically, when a blog article be advertisement blog article probability P (bad | W₁,W₂...W_n) be greater than 0.5 when, illustrate this Blog article tendency is advertisement blog article.

Wherein, 0≤θ≤1, θ are arranged bigger, then the accuracy rate judged is higher, and advertisement blog article recall rate is lower；On the contrary, θ What is be arranged is smaller, then the accuracy rate judged is lower, and advertisement blog article recall rate is higher, and therefore, it is accurate to take into account according to the actual situation θ is arranged in rate and recall rate.

In addition, in offline segmentation scheme, if creating non-advertisement blog article filter, the specific implementation process is as follows:

For any one blog article, the vector after participle is expressed as D=(W₁,W₂,...W_n), n is after the blog article segments Number, if P (i_good | W_i) it is word W occur in the blog article_iWhen, blog article is the probability of normal blog article, using above-mentioned offline portion Goodhashtable obtained in offshoot program step 3 and Badhashtable can calculate P (i_ according to bayesian algorithm good|W_i) value, then for each of SET_GOOD word, i.e. W_i∈ SET_GOOD establishes the Kazakhstan of following mapping relations Uncommon table simultaneously stores:

Wherein, Good ProbabilityHashtable is non-advertisement blog article filter alleged by the present embodiment, is indicated Any word W in SET_GOOD_i, work as W_iWhen appearing in any one blog article D, blog article D is the probability of normal blog article.

In addition, the present embodiment can also use following optimisation strategy:

In above-mentioned offline statistics and on-line prediction, when segmenting to corresponding blog article, removal does not have representativeness Word (such as stop words)；Alternatively, choosing representative word (such as noun, verb)；Or combine above two situation.

In addition, following consideration can be increased in online segmentation scheme, for any one blog article, after participle Vector is expressed as D=(W₁,W₂,...W_n), n is number after blog article participle, when the probability P that one blog article of calculating is advertisement blog article (bad|W₁,W₂...W_n) when, if some word W_iBoth it had not appeared in Bad ProbabilityHashtable, had not had yet Good ProbabilityHashtable is appeared in, illustrates that current filter does not have recognition capability to the word, therefore can To ignore effect of this word to result, to reduce erroneous judgement.

Further, when being segmented to blog article, segmentation sequence N-gram (N member) can also be changed.Some in microblogging Word all often occurs in advertisement blog article and normal blog article, can be divided into single word, such as " robbing ", " only " word in participle, Individually these words do not have identification (being advertisement or normal blog article) ability well, but when these words and its context Good recognition capability, such as " crazy to rob ", " only selling " word will be had after word combination, the blog article containing these words is the probability of advertisement It is very big.Therefore in offline segmentation scheme and online segmentation scheme, to the obtained single word of participle carry out 2 yuan or it is polynary up and down Text combination, then carries out subsequent calculation processing, can be reduced the scale of advertisement blog article filter in this way, improve the accurate of differentiation Property.

In addition, can be combined with certain rule when identifying using microblogging filter to current microblogging blog article Differentiate.

Specifically, although the present embodiment above scheme identifies accuracy rate (90%+) with higher to advertisement blog article and recalls Rate (90%+), but in order to reduce erroneous judgement bring injury, some abundant in content advertisements can be let off, to a certain extent to subtract Normal text is gently judged to the possible injury of advertisement, can such as think that the microblogging with video can be in conjunction with certain rule Think it is non-advertisement blog article, and those be identified as with the blog article of advertisement by advertisement blog article filter, if it does not contain it is any Apparent advertising words can normally be recalled this blog article with advertisement property it may be considered that the blog article is weak advertisement.

Whether the present embodiment can be that advertisement blog article is effectively identified to microblogging blog article, in microblogging through the above scheme When full-text search, advertisement sticker is recalled according to certain strategy (do not recall or selectivity is recalled), search engine can be improved Valid data recall rate, and promote user's search experience.

Compared with prior art, the present embodiment has the advantage that

1, the identification of microblogging advertisement blog article is carried out based on bayesian theory, using known advertisement blog article and non-advertisement blog article as sample This, obtains microblogging filter, and uses it to the probability that judgement is newly advertisement blog article into blog article.The program in the prior art, base Different in the identification technology of cheating vocabulary, the present invention is counted based on a large amount of data, to given data set It practises, obtains differentiating the difference between advertisement blog article and non-advertisement blog article, which is to be indicated with probability, and can apply automatically Into later detection, maintenance automation improves a lot to recall rate.

2, all the elements of blog article are analyzed, some keyword not only therein, such as: comprising " cheap ", The blog article of " selling " printed words is not necessarily advertisement blog article, if using keyword filtration technology in the prior art, it is clear that be difficult to reach To ideal effect.And the method for the present invention had both considered the probability that these words occur in advertisement blog article, it is contemplated that it is just Probability in Chang Bowen is judged by comprehensively considering these factors, can hold the balance between " good " and " bad ", quasi- True rate is substantially better than non-1 i.e. 0 static filtering technology.

3, the microblogging filter of the present embodiment creation is difficult to be spoofed, although advertisement blog article sends master-hand and can pass through reduction Advertisement vocabulary (such as " cheap ", " price ") is added to bypass one into some good vocabulary (such as news, hot word) in blog article As blog article Content inspection, but since advertisement blog article filter has personalized color, to successfully bypass its inspection It looks into, the preference for wouling have to write each bloger microblogging is studied, and this hardly has feasibility.In addition, by " special Very " the study of blog article training set, available " special " filter of microblogging advertisement blog article recognition methods based on Bayes, Therefore it is directed to a certain series advertisements blog article, can be efficiently identified.

As shown in figure 9, second embodiment of the invention proposes a kind of microblogging advertisement blog article recognition methods, implement above-mentioned first On the basis of example, after above-mentioned steps S102, further includes:

Step S103 re-starts study, updates the microblogging mistake according to the advertisement blog article and non-advertisement blog article identified Filter.

The difference of the present embodiment and above-described embodiment is that the present embodiment can also be according to according to the advertisement blog article identified With non-advertisement blog article, study is re-started, periodically updates microblogging filter.

Specifically, the advertisement blog article and normal blog article identified according to online segmentation scheme is repeated every certain period Offline segmentation scheme trains new Bad ProbabilityHashtable(advertisement blog article filter) and Good The non-advertisement blog article filter of ProbabilityHashtable(), it then updates and arrives online part.

The present embodiment has adaptation function based on the microblogging advertisement blog article recognition methods of Bayes, by learning newly wide Blog article and normal blog article sample are accused, continuous to update training, microblogging filter also constantly obtains self refresh.When new blog article reaches When, newest advertisement blog article can be fought using the advertisement blog article filter or non-advertisement blog article filter that newly obtain, it is right It is more efficient in the advertisement identification of the stronger media text of this real-time of microblogging.

As shown in Figure 10, first embodiment of the invention proposes a kind of microblogging advertisement blog article identification device, comprising: creation module 201 and identification module 202, in which:

Creation module 201, for creating microblogging filter using known advertisement blog article and non-advertisement blog article as sample；

Identification module 202, for carrying out advertisement to current microblogging blog article based on the microblogging filter and bayesian algorithm Identification.

In order to identify that microblogging is advertisement microblogging or normal microblogging (the present embodiment refers to non-advertisement microblogging), the present embodiment is first Microblogging filter is created by known advertisement blog article and non-advertisement blog article by creation module 201, is then led to by identification module 202 The microblogging filter for crossing creation carries out advertisement identification to current microblogging.

Specifically, participle is carried out to current microblogging blog article first and vector is converted, then input the vector being converted to Into the microblogging filter created, and it is rich for advertisement to combine bayesian algorithm and total probability formula to calculate current microblogging blog article The probability of text.

More specifically, as shown in figure 11, for creating advertisement blog article filter, the creation module 201 be can wrap Include: collector unit 2011, participle unit 2012, the first computing unit 2013, first establishing unit 2014 and second are established single Member 2015, in which:

Collector unit 2011, for collect several known advertisement blog articles and non-advertisement blog article separately constitute advertisement blog article collection and Non- advertisement blog article collection, as sample；

Participle unit 2012 is divided for each blog article to the advertisement blog article collection and non-advertisement blog article concentration Word obtains the word sequence of each blog article；

First computing unit 2013 obtains each word of the advertisement blog article concentration in advertisement blog article concentration for calculating The probability of appearance；It calculates and obtains the probability that the non-advertisement blog article concentrates each word to concentrate appearance in the non-advertisement blog article；

First establishing unit 2014, for according to calculating the probability obtained, advertisement blog article collection described in correspondence establishment and non-wide Accusing blog article concentrates each word and the word to concentrate the corresponding of the probability occurred to close in the advertisement blog article collection or non-advertisement blog article It is Hash table；

Second establishes unit 2015, and for being based on the corresponding relationship Hash table, it is rich to establish advertisement according to bayesian algorithm In collected works, the probability of advertisement blog article and the mapping relations Hash table of the word are occurred based on corresponding word, obtain advertisement blog article mistake Filter.

When creating non-advertisement blog article filter, described second, which establishes unit 2015, is also used to:

Based on the corresponding relationship Hash table, non-advertisement blog article is established according to bayesian algorithm and is concentrated, based on corresponding word There is the probability of non-advertisement blog article and the mapping relations Hash table of the word, obtains non-advertisement blog article filter.

As shown in figure 12, the identification module 202 may include: participle converting unit 2021, the second computing unit 2022 And judging unit 2023, in which:

Converting unit 2021 is segmented, for carrying out participle and vector conversion to current microblogging blog article；

Second computing unit 2022 for inputting the vector being converted in the microblogging filter, and combines pattra leaves This algorithm and total probability formula calculate the probability that current microblogging blog article is advertisement blog article；

Judging unit 2023, if the probability for current microblogging blog article to be advertisement blog article is more than predetermined threshold, determining should Microblogging blog article is advertisement blog article.

2, corresponding normal blog article collection and advertisement blog article collection, the word sequence that will be obtained after any one blog article D participle can be with It is indicated with vector, i.e. D=(W₁,W₂... W_n), n is the number after blog article participle, therefore can be by SET_GOOD and SET_BAD Regard a series of single contaminations as.Any word W in SET_GOOD_iIt is expressed as W_i∈ SET_GOOD, from SET_GOOD optionally One word and word W_iProbability (i.e. W_iIn the probability that SET_GOOD occurs) it is expressed as P_{i_good}, thenOrWherein, TF (W_i) it is word W_iCorresponding word frequency, N are the number of not repetitor in SET_GOOD； P can be calculated with same method_{i_bad}(i.e. W_iThe probability occurred in SET_BAD).

Wherein, Bad ProbabilityHashtable is advertisement blog article filter alleged by the present embodiment, indicates SET_ Any word W in BAD_i, work as W_iWhen appearing in any one blog article D, blog article D is the probability of advertisement.

Segmentation scheme online for the present embodiment:

For any one blog article, the vector after participle is expressed as D=(W₁,W₂,...W_n), n is after the blog article segments Number, if the blog article be advertisement probability be expressed as P (bad | W₁,W₂...W_n), it is obtained using above-mentioned offline segmentation scheme BadProbabilityHashtable(advertisement blog article filter), it can be calculated according to bayesian algorithm and total probability formula The blog article be advertisement blog article probability P (bad | W₁,W₂...W_n), when P (bad | W₁,W₁...W_n) be more than some threshold θ when Think that the blog article is advertisement blog article.

In addition, following consideration can be increased in online segmentation scheme, for any one blog article, after participle Vector is expressed as D=(W₁,W₂,...W_n), n is number after blog article participle, when the probability P that one blog article of calculating is advertisement blog article (bad|W₁,W₂...W_n) when, if some word W_iBoth it had not appeared in Bad ProbabilityHashtable, had not had yet Good ProbabilityH ashtable is appeared in, illustrates that current filter does not have recognition capability to the word, therefore Effect of this word to result can be ignored, to reduce erroneous judgement.

Compared with prior art, the present embodiment has the advantage that

As shown in figure 13, second embodiment of the invention proposes a kind of microblogging advertisement blog article identification device, further includes:

Update module 203 updates institute for re-starting study according to the advertisement blog article and non-advertisement blog article identified State microblogging filter.

Specifically, the advertisement blog article and normal blog article identified according to online segmentation scheme is repeated every certain period Offline segmentation scheme trains new Bad ProbabilityH ashtable(advertisement blog article filter) and Good The non-advertisement blog article filter of ProbabilityH ashtable(), it then updates and arrives online part.

The above description is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all utilizations Equivalent structure made by description of the invention and accompanying drawing content or process transformation, are applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims

1. a kind of microblogging advertisement blog article recognition methods characterized by comprising

Advertisement identification is carried out to current microblogging blog article based on the microblogging filter and bayesian algorithm；

Wherein, described after carrying out advertisement identification to current microblogging blog article based on the microblogging filter and bayesian algorithm Method further include: according to the advertisement blog article and non-advertisement blog article identified, re-start study, update the microblogging filter；

Wherein, described using known advertisement blog article and non-advertisement blog article as sample, if the step of creation microblogging filter includes: to collect Dry known advertisement blog article and non-advertisement blog article separately constitute advertisement blog article collection and non-advertisement blog article collection, as sample；To described wide It accuses each blog article that blog article collection and non-advertisement blog article are concentrated to be segmented, obtains the word sequence of each blog article；Calculating obtains Take the advertisement blog article that each word is concentrated to concentrate the probability occurred in the advertisement blog article；It calculates and obtains the non-advertisement blog article collection In each word the probability occurred is concentrated in the non-advertisement blog article；According to the probability obtained is calculated, advertisement described in correspondence establishment is rich Collected works and non-advertisement blog article concentrate each word and the word to concentrate the corresponding relationship of the probability occurred to breathe out in the advertisement blog article Uncommon table or the word concentrate the corresponding relationship Hash table of the probability occurred in non-advertisement blog article；Based on the word in advertisement blog article The corresponding relationship Hash table of the probability occurred is concentrated to concentrate the corresponding relationship of the probability occurred in non-advertisement blog article with the word Hash table is established advertisement blog article according to bayesian algorithm and is concentrated, and the probability of advertisement blog article and the word occurs based on corresponding word Mapping relations Hash table, obtain advertisement blog article filter；Or

It is described using known advertisement blog article and non-advertisement blog article as sample, create microblogging filter the step of include: collect it is several Know that advertisement blog article and non-advertisement blog article separately constitute advertisement blog article collection and non-advertisement blog article collection, as sample；It is rich to the advertisement The each blog article that collected works and non-advertisement blog article are concentrated is segmented, and the word sequence of each blog article is obtained；It calculates and obtains institute Stating advertisement blog article concentrates each word to concentrate the probability occurred in the advertisement blog article；It calculates and obtains the non-advertisement blog article concentration often One word concentrates the probability occurred in the non-advertisement blog article；According to calculating the probability obtained, advertisement blog article collection described in correspondence establishment The corresponding relationship Hash table for the probability for concentrating each word and the word to occur in advertisement blog article concentration with non-advertisement blog article Or the word concentrates the corresponding relationship Hash table of the probability occurred in non-advertisement blog article；It is concentrated based on the word in advertisement blog article The corresponding relationship Hash table and the word of the probability of appearance concentrate the corresponding relationship Hash of the probability occurred in non-advertisement blog article Table is established non-advertisement blog article according to bayesian algorithm and is concentrated, and the probability of non-advertisement blog article and the word occurs based on corresponding word Mapping relations Hash table, obtain non-advertisement blog article filter.

2. the method according to claim 1, wherein the microblogging filter and bayesian algorithm of being based on is to current Microblogging blog article carry out advertisement identification the step of include:

Participle and vector conversion are carried out to current microblogging blog article；

The vector being converted to is inputted in the microblogging filter, and bayesian algorithm and total probability formula is combined to calculate and work as Preceding microblogging blog article is the probability of advertisement blog article；

If the probability that current microblogging blog article is advertisement blog article is more than predetermined threshold, determine the microblogging blog article for advertisement blog article.

3. according to the method described in claim 2, it is characterized in that, the step of setting the predetermined threshold includes:

The predetermined threshold is obtained based on known advertisement blog article collection and non-advertisement blog article collection counting statistics.

4. the method according to claim 1, which is characterized in that further include:

When segmenting to corresponding blog article, removal does not meet the word of predetermined condition and/or chooses specific word.

5. according to the method described in claim 4, it is characterized by further comprising:

After being segmented to corresponding blog article, polynary context is carried out to the word that participle obtains and is combined.

6. according to the method described in claim 2, it is characterized in that, the microblogging filter and bayesian algorithm of being based on is to current The step of microblogging blog article progress advertisement identification, further comprises:

When calculating the probability that current microblogging blog article is advertisement blog article, if described in a word does not appear in current microblogging blog article In microblogging filter, then ignores and calculate the word.

7. according to the method described in claim 2, it is characterized in that, the microblogging filter and bayesian algorithm of being based on is to current The step of microblogging blog article progress advertisement identification, further comprises:

Identify whether current microblogging blog article is advertisement blog article in conjunction with pre-defined rule.

8. a kind of microblogging advertisement blog article identification device characterized by comprising

Identification module, for carrying out advertisement identification to current microblogging blog article based on the microblogging filter and bayesian algorithm；

Wherein, described device further include: update module, for according to the advertisement blog article and non-advertisement blog article that identify, again into Row study, updates the microblogging filter；

Wherein, the creation module includes: collector unit, for collecting several known advertisement blog articles and non-advertisement blog article group respectively At advertisement blog article collection and non-advertisement blog article collection, as sample；Participle unit, for the advertisement blog article collection and non-advertisement blog article The each blog article concentrated is segmented, and the word sequence of each blog article is obtained；First computing unit obtains institute for calculating Stating advertisement blog article concentrates each word to concentrate the probability occurred in the advertisement blog article；It calculates and obtains the non-advertisement blog article concentration often One word concentrates the probability occurred in the non-advertisement blog article；First establishing unit, for according to the probability obtained is calculated, correspondence to be built It founds the advertisement blog article collection and non-advertisement blog article concentrates each word and the word to concentrate the probability occurred in the advertisement blog article Corresponding relationship Hash table or the word non-advertisement blog article concentrate occur probability corresponding relationship Hash table；Second establishes list Member, for concentrating the corresponding relationship Hash table of the probability occurred with the word in non-advertisement in advertisement blog article based on the word Blog article concentrates the corresponding relationship Hash table of the probability occurred, establishes advertisement blog article according to bayesian algorithm and concentrates, based on corresponding single There is the probability of advertisement blog article and the mapping relations Hash table of the word in word, obtains advertisement blog article filter.

9. device according to claim 8, which is characterized in that when creating non-advertisement blog article filter, described second is built Vertical unit is also used to:

Concentrate the corresponding relationship Hash table of the probability occurred and the word rich in non-advertisement in advertisement blog article based on the word The corresponding relationship Hash table of the probability occurred in collected works is established non-advertisement blog article according to bayesian algorithm and is concentrated, based on corresponding single There is the probability of non-advertisement blog article and the mapping relations Hash table of the word in word, obtains non-advertisement blog article filter.

10. device according to claim 8, which is characterized in that the identification module includes:

Converting unit is segmented, for carrying out participle and vector conversion to current microblogging blog article；

Second computing unit, for the vector being converted to be inputted in the microblogging filter, and combine bayesian algorithm and Total probability formula calculates the probability that current microblogging blog article is advertisement blog article；

Judging unit determines the microblogging blog article the probability for if current microblogging blog article to be advertisement blog article is more than predetermined threshold For advertisement blog article.

11. device according to claim 8 or claim 9, which is characterized in that the participle unit is also used to corresponding blog article When being segmented, removal does not meet the word of predetermined condition and/or chooses specific word；And/or divide to corresponding blog article After word, polynary context is carried out to the word that participle obtains and is combined.

12. device according to claim 10, which is characterized in that the participle converting unit is also used to rich in current microblogging When text is segmented, removal does not meet the word of predetermined condition and/or chooses specific word；And/or to current microblogging blog article into After row participle, polynary context is carried out to the word that participle obtains and is combined.

13. device according to claim 10, which is characterized in that second computing unit is also used to current micro- in calculating When rich blog article is the probability of advertisement blog article, if a word does not appear in the microblogging filter in current microblogging blog article, Ignore and calculates the word.

14. device according to claim 10, which is characterized in that the judging unit is also used to that pre-defined rule is combined to identify Whether current microblogging blog article is advertisement blog article.