CN106372057A

CN106372057A - Content auditing method and apparatus

Info

Publication number: CN106372057A
Application number: CN201610727794.0A
Authority: CN
Inventors: 李术长
Original assignee: LeTV Holding Beijing Co Ltd; LeTV Information Technology Beijing Co Ltd
Current assignee: LeTV Holding Beijing Co Ltd; LeTV Information Technology Beijing Co Ltd
Priority date: 2016-08-25
Filing date: 2016-08-25
Publication date: 2017-02-01

Abstract

The invention discloses a content auditing method and apparatus. According to the method and the apparatus provided by embodiments of the invention, to-be-submitted contents of a user are obtained; whether the contents contain preset sensitive vocabularies or not is detected; and if the contents do not contain the preset sensitive vocabularies, the contents are processed by utilizing a classification model to obtain junk degree parameters of the contents, so that whether the contents pass the auditing or not can be determined according to the junk degree parameters. The auditing process is free of artificial participation, the operation is simple, and the accuracy is high, so that the efficiency and reliability of content auditing are improved.

Description

The checking method of content and device

Technical field

The present invention relates to Internet technology, more particularly, to a kind of checking method of content and device.

Background technology

With the development of communication technology, terminal is integrated with increasing function, so that the systemic-function row of terminal More and more corresponding applications (application, app) are contained in table.User all can be needed in major applications to submit to individual The user data of people, for example, the pet name of user, the contact method of user, user content to be released etc..

Generally, these data, after submitting to, need, by carrying out manual examination and verification process, only to pass through the data of examination ＆ verification, The user data successfully as individual subscriber can be submitted to.This mode, complex operation, and easy error, thus in result in Hold the efficiency of examination ＆ verification and the reduction of reliability.

Content of the invention

The many aspects of the present invention provide a kind of checking method of content and device, with improve content auditing efficiency and can By property.

An aspect of of the present present invention, provides a kind of checking method of content, comprising:

Obtain user's content to be committed, detect the sensitive vocabulary whether comprising in described content to pre-set；

If described content does not comprise the sensitive vocabulary pre-setting, using disaggregated model, described content is processed, with Obtain the spam degree parameter of described content；

According to described spam degree parameter, determine described content whether by examination ＆ verification.

Aspect as above and arbitrary possible implementation, it is further provided a kind of implementation, described acquisition is treated After the content of examination ＆ verification, also include:

If described content comprises described sensitivity vocabulary, described content is carried out reporting process, for carrying out to described content Whether manual examination and verification are processed, determine described content by examination ＆ verification.

Aspect as above and arbitrary possible implementation, it is further provided a kind of implementation, described according to institute State spam degree parameter, determine described content whether by examination ＆ verification, comprising:

If described spam degree parameter is less than or equal to the parameter threshold pre-setting, determine that described content passes through examination ＆ verification；

If described spam degree parameter is more than described parameter threshold, determine described content not over examination ＆ verification.

If described spam degree parameter is more than described parameter threshold, and described user, in described blacklist, determines described interior Hold not over examination ＆ verification；

If described spam degree parameter is more than described parameter threshold, and described user is in described blacklist, will be described interior Hold and carry out reporting process, for carrying out manual examination and verification process to described content, determine described content whether by examination ＆ verification.

Aspect as above and arbitrary possible implementation, it is further provided a kind of implementation, if described Spam degree parameter is more than described parameter threshold, and described user, not in described blacklist, described content is carried out reporting process, For manual examination and verification process is carried out to described content, after determining described content whether by examination ＆ verification, also include:

By the manual examination and verification result of described content and described content, as training sample；

Using described training sample, update described disaggregated model.

Another aspect of the present invention, provides a kind of examination ＆ verification device of content, comprising:

Acquiring unit, the content to be committed for obtaining user, detects whether comprise in described content to pre-set quick Sense vocabulary；

Taxon, if do not comprise the sensitive vocabulary pre-setting, using disaggregated model, to described interior for described content Hold and processed, to obtain the spam degree parameter of described content；

Whether analytic unit, for according to described spam degree parameter, determining described content by examination ＆ verification.

Aspect as above and arbitrary possible implementation, it is further provided a kind of implementation, described grouping sheet Unit, is additionally operable to

Aspect as above and arbitrary possible implementation, it is further provided a kind of implementation, described analysis list Unit, specifically for

Aspect as above and arbitrary possible implementation, it is further provided a kind of implementation, described analysis list Unit, is additionally operable to

By the manual examination and verification result of described content and described content, as training sample；And

Using described training sample, update described disaggregated model.

From described technical scheme, the embodiment of the present invention is passed through to obtain user's content to be committed, detects described content In whether comprise the sensitive vocabulary that pre-sets, if described content does not comprise the sensitive vocabulary pre-setting, using disaggregated model, Described content is processed, to obtain the spam degree parameter of described content, enabling according to described spam degree parameter, determine Described content, whether by examination ＆ verification, manually need not participate in review process, simple to operate, and accuracy is high, thus improve content The efficiency of examination ＆ verification and reliability.

In addition, adopting technical scheme provided by the present invention, being capable of significant increase Consumer's Experience.

Described above is only the general introduction of technical solution of the present invention, in order to better understand the technological means of the present invention, And can be practiced according to the content of description, and in order to allow the above and other objects of the present invention, feature and advantage can Become apparent, below especially exemplified by the specific embodiment of the present invention.

Brief description

By reading the detailed description of hereafter preferred implementation, various other advantages and benefit are common for this area Technical staff will be clear from understanding.Accompanying drawing is only used for illustrating the purpose of preferred implementation, and is not considered as to the present invention Restriction.And in whole accompanying drawing, it is denoted by the same reference numerals identical part.In the accompanying drawings:

The schematic flow sheet of the checking method of the content that Fig. 1 provides for one embodiment of the invention；

The structural representation of the examination ＆ verification device of the content that Fig. 2 provides for another embodiment of the present invention.

Specific embodiment

It is more fully described the exemplary embodiment of the disclosure below with reference to accompanying drawings.Although showing the disclosure in accompanying drawing Exemplary embodiment it being understood, however, that may be realized in various forms the disclosure and should not be by embodiments set forth here Limited.On the contrary, these embodiments are provided to be able to be best understood from the disclosure, and can be by the scope of the present disclosure Complete conveys to those skilled in the art.

It should be noted that involved subscriber terminal equipment in the embodiment of the present invention can include but is not limited to mobile phone, Personal digital assistant (personal digital assistant, pda), radio hand-held equipment, panel computer (tablet Computer), PC (personal computer, pc), Mp 3 player, mp4 player, wearable device be (for example, Intelligent glasses, intelligent watch, Intelligent bracelet etc.) etc..

In addition, the terms "and/or", a kind of only incidence relation of description affiliated partner, expression there may be Three kinds of relations, for example, a and/or b, can represent: individualism a, there are a and b, these three situations of individualism b simultaneously.Separately Outward, character "/" herein, typically represent forward-backward correlation to as if a kind of relation of "or".

The schematic flow sheet of the checking method of the content that Fig. 1 provides for one embodiment of the invention, as shown in Figure 1.

101st, obtain user's content to be committed, detect the sensitive vocabulary whether comprising in described content to pre-set.

If 102 described contents do not comprise the sensitive vocabulary pre-setting, using disaggregated model, at described content Reason, to obtain the spam degree parameter of described content.

So-called spam degree parameter, for an evaluating of the authentic and valid degree of description content.

It should be noted that before 102, existing training method can be adopted, build disaggregated model.The instruction being adopted Practice the training sample included in sample set, can be the known sample through mark, as such, it is possible to directly utilize known to these Sample is trained, to build disaggregated model；Or can also a part be known sample through mark, another part is not There is the unknown sample through mark, then, then can be trained first with known sample, to build preliminary classification model, so Afterwards, recycle preliminary classification model that unknown sample is predicted, to obtain classification results, and then then can be according to unknown sample Classification results, unknown sample is labeled, to form known sample, as the known sample newly increasing, using newly increasing Known sample, and original known sample re-starts training, to build new disaggregated model, until constructed classification Till model or known sample meet the cut-off condition of disaggregated model, such as classification accuracy is more than or equal to pre-set accurate The quantity of rate threshold value or known sample is more than or equal to amount threshold pre-setting etc., and the present embodiment is not especially limited to this Fixed.

103rd, according to described spam degree parameter, determine described content whether by examination ＆ verification.

It should be noted that 101～103 executive agent can be partly or entirely the application being located locally terminal, Or can also be plug-in unit or the SDK (software in the application be arranged in local terminal Development kit, sdk) etc. functional unit, or can also be the process engine in network side server, or Can also be the distributed system positioned at network side, the present embodiment is not particularly limited to this.

It is understood that described application can be mounted in the local program (nativeapp) in terminal, or also may be used To be a web page program (webapp) of browser in terminal, the present embodiment is not particularly limited to this.

So, the content to be committed by obtaining user, detects the sensitive word whether comprising in described content to pre-set Converge, if described content does not comprise the sensitive vocabulary pre-setting, using disaggregated model, described content is processed, to obtain The spam degree parameter of described content, enabling according to described spam degree parameter, determines that described content, need not whether by examination ＆ verification Artificial participation review process, simple to operate, and accuracy is high, thus improve efficiency and the reliability of content auditing.

In the present invention, by the rubbish contents accounting of the user data such as user basic information and user's content to be issued Less, can be filtered by key word is that sensitive vocabulary filters and the very high content of degree of safety audited automatically by disaggregated model, so The inner capacitiess needing manual examination and verification to process can be greatly reduced, substantial amounts of manual examination and verification process time and manpower money can be saved Source, can effectively improve the efficiency of content auditing.

Alternatively, in a possible implementation of the present embodiment, sensitive word lists can be pre-set, Several sensitive vocabulary are safeguarded in this sensitive word lists.Obtaining user's content to be committed, detecting in described content is After the no sensitive vocabulary comprising to pre-set, specifically may determine that whether comprise in described content quick in sensitive word lists Sense vocabulary.

If described content does not comprise described sensitivity vocabulary, 102 can be continued executing with, using disaggregated model, to described interior Hold and processed, to obtain the spam degree parameter of described content.

If described content comprises described sensitivity vocabulary, described content can be carried out reporting process, for described interior Hold and carry out manual examination and verification process, determine described content whether by examination ＆ verification.Specifically, manual examination and verification are processed rule and mark Standard, can be particularly limited to this using any rule of the prior art and standard, the present embodiment.

After manual examination and verification process is carried out to described content, can also be artificial by described content and described content further Examination ＆ verification result, as training sample.Then, recycle described training sample, update described disaggregated model.

Alternatively, in a possible implementation of the present embodiment, in 102, the disaggregated model being adopted is permissible Including but not limited to word-dividing mode, Bayes's classification module and training dictionary module, the present embodiment is not particularly limited to this. In 102, specifically can execute and operate as follows:

First, in word-dividing mode, specifically word segmentation processing can be carried out to described content, to obtain word segmentation result.Right After described content carries out word segmentation processing, in order to improve the efficiency of subsequent treatment and reduce noise, to obtain after word segmentation processing Each word carries out filtration treatment, filtration treatment including but not limited to set forth below: filters out what default deactivation vocabulary was comprised Word；Wherein, generic word list is to be in advance based on function word, auxiliary word, pronoun, article, adverbial word, modal particle that word frequency statisticses go out etc., this A little words generally do not possess independent competency.Specifically can be by default high frequency bar be reached to the frequency of occurrences in existing resource The word of part is collected obtaining, for example, auxiliary word " " there is the very high frequency of occurrences, but it generally has very low energy of expressing the meaning Power, therefore, is collected in deactivation vocabulary.

Secondly, in Bayes's classification module, specifically can be using the vocabulary being stored in training dictionary module and this word The category attribute converging, obtains the category attribute of each word segmentation result；And then, then can be belonged to according to the classification of each word segmentation result Property, obtain the rubbish scoring of each word segmentation result, the rubbish scoring according to each word segmentation result is it becomes possible to obtain described content Spam degree parameter.Calculate the spam degree parameter of content, multiple methods of the prior art can be adopted, detailed description may refer to Related content of the prior art, here is omitted.

Alternatively, in a possible implementation of the present embodiment, in addition it is also necessary to pre-set parameter before 103 Threshold value.Specifically, an empirical value can be set, as parameter threshold, or optimized algorithm can also be utilized, draw one Optimal value, as parameter threshold, the present embodiment is not particularly limited to this.In the spam degree obtaining user's content to be committed After parameter, specifically may determine that described spam degree parameter whether less than or equal to the parameter threshold pre-setting.

During a concrete implementation, if described spam degree parameter is less than or equal to the parameter threshold pre-setting, Then can determine that described content passes through examination ＆ verification.

During another concrete implementation, if described spam degree parameter is more than described parameter threshold, determine described interior Hold not over examination ＆ verification.

During another concrete implementation, if described spam degree parameter is more than described parameter threshold, and described user In described blacklist, then can determine described content not over examination ＆ verification.So, this content need not be carried out again with any examining Core, it is possible to directly determine that this content is not passed through to audit, can effectively improve the efficiency of content auditing.

Wherein, the user being stored in blacklist, can be tactful according to specifying, determination.Described specified strategy is permissible The content submitted to by certain user is processed through manual examination and verification, determines that the number of times that this content does not pass through to audit is more than and specifies threshold Value, for example, 5 times, etc., the present embodiment is not particularly limited to this.

During another concrete implementation, if described spam degree parameter is more than described parameter threshold, and described user Not in described blacklist, then described content can be carried out reporting process, for manual examination and verification process is carried out to described content, Determine described content whether by examination ＆ verification.Specifically, manual examination and verification are processed rule and standard, can be using in prior art Any rule and standard, the present embodiment is not particularly limited to this.

In the present embodiment, by obtaining user's content to be committed, detect whether comprise in described content to pre-set Sensitive vocabulary, if described content does not comprise the sensitive vocabulary pre-setting, using disaggregated model, is processed to described content, To obtain the spam degree parameter of described content, enabling according to described spam degree parameter, determine described content whether by examining Core, manually need not participate in review process, simple to operate, and accuracy is high, thus improve efficiency and the reliability of content auditing.

It should be noted that for aforesaid each method embodiment, in order to be briefly described, therefore it is all expressed as a series of Combination of actions, but those skilled in the art should know, the present invention is not limited by described sequence of movement because According to the present invention, some steps can be carried out using other orders or simultaneously.Secondly, those skilled in the art also should know Know, embodiment described in this description belongs to preferred embodiment, involved action and the module not necessarily present invention Necessary.

In the described embodiment, the description to each embodiment all emphasizes particularly on different fields, and does not have the portion described in detail in certain embodiment Point, may refer to the associated description of other embodiment.

The structural representation of the examination ＆ verification device of the content that Fig. 2 provides for another embodiment of the present invention, as shown in Figure 2.This reality The examination ＆ verification device applying the content of example can include acquiring unit 21, taxon 22 and analytic unit 23.Wherein, acquiring unit 21, the content to be committed for obtaining user, detects the sensitive vocabulary whether comprising in described content to pre-set；Taxon 22, if do not comprise the sensitive vocabulary pre-setting for described content, using disaggregated model, described content is processed, with Obtain the spam degree parameter of described content；Analytic unit 23, for according to described spam degree parameter, determining whether described content leads to Cross examination ＆ verification.

It should be noted that the examination ＆ verification device of the content that provided of the present embodiment partly or entirely can be for being located locally The application of terminal, or can also be plug-in unit or the SDK in the application be arranged in local terminal Functional units such as (software development kit, sdk), or can also be the process in network side server Engine, or can also be the distributed system positioned at network side, the present embodiment is not particularly limited to this.

Alternatively, in a possible implementation of the present embodiment, described taxon 23, can also use further If comprising described sensitivity vocabulary in described content, described content is carried out reporting process, for manually being examined to described content Whether core is processed, determine described content by examination ＆ verification.

Further, described taxon 22, can also be further used for manually examining described content and described content Core result, as training sample；And utilize described training sample, update described disaggregated model.

Alternatively, in a possible implementation of the present embodiment, described analytic unit 23, specifically can be used for institute Stating analytic unit 23, if specifically can be used for described spam degree parameter to be less than or equal to the parameter threshold pre-setting, determining institute State content and pass through examination ＆ verification；If described spam degree parameter is more than described parameter threshold, determine described content not over examination ＆ verification.

Alternatively, in a possible implementation of the present embodiment, described analytic unit 23, if specifically can be used for Described spam degree parameter is less than or equal to the parameter threshold pre-setting, and determines that described content passes through examination ＆ verification；If described spam degree Parameter is more than described parameter threshold, and described user, in described blacklist, determines described content not over examination ＆ verification；If described Spam degree parameter is more than described parameter threshold, and described user, not in described blacklist, described content is carried out reporting process, For carrying out manual examination and verification process to described content, determine described content whether by examination ＆ verification.

Further, described analytic unit 23, can also be further used for manually examining described content and described content Core result, as training sample；And utilize described training sample, update described disaggregated model.

It should be noted that method in the corresponding embodiment of Fig. 1, the examination ＆ verification device of the content that can be provided by the present embodiment Realize.Describe the related content that may refer in the corresponding embodiment of Fig. 1 in detail, here is omitted.

In the present embodiment, user's content to be committed is obtained by acquiring unit, whether comprises pre- in the described content of detection The sensitive vocabulary first arranging, if the described content of taxon does not comprise the sensitive vocabulary pre-setting, using disaggregated model, to institute State content to be processed, with obtain the spam degree parameter of described content so that analytic unit can according to described spam degree parameter, Determine that described content, whether by examination ＆ verification, manually need not participate in review process, simple to operate, and accuracy is high, thus improve The efficiency of content auditing and reliability.

Described above illustrate and describes some preferred embodiments of the application, but as previously mentioned it should be understood that the application Be not limited to form disclosed herein, be not to be taken as the exclusion to other embodiment, and can be used for various other combinations, Modification and environment, and can be in invention contemplated scope described herein, by technology or the knowledge of above-mentioned teaching or association area It is modified.And the change that those skilled in the art are carried out and change without departing from spirit and scope, then all should be in this Shen Please be in the protection domain of claims.

Claims

1. a kind of checking method of content is it is characterised in that include:

If described content does not comprise the sensitive vocabulary pre-setting, using disaggregated model, described content is processed, to obtain The spam degree parameter of described content；

2. method according to claim 1 is it is characterised in that after the pending content of described acquisition, also include:

If described content comprises described sensitivity vocabulary, described content is carried out reporting process, for carrying out manually to described content Whether examination ＆ verification is processed, determine described content by examination ＆ verification.

3. method according to claim 1 and 2 it is characterised in that described according to described spam degree parameter, determine described interior Whether hold by examination ＆ verification, comprising:

4. method according to claim 1 and 2 it is characterised in that described according to described spam degree parameter, determine described interior Whether hold by examination ＆ verification, comprising:

If described spam degree parameter is more than described parameter threshold, and described user, in described blacklist, determines that described content does not have Have by examination ＆ verification；

If described spam degree parameter is more than described parameter threshold, and described user, not in described blacklist, described content is entered Whether row reports process, for carrying out manual examination and verification process to described content, determine described content by examination ＆ verification.

If 5. method according to claim 4 is it is characterised in that described spam degree parameter is more than described parameter threshold It is worth, and described user not in described blacklist, described content is carried out reporting process, for manually being examined to described content Whether core is processed, after determining described content by examination ＆ verification, also include:

Using described training sample, update described disaggregated model.

6. a kind of examination ＆ verification device of content is it is characterised in that include:

Acquiring unit, the content to be committed for obtaining user, detects the sensitive word whether comprising in described content to pre-set Converge；

Taxon, if do not comprise the sensitive vocabulary pre-setting for described content, using disaggregated model, enters to described content Row is processed, to obtain the spam degree parameter of described content；

7. device according to claim 6, it is characterised in that described taxon, is additionally operable to

8. the method according to claim 6 or 7 is it is characterised in that described analytic unit, specifically for

9. the device according to claim 6 or 7 is it is characterised in that described analytic unit, specifically for

10. device according to claim 9 is it is characterised in that described analytic unit, is additionally operable to described content and described The manual examination and verification result of content, as training sample；And utilize described training sample, update described disaggregated model.