CN101446970B

CN101446970B - Method for censoring and process text contents issued by user and device thereof

Info

Publication number: CN101446970B
Application number: CN2008102200098A
Authority: CN
Inventors: 刘怀军; 刘昌毅
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Shenzhen Shiji Guangsu Information Technology Co Ltd
Priority date: 2008-12-15
Filing date: 2008-12-15
Publication date: 2012-07-04
Anticipated expiration: 2028-12-15
Also published as: CN101446970A

Abstract

The invention discloses a method for censoring and processing text contents issued by a user and a device thereof. The method comprises the following steps: receiving the text contents issued by the user and judging user information according to a list rule database; if the user information neither belongs to a white list or a white rule nor a black list or a black rule, calculating a first similarity of a first characteristic vector of the text contents of the user and a second characteristic vector of pre-established spam sample contents, and judging whether the text contents issued by the user are qualified contents according to the first similarity, if the text contents are the qualified contents, publishing the text contents issued by the user; or sending the text contents issued by the user for manual censoring. The method and the device can help censor and filter the user information and the text contents issued by the user without total manual censoring of the information issued by the user, thus greatly avoid the manual censoring time and saving the human resources and correspondingly enhancing the censoring efficiency.

Description

A kind of method and device thereof that the content of text audit of user's issue is handled

Technical field

The present invention relates to the communications field, a kind of method and device thereof that the content of text audit of user's issue is handled.

Background technology

At present; Ask community (network address: http://wenwen.soso.com) be similar to that Baidu is known, Sina likes to ask etc. question and answer type service; The user can ask a question or answers the problem that other people propose at the page, has made things convenient for user's obtaining information to a great extent.Now, ask community and approximately have more than 20 ten thousand new problems generations every day, ask the information that the user submits in the community and all examine, need to consume a large amount of manual examination and verification time via manual work, the waste of manpower resource, and review efficiency is lower.

Summary of the invention

The invention provides a kind of method and device thereof that the content of text audit of user's issue is handled, it can save a large amount of manual examination and verification time, has improved review efficiency.

Technical scheme of the present invention is: a kind of method that the content of text audit of user's issue is handled comprises step:

Receive the content of text of user's issue, according to list rule database judges information, said list rule database comprises blacklist, black rule, white list and white rule;

If said user profile neither belongs to white list or white rule, do not belong to blacklist or black rule yet, then the content of text to said user carries out format conversion, extracts the notional word in the said content of text;

Calculate the contrary document frequency weighted value of each notional word in the document database of setting up in advance that extracts, obtain first proper vector of forming by said contrary document frequency weighted value;

First similarity of second proper vector of the spam samples content of calculating said first proper vector and setting up in advance; Whether the content of text of judging said user's issue according to said first similarity is qualified content; If qualified content is then announced the content of text of said user's issue.

The invention also discloses a kind of device that the content of text audit of user's issue is handled; It comprises; Auditing module; Be used to receive the content of text of user's issue, according to list rule database judges information, said list rule database comprises blacklist, black rule, white list and white rule;

Modular converter is used for neither belonging to white list or white rule in said user profile, and when also not belonging to blacklist or black rule, the content of text that said user is issued carries out format conversion, extracts the notional word in the said content of text;

Computing module is used for calculating the contrary document frequency weighted value of each notional word of extraction at the document database of setting up in advance, obtains first proper vector of being made up of said contrary document frequency weighted value; First similarity of second proper vector of the spam samples content of calculating said first proper vector simultaneously and setting up in advance;

Judge module is used for judging according to said first similarity whether said user's content of text is qualified content, if qualified content is then announced the content of text of said user's issue.

The method and apparatus that the content of text audit of user's issue is handled of the present invention; Only to neither belonging to white list or white rule; The content of text that does not also belong to user's issue of blacklist or black rule is examined filtration treatment; Can and the underproof content of text of user's issue be sent to manual work with the content of text of the user's issue that belongs to black rule and blacklist and examine, the qualified content of text that content of text that the user who belongs to white rule and white list is issued and user issue is directly announced; Need not all examine user's information releasing like this, can save a large amount of manual examination and verification time, save human resources, also improve review efficiency accordingly via manual work.

Description of drawings

Fig. 1 is the method flow diagram that the present invention handles the content of text audit of user's issue;

Fig. 2 is the structured flowchart () of the present invention to the device of the content of text audit processing of user's issue;

Fig. 3 is the structured flowchart (two) of the present invention to the device of the content of text audit processing of user's issue;

Fig. 4 is the structured flowchart (three) of the present invention to the device of the content of text audit processing of user's issue.

Embodiment

The method and apparatus that the content of text audit of user's issue is handled of the present invention; Only to neither belonging to white list or white rule; The content of text that does not also belong to user's issue of blacklist or black rule is examined filtration treatment; To belong to black rule and blacklist user's issue content of text and the underproof content of text of user's issue sent to manual work examine, the qualified content of text that content of text that the user who belongs to white rule and white list is issued and user issue is directly announced; Need not all examine user's information releasing like this, can save a large amount of manual examination and verification time, save human resources, also improve review efficiency accordingly via manual work.

Below in conjunction with accompanying drawing and specific embodiment the present invention is done a detailed elaboration.

The method that the content of text audit of user issue is handled of the present invention can be applied in to be asked in the question and answer type services such as community, Baidu are known, Sina likes to ask.

The method that the content of text audit of user's issue is handled of the present invention comprises step, like Fig. 1,

The content of text of S100, reception user issue.S101, according to list rule database judges information; Said list rule database comprises blacklist, black rule, white list and white rule.In one embodiment, blacklist can be to have big probability that the user list of junk information is provided, and white list is to have big probability that the user list of proper information is provided; Black rule is to set according to user's grade or credit rating, and its expression user's lower grade is or credit rating is very low, and white rule also is to set according to user's grade or credit rating, and its grade of representing the user is than higher or credit rating is very high.

If the said user profile of S102 neither belongs to white list or white rule, do not belong to blacklist or black rule yet, then the content of text to said user's issue carries out format conversion, extracts the notional word in the said content of text.In one embodiment, format conversion can comprise that said content of text is carried out the traditional font to be changed, remove the conversion in unnecessary space etc. to half-angle to simplified conversion, full-shape, and notional word is the core word of content of text, and function word is not as core word.

Contrary document frequency (IDF) weighted value of each notional word in the document database of setting up in advance that S103, calculating are extracted obtains first proper vector of being made up of said contrary document frequency (IDF) weighted value.In one embodiment, the document database can be made up of the content of text of all user's issues.Contrary document frequency (IDF) weighted value of each notional word that calculate to extract in the document database of setting up in advance, specifically can for: according to formula

{Wgt = t}_{f} \times Lg \frac{U}{V}

Calculate contrary document frequency (IDF) weighted value of each notional word; Wherein wgt is contrary document frequency (IDF) weighted value, t _fBe the frequency values that said notional word occurs in said user's content of text, U is the total number of documents in the said document database, and V is for the number of files of said notional word occurring.

S104, calculate said first proper vector and first similarity of second proper vector of the spam samples content set up in advance.Second proper vector of spam samples content can obtain in advance; It is the same with first proper vector that it obtains process; Take out a spam samples content,, extract notional word its format conversion; Calculate the contrary document frequency weighted value of each notional word in said document database then, form second proper vector by these weighted values.In one embodiment, calculate said first proper vector and first similarity of second proper vector of the spam samples content set up in advance, be specially: according to formula

Cos (X, Y) = \frac{Σ_{α = 1, β = 1}^{α = m, β = n} x_{α} y_{β}}{\sqrt{Σ_{α = 1}^{m} x_{α}^{2} Σ_{β = 1}^{n} y_{β}^{2}}}

Cos(X，Y)

Calculate said first similarity; Wherein represent said first similarity,

X＝{x ₁，K，x _m}，Y＝{y ₁，K，y _n}

Represent said first proper vector and second proper vector respectively.

S105, judge according to said first similarity whether the content of text of said user issue is qualified content.This determination methods has a variety of modes, can set according to user's needs.In one embodiment, can set a predetermined threshold,, otherwise judge that the content of text of this user's issue is qualified content if the value of said first similarity, can judge then that the content of text of this user's issue is defective content greater than this threshold value.

If qualified content, then carry out the content of text that step S107 announces said user's issue, the content of text of said user's issue is sent to manual work examine otherwise can carry out step S106 in one embodiment.

In one embodiment, belong to blacklist or black rule, the content of text of said user's issue is sent to manual work examine if can also comprise step S102 user profile after the step S101.If the said user profile of S103 belongs to white list or white rule, with the content of text of announcing said user's issue.

In order to judge comprehensively accurately that further whether the content that the user issues is qualified content, reduces the probability of erroneous judgement.In one embodiment; Neither belong to white list or white rule in judges information; When not belonging to blacklist or black rule again; Can also comprise step, detect second similarity of content of text with the feature database of setting up in advance that comprises phone number format, webpage format and Mars word form etc. of said user issue, judge according to this second similarity and first similarity whether the content of text that said user issues is qualified content.When whether the content of text of judges issue is qualified content; Can distribute weights respectively for first similarity and second similarity; Whether detect the weights sum greater than a predetermined value; If greater than a predetermined value, can judge that the content of text of this user's issue is defective content, otherwise be qualified content.Whether the value that also can only detect this second similarity in addition greater than a predetermined value, if greater than could judge directly that the content of text of this user's issue is defective content.

In order to reach same purpose, judge comprehensively accurately further whether the content of user's issue is qualified content, reduce the probability of erroneous judgement.In one embodiment; Neither belong to white list or white rule in judges information; When not belonging to blacklist or black rule again; Can also comprise step, add up the number of characters of the content of text of said user's issue, judge according to this number of characters, first similarity and second similarity whether the content of text of said user's issue is qualified content.When whether the content of text of judges issue is qualified content; Can distribute weights respectively for number of characters, first similarity and second similarity; Whether detect the weights sum greater than a predetermined value; If greater than a predetermined value, can judge that the content of text of this user's issue is defective content, otherwise be qualified content.Also can set a predetermined value with regard to this number of characters separately in addition, if when detecting number of characters less than this predetermined value, content of text that directly can the judges issue is defective content.

In order to reach same purpose, judge comprehensively accurately further whether the content of user's issue is qualified content, reduce the probability of erroneous judgement.In one embodiment; Neither belong to white list or white rule in judges information; When not belonging to blacklist or black rule again; Can also comprise step; The third phase that detects the content of text of said user's issue and the data bank of setting up in advance that can not announce words (this data bank is to some special words and short sentence or the set of interior perhaps other settings of requirement shielding at no distant date) judges like degree, said number of characters, first similarity and second similarity whether the content of text that said user issues is qualified content according to this third phase like degree.When whether the content of text of judges issue is qualified content; Can distribute weights respectively like degree, number of characters, first similarity and second similarity for third phase; Whether detect the weights sum greater than predetermined value; If greater than a predetermined value, can judge that the content of text of this user's issue is defective content, otherwise be qualified content.Also can detect this third phase in addition separately and seemingly whether spend greater than a predetermined value, if greater than, can judge that then the content of text of this user's issue is defective content.

The present invention has also disclosed a kind of device that the content of text audit of user's issue is handled, and like Fig. 2, it comprises auditing module, modular converter, computing module and the judge module that connects successively;

Auditing module is used to receive the content of text that the user issues, and according to list rule database judges information, said list rule database comprises blacklist, black rule, white list and white rule.In one embodiment, blacklist can be to have big probability that the user list of junk information is provided, and white list is to have big probability that the user list of proper information is provided; Black rule is to set according to user's grade or credit rating, and its expression user's lower grade is or credit rating is very low, and white rule also is to set according to user's grade or credit rating, and its grade of representing the user is than higher or credit rating is very high.

Modular converter is used for neither belonging to white list or white rule in said user profile, and when also not belonging to blacklist or black rule, the content of text that said user is issued carries out format conversion, extracts the notional word in the said content of text.In one embodiment, format conversion can comprise that said content of text is carried out body to be changed, remove the conversion in unnecessary space etc. to half-angle to simplified conversion, full-shape, and notional word is the core word of content of text, and function word is not as core word.

Computing module is used for calculating contrary document frequency (IDF) weighted value of each notional word of extraction at the document database of setting up in advance, obtains first proper vector of being made up of said contrary document frequency (IDF) weighted value; First similarity of second proper vector of the spam samples content of calculating said first proper vector simultaneously and setting up in advance.In one embodiment, the document database can be made up of the content of text of all user's issues.Contrary document frequency (IDF) weighted value of each notional word that calculate to extract in the document database of setting up in advance, specifically can for: according to formula

Wgt = t_{f} \times Lg \frac{U}{V}

Calculate contrary document frequency (IDF) weighted value of each notional word; Wherein wgt is contrary document frequency (IDF) weighted value, t _fBe the frequency values that said notional word occurs in said user's content of text, U is the total number of documents in the said document database, and V is for the number of files of said notional word occurring.Second proper vector of spam samples content can obtain in advance; It is the same with first proper vector that it obtains process; Take out a spam samples content,, extract notional word its format conversion; Calculate the contrary document frequency weighted value of each notional word in said document database then, form second proper vector by these weighted values.In one embodiment, calculate said first proper vector and first similarity of second proper vector of the spam samples content set up in advance, be specially: according to formula

Cos (X, Y) = \frac{Σ_{α = 1, β = 1}^{α = m, β = n} x_{α} y_{β}}{\sqrt{Σ_{α = 1}^{m} x_{α}^{2} Σ_{β = 1}^{n} y_{β}^{2}}}

Cos(X，Y)

Calculate said first similarity; Wherein represent said first similarity,

X＝{x ₁，K，x _m}，Y＝{y ₁，K，y _n}

Represent said first proper vector and second proper vector respectively.

Judge module is used for judging according to said first similarity whether said user's content of text is qualified content, if qualified content is then announced the content of text of said user's issue.In one embodiment, be defective content if judge said user's content of text, the content of text that then said judge module is issued said user sends to manual work and examines.

In one embodiment, said auditing module belongs to blacklist or black rule in user profile, the content of text of said user's issue is sent to manual work examine; Belong to white list or white rule in said user profile, with the content of text of announcing said user's issue.

In order to judge comprehensively accurately that further whether the content that the user issues is qualified content, reduces the probability of erroneous judgement.Like Fig. 3; Between said auditing module and said judge module, also be connected with detection module; Be used for neither belonging to white list or white rule in user profile; When not belonging to blacklist or black rule again, detect the content of text of said user's issue and second similarity of the feature database that comprises phone number format, webpage format and Mars word form of foundation in advance; And/or detect said user's content of text and the third phase of the data bank that can not announce words set up in advance like degree; And said second similarity and/or third phase sent to said judge module like degree, said judge module judges like degree whether the content of text that said user issues is qualified content according to said first similarity, second similarity and/or third phase.When whether the content of text of judges issue is qualified content; Can distribute weights respectively like degree for first similarity, second similarity and/or third phase; Whether detect the weights sum greater than predetermined value; If greater than a predetermined value, can judge that the content of text of this user's issue is defective content, otherwise be qualified content.

In order to reach identical purpose, judge comprehensively accurately further whether the content of user's issue is qualified content, reduce the probability of erroneous judgement.Like Fig. 4; Between said auditing module and said judge module, also be connected with statistical module; Be used for neither belonging to white list or white rule, when not belonging to blacklist or black rule again, add up the number of characters of the content of text of said user's issue in user profile; And this number of characters sent to said judge module, said judge module judges like degree whether the content of text that said user issues is qualified content according to this number of characters, said first similarity, second similarity and/or third phase.When whether the content of text of judges issue is qualified content; Can distribute weights respectively like degree for number of characters, first similarity, second similarity and/or third phase; Whether detect the weights sum greater than predetermined value; If greater than a predetermined value, can judge that the content of text of this user's issue is defective content, otherwise be qualified content.

In sum; The method and apparatus that the content of text audit of user's issue is handled of the present invention; Can examine filtration treatment to the content of text of user profile and user's issue; To belong to black rule and blacklist user's issue content of text and the underproof content of text of user's issue sent to manual work examine, the qualified content of text that content of text that the user who belongs to white rule and white list is issued and user issue is directly announced; Need not all examine user's information releasing like this, can save a large amount of manual examination and verification time, save human resources, also improve review efficiency accordingly via manual work.

Above-described embodiment of the present invention does not constitute the qualification to protection domain of the present invention.Any modification of within spirit of the present invention and principle, being done, be equal to replacement and improvement etc., all should be included within the claim protection domain of the present invention.

Claims

1. the method that the content of text audit of user's issue is handled is characterized in that, comprises step:

If said user profile neither belongs to white list or white rule, do not belong to blacklist or black rule yet, then the content of text to said user's issue carries out format conversion, extracts the notional word in the said content of text, and notional word is the core word of content of text;

First similarity of second characteristic vector of the spam samples content of calculating said first characteristic vector and setting up in advance; Whether the content of text of judging said user's issue based on said first similarity is qualified content; If qualified content is then announced the content of text of said user's issue;

Said first similarity is according to formula

Calculate, wherein, Cos (X, Y) said first similarity of expression, X={x ₁..., x _m, Y={y ₁..., y _nRepresent said first proper vector and second proper vector respectively.

2. the method that the content of text audit of user's issue is handled according to claim 1; It is characterized in that: neither belong to white list or white rule in said user profile; When not belonging to blacklist or black rule yet; Also comprise step; Second similarity of the feature database that comprises phone number format, webpage format and Mars word form that detects the content of text of said user's issue and set up in advance judges according to said second similarity and first similarity whether the content of text of said user's issue is qualified content.

3. the method that the content of text audit of user's issue is handled according to claim 2; It is characterized in that: neither belong to white list or white rule in said user profile; When not belonging to blacklist or black rule yet; Also comprise step, add up the number of characters of the content of text of said user's issue, judge according to this number of characters, first similarity and second similarity whether the content of text of said user's issue is qualified content.

4. the method that the content of text audit of user's issue is handled according to claim 3; It is characterized in that: neither belong to white list or white rule in said user profile; When not belonging to blacklist or black rule yet; Also comprise step; The third phase that comprises the data bank that can not announce words that detects the content of text of said user's issue and set up in advance judges like degree, said number of characters, first similarity and second similarity whether the content of text of said user's issue is qualified content according to this third phase like degree.

5. according to the described method that the content of text audit of user's issue is handled of the arbitrary claim of claim 1 to 4; It is characterized in that: the contrary document frequency weighted value of each notional word that said calculating is extracted in the document database of setting up in advance is specially: according to formula

Calculate the contrary document frequency weighted value of each notional word; Wherein wgt is contrary document frequency weighted value, t _fBe the frequency values that said notional word occurs in said user's content of text, U is the total number of documents in the said document database, and V is for the number of files of said notional word occurring.

6. the content of text to user's issue is examined the device of handling, and it is characterized in that: comprise,

Auditing module is used to receive the content of text that the user issues, and according to list rule database judges information, said list rule database comprises blacklist, black rule, white list and white rule;

Modular converter; Be used for neither belonging to white list or white rule in said user profile, when also not belonging to blacklist or black rule, the content of text that said user is issued carries out format conversion; Extract the notional word in the said content of text, notional word is the core word of content of text;

Judge module is used for judging according to said first similarity whether the content of text of said user's issue is qualified content, if qualified content is then announced the content of text of said user's issue;

Said computing module is according to formula

Calculate said first similarity, wherein, Cos (X, Y) said first similarity of expression, X={x ₁..., x _m, Y={y ₁..., y _nRepresent said first proper vector and second proper vector respectively.

7. the device that the content of text audit of user's issue is handled according to claim 6; It is characterized in that: also comprise detection module; Neither belong to white list or white rule in said user profile; When also not belonging to blacklist or black rule, be used to detect the content of text of said user's issue and second similarity of the feature database that comprises phone number format, webpage format and Mars word form of foundation in advance; Seemingly spend with the third phase that comprises the data bank that to announce words of content of text that detects said user and foundation in advance; And said second similarity and third phase sent to said judge module like degree, said judge module judges like degree whether the content of text that said user issues is qualified content according to said first similarity, second similarity and third phase.

8. the device that the content of text audit of user's issue is handled according to claim 7; It is characterized in that: also comprise statistical module; Neither belong to white list or white rule in said user profile; When not belonging to blacklist or black rule yet; Be used to add up the number of characters of said content of text, and said number of characters is sent to said judge module, said judge module judges like the degree and first similarity whether the content of text of said user's issue is qualified content according to said number of characters, second similarity, third phase.