CN110298020B

CN110298020B - Text anti-cheating variant reduction method and equipment, and text anti-cheating method and equipment

Info

Publication number: CN110298020B
Application number: CN201910462812.0A
Authority: CN
Inventors: 袁晖
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-05-30
Filing date: 2019-05-30
Publication date: 2023-05-16
Anticipated expiration: 2039-05-30
Also published as: CN110298020A

Abstract

The embodiment of the invention provides a text anti-cheating variant reduction method, which comprises the following steps: taking the text as a root node text; directly taking the root node text as a child node text, and expanding each character in the root node text according to each mapping in one of N mapping relations to generate the child node text; taking each child node text as a new child node text, and expanding each character in each child node text according to each mapping in one mapping relation in the non-adopted mapping relation to generate a new child node text; repeating the steps until each mapping in each mapping relation of all characters in all child node texts is traversed; and scoring the smoothness of the child node texts, and selecting the child node text with the highest smoothness score as the restored text. Through the method, the difficulty in matching the follow-up keywords or identifying the models can be reduced, and finally the cheating text is deleted.

Description

Text anti-cheating variant reduction method and equipment, and text anti-cheating method and equipment

Technical Field

The invention relates to the field of text anti-cheating, in particular to a text anti-cheating variant reduction method, a text anti-cheating variant reduction device, a text anti-cheating method and a text anti-cheating device.

Background

With the continuous development of the Internet, the number of netizens is increased year by year, and various forms of flow bonus are provided for large Internet companies. However, another "mothproof market" of cheating popularization is grown behind the bright and bright internet market, various cheating popularization posts (namely soft text or soft advertisement) are issued under the product line of communities, feed streams and the like for the purpose of popularizing certain goods or services, the user experience of the products is seriously affected, and potential advertisers are guided freely to a certain extent, so that the income of the companies is lost. There are a large number of black and gray producing teams in today's internet environment that send offending text content in large quantities for purposes including pornography web site drainage, fake medicine sales, network fraud, etc. The cheating text they send belongs to the most difficult class to identify. Text anti-cheating refers to identifying advertisement cheating content in text. In the prior art, two methods for identifying cheating texts exist: 1. keyword matching identification; 2. machine learning model identification.

The keyword matching method comprises the following steps: and reading the dictionary, and matching the text to be predicted through the keywords, wherein if the matching is successful, the text to be predicted is considered to contain cheating content. Enumerating variant content and writing into a keyword dictionary. The keyword matching method has the defects that:

first, although the number of the whole characters is limited, the number of the combination of a plurality of variant characters is extremely huge (if the word of micro is assumed to have 20 variants, the word of letter is assumed to have 15 variants, and the variants of the keyword of micro are assumed to have 20×15=300), and it is almost impossible to enumerate the variant combinations of all the characters;

second, enumerated variant character combinations tend to be more traumatic.

The machine learning model identification method comprises the following steps: and (3) establishing a corpus, manually labeling the cheating text, training an NLP classification model through machine learning, predicting the text to be judged through the model, outputting the cheating probability, and considering that the text contains the cheating content when the cheating probability is higher than a certain threshold value.

The defects of machine learning model identification are: the effect of the model depends on training samples in the corpus, and the corpus cannot cover variant combinations of all characters as the keyword matching method is the same. The model can only identify variant cheating text contained in the corpus, but cannot identify variant cheating text outside the scope of the corpus. The existing NLP classification model mainly solves the problem of classification of natural language, but the context relation among words in the variant text is artificially destroyed, and cannot be regarded as natural language, and the NLP model effect is not ideal.

Disclosure of Invention

The method and the device for eliminating the variant in the text effectively remove the variant in the text, restore the real information of the text, facilitate the subsequent execution of other strategies such as keyword matching, model recognition and the like, and improve the generalization capability of anti-cheating strategies.

To achieve the above object, in a first aspect of the present invention, there is provided a text anti-cheating variant reduction method, the method comprising:

taking the text as a root node text;

directly taking the root node text as a child node text, and expanding each character in the root node text according to each mapping in one of N mapping relations to generate the child node text;

taking each child node text as a new child node text, and expanding each character in each child node text according to each mapping in one mapping relation in the non-adopted mapping relation to generate a new child node text;

repeating the steps until each mapping in each mapping relation of all characters in all child node texts is traversed;

and scoring the smoothness of all the child node texts generated by the last expansion in the traversed child node texts, and selecting the child node text with the highest smoothness score as a restored text.

Optionally, in the step, the N mapping relations include a shape-near word mapping, a homophone mapping, a harmonic word mapping, and an interference character mapping.

Optionally, the method for expanding the shape-near word mapping relation comprises the following steps:

capturing all characters in the root node text or the child node text by adopting image processing software, and carrying out image recognition on the captured characters;

chinese characters similar to the shape of each recognized character are expanded into child node text as shape near words.

capturing the disassembled data and the radical data of each character by adopting image processing software;

and under the condition that each character is subjected to radical removal, if the rest part of the characters subjected to radical removal can form Chinese characters, expanding the rest part into child node text as a shape near word.

Optionally, the method for expanding homophone mapping relation comprises the following steps:

converting each Chinese character in the root node text or the child node text into pinyin;

and expanding each Chinese character with the same pinyin as the pinyin as homophones into child node text.

Optionally, the method for expanding the mapping relation of the harmonic words comprises the following steps:

and expanding each Chinese character with pinyin similar to the pinyin as a harmonic character into child node text.

Optionally, the method for expanding the mapping relation of the interference characters comprises the following steps:

the nonsensical characters are expanded as null characters into child node text.

Optionally, the steps further include: and scoring the smoothness of each child node text, and deleting M child node texts with the rear score according to the obtained smoothness score.

In another aspect of the present invention, there is also provided a text anti-cheating method, the method including:

reducing the variants in the text by adopting the method to obtain a reduced text;

performing keyword matching or model recognition on the restored text;

labeling the text successfully matched with the keywords as a cheating text, or labeling the text successfully identified by the model as the cheating text.

In a third aspect of the present invention, there is also provided a text anti-cheating variant reduction apparatus, including:

at least one processor;

a memory coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor, the at least one processor implementing the method described above by executing the instructions stored by the memory.

The fourth aspect of the present invention also provides a text anti-cheating device, including:

at least one processor;

a memory coupled to the at least one processor;

In a fifth aspect of the invention, there is also provided a machine-readable storage medium having stored thereon instructions which, when executed by a controller, enable the controller to perform the method as described hereinbefore.

The technical scheme of the invention has at least the following effects:

the method effectively removes the variants in the text, restores the real information of the text, facilitates the subsequent execution of other strategies such as keyword matching, model recognition and the like, and improves the generalization capability of the anti-cheating strategy.

Additional features and advantages of embodiments of the invention will be set forth in the detailed description which follows.

Drawings

The accompanying drawings are included to provide a further understanding of embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain, without limitation, the embodiments of the invention. In the drawings:

FIG. 1 is a flow chart of a text anti-cheating variant reduction method provided by an embodiment of the present invention;

fig. 2 is a flowchart of a text anti-cheating method provided by an embodiment of the present invention.

Detailed Description

The following describes specific embodiments of the present invention in detail with reference to the drawings. It should be understood that the detailed description and specific examples, while indicating and illustrating the invention, are not intended to limit the invention.

In the embodiments of the present invention, unless otherwise indicated, terms of orientation such as "upper, lower, top, bottom" are used generally with respect to the orientation shown in the drawings or with respect to the positional relationship of the various components with respect to one another in the vertical, vertical or gravitational directions.

FIG. 1 is a flow chart of a text anti-cheating variant reduction method provided by an embodiment of the present invention; as shown in fig. 1, the present invention provides a text anti-cheating variant reduction method, which includes:

s11) taking the text as a root node text; when the root node text is considered to belong to the cheating text, the root node text is taken as the root of the variant reduction tree.

S12) directly taking the root node text as a child node text, and expanding each character in the root node text according to each mapping in one of N mapping relations to generate the child node text; since root node text typically contains multiple characters; for example, starting from the first character, each of the nth mappings is used to extend to sub-byte text for the first character. Such as: "can be expanded into" you "," Ni "," woolen "etc. by using homophones or homophones mapping relation, and the expanded sub-byte text uses" you "," Ni "or" woolen "as the first character respectively. In turn, the second character, the third character, etc. of the root node text may also be expanded into sub-byte text by using each mapping relationship in the nth mapping relationship in the manner described above. The non-replaced text is also expanded by one child node to be written, i.e. the root node text is also expanded as one child node text.

S13) taking each child node text as a new child node text, and expanding each character in each child node text according to each mapping in one mapping relation in the non-adopted mapping relation to generate a new child node text; and when a plurality of sub-byte texts expanded by adopting the N-th mapping relation are generated in the step S12), expanding each character by adopting the N-1-th mapping relation for each sub-byte text. For example: the sub-byte text generated by expansion in step S12) includes: "hello", "nigood", "woolen", "nigood"; then adopting an N-1 mapping relation which may be a shape-near-word mapping relation, and expanding the shape-near-word mapping relation for 'hello' into 'you "," 1' girl "," man 'good' and the like; the "Nigood" is expanded into "female", "female good", "Nifemale", etc. by adopting the shape-near-word mapping relationship. The non-replaced text is also expanded by one child node to be written, namely, the child node text is also expanded as a new child node text. By adopting the N-1 expansion method of the mapping relation, each sub-byte text in the plurality of sub-node texts can generate more sub-byte texts, and a plurality of branches can be sent out similar to branches on a tree, and each branch can be separated into a plurality of branches. Therefore, variant restoration of the root node text and child node text forms a variant restoration tree similar to multiple branches on a tree.

S14) repeating the step S13) until each mapping in each mapping relation of all characters in all child node texts is traversed; repeating the steps to obtain a variant reduction tree with a plurality of branches. If the characters of the root byte text are many, more sub-byte text is generated, and similarly, the branches of the variant reduction tree are more. If the traversal is complete, a very large number of sub-byte text is formed.

S15) carrying out smoothness scoring on all the child node texts generated by the last expansion in the traversed child node texts, and selecting the child node text with the highest smoothness score as a restored text. And scoring the smoothness of the tail ends of the branches in the traversed sub-byte text, namely the sub-byte text obtained by the last expansion. The scoring step is prior art in the art and will not be described in detail herein.

In step S12), the N mapping relationships include a shape-near word map, a homophone map, a harmonic word map, and an interference character map. The mapping relation can be increased and decreased according to actual conditions.

Specifically, the method for expanding the shape-near word mapping relation comprises the following steps:

capturing all characters in the root node text or the child node text by adopting image processing software, and carrying out image recognition on the captured characters; according to one embodiment, the image recognition is performed by means of OCR recognition.

Chinese characters similar to the shape of each recognized character are expanded into child node text as shape near words. For example, a pixel image OCR result of a "micro" word, a result with a similarity greater than a certain threshold might be: micro, , bar, , sign, badge, bear, , and mikania micrantha, the micro word itself is removed, the remaining micro, , bar, , sign, bear, , and mikania micrantha are recorded as near-shape words, and the recorded near-shape words are used as expandable characters.

The method steps of the shape-near word mapping relation expansion can further comprise:

In particular, many Chinese characters are composed of a plurality of radicals and a plurality of other Chinese characters. For example: the "greeting" can be broken down into two words, "Add" and "shellfish", so that the "greeting" can be expanded to child node text with "Add" and, at the same time, to child byte text with "shellfish".

Specifically, the method for expanding homophone mapping relation comprises the following steps:

and expanding each Chinese character with the same pinyin as the pinyin as homophones into child node text. For example, expanding "sad" to Chinese characters with the same pinyin may be: "enemy", "frame", "thick", and the like. Therefore, "sad" can be extended to sub-byte text with "enemy", "frame", "thick" using homophone mapping relation.

Specifically, the method for expanding the harmonic word mapping relation comprises the following steps:

For example, extending "river" to Chinese characters with harmonic tones may be: "will", "Jiang", "prize", and so forth. Thus, "river" can be extended with harmonic word mapping to have: sub-byte text of "will", "Jiang", "prize".

Specifically, the method for expanding the mapping relation of the interference characters comprises the following steps:

a nonsensical character, such as "-", "\", "one", or "kana", is expanded as a null character into child node text.

Step S13) further includes: and scoring the smoothness of each child node text, and deleting M child node texts with the rear score according to the obtained smoothness score. For example, M is 5. If the complete variant reduction tree is directly constructed, the number of nodes is quite large, so that proper pruning operation can be performed in the actual execution process, such as forward degree scoring, partial nodes are abandoned, and the search range is reduced, so that the system performance is improved.

Fig. 2 is a flowchart of a text anti-cheating method according to an embodiment of the present invention, and a second aspect of the present invention further provides a text anti-cheating method, where the method includes:

s1) reducing the variants in the text by the method of any one of the above claims to obtain a reduced text;

s2) carrying out keyword matching or model recognition on the text restored in the step S1);

and S3) marking the text successfully matched with the keyword in the step S2) as a cheating text, or marking the text successfully identified by the model as the cheating text.

The third aspect of the present invention also provides a text anti-cheating variant reduction apparatus, comprising:

at least one processor;

a memory coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor, the at least one processor implementing the method as described above by executing the instructions stored by the memory.

The fourth aspect of the present invention also provides a text anti-cheating apparatus, comprising:

at least one processor;

a memory coupled to the at least one processor;

The fifth aspect of the invention also provides a machine-readable storage medium having stored thereon instructions which, when executed by a controller, enable the controller to perform a method as described above.

According to the technical scheme, the root node text is expanded into a plurality of sub-byte texts step by step through a plurality of mapping relations, the sub-byte texts are subjected to smoothness scoring, part of the sub-byte texts are deleted, and finally the text with the highest smoothness is obtained as the restored text. Based on the restored text, keyword matching or model recognition can be performed, and finally the aim of deleting the matched keywords or deleting the text after model recognition is fulfilled.

The alternative embodiments of the present invention have been described in detail above with reference to the accompanying drawings, but the embodiments of the present invention are not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solutions of the embodiments of the present invention within the scope of the technical concept of the embodiments of the present invention, and all the simple modifications belong to the protection scope of the embodiments of the present invention.

In addition, the specific features described in the above embodiments may be combined in any suitable manner without contradiction. In order to avoid unnecessary repetition, the various possible combinations of embodiments of the invention are not described in detail.

Those skilled in the art will appreciate that all or part of the steps in a method for implementing the above embodiments may be implemented by a program stored in a storage medium, where the program includes several instructions for causing a single-chip microcomputer, chip or processor (processor) to perform all or part of the steps in a method according to the embodiments of the invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In addition, any combination of the various embodiments of the present invention may be made, so long as it does not deviate from the idea of the embodiments of the present invention, and it should also be regarded as what is disclosed in the embodiments of the present invention.

Claims

1. A text anti-cheating variant reduction method, the method comprising:

s11) taking the text as a root node text;

s12) directly taking the root node text as a child node text, and simultaneously expanding each character in the root node text according to each mapping in one of N mapping relations to generate the child node text, wherein the N mapping relations comprise a shape-near word mapping, a homophone word mapping, a harmonic word mapping and an interference character mapping;

s13) taking each child node text as a new child node text, and expanding each character in each child node text according to each mapping in one mapping relation in the non-adopted mapping relation to generate a new child node text;

s14) repeating the step S13) until each mapping in each mapping relation of all characters in all child node texts is traversed;

s15) carrying out smoothness scoring on all the child node texts generated by the last expansion in the traversed child node texts, and selecting the child node text with the highest smoothness score as a restored text.

2. The text anti-cheating variant reduction method according to claim 1, wherein the method steps of the form-near word mapping relation expansion are as follows:

3. The text anti-cheating variant reduction method according to claim 1, wherein the method steps of the form-near word mapping relation expansion are as follows:

4. The text anti-cheating variant reduction method according to claim 1, wherein the method steps of homophone mapping relation expansion are as follows:

5. The text anti-cheating variant restoration method according to claim 1, wherein the method steps of the harmonic word mapping relation expansion are as follows:

6. The text anti-cheating variant reduction method according to claim 1, wherein the method step of expanding the interference character mapping relation is as follows:

7. The text anti-cheating variant reduction method according to claim 1, wherein step S13) further comprises: and scoring the smoothness of each child node text, and deleting M child node texts with the rear score according to the obtained smoothness score.

8. A method of text anti-cheating, the method comprising:

s1) reducing the variants in the text by the method of any one of claims 1-7 to obtain reduced text;

9. A text anti-cheating variant reduction apparatus, comprising:

at least one processor;

a memory coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor, the at least one processor implementing the method of any one of claims 1-7 by executing the instructions stored by the memory.

10. A text anti-cheating device, comprising:

at least one processor;

a memory coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor, the at least one processor implementing the method of claim 8 by executing the instructions stored by the memory.

11. A machine-readable storage medium having stored thereon instructions which, when executed by a controller, cause the controller to perform the method of any of claims 1-7.