CN108804413B - Text cheating identification method and device - Google Patents

Text cheating identification method and device Download PDF

Info

Publication number
CN108804413B
CN108804413B CN201810398470.6A CN201810398470A CN108804413B CN 108804413 B CN108804413 B CN 108804413B CN 201810398470 A CN201810398470 A CN 201810398470A CN 108804413 B CN108804413 B CN 108804413B
Authority
CN
China
Prior art keywords
carrier
text
cheating
soft
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810398470.6A
Other languages
Chinese (zh)
Other versions
CN108804413A (en
Inventor
覃丕七
余义祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu Online Network Technology Beijing Co Ltd
Original Assignee
Baidu Online Network Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Baidu Online Network Technology Beijing Co Ltd filed Critical Baidu Online Network Technology Beijing Co Ltd
Priority to CN201810398470.6A priority Critical patent/CN108804413B/en
Publication of CN108804413A publication Critical patent/CN108804413A/en
Application granted granted Critical
Publication of CN108804413B publication Critical patent/CN108804413B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a text cheating identification method and a text cheating identification device, wherein the text cheating identification method comprises the following steps: extracting a plurality of suspected carrier words according to a current text and a plurality of historical texts submitted by a user; carrying out carrier identification on the plurality of suspected carrier words, and identifying the most suspected carrier words as soft text carrier words; and performing soft text cheating identification on the current text and the plurality of historical texts containing the soft text carrier words, and identifying whether the current text is soft text and/or whether the user is a soft text cheating user. According to the text cheating identification method and device, because the soft text cheating identification is carried out based on a plurality of linguistic data (the current text and a plurality of historical texts), namely, the plurality of historical texts of the user are referred for evidence making, the soft text identification rate and the identification precision are improved.

Description

Text cheating identification method and device
Technical Field
The invention relates to the technical field of computers, in particular to a text cheating identification method and device.
Background
With the continuous development of the internet, the number of net citizens is increased year by year, and various forms of flow dividends are provided for various large internet companies. However, another moth market for cheating promotion is bred behind the bright internet market, various cheating promotion posts (namely, soft texts or soft advertisements) are issued under product lines such as communities and feed streams for the purpose of promoting certain commodities or services, the user experience of the products is seriously influenced, and the posts are freely guided to potential advertisers to a certain extent, so that the income of companies is lost.
In the prior art, a method for identifying cheating promotion posts is as follows: based on the samples marked with the soft texts and the non-soft texts, a two-classification model is designed by using Machine learning technologies such as logistic regression and Support Vector Machine (SVM for short), soft text prediction is carried out on posts newly submitted by the user by using the two-classification model, and whether the posts newly submitted by the user are soft texts is determined according to the output probability.
However, the prior art has at least the following defects: the linguistic data used for model training and prediction are single linguistic data, so that the recognition rate and the recognition precision of the model are low, and further the soft text recognition rate and the recognition precision are low.
Disclosure of Invention
The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.
Therefore, the first objective of the present invention is to provide a text cheating recognition method to improve the soft text recognition rate and recognition accuracy.
A second object of the present invention is to provide a device for recognizing text cheating.
A third object of the invention is to propose a computer device.
A fourth object of the invention is to propose a non-transitory computer-readable storage medium.
A fifth object of the invention is to propose a computer program product.
In order to achieve the above object, an embodiment of a first aspect of the present invention provides a text cheating identification method, including:
extracting a plurality of suspected carrier words according to a current text and a plurality of historical texts submitted by a user;
carrying out carrier identification on the plurality of suspected carrier words, and identifying the most suspected carrier words as soft text carrier words;
and performing soft text cheating identification on the current text and the plurality of historical texts containing the soft text carrier words, and identifying whether the current text is soft text and/or whether the user is a soft text cheating user.
According to the text cheating identification method provided by the embodiment of the invention, a plurality of suspected carrier words are extracted according to a current text and a plurality of historical texts submitted by a user, the most suspected carrier words are identified as soft text carrier words, soft text cheating identification is carried out on the current text and the plurality of historical texts containing the soft text carrier words, and whether the current text is soft text and/or whether the user is a soft text cheating user is identified. Because the software cheating identification is carried out based on a plurality of linguistic data (the current text and a plurality of historical texts), namely, the evidence is carried out by referring to a plurality of historical texts of the user, the software identification rate and the identification precision are improved.
In order to achieve the above object, a second embodiment of the present invention provides a text cheating recognition apparatus, including:
the extraction module is used for extracting a plurality of suspected carrier words according to a current text and a plurality of historical texts submitted by a user;
the carrier identification module is used for carrying out carrier identification on the plurality of suspected carrier words and identifying the most suspected carrier words as soft text carrier words;
and the cheating identification module is used for carrying out soft text cheating identification on the current text and the plurality of historical texts containing the soft text carrier words and identifying whether the current text is soft text and/or whether the user is a soft text cheating user.
According to the text cheating recognition device provided by the embodiment of the invention, a plurality of suspected carrier words are extracted according to the current text and a plurality of historical texts submitted by a user, the most suspected carrier words are recognized as soft text carrier words, soft text cheating recognition is carried out on the current text and the plurality of historical texts containing the soft text carrier words, and whether the current text is soft text and/or whether the user is a soft text cheating user is recognized. Because the software cheating identification is carried out based on a plurality of linguistic data (the current text and a plurality of historical texts), namely, the evidence is carried out by referring to a plurality of historical texts of the user, the software identification rate and the identification precision are improved.
In order to achieve the above object, an embodiment of a third aspect of the present invention provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the text cheating recognition method according to the embodiment of the first aspect of the present invention.
To achieve the above object, a fourth embodiment of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the text cheating recognition method according to the first embodiment of the present invention.
To achieve the above object, a fifth embodiment of the present invention provides a computer program product, wherein instructions of the computer program product, when executed by a processor, perform the text cheating identification method according to the first embodiment of the present invention.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a schematic flowchart of a text cheating identification method according to an embodiment of the present invention;
fig. 2 is a schematic flowchart of another text cheating recognition method according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an apparatus for identifying text cheating according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of another text cheating recognition apparatus according to an embodiment of the present invention; and
fig. 5 is a block diagram of a computer device according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
The text cheating recognition method and apparatus according to the embodiments of the present invention will be described below with reference to the accompanying drawings.
Fig. 1 is a schematic flowchart of a text cheating recognition method according to an embodiment of the present invention. As shown in fig. 1, the method for identifying text cheating includes the following steps:
s101, extracting a plurality of suspected carrier words according to a current text and a plurality of historical texts submitted by a user.
Specifically, the current text is the text newly submitted by the user. Historical text is text submitted by the user in history. The suspect vector word is a word which appears in the current text, and the appearance frequency of the suspect vector word is greater than a set frequency threshold value or the appearance frequency of the suspect vector word is greater than a set frequency threshold value in the current text and the plurality of historical texts. The occurrence frequency is greater than the set frequency threshold, that is, the ratio of the number of texts containing the suspected carrier word to the total number of texts exceeds the set ratio threshold, for example, the number of texts containing a word is 8, the ratio of the number of texts containing a word to the total number of texts is 8/10, and the ratio exceeds the set ratio threshold by 30%, so that the word is determined to be the suspected carrier word. The number of occurrences is greater than the set number threshold, that is, the number of texts containing the word of the suspected carrier exceeds the set number threshold, for example, 10 texts are present and the number of texts containing a word is 8, and the number of texts exceeds the set number threshold is 5, so that the word is determined to be the word of the suspected carrier. In summary, as long as the word appears in the current text and any one of the following two conditions that the appearance frequency is greater than the set frequency threshold and the appearance frequency is greater than the set frequency threshold is met, the word can be determined to be the suspect vector word.
The cheating user usually submits a plurality of different descriptions of the soft text cheating, but the carrier words for promoting the cheating are not changed, namely the carrier words co-occurrence exists. Therefore, the method can be used for extracting the co-occurrence suspicion carrier words from the current text and a plurality of historical texts submitted by the user, and the number of the extracted suspicion carrier words can be one or more. For example, the current text and the plurality of historical texts submitted by the user are as follows:
text 1: i have the coming person to say a bar, the abdomen-supporting pantyhose of the ABCD is full of force, and I wear the trousers for a little half a year.
Text 2: EF, ABCD are all possible, but I wear ABCD with friends and the pure cotton is comfortable!
Text 3: many people can wear the pelvic girdle of the ABCD all the time, feel straight and comfortable, and like and thank you.
Text 4: the waistband of ABCD is full of money, hopes of adopting, thanks!
And extracting the suspected carrier words from the 4 texts, and finally extracting the suspected carrier words as ABCD.
S102, carrying out carrier identification on a plurality of suspected carrier words, and identifying the most suspected carrier words as soft text carrier words.
Specifically, the multiple suspected carrier words extracted in step S101 may be subjected to carrier recognition by using a specific carrier recognition calculation formula or a model for carrier recognition obtained through machine learning training, and the most suspected carrier word is recognized as a soft text carrier word.
S103, performing soft text cheating identification on the current text and a plurality of historical texts containing soft text carrier words, and identifying whether the current text is soft text and/or whether the user is a soft text cheating user.
Specifically, the method can perform soft cheating recognition on the current text and a plurality of historical texts containing the soft text carrier words recognized in the step S102 by adopting a specific soft cheating recognition calculation formula or a model for soft cheating recognition obtained by machine learning training, so as to recognize whether the current text is soft text and/or whether the user is a soft cheating user, which is beneficial to filtering low-quality and prohibited submitted texts, further improving user experience, reducing the risk of misleading the consumer by enterprises, and simultaneously squeezing a free advertisement backflow market, thereby providing more potential commercial values for product lines.
In this embodiment, a plurality of suspected carrier words are extracted according to a current text and a plurality of historical texts submitted by a user, the most suspected carrier word is identified as a soft text carrier word, soft text cheating identification is performed on the current text and the plurality of historical texts containing the soft text carrier word, and whether the current text is soft text and/or whether the user is a soft text cheating user is identified. Because the software cheating identification is carried out based on a plurality of linguistic data (the current text and a plurality of historical texts), namely, the evidence is carried out by referring to a plurality of historical texts of the user, the software identification rate and the identification precision are improved.
To clearly illustrate the above embodiment, this embodiment provides another text cheating identification method, and fig. 2 is a schematic flow chart of the another text cheating identification method provided in the embodiment of the present invention. As shown in fig. 2, the method for identifying text cheating includes the following steps:
step S101 in the previous embodiment specifically includes the following step S201.
S201, comparing similarity of a current text submitted by a user with a plurality of historical texts, and extracting words with the occurrence frequency larger than a set frequency threshold or the occurrence frequency larger than a set frequency threshold as suspect carrier words.
Specifically, the similarity comparison may be performed on the current text submitted by the user and the plurality of historical texts, so as to extract words with the occurrence frequency greater than a set frequency threshold or the occurrence frequency greater than a set frequency threshold as the suspect vector words.
The number of the extracted suspected carrier words can be one or more, in order to further reduce interference and carrier identification precision, after words with the occurrence frequency larger than a set frequency threshold or the occurrence frequency larger than a set frequency threshold are extracted, the words with the occurrence frequency larger than the set frequency threshold or the occurrence frequency larger than the set frequency threshold are filtered according to a preset high-frequency non-carrier anti-word list (the high-frequency non-carrier anti-word list comprises a plurality of anti-words with higher occurrence frequency but not soft text carrier words for cheating promotion, such as 'hello', 'our', and 'adoption'), and the like), and the words with the occurrence frequency larger than the set frequency threshold or the occurrence frequency larger than the set frequency threshold are reserved as the suspected carrier words which are not matched with the high-frequency non-carrier anti-word list.
Step S102 in the previous embodiment may specifically include the following steps S202 and S203.
S202, carrying out carrier identification on a plurality of suspected carrier words by adopting a carrier identification binary model.
Specifically, algorithms such as Long Short-Term Memory network (LSTM) and the like may be adopted, and a carrier recognition binary model for carrier recognition obtained through machine learning training is used to perform carrier recognition on the multiple suspected carrier words extracted in step S201, that is, the multiple suspected carrier words extracted in step S201 are input to the carrier recognition binary model for carrier recognition. The vector recognition binary model can be obtained by training in the following way: and constructing a plurality of carrier corpuses and a plurality of non-carrier corpuses, and training to obtain a carrier recognition binary classification model according to the plurality of carrier corpuses and the plurality of non-carrier corpuses.
The carrier corpus can be a plurality of soft text carrier word samples generated in a historical artificial corpus labeling process and/or a historical soft text carrier word identification process. Because the soft text carrier words and the names have certain similarity, individuality is often pursued, so that the relevance (disorder) between the characters is lacked, and the frequency of a plurality of single characters is higher when the single characters are disassembled, so that the carrier linguistic data can also be used for disassembling the single characters for a plurality of soft text carrier word samples generated in the process of marking the historical artificial linguistic data and/or the process of identifying the historical soft text carrier words to construct a plurality of carrier linguistic data with different lengths and/or different single characters. The length of the carrier corpus can be generated with a certain probability according to the distribution of a plurality of soft text carrier word lengths generated in the historical artificial corpus labeling process and/or the historical soft text carrier word identification process. The non-carrier corpus can be derived from irrelevant high-frequency words after in-station word segmentation and natural-order high-frequency phrases (such as hello) extracted from the text, and the length of the non-carrier corpus can be 2-4 single words.
As a feasible implementation manner, the following manner may be adopted to perform single word decomposition on a plurality of soft text carrier word samples generated in the historical artificial corpus labeling process and/or the historical soft text carrier word recognition process to construct a carrier corpus: and randomly disordering the sequence of the single characters of the soft text carrier word samples to generate carrier linguistic data. For example, the order of the individual words of the soft text carrier word sample "ABCD" is randomly scrambled, and the generated carrier corpus can be "DCBA", "CADB", and "ABCD", etc.
As another possible implementation, the individual word decomposition may be performed on a plurality of soft text carrier word samples generated in the history artificial corpus tagging process and/or the history soft text carrier word recognition process in the following manner to construct a carrier corpus: randomly extracting a set number of soft text carrier word samples from a plurality of soft text carrier word samples, and randomly extracting a single word from each extracted soft text carrier word sample to combine into a carrier corpus. For example, a carrier corpus of 4 individual characters is constructed, 4 soft text carrier word samples can be randomly extracted from a plurality of soft text carrier word samples generated in a historical artificial corpus labeling process and/or a historical soft text carrier word recognition process, and in the 4 soft text carrier word samples, an individual character is randomly extracted from each soft text carrier word sample to combine into a carrier corpus of 4 individual characters.
And S203, determining the suspected carrier words with the highest probability output by the carrier identification binary model and larger than a first set probability threshold as soft text carrier words.
Specifically, the output of the two carrier recognition classification models is prediction of the probability that the input suspected carrier word is a soft text carrier word, for each input suspected carrier word, the two carrier recognition classification models all output a corresponding probability, and the suspected carrier word with the maximum probability and larger than a first set probability threshold is determined as the soft text carrier word. And if the probability of the suspected carrier word with the maximum probability is equal to or less than a first set probability threshold, determining that no suspected carrier word is determined as a soft text carrier word, and ending the process.
Step S103 in the previous embodiment may specifically include the following steps S204 and S205.
And S204, performing soft text cheating prediction on the current text and a plurality of historical texts containing soft text carrier words by adopting a soft text cheating prediction model.
Specifically, algorithms such as long-time and short-time memory networks LSTM with multiple text inputs can be adopted, a soft text cheating prediction model for soft text cheating recognition is obtained through machine learning training, soft text cheating prediction is performed on a current text submitted by a user and a plurality of historical texts containing soft text carrier words determined in step S203, that is, the current text and the plurality of historical texts containing the soft text carrier words determined in step S203 are simultaneously input to the soft text cheating prediction model for soft text cheating prediction. For example, if the soft text cheating prediction model supports simultaneous input of 5 texts, the current text and the 4 most recently submitted historical texts containing the soft text carrier words determined in step S203 are simultaneously input to the soft text cheating prediction model for soft text cheating prediction. If the number of the historical texts containing the soft text carrier words determined in step S203 submitted by the user is less than 4, for example, only 1, 5 texts are filled in the current text and the copy of the 1 historical text, that is, 5 texts including the current text, the 1 historical text, the copy of the current text, the copy of the 1 historical text and the copy of the current text are simultaneously input to the soft text cheating prediction model for the soft text cheating prediction. The soft text cheating prediction model can be obtained by training in the following mode: and training to obtain a soft text cheating prediction model according to the plurality of cheating corpora and the plurality of non-cheating corpora marked manually. Wherein, the cheating linguistic data is artificially marked as the linguistic data of the soft text cheating. The non-cheating corpora are corpora that are manually marked as non-soft-text cheating.
Because the length of the corpus is too long or too short, the soft text cheating recognition rate and the recognition accuracy can be reduced, in the training process of the soft text cheating prediction model, carrier sequence interception can be carried out on a plurality of cheating corpora and a plurality of non-cheating corpora which are marked manually (namely, sequence interception is carried out around carrier words), and a plurality of cheating corpus sequences and a plurality of non-cheating corpus sequences with set character numbers (for example, 30-40 characters) are obtained; and training to obtain a soft text cheating prediction model according to the plurality of cheating corpus sequences and the plurality of non-cheating corpus sequences. Similarly, in the soft text cheating prediction model prediction process, carrier sequence interception can be performed on the current text and a plurality of historical texts containing soft text carrier words (namely, sequence interception is performed around the carrier words), so that a current text sequence with a set number of characters (for example, 30-40 characters) and a plurality of historical text sequences containing soft text carrier words are obtained; and performing soft text cheating prediction on the current text sequence and a plurality of historical text sequences containing soft text carrier words by adopting a soft text cheating prediction model.
And S205, if the probability output by the soft text cheating prediction model is greater than a second set probability threshold, determining that the current text is soft text and/or the user is a soft text cheating user.
Specifically, the soft text cheating prediction model outputs the prediction of the probability that the input current text and a plurality of historical texts containing soft text carrier words are cheated by soft text, the soft text cheating prediction model outputs the corresponding probability for the input current text and the plurality of historical texts containing soft text carrier words, if the probability is larger than a second set probability threshold value, the current text is determined to be soft text and/or the user is determined to be a soft text cheating user, low-quality and illegal submitted text filtering is facilitated, user experience is further improved, risks of misleading the consumer by enterprises are reduced, meanwhile, a free advertisement backflow market is squeezed, and more potential commercial values are provided for a product line.
In this embodiment, a plurality of suspected carrier words are extracted according to a current text and a plurality of historical texts submitted by a user, the most suspected carrier word is identified as a soft text carrier word, soft text cheating identification is performed on the current text and the plurality of historical texts containing the soft text carrier word, and whether the current text is soft text and/or whether the user is a soft text cheating user is identified. Because the software cheating identification is carried out based on a plurality of linguistic data (the current text and a plurality of historical texts), namely, the evidence is carried out by referring to a plurality of historical texts of the user, the software identification rate and the identification precision are improved.
In order to implement the above embodiment, the present invention further provides a device for identifying text cheating. Fig. 3 is a schematic structural diagram of an apparatus for recognizing text cheating according to an embodiment of the present invention. As shown in fig. 3, the text cheating recognition apparatus includes: an extraction module 51, a carrier recognition module 52 and a cheating recognition module 53.
The extracting module 51 is configured to extract a plurality of suspected carrier words according to a current text and a plurality of historical texts submitted by a user.
And the carrier identification module 52 is configured to perform carrier identification on a plurality of suspected carrier words, and identify the most suspected carrier word as a soft text carrier word.
And the cheating identification module 53 is configured to perform soft text cheating identification on the current text and a plurality of historical texts containing soft text carrier words, and identify whether the current text is a soft text and/or whether the user is a soft text cheating user.
It should be noted that the explanation of the embodiment of the text cheating identification method is also applicable to the text cheating identification apparatus of the embodiment, and details are not repeated here.
In this embodiment, a plurality of suspected carrier words are extracted according to a current text and a plurality of historical texts submitted by a user, the most suspected carrier word is identified as a soft text carrier word, soft text cheating identification is performed on the current text and the plurality of historical texts containing the soft text carrier word, and whether the current text is soft text and/or whether the user is a soft text cheating user is identified. Because the software cheating identification is carried out based on a plurality of linguistic data (the current text and a plurality of historical texts), namely, the evidence is carried out by referring to a plurality of historical texts of the user, the software identification rate and the identification precision are improved.
Based on the above embodiment, the embodiment of the present invention further provides a possible implementation manner of the text cheating recognition apparatus. Fig. 4 is a schematic structural diagram of another text cheating recognition apparatus according to an embodiment of the present invention. As shown in fig. 4, on the basis of the previous embodiment, the extracting module 51 may be specifically configured to:
and comparing the similarity of the current text submitted by the user with a plurality of historical texts, and extracting words with the occurrence frequency larger than a set frequency threshold or the occurrence frequency larger than a set frequency threshold as suspect carrier words.
Further, in a possible implementation manner of the embodiment of the present invention, the extracting module 51 may further be configured to: and filtering words with the occurrence frequency larger than a set frequency threshold value or the occurrence frequency larger than a set frequency threshold value according to a preset high-frequency non-carrier reverse word list, and reserving words which are not matched with the high-frequency non-carrier reverse word list as suspected carrier words.
Further, in a possible implementation manner of the embodiment of the present invention, the carrier identification module 52 may specifically include: an identification unit 521 and a first determination unit 522.
The identifying unit 521 is configured to perform vector identification on a plurality of suspected vector words by using a vector identification binary model.
The first determining unit 522 is configured to determine a suspected carrier word with a maximum probability output by the carrier recognition binary model and larger than a first set probability threshold as a soft text carrier word.
Further, in a possible implementation manner of the embodiment of the present invention, the carrier identification module 52 may further include: a construction unit 523 and a first training unit 524.
A constructing unit 523 configured to construct a plurality of carrier corpuses and a plurality of non-carrier corpuses.
The first training unit 524 is configured to train to obtain a carrier recognition binary model according to the plurality of carrier corpuses and the plurality of non-carrier corpuses.
Further, in a possible implementation manner of the embodiment of the present invention, the constructing unit 523 may be specifically configured to: and (3) performing single word disassembly on a plurality of soft text carrier word samples generated in the historical artificial corpus labeling process and/or the historical soft text carrier word identification process to construct a plurality of carrier corpuses with different lengths and/or different single words.
Further, in a possible implementation manner of the embodiment of the present invention, the constructing unit 523 may be specifically configured to: randomly disordering the sequence of the single characters of the soft text carrier word samples to generate carrier linguistic data; and/or randomly extracting a set number of soft text carrier word samples from a plurality of soft text carrier word samples, and randomly extracting a single character from each extracted soft text carrier word sample to combine into a carrier corpus.
Further, in a possible implementation manner of the embodiment of the present invention, the cheating identifying module 53 may specifically include: a prediction unit 531 and a second determination unit 532.
The prediction unit 531 is configured to perform soft cheating prediction on the current text and a plurality of historical texts containing soft text carrier words by using a soft cheating prediction model.
The second determining unit 532 is configured to determine that the current text is a soft text and/or the user is a soft text cheating user if the probability output by the soft text cheating prediction model is greater than a second set probability threshold.
Further, in a possible implementation manner of the embodiment of the present invention, the cheating identifying module further includes: the second training unit 533 is configured to train to obtain a soft text cheating prediction model according to the plurality of cheating corpora and the plurality of non-cheating corpora labeled manually.
Further, in a possible implementation manner of the embodiment of the present invention, the second training unit 533 may be specifically configured to: carrying out carrier sequence interception on a plurality of artificially marked cheating linguistic data and a plurality of non-cheating linguistic data to obtain a plurality of cheating linguistic data sequences and a plurality of non-cheating linguistic data sequences with set character numbers; training to obtain a soft text cheating prediction model according to the plurality of cheating corpus sequences and the plurality of non-cheating corpus sequences;
further, in a possible implementation manner of the embodiment of the present invention, the prediction unit 531 may be specifically configured to: intercepting a carrier sequence of a current text and a plurality of historical texts containing soft text carrier words to obtain a current text sequence with a set number of characters and a plurality of historical text sequences containing soft text carrier words; and performing soft text cheating prediction on the current text sequence and a plurality of historical text sequences containing soft text carrier words by adopting a soft text cheating prediction model.
It should be noted that the explanation of the embodiment of the text cheating identification method is also applicable to the text cheating identification apparatus of the embodiment, and details are not repeated here.
In this embodiment, a plurality of suspected carrier words are extracted according to a current text and a plurality of historical texts submitted by a user, the most suspected carrier word is identified as a soft text carrier word, soft text cheating identification is performed on the current text and the plurality of historical texts containing the soft text carrier word, and whether the current text is soft text and/or whether the user is a soft text cheating user is identified. Because the software cheating identification is carried out based on a plurality of linguistic data (the current text and a plurality of historical texts), namely, the evidence is carried out by referring to a plurality of historical texts of the user, the software identification rate and the identification precision are improved.
In order to implement the foregoing embodiments, the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the method for recognizing text cheating is implemented as described in the foregoing embodiments.
In order to implement the above embodiments, the present invention also proposes a non-transitory computer-readable storage medium on which a computer program is stored, which when executed by a processor implements the method for recognizing text cheating as shown in the above embodiments.
In order to implement the above embodiments, the present invention further provides a computer program product, wherein when the instructions in the computer program product are executed by a processor, the method for identifying text cheating is performed as shown in the above embodiments.
FIG. 5 illustrates a block diagram of an exemplary computer device suitable for use in implementing embodiments of the present application. The computer device 12 shown in fig. 5 is only an example and should not bring any limitation to the function and scope of use of the embodiments of the present application.
As shown in FIG. 5, computer device 12 is in the form of a general purpose computing device. The components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.
Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. These architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus, to name a few.
Computer device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
Memory 28 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 30 and/or cache Memory 32. Computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 5, and commonly referred to as a "hard drive"). Although not shown in FIG. 5, a disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk Read Only Memory (CD-ROM), a Digital versatile disk Read Only Memory (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the application.
A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally perform the functions and/or methodologies of the embodiments described herein.
The computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with the computer system/server 12, and/or with any devices (e.g., network card, modem, etc.) that enable the computer system/server 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Moreover, computer device 12 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public Network such as the Internet) via Network adapter 20. As shown in FIG. 5, the network adapter 20 communicates with the other modules of the computer device 12 via the bus 18. It should be appreciated that although not shown in FIG. 5, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
The processing unit 16 executes various functional applications and data processing, for example, implementing the methods mentioned in the foregoing embodiments, by executing programs stored in the system memory 28.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware that is related to instructions of a program, and the program may be stored in a computer-readable storage medium, and when executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (15)

1. A text cheating identification method is characterized by comprising the following steps:
extracting a plurality of suspected carrier words according to a current text and a plurality of historical texts submitted by a user;
carrying out carrier identification on the plurality of suspected carrier words, and identifying the most suspected carrier words as soft text carrier words;
performing soft text cheating identification on the current text and a plurality of historical texts containing the soft text carrier words, and identifying whether the current text is soft text and/or whether the user is a soft text cheating user; the carrier identification of the plurality of suspected carrier words, and the identification of the most suspected carrier word as a soft text carrier word comprises the following steps:
adopting a carrier identification binary classification model to carry out carrier identification on the plurality of suspected carrier words;
determining the suspected carrier words with the highest probability output by the carrier identification binary model and larger than a first set probability threshold as the soft text carrier words;
the method for recognizing the suspicion carrier words by adopting the carrier recognition binary model further comprises the following steps of:
constructing a plurality of carrier corpuses and a plurality of non-carrier corpuses;
training to obtain the carrier recognition binary classification model according to the plurality of carrier linguistic data and the plurality of non-carrier linguistic data; the constructing the plurality of carrier corpuses comprises:
performing single word disassembly on a plurality of soft text carrier word samples generated in a historical artificial corpus labeling process and/or a historical soft text carrier word identification process to construct a plurality of carrier corpuses with different lengths and/or different single words;
the method for disassembling single words of a plurality of soft text carrier word samples generated in the historical artificial corpus labeling process and/or the historical soft text carrier word identification process to construct the carrier corpus of a plurality of different lengths and/or different single words comprises the following steps: randomly disordering the sequence of the single character sequence of the soft text carrier word sample to generate the carrier corpus; and/or randomly extracting a set number of soft text carrier word samples from the plurality of soft text carrier word samples, and randomly extracting a single word from each extracted soft text carrier word sample to combine into the carrier corpus.
2. The identification method of claim 1, wherein the extracting a plurality of suspect carrier words according to the current text and a plurality of historical texts submitted by the user comprises:
and comparing the similarity of the current text submitted by the user with a plurality of historical texts, and extracting words with the occurrence frequency larger than a set frequency threshold or the occurrence frequency larger than a set frequency threshold as the suspect carrier words.
3. The identification method according to claim 2, further comprising:
and filtering the words with the occurrence frequency larger than a set frequency threshold value or the occurrence frequency larger than a set frequency threshold value according to a preset high-frequency non-carrier reverse word list, and reserving the words which are not matched with the high-frequency non-carrier reverse word list as the suspected carrier words.
4. The recognition method according to claim 1, wherein said performing soft text cheating recognition on said current text and a plurality of said historical texts containing said soft text carrier words, and recognizing whether said current text is soft text and/or whether said user is a soft text cheating user comprises:
adopting a soft text cheating prediction model to carry out soft text cheating prediction on the current text and a plurality of historical texts containing the soft text carrier words;
and if the probability output by the soft text cheating prediction model is greater than a second set probability threshold, determining that the current text is soft text and/or the user is a soft text cheating user.
5. The recognition method of claim 4, wherein said employing a soft text cheating prediction model to perform soft text cheating prediction on said current text and a plurality of said historical texts containing said soft text carrier words further comprises:
and training to obtain the soft text cheating prediction model according to the plurality of cheating corpora and the plurality of non-cheating corpora marked manually.
6. The recognition method of claim 5, wherein the training of the soft text cheating prediction model based on the artificially labeled cheating corpora and non-cheating corpora comprises:
carrying out carrier sequence interception on the plurality of cheating linguistic data and the plurality of non-cheating linguistic data which are marked manually to obtain a plurality of cheating linguistic data sequences and a plurality of non-cheating linguistic data sequences with set character numbers;
training to obtain the soft text cheating prediction model according to the plurality of cheating corpus sequences and the plurality of non-cheating corpus sequences;
the soft cheating prediction of the current text and the plurality of historical texts containing the soft text carrier words by adopting a soft cheating prediction model comprises the following steps:
carrying out carrier sequence interception on the current text and the plurality of historical texts containing the soft text carrier words to obtain the current text sequence with the set number of characters and the plurality of historical text sequences containing the soft text carrier words;
and performing soft text cheating prediction on the current text sequence and a plurality of historical text sequences containing the soft text carrier words by adopting the soft text cheating prediction model.
7. An apparatus for recognizing text cheating, comprising:
the extraction module is used for extracting a plurality of suspected carrier words according to a current text and a plurality of historical texts submitted by a user;
the carrier identification module is used for carrying out carrier identification on the plurality of suspected carrier words and identifying the most suspected carrier words as soft text carrier words;
the cheating identification module is used for carrying out soft text cheating identification on the current text and the plurality of historical texts containing the soft text carrier words and identifying whether the current text is soft text and/or whether the user is a soft text cheating user;
the carrier recognition module includes:
the identification unit is used for carrying out carrier identification on the plurality of suspected carrier words by adopting a carrier identification binary model;
the first determining unit is used for determining the suspected carrier words with the highest probability output by the carrier identification binary model and larger than a first set probability threshold as the soft text carrier words;
the carrier recognition module further includes:
the constructing unit is used for constructing a plurality of carrier linguistic data and a plurality of non-carrier linguistic data;
the first training unit is used for training to obtain the carrier recognition binary model according to the plurality of carrier linguistic data and the plurality of non-carrier linguistic data;
the construction unit is specifically configured to:
performing single word disassembly on a plurality of soft text carrier word samples generated in a historical artificial corpus labeling process and/or a historical soft text carrier word identification process to construct a plurality of carrier corpuses with different lengths and/or different single words;
the construction unit is specifically configured to:
randomly disordering the sequence of the single character sequence of the soft text carrier word sample to generate the carrier corpus; and/or the presence of a gas in the gas,
and randomly extracting a set number of soft text carrier word samples from the plurality of soft text carrier word samples, and randomly extracting a single word from each extracted soft text carrier word sample to combine into the carrier corpus.
8. The identification device according to claim 7, wherein the extraction module is specifically configured to:
and comparing the similarity of the current text submitted by the user with a plurality of historical texts, and extracting words with the occurrence frequency larger than a set frequency threshold or the occurrence frequency larger than a set frequency threshold as the suspect carrier words.
9. The identification device of claim 8, wherein the extraction module is further configured to:
and filtering the words with the occurrence frequency larger than a set frequency threshold value or the occurrence frequency larger than a set frequency threshold value according to a preset high-frequency non-carrier reverse word list, and reserving the words which are not matched with the high-frequency non-carrier reverse word list as the suspected carrier words.
10. The identification device of claim 7, wherein the cheat identification module comprises:
the prediction unit is used for performing soft text cheating prediction on the current text and a plurality of historical texts containing the soft text carrier words by adopting a soft text cheating prediction model;
and the second determining unit is used for determining that the current text is the soft text and/or the user is the soft text cheating user if the probability output by the soft text cheating prediction model is greater than a second set probability threshold.
11. The identification device of claim 10, wherein the cheat identification module further comprises:
and the second training unit is used for training to obtain the soft text cheating prediction model according to the plurality of cheating corpora and the plurality of non-cheating corpora marked manually.
12. The identification device of claim 11, wherein the second training unit is specifically configured to:
carrying out carrier sequence interception on the plurality of cheating linguistic data and the plurality of non-cheating linguistic data which are marked manually to obtain a plurality of cheating linguistic data sequences and a plurality of non-cheating linguistic data sequences with set character numbers;
training to obtain the soft text cheating prediction model according to the plurality of cheating corpus sequences and the plurality of non-cheating corpus sequences;
the prediction unit is specifically configured to:
carrying out carrier sequence interception on the current text and the plurality of historical texts containing the soft text carrier words to obtain the current text sequence with the set number of characters and the plurality of historical text sequences containing the soft text carrier words;
and performing soft text cheating prediction on the current text sequence and a plurality of historical text sequences containing the soft text carrier words by adopting the soft text cheating prediction model.
13. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of text cheating recognition as recited in any one of claims 1-6 when the program is executed by the processor.
14. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the program, when executed by a processor, implements the method for identifying text cheating according to any one of claims 1-6.
15. A computer program product, characterized in that instructions in the computer program product, when executed by a processor, perform the method of text cheating identification according to any one of claims 1-6.
CN201810398470.6A 2018-04-28 2018-04-28 Text cheating identification method and device Active CN108804413B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810398470.6A CN108804413B (en) 2018-04-28 2018-04-28 Text cheating identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810398470.6A CN108804413B (en) 2018-04-28 2018-04-28 Text cheating identification method and device

Publications (2)

Publication Number Publication Date
CN108804413A CN108804413A (en) 2018-11-13
CN108804413B true CN108804413B (en) 2022-03-22

Family

ID=64094013

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810398470.6A Active CN108804413B (en) 2018-04-28 2018-04-28 Text cheating identification method and device

Country Status (1)

Country Link
CN (1) CN108804413B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110298020B (en) * 2019-05-30 2023-05-16 北京百度网讯科技有限公司 Text anti-cheating variant reduction method and equipment, and text anti-cheating method and equipment
CN110704615B (en) * 2019-09-04 2021-01-26 北京航空航天大学 Internet financial non-dominant advertisement identification method and device
CN113591464B (en) * 2021-07-28 2022-06-10 百度在线网络技术(北京)有限公司 Variant text detection method, model training method, device and electronic equipment

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102891838A (en) * 2011-07-22 2013-01-23 腾讯科技(深圳)有限公司 Method and device for detecting promotion content in question and answer club
CN103049501B (en) * 2012-12-11 2016-08-03 上海大学 Based on mutual information and the Chinese domain term recognition method of conditional random field models
CN103176953B (en) * 2013-03-20 2016-02-24 新浪网技术(中国)有限公司 A kind of text handling method and system
CN103886016B (en) * 2014-02-20 2017-11-03 百度在线网络技术(北京)有限公司 A kind of method and apparatus for being used to determine the rubbish text information in the page
CN105447031A (en) * 2014-08-28 2016-03-30 百度在线网络技术(北京)有限公司 Training sample labeling method and device
CN104408087A (en) * 2014-11-13 2015-03-11 百度在线网络技术(北京)有限公司 Method and system for identifying cheating text
CN104331396A (en) * 2014-11-26 2015-02-04 深圳市英威诺科技有限公司 Intelligent advertisement identifying method
CN106156017A (en) * 2015-03-23 2016-11-23 北大方正集团有限公司 Information identifying method and information identification system
CN107102981B (en) * 2016-02-19 2020-06-23 腾讯科技(深圳)有限公司 Word vector generation method and device
CN105787133B (en) * 2016-03-31 2020-06-02 北京小米移动软件有限公司 Advertisement information filtering method and device
CN107741933A (en) * 2016-08-08 2018-02-27 北京京东尚科信息技术有限公司 Method and apparatus for detecting text
US20180096390A1 (en) * 2016-09-30 2018-04-05 Facebook, Inc. Systems and methods for promoting content items
CN106845999A (en) * 2017-02-20 2017-06-13 百度在线网络技术(北京)有限公司 Risk subscribers recognition methods, device and server
CN107239440B (en) * 2017-04-21 2021-05-25 同盾控股有限公司 Junk text recognition method and device
CN107894994A (en) * 2017-10-18 2018-04-10 北京京东尚科信息技术有限公司 A kind of method and apparatus for detecting much-talked-about topic classification

Also Published As

Publication number Publication date
CN108804413A (en) 2018-11-13

Similar Documents

Publication Publication Date Title
CN109657054B (en) Abstract generation method, device, server and storage medium
CN109446430B (en) Product recommendation method and device, computer equipment and readable storage medium
CN106649818B (en) Application search intention identification method and device, application search method and server
CN108733778B (en) Industry type identification method and device of object
CN108256098B (en) Method and device for determining emotional tendency of user comment
CN108804413B (en) Text cheating identification method and device
US20200257757A1 (en) Machine Learning Techniques for Generating Document Summaries Targeted to Affective Tone
CN111476256A (en) Model training method and device based on semi-supervised learning and electronic equipment
CN104978354B (en) Text classification method and device
CN111125354A (en) Text classification method and device
Sanguansat Paragraph2vec-based sentiment analysis on social media for business in thailand
CN112100384B (en) Data viewpoint extraction method, device, equipment and storage medium
CN113722483A (en) Topic classification method, device, equipment and storage medium
CN112257452A (en) Emotion recognition model training method, device, equipment and storage medium
CN111563377A (en) Data enhancement method and device
CN113420122A (en) Method, device and equipment for analyzing text and storage medium
CN107844531B (en) Answer output method and device and computer equipment
CN112151019A (en) Text processing method and device and computing equipment
CN113807096A (en) Text data processing method and device, computer equipment and storage medium
CN110717326B (en) Text information author identification method and device based on machine learning
CN113038175A (en) Video processing method and device, electronic equipment and computer readable storage medium
CN115687790B (en) Advertisement pushing method and system based on big data and cloud platform
CN107590163B (en) The methods, devices and systems of text feature selection
CN113378541B (en) Text punctuation prediction method, device, system and storage medium
CN112597295B (en) Digest extraction method, digest extraction device, computer device, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant