CN113591464A

CN113591464A - Variant text detection method, model training method, device and electronic equipment

Info

Publication number: CN113591464A
Application number: CN202110860112.4A
Authority: CN
Inventors: 孙晓洁; 吕中厚; 王洋; 高梦晗
Original assignee: Baidu Online Network Technology Beijing Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd
Priority date: 2021-07-28
Filing date: 2021-07-28
Publication date: 2021-11-02
Anticipated expiration: 2041-07-28
Also published as: CN113591464B

Abstract

The disclosure provides a variant text detection method, a model training device and electronic equipment, and relates to the technical field of artificial intelligence, in particular to the field of text processing. The method comprises the following steps: respectively inputting the plurality of texts into a variant text detection model to obtain a variant score of each text in the plurality of texts, wherein the variant text detection model is obtained by training a text recognition model by adopting a variant text sample; determining a first variant text in the plurality of texts according to the variant score of each text, and determining a first account corresponding to the first variant text; determining a suspicious account associated with the first account; and detecting the content characteristics of the text submitted by the suspicious account, and determining a second variant text in the text submitted by the suspicious account according to the detection result of the content characteristics, so that the variant text is actively mined in time.

Description

Variant text detection method, model training method, device and electronic equipment

Technical Field

The present disclosure relates to a text processing technology in the technical field of artificial intelligence, and in particular, to a variant text detection method, a model training device, and an electronic device.

Background

A UGC (User Generated Content) platform is often damaged by a black-producing team, and the black-producing team often releases prohibited Content such as an illegal website which the black-producing team wants to transfer in the form of a variant text and the like through a large number of accounts. In order to achieve the purpose of content release, a black-producing team constructs variant texts in the modes of homophonic, harmonious, similar, even structural variation and the like on the texts, and under the condition of ensuring semantic transmission of the texts, the variant texts bypass the wind control check of a UGC platform, and the behavior seriously influences the use experience of normal users, so that variant text detection on the contents of the UGC platform is necessary.

In the related technology, the content of the UGC platform is usually detected by using a trained text detection model, and the detection capability of the model is related to a sample during training, however, because a variant text has two characteristics of formal variation and normal semantic transmission, it is difficult to construct a sample satisfying the two characteristics simultaneously during constructing the sample, and the new variant text forms of a black product team are infinite, so that the model is difficult to detect the new variant text.

Disclosure of Invention

The disclosure provides a variant text detection method, a model training device and electronic equipment, wherein timely and active mining of variant texts is realized.

According to an aspect of the present disclosure, there is provided a variant text detection method, the method including:

respectively inputting the plurality of texts into a variant text detection model to obtain a variant score of each text in the plurality of texts, wherein the variant text detection model is obtained by training a text recognition model by adopting a variant text sample;

determining a first variant text in the plurality of texts according to the variant score of each text, and determining a first account corresponding to the first variant text;

determining a suspicious account associated with the first account;

and performing content feature detection on the text submitted by the suspicious account, and determining a second variant text in the text submitted by the suspicious account according to the result of the content feature detection.

According to another aspect of the present disclosure, there is provided a model training method, the method including:

obtaining a first variant text and a second variant text, wherein the first variant text is obtained by inputting a plurality of texts into a variant text detection model respectively and obtaining a variant score of each text, the second variant text is obtained by performing content feature detection on a text submitted by a suspicious account related to a first account corresponding to the first variant text, and the second variant text is determined according to a content feature detection result;

and training the variant text detection model by adopting the first variant text and the second variant text so as to update the model parameters of the variant text detection model.

According to still another aspect of the present disclosure, there is provided a variant text detecting apparatus, including:

the input module is used for respectively inputting the plurality of texts into the variant text detection model to obtain a variant score of each text in the plurality of texts, and the variant text detection model is obtained by training the text recognition model by adopting a variant text sample;

the first determining module is used for determining a first variant text in the plurality of texts according to the variant score of each text and determining a first account corresponding to the first variant text;

a second determination module for determining a suspicious account associated with the first account;

and the detection module is used for carrying out content characteristic detection on the text submitted by the suspicious account and determining a second variant text in the text submitted by the suspicious account according to the result of the content characteristic detection.

According to still another aspect of the present disclosure, there is provided a model training apparatus including:

the obtaining module is used for obtaining a first variant text and a second variant text, wherein the first variant text is obtained by inputting a plurality of texts into a variant text detection model respectively and obtaining a variant score of each text, the second variant text is obtained by performing content feature detection on a text submitted by a suspicious account related to a first account corresponding to the first variant text, and the variant text is determined according to a result of the content feature detection;

and the training module is used for training the variant text detection model by adopting the first variant text and the second variant text so as to update the model parameters of the variant text detection model.

According to still another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first or second aspect described above.

According to yet another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of the first or second aspect described above.

According to yet another aspect of the present disclosure, there is provided a computer program product comprising: a computer program, the computer program being stored in a readable storage medium, from which the computer program can be read by at least one processor of an electronic device, execution of the computer program by the at least one processor causing the electronic device to perform the method of the first aspect or the second aspect.

According to the technical scheme disclosed by the invention, the variant text is actively mined in time.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic flow chart diagram of a variant text detection method provided according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart diagram of a model training method provided in accordance with an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a variant text detection apparatus provided according to an embodiment of the present disclosure;

FIG. 4 is a schematic structural diagram of a model training apparatus provided in accordance with an embodiment of the present disclosure;

fig. 5 is a schematic block diagram of an electronic device for implementing a variant text detection method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

When a text detection model is used for detecting variant texts of the content of the UGC platform, the detection capability of the model is related to samples in training, and if the detection capability of the model is improved, the related samples in model training need to be expanded.

For example, in the conventional method, the confrontation enhancement is performed on the existing sample, for example, the keywords in the existing sample are firstly extracted and then subjected to variations such as pinyin and font, and the confrontation sample is generated to obtain more training data, so that the model detection capability is enhanced. The traditional method can ensure that the generated variant text meets the characteristic of normal semantic transmission, but the variant form is single, but in practice, the form of constructing the variant text by a black-yielding team is endless, so that training the model by using the sample obtained by the traditional method is still insufficient for realizing the timely detection of a new variant text form.

For example, in a deep learning based method, such as a method for generating a network against a generation network, a generator and a discriminator are utilized, the generator aims to generate variant samples which can bypass the detection of the discriminator, and the discriminator aims to correctly identify the variant samples generated by the generator, so that the variant text detection capability of the discriminator is improved. However, such a deep learning-based method is difficult to ensure that variant samples generated by a generator have correct semantics, which may lead to infinite inclination of the generated variant samples to formal variation, which may lead to deviation of training of a discriminator, and thus, it is still insufficient to achieve timely detection of new variant text forms.

Based on the above reasons, currently, in variant text detection, only after a new variant text form of a black product team appears in a large scale, relevant variant texts can be collected to train and improve the detection capability of a model, so that the variant text detection is seriously delayed, and the new variant text form cannot be found in time.

In order to solve the above problem, the embodiment of the present disclosure provides a variant text detection method, which may first use a variant text detection model for detecting variant forms that are already popular in a large scale, so as to grab out part of accounts of a black production team according to detected variant texts, and meanwhile, considering that the same black production team often organizes a plurality of accounts in bulk to perform purposeful "loose synchronization" attacks in a short time, therefore, the remaining suspicious accounts belonging to the same black production team may be dug out through the known accounts, and then texts submitted by the suspicious accounts are determined through a variant detection model in combination with content feature detection, so as to find out new variant text forms of the black production team before the new variant text forms are popular in a large scale.

The invention provides a variant text detection method, a model training device and electronic equipment, which are applied to the fields of artificial intelligence and text processing and realize timely and active mining of variant texts.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

Hereinafter, the variant text detection method provided by the present disclosure will be described in detail by specific examples. It is to be understood that the following detailed description may be combined with other embodiments, and that the same or similar concepts or processes may not be repeated in some embodiments.

Fig. 1 is a schematic flow chart diagram of a variant text detection method provided according to an embodiment of the present disclosure. As shown in fig. 1, the method includes:

s101, respectively inputting the plurality of texts into a variant text detection model to obtain a variant score of each text in the plurality of texts.

The variant text detection model is obtained by training a text recognition model by adopting a variant text sample. The input of the variant text detection model is text, and the output is a variant score of the text, wherein the variant score is used for representing the probability that the text is variant text, and the higher the variant score is, the higher the probability that the text is variant text is.

For all texts submitted by users in the UGC platform, variant text detection is required, the plurality of texts in the step can comprise any text submitted by the users, and each text is respectively input into a variant text detection model to obtain a variant score of each text.

S102, determining a first variant text in the plurality of texts according to the variant score of each text, and determining a first account corresponding to the first variant text.

Since the variant score of each text may represent the probability that the text is a variant text, a part of the text having a higher probability may be determined as a variant text, i.e., a first variant text. Meanwhile, the account of the user who issues the first variant text, namely the first account, can also be determined. Optionally, the determined first variant text may be deleted in the UGC platform, and the first account may be prohibited from reissuing the text.

S103, determining a suspicious account related to the first account.

Since blackout teams often publish variant text with a large number of accounts over a period of time, a suspicious account associated with a first account may be determined based on the association between accounts to discover new variant text in time by detecting text submitted by the suspicious account.

And S104, performing content feature detection on the text submitted by the suspicious account, and determining a second variant text in the text submitted by the suspicious account according to the content feature detection result.

For the text submitted by the suspicious account, besides the first variant text which may already exist in the text is identified through a variant text detection model, content feature detection is further performed, the content feature detection may include detection on various features such as similarity, semantics, special characters or keywords of the text, and a new text variant form is timely mined through a result of the content feature detection on the text which can be submitted by the account, and a second variant text which is not detected by the variant text detection model is identified.

The variant text detection method of the embodiment of the disclosure includes the steps of firstly, carrying out preliminary detection on any text submitted by a user by adopting a variant text detection model, determining a first variant text and a corresponding first account, further excavating suspicious accounts of the same black production team related to the first account, and determining a second variant text which is not detected by the variant text detection model for the texts submitted by the suspicious accounts by combining a content feature detection method, so that the second variant text can be discovered before a new variant text form of the black production team is popular in a large scale, and timely and active excavation of the variant text is realized.

On the basis of the above embodiment, first, the determination of the first variant text in the plurality of texts according to the variant score of each text in S102 is explained.

Optionally, a first text with a variant score greater than or equal to a first threshold is determined as the first variant text. Optionally, for a second text with a variant score greater than or equal to a second threshold and smaller than a first threshold, outputting first indication information, where the first indication information is used to indicate that an annotation is added to the second text, and if a received annotation of the second text is a variant text, determining the second text as the first variant text.

The variant score of each text is obtained after each text is input into the variant text detection model, the probability that the first text with the variant score larger than or equal to the first threshold is the variant text is high, the first text can be directly determined as the first variant text, and the fast automatic recognition of the variant text is guaranteed.

And for the second text with the variant score being greater than or equal to the second threshold and smaller than the first threshold, further auditing confirmation can be carried out to determine whether the second text is the variant text. For the second text, first indication information may be output, where the first indication information is used to indicate an auditor to add a label to the second text, and the auditor adds a label to the second text after auditing the second text, for example, if the second text is determined to be a variant text after auditing, the label added to the second text by the auditor is determined to be a variant text, and at this time, the second text may be determined to be the first variant text. If the second text is determined to be not the variant text after the review, the label added by the reviewer for the second text is the non-variant text, and at this time, the second text is not determined to be the first variant text. Therefore, the detection accuracy of the second text with higher variant score is ensured, and the non-variant text is prevented from being falsely detected as the variant text.

After the first variant text is determined, according to the first account corresponding to the first variant text, suspicious accounts of the same blackcurrant team related to the first account can also be determined. The following describes determining a suspicious account associated with the first account in S103.

Optionally, an account with the same Internet Protocol (IP) address as the first account is determined as the suspicious account.

The black-property group often releases variant texts by using a large number of accounts within a period of time, and because the number of the accounts used is large, the situation that a plurality of accounts share an IP address is often caused in actual operation, so that the account which is the same as the IP address of the first account can be determined as a suspicious account, a large number of suspicious users can be suspicious and quickly locked, and potential variant texts can be mined.

Optionally, the account with the interaction value greater than the preset value with the first account is determined as a suspicious account.

Accounts of the same blackout group often pay attention to each other, forward, comment and other behaviors in the UGC platform, so that accounts with more interaction with the first account can be determined as suspicious accounts. In an example, the interaction value between the accounts is calculated, and the account with the interaction value larger than the preset value is determined as the suspicious account, so that the comprehensiveness of mining the suspicious account is ensured.

For the text submitted by the suspicious account, an emphasis check is required to mine a new text variant form, which is described below.

And inputting the text submitted by the suspicious account into a variant text detection model to obtain the variant score of the text submitted by the suspicious account. Similar processing is performed on the first text with the variant score being greater than or equal to the first threshold value and the second text with the variant score being greater than or equal to the second threshold value and smaller than the first threshold value in the text submitted by the suspicious account according to the method in the foregoing embodiment, which is not described herein again.

And outputting third indication information for fourth texts of which the variant scores are greater than or equal to a third threshold and smaller than a second threshold in the texts submitted by the suspicious account, wherein the third indication information is used for indicating that labels are added to the fourth texts, and if the fourth texts are received and labeled as variant texts, determining the fourth texts as second variant texts. Similarly to the foregoing, for the fourth text, third indication information may be output, where the third indication information is used to indicate an auditor to add a label to the fourth text, and the auditor adds a label to the fourth text after auditing the fourth text, for example, if the fourth text is determined to be a variant text after auditing, the label added to the fourth text by the auditor is a variant text, and at this time, the fourth text may be determined to be the first variant text. If the fourth text is determined to be not the variant text after the review, the label added by the reviewer for the fourth text is the non-variant text, and at this time, the fourth text is not determined to be the first variant text. Therefore, detection of a part of variant texts which are not detected by the variant text detection model in the texts submitted by the suspicious account is achieved.

And detecting the content characteristics of the third text of which the variant score is smaller than a third threshold value in the texts submitted by the suspicious account. Since the number of the third texts with variant scores smaller than the third threshold value is possibly more, the detection efficiency can be improved by adopting the content feature detection method.

Optionally, after content feature detection is performed on the third text, if a result of the content feature detection meets a preset condition, second indication information is output, where the second indication information is used to indicate that a label is added to the third text, and if a received label of the third text is a variant text, the third text is determined as the second variant text. The result of the content feature detection satisfies the preset condition, which indicates that the probability that the third text is the variant text is high, at this time, further examination is needed to determine whether the third text is the variant text, and if the result of the content feature detection does not satisfy the preset condition, the probability that the third text is the variant text is low, and no processing is needed. And outputting second indication information when the content feature detection result meets the preset condition, wherein the second indication information is also used for indicating to label the third text, and if the received third text is labeled as a variant text, determining the third text as the second variant text. By firstly detecting the content characteristics of the third text and then outputting second indication information to prompt an auditor to label the third text under the condition that the result of the content characteristic detection meets the preset condition, the number of texts to be labeled can be reduced on the premise of ensuring the accuracy of the detection of the variant text.

The manner of content feature detection and the result of content feature detection are further described below.

In one embodiment, the content feature detection is performed on the third text, and includes: and carrying out similarity detection on every two texts in the third text, and if the similarity of the two texts is greater than a similarity threshold value, determining that the two texts are similar texts. Correspondingly, if the result of the content feature detection meets the preset condition, outputting second indication information, including: and if the number of the similar texts in the third text is greater than the preset value, outputting second indication information.

Because the information to be transmitted by the variant texts issued by the accounts belonging to the same black-producing team is usually the same, the variant mode and the variant content of the variant texts are often very similar, and therefore, the similarity comparison between every two third texts can be performed. Through similarity comparison, variant texts submitted by the same black-yielding team can be effectively mined, and new variant text forms can be found in time.

In one embodiment, the content feature detection is performed on the third text, and includes: and performing semantic relevance detection on the third text and the superior text object of the third text to obtain relevance scores of the third text and the superior text object of the third text. Correspondingly, if the result of the content feature detection meets the preset condition, outputting second indication information, including: and if the relevance scores of the third text and the upper text object of the third text are smaller than the relevance threshold value, outputting second indication information.

In order to avoid supervision, texts issued by accounts of the black-producing team and the upper-level text objects thereof are often completely semantically unrelated, for example, the content is completely unrelated to the title, so that semantic relevance detection can be performed on the third texts and the upper-level text objects of the third texts, and for the third texts with lower relevance, second indication information is also output to indicate auditing and labeling by auditors. Therefore, through semantic relevance mining the variant text which is possibly issued by the suspicious account, the new variant text form can be timely discovered.

In one embodiment, the content feature detection is performed on the third text, and includes: and matching the third text with a preset special character library to obtain the number of the special characters contained in the third text. Correspondingly, if the result of the content feature detection meets the preset condition, outputting second indication information, including: and if the number of the special characters contained in the third text is greater than the special character threshold value, outputting second indication information.

Because the variant texts issued by the black-yielding team often include a large number of special characters, the variant texts possibly issued by the suspicious account can be mined through special character detection, and second indication information is also output for a third text containing more special characters to indicate an auditor to perform audit marking on the third text, so that new variant texts can be mined in time.

In one embodiment, the content feature detection is performed on the third text, and includes: and matching the third text with a keyword lexicon to obtain the number of keywords in the third text, wherein the keyword lexicon is obtained by extracting keywords from the preset variant text. Correspondingly, if the result of the content feature detection meets the preset condition, outputting second indication information, including: and if the number of the keywords in the third text is greater than the keyword threshold value, outputting second indication information.

Although the black product team often publishes the variant text in a new variant text form, the published content is often similar illegal content, so repeated keywords often exist in the published variant text, and when content feature detection is performed on the third text of the suspicious account, matching can be performed by using a keyword word bank, wherein the keyword word bank is obtained by extracting keywords from the detected variant text, and for the third text which hits the keyword word bank for many times, second indication information is also output to indicate an auditor to audit and label the third text, so that new variant texts can be mined in time.

It can be understood that, in practical applications, one or more of the methods for detecting content features in the foregoing embodiments may be used to detect content features of the third text, and as long as the result of detecting content features by using any method meets the preset condition of the response method, the second indication information is output to indicate an auditor to perform audit marking on the third text. Therefore, the possible variant texts can be mined as comprehensively as possible.

In the above embodiment, the variant text detection method is introduced, and after the variant text is determined by the above method, the determined variant text (for example, the first variant text and the second variant text) may be used as a training sample. Fig. 2 is a schematic flow chart diagram of a model training method provided according to an embodiment of the present disclosure. As shown in fig. 2, the method includes:

s201, obtaining a first variant text and a second variant text.

The first variant text is obtained according to the variant score of each text after the variant scores of the texts are obtained by respectively inputting the texts into a variant text detection model, and the second variant text is determined according to the content feature detection of the texts submitted by the suspicious accounts related to the first accounts corresponding to the first variant text.

S202, training the variant text detection model by adopting the first variant text and the second variant text so as to update the model parameters of the variant text detection model.

The variant text detected by the variant text detection method of the embodiment is used as a training sample to train the variant text detection model, and the variant text obtained by the method possibly comprises the mined variant text in a new form, so that the text detection model can detect the latest variant text after being used as the sample to train the text detection model, the detection capability of the variant text detection model is greatly improved, the limitation and deviation of the model detection capability obtained by training due to poor quality of the constructed sample can be avoided, the full-process closed loop between the mining of the novel variant text and the iterative enhancement of the model detection capability is realized, and the situation that the detection can be realized after the large-scale process of the variant text is passively waited in the related technology is solved.

Fig. 3 is a schematic structural diagram of a variant text detection apparatus provided according to an embodiment of the present disclosure. As shown in fig. 3, the variant text detecting apparatus 300 includes:

the input module 301 is configured to input the multiple texts into a variant text detection model respectively to obtain a variant score of each text in the multiple texts, where the variant text detection model is obtained by training a text recognition model by using a variant text sample;

a first determining module 302, configured to determine a first variant text in the multiple texts according to the variant score of each text, and determine a first account corresponding to the first variant text;

a second determination module 303 for determining a suspicious account associated with the first account;

the detection module 304 is configured to perform content feature detection on the text submitted by the suspicious account, and determine a second variant text in the text submitted by the suspicious account according to a result of the content feature detection.

Optionally, the first determining module 302 includes:

a first determination unit configured to determine a first text having a variation score greater than or equal to a first threshold as a first variant text.

Optionally, the variant text detecting apparatus 300 further includes:

and the second determining unit is used for outputting first indication information for a second text with the variant score larger than or equal to a second threshold and smaller than a first threshold, wherein the first indication information is used for indicating that an annotation is added to the second text, and if the received second text is annotated as the variant text, the second text is determined as the first variant text.

Optionally, the second determining module includes:

and the third determination unit is used for determining the account which is the same as the internet protocol address of the first account as the suspicious account.

Optionally, the second determining module includes:

and the fourth determining unit is used for determining the account with the interaction value larger than the preset value with the first account as the suspicious account.

Optionally, the detecting module 304 includes:

the input unit is used for inputting the text submitted by the suspicious account into the variant text detection model to obtain the variant score of the text submitted by the suspicious account;

and the detection unit is used for detecting the content characteristics of a third text of which the variant score is smaller than a third threshold value in the texts submitted by the suspicious account.

Optionally, the detecting module 304 includes: and the fifth determining unit is used for outputting second indication information if the content feature detection result meets a preset condition, wherein the second indication information is used for indicating that a label is added to the third text, and if the label of the received third text is a variant text, the third text is determined as the second variant text.

Optionally, the detection unit includes: the first detection subunit is used for carrying out similarity detection on every two texts in the third text, and if the similarity of the two texts is greater than a similarity threshold value, determining that the two texts are similar texts;

the fifth determination unit includes: and the first output subunit is used for outputting second indication information if the number of the similar texts in the third text is greater than a preset value.

Optionally, the detection unit includes: the second detection subunit is used for performing semantic relevance detection on the third text and the superior text object of the third text to obtain relevance scores of the third text and the superior text object of the third text;

the fifth determination unit includes: and the second output subunit outputs second indication information if the relevance scores of the third text and the upper text object of the third text are smaller than the relevance threshold.

Optionally, the detection unit includes: the third detection subunit matches the third text with a preset special character library to obtain the number of special characters contained in the third text;

the fifth determination unit includes: and the third output subunit outputs second indication information if the number of the special characters contained in the third text is greater than the special character threshold value.

Optionally, the detection unit includes: the fourth detection subunit matches the third text with a keyword lexicon to obtain the number of keywords in the third text, wherein the keyword lexicon is obtained by extracting keywords from a preset variant text;

the fifth determination unit includes: and the fourth output subunit outputs the second indication information if the number of the keywords included in the third text is greater than the keyword threshold value.

Optionally, the variant text detecting apparatus 300 further includes: and the fifth output subunit is configured to output third indication information to a fourth text of which the variant score is greater than or equal to the third threshold and smaller than the second threshold in the text submitted by the suspicious account, where the third indication information is used to indicate that a label is added to the fourth text, and if the received fourth text is labeled as a variant text, determine the fourth text as the second variant text.

The apparatus of the embodiment of the present disclosure may be configured to perform the variant text detection method in the above method embodiment, and the implementation principle and the technical effect are similar, which are not described herein again.

Fig. 4 is a schematic structural diagram of a model training apparatus provided according to an embodiment of the present disclosure. As shown in fig. 4, the model training apparatus 400 includes:

an obtaining module 401, configured to obtain a first variant text and a second variant text, where the first variant text is a variant text obtained by inputting multiple texts into a variant text detection model respectively to obtain a variant score of each text, and the second variant text is a variant text determined by performing content feature detection on a text submitted by a suspicious account related to a first account corresponding to the first variant text and according to a result of the content feature detection;

a training module 402, configured to train the variant text detection model using the first variant text and the second variant text to update the model parameters of the variant text detection model.

The apparatus of the embodiment of the present disclosure may be configured to execute the model training method in the above method embodiment, and the implementation principle and the technical effect are similar, which are not described herein again.

The present disclosure also provides an electronic device and a non-transitory computer-readable storage medium storing computer instructions, according to embodiments of the present disclosure.

According to an embodiment of the present disclosure, the present disclosure also provides a computer program product comprising: a computer program, stored in a readable storage medium, from which at least one processor of the electronic device can read the computer program, the at least one processor executing the computer program causing the electronic device to perform the solution provided by any of the embodiments described above.

Fig. 5 is a schematic block diagram of an electronic device for implementing a variant text detection method of an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 5, the electronic device 500 includes a computing unit 501, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The calculation unit 501, the ROM 502, and the RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

A number of components in the device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, or the like; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508, such as a magnetic disk, optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 501 may be a variety of general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 501 performs the respective methods and processes described above, such as the variant text detection method. For example, in some embodiments, the variant text detection method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into the RAM 503 and executed by the computing unit 501, one or more steps of the variant text detection method described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the variant text detection method in any other suitable manner (e.g., by means of firmware).

The electronic device for implementing the model method according to the embodiment of the present disclosure is similar to the electronic device for implementing the variant text detection method according to the embodiment of the present disclosure shown in fig. 5, and is not described herein again.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A variant text detection method, comprising:

determining a suspicious account associated with the first account;

2. The method of claim 1, wherein said determining a first variant text of said plurality of texts from said variant score for each text comprises:

determining a first text having a variant score greater than or equal to a first threshold as the first variant text.

3. The method of claim 2, further comprising:

and outputting first indication information for second text with a variant score larger than or equal to a second threshold and smaller than the first threshold, wherein the first indication information is used for indicating that an annotation is added to the second text, and if the received second text is annotated as variant text, determining the second text as the first variant text.

4. The method of any of claims 1-3, wherein the determining a suspicious account related to the first account comprises:

determining an account that is the same as the internet protocol address of the first account as the suspicious account.

5. The method of any of claims 1-3, wherein the determining a suspicious account related to the first account comprises:

and determining the account with the interaction value larger than the preset value with the first account as the suspicious account.

6. The method of any of claims 1-5, wherein the content feature detection of the text submitted by the suspicious account comprises:

inputting the text submitted by the suspicious account into the variant text detection model to obtain the variant score of the text submitted by the suspicious account;

and detecting content characteristics of a third text of which the variant score is smaller than a third threshold value in the texts submitted by the suspicious account.

7. The method of claim 6, wherein the determining a second variant text in the text submitted by the suspicious account based on the result of the content feature detection comprises:

and if the result of the content feature detection meets a preset condition, outputting second indication information, wherein the second indication information is used for indicating that a label is added to the third text, and if the received label of the third text is a variant text, determining the third text as the second variant text.

8. The method of claim 7, wherein the content feature detection of a third text with a variant score smaller than a third threshold value in the texts submitted by the suspicious account comprises:

performing similarity detection on every two texts in the third text, and if the similarity of the two texts is greater than a similarity threshold value, determining that the two texts are similar texts;

if the content feature detection result meets a preset condition, outputting second indication information, including:

and if the number of the similar texts in the third text is greater than a preset value, outputting the second indication information.

9. The method of claim 7, wherein the content feature detection of a third text with a variant score smaller than a third threshold value in the texts submitted by the suspicious account comprises:

semantic relevance detection is carried out on the third text and the superior text object of the third text, and relevance scores of the third text and the superior text object of the third text are obtained;

and if the relevance scores of the third text and the superior text object of the third text are smaller than a relevance threshold value, outputting the second indication information.

10. The method of claim 7, wherein the content feature detection of a third text with a variant score smaller than a third threshold value in the texts submitted by the suspicious account comprises:

matching the third text with a preset special character library to obtain the number of special characters contained in the third text;

and if the number of the special characters contained in the third text is greater than a special character threshold value, outputting the second indication information.

11. The method of claim 7, wherein the content feature detection of a third text with a variant score smaller than a third threshold value in the texts submitted by the suspicious account comprises:

matching the third text with a keyword word bank to obtain the number of keywords in the third text, wherein the keyword word bank is obtained by extracting keywords from a preset variant text;

and if the number of the keywords in the third text is greater than a keyword threshold value, outputting the second indication information.

12. The method according to any one of claims 6-11, further comprising:

and outputting third indication information for fourth texts of which the variant scores are greater than or equal to a third threshold and smaller than a second threshold in the texts submitted by the suspicious account, wherein the third indication information is used for indicating that labels are added to the fourth texts, and if the labels of the received fourth texts are variant texts, determining the fourth texts as the second variant texts.

13. A model training method, comprising:

obtaining a first variant text and a second variant text, wherein the first variant text is obtained by inputting a plurality of texts into a variant text detection model respectively, obtaining a variant score of each text, obtaining a variant text according to the variant score of each text, and the second variant text is obtained by performing content feature detection on a text submitted by a suspicious account related to a first account corresponding to the first variant text and determining the variant text according to a result of the content feature detection;

14. A variant text detection apparatus comprising:

the input module is used for respectively inputting the plurality of texts into a variant text detection model to obtain a variant score of each text in the plurality of texts, and the variant text detection model is obtained by training a text recognition model by adopting a variant text sample;

a first determining module, configured to determine, according to the variant score of each text, a first variant text in the plurality of texts, and determine a first account corresponding to the first variant text;

a second determination module to determine a suspicious account associated with the first account;

and the detection module is used for carrying out content characteristic detection on the texts submitted by the suspicious account and determining a second variant text in the texts submitted by the suspicious account according to the result of the content characteristic detection.

15. The apparatus of claim 14, wherein the first determining means comprises:

a first determination unit configured to determine a first text having a variation score greater than or equal to a first threshold as the first variant text.

16. The apparatus of claim 15, further comprising:

and a second determining unit, configured to output, for a second text with a variant score greater than or equal to a second threshold and smaller than the first threshold, first indication information, where the first indication information is used to indicate that an annotation is added to the second text, and if the received annotation of the second text is a variant text, determine the second text as the first variant text.

17. The apparatus of any of claims 14-16, wherein the second determining means comprises:

a third determining unit, configured to determine an account that is the same as the internet protocol address of the first account as the suspicious account.

18. The apparatus of any of claims 14-16, wherein the second determining means comprises:

19. The apparatus of any one of claims 14-18, wherein the detection module comprises:

20. The apparatus of claim 19, wherein the detection module comprises:

a fifth determining unit, configured to output second indication information if the result of the content feature detection meets a preset condition, where the second indication information is used to indicate that a label is added to the third text, and if a label of the received third text is a variant text, determine the third text as the second variant text.

21. The apparatus of claim 20, wherein the detection unit comprises:

the first detection subunit is configured to perform similarity detection on every two texts in the third text, and if the similarity of two texts is greater than a similarity threshold, determine that the two texts are similar texts;

the fifth determination unit includes:

and the first output subunit is used for outputting the second indication information if the number of the similar texts in the third text is greater than a preset value.

22. The apparatus of claim 20, wherein the detection unit comprises:

the second detection subunit is configured to perform semantic relevance detection on the third text and a superior text object of the third text, so as to obtain relevance scores of the third text and the superior text object of the third text;

the fifth determination unit includes:

and the second output subunit outputs the second indication information if the relevance scores of the third text and the superior text object of the third text are smaller than a relevance threshold.

23. The apparatus of claim 20, wherein the detection unit comprises:

the third detection subunit matches the third text with a preset special character library to obtain the number of special characters contained in the third text;

the fifth determination unit includes:

and the third output subunit outputs the second indication information if the number of the special characters contained in the third text is greater than a special character threshold value.

24. The apparatus of claim 20, wherein the detection unit comprises:

the fourth detection subunit matches the third text with a keyword lexicon to obtain the number of keywords in the third text, wherein the keyword lexicon is obtained by extracting keywords from a preset variant text;

the fifth determination unit includes:

and the fourth output subunit outputs the second indication information if the number of the keywords included in the third text is greater than a keyword threshold value.

25. The apparatus of any of claims 19-24, further comprising:

and a fifth output subunit, configured to output third indication information for a fourth text whose variant score is greater than or equal to a third threshold and smaller than a second threshold in the text submitted by the suspicious account, where the third indication information is used to indicate that a label is added to the fourth text, and if a label of the fourth text is received and is a variant text, determine the fourth text as the second variant text.

26. A model training apparatus comprising:

27. An electronic device, comprising:

at least one processor; and a memory communicatively coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-13.

28. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-13.

29. A computer program product comprising a computer program which, when executed by a processor, implements the method of any one of claims 1-13.