CN111046658B

CN111046658B - Method, device and equipment for recognizing disorder text

Info

Publication number: CN111046658B
Application number: CN201911306126.0A
Authority: CN
Inventors: 孙建举
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2019-12-18
Filing date: 2019-12-18
Publication date: 2023-05-09
Anticipated expiration: 2039-12-18
Also published as: CN111046658A

Abstract

The embodiment of the specification discloses a method, a device and equipment for recognizing a disorder text, wherein the disorder text recognition scheme comprises the following steps: and inputting the extracted feature vectors of the text to be identified into the trained sequence to a sequence model to obtain the feature vectors of the sequenced text. The trained sequence is a recurrent neural network model in the encoder sub-model and the decoder sub-model in the sequence model. Calculating a difference value between the feature vector of the sequenced text and the feature vector of the text to be identified; and when the difference value is larger than a preset value, determining the text information to be identified as disorder text information.

Description

Method, device and equipment for recognizing disorder text

Technical Field

One or more embodiments of the present disclosure relate to the field of computer technology, and in particular, to a method, an apparatus, and a device for recognizing a disorder text.

Background

Currently, lawbreakers utilize a random information generator to generate a large amount of false information and use the false information to make benefits. This way of utilizing spurious information to profit has gradually formed a black industry chain. To combat illicit activities that make use of spurious information to make a profit, the spurious information needs to be identified. Since the spurious information is typically randomly generated out-of-order text, a need arises to identify the out-of-order text.

In summary, how to accurately identify the disorder text is a technical problem to be solved.

Disclosure of Invention

In view of this, one or more embodiments of the present disclosure provide a method, an apparatus, and a device for recognizing a scrambled text, which are used for precisely recognizing the scrambled text.

In order to solve the above technical problems, the embodiments of the present specification are implemented as follows:

the method for recognizing the disorder text provided by the embodiment of the specification comprises the following steps:

acquiring text information to be identified;

extracting the characteristics of the text information to be identified to obtain the characteristic vector of the text to be identified;

inputting the feature vector of the text to be identified into a trained sequence to a sequence model to obtain the feature vector of the text after sequencing; the trained sequence-to-sequence model is generated by training an initial sequence-to-sequence model by using feature vectors of positive sequence texts, wherein the initial sequence-to-sequence model comprises an encoder sub-model and a decoder sub-model, and the encoder sub-model and the decoder sub-model are both recurrent neural network models;

calculating a difference value between the feature vector of the sequenced text and the feature vector of the text to be identified;

And when the difference value is larger than a preset value, determining that the text information to be identified is disorder text information.

The training method for the sequence-to-sequence model provided by the embodiment of the specification comprises the following steps:

acquiring a sample set, wherein samples in the sample set are positive sequence text information;

extracting the characteristics of each sample in the sample set to obtain a sample characteristic vector set;

and training an initial sequence to sequence model by using the sample feature vector set to obtain a trained sequence to sequence model, wherein the initial sequence to sequence model comprises an encoder sub-model and a decoder sub-model, and the encoder sub-model and the decoder sub-model are both recurrent neural network models.

The embodiment of the specification provides a disorder text recognition device, which comprises:

the acquisition module is used for acquiring text information to be identified;

the feature extraction module is used for extracting features of the text information to be identified to obtain feature vectors of the text to be identified;

the sequencing module is used for inputting the feature vector of the text to be identified into the trained sequence to the sequence model to obtain the feature vector of the text after sequencing; the trained sequence-to-sequence model is generated by training an initial sequence-to-sequence model by using feature vectors of positive sequence texts, wherein the initial sequence-to-sequence model comprises an encoder sub-model and a decoder sub-model, and the encoder sub-model and the decoder sub-model are both recurrent neural network models;

The difference value calculation module is used for calculating a difference value between the feature vector of the sequenced text and the feature vector of the text to be identified;

and the disorder text information determining module is used for determining that the text information to be identified is disorder text information when the difference value is larger than a preset value.

The embodiment of the present specification provides a training device for a sequence-to-sequence model, including:

the acquisition module is used for acquiring a sample set, wherein samples in the sample set are positive sequence text information;

the feature extraction module is used for extracting features of each sample in the sample set to obtain a sample feature vector set;

the training module is used for training an initial sequence to sequence model by using the sample feature vector set to obtain a trained sequence to sequence model, wherein the initial sequence to sequence model comprises an encoder sub-model and a decoder sub-model, and the encoder sub-model and the decoder sub-model are both recurrent neural network models.

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,,

The memory stores instructions executable by the at least one processor to enable the at least one processor to:

acquiring text information to be identified;

The embodiment of the specification provides training equipment of a sequence-to-sequence model, which comprises the following components:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,,

At least one embodiment of the present disclosure can achieve the following advantages: and inputting the extracted feature vectors of the texts to be identified into the trained sequence to a sequence model, so that the feature vectors of the ordered texts can be obtained. When the difference value between the feature vector of the text after sequencing and the feature vector of the text to be identified is larger than a preset value, determining that the text information to be identified is disorder text information. The encoder sub-model and the decoder sub-model in the trained sequence-to-sequence model are both recursive neural network models, so that the feature vector of the sequenced text obtained based on the trained sequence-to-sequence model processing is high in accuracy, and the accuracy of the out-of-order text recognition result obtained based on the feature vector of the sequenced text can be improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of one or more embodiments of the specification, illustrate and explain one or more embodiments of the specification, and are not an undue limitation on the one or more embodiments of the specification. In the drawings:

FIG. 1 is a schematic flow chart of a method for recognizing a disorder text according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a sequence-to-sequence model according to an embodiment of the present disclosure;

FIG. 3 is a flow chart of a training method of a sequence-to-sequence model according to an embodiment of the present disclosure;

FIG. 4 is a schematic structural diagram of a device for recognizing a document in disorder corresponding to FIG. 1 according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a training device corresponding to the sequence-to-sequence model of FIG. 3 according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a disorder text recognition device corresponding to fig. 1 according to an embodiment of the present disclosure.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of one or more embodiments of the present specification more clear, the technical solutions of one or more embodiments of the present specification will be clearly and completely described below in connection with specific embodiments of the present specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present specification. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without undue burden, are intended to be within the scope of one or more embodiments herein.

The following describes in detail the technical solutions provided by the embodiments of the present specification with reference to the accompanying drawings.

In the prior art, lawless persons gather preferential information of each financial institution and each merchant, and trade rewards are exchanged for high-prizes with low cost or even zero cost in business marketing activities by registering false accounts or fictitious user information and other modes, which brings adverse effects to the normal operation of business of each enterprise. Because a large amount of disordered texts exist in the information such as the user name, address information, mailbox information and the like used by the lawbreaker, the identification of illegal actions can be realized through the identification of the disordered texts.

At present, when recognizing the disorder text, a large number of positive sequence texts are generally counted to obtain the probability that any two words are adjacent. When the text to be recognized is recognized, the probability of each word in the text to be recognized at each position can be determined, and the probability that the text to be recognized is the disordered text is further obtained. According to the method for recognizing the disordered text, only the association relation between adjacent texts is considered, so that the accuracy of the recognition result of the disordered text is poor. And because a large number of text samples are required to be used for statistical training, the method is not suitable for scenes with limited sample numbers, and therefore, the method for recognizing the random text has poor applicability.

In order to solve the drawbacks of the prior art, the present solution provides the following embodiments:

fig. 1 is a schematic flow chart of a method for recognizing a disorder text according to an embodiment of the present disclosure. From the program perspective, the execution subject of the flow may be a program installed on an application server.

As shown in fig. 1, the process may include the steps of:

step 102: and acquiring text information to be identified.

In the embodiment of the present disclosure, the type of text information to be recognized is different according to the application scenario. For example, the text information to be identified may be account registration information for the user account when applied to a scenario where a registered false account is identified. For example, the text information to be recognized may include information such as a user name, a mailbox account identification, a user address, a store name, and the like, which are input when the user registers for an account. When applied to a scenario where false information is used to get resources, the text information to be identified may be account information used by the user to obtain the specified resources. For example, the text information to be identified may be information such as a user name, a user account identifier, a contact mailbox, a contact address, etc. input when the merchant coupons or red packages in the user domain. It can be seen that the text information to be recognized may include information in various formats such as chinese characters, numerals, english words, english letters, punctuation marks, etc.

Step 104: and extracting the characteristics of the text information to be identified to obtain the characteristic vector of the text to be identified.

In the embodiment of the present specification, word vector (Word mapping) means a vector in which words or phrases from a vocabulary are mapped to real numbers, and Word vector is an important basis in natural language processing, which is advantageous for analysis of text, emotion, word sense, and the like. Therefore, the word vector of the text to be recognized, which is obtained by extracting the characteristics of the text information to be recognized, can be determined as the characteristic vector of the text to be recognized.

Step 106: inputting the feature vector of the text to be identified into a trained sequence to a sequence model to obtain the feature vector of the text after sequencing; the trained sequence-to-sequence model is generated by training an initial sequence-to-sequence model by using feature vectors of positive sequence texts, wherein the initial sequence-to-sequence model comprises an encoder sub-model and a decoder sub-model, and the encoder sub-model and the decoder sub-model are both recurrent neural network models.

In the present embodiment, the sequence-to-sequence model (i.e., sequence to Sequence model, abbreviated as Seq2Seq model) is one of the encoding-decoding models. Fig. 2 is a schematic structural diagram of a sequence-to-sequence model according to an embodiment of the present disclosure. As shown in fig. 2, the sequence-to-sequence model may include an encoder sub-model 202 and a decoder sub-model 204.

For ease of understanding, the principles of operation of the sequence-to-sequence model are illustrated in connection with the description of fig. 2. Assume that the text to be recognized contains three words or characters, and the three words or characters in the text to be recognized sequentially correspond to the featuresThe vectors (i.e. word vectors) are w respectively ₀ 、w ₁ W ₂ . Assume that the initial hidden state of the encoder sub-model 202 is h ₀ . The encoder sub-model 202 outputs the hidden state h at each instant _j ＝L(w _j ,h _j-1 ) Wherein L represents an expression of an encoder submodel, w _j For the feature vector corresponding to the j-th word or character in the text to be recognized, h _j-1 The hidden state output for the encoder sub-model at the last time instant. The hidden states output by the encoder sub-model 202 at each instant are sequentially arranged to obtain a vector v, which may be represented as v= (h) ₁ ,h ₂ ,h ₃ )。

Assume that the initial output of decoder sub-model 204 is t ₀ . The input for each time instant of the decoder sub-model 204 is the vector v and the output of the decoder sub-model 204 at the last time instant. The output t of the decoder sub-model 204 at each instant _j ＝F(v,t _j-1 ) Wherein F represents an expression of a decoder submodel, v is a vector obtained by sequentially arranging hidden states output by the encoder submodel at each moment, and t is _j-1 Is the output of the decoder sub-model at the last instant. The outputs of decoder submodel 204 at each instant are sequentially arranged to yield an output vector, which may be represented as t= (T ₁ ,t ₂ ,t ₃ ). The output vector is the feature vector of the sequenced text obtained after sequencing the text to be recognized.

In the embodiment of the present specification, the feature vector of the positive sequence text may be used in advance to train the initial sequence to the sequence model, so that the trained sequence to the sequence model learns the rule and feature of the positive sequence text. After the feature vector of the text to be recognized is input into the trained sequence-to-sequence model, the trained sequence-to-sequence model processes the feature vector of the text to be recognized according to the rule and the feature of the learned positive sequence text to generate the feature vector of the sequenced text conforming to the rule and the feature of the positive sequence text. That is, it can be considered that the feature vector of the text to be recognized is input to the feature vector of the ordered text obtained by the training sequence to the sequence model, and is the feature vector of the text closest to the positive sequence text which can be constituted by the word or character included in the text to be recognized.

In the embodiments of the present disclosure, both the encoder sub-model and the decoder sub-model in the sequence-to-sequence model may be implemented using recurrent neural network models. Among them, the recurrent neural network model (recursive neural network) includes various algorithm models, such as a recurrent neural network (recurrent neural network), a Long Short-Term Memory network (Long Short-Term Memory), and the like. The encoder sub-model and the decoder sub-model in the sequence-to-sequence model in step 106 may be implemented using any of a variety of recurrent neural network models.

In practical application, gradient explosion is easy to generate due to the long-order dependence problem of the circulating neural network. The long-term and short-term memory network model introduces a forgetting gate, and is easy to converge, so that the long-order dependence problem can be solved. The long-term and short-term memory network model can mine the relation among the sequence elements, and is suitable for processing and predicting important events with very long intervals and delays in a time sequence, so that the method is suitable for a scene of reordering the text sequence according to a certain rule.

When the long-term memory network model is adopted by the encoder sub-model and the decoder sub-model in the trained sequence-to-sequence model, experiments prove that the F1 fraction (F1 Score) of the trained sequence-to-sequence model can reach 0.91. Wherein the F1score is a harmonic mean of the precision and recall. Therefore, the feature vector of the sequenced text obtained based on the trained sequence-to-sequence model processing has better accuracy and precision, thereby being beneficial to improving the effectiveness of the disordered text identification method. And the sequence-to-sequence model after training can be obtained by training the initial model by using a small number of positive samples, and a large number of positive and negative training samples are not needed, so that the disorder text recognition method can be suitable for scenes with a small number of training samples, and the universality of the disorder text recognition method is improved.

Step 108: and calculating a difference value between the feature vector of the sequenced text and the feature vector of the text to be identified.

Step 110: and when the difference value is larger than a preset value, determining that the text information to be identified is disorder text information.

In the embodiment of the present specification, since the feature vector of the text after sorting may be regarded as the feature vector of the nearest positive-sequence text that can be constituted by words or characters included in the text to be recognized. Therefore, when the difference value between the feature vector of the text after sorting and the feature vector of the text to be recognized is smaller than or equal to a preset value, the text to be recognized can be considered to conform to the feature of the positive sequence text, namely the text information to be recognized can be considered to be the positive sequence text information. When the difference value between the feature vector of the text after sequencing and the feature vector of the text to be recognized is larger than a preset value, the text to be recognized can be considered to be not in line with the features of the positive sequence text, namely the text information to be recognized can be considered to be disorder text information.

It should be understood that the method according to one or more embodiments of the present disclosure may include the steps in which some of the steps are interchanged as needed, or some of the steps may be omitted or deleted.

The method in fig. 1 uses the trained sequence-to-sequence model to process the feature vectors of the text information to be identified to obtain the feature vectors of the ordered text. And calculating a difference value between the feature vector of the text after sequencing and the feature vector of the text to be identified, and comparing the difference value with a preset value to identify whether the text information to be identified is disorder text information. The trained sequence-to-sequence model is obtained by training the feature vector of the positive sequence text, and the encoder sub-model and the decoder sub-model of the trained sequence-to-sequence model can be the recurrent neural network model, so that the accuracy of the feature vector of the sequenced text obtained by processing the trained sequence-to-sequence model is higher, and the accuracy of the recognition result of the disordered text can be further improved.

The examples of the present specification also provide some specific embodiments of the method based on the method of fig. 1, which is described below.

In the present embodiment, step 104: the feature extraction of the text information to be identified may specifically include:

and extracting the characteristics of the text information to be identified by adopting a trained Word2vec model, wherein the trained Word2vec model is generated by training an initial Word2vec model by using positive sequence text information.

In embodiments of the present disclosure, a Word2vec model may be used to map each Word in the text information to be identified to a feature vector (i.e., word vector), which may represent Word-to-Word relationships. In practical application, the initial Word2vec model can be trained by using the positive sequence text information in advance, so that the Word2vec model after training learns the Word-Word relationship in the positive sequence text, the feature vector accuracy of the text to be identified extracted based on the Word2vec model after training is better, and further the accuracy of the method for identifying the disorder text in the embodiment of the specification is improved.

In the embodiment of the present specification, the types of text information to be recognized are various, for example, name, address, mailbox, and the like. In practical application, the same type of positive sequence text sample should be used to train the initial Word2vec model, and the trained Word2vec model is used to perform feature extraction on the text information to be identified, which has the same type as the positive sequence text sample. For example, a Word2vec model obtained by training positive-sequence address text samples can be used for extracting features of address text information, but not extracting features of name text information, so as to ensure the accuracy of extracted feature vectors of texts to be identified.

In the present description embodiment, step 108: calculating a difference value between the feature vector of the ordered text and the feature vector of the text to be identified may specifically include:

and calculating a distance value between the feature vector of the sequenced text and the feature vector of the text to be identified.

In the embodiment of the present disclosure, there are various implementation methods for calculating the distance value between the feature vector of the text after sorting and the feature vector of the text to be identified. For example, an euclidean distance calculation method, a manhattan distance calculation method, a mahalanobis distance calculation method, or the like may be used. In practical application, a root mean square error (Root Mean Squared Error) calculation formula can be adopted to calculate a distance value between the feature vector of the text after sorting and the feature vector of the text to be identified, so that accuracy of the calculated distance value is improved.

In the embodiment of the present specification, the types of text information to be identified in different application scenarios are also different. When applied to a scenario where a false registration account is identified, the text information to be identified may be account registration information. Step 110: after determining that the text information to be identified is the disorder text information, the method may further include:

And generating a control instruction for prohibiting the user from using the account corresponding to the account registration information.

In the embodiment of the present specification, when the text information to be identified is account registration information and it is determined that the account registration information is out-of-order text, it may be considered that the user uses false registration information when registering the account. And the probability that the account registered by using the false registration information is a false account registered by an lawbreaker is high, so that an account corresponding to the account registration information can be disabled by generating a forbidden control instruction. Thereby reducing the probability of illegal actions performed by illegal molecules by using false registration accounts, and further ensuring the normal operation of financial institutions or merchant businesses.

In practical application, other information of the account corresponding to the account registration information may be combined at the same time to finally determine whether the account corresponding to the account registration information is a false registration account, which is not specifically limited in this embodiment of the present disclosure.

In the embodiment of the present specification, when applied to a scenario where a user obtains a specified resource using a false registered account or false account information, the text information to be identified may be account information used by the user to obtain the specified resource. Step 110: after determining that the text information to be identified is the disorder text information, the method may further include:

And generating a control instruction for prohibiting the user from using the designated resources in the account corresponding to the account information.

In the embodiment of the present specification, for ease of understanding, a scenario in which a user obtains a specified resource using a false registration account or false account information is illustrated. For example, a lawbreaker may purchase a large number of subscriber identity cards (i.e., subscriber Identity Modula, abbreviated as SIMs) to obtain a large number of mobile phone number usage rights. And a large amount of false user information is generated by software such as a name generator, an identity card generator and the like. The lawbreaker will register a large number of false registration accounts at each platform using the mobile phone number and the false user information to use the false registration accounts to obtain offers offered by financial institutions and various merchants. Alternatively, the lawbreaker may also directly use the compiled false account information to obtain offers offered by financial institutions and various merchants. This behaviour of lawless persons is called "wool-out" behaviour. The offers provided by the financial institutions and various merchants are designated resources obtained by users through false registration accounts or false account information. The specified resources may include, but are not limited to, red-pack, coupon, physical gift, mobile traffic, etc.

In this embodiment of the present disclosure, when the text information to be identified is account information of a user for acquiring a specified resource, and the account information for acquiring the specified resource is determined to be an out-of-order text, the user may be considered to be acquiring the specified resource by using a false registration account or false account information, and not conform to the issuing rule of the specified resource. Accordingly, a decommissioning control instruction may be generated to deactivate the specified resource in the account corresponding to the account information. Or refusing to issue the designated resource to the account corresponding to the account information. Therefore, the probability that lawbreakers utilize false registration accounts or false account information to "draw wool" is reduced, and normal operation of financial institutions or merchant businesses is further guaranteed.

In the embodiment of the present specification, after determining that the text information to be recognized is the scrambled text information, it may further include: and generating prompt information, wherein the prompt information is used for prompting that the text information to be identified is disorder text information. Therefore, the application platform is convenient for carrying out subsequent processing on the identified disordered text information, and the practicability of the disordered text identification method is improved.

Fig. 3 is a flow chart of a training method of a sequence-to-sequence model according to an embodiment of the present disclosure. From the program perspective, the execution subject of the flow may be a program installed on an application server.

As shown in fig. 3, the process may include the steps of:

step 302: and acquiring a sample set, wherein samples in the sample set are positive sequence text information.

In the embodiment of the present disclosure, due to different application scenarios, the effect from the trained sequence to the sequence model to be obtained is different, and thus the types of samples in the sample set used for training the initial sequence to the sequence model are also different. For example, when out-of-order recognition needs to be performed on the name text to be recognized, the trained sequence-to-sequence model is used for sorting the name text to be recognized, and correspondingly, samples in a sample set used for training the initial sequence-to-sequence model should be positive-order name text. In the present embodiment, the types of the respective samples (i.e., positive sequence text information) in one sample set are the same.

Step 304: and extracting the characteristics of each sample in the sample set to obtain a sample characteristic vector set.

In the embodiment of the present disclosure, step 304 may specifically include: and extracting the characteristics of each sample in the sample set by adopting a trained Word2vec model, wherein the trained Word2vec model is generated by training an initial Word2vec model by using positive sequence text information.

In the present embodiment, the positive text information of the training initial Word2vec model is the same type as the samples in the sample set. For example, when the samples in the sample set are name texts, the trained Word2vec model used in feature extraction of the samples in the sample set should be generated by training the initial Word2vec model using positive-order name texts. Thereby ensuring the accuracy of sample feature vectors obtained by extracting the features of the samples in the sample set by adopting the trained Word2vec model.

Step 306: and training an initial sequence to sequence model by using the sample feature vector set to obtain a trained sequence to sequence model, wherein the initial sequence to sequence model comprises an encoder sub-model and a decoder sub-model, and the encoder sub-model and the decoder sub-model are both recurrent neural network models.

In the embodiment of the present disclosure, since the feature vectors in the sample feature vector set are obtained by extracting features from the positive sequence text information, when the initial sequence to sequence model is trained by using the sample feature vector set, the trained sequence to sequence model can learn rules and features of the positive sequence text. After the feature vector of the text to be recognized is input into the trained sequence-to-sequence model, the trained sequence-to-sequence model processes the feature vector of the text to be recognized according to the rule and the feature of the learned positive sequence text to generate the feature vector of the sequenced text conforming to the rule and the feature of the positive sequence text. That is, it may be considered that the trained sequence-to-sequence model may sort the text information to be recognized to obtain feature vectors of the nearest positive sequence text that may be formed by words or characters included in the text information to be recognized.

In the embodiment of the specification, the word vector is adopted to train the initial sequence to sequence model to obtain the trained sequence to sequence model, so that the feature vector (namely the word vector) of a small amount of positive sequence text information can be utilized to train the initial sequence to sequence model to obtain the ordered sequence to sequence model with high accuracy and precision. It can be known that the sequence-to-sequence model training method in the embodiment of the present disclosure has better universality, and is not only suitable for scenes with abundant training samples, but also suitable for scenes with fewer training samples.

In the embodiment of the present disclosure, the encoder sub-model and the decoder sub-model in the initial sequence-to-sequence model may be implemented using any recurrent neural network model. Among them, the recurrent neural network model (recursive neural network) includes various algorithm models, such as a recurrent neural network (recurrent neural network), a Long Short-Term Memory network (Long Short-Term Memory), and the like. When the encoder sub-model and the decoder sub-model in the initial sequence-to-sequence model both adopt long-term memory network models, experiments prove that the F1 fraction (F1 Score) of the trained sequence-to-sequence model can reach 0.91. Wherein the F1 score is a harmonic mean of the precision and recall. Therefore, the accuracy of the trained sequence-to-sequence model is better. And further, the accuracy and the precision of the feature vector of the sequenced text obtained by processing the feature vector of the text information to be recognized by using the trained sequence-to-sequence model are also good, and the effectiveness of the disorder text recognition method is improved.

Based on the same thought, the embodiment of the present disclosure further provides an apparatus corresponding to the method in fig. 1. Fig. 4 is a schematic structural diagram of a disorder text recognition device corresponding to fig. 1 according to an embodiment of the present disclosure. As shown in fig. 4, the apparatus may include:

an obtaining module 402, configured to obtain text information to be identified.

And the feature extraction module 404 is configured to perform feature extraction on the text information to be identified, so as to obtain a feature vector of the text to be identified.

The sorting module 406 is configured to input the feature vector of the text to be identified into a trained sequence to a sequence model, so as to obtain a feature vector of the text after sorting; the trained sequence-to-sequence model is generated by training an initial sequence-to-sequence model by using feature vectors of positive sequence texts, wherein the initial sequence-to-sequence model comprises an encoder sub-model and a decoder sub-model, and the encoder sub-model and the decoder sub-model are both recurrent neural network models. Specifically, the encoder sub-model and the decoder sub-model may both be long-term and short-term memory network models.

A difference value calculating module 408, configured to calculate a difference value between the feature vector of the ordered text and the feature vector of the text to be identified.

And the disorder text information determining module 410 is configured to determine that the text information to be identified is disorder text information when the difference value is greater than a preset value.

The present examples also provide some embodiments of the method based on the apparatus of fig. 4, as described below.

In the illustrated embodiment, the feature extraction module 404 may be specifically configured to:

In the illustrated embodiment, the variance value calculation module 408 may be specifically configured to:

In an embodiment of the specification, when the text information to be identified is account registration information, the apparatus may further include:

and the forbidden control instruction generation module is used for generating forbidden control instructions after determining that the text information to be identified is the disordered text information so as to forbidden users to use accounts corresponding to the account registration information.

In an embodiment of the present disclosure, when the text information to be identified is account information used by a user to obtain a specified resource, the apparatus in fig. 4 may further include:

And the forbidden control instruction generation module is used for generating forbidden control instructions after determining that the text information to be identified is the disordered text information so as to forbidden users to use the designated resources in the account corresponding to the account information.

In an embodiment of the disclosure, the apparatus in fig. 4 may further include a hint information generating module. The prompt information generation module can be used for generating prompt information after determining that the text information to be identified is the disordered text information, wherein the prompt information is used for prompting that the text information to be identified is the disordered text information.

Based on the same thought, the embodiment of the present disclosure further provides an apparatus corresponding to the method in fig. 3. Fig. 5 is a schematic structural diagram of a training device corresponding to the sequence-to-sequence model of fig. 3 according to an embodiment of the present disclosure. As shown in fig. 5, the apparatus may include:

the obtaining module 502 is configured to obtain a sample set, where samples in the sample set are positive sequence text information.

And the feature extraction module 504 is configured to perform feature extraction on each sample in the sample set to obtain a sample feature vector set.

The training module 506 is configured to train an initial sequence-to-sequence model using the sample feature vector set, and obtain a trained sequence-to-sequence model, where the initial sequence-to-sequence model includes an encoder sub-model and a decoder sub-model, and the encoder sub-model and the decoder sub-model are both recurrent neural network models. Specifically, the encoder sub-model and the decoder sub-model may both be long-term and short-term memory network models.

In the embodiment of the present disclosure, the feature extraction module 504 may specifically be configured to:

and extracting the characteristics of each sample in the sample set by adopting a trained Word2vec model, wherein the trained Word2vec model is generated by training an initial Word2vec model by using positive sequence text information.

Based on the same thought, the embodiment of the specification also provides equipment corresponding to the method in fig. 1.

Fig. 6 is a schematic structural diagram of a disorder text recognition device corresponding to fig. 1 according to an embodiment of the present disclosure. As shown in fig. 6, the apparatus 600 may include:

at least one processor 610; the method comprises the steps of,

a memory 630 communicatively coupled to the at least one processor; wherein,,

the memory 630 stores instructions 620 executable by the at least one processor 610 to enable the at least one processor 610 to:

and acquiring text information to be identified.

And extracting the characteristics of the text information to be identified to obtain the characteristic vector of the text to be identified.

Inputting the feature vector of the text to be identified into a trained sequence to a sequence model to obtain the feature vector of the text after sequencing; the trained sequence-to-sequence model is generated by training an initial sequence-to-sequence model by using feature vectors of positive sequence texts, wherein the initial sequence-to-sequence model comprises an encoder sub-model and a decoder sub-model, and the encoder sub-model and the decoder sub-model are both recurrent neural network models.

And calculating a difference value between the feature vector of the sequenced text and the feature vector of the text to be identified.

Based on the same thought, the embodiment of the specification also provides training equipment of the sequence-to-sequence model corresponding to the method in fig. 3. The apparatus includes:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,,

and acquiring a sample set, wherein samples in the sample set are positive sequence text information.

And extracting the characteristics of each sample in the sample set to obtain a sample characteristic vector set.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone LabsC8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing one or more embodiments of the present description.

One skilled in the art will appreciate that one or more embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Moreover, one or more embodiments of the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

One or more embodiments of the present specification are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to one or more embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

One or more embodiments of the present specification may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. One or more embodiments of the present description may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing description is illustrative of embodiments of the present disclosure and is not to be construed as limiting one or more embodiments of the present disclosure. Various modifications and alterations to one or more embodiments of this description will be apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of one or more embodiments of the present disclosure, are intended to be included within the scope of the claims of one or more embodiments of the present disclosure.

Claims

1. A method of recognizing a scrambled text, comprising:

acquiring text information to be identified;

2. The method of claim 1, wherein the feature extraction of the text information to be identified specifically includes:

3. The method according to claim 1, wherein the calculating the difference value between the feature vector of the ordered text and the feature vector of the text to be identified specifically comprises:

4. The method of claim 1, wherein the text information to be identified is account registration information, and further comprising, after determining that the text information to be identified is scrambled text information:

5. The method of claim 1, wherein the text information to be identified is account information used by a user to obtain a specified resource, and further comprising, after determining that the text information to be identified is out-of-order text information:

6. The method of claim 1, further comprising, after the determining that the text information to be identified is out-of-order text information:

and generating prompt information, wherein the prompt information is used for prompting that the text information to be identified is disorder text information.

7. The method of claim 1, wherein the encoder sub-model and the decoder sub-model are both long-term memory network models.

8. A method of training a sequence-to-sequence model, comprising:

training an initial sequence to sequence model by using the sample feature vector set to obtain a trained sequence to sequence model, wherein the initial sequence to sequence model comprises an encoder sub-model and a decoder sub-model, and the encoder sub-model and the decoder sub-model are both recurrent neural network models; the trained sequence-to-sequence model is used for receiving the feature vector of the text to be recognized and outputting the feature vector of the sequenced text so as to determine whether the text information to be recognized is disorder text information according to the difference value between the feature vector of the sequenced text and the feature vector of the text to be recognized.

9. The method of claim 8, wherein the feature extraction is performed on each sample in the sample set, specifically including:

10. The method of claim 8, wherein the encoder sub-model and the decoder sub-model are both long-term memory network models.

11. A disorder document recognition apparatus comprising:

the acquisition module is used for acquiring text information to be identified;

12. The apparatus of claim 11, the feature extraction module being specifically configured to:

13. The apparatus of claim 11, the text information to be identified being account registration information, the apparatus further comprising:

14. The apparatus of claim 11, the text information to be identified being account information used by a user to obtain a specified resource, the apparatus further comprising:

15. The apparatus of claim 11, the encoder sub-model and the decoder sub-model are both long-term memory network models.

16. A training apparatus for a sequence-to-sequence model, comprising:

the training module is used for training an initial sequence to sequence model by using the sample feature vector set to obtain a trained sequence to sequence model, wherein the initial sequence to sequence model comprises an encoder sub-model and a decoder sub-model, and the encoder sub-model and the decoder sub-model are both recurrent neural network models; the trained sequence-to-sequence model is used for receiving the feature vector of the text to be recognized and outputting the feature vector of the sequenced text so as to determine whether the text information to be recognized is disorder text information according to the difference value between the feature vector of the sequenced text and the feature vector of the text to be recognized.

17. The apparatus of claim 16, the feature extraction module being specifically configured to:

18. The apparatus of claim 16, the encoder sub-model and the decoder sub-model are both long-term memory network models.

19. A scrambled text identification device comprising:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,,

acquiring text information to be identified;

20. A training apparatus of a sequence-to-sequence model, comprising:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,,