CN110704611A

CN110704611A - Illegal text recognition method and device based on feature deinterleaving

Info

Publication number: CN110704611A
Application number: CN201910730306.5A
Authority: CN
Inventors: 任博雅; 刘权; 李扬曦; 赵媛; 时磊; 徐雅静; 林鸿展; 孙忆南; 李思
Original assignee: National Computer Network and Information Security Management Center
Current assignee: National Computer Network and Information Security Management Center
Priority date: 2019-08-08
Filing date: 2019-08-08
Publication date: 2020-01-17
Anticipated expiration: 2039-08-08
Also published as: CN110704611B

Abstract

The invention discloses an illegal text recognition method and device based on feature deinterleaving, wherein the method comprises the following steps: step 1, carrying out a variant removing operation on a text to be recognized, and removing special characters in the text to be recognized; step 2, judging whether the text to be recognized is a disordered text or not according to a preset text feature word library and a disordered feature word library, if so, performing deinterlacing processing on the text to be recognized to eliminate variants, otherwise, directly executing the step 3; and 3, classifying the text to be recognized after the solution is modified by utilizing a pre-trained classifier group, and outputting whether the prediction result is an illegal text prediction result.

Description

Illegal text recognition method and device based on feature deinterleaving

Technical Field

The invention relates to the technical field of computers, in particular to an illegal text identification method and device based on feature deinterleaving.

Background

With the rapid development of the mobile internet and the social network, massive text data are generated in the daily communication process, the data contain a lot of harmful, sensitive and illegal information, and how to effectively and reasonably screen the data from normal texts has important significance for network supervision and network environment purification. However, since illegal harmful short text languages in the network have high randomness and complex and various variants, especially various metaphors like harmful texts, in order to avoid monitoring and examination, the text content is encrypted in a similar interleaving mode, so that the information structure is changed to the maximum extent, and the information content is not changed, so that the text becomes a disordered text. The text data is sparse and hidden in a large number of normal texts for propagation, so that the damage is great.

Currently, there are two general approaches to solving such problems: firstly, a rule matching mode is adopted, and a template is formed by manually summarizing rules for identification; the method needs a large amount of manual establishment of rule modes, and is time-consuming and labor-consuming; and only known disorder modes can be found, other information structures of the text can not be effectively utilized for auxiliary judgment, and the generalization capability is almost absent. And secondly, labeling the disordered data and training the disordered classification model. Both of the two ways rely on the manual work to find out enough diverse out-of-order data, so that the identification can be effectively carried out. For the scene with sparse disordered data, a good effect cannot be achieved.

Specifically, as shown in fig. 1, a method for identifying a variant short text based on hierarchical features is provided, which mainly solves a filtering problem of a character hierarchical variant short text, defines a keyword, a keyword library and a feature form in advance, such as specific forms of character separation at equal intervals, in a vertical arrangement, and the like, extracts features of the keyword and the keyword from the short text by using a method of query matching by using additional neural network modules and weights of feature levels, obtains position information of the word in the text, calculates corresponding feature weights, and finally inputs the position information to a classifier for classification calculation based on the extracted features. The specific treatment steps are as follows:

step 1, predefining a keyword library, a keyword replacement word list and characteristic forms (such as character separation specific forms of equal intervals, vertical arrangement and the like);

step 2, utilizing the feature word library and the word library trained by the additional neural network module and the weight of the feature form;

step 3, extracting features by a query matching method;

step 4, obtaining the position information of the word in the text;

step 5, corresponding weight calculation is carried out;

and 6, performing classification calculation on the input classifier based on the extracted features.

It can be seen from the above processing that the variant short text recognition based on the hierarchical features is specially used for processing various variant short texts, and the method needs to rely on a large-scale variant text corpus for training to obtain a good effect. For the case that the distribution of variant short texts is unbalanced, especially for the case that partial variant or disorder samples are sparse, the method has difficulty in learning sufficient features and achieving the expected effect.

In an actual network environment, the number of normal short texts and illegal non-variant short texts is large, the number of variant short texts is small, the diversity is complex, and the distribution of various variants is extremely unbalanced, for example, disordered short text data is very sparse. The existing method has the defects that the sample distribution is unbalanced, and especially under the condition of small sample data, enough training corpora cannot be obtained, so that the expected effect is difficult to achieve.

Disclosure of Invention

The embodiment of the invention provides an illegal text recognition method and device based on feature deinterleaving, which are used for solving the problem that disordered variant samples are not sparse enough to form a training set in the prior art.

The embodiment of the invention provides an illegal text identification method based on feature deinterleaving, which comprises the following steps:

step 1, carrying out a variant removing operation on a text to be recognized, and removing special characters in the text to be recognized;

step 2, judging whether the text to be recognized is a disordered text or not according to a preset text feature word library and a disordered feature word library, if so, performing deinterlacing processing on the text to be recognized to eliminate variants, otherwise, directly executing the step 3;

and 3, classifying the text to be recognized after the solution is modified by utilizing a pre-trained classifier group, and outputting whether the prediction result is an illegal text prediction result.

The embodiment of the invention also provides an illegal text recognition device based on feature deinterleaving, which comprises the following steps:

the de-morphing module is used for performing de-morphing operation on the text to be recognized and removing special characters in the text to be recognized;

the judging module is used for judging whether the text to be identified is the out-of-order text or not according to a preset text feature word library and an out-of-order feature word library, if so, the de-interlacing module is called, and otherwise, the classifying module is called;

a de-interleaving module for de-interleaving the text to be recognized, eliminating the variation and calling the classification module

And the classification module is used for classifying the text to be recognized after the solution transformation by utilizing a pre-trained classifier group and outputting whether the prediction result is an illegal text prediction result.

The embodiment of the invention also provides an illegal text recognition device based on feature deinterleaving, which comprises the following steps: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the above method for illegal text recognition based on feature deinterleaving.

The embodiment of the invention also provides a computer readable storage medium, wherein an implementation program for information transmission is stored on the computer readable storage medium, and when the program is executed by a processor, the steps of the illegal text recognition method based on feature deinterleaving are implemented.

By adopting the embodiment of the invention, under the environment that disordered illegal samples are sparse, sufficient characteristics of normal samples and illegal invariant training samples are utilized to help identify the data characteristics of the small samples through a complete end-to-end processing flow, and the small samples are deinterleaved to eliminate disorder, so that the purpose of identifying the disordered small samples is achieved, the on-line test verification has a good identification effect, and the identification accuracy and recall rate of harmful short texts can be improved.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a diagram of a variant short text recognition method based on hierarchical features in the prior art;

FIG. 2 is a process flow diagram of an illegal text recognition method based on feature deinterleaving according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of the overall solution of an embodiment of the present invention;

FIG. 4 is a detailed schematic of the overall solution of an embodiment of the invention;

FIG. 5 is a schematic diagram of an illegal text recognition device based on feature deinterleaving according to an embodiment of the present invention;

fig. 6 is a schematic diagram of an illegal text recognition device based on feature deinterleaving according to a second embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Method embodiment

According to an embodiment of the present invention, an illegal text recognition method based on feature deinterleaving is provided, and fig. 2 is a processing flow chart of the illegal text recognition method based on feature deinterleaving according to the embodiment of the present invention, as shown in fig. 2, the method according to the embodiment of the present invention specifically includes:

before the specific processing steps of the embodiment of the present invention are performed, the following preparation work needs to be performed. Specifically, a training sample needs to be obtained, and features are extracted from the training sample to form a text feature word bank, where the training sample includes: normal samples and illegal invariant short text samples; and finally, extracting the de-interlacing features from the text feature word library to form a disorder feature word library.

Specifically, firstly, the word frequency tf of each word in the training sample in the normal sample is counted_ipAnd word frequency tf in illegal short text samples_inAccording to tf_di＝|tf_ip-tf_inSelecting characteristic words by using an | comparison threshold epsilon 1 to form a text characteristic word bank; finally, extracting tf from the text feature word bank_inAnd splitting the words larger than the threshold value epsilon 2 to form a disorder characteristic word library.

In addition, training a classifier group by using the training samples is further required, wherein the classifier group specifically includes: SVM classifier, perceptron classifier and LR classifier.

After the above processing is executed, the following steps of the embodiment of the invention are executed:

step 201, performing a morphing removing operation on a text to be recognized, and removing special characters in the text to be recognized;

specifically, in step 201, according to the special characters in the variant library, an RE regular expression is used to form a rule template, and a de-variant operation is performed on the text to be recognized to remove the special characters in the text to be recognized.

Step 202, judging whether the text to be recognized is a disorder text or not according to a preset text feature word bank and a disorder feature word bank, if so, performing de-interlacing processing on the text to be recognized to eliminate variants, otherwise, directly executing step 203;

step 202 specifically includes the following processing: after word segmentation is carried out on the text to be recognized, the text to be recognized and the feature words in the text feature word library are mapped one by one to generate a unique hot coded text vector, if the unique hot coded text vectors are all equal to 0, whether the text is the disorder text is judged according to the disorder feature word library, if the text is judged to be the disorder text, the text to be recognized is subjected to de-interleaving processing, and variants are eliminated.

And step 203, classifying the text to be recognized after the solution transformation by utilizing a pre-trained classifier group, and outputting whether the text is an illegal text prediction result.

The technical solutions of the embodiments of the present invention will be described below with reference to the accompanying drawings.

Fig. 3 is a schematic diagram of the general technical solution of the embodiment of the present invention, as shown in fig. 3, specifically including the following processes:

step S1: carrying out a morphing removing operation on the text to be recognized, and removing special characters in the text, namely simple morphing;

step S2: the training samples are composed of normal samples and illegal invariant short text samples.

(2.1) extracting features from the training samples to form a text feature word bank;

(2.2) extracting the de-interlacing features from the feature word library to form a disorder feature word library;

step S3: de-interleaving processing algorithm

(3.1) judging whether the text is disordered or not by using the characteristic word stock obtained in the step S2

(3.2) if yes, performing deinterleaving and eliminating variants;

step S4: training a classifier group by using the training samples, classifying the text of the solution variant by using the classifier, and outputting a prediction result.

The above-described processing steps will be described in detail below with reference to the accompanying drawings.

Fig. 4 is a detailed schematic diagram of a general technical solution of an embodiment of the present invention, as shown in fig. 4, specifically including:

e.g. for any out-of-order short text T1 ═ T (T)₁,t₂,t₃,…,t_(i-1),t_i) Wherein i is the word number of the text, and the processing flow comprises the following steps:

step S1: carrying out de-morphing processing on an original text through a Regular Expression (RE), and removing special characters of the text;

(1.1) the variant library is composed of various ASCII English characters, Greek letters, Chinese special characters, etc., such as!,%, [ lambda ], (+, ○, Σ, [ lambda ], "@, @ # ￥% … … &, etc.

(1.2) constructing a rule template using the RE regular expression, i.e. the above characters that occur singly as variants, are removed.

Step S2: the training samples are composed of normal samples and non-variant illegal short text samples.

(2.1): counting the word frequency tf of each word in the training sample in the normal sample_ipAnd word frequency tf in illegal short text samples_inAccording to tf_di＝|tf_ip-tf_inSelecting characteristic words by using an | comparison threshold epsilon 1 to form a text characteristic word bank;

(2.2) extracting tf from the text feature word bank_in>Splitting words with a threshold value epsilon 2 to form a disorder characteristic word library;

step S3: de-interleaving processing algorithm

(31) the short text T2 output in step S1 is (T)₁,t₂,t₃,…,t_(j-1)Tj), j is the length of T2, the short text T2 is segmented, and then is mapped with the feature words in the text feature word bank one by one to generate a one-hot (one-hot) text vector x ═ x₁,…,x_n)；

(3.2) if the vector x is equal to (0,0, …,0), performing de-interleaving judgment by using the disorder feature dictionary; if the text is judged to be the disordered text, performing deinterlacing treatment:

defining a minimum step size s_minWith the maximum step size s_maxWherein s is_max<j；T_temp，T_sTo reconstruct the text, N_matchIs T_sThe number of matched feature words;

defining an AC (Aho-Corasick automation) automaton matching function ACmachine (), inputting the function into a reconstructed text, and outputting the function M as the number of feature words matched with a feature dictionary; wherein, the AC automaton: is a string search algorithm invented by Alfred v.aho and margarettj.coramick for matching substrings in a finite set of "dictionaries" in an input string of characters.

(3.3) otherwise, directly entering the next step.

Step S4: the text vectors after de-interleaving are input into a trained classifier group, in this example, a combination of three classifiers, namely a Support Vector Machine classification algorithm (SVM), a perceptron classification algorithm (perceptual cluster) and a Logistic Regression classification algorithm (LR), is used, and a voting strategy is adopted to obtain a prediction result of whether the text is an illegal text.

In summary, the method for identifying the illegal short text based on the feature deinterleaving provided by the embodiment of the invention has the following technical effects in view of the problem that the random variant sample is not sparse enough to form a training set:

the method comprises the following steps of 1, utilizing the characteristics of sufficient normal samples and illegal non-variant training samples to help to identify the data characteristics of the small samples, and performing de-interleaving to eliminate disorder, thereby achieving the purpose of identifying the disorder small samples and improving the identification rate of variant texts of the small samples;

2, the method of extracting the character features of the illegal text from the character features can capture more disordered variant features, and enhances the generalization capability of the method.

Apparatus embodiment one

According to an embodiment of the present invention, an illegal text recognition apparatus based on feature deinterleaving is provided, fig. 5 is a schematic diagram of an illegal text recognition apparatus based on feature deinterleaving according to an apparatus embodiment of the present invention, and as shown in fig. 5, the illegal text recognition apparatus based on feature deinterleaving according to an embodiment of the present invention specifically includes:

the de-morphing module 50 is used for performing de-morphing operation on the text to be recognized and removing special characters in the text to be recognized; the de-morphing module 50 is specifically configured to: and according to the special characters in the variant library, forming a rule template by using an RE regular expression, carrying out variant removing operation on the text to be recognized, and removing the special characters in the text to be recognized.

The judging module 52 is used for judging whether the text to be identified is the out-of-order text according to a preset text feature word library and an out-of-order feature word library, if so, the de-interlacing module 54 is called, and otherwise, the classifying module 56 is called;

the de-interleaving module 54 is used for de-interleaving the text to be recognized, eliminating variants and calling the classification module 56; the deinterleaving module 54 is specifically configured to:

after the text to be recognized is segmented, mapping the segmented text with the feature words in the text feature word library one by one to generate a unique hot coded text vector, if the unique hot coded text vectors are all equal to 0, judging whether the text is a disordered text according to the disordered feature word library, if so, performing de-interleaving processing on the text to be recognized to eliminate variants;

and the classification module 56 is configured to classify the text to be recognized after the solution transformation by using a pre-trained classifier group, and output whether the prediction result is an illegal text prediction result.

In an embodiment of the present invention, the apparatus further includes:

the library construction module is used for acquiring training samples, extracting features from the training samples, extracting de-interlacing features from the text feature word library, forming a disorder feature word library and forming a text feature word library, wherein the training samples comprise: normal samples and illegal invariant short text samples; the library construction module is specifically configured to: counting the word frequency tf of each word in the training sample in the normal sample_ipAnd word frequency tf in illegal short text samples_inAccording to tf_di＝|tf_ip-tf_inSelecting characteristic words by using an | comparison threshold epsilon 1 to form a text characteristic word bank; extracting tf from text feature word stock_inAnd splitting the words larger than the threshold value epsilon 2 to form a disorder characteristic word library.

The classifier group training module is used for training a classifier group by utilizing the training samples, wherein the classifier group specifically comprises: a support vector machine classification algorithm classifier, a perceptron classification algorithm classifier and a logistic regression classification algorithm classifier.

The specific processing of each module in the embodiment of the present invention can be understood by referring to the above method embodiment, and is not described herein again.

Device embodiment II

An embodiment of the present invention further provides an illegal text recognition apparatus based on feature deinterleaving, as shown in fig. 6, including: a memory 60, a processor 62 and a computer program stored on the memory 60 and executable on the processor 62, which computer program, when executed by the processor 62, carries out the following method steps:

Device embodiment III

The embodiment of the present invention provides a computer-readable storage medium, on which an implementation program for information transmission is stored, and when being executed by a processor 62, the implementation program implements the following method steps:

The computer-readable storage medium of this embodiment includes, but is not limited to: ROM, RAM, magnetic or optical disks, and the like.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An illegal text recognition method based on feature deinterleaving is characterized by comprising the following steps:

2. The method of claim 1, wherein the method further comprises:

acquiring a training sample, extracting features from the training sample to form a text feature word bank, wherein the training sample comprises: normal samples and illegal invariant short text samples;

and extracting the de-interlacing features from the text feature word library to form a disorder feature word library.

3. The method of claim 2,

the extracting of the features from the training samples and the forming of the text feature word bank specifically comprise:

counting the word frequency tf of each word in the training sample in the normal sample_ipAnd word frequency tf in illegal short text samples_inAccording to tf_di＝|tf_ip-tf_inSelecting characteristic words by using an | comparison threshold epsilon 1 to form a text characteristic word bank;

extracting the deinterleaving characteristics from the text characteristic word library to form a disorder characteristic word library specifically comprises the following steps:

extracting tf from text feature word stock_inAnd splitting the words larger than the threshold value epsilon 2 to form a disorder characteristic word library.

4. The method of claim 1, wherein the method further comprises:

training a classifier group by using training samples, wherein the classifier group specifically comprises: a support vector machine classification algorithm classifier, a perceptron classification algorithm classifier and a logistic regression classification algorithm classifier.

5. The method of claim 1, wherein whether the text to be recognized is a disorder text is judged according to a preset text feature word bank and a disorder feature word bank, if so, the text to be recognized is deinterleaved, and eliminating variants specifically comprises:

after word segmentation is carried out on the text to be recognized, the text to be recognized and the feature words in the text feature word library are mapped one by one to generate a unique hot coded text vector, if the unique hot coded text vectors are all equal to 0, whether the text is a disordered text is judged according to the disordered feature word library, if the text is judged to be the disordered text, the text to be recognized is subjected to de-interlacing processing, and variants are eliminated;

performing a morphing removing operation on a text to be recognized, wherein the removing of the special characters in the text to be recognized specifically comprises the following steps:

and according to the special characters in the variant library, using an RE regular expression to form a rule template, carrying out variant removing operation on the text to be recognized, and removing the special characters in the text to be recognized.

6. An illegal text recognition device based on feature deinterleaving is characterized in that,

7. The apparatus of claim 6, wherein the apparatus further comprises:

the library construction module is used for acquiring training samples, extracting features from the training samples, extracting de-interlacing features from the text feature word library to form a disorder feature word library and a text feature word library, wherein the training samples comprise: normal samples and illegal invariant short text samples;

the classifier group training module is used for training a classifier group by utilizing a training sample, wherein the classifier group specifically comprises: a support vector machine classification algorithm classifier, a perceptron classification algorithm classifier and a logistic regression classification algorithm classifier.

8. The apparatus of claim 7,

the library construction module is specifically configured to: counting the word frequency tf of each word in the training sample in the normal sample_ipAnd word frequency tf in illegal short text samples_inAccording to tf_di＝|tf_ip-tf_inSelecting characteristic words by using an | comparison threshold epsilon 1 to form a text characteristic word bank; extracting tf from text feature word stock_inSplitting words larger than a threshold value epsilon 2 to form a disorder characteristic word library;

the de-interleaving module is specifically configured to:

the de-morphing module is specifically configured to: and according to the special characters in the variant library, using an RE regular expression to form a rule template, carrying out variant removing operation on the text to be recognized, and removing the special characters in the text to be recognized.

9. An illegal text recognition device based on feature deinterleaving, comprising: memory, processor and computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the feature-based de-interleaving illegal text recognition method according to any of claims 1 to 5.

10. A computer-readable storage medium, on which an information transfer implementing program is stored, which, when executed by a processor, implements the steps of the feature-deinterleaving based illegal text recognition method according to any one of claims 1 to 5.