CN110704611A - Illegal text recognition method and device based on feature deinterleaving - Google Patents

Illegal text recognition method and device based on feature deinterleaving Download PDF

Info

Publication number
CN110704611A
CN110704611A CN201910730306.5A CN201910730306A CN110704611A CN 110704611 A CN110704611 A CN 110704611A CN 201910730306 A CN201910730306 A CN 201910730306A CN 110704611 A CN110704611 A CN 110704611A
Authority
CN
China
Prior art keywords
text
recognized
feature
illegal
library
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910730306.5A
Other languages
Chinese (zh)
Other versions
CN110704611B (en
Inventor
任博雅
刘权
李扬曦
赵媛
时磊
徐雅静
林鸿展
孙忆南
李思
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Computer Network and Information Security Management Center
Original Assignee
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Computer Network and Information Security Management Center filed Critical National Computer Network and Information Security Management Center
Priority to CN201910730306.5A priority Critical patent/CN110704611B/en
Publication of CN110704611A publication Critical patent/CN110704611A/en
Application granted granted Critical
Publication of CN110704611B publication Critical patent/CN110704611B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an illegal text recognition method and device based on feature deinterleaving, wherein the method comprises the following steps: step 1, carrying out a variant removing operation on a text to be recognized, and removing special characters in the text to be recognized; step 2, judging whether the text to be recognized is a disordered text or not according to a preset text feature word library and a disordered feature word library, if so, performing deinterlacing processing on the text to be recognized to eliminate variants, otherwise, directly executing the step 3; and 3, classifying the text to be recognized after the solution is modified by utilizing a pre-trained classifier group, and outputting whether the prediction result is an illegal text prediction result.

Description

Illegal text recognition method and device based on feature deinterleaving
Technical Field
The invention relates to the technical field of computers, in particular to an illegal text identification method and device based on feature deinterleaving.
Background
With the rapid development of the mobile internet and the social network, massive text data are generated in the daily communication process, the data contain a lot of harmful, sensitive and illegal information, and how to effectively and reasonably screen the data from normal texts has important significance for network supervision and network environment purification. However, since illegal harmful short text languages in the network have high randomness and complex and various variants, especially various metaphors like harmful texts, in order to avoid monitoring and examination, the text content is encrypted in a similar interleaving mode, so that the information structure is changed to the maximum extent, and the information content is not changed, so that the text becomes a disordered text. The text data is sparse and hidden in a large number of normal texts for propagation, so that the damage is great.
Currently, there are two general approaches to solving such problems: firstly, a rule matching mode is adopted, and a template is formed by manually summarizing rules for identification; the method needs a large amount of manual establishment of rule modes, and is time-consuming and labor-consuming; and only known disorder modes can be found, other information structures of the text can not be effectively utilized for auxiliary judgment, and the generalization capability is almost absent. And secondly, labeling the disordered data and training the disordered classification model. Both of the two ways rely on the manual work to find out enough diverse out-of-order data, so that the identification can be effectively carried out. For the scene with sparse disordered data, a good effect cannot be achieved.
Specifically, as shown in fig. 1, a method for identifying a variant short text based on hierarchical features is provided, which mainly solves a filtering problem of a character hierarchical variant short text, defines a keyword, a keyword library and a feature form in advance, such as specific forms of character separation at equal intervals, in a vertical arrangement, and the like, extracts features of the keyword and the keyword from the short text by using a method of query matching by using additional neural network modules and weights of feature levels, obtains position information of the word in the text, calculates corresponding feature weights, and finally inputs the position information to a classifier for classification calculation based on the extracted features. The specific treatment steps are as follows:
step 1, predefining a keyword library, a keyword replacement word list and characteristic forms (such as character separation specific forms of equal intervals, vertical arrangement and the like);
step 2, utilizing the feature word library and the word library trained by the additional neural network module and the weight of the feature form;
step 3, extracting features by a query matching method;
step 4, obtaining the position information of the word in the text;
step 5, corresponding weight calculation is carried out;
and 6, performing classification calculation on the input classifier based on the extracted features.
It can be seen from the above processing that the variant short text recognition based on the hierarchical features is specially used for processing various variant short texts, and the method needs to rely on a large-scale variant text corpus for training to obtain a good effect. For the case that the distribution of variant short texts is unbalanced, especially for the case that partial variant or disorder samples are sparse, the method has difficulty in learning sufficient features and achieving the expected effect.
In an actual network environment, the number of normal short texts and illegal non-variant short texts is large, the number of variant short texts is small, the diversity is complex, and the distribution of various variants is extremely unbalanced, for example, disordered short text data is very sparse. The existing method has the defects that the sample distribution is unbalanced, and especially under the condition of small sample data, enough training corpora cannot be obtained, so that the expected effect is difficult to achieve.
Disclosure of Invention
The embodiment of the invention provides an illegal text recognition method and device based on feature deinterleaving, which are used for solving the problem that disordered variant samples are not sparse enough to form a training set in the prior art.
The embodiment of the invention provides an illegal text identification method based on feature deinterleaving, which comprises the following steps:
step 1, carrying out a variant removing operation on a text to be recognized, and removing special characters in the text to be recognized;
step 2, judging whether the text to be recognized is a disordered text or not according to a preset text feature word library and a disordered feature word library, if so, performing deinterlacing processing on the text to be recognized to eliminate variants, otherwise, directly executing the step 3;
and 3, classifying the text to be recognized after the solution is modified by utilizing a pre-trained classifier group, and outputting whether the prediction result is an illegal text prediction result.
The embodiment of the invention also provides an illegal text recognition device based on feature deinterleaving, which comprises the following steps:
the de-morphing module is used for performing de-morphing operation on the text to be recognized and removing special characters in the text to be recognized;
the judging module is used for judging whether the text to be identified is the out-of-order text or not according to a preset text feature word library and an out-of-order feature word library, if so, the de-interlacing module is called, and otherwise, the classifying module is called;
a de-interleaving module for de-interleaving the text to be recognized, eliminating the variation and calling the classification module
And the classification module is used for classifying the text to be recognized after the solution transformation by utilizing a pre-trained classifier group and outputting whether the prediction result is an illegal text prediction result.
The embodiment of the invention also provides an illegal text recognition device based on feature deinterleaving, which comprises the following steps: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the above method for illegal text recognition based on feature deinterleaving.
The embodiment of the invention also provides a computer readable storage medium, wherein an implementation program for information transmission is stored on the computer readable storage medium, and when the program is executed by a processor, the steps of the illegal text recognition method based on feature deinterleaving are implemented.
By adopting the embodiment of the invention, under the environment that disordered illegal samples are sparse, sufficient characteristics of normal samples and illegal invariant training samples are utilized to help identify the data characteristics of the small samples through a complete end-to-end processing flow, and the small samples are deinterleaved to eliminate disorder, so that the purpose of identifying the disordered small samples is achieved, the on-line test verification has a good identification effect, and the identification accuracy and recall rate of harmful short texts can be improved.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 is a diagram of a variant short text recognition method based on hierarchical features in the prior art;
FIG. 2 is a process flow diagram of an illegal text recognition method based on feature deinterleaving according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of the overall solution of an embodiment of the present invention;
FIG. 4 is a detailed schematic of the overall solution of an embodiment of the invention;
FIG. 5 is a schematic diagram of an illegal text recognition device based on feature deinterleaving according to an embodiment of the present invention;
fig. 6 is a schematic diagram of an illegal text recognition device based on feature deinterleaving according to a second embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Method embodiment
According to an embodiment of the present invention, an illegal text recognition method based on feature deinterleaving is provided, and fig. 2 is a processing flow chart of the illegal text recognition method based on feature deinterleaving according to the embodiment of the present invention, as shown in fig. 2, the method according to the embodiment of the present invention specifically includes:
before the specific processing steps of the embodiment of the present invention are performed, the following preparation work needs to be performed. Specifically, a training sample needs to be obtained, and features are extracted from the training sample to form a text feature word bank, where the training sample includes: normal samples and illegal invariant short text samples; and finally, extracting the de-interlacing features from the text feature word library to form a disorder feature word library.
Specifically, firstly, the word frequency tf of each word in the training sample in the normal sample is countedipAnd word frequency tf in illegal short text samplesinAccording to tfdi=|tfip-tfinSelecting characteristic words by using an | comparison threshold epsilon 1 to form a text characteristic word bank; finally, extracting tf from the text feature word bankinAnd splitting the words larger than the threshold value epsilon 2 to form a disorder characteristic word library.
In addition, training a classifier group by using the training samples is further required, wherein the classifier group specifically includes: SVM classifier, perceptron classifier and LR classifier.
After the above processing is executed, the following steps of the embodiment of the invention are executed:
step 201, performing a morphing removing operation on a text to be recognized, and removing special characters in the text to be recognized;
specifically, in step 201, according to the special characters in the variant library, an RE regular expression is used to form a rule template, and a de-variant operation is performed on the text to be recognized to remove the special characters in the text to be recognized.
Step 202, judging whether the text to be recognized is a disorder text or not according to a preset text feature word bank and a disorder feature word bank, if so, performing de-interlacing processing on the text to be recognized to eliminate variants, otherwise, directly executing step 203;
step 202 specifically includes the following processing: after word segmentation is carried out on the text to be recognized, the text to be recognized and the feature words in the text feature word library are mapped one by one to generate a unique hot coded text vector, if the unique hot coded text vectors are all equal to 0, whether the text is the disorder text is judged according to the disorder feature word library, if the text is judged to be the disorder text, the text to be recognized is subjected to de-interleaving processing, and variants are eliminated.
And step 203, classifying the text to be recognized after the solution transformation by utilizing a pre-trained classifier group, and outputting whether the text is an illegal text prediction result.
The technical solutions of the embodiments of the present invention will be described below with reference to the accompanying drawings.
Fig. 3 is a schematic diagram of the general technical solution of the embodiment of the present invention, as shown in fig. 3, specifically including the following processes:
step S1: carrying out a morphing removing operation on the text to be recognized, and removing special characters in the text, namely simple morphing;
step S2: the training samples are composed of normal samples and illegal invariant short text samples.
(2.1) extracting features from the training samples to form a text feature word bank;
(2.2) extracting the de-interlacing features from the feature word library to form a disorder feature word library;
step S3: de-interleaving processing algorithm
(3.1) judging whether the text is disordered or not by using the characteristic word stock obtained in the step S2
(3.2) if yes, performing deinterleaving and eliminating variants;
step S4: training a classifier group by using the training samples, classifying the text of the solution variant by using the classifier, and outputting a prediction result.
The above-described processing steps will be described in detail below with reference to the accompanying drawings.
Fig. 4 is a detailed schematic diagram of a general technical solution of an embodiment of the present invention, as shown in fig. 4, specifically including:
e.g. for any out-of-order short text T1 ═ T (T)1,t2,t3,…,t(i-1),ti) Wherein i is the word number of the text, and the processing flow comprises the following steps:
step S1: carrying out de-morphing processing on an original text through a Regular Expression (RE), and removing special characters of the text;
(1.1) the variant library is composed of various ASCII English characters, Greek letters, Chinese special characters, etc., such as!,%, [ lambda ], (+, ○, Σ, [ lambda ], "@, @ # ¥% … … &, etc.
(1.2) constructing a rule template using the RE regular expression, i.e. the above characters that occur singly as variants, are removed.
Step S2: the training samples are composed of normal samples and non-variant illegal short text samples.
(2.1): counting the word frequency tf of each word in the training sample in the normal sampleipAnd word frequency tf in illegal short text samplesinAccording to tfdi=|tfip-tfinSelecting characteristic words by using an | comparison threshold epsilon 1 to form a text characteristic word bank;
(2.2) extracting tf from the text feature word bankin>Splitting words with a threshold value epsilon 2 to form a disorder characteristic word library;
step S3: de-interleaving processing algorithm
(31) the short text T2 output in step S1 is (T)1,t2,t3,…,t(j-1)Tj), j is the length of T2, the short text T2 is segmented, and then is mapped with the feature words in the text feature word bank one by one to generate a one-hot (one-hot) text vector x ═ x1,…,xn);
(3.2) if the vector x is equal to (0,0, …,0), performing de-interleaving judgment by using the disorder feature dictionary; if the text is judged to be the disordered text, performing deinterlacing treatment:
defining a minimum step size sminWith the maximum step size smaxWherein s ismax<j;Ttemp,TsTo reconstruct the text, NmatchIs TsThe number of matched feature words;
defining an AC (Aho-Corasick automation) automaton matching function ACmachine (), inputting the function into a reconstructed text, and outputting the function M as the number of feature words matched with a feature dictionary; wherein, the AC automaton: is a string search algorithm invented by Alfred v.aho and margarettj.coramick for matching substrings in a finite set of "dictionaries" in an input string of characters.
Figure BDA0002160314170000071
Figure BDA0002160314170000081
(3.3) otherwise, directly entering the next step.
Step S4: the text vectors after de-interleaving are input into a trained classifier group, in this example, a combination of three classifiers, namely a Support Vector Machine classification algorithm (SVM), a perceptron classification algorithm (perceptual cluster) and a Logistic Regression classification algorithm (LR), is used, and a voting strategy is adopted to obtain a prediction result of whether the text is an illegal text.
In summary, the method for identifying the illegal short text based on the feature deinterleaving provided by the embodiment of the invention has the following technical effects in view of the problem that the random variant sample is not sparse enough to form a training set:
the method comprises the following steps of 1, utilizing the characteristics of sufficient normal samples and illegal non-variant training samples to help to identify the data characteristics of the small samples, and performing de-interleaving to eliminate disorder, thereby achieving the purpose of identifying the disorder small samples and improving the identification rate of variant texts of the small samples;
2, the method of extracting the character features of the illegal text from the character features can capture more disordered variant features, and enhances the generalization capability of the method.
Apparatus embodiment one
According to an embodiment of the present invention, an illegal text recognition apparatus based on feature deinterleaving is provided, fig. 5 is a schematic diagram of an illegal text recognition apparatus based on feature deinterleaving according to an apparatus embodiment of the present invention, and as shown in fig. 5, the illegal text recognition apparatus based on feature deinterleaving according to an embodiment of the present invention specifically includes:
the de-morphing module 50 is used for performing de-morphing operation on the text to be recognized and removing special characters in the text to be recognized; the de-morphing module 50 is specifically configured to: and according to the special characters in the variant library, forming a rule template by using an RE regular expression, carrying out variant removing operation on the text to be recognized, and removing the special characters in the text to be recognized.
The judging module 52 is used for judging whether the text to be identified is the out-of-order text according to a preset text feature word library and an out-of-order feature word library, if so, the de-interlacing module 54 is called, and otherwise, the classifying module 56 is called;
the de-interleaving module 54 is used for de-interleaving the text to be recognized, eliminating variants and calling the classification module 56; the deinterleaving module 54 is specifically configured to:
after the text to be recognized is segmented, mapping the segmented text with the feature words in the text feature word library one by one to generate a unique hot coded text vector, if the unique hot coded text vectors are all equal to 0, judging whether the text is a disordered text according to the disordered feature word library, if so, performing de-interleaving processing on the text to be recognized to eliminate variants;
and the classification module 56 is configured to classify the text to be recognized after the solution transformation by using a pre-trained classifier group, and output whether the prediction result is an illegal text prediction result.
In an embodiment of the present invention, the apparatus further includes:
the library construction module is used for acquiring training samples, extracting features from the training samples, extracting de-interlacing features from the text feature word library, forming a disorder feature word library and forming a text feature word library, wherein the training samples comprise: normal samples and illegal invariant short text samples; the library construction module is specifically configured to: counting the word frequency tf of each word in the training sample in the normal sampleipAnd word frequency tf in illegal short text samplesinAccording to tfdi=|tfip-tfinSelecting characteristic words by using an | comparison threshold epsilon 1 to form a text characteristic word bank; extracting tf from text feature word stockinAnd splitting the words larger than the threshold value epsilon 2 to form a disorder characteristic word library.
The classifier group training module is used for training a classifier group by utilizing the training samples, wherein the classifier group specifically comprises: a support vector machine classification algorithm classifier, a perceptron classification algorithm classifier and a logistic regression classification algorithm classifier.
The specific processing of each module in the embodiment of the present invention can be understood by referring to the above method embodiment, and is not described herein again.
Device embodiment II
An embodiment of the present invention further provides an illegal text recognition apparatus based on feature deinterleaving, as shown in fig. 6, including: a memory 60, a processor 62 and a computer program stored on the memory 60 and executable on the processor 62, which computer program, when executed by the processor 62, carries out the following method steps:
before the specific processing steps of the embodiment of the present invention are performed, the following preparation work needs to be performed. Specifically, a training sample needs to be obtained, and features are extracted from the training sample to form a text feature word bank, where the training sample includes: normal samples and illegal invariant short text samples; and finally, extracting the de-interlacing features from the text feature word library to form a disorder feature word library.
Specifically, firstly, the word frequency tf of each word in the training sample in the normal sample is countedipAnd word frequency tf in illegal short text samplesinAccording to tfdi=|tfip-tfinSelecting characteristic words by using an | comparison threshold epsilon 1 to form a text characteristic word bank; finally, extracting tf from the text feature word bankinAnd splitting the words larger than the threshold value epsilon 2 to form a disorder characteristic word library.
In addition, training a classifier group by using the training samples is further required, wherein the classifier group specifically includes: SVM classifier, perceptron classifier and LR classifier.
After the above processing is executed, the following steps of the embodiment of the invention are executed:
step 201, performing a morphing removing operation on a text to be recognized, and removing special characters in the text to be recognized;
specifically, in step 201, according to the special characters in the variant library, an RE regular expression is used to form a rule template, and a de-variant operation is performed on the text to be recognized to remove the special characters in the text to be recognized.
Step 202, judging whether the text to be recognized is a disorder text or not according to a preset text feature word bank and a disorder feature word bank, if so, performing de-interlacing processing on the text to be recognized to eliminate variants, otherwise, directly executing step 203;
step 202 specifically includes the following processing: after word segmentation is carried out on the text to be recognized, the text to be recognized and the feature words in the text feature word library are mapped one by one to generate a unique hot coded text vector, if the unique hot coded text vectors are all equal to 0, whether the text is the disorder text is judged according to the disorder feature word library, if the text is judged to be the disorder text, the text to be recognized is subjected to de-interleaving processing, and variants are eliminated.
And step 203, classifying the text to be recognized after the solution transformation by utilizing a pre-trained classifier group, and outputting whether the text is an illegal text prediction result.
Device embodiment III
The embodiment of the present invention provides a computer-readable storage medium, on which an implementation program for information transmission is stored, and when being executed by a processor 62, the implementation program implements the following method steps:
before the specific processing steps of the embodiment of the present invention are performed, the following preparation work needs to be performed. Specifically, a training sample needs to be obtained, and features are extracted from the training sample to form a text feature word bank, where the training sample includes: normal samples and illegal invariant short text samples; and finally, extracting the de-interlacing features from the text feature word library to form a disorder feature word library.
Specifically, firstly, the word frequency tf of each word in the training sample in the normal sample is countedipAnd word frequency tf in illegal short text samplesinAccording to tfdi=|tfip-tfinSelecting characteristic words by using an | comparison threshold epsilon 1 to form a text characteristic word bank; finally, extracting tf from the text feature word bankinAnd splitting the words larger than the threshold value epsilon 2 to form a disorder characteristic word library.
In addition, training a classifier group by using the training samples is further required, wherein the classifier group specifically includes: SVM classifier, perceptron classifier and LR classifier.
After the above processing is executed, the following steps of the embodiment of the invention are executed:
step 201, performing a morphing removing operation on a text to be recognized, and removing special characters in the text to be recognized;
specifically, in step 201, according to the special characters in the variant library, an RE regular expression is used to form a rule template, and a de-variant operation is performed on the text to be recognized to remove the special characters in the text to be recognized.
Step 202, judging whether the text to be recognized is a disorder text or not according to a preset text feature word bank and a disorder feature word bank, if so, performing de-interlacing processing on the text to be recognized to eliminate variants, otherwise, directly executing step 203;
step 202 specifically includes the following processing: after word segmentation is carried out on the text to be recognized, the text to be recognized and the feature words in the text feature word library are mapped one by one to generate a unique hot coded text vector, if the unique hot coded text vectors are all equal to 0, whether the text is the disorder text is judged according to the disorder feature word library, if the text is judged to be the disorder text, the text to be recognized is subjected to de-interleaving processing, and variants are eliminated.
And step 203, classifying the text to be recognized after the solution transformation by utilizing a pre-trained classifier group, and outputting whether the text is an illegal text prediction result.
The computer-readable storage medium of this embodiment includes, but is not limited to: ROM, RAM, magnetic or optical disks, and the like.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. An illegal text recognition method based on feature deinterleaving is characterized by comprising the following steps:
step 1, carrying out a variant removing operation on a text to be recognized, and removing special characters in the text to be recognized;
step 2, judging whether the text to be recognized is a disordered text or not according to a preset text feature word library and a disordered feature word library, if so, performing deinterlacing processing on the text to be recognized to eliminate variants, otherwise, directly executing the step 3;
and 3, classifying the text to be recognized after the solution is modified by utilizing a pre-trained classifier group, and outputting whether the prediction result is an illegal text prediction result.
2. The method of claim 1, wherein the method further comprises:
acquiring a training sample, extracting features from the training sample to form a text feature word bank, wherein the training sample comprises: normal samples and illegal invariant short text samples;
and extracting the de-interlacing features from the text feature word library to form a disorder feature word library.
3. The method of claim 2,
the extracting of the features from the training samples and the forming of the text feature word bank specifically comprise:
counting the word frequency tf of each word in the training sample in the normal sampleipAnd word frequency tf in illegal short text samplesinAccording to tfdi=|tfip-tfinSelecting characteristic words by using an | comparison threshold epsilon 1 to form a text characteristic word bank;
extracting the deinterleaving characteristics from the text characteristic word library to form a disorder characteristic word library specifically comprises the following steps:
extracting tf from text feature word stockinAnd splitting the words larger than the threshold value epsilon 2 to form a disorder characteristic word library.
4. The method of claim 1, wherein the method further comprises:
training a classifier group by using training samples, wherein the classifier group specifically comprises: a support vector machine classification algorithm classifier, a perceptron classification algorithm classifier and a logistic regression classification algorithm classifier.
5. The method of claim 1, wherein whether the text to be recognized is a disorder text is judged according to a preset text feature word bank and a disorder feature word bank, if so, the text to be recognized is deinterleaved, and eliminating variants specifically comprises:
after word segmentation is carried out on the text to be recognized, the text to be recognized and the feature words in the text feature word library are mapped one by one to generate a unique hot coded text vector, if the unique hot coded text vectors are all equal to 0, whether the text is a disordered text is judged according to the disordered feature word library, if the text is judged to be the disordered text, the text to be recognized is subjected to de-interlacing processing, and variants are eliminated;
performing a morphing removing operation on a text to be recognized, wherein the removing of the special characters in the text to be recognized specifically comprises the following steps:
and according to the special characters in the variant library, using an RE regular expression to form a rule template, carrying out variant removing operation on the text to be recognized, and removing the special characters in the text to be recognized.
6. An illegal text recognition device based on feature deinterleaving is characterized in that,
the de-morphing module is used for performing de-morphing operation on the text to be recognized and removing special characters in the text to be recognized;
the judging module is used for judging whether the text to be identified is the out-of-order text or not according to a preset text feature word library and an out-of-order feature word library, if so, the de-interlacing module is called, and otherwise, the classifying module is called;
a de-interleaving module for de-interleaving the text to be recognized, eliminating the variation and calling the classification module
And the classification module is used for classifying the text to be recognized after the solution transformation by utilizing a pre-trained classifier group and outputting whether the prediction result is an illegal text prediction result.
7. The apparatus of claim 6, wherein the apparatus further comprises:
the library construction module is used for acquiring training samples, extracting features from the training samples, extracting de-interlacing features from the text feature word library to form a disorder feature word library and a text feature word library, wherein the training samples comprise: normal samples and illegal invariant short text samples;
the classifier group training module is used for training a classifier group by utilizing a training sample, wherein the classifier group specifically comprises: a support vector machine classification algorithm classifier, a perceptron classification algorithm classifier and a logistic regression classification algorithm classifier.
8. The apparatus of claim 7,
the library construction module is specifically configured to: counting the word frequency tf of each word in the training sample in the normal sampleipAnd word frequency tf in illegal short text samplesinAccording to tfdi=|tfip-tfinSelecting characteristic words by using an | comparison threshold epsilon 1 to form a text characteristic word bank; extracting tf from text feature word stockinSplitting words larger than a threshold value epsilon 2 to form a disorder characteristic word library;
the de-interleaving module is specifically configured to:
after word segmentation is carried out on the text to be recognized, the text to be recognized and the feature words in the text feature word library are mapped one by one to generate a unique hot coded text vector, if the unique hot coded text vectors are all equal to 0, whether the text is a disordered text is judged according to the disordered feature word library, if the text is judged to be the disordered text, the text to be recognized is subjected to de-interlacing processing, and variants are eliminated;
the de-morphing module is specifically configured to: and according to the special characters in the variant library, using an RE regular expression to form a rule template, carrying out variant removing operation on the text to be recognized, and removing the special characters in the text to be recognized.
9. An illegal text recognition device based on feature deinterleaving, comprising: memory, processor and computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the feature-based de-interleaving illegal text recognition method according to any of claims 1 to 5.
10. A computer-readable storage medium, on which an information transfer implementing program is stored, which, when executed by a processor, implements the steps of the feature-deinterleaving based illegal text recognition method according to any one of claims 1 to 5.
CN201910730306.5A 2019-08-08 2019-08-08 Illegal text recognition method and device based on feature de-interleaving Active CN110704611B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910730306.5A CN110704611B (en) 2019-08-08 2019-08-08 Illegal text recognition method and device based on feature de-interleaving

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910730306.5A CN110704611B (en) 2019-08-08 2019-08-08 Illegal text recognition method and device based on feature de-interleaving

Publications (2)

Publication Number Publication Date
CN110704611A true CN110704611A (en) 2020-01-17
CN110704611B CN110704611B (en) 2022-08-19

Family

ID=69193399

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910730306.5A Active CN110704611B (en) 2019-08-08 2019-08-08 Illegal text recognition method and device based on feature de-interleaving

Country Status (1)

Country Link
CN (1) CN110704611B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111914542A (en) * 2020-05-21 2020-11-10 国家计算机网络与信息安全管理中心 Suspected illegal investment market subject identification method, device, terminal and storage medium
CN113591464A (en) * 2021-07-28 2021-11-02 百度在线网络技术(北京)有限公司 Variant text detection method, model training method, device and electronic equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101853277A (en) * 2010-05-14 2010-10-06 南京信息工程大学 Vulnerability data mining method based on classification and association analysis
US20110167317A1 (en) * 2008-06-16 2011-07-07 Sung-Hoon Kim Apparatus for adaptable/variable type modulation and demodulation in digital tx/rx system
CN103313104A (en) * 2013-05-28 2013-09-18 广州中大电讯科技有限公司 Digital television multiplexer remote control method based on Ethernet
CN103650526A (en) * 2011-06-03 2014-03-19 苹果公司 Playlists for real-time or near real-time streaming
CN107615689A (en) * 2015-04-09 2018-01-19 艾比奎蒂数字公司 The system and method for signal quality in automatic detection digital radio broadcasting signal
CN108763293A (en) * 2018-04-17 2018-11-06 平安科技(深圳)有限公司 Point of interest querying method, device and computer equipment based on semantic understanding
CN109033155A (en) * 2018-06-13 2018-12-18 中国电子科技集团公司电子科学研究院 Search mail content and method, device, terminal and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110167317A1 (en) * 2008-06-16 2011-07-07 Sung-Hoon Kim Apparatus for adaptable/variable type modulation and demodulation in digital tx/rx system
CN101853277A (en) * 2010-05-14 2010-10-06 南京信息工程大学 Vulnerability data mining method based on classification and association analysis
CN103650526A (en) * 2011-06-03 2014-03-19 苹果公司 Playlists for real-time or near real-time streaming
CN103313104A (en) * 2013-05-28 2013-09-18 广州中大电讯科技有限公司 Digital television multiplexer remote control method based on Ethernet
CN107615689A (en) * 2015-04-09 2018-01-19 艾比奎蒂数字公司 The system and method for signal quality in automatic detection digital radio broadcasting signal
CN108763293A (en) * 2018-04-17 2018-11-06 平安科技(深圳)有限公司 Point of interest querying method, device and computer equipment based on semantic understanding
CN109033155A (en) * 2018-06-13 2018-12-18 中国电子科技集团公司电子科学研究院 Search mail content and method, device, terminal and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈丽江: "在特定类型的二字词组合型歧义消解过程中保证分词的一致性", 《文教资料》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111914542A (en) * 2020-05-21 2020-11-10 国家计算机网络与信息安全管理中心 Suspected illegal investment market subject identification method, device, terminal and storage medium
CN113591464A (en) * 2021-07-28 2021-11-02 百度在线网络技术(北京)有限公司 Variant text detection method, model training method, device and electronic equipment

Also Published As

Publication number Publication date
CN110704611B (en) 2022-08-19

Similar Documents

Publication Publication Date Title
Hashemi Web page classification: a survey of perspectives, gaps, and future directions
Hussain et al. Detection of bangla fake news using mnb and svm classifier
Opara et al. HTMLPhish: Enabling phishing web page detection by applying deep learning techniques on HTML analysis
CN105426356B (en) A kind of target information recognition methods and device
Aggarwal et al. Classification of fake news by fine-tuning deep bidirectional transformers based language model
CN111767725B (en) Data processing method and device based on emotion polarity analysis model
CN111460820B (en) Network space security domain named entity recognition method and device based on pre-training model BERT
CN110442857B (en) Emotion intelligent judging method and device and computer readable storage medium
CN107357895B (en) Text representation processing method based on bag-of-words model
Rakholia et al. Classification of Gujarati documents using Naïve Bayes classifier
CN110191096A (en) A kind of term vector homepage invasion detection method based on semantic analysis
CN110704611B (en) Illegal text recognition method and device based on feature de-interleaving
Islam et al. Deep learning for multi-labeled cyberbully detection: Enhancing online safety
Mahara et al. Fake news detection: A RNN-LSTM, Bi-LSTM based deep learning approach
Bajaj et al. Exposing the vulnerabilities of deep learning models in news classification
Dehghani et al. Sentiment analysis of Persian political tweets using ParsBERT embedding model with convolutional neural network
Zhang et al. Text Sentiment Classification Based on Feature Fusion.
CN112528653B (en) Short text entity recognition method and system
CN113726730A (en) DGA domain name detection method and system based on deep learning algorithm
CN112966507A (en) Method, device, equipment and storage medium for constructing recognition model and identifying attack
Dhillon et al. Crowdsourcing of hate speech for detecting abusive behavior on social media
Zhang et al. A hot spot clustering method based on improved kmeans algorithm
Nisha et al. Detection and classification of cyberbullying in social media using text mining
Darwish et al. Identifying fake news in the russian-ukrainian conflict using machine learning
Chandana et al. BCC NEWS classification comparison between naive bayes, support vector machine, recurrent neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant