CN111782811A

CN111782811A - E-government affair sensitive text detection method based on convolutional neural network and support vector machine

Info

Publication number: CN111782811A
Application number: CN202010629592.9A
Authority: CN
Inventors: 王婷; 秦拯; 张吉昕; 胡玉鹏
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2020-07-03
Filing date: 2020-07-03
Publication date: 2020-10-16

Abstract

The invention relates to an electronic government affair sensitive text detection method based on a convolutional neural network and a support vector machine. The invention mainly comprises (1) an electronic government affair sensitive text detection model based on a convolutional neural network and a support vector machine; (2) a sensitive domain text classification model based on TFIDF and a support vector machine; (3) a policy document recognition model based on word vectors and a convolutional neural network is provided.

Description

E-government affair sensitive text detection method based on convolutional neural network and support vector machine

Technical Field

The invention relates to the technical field of machine learning, in particular to an electronic government affair sensitive text detection method based on a convolutional neural network and a support vector machine.

Background

With the rapid development of internet and computer technologies, the application of network information technology in social life is more and more extensive. Because the network has the characteristics of openness, sharing and the like, the internet and computer technology bring some content security problems while improving the work efficiency of the government, certain threats to the information security of government departments, sensitive information and files of the government departments may be revealed to the internet through an electronic government platform, and particularly, policy documents in sensitive fields of religion, military, politics and the like mostly contain sensitive information, and once the sensitive information is revealed and spread, huge losses are caused to the security of the country. Therefore, how to accurately and quickly detect the electronic government affair sensitive information leaked into the network and reduce the false alarm rate and the missing alarm rate becomes a great challenge by keeping the national secret.

To protect sensitive text from leakage, it is first determined whether the text content contains sensitive information. At present, most of sensitive text detection works are carried out according to customized rules, however, with the increasing quantity and complexity of sensitive electronic text documents, the existing sensitive detection means cannot meet the requirements of high efficiency and convenience. In order to timely and comprehensively discover sensitive information leaked to an internet portal website, how to research a more efficient sensitive detection technical solution is a non-negligible problem. Currently, there are two main detection techniques: one is a detection method based on keyword matching, and sensitive word matching is the key core of the method and is generally realized by using a character string matching algorithm. The detection method based on keyword matching ignores the relevance between the deformed words and the original words, and has low accuracy. With the development of machine learning technology, another detection technology is to use text classification in machine learning to detect sensitive texts, and a sensitive content detection method based on traditional machine learning has low accuracy due to less sensitive texts which can be used for training.

Therefore, in order to solve the above problems, the present invention provides an e-government affairs sensitive text detection method based on a convolutional neural network and a support vector machine, which combines the characteristics of e-government affairs sensitive text (relating to the content of policy guidelines in the sensitive field, etc.).

Disclosure of Invention

The invention provides an electronic government affair sensitive text detection method based on a convolutional neural network and a support vector machine, which mainly comprises the following three contents:

1. providing a sensitive field text classification model based on a TFIDF and a support vector machine;

2. providing a policy document identification model based on word vectors and a convolutional neural network;

3. an electronic government affair sensitive text detection model based on sensitive field text classification and policy document identification is provided.

The specific contents are as follows:

1. a sensitive field text classification model based on a TFIDF and a support vector machine is provided

The method mainly adopts a TFIDF weighting technology to construct a text vector, adopts a support vector machine algorithm, and constructs a sensitive field text classification model through continuous machine learning training, wherein the model is used for judging whether the text belongs to the sensitive field.

(1) The field text data set is converted to a text vector using TFIDF weighting techniques. For each text in the dataset, the semantics of the text are represented by a vector, each dimension of the vector corresponding to a word whose value is the TFIDF value of the word occurring in the text. TFIDF is used to evaluate how important a word is for one of the texts of a corpus of files.

The process of calculating the weights using the TFIDF weighting technique is as follows

The calculation formula consists of two parts of word frequency (TF) and inverse file frequency (IDF) and is as follows:

w_ij＝tf_ij*idf_ij(3)

wherein n is_ijRepresenting the number of times of occurrence of the ith characteristic word in the jth text; n is a radical of_jRepresenting the total number of words in the jth text; n is a radical of_iIs the number of texts containing the ith feature word, and N is the total number of texts; w is a_ijIs the TFIDF value of the ith feature word.

Because training TFIDF on large-scale corpus can obtain very many words, in consideration of time and space efficiency, the invention limits and selects 500 characteristic words, preferentially selects words with high word frequency, and obtains X ═ X after constructing vector₁,x₂,…,x_iIn which x₁-x_iAnd representing the vector corresponding to the ith text in the text training set D.

(2) And training the text data set by adopting a support vector machine algorithm to obtain a sensitive field text classification model. The process is as follows:

modeling: given a training sample T { (v)₁,y₁),(v₁,y₂),…,(v_n,y_n) In which v is₁-v_nIs n text vectors, y₁-y_mThe sensitive field label value corresponding to the training text is 1, and the text label value belonging to the sensitive field is-1. We need to find a hyperplane to classify the instances in each training set into different classes, where the hyperplane is wx + b ═ 0, and the classification decision model is f (x) ═ sign (wx + b), where sign stands for the sign function, w is the weight of the model, and b is the bias. In order to obtain a maximum interval hyperplane that can completely separate the sample points in the training sample set, the following optimization constraint problem needs to be solved:

s.t. y_i(w*x_i+b)-1≥0,i＝1,2,...,n (5)

and (5) solving the optimal w and b to finally obtain a sensitive field classification decision model f (x).

And (3) detection: after modeling is completed, inputting a text vector to be detected, wherein the obtained output value is the classification label value of the text, +1 represents a positive class and indicates that the text belongs to the sensitive field, and-1 represents a negative class and indicates that the text does not belong to the sensitive field.

2. A policy document recognition model based on word vectors and a convolutional neural network is provided

The Word vector training is carried out on the Word sequence after the words are segmented by adopting the Word2vec technology to obtain the Word vector corresponding to each Word, the Word vector is used as the input data of the convolutional neural network, a policy document identification model based on the convolutional neural network is constructed, the model is used for judging whether the text is a policy document, and the model mainly comprises an input layer, a convolutional layer, a pooling layer, a full connection layer and the like.

(1) The first layer is the input layer. The input layer is a matrix of n x m, denoted by letter a. Wherein n is the number of words in a text word sequence, and the invention adopts padding technology to keep the lengths of all the text word sequences consistent. m is the dimension of the Word vector corresponding to each Word, and the Word vector training method adopts the Word2vec technology to train the Word vectors and map each Word into an m-dimensional Word vector.

(2) The second layer is a convolutional layer. The convolution operation is performed on the matrix by using convolution kernels of different sizes, the width of the convolution kernel is equal to the dimension m of the word vector, the height is h, and one convolution kernel is assumed to be a matrix t of h m. The convolution kernel is slid down by step 1, and convolution operation is performed every time a window of h × m is passed, so as to generate a new characteristic value c_iProcessing a convolution kernel to obtain a feature map c, wherein n-h +1 features are obtained in total, and the calculation formula is as follows:

c_i＝f(t*A[i:i+h-1]+b),i＝1,2,...,n+h-1 (6)

where b is the bias term and f is the activation function.

(3) The third layer is a pooling layer. Because the sizes of the characteristic graphs obtained by convolution kernels with different sizes are different, the invention adopts the pooling function 1-max-posing to extract the characteristics of each characteristic graph, so that the dimensionality of the characteristic graphs is kept consistent, and the principle of the 1-max-posing is to take a maximum value from a plurality of values.

(4) The fourth layer is a full connection layer. And the full connection layer is used for classification, and the features extracted by the convolution and pooling layer are input into a softmax function for classification training to obtain a policy document identification model.

3. An electronic government affair sensitive text detection model based on sensitive field text classification and policy document identification is provided

Since the policy document in the e-government sensitive field mostly relates to sensitive content, in order to determine whether a text belongs to a sensitive text, it is necessary to detect whether the text is a sensitive field and whether the text is a policy document. And detecting whether the text belongs to the sensitive field by adopting a sensitive field text classification model, and judging whether the text belongs to the policy document by adopting a policy document identification model for the text belonging to the sensitive field.

(1) And classifying sensitive domain texts. Firstly, a text vector of a content text to be detected is constructed, and then a support vector machine algorithm is adopted to establish a sensitive field classification model to calculate a classification result of the text. The input of the model is a text to be detected, the output is a sensitive field classification result of the text, and whether the text belongs to the sensitive field is judged.

(2) Policy document identification. Firstly, word vectors are constructed based on word2vec technology, and then a policy official document recognition result of a text is calculated by adopting a convolutional neural network building model. And inputting a text to be detected by the model, outputting a policy document identification result of the text, and judging whether the text belongs to the administrative policy document.

And finally integrating the models in the steps to obtain an electronic government affair sensitive text detection model based on sensitive field text classification and policy official document identification.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The invention discloses an electronic government affair sensitive text detection method based on a convolutional neural network and a support vector machine. The method mainly comprises the following steps:

step 1: the text data is preprocessed. Firstly, cleaning a data set prepared by the invention to remove useless parts in a text, and then segmenting the text by using a Chinese word segmentation technology to obtain a word sequence of the segmented text.

Step 2: and establishing a sensitive field text classification model. Calculating the weight of words in the text by using a TFIDF technology, and converting a text word sequence into a corresponding text vector; a domain classification model is established by using a support vector machine algorithm, input contents are text vectors of a sensitive domain and a non-sensitive domain and corresponding classification labels, and the model obtains a better classification result through continuous training.

And step 3: and establishing a policy document identification model. Firstly, performing Word vector training on a text by using a Word2vec model, expressing each Word by using a Word vector, and converting each text into a corresponding matrix as input data of a convolutional neural network; secondly, carrying out convolution calculation on the input matrix by utilizing convolution kernels with different sizes to obtain a plurality of characteristic graphs; and then, extracting the features of each feature map by using a pooling function 1-max-posing, and outputting the maximum value of the features. And finally, inputting the extracted features into a softmax function for classification to obtain a text policy document identification result.

And 4, step 4: and (6) detecting. Firstly, converting a text to be detected into a text vector, and detecting whether the text belongs to the sensitive field by adopting the sensitive field text classification model established in the step 1; and then, judging whether the text belongs to the policy document by adopting the policy document identification model established in the step 2 for the text belonging to the sensitive field. And finally, detecting that most of policy documents in the E-government sensitive field are sensitive texts.

Claims

1. A method for detecting E-government affair sensitive texts based on a convolutional neural network and a support vector machine is characterized by comprising the following steps:

(1) providing a sensitive field text classification model based on a TFIDF and a support vector machine;

(2) providing a policy document identification model based on word vectors and a convolutional neural network;

(3) an electronic government affair sensitive text detection model based on sensitive field text classification and policy document identification is provided.

2. The TFIDF and support vector machine based domain text classification model of claim 1, wherein: and calculating the weights of words in the sensitive field texts and the non-sensitive field texts by adopting a TFIDF technology, and constructing two types of text vectors. And adopting a support vector machine algorithm, taking the two types of text vectors and the classification labels thereof as input and output, and performing iterative training to obtain a final convergent sensitive field text classification model.

3. The word vector and convolutional neural network based policy document identification model of claim 1, wherein: vectorizing and expressing each keyword in the text of the policy official document and the text of the non-policy official document by adopting a word vector algorithm, and obtaining a word vector matrix of the text according to a word sequence to be used as the input of a convolutional neural network; carrying out convolution calculation on the input word vector matrixes by adopting convolution kernels with different sizes to obtain a plurality of characteristic graphs; and (3) reducing the dimension of the features of each feature map by using a pooling function, and finally inputting the features into a softmax classifier layer for classification training to obtain a policy document identification model.

4. The sensitive text detection model based on religious domain text classification and policy document identification according to claim 1, characterized in that: since policy documents in sensitive fields of e-government (e.g., sensitive fields such as religion, military, politics, etc.) mostly contain sensitive contents, in order to determine whether a text belongs to a sensitive text, it is necessary to detect whether the text is a sensitive field and whether the text is a policy document. And detecting whether the text belongs to the sensitive field by adopting the sensitive field text classification model, and judging whether the text belongs to the policy document by adopting a policy document identification model for the text belonging to the sensitive field.