CN111209373A

CN111209373A - Sensitive text recognition method and device based on natural semantics

Info

Publication number: CN111209373A
Application number: CN202010012173.0A
Authority: CN
Inventors: 万淼; 孙彦芬; 王歆怡; 陈锦; 王禹
Original assignee: Beijing Venus Information Security Technology Co Ltd; China Information Technology Security Evaluation Center
Current assignee: Beijing Venus Information Security Technology Co Ltd; China Information Technology Security Evaluation Center
Priority date: 2020-01-07
Filing date: 2020-01-07
Publication date: 2020-05-29

Abstract

A sensitive text recognition method and device based on natural semantics are disclosed, wherein the method comprises the following steps: acquiring a mass corpus word vector library; performing word segmentation on the sample document; performing word-by-word vectorization on the sample document, correcting the massive corpus word vector library, and establishing a small corpus word vector library; vectorization analysis is carried out on the sample document, and fingerprint characteristics of the sample document are extracted; carrying out word segmentation, word-by-word vectorization and document vectorization analysis on the documents to be detected in sequence to obtain fingerprint characteristics of the documents; and comparing the fingerprint characteristics of the document to be detected with the fingerprint characteristics of the sample document, and identifying the sensitive document to be detected. The method and the device have the advantages of low missing report rate, difficulty in avoiding, high identification efficiency and the like.

Description

Sensitive text recognition method and device based on natural semantics

Technical Field

The invention relates to the field of computer information processing, in particular to a sensitive text recognition technology based on natural semantic features.

Background

The traditional sensitive file identification method based on keywords has the advantages that the strategy setting is simple and visual, on the other hand, the defects of high false and missed report rate exist, and the method is easy to avoid; the sensitive file identification method based on file hash has the advantages of high processing speed and has the defects of missing report caused by avalanche effect due to small disturbance such as modification of paragraph sequence or modification of character expression, and the similarity between documents cannot be really judged from the semantic connotation.

In addition, the traditional method is often limited to the bottleneck of flow and performance, and cannot achieve balance among quick identification, high accuracy, low false alarm rate and low false alarm rate.

Disclosure of Invention

The invention provides a sensitive text recognition method and device based on natural semantics, which can realize the discrimination of similarity between documents from the aspect of semantic connotation, have low false and missing report rate, are difficult to avoid and have higher processing efficiency.

The invention provides a sensitive text recognition method based on natural semantics, which comprises the following steps:

acquiring a massive corpus word vector library based on natural corpus;

performing word segmentation on the sample document;

performing word-by-word vectorization on the sample document, correcting the massive corpus vector library, and establishing a small corpus word vector library based on the new words in the sample document;

based on the corrected mass corpus word vector library and the corrected small corpus word vector library, performing vectorization analysis on the sample document, and extracting fingerprint characteristics of the sample document;

carrying out word segmentation, word-by-word vectorization and document vectorization analysis on the documents to be detected in sequence to obtain fingerprint characteristics of the documents;

and comparing the fingerprint characteristics of the document to be detected with the fingerprint characteristics of the sample document, and identifying the sensitive document to be detected.

Optionally, the method for obtaining a mass corpus word vector library based on natural corpus includes:

and training the natural corpus by using a natural language processing model to obtain the massive corpus word vector library.

Optionally, the word-by-word vectorization of the sample document, the correction of the mass word vector library, and the establishment of the corpus word vector library based on the new words in the sample document include:

taking all vocabularies obtained after word segmentation of the sample document as input, and positioning word vectors of the words in the massive corpus word vector library one by one;

correcting the word vector of the word according to the relation between the word and other words in the sample document;

and analyzing word vectors of the new words which are not in the massive corpus word vector library according to the context of the document where the words are located to obtain word vectors, storing the word vectors into a small corpus word vector library, and continuously updating the small corpus word vector library along with the addition of new words.

Optionally, a skip-gram model in Word2Vec using negative sampling optimization acceleration is adopted to train the natural corpus to obtain the mass corpus Word vector library, or the mass corpus Word vector library is corrected.

Optionally, the new words not in the mass corpus word vector library are subjected to word vector analysis according to the context of the document where the word is located by adopting a Nonce2Vec method to obtain word vectors.

Optionally, a SIF algorithm is adopted, and based on the corrected mass corpus word vector library and the corrected small corpus word vector library, vectorization analysis is performed on the sample document, so as to extract fingerprint features of the sample document.

Optionally, the comparing the fingerprint characteristics of the document to be detected and the sample document to identify a sensitive document to be detected includes:

calculating the cosine similarity of the fingerprints of the document to be detected and the sample document;

and if the similarity is higher than the threshold value, classifying the current document to be detected as a sensitive document.

Optionally, after comparing the fingerprint characteristics of the document to be detected with the fingerprint characteristics of the sample document and identifying the sensitive document to be detected, the method further includes:

and outputting the sample document sequence number corresponding to the document to be detected classified as the sensitive document, and finishing the alarm.

In another aspect, the present disclosure also provides a sensitive text recognition apparatus, including:

the word segmentation module is used for carrying out Chinese word segmentation on the sample document and the document to be detected;

the word vectorization module is used for carrying out word-by-word vectorization on the sample document based on the massive corpus word vector library;

the document fingerprint calculation module is used for respectively carrying out vectorization analysis on the sample document and the document to be detected and extracting fingerprint characteristics of the sample document and the document to be detected;

and the document fingerprint similarity calculation module is used for comparing the fingerprint characteristics of the document to be detected with the sample document and identifying the sensitive document to be detected.

Optionally, the sensitive text recognition apparatus further includes:

and the natural corpus pre-training module is used for training the natural corpus to obtain a massive corpus word vector library.

The method and the device for identifying the sensitive text based on the natural semantics utilize word vectors to generate characteristic fingerprints of the document from a natural semantic level, and further identify the text to be detected containing sensitive information through comparison of the fingerprints, wherein the fingerprints contain the semantics and subject information of the document and are difficult to avoid through conventional means, so that the false missing report rate is effectively reduced, and meanwhile, the system has good processing efficiency due to pre-training of massive linguistic data and the adoption of an efficient algorithm.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be apparent from the following more particular descriptions of exemplary embodiments of the disclosure as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the exemplary embodiments of the disclosure.

FIG. 1 shows a flow diagram of a sensitive text recognition method according to an example embodiment of the present disclosure.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

FIG. 1 shows a flowchart of an exemplary embodiment of a method for natural semantic based sensitive text recognition according to the present disclosure, which includes:

step S101: acquiring a massive corpus word vector library based on natural corpus;

step S102: performing word segmentation on the sample document;

step S103: performing word-by-word vectorization on the sample document, correcting the massive corpus vector library, and establishing a small corpus word vector library based on the new words in the sample document;

step S104: based on the corrected mass corpus word vector library and the corrected small corpus word vector library, performing vectorization analysis on the sample document, and extracting fingerprint characteristics of the sample document;

step S105: carrying out word segmentation, word-by-word vectorization and document vectorization analysis on the documents to be detected in sequence to obtain fingerprint characteristics of the documents;

step S106: and comparing the fingerprint characteristics of the document to be detected with the fingerprint characteristics of the sample document, and identifying the sensitive document to be detected.

In the above exemplary embodiment, the sample document is an existing known sensitive file, after performing chinese word segmentation on the document, all words are used as input, and word-by-word vectorization is performed on the words input by the sample document by relying on a corpus of a natural language processing model, and then a vector for characterizing the document, i.e., a document fingerprint, is generated. And finally, comparing the fingerprint similarity of the sample document and the document to be detected, and judging whether the document to be detected is a sensitive document.

A word vector (word embedding) is a vectorized representation of a word. The method and the device perform word-by-word vectorization based on natural semantics, and can indicate semantic features inside the document according to the vector relation between words and expressions in natural linguistic data.

For example, if a word "search engine" and "***" or "Baidu" always appear at the same time in the corpus, the word "***" and "Baidu" are mapped to relatively similar spatial positions when the word is vectorized; meanwhile, the number of co-occurrence times of "***" and "usa" is larger, so that the distance between "***" and other word vectors co-occurring with "usa" (such as "california" and "apple") is closer, and the distance between "Baidu" and other word vectors co-occurring with "china" is closer. Therefore, the replacement of synonyms or synonyms in sensitive documents, such as "secret", etc., can be accurately identified, and even two sentences expressing similar meanings, such as "Wang will attend a meeting on the day of worship" and "Sunday, Wang will appear at a meeting".

The method is characterized in that a pre-trained massive corpus word vector library is directly obtained as a basis, the system operation efficiency is undoubtedly greatly improved, further, the massive corpus word vector library is continuously corrected based on the actual word segmentation result of a sample document, meanwhile, a small corpus word vector library is established for irregular words in the sample document, namely, new words, and the semantic features of the sample document can be completely expressed by combining the small corpus word vector library and the irregular words.

Based on the above, the characteristic fingerprint of the document is further generated, and the comparison of the similarity of the fingerprint is carried out, so that the identification of the sensitive document can be carried out.

The present disclosure builds a language model for Chinese from the perspective of natural semantics, which may employ entry documents from Chinese wiki across various industries. The mass corpus word vector library can be obtained through the pre-training mode, and the existing word vector library based on natural corpus can also be directly obtained through the means of importing and the like.

Optionally, the word-by-word vectorization of the sample document, the correction of the mass corpus word vector library, and the establishment of the small corpus word vector library based on the new words in the sample document include:

The method for processing the oov (out of vocabularies) words is to select a fast and efficient low-dimensional word vector processing model. The Nonce2Vec method can be selected to realize the instant and fast training of the new words.

In addition, optionally, after comparing the fingerprint characteristics of the document to be detected with the fingerprint characteristics of the sample document and identifying the sensitive document to be detected, the method further includes:

step S201: and outputting the sample document sequence number corresponding to the document to be detected classified as the sensitive document, and finishing the alarm.

The sensitive text recognition apparatus based on natural semantics according to an exemplary embodiment includes:

Optionally, the sensitive text recognition apparatus further includes:

According to the method and the device for identifying the sensitive text based on the natural semantics, word vectors are utilized to generate characteristic fingerprints of the document from a natural semantic level, and then the text to be detected containing sensitive information is identified through comparison of the fingerprints, compared with the prior art, the method has the advantages that ① low missing report rate is achieved, the method is based on semantic level modeling, linguistic information including but not limited to synonyms, similar words, grammar, sentence patterns and the like can be identified, even if the word expression or paragraph sequence is modified, whether the document is similar to a sample document can be accurately obtained, detection through modification sequence and expression avoidance is blocked, ② efficiency and effect are excellent, training speed of the document vectorization on the new words is high, comparison effect of similarity is superior to that of a plurality of advanced neural network models (such as a plurality of RNN and LSTM models), ③ cross-field is achieved, the model based on massive Chinese corpus training covers most semantic information of various industry fields, ④ long and short documents are applicable, and the papers or the messages of a plurality of crosses have unusual expressions.

The foregoing is illustrative of the present invention and various modifications and changes in form or detail will readily occur to those skilled in the art based upon the teachings herein and the application of the principles and principles disclosed herein, which are to be regarded as illustrative rather than restrictive on the broad principles of the present invention.

Claims

1. A sensitive text recognition method based on natural semantics is characterized by comprising the following steps:

acquiring a massive corpus word vector library based on natural corpus;

performing word segmentation on the sample document;

performing word-by-word vectorization on the sample document, correcting the massive corpus word vector library, and establishing a small corpus word vector library based on the new words in the sample document;

2. The sensitive text recognition method according to claim 1, wherein the method for obtaining a mass corpus word vector library based on natural corpus comprises:

3. The method according to claim 1, wherein said vectorizing word by word of the sample document, modifying said corpus word vector library, and establishing a corpus word vector library based on the new words in the sample document comprises:

4. The sensitive text recognition method according to claim 1 or 2, wherein a skip-gram model optimized and accelerated by using negative sampling in Word2Vec is adopted to train a natural corpus to obtain the mass corpus Word vector library, or the mass corpus Word vector library is corrected.

5. The sensitive text recognition method of claim 3, wherein: and analyzing the word vectors of the new words not in the massive corpus word vector library by adopting a Nonce2Vec method according to the context of the document where the words are located to obtain word vectors.

6. The sensitive text recognition method according to claim 1, wherein an SIF algorithm is adopted, and a sample document is subjected to vectorization analysis based on the corrected mass corpus word vector library and the corrected small corpus word vector library to extract fingerprint features of the sample document.

7. The sensitive text identification method according to claim 1, wherein the comparing the fingerprint characteristics of the document to be detected and the sample document to identify the sensitive document to be detected comprises:

8. The method of claim 1, wherein after comparing the fingerprint characteristics of the document to be tested with the fingerprint characteristics of the sample document and identifying the sensitive document to be tested, the method further comprises:

9. A sensitive text recognition apparatus applying the sensitive text recognition method according to any one of claims 1 to 8, comprising:

10. The sensitive text recognition apparatus of claim 9, further comprising: