CN114491018A

CN114491018A - Construction method of sensitive information detection model, and sensitive information detection method and device

Info

Publication number: CN114491018A
Application number: CN202111595176.2A
Authority: CN
Inventors: 刘羽琦; 邱峰; 鲁广平; 王昆
Original assignee: Tianyi Cloud Technology Co Ltd
Current assignee: Tianyi Cloud Technology Co Ltd
Priority date: 2021-12-23
Filing date: 2021-12-23
Publication date: 2022-05-13

Abstract

The invention discloses a construction method of a sensitive information detection model, a sensitive information detection method and a device, wherein the construction method comprises the following steps: acquiring a training sample set and an open corpus; pre-training the BERT model based on the open corpus to obtain a pre-trained BERT model; constructing a preset neural network model according to the pre-trained BERT model, the BilSTM model, the full connection layer and the classification layer; and training a preset neural network model according to the training sample set to obtain a sensitive information detection model. By implementing the invention, the BERT model is adopted, and the context of the sentence where each word is located is combined when the sentence is coded, so that the problem of distinguishing the polysemous words is successfully solved; meanwhile, the efficiency of applying a traditional manual detection method can be improved, and compared with other word vector obtaining methods based on a word2vec model, the accuracy is improved.

Description

Construction method of sensitive information detection model, and sensitive information detection method and device

Technical Field

The invention relates to the technical field of data security, in particular to a construction method of a sensitive information detection model, a sensitive information detection method and a sensitive information detection device.

Background

In the information and digital age, the types of data are continuously increased, and the volume of the data is rapidly increased.

At present, the sensitive information recognition and detection technology mainly realizes the recognition of sensitive fields (such as names, addresses, certificate numbers, mobile phone numbers and the like) through a manually defined sensitive word dictionary and regular matching. The sensitive information recognition technology of the sensitive word dictionary is that sensitive fields are manually defined and added into the dictionary, then data are matched with the fields in the dictionary one by one, and when the data meet the matching requirement, the data are sensitive data. The regular matching mode is that rules are summarized for data (such as certificate numbers, mobile phone numbers and the like) with certain rules on the structure, regular expressions are built, the defined regular expressions are used for matching the data, and sensitive data are determined.

Although the above methods can all realize the identification of sensitive data, as data is diversified, the sensitive word dictionary and the regular matching method may have the following problems, which affect the accuracy and reliability of identification and detection: firstly, the content of the manually defined sensitive word bank is incomplete, and all sensitive word segments cannot be accurately identified and judged; secondly, the word has a word ambiguity problem under different contexts, and some fields may be missed or wrongly recognized by matching through a sensitive word bank; thirdly, the mode of the regular expression needs to research sensitive data with certain characteristics, the regular expressions are increased along with the continuous increase of data types, and manpower and time are consumed for maintaining and updating the rule base.

In addition, many unstructured text messages need to be combined with context or multiple sentence content to judge the sensitive type or level, it is difficult for people to accurately define a relatively comprehensive and accurate sensitive rule base, and the judgment by manually reading long text content is low in efficiency and high in cost.

Therefore, in this background, a fast and accurate method is needed to determine whether the unstructured electronic text contains sensitive information, so as to reduce the risk of leakage of such file data.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method for constructing a sensitive information detection model, a method for detecting sensitive information, and an apparatus for detecting sensitive information, so as to solve technical problems in the prior art that a method for recognizing a sensitive field by using a manually defined sensitive word dictionary and a regular matching is relatively inefficient and has a relatively high cost.

The technical scheme provided by the invention is as follows:

the first aspect of the embodiments of the present invention provides a method for constructing a sensitive information detection model, including: acquiring a training sample set and an open corpus; pre-training the BERT model based on the open corpus to obtain a pre-trained BERT model; constructing a preset neural network model according to the pre-trained BERT model, the BilSTM model, the full connection layer and the classification layer; and training the preset neural network model according to the training sample set to obtain a sensitive information detection model.

Optionally, the training sample set is constructed by the following steps: acquiring unstructured text in a data set; analyzing according to the unstructured text to obtain text information in the unstructured text; and constructing a training sample set according to the text information.

Optionally, the types of unstructured text include a Microsoft Word document, a Microsoft PowerPoint presentation, a Microsoft Excel worksheet, and an Adobe Acrobat document.

Optionally, obtaining text information in the unstructured text according to the unstructured text parsing includes: analyzing to obtain first text information according to document.xml files in the file folder after the Microsoft Word document is decompressed; analyzing to obtain second text information according to the slide.xml file in the folder after the Microsoft PowerPoint presentation is decompressed; analyzing to obtain third text information according to the shared strings.xml file in the file folder after the Microsoft Excel worksheet is decompressed; analyzing the Adobe Acrobat document according to a preset tool to obtain fourth text information; and obtaining text information in the unstructured text according to the first text information, the second text information, the third text information and the fourth text information.

Optionally, training the preset neural network model according to a training sample set to obtain a sensitive information detection model, including: and training the preset neural network model according to a training sample set based on a k-fold cross validation algorithm to obtain a sensitive information detection model.

A second aspect of the embodiments of the present invention provides a method for detecting sensitive information, including: acquiring a text to be detected; inputting a text to be detected into the sensitive information detection model constructed by the construction method of the sensitive information detection model according to any one of the first aspect and the first aspect of the embodiment of the invention, so as to obtain a sensitive result of the text to be detected; and managing and controlling the text to be detected according to the sensitive result of the text to be detected.

A third aspect of the embodiments of the present invention provides a device for constructing a sensitive information detection model, including: the system comprises a sample acquisition module, a data processing module and a data processing module, wherein the sample acquisition module is used for acquiring a training sample set and an open corpus; the pre-training module is used for pre-training the BERT model based on the open corpus to obtain a pre-trained BERT model; the model building module is used for building a preset neural network model according to the pre-trained BERT model, the BilSTM model, the full connection layer and the classification layer; and the training module is used for training the preset neural network model according to a training sample set to obtain a sensitive information detection model.

A fourth aspect of the embodiments of the present invention provides a sensitive information detection apparatus, including: the text acquisition module is used for acquiring a text to be detected; the detection module is configured to input a text to be detected into the sensitive information detection model constructed by the construction method of the sensitive information detection model according to any one of the first aspect and the first aspect of the embodiment of the present invention, so as to obtain a sensitive result of the text to be detected; and the management and control module is used for managing and controlling the text to be detected according to the sensitive result of the text to be detected.

A fifth aspect of the embodiments of the present invention provides a computer-readable storage medium, where computer instructions are stored, and the computer instructions are configured to enable the computer to execute the method for constructing the sensitive information detection model according to any one of the first aspect and the first aspect of the embodiments of the present invention and the method for detecting the sensitive information according to the second aspect of the embodiments of the present invention.

A sixth aspect of an embodiment of the present invention provides an electronic device, including: the sensitive information detection method includes a memory and a processor, the memory and the processor are communicatively connected with each other, the memory stores computer instructions, and the processor executes the computer instructions to execute the method for constructing the sensitive information detection model according to any one of the first aspect and the first aspect of the embodiment of the present invention and the method for detecting the sensitive information according to the second aspect of the embodiment of the present invention.

The technical scheme provided by the invention has the following effects:

according to the construction method and the device of the sensitive information detection model provided by the embodiment of the invention, firstly, a pre-trained BERT model is obtained through pre-training; and then constructing a preset neural network model by adopting the pre-trained BERT model, the BilSTM model, the full connection layer and the classification layer for training to obtain a sensitive information detection model. The construction method of the sensitive information detection model adopts a BERT model, and combines the context of the sentence in which each word is located when the sentence is coded, thereby successfully solving the problem of distinguishing the polysemous words; meanwhile, the efficiency of applying a traditional manual detection method can be improved, and compared with other word vector obtaining methods based on a word2vec model, the accuracy is improved.

The method and the device for detecting the sensitive information, provided by the embodiment of the invention, can be used for detecting whether the text data to be detected contains the sensitive information, the efficiency is greatly superior to that of the traditional manual detection method, and the accuracy is superior to that of the existing word vector classification method based on the word2vec model. Meanwhile, the detection method can also perform access right management and control on the text containing the sensitive information according to the sensitive result, effectively protect the security of the sensitive electronic text and prevent the sensitive content from being leaked.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow chart of a method of constructing a sensitive information detection model according to an embodiment of the invention;

FIG. 2 is a block diagram of a pre-training structure of a BERT model of a method for constructing a sensitive information detection model according to an embodiment of the present invention;

fig. 3 is a block diagram of a preset neural network model of a method for constructing a sensitive information detection model according to an embodiment of the present invention;

FIG. 4 is a flow chart of a sensitive information detection method according to an embodiment of the present invention;

fig. 5 is a block diagram of a construction apparatus of a sensitive information detection model according to an embodiment of the present invention;

fig. 6 is a block diagram of a sensitive information detecting apparatus according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a computer-readable storage medium provided according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device provided in an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

As described in the background, the use of artificially defined sensitive word dictionaries and canonical matches to achieve recognition of sensitive fields is inefficient and cost prohibitive. In addition, the existing text classification method mainly performs word segmentation and word vectorization processing on an original text and inputs the processed text into a deep neural network for feature extraction, so that the text is classified and the file type is judged.

However, most text classification methods and systems use Word2vec and Global Vectors for Word retrieval models to convert text contents into feature Vectors. Both of the two methods are text vector representation methods using words as processing units, the obtained feature vectors are static vectors of the words, and the vector representations of the words in different contexts are the same, so that the problem of synonyms cannot be solved.

In view of this, embodiments of the present invention provide a method for constructing a sensitive information detection model and a method for detecting sensitive information, which detect sensitive information in an unstructured text by a method for processing a text classification task based on a natural language.

In accordance with an embodiment of the present invention, there is provided a method for constructing a sensitive information detection model and a sensitive information detection method, it should be noted that the steps shown in the flowchart of the drawings may be executed in a computer system such as a set of computer executable instructions, and although a logical order is shown in the flowchart, in some cases, the steps shown or described may be executed in an order different from that here.

In this embodiment, a method for constructing a sensitive information detection model is provided, which may be used in electronic devices, such as computers, mobile phones, tablet computers, and the like, fig. 1 is a flowchart of a method for constructing a sensitive information detection model according to an embodiment of the present invention, and as shown in fig. 1, the method includes the following steps:

step S101: and acquiring a training sample set and an open corpus. Wherein the set of training samples may be unstructured text in an existing dataset. Thus, the training sample set may be obtained in the following manner: acquiring unstructured text in a data set; analyzing according to the unstructured text to obtain text information in the unstructured text; and constructing a training sample set according to the text information.

In one embodiment, the unstructured text may be common types of office files such as common Microsoft Word documents, Microsoft PowerPoint presentations, Microsoft Excel worksheets, and Adobe Acrobat documents (PDFs). And analyzing the unstructured text to extract text information.

Specifically, obtaining text information in the unstructured text according to the unstructured text parsing includes: analyzing to obtain first text information according to document.xml files in the file folder after the Microsoft Word document is decompressed; analyzing to obtain second text information according to the slide.xml file in the folder after the Microsoft PowerPoint presentation is decompressed; analyzing to obtain third text information according to the shared strings.xml file in the file folder after the Microsoft Excel worksheet is decompressed; analyzing the Adobe Acrobat document according to a preset tool to obtain fourth text information; and obtaining text information in the unstructured text according to the first text information, the second text information, the third text information and the fourth text information. When parsing an Adobe Acrobat document or a PDF type file, the parsing may be implemented by various software tools such as PDF Data extra.

Step S102: and pre-training the BERT model based on the open corpus to obtain the pre-trained BERT model. In particular, BERT (Bidirectional Encoder characterization based on Transformer) is a language model for generating word vectors based on context representation, that is, BERT model is a Transformer, and when a word is processed by BERT model, the meaning of word before and after the word can be obtained by considering the word before and after the word. That is, the BERT model combines the context of the sentence in which each word is located when encoding the sentence.

And the BERT model is pre-trained, so that the part of models of the middle and bottom layers and the commonality of the downstream tasks can be trained in advance, and then the respective models are trained by the respective sample data of the downstream tasks, thereby greatly accelerating the convergence speed.

In the method, the BERT model is pre-trained, and the BERT model can be trained based on Wikipedia as a corpus. In pre-training, as shown in FIG. 2, the input of BERT is the superposition of token, segmentation and position embeddings, that is, embedding of each word is the superposition of three embedding. WordPiece embedding is used for token embedding, namely word vectors in input text; the position elements are used for representing position information of words in the sentence, or position vectors of the input text; segment templates are for the whole sentence, i.e. the input is a text vector of the input text.

In the pre-training process, the BERT model can consider both the context and context of a word, whereby the BERT model randomly masks portions of input tokens, and then trains the model to correctly predict those masked tokens. Meanwhile, the BERT model can also pre-train a next binary sentence prediction task to realize the next sentence prediction. The parameters of the BERT model obtained by the semi-supervised learning of characteristic extraction through a plurality of layers of transformers and the Fine-tuning mode are combined to achieve better accuracy. Therefore, through the pre-training, the output vector of the BERT model is the vector representation of each word/word after the full-text semantic information is fused.

Step S103: and constructing a preset neural network model according to the pre-trained BERT model, the BilSTM model, the full connection layer and the classification layer. When the preset neural network model is constructed, the adopted pre-trained BERT model may be the BERT model pre-trained in step S102. That is, the parameters in the pre-trained BERT model may be determined through the pre-training in step S102, and the parameters obtained through the pre-training may be directly adopted by the pre-trained BERT model when the preset neural network model is constructed.

The BilSTM (Bi-directional Long Short-Term Memory) model is formed by combining a forward LSTM (Long Short-Term Memory) and a backward LSTM (Long Short-Term Memory network). Longer distance dependencies can be better captured using the LSTM model. Because LSTM learns which information to remember and which information to forget through the training process. However, modeling sentences using LSTM has a problem: the information from back to front cannot be encoded. Therefore, the BilSTM model combining the forward LSTM and the backward LSTM can better capture the bidirectional semantic dependence.

Specifically, for the classification layer, a softmax function may be employed, and in addition, a sigmoid may also be employed. Through the full connection layer and the classification layer, the preset neural network model not only can carry out two classification tasks (such as file labels are sensitive or insensitive), but also can realize multi-classification tasks (such as file labels are insensitive, politically sensitive or military sensitive).

Step S104: and training the preset neural network model according to the training sample set to obtain a sensitive information detection model, as shown in fig. 3. During training, the pre-trained BERT model can perform word embedding vector representation on the text, the output of the last layer of the pre-trained BERT model is used as the input of the BilSTM model, and a labeled training sample set is used for training. In BilSTM, the bidirectional weighting vectors are fully connected, and a new vector is output to the softmax function for classification.

Specifically, in the training process, the model can be trained by adopting different parameters according to different file characteristics of the training set, so as to achieve the best training effect. Meanwhile, based on the labeling of the labeled training sample set, two-classification can be realized, and multi-classification can also be realized. For example, if the labels are sensitive and non-sensitive, then a second classification can be implemented; labeled as insensitive, politically sensitive, and military sensitive, three classifications can be implemented.

In an embodiment, when a preset neural network model is trained, a k-fold cross validation algorithm may be adopted to train the preset neural network model according to a training sample set, so as to obtain a sensitive information detection model. The basic idea of cross validation is to group the original data, one part is used as a training set, the other part is used as a validation set, firstly, the model is trained by the training set, and then the generalization error of the model is tested by the validation set. In addition, data is always limited in reality, so that k-fold cross validation can be adopted in order to form reuse on the data. The k-fold cross validation is that 1/k of a training set is used as a test set, each model is trained for k times and tested for k times, the error rate is the average of the k times, and finally the model with the minimum average rate is selected.

Specifically, in the actual training, 10-fold cross validation may be employed. The 10-fold cross validation is to divide the experimental samples into 10 parts at random, 9 parts of the experimental samples are selected as training data and 1 part of the experimental samples are selected as testing data, and the average result of 10 experiments is calculated to be used as the final experimental result. Further, 5-fold cross validation, 20-fold cross validation, or the like may be employed.

The construction method of the sensitive information detection model provided by the embodiment of the invention comprises the following steps of firstly, obtaining a pre-trained BERT model through pre-training; and then constructing a preset neural network model by adopting the pre-trained BERT model, the BilSTM model, the full connection layer and the classification layer for training to obtain a sensitive information detection model. The construction method of the sensitive information detection model adopts a BERT model, and combines the context of the sentence in which each word is located when the sentence is coded, thereby successfully solving the problem of distinguishing the polysemous words; meanwhile, the efficiency of applying a traditional manual detection method can be improved, and compared with other word vector obtaining methods based on a word2vec model, the accuracy is improved.

An embodiment of the present invention further provides a method for detecting sensitive information, as shown in fig. 4, the method includes the following steps:

step S201: and acquiring the text to be detected.

Step S202: and inputting the text to be detected into the sensitive information detection model constructed by the construction method of the sensitive information detection model in the embodiment to obtain the sensitive result of the text to be detected. Specifically, the sensitive information detection model constructed by the construction method of the sensitive information detection model can realize a two-classification task and also can realize a multi-classification task. Therefore, the construction method of the sensitive information detection model can be used for constructing the sensitive information detection model of the second category and the sensitive information detection model of the multi-category, and then inputting the text to be detected into the corresponding models for detection according to the specific requirements of the sensitive information detection. For example, only by determining whether the text to be detected has sensitivity, the text to be detected can be input into a sensitive information detection model of the second classification; if the text to be detected needs to be determined whether the text to be detected is military sensitive or political sensitive, the text to be detected can be input into a sensitive information detection model of multi-category. Thereby obtaining a corresponding sensitive result.

Step S203: and managing and controlling the text to be detected according to the sensitive result of the text to be detected. Specifically, if the text to be detected is judged to have sensitive information according to the sensitive result, control measures can be taken for the text to be detected. In addition, if the text to be detected is detected to be politically sensitive or military sensitive, different grading management and control measures can be adopted according to the corresponding sensitivity.

The sensitive information detection method provided by the embodiment of the invention can detect whether the text data to be detected contains sensitive information, the efficiency is greatly superior to that of the traditional manual detection method, and the accuracy is superior to that of the existing word vector classification method based on the word2vec model. Meanwhile, the detection method can also perform access right management and control on the text containing the sensitive information according to the sensitive result, effectively protect the security of the sensitive electronic text and prevent the sensitive content from being leaked.

An embodiment of the present invention further provides a device for constructing a sensitive information detection model, as shown in fig. 5, the device includes:

the system comprises a sample acquisition module, a data processing module and a data processing module, wherein the sample acquisition module is used for acquiring a training sample set and an open corpus; for details, reference is made to the corresponding parts of the above method embodiments, which are not described herein again.

The pre-training module is used for pre-training the BERT model based on the open corpus to obtain a pre-trained BERT model; for details, reference is made to the corresponding parts of the above method embodiments, which are not described herein again.

The model building module is used for building a preset neural network model according to the pre-trained BERT model, the BilSTM model, the full connection layer and the classification layer; for details, reference is made to the corresponding parts of the above method embodiments, which are not described herein again.

And the training module is used for training the preset neural network model according to a training sample set to obtain a sensitive information detection model. For details, reference is made to the corresponding parts of the above method embodiments, which are not described herein again.

The construction device of the sensitive information detection model provided by the embodiment of the invention firstly obtains a pre-trained BERT model through pre-training; and then constructing a preset neural network model by adopting a pre-trained BERT model, a BilSTM model, a full connection layer and a classification layer for training to obtain a sensitive information detection model. The construction device of the sensitive information detection model adopts a BERT model, and combines the context of the sentence in which each word is located when the sentence is coded, so that the problem of distinguishing the polysemous words is successfully solved; meanwhile, the efficiency of applying a traditional manual detection method can be improved, and compared with other word vector obtaining methods based on a word2vec model, the accuracy is improved.

The functional description of the device for constructing the sensitive information detection model provided by the embodiment of the invention refers to the description of the method for constructing the sensitive information detection model in the embodiment.

An embodiment of the present invention further provides a sensitive information detecting apparatus, as shown in fig. 6, the apparatus includes:

the text acquisition module is used for acquiring a text to be detected; for details, reference is made to the corresponding parts of the above method embodiments, which are not described herein again.

The detection module is used for inputting the text to be detected into the sensitive information detection model constructed by the construction method of the sensitive information detection model in the embodiment to obtain the sensitive result of the text to be detected; for details, reference is made to the corresponding parts of the above method embodiments, which are not described herein again.

And the management and control module is used for managing and controlling the text to be detected according to the sensitive result of the text to be detected. For details, reference is made to the corresponding parts of the above method embodiments, which are not described herein again.

The sensitive information detection device provided by the embodiment of the invention can detect whether the text data to be detected contains sensitive information, the efficiency is greatly superior to that of the traditional manual detection method, and the accuracy is superior to that of the existing word vector classification method based on the word2vec model. Meanwhile, the detection method can also perform access right management and control on the text containing the sensitive information according to the sensitive result, effectively protect the security of the sensitive electronic text and prevent the sensitive content from being leaked.

For a detailed description of the functions of the sensitive information detection apparatus provided in the embodiments of the present invention, reference is made to the description of the sensitive information detection method in the above embodiments.

An embodiment of the present invention further provides a storage medium, as shown in fig. 7, on which a computer program 601 is stored, where the instructions, when executed by a processor, implement the steps of the sensitive information detection model and the sensitive information detection method in the foregoing embodiments. The storage medium is also stored with audio and video stream data, characteristic frame data, an interactive request signaling, encrypted data, preset data size and the like. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD) or a Solid State Drive (SSD), etc.; the storage medium may also comprise a combination of memories of the kind described above.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.

An embodiment of the present invention further provides an electronic device, as shown in fig. 8, the electronic device may include a processor 51 and a memory 52, where the processor 51 and the memory 52 may be connected by a bus or in another manner, and fig. 8 takes the connection by the bus as an example.

The processor 51 may be a Central Processing Unit (CPU). The Processor 51 may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or combinations thereof.

The memory 52, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as the corresponding program instructions/modules in the embodiments of the present invention. The processor 51 executes various functional applications and data processing of the processor by running the non-transitory software programs, instructions and modules stored in the memory 52, that is, implements the construction method of the sensitive information detection model and the sensitive information detection method in the above method embodiments.

The memory 52 may include a storage program area and a storage data area, wherein the storage program area may store an operating device, an application program required for at least one function; the storage data area may store data created by the processor 51, and the like. Further, the memory 52 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 52 may optionally include memory located remotely from the processor 51, and these remote memories may be connected to the processor 51 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The one or more modules are stored in the memory 52, and when executed by the processor 51, perform a sensitive information detection model construction method and a sensitive information detection method as in the embodiments shown in fig. 1 to 4.

The details of the electronic device may be understood by referring to the corresponding descriptions and effects in the embodiments shown in fig. 1 to fig. 4, and are not described herein again.

Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims

1. A construction method of a sensitive information detection model is characterized by comprising the following steps:

acquiring a training sample set and an open corpus;

pre-training the BERT model based on the open corpus to obtain a pre-trained BERT model;

constructing a preset neural network model according to the pre-trained BERT model, the BilSTM model, the full connection layer and the classification layer;

and training the preset neural network model according to the training sample set to obtain a sensitive information detection model.

2. The method for constructing the sensitive information detection model according to claim 1, wherein the training sample set is constructed by the following steps:

acquiring unstructured text in a data set;

analyzing according to the unstructured text to obtain text information in the unstructured text;

and constructing a training sample set according to the text information.

3. The method for constructing the sensitive information detection model according to claim 2, wherein the types of the unstructured text include a Microsoft Word document, a Microsoft PowerPoint presentation, a Microsoft Excel work sheet, and an Adobe Acrobat document.

4. The method for constructing the sensitive information detection model according to claim 3, wherein obtaining the text information in the unstructured text according to the unstructured text parsing includes:

analyzing to obtain first text information according to document.xml files in the file folder after the Microsoft Word document is decompressed;

analyzing to obtain second text information according to the slide.xml file in the folder after the Microsoft PowerPoint presentation is decompressed;

analyzing to obtain third text information according to the shared strings.xml file in the file folder after the Microsoft Excel worksheet is decompressed;

analyzing the Adobe Acrobat document according to a preset tool to obtain fourth text information;

and obtaining text information in the unstructured text according to the first text information, the second text information, the third text information and the fourth text information.

5. The method for constructing the sensitive information detection model according to claim 1, wherein training the preset neural network model according to a training sample set to obtain the sensitive information detection model comprises:

and training the preset neural network model according to a training sample set based on a k-fold cross validation algorithm to obtain a sensitive information detection model.

6. A method for sensitive information detection, comprising:

acquiring a text to be detected;

inputting a text to be detected into the sensitive information detection model constructed by the construction method of the sensitive information detection model according to any one of claims 1 to 5 to obtain a sensitive result of the text to be detected;

and managing and controlling the text to be detected according to the sensitive result of the text to be detected.

7. A construction device of a sensitive information detection model is characterized by comprising the following steps:

the system comprises a sample acquisition module, a training sample set and an open corpus, wherein the sample acquisition module is used for acquiring a training sample set and an open corpus;

the pre-training module is used for pre-training the BERT model based on the open corpus to obtain a pre-trained BERT model;

the model building module is used for building a preset neural network model according to the pre-trained BERT model, the BilSTM model, the full connection layer and the classification layer;

and the training module is used for training the preset neural network model according to a training sample set to obtain a sensitive information detection model.

8. A sensitive information detecting apparatus, comprising:

the text acquisition module is used for acquiring a text to be detected;

the detection module is used for inputting a text to be detected into the sensitive information detection model constructed by the construction method of the sensitive information detection model according to any one of claims 1 to 5 to obtain a sensitive result of the text to be detected;

and the management and control module is used for managing and controlling the text to be detected according to the sensitive result of the text to be detected.

9. A computer-readable storage medium storing computer instructions for causing a computer to execute the method for constructing the sensitive information detection model according to any one of claims 1 to 5 and the method for detecting sensitive information according to claim 6.

10. An electronic device, comprising: a memory and a processor, the memory and the processor being communicatively connected to each other, the memory storing computer instructions, and the processor executing the computer instructions to perform the method for constructing the sensitive information detection model according to any one of claims 1 to 5 and the method for detecting sensitive information according to claim 6.