CN117216280A

CN117216280A - Incremental learning method, incremental learning method and incremental learning device for sensitive data identification model

Info

Publication number: CN117216280A
Application number: CN202311483460.XA
Authority: CN
Inventors: 张黎; 吴洋
Original assignee: Flash It Co ltd
Current assignee: Flash It Co ltd
Priority date: 2023-11-09
Filing date: 2023-11-09
Publication date: 2023-12-12
Anticipated expiration: 2043-11-09
Also published as: CN117216280B

Abstract

The invention provides an incremental learning method, an incremental learning method and an incremental learning device for a sensitive data recognition model.

Description

Incremental learning method, incremental learning method and incremental learning device for sensitive data identification model

Technical Field

The invention relates to the technical field of electric data processing, in particular to an incremental learning method, an incremental learning method and an incremental learning device for a sensitive data identification model.

Background

In today's digitized world, the meaning of data privacy security protection is self-evident. With the widespread use of big data and artificial intelligence, a vast amount of personal information is collected, stored and processed, including but not limited to personal identity, location, health status, consumption habits, and the like. If such data is illegally used or leaked, it may have serious influence on life, work and even personal safety of an individual. Therefore, protecting the privacy of data is not only maintaining personal interests, but also an important guarantee of social stability and development, and the premise of protecting the privacy of data is to identify the privacy data to determine which data are sensitive and need special protection. This allows special processing of the data, such as encryption storage, anonymization processing, etc., to prevent the private data from being compromised.

At present, a privacy data identification model realized based on machine learning or deep learning technology is generally adopted when privacy data identification is carried out, and data containing privacy information is identified through a training model. The models can identify various forms of private data such as text, images, sound, etc. However, since the definition and form of the privacy data are diverse, and new privacy data forms are continuously emerging over time, existing models often need to be continuously updated and optimized. Therefore, there is a need for incremental learning of the privacy data recognition model, and when new data is added, the model can be learned and updated on an original basis without retraining the entire model. However, the incremental learning method has a certain disadvantage in protecting privacy, because the incremental learning needs to process new data, if the data contains sensitive information, the information may be revealed during the processing, and secondly, parameters of the model may change during the incremental learning, and if the changes are maliciously utilized, the original sensitive data may be inferred. Therefore, how to protect private data during incremental learning is an important current research direction.

Disclosure of Invention

The invention provides an incremental learning method, an incremental learning method and an incremental learning device for a sensitive data identification model, which are used for solving the defect that data privacy is leaked when the model performs incremental learning in the prior art.

The invention provides an incremental learning method of a sensitive data identification model, which comprises the following steps:

receiving a model increment learning request initiated by a user, and determining an increment learning mode and training sample updating information based on the model increment learning request; the incremental learning mode comprises data category, sample file, and addition, deletion and category modification of sensitive keywords;

after an update sample is determined based on the training sample update information and the keywords of the update sample are extracted, updating the training text set based on the update sample, and updating the classification keyword list of the corresponding category in the data identification model based on the incremental learning mode and the keywords of the update sample;

and carrying out sensitive data recognition on the keyword sequences of all training samples in the training text set based on the data recognition model to obtain recognition results of all training samples, and updating the classification keyword list of the corresponding category based on the difference between the category label of all training samples and the recognition results to obtain the data recognition model after incremental learning.

According to the incremental learning method of the sensitive data recognition model provided by the invention, the classification keyword list of the corresponding category in the data recognition model is updated based on the incremental learning mode and the keywords of the update sample, and the method specifically comprises the following steps:

if the incremental learning mode is new or deleting of the sample file or the sensitive keyword, adding the keyword of the updated sample into a classified keyword list of a category corresponding to the sample file or the sensitive keyword, and setting weight corresponding to the keyword of the updated sample, or deleting the keyword of the updated sample from the classified keyword list of the category corresponding to the sample file or the sensitive keyword;

if the incremental learning mode is new addition or deletion of the data category, constructing a classified keyword list corresponding to the data category based on the keywords of the update sample, and setting weight corresponding to the keywords of the update sample, or deleting the classified keyword list corresponding to the data category;

if the incremental learning mode is the category modification of the sample file or the sensitive keyword, deleting the keyword of the updated sample from the classified keyword list of the original category corresponding to the sample file or the sensitive keyword, and adding the keyword of the updated sample into the classified keyword list of the updated category corresponding to the sample file or the sensitive keyword;

If the incremental learning mode is the category modification of the data category, fusing the classification keyword list of the original category with the classification keyword list of the updated category.

According to the incremental learning method of the sensitive data recognition model provided by the invention, if the incremental learning mode is new addition or deletion of a sample file or a sensitive keyword, adding the keyword of the updated sample into a classified keyword list of a category corresponding to the sample file or the sensitive keyword, and setting a weight corresponding to the keyword of the updated sample, or deleting the keyword of the updated sample from the classified keyword list of the category corresponding to the sample file or the sensitive keyword, specifically including:

if the incremental learning mode is the deletion of a sample file or a sensitive keyword, clustering the keywords of the updated sample to obtain a plurality of keyword class clusters, and determining the weight of the keywords of the updated sample in a classified keyword list of the corresponding category of the sample file or the sensitive keyword;

for any keyword class cluster, dividing similar keywords in the any keyword class cluster according to the category of the affiliated updated text, and determining whether the any keyword class cluster is a cross-category cluster or not based on the number of the similar keywords in each similar keyword set after obtaining a plurality of similar keyword sets;

If any keyword class cluster is not a cross-class cluster, or if any keyword class cluster is a cross-class cluster and the weight of the similar keywords contained in more than a preset number in the classified keyword list of the class corresponding to the sample file or the sensitive keyword is larger than a first preset value, the weight of the similar keywords in any keyword class cluster in the classified keyword list of the class corresponding to the sample file or the sensitive keyword is reduced;

if any keyword class cluster is a cross-class cluster and the number of similar keywords with the weight larger than a first preset value in the classified keyword list of the class corresponding to the sample file or the sensitive keyword is smaller than the preset number, deleting the similar keywords with the weight smaller than or equal to the first preset value in any keyword class cluster from the classified keyword list of the class corresponding to the sample file or the sensitive keyword.

According to the incremental learning method of the sensitive data recognition model provided by the invention, if the incremental learning mode is a modification of a category of a sample file or a sensitive keyword, deleting the keyword of the updated sample from a category keyword list corresponding to an original category of the sample file or the sensitive keyword, and adding the keyword of the updated sample to the category keyword list corresponding to the updated category of the sample file or the sensitive keyword, wherein the method specifically comprises the following steps:

Determining weights of repeated keywords in the classified keyword list of the original category corresponding to the sample file or the sensitive keyword and the classified keyword list of the update category corresponding to the sample file or the sensitive keyword respectively aiming at repeated keywords in the classified keyword list of the update category corresponding to the sample file or the sensitive keyword in the updated sample;

if the weights of the repeated keywords in the classified keyword list of the original category corresponding to the sample file or the sensitive keyword and the classified keyword list of the update category corresponding to the sample file or the sensitive keyword are respectively larger than a second preset value, the weights of the repeated keywords in the classified keyword list of the update category corresponding to the sample file or the sensitive keyword are maintained;

otherwise, updating the weight of the repeated keyword in the classified keyword list of the corresponding update category of the sample file or the sensitive keyword based on the average value of the weights of the repeated keyword in the classified keyword list of the corresponding original category of the sample file or the sensitive keyword and the classified keyword list of the corresponding update category of the sample file or the sensitive keyword.

According to the incremental learning method of the sensitive data recognition model provided by the invention, the key word sequence of each training sample in the training text set is subjected to sensitive data recognition based on the data recognition model to obtain the recognition result of each training sample, and the incremental learning method concretely comprises the following steps:

extracting word embedding vectors corresponding to the keywords in the keyword sequence of any training sample;

and determining the recognition result of any training sample based on the classified keyword list corresponding to each category in the data recognition model and the word embedding vector corresponding to each keyword in the keyword sequence of any training sample.

The invention also provides an identification method, which comprises the following steps:

receiving a file identification request submitted by a user; the file identification request carries a path and a file name of a file to be identified;

acquiring the file to be identified based on the path and the file name of the file to be identified, extracting text content in the file to be identified, and extracting a keyword sequence of the text content;

performing sensitive data recognition on the keyword sequence of the text content based on a data recognition model to obtain a recognition result of the file to be recognized;

The data identification model is learned by an incremental learning method based on any one of the sensitive data identification models.

The invention also provides an incremental learning device of the sensitive data identification model, which comprises:

the learning request receiving unit is used for receiving a model increment learning request initiated by a user and determining an increment learning mode and training sample updating information based on the model increment learning request; the incremental learning mode comprises data category, sample file, and addition, deletion and category modification of sensitive keywords;

the keyword list updating unit is used for updating the training text set based on the updating sample after determining the updating sample based on the training sample updating information and extracting the keywords of the updating sample, and updating the classified keyword list of the corresponding category in the data recognition model based on the incremental learning mode and the keywords of the updating sample;

the model increment learning unit is used for carrying out sensitive data recognition on the keyword sequences of all training samples in the training text set based on the data recognition model to obtain recognition results of all training samples, and updating the classification keyword list of the corresponding category based on the difference between the category label of all training samples and the recognition results to obtain the data recognition model after increment learning.

The invention also provides an identification device, comprising:

the identification request receiving unit is used for receiving a file identification request submitted by a user; the file identification request carries a path and a file name of a file to be identified;

a text content extraction unit, configured to obtain the file to be identified based on a path and a file name of the file to be identified, extract text content in the file to be identified, and extract a keyword sequence of the text content;

the sensitive data identification unit is used for carrying out sensitive data identification on the keyword sequence of the text content based on a data identification model to obtain an identification result of the file to be identified;

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the incremental learning method or the identification method of the sensitive data identification model according to any one of the above when executing the program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements an incremental learning method or recognition method of a sensitive data recognition model as described in any one of the above.

The invention also provides a computer program product comprising a computer program which when executed by a processor implements an incremental learning method or recognition method of a sensitive data recognition model as described in any one of the above.

According to the incremental learning method, the incremental learning method and the incremental learning device for the sensitive data recognition model, the incremental learning mode and the training sample updating information are determined based on the model incremental learning request of a user, after the updating sample is determined based on the training sample updating information and the keywords of the updating sample are extracted, the training text set is updated based on the updating sample, the classified keyword list of the corresponding category in the data recognition model is updated based on the incremental learning mode and the keywords of the updating sample, the sensitive data recognition is carried out on the keyword sequence of each training sample in the training text set based on the data recognition model, the recognition result of each training sample is obtained, the classified keyword list of the corresponding category is updated based on the difference between the category label and the recognition result of each training sample, the data recognition model after incremental learning is obtained, the original content of the training sample is not required to be obtained in the incremental learning process of the model, the fact that the parameters in the incremental learning process are not leaked is guaranteed, the change of the parameters in the incremental learning process is only reflected on the weight change of the keywords, the sensitive content of the training sample cannot be reversely deduced according to the change of the parameters, the privacy protection in the process is improved, the privacy protection of the text is not required to be greatly processed, the privacy protection of the text is not required to be repeatedly processed, the text is not required to be embedded in the incremental learning process, and the large-level is greatly-level is required to be greatly improved, and the work is greatly is required to be repeatedly processed.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of an incremental learning method of a sensitive data recognition model provided by the invention;

FIG. 2 is a flow chart of an identification method provided by the invention;

FIG. 3 is a schematic diagram of the structure of the incremental learning device of the sensitive data recognition model provided by the invention;

FIG. 4 is a schematic diagram of the structure of the identification device provided by the present invention;

fig. 5 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Fig. 1 is a flow chart of an incremental learning method of a sensitive data identification model provided by the invention, as shown in fig. 1, the method includes:

step 110, receiving a model increment learning request initiated by a user, and determining an increment learning mode and training sample updating information based on the model increment learning request; the incremental learning mode comprises data category, sample file, and addition, deletion and category modification of sensitive keywords;

step 120, after determining an update sample based on the update information of the training sample and extracting keywords of the update sample, updating a training text set based on the update sample, and updating a classification keyword list of a corresponding category in a data recognition model based on the incremental learning mode and the keywords of the update sample;

and 130, carrying out sensitive data recognition on the keyword sequences of all training samples in the training text set based on the data recognition model to obtain recognition results of all training samples, and updating the classification keyword list of the corresponding class based on the difference between the class labels of all training samples and the recognition results to obtain the data recognition model after incremental learning.

Here, the data recognition model that the user needs to perform incremental learning is a model that has been initially trained based on training samples in a training text set, where a maintained classification keyword list already has the ability to classify text data. The number of the classification keyword list in the data recognition model is related to the text classification task type of the data recognition model, which is not particularly limited in the embodiment of the present invention. If the data recognition model is a classification model, only one classification keyword list can be used for distinguishing whether a text belongs to a specific category, and if the data recognition model is a multi-classification model, the classification keyword list corresponds to the classification category one by one, and the classification keyword list corresponding to each category is used for distinguishing whether a text belongs to the category.

It should be noted that, in order to improve the data privacy protection degree in the training process of the data recognition model, in the initial training process of the data recognition model, a keyword extraction may be locally performed on a training sample to obtain a keyword sequence of the training sample, and then a Word embedding vector of each keyword in the keyword sequence of the training sample is obtained by using a pre-trained Word embedding model (for example, a Word2Vec model or a GloVe model), so as to obtain a Word embedding vector sequence corresponding to the keyword sequence in a combining manner. Wherein the word embedding vectors of different keywords are different, and the word embedding vectors of synonyms are similar. In addition, features of each keyword in the training samples, such as word frequency, TF-IDF features, etc., can be extracted locally, and the word embedded vector sequence of each training sample (or the features of each keyword are combined) and the class label of each training sample are sent to a remote server, so that the data recognition model is trained at the remote server based on the word embedded vector sequence of each training sample and the class label of each training sample, and the original data of the training samples is ensured to be stored locally and kept secret from other objects including the data recognition model.

When the initial training process starts, the data recognition model constructs a classification keyword list corresponding to each class based on the word embedded vector sequence of each training sample and the class labels thereof, and initializes the classification keyword list. Here, the classification keyword list contains a weight corresponding to each word embedding vector, and the larger the weight is, the more the corresponding word embedding vector (or the word represented by the word embedding vector) is distinguished from the classification text. Along with the training, the classification keyword list of each category can be adjusted based on the recognition result of the training sample output by the data recognition model and the difference between the category labels of the training sample, so that the text recognition capability of the data recognition model is improved. The data recognition model can determine the probability that the training sample belongs to each category based on the weight of each word embedding vector in the word embedding vector sequence in the classified keyword list of each category aiming at the word embedding vector sequence and the characteristics of each keyword of the input training sample, and obtain the recognition result of the training sample.

In some embodiments, the class labels of the training samples may be manually labeled or may be obtained using automatic labeling techniques. If the automatic labeling technology is adopted, the training samples can be classified according to the folders to which the training samples belong, and corresponding class labels are given. It can be seen that training samples belonging to the same folder in this way correspond to the same category. Or, the number of categories can be set according to the requirement, and then the training samples are grouped according to the set number of categories by using a clustering algorithm, so that each sample belongs to one of the categories, and corresponding category labels are assigned to each training sample. The similarity classification threshold value can be set, and then the training samples with high similarity are classified into the same category according to the similarity between each training sample, so that the corresponding category labels are assigned to the training samples.

Although the data recognition model obtained after initial training has the capability of recognizing texts of various types, the privacy data form and the privacy recognition requirement of the user are continuously changed along with the time, so that incremental learning is required to be carried out on the data recognition model according to the model incremental learning request proposed by the user so as to adapt to new requirements. The model incremental learning request initiated by the user comprises an incremental learning model aiming at the data identification model and training sample updating information. Here, the incremental learning mode includes data category, sample file, and addition, deletion, and category modification of sensitive keywords. Correspondingly, the training sample updating information comprises relevant information for updating the training text set based on the incremental learning model. For example, when the incremental learning mode is a new addition of a data category, a sample file and a sensitive keyword, the training sample update information includes training samples corresponding to the new addition of the data category, the sample file and the sensitive keyword, and the training text set is updated by adding the training samples to the training text set. When the incremental learning mode is deleting the data category, the sample file and the sensitive keyword, the training sample updating information comprises training samples corresponding to the newly added data category, sample file and sensitive keyword, and the training text set is updated by deleting the training samples from the training text set. When the incremental learning mode is used for modifying the data category, the sample file and the category of the sensitive keyword, the training sample updating information comprises the data category before the category modification, the sample file and the training sample corresponding to the sensitive keyword, and the training text set is updated by updating the category label of the corresponding training sample.

Based on the training sample update information, training samples that need to be changed (added, deleted, or updated category labels) can be determined as update samples, and keywords of each update sample can be extracted. And then, after the training text set is correspondingly updated based on the training sample updating information and the updating samples, updating the classification keyword list of the corresponding category in the data identification model based on the incremental learning mode and the keywords of each updating sample. The word embedding vector of the keyword of each update sample may be newly added to the category classification keyword list of the category to which the word embedding vector belongs, deleted from the category classification keyword list of the category to which the word embedding vector of the keyword of each update sample belongs, or moved to the category classification keyword list of the update sample based on the incremental learning mode. By changing the structure of the classification keyword list of each category in this way, the data identification model can be adapted to new classification requirements.

In some embodiments, if the incremental learning mode is new or deleting of a sample file or a sensitive keyword, adding the keyword of each update sample to the classified keyword list of the category corresponding to the sample file or the sensitive keyword, and setting a weight corresponding to the keyword of the update sample in the corresponding classified keyword list, or deleting the keyword of the update sample from the classified keyword list of the category corresponding to the sample file or the sensitive keyword. When the incremental learning mode is a new addition of a sample file or a sensitive keyword, adding the keywords of each updated sample to a classified keyword list of a corresponding category of the sample file or the sensitive keyword, and setting the corresponding weight of the keywords of the updated sample in the corresponding classified keyword list, wherein the corresponding weight of the keywords existing in the corresponding classified keyword list in the updated sample can be kept unchanged; for the keywords in the update samples, which are not in the corresponding classified keyword list, the TF-IDF value of the keywords can be calculated in the range of the training samples belonging to the same class, and the corresponding weights can be set based on the TF-IDF. The higher the TF-IDF value of a keyword calculated in the above manner, the more important the term is for the corresponding updated text, but at the same time the term rarely appears in other similar texts, so that the characteristic expression capability for the similar text is relatively poor, and therefore the corresponding weight can be set to be lower.

If the incremental learning mode is new or deleted for the data category, a classification keyword list corresponding to the data category may be constructed based on the keywords of the update sample, and the weight corresponding to the keywords of the update sample may be set (may be set to a preset initial value in a unified manner), or the classification keyword list corresponding to the data category may be deleted.

If the incremental learning mode is a modification of the category of the sample file or the sensitive keyword, deleting the keyword of the update sample from the category keyword list of the sample file or the sensitive keyword corresponding to the original category (i.e. the category before modification), and adding the keyword of the update sample to the category keyword list of the sample file or the sensitive keyword corresponding to the update category (i.e. the category after modification). If the incremental learning mode is to modify the category of the sample file or the sensitive keyword from A to B, deleting the keyword for updating the sample from the classified keyword list corresponding to the category A, and adding the keyword for updating the sample into the classified keyword list corresponding to the category B. If the incremental learning mode is a modification of the category of the data category, the original category classification keyword list and the update category classification keyword list are fused, and the specific fusion mode and the mode of adding the update sample keyword to the sample file or the update category classification keyword list corresponding to the sensitive keyword under the condition that the incremental learning mode is a modification of the category of the sample file or the sensitive keyword can be the same.

In other embodiments, if the incremental learning mode is deletion of a sample file or a sensitive keyword, the keyword of the updated sample may be deleted from the classified keyword list of the corresponding category of the sample file or the sensitive keyword by:

clustering the keywords of the updated sample to obtain a plurality of keyword class clusters, wherein TF-IDF values of the keywords can be used as keyword features for clustering. Meanwhile, the weight of the keywords of the updated sample in the classified keyword list of the corresponding category of the sample file or the sensitive keyword can be determined. And dividing similar keywords in any keyword class cluster according to the category of the affiliated updated text to obtain a plurality of similar keyword sets. Wherein the category of the updated text to which the similar keywords in each similar keyword set belong is the same. Then, based on the number of similar keywords in each set of similar keywords, it is determined whether the keyword class cluster is a cross-category cluster. Here, the variance of the number of similar keywords in each similar keyword set may be calculated, and whether the keyword class cluster is a cross-class cluster is determined according to the variance, and if the variance is smaller than a preset variance threshold, the keyword class cluster is determined to be a cross-class cluster.

If any keyword class cluster is not a cross-class cluster, or if any keyword class cluster is a cross-class cluster and contains more than a preset number of similar keywords, the weight of the similar keywords in the classified keyword list of the corresponding class of the sample file or the sensitive keywords is larger than a first preset value, the keyword may only exist in a specific class, which indicates that the keyword is important for class division, or may exist in texts of a plurality of classes, but the model considers that the keyword class is relatively important for class division, so that the weight of the similar keywords in the keyword class cluster in the classified keyword list of the corresponding class of the sample file or the sensitive keywords can be reduced but not deleted. If any keyword cluster is a cross-category cluster and the number of similar keywords with weights larger than a first preset value in a classified keyword list corresponding to a category of a sample file or a sensitive keyword is smaller than the preset number, the keywords are relatively unimportant to category division, so that similar keywords with weights smaller than or equal to the first preset value in the keyword cluster are deleted from the classified keyword list corresponding to the category of the sample file or the sensitive keyword, and the weights of similar keywords with weights larger than the first preset value in the keyword cluster can be reduced.

In addition, if the incremental learning mode is a modification of the category of the sample file or the sensitive keyword, the weights of the repeated keywords in the category keyword list corresponding to the original category and the category keyword list corresponding to the update category of the sample file or the sensitive keyword can be determined for the repeated keywords existing in the category keyword list corresponding to the update category of the sample file or the sensitive keyword in the update sample. If the weights of the repeated keywords in the classified keyword list of the original category corresponding to the sample file or the sensitive keyword and the classified keyword list of the update category corresponding to the sample file or the sensitive keyword are larger than a second preset value, the weights of the repeated keywords in the classified keyword list of the update category corresponding to the sample file or the sensitive keyword are maintained unchanged. Otherwise, updating the weight of the repeated keyword in the classified keyword list of the update category corresponding to the sample file or the sensitive keyword based on the average value of the weights of the repeated keyword in the classified keyword list of the original category corresponding to the sample file or the sensitive keyword and the classified keyword list of the update category corresponding to the sample file or the sensitive keyword respectively.

After updating the classification keyword list of the corresponding category in the data recognition model based on the incremental learning mode and the keywords of the updated sample, sensitive data recognition can be performed on the keyword sequences of each training sample in the updated training text set based on the data recognition model, so as to obtain recognition results of each training sample. The word embedding vectors corresponding to the keywords in the keyword sequences of the training samples can be extracted, and the word embedding vector sequences corresponding to the keyword sequences of the training samples are transmitted to the data recognition model for recognition, and the recognition results of the training samples are obtained in the same manner as the recognition results in the initial training process described in the above embodiment. In some embodiments, the recognition result of the training sample may be determined based on the classified keyword list corresponding to each category in the data recognition model and the word embedding vector corresponding to each keyword in the keyword sequence of any training sample. In other embodiments, the recognition result of the training sample may be determined based on the classified keyword list corresponding to each category in the data recognition model, and by combining the word embedding vector corresponding to each keyword in the keyword sequence of any training sample and the feature (such as word frequency and TF-IDF) of each keyword. And then, updating the classification keyword list of the corresponding category based on the difference between the category label and the recognition result of each training sample, so that the data recognition model after incremental learning, which can adapt to the new requirements of the user, can be obtained.

In summary, the method provided by the embodiment of the invention determines the incremental learning mode and the training sample updating information based on the model incremental learning request of the user, after determining the updating sample and extracting the keyword of the updating sample based on the training sample updating information, updates the training text set based on the updating sample, updates the classified keyword list of the corresponding category in the data recognition model based on the incremental learning mode and the keyword of the updating sample, carries out sensitive data recognition on the keyword sequence of each training sample in the training text set based on the data recognition model, obtains the recognition result of each training sample, updates the classified keyword list of the corresponding category based on the difference between the category label and the recognition result of each training sample, and obtains the data recognition model after incremental learning.

Based on any of the above embodiments, fig. 2 is a schematic flow chart of an identification method provided by the present invention, as shown in fig. 2, the method includes:

step 210, receiving a file identification request submitted by a user; the file identification request carries a path and a file name of a file to be identified;

step 220, acquiring the file to be identified based on the path and the file name of the file to be identified, extracting text content in the file to be identified, and extracting a keyword sequence of the text content;

step 230, performing sensitive data recognition on the keyword sequence of the text content based on a data recognition model to obtain a recognition result of the file to be recognized;

the data identification model is obtained by performing incremental learning based on the incremental learning method of the sensitive data identification model provided by any embodiment.

The incremental learning device of the sensitive data recognition model provided by the invention is described below, and the incremental learning device of the sensitive data recognition model described below and the incremental learning method of the sensitive data recognition model described above can be referred to correspondingly with each other.

Based on any of the above embodiments, fig. 3 is a schematic structural diagram of an incremental learning device of a sensitive data recognition model according to the present invention, as shown in fig. 3, the device includes:

A learning request receiving unit 310, configured to receive a model incremental learning request initiated by a user, and determine an incremental learning mode and training sample update information based on the model incremental learning request; the incremental learning mode comprises data category, sample file, and addition, deletion and category modification of sensitive keywords;

a keyword list updating unit 320, configured to determine an update sample based on the training sample update information and extract keywords of the update sample, update the training text set based on the update sample, and update the classified keyword list of the corresponding category in the data recognition model based on the incremental learning mode and the keywords of the update sample;

the model increment learning unit 330 is configured to perform sensitive data recognition on the keyword sequence of each training sample in the training text set based on the data recognition model, obtain a recognition result of each training sample, and update the classified keyword list of the corresponding class based on the difference between the class label and the recognition result of each training sample, so as to obtain the data recognition model after increment learning.

According to the device provided by the embodiment of the invention, the increment learning mode and the training sample updating information are determined based on the model increment learning request of a user, after the updating sample is determined based on the training sample updating information and the keywords of the updating sample are extracted, the training text set is updated based on the updating sample, and the classified keyword list of the corresponding category in the data recognition model is updated based on the increment learning mode and the keywords of the updating sample, and then the keyword sequence of each training sample in the training text set is subjected to sensitive data recognition based on the data recognition model, so that the recognition result of each training sample is obtained, and the classified keyword list of the corresponding category is updated based on the difference between the category label and the recognition result of each training sample, so that the data recognition model after increment learning is obtained.

Based on any one of the foregoing embodiments, the updating the classification keyword list of the corresponding category in the data recognition model based on the incremental learning mode and the keywords of the update sample specifically includes:

Based on any one of the foregoing embodiments, if the incremental learning mode is new addition or deletion of a sample file or a sensitive keyword, adding the keyword of the updated sample to a classified keyword list of a category corresponding to the sample file or the sensitive keyword, and setting a weight corresponding to the keyword of the updated sample, or deleting the keyword of the updated sample from the classified keyword list of the category corresponding to the sample file or the sensitive keyword, including:

Based on any one of the above embodiments, if the incremental learning mode is a modification of a category of a sample file or a sensitive keyword, deleting the keyword of the update sample from a category keyword list of an original category corresponding to the sample file or the sensitive keyword, and adding the keyword of the update sample to the category keyword list of an update category corresponding to the sample file or the sensitive keyword, which specifically includes:

Based on any one of the foregoing embodiments, the performing, based on the data recognition model, sensitive data recognition on the keyword sequence of each training sample in the training text set to obtain a recognition result of each training sample specifically includes:

The identification device provided by the invention is described below, and the identification device described below and the identification method described above can be referred to correspondingly.

Based on any of the above embodiments, fig. 4 is a schematic structural diagram of an identification device provided by the present invention, as shown in fig. 4, the device includes:

an identification request receiving unit 410, configured to receive a file identification request submitted by a user; the file identification request carries a path and a file name of a file to be identified;

a text content extraction unit 420, configured to obtain the file to be identified based on the path and the file name of the file to be identified, extract text content in the file to be identified, and extract a keyword sequence of the text content;

The sensitive data recognition unit 430 is configured to perform sensitive data recognition on the keyword sequence of the text content based on a data recognition model, so as to obtain a recognition result of the file to be recognized;

the data identification model is learned based on the incremental learning method of the sensitive data identification model provided by any embodiment.

Fig. 5 is a schematic structural diagram of an electronic device according to the present invention, and as shown in fig. 5, the electronic device may include: processor 510, memory 520, communication interface (Communications Interface) 530, and communication bus 540, wherein processor 510, memory 520, and communication interface 530 communicate with each other via communication bus 540. Processor 510 may invoke logic instructions in memory 520 to perform an incremental learning method of a sensitive data recognition model, the method comprising: receiving a model increment learning request initiated by a user, and determining an increment learning mode and training sample updating information based on the model increment learning request; the incremental learning mode comprises data category, sample file, and addition, deletion and category modification of sensitive keywords; after an update sample is determined based on the training sample update information and the keywords of the update sample are extracted, updating the training text set based on the update sample, and updating the classification keyword list of the corresponding category in the data identification model based on the incremental learning mode and the keywords of the update sample; and carrying out sensitive data recognition on the keyword sequences of all training samples in the training text set based on the data recognition model to obtain recognition results of all training samples, and updating the classification keyword list of the corresponding category based on the difference between the category label of all training samples and the recognition results to obtain the data recognition model after incremental learning.

Processor 510 may also invoke logic instructions in memory 520 to perform an identification method comprising: receiving a file identification request submitted by a user; the file identification request carries a path and a file name of a file to be identified; acquiring the file to be identified based on the path and the file name of the file to be identified, extracting text content in the file to be identified, and extracting a keyword sequence of the text content; performing sensitive data recognition on the keyword sequence of the text content based on a data recognition model to obtain a recognition result of the file to be recognized; the data identification model is learned based on the incremental learning method of the sensitive data identification model provided by any embodiment.

Further, the logic instructions in the memory 520 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform a method of incremental learning of a sensitive data recognition model provided by the methods described above, the method comprising: receiving a model increment learning request initiated by a user, and determining an increment learning mode and training sample updating information based on the model increment learning request; the incremental learning mode comprises data category, sample file, and addition, deletion and category modification of sensitive keywords; after an update sample is determined based on the training sample update information and the keywords of the update sample are extracted, updating the training text set based on the update sample, and updating the classification keyword list of the corresponding category in the data identification model based on the incremental learning mode and the keywords of the update sample; and carrying out sensitive data recognition on the keyword sequences of all training samples in the training text set based on the data recognition model to obtain recognition results of all training samples, and updating the classification keyword list of the corresponding category based on the difference between the category label of all training samples and the recognition results to obtain the data recognition model after incremental learning.

The computer is further capable of executing the identification method provided by the methods described above when the program instructions are executed by the computer, the method comprising: receiving a file identification request submitted by a user; the file identification request carries a path and a file name of a file to be identified; acquiring the file to be identified based on the path and the file name of the file to be identified, extracting text content in the file to be identified, and extracting a keyword sequence of the text content; performing sensitive data recognition on the keyword sequence of the text content based on a data recognition model to obtain a recognition result of the file to be recognized; the data identification model is learned based on the incremental learning method of the sensitive data identification model provided by any embodiment.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the above-provided incremental learning method of a sensitive data identification model, the method comprising: receiving a model increment learning request initiated by a user, and determining an increment learning mode and training sample updating information based on the model increment learning request; the incremental learning mode comprises data category, sample file, and addition, deletion and category modification of sensitive keywords; after an update sample is determined based on the training sample update information and the keywords of the update sample are extracted, updating the training text set based on the update sample, and updating the classification keyword list of the corresponding category in the data identification model based on the incremental learning mode and the keywords of the update sample; and carrying out sensitive data recognition on the keyword sequences of all training samples in the training text set based on the data recognition model to obtain recognition results of all training samples, and updating the classification keyword list of the corresponding category based on the difference between the category label of all training samples and the recognition results to obtain the data recognition model after incremental learning.

The computer program, when executed by a processor, is further implementable to perform the above-described respective provided identification methods, the method comprising: receiving a file identification request submitted by a user; the file identification request carries a path and a file name of a file to be identified; acquiring the file to be identified based on the path and the file name of the file to be identified, extracting text content in the file to be identified, and extracting a keyword sequence of the text content; performing sensitive data recognition on the keyword sequence of the text content based on a data recognition model to obtain a recognition result of the file to be recognized; the data identification model is learned based on the incremental learning method of the sensitive data identification model provided by any embodiment.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. An incremental learning method of a sensitive data recognition model, comprising:

2. The incremental learning method of the sensitive data recognition model according to claim 1, wherein the updating of the classification keyword list of the corresponding category in the data recognition model based on the incremental learning mode and the keywords of the update sample specifically comprises:

3. The incremental learning method of the sensitive data recognition model according to claim 2, wherein if the incremental learning mode is a new or deleted sample file or sensitive keyword, adding the updated sample keyword to a classified keyword list of a category corresponding to the sample file or sensitive keyword and setting a weight corresponding to the updated sample keyword, or deleting the updated sample keyword from the classified keyword list of the category corresponding to the sample file or sensitive keyword, specifically comprising:

4. The incremental learning method of the sensitive data recognition model according to claim 2, wherein if the incremental learning mode is a modification of a category of a sample file or a sensitive keyword, deleting the keyword of the update sample from the category keyword list of the sample file or the sensitive keyword corresponding to the original category, and adding the keyword of the update sample to the category keyword list of the sample file or the sensitive keyword corresponding to the update category, specifically comprising:

5. The incremental learning method of the sensitive data recognition model according to claim 1, wherein the performing sensitive data recognition on the keyword sequence of each training sample in the training text set based on the data recognition model to obtain a recognition result of each training sample specifically comprises:

6. A method of identification, comprising:

The data recognition model is learned based on the incremental learning method of the sensitive data recognition model according to any one of claims 1 to 5.

7. An incremental learning device for a sensitive data recognition model, comprising:

8. An identification device, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the incremental learning method of the sensitive data identification model of any one of claims 1 to 5 or the identification method of claim 6 when executing the program.

10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the incremental learning method of the sensitive data identification model of any one of claims 1 to 5 or the identification method of claim 6.