CN117272123A - Sensitive data processing method and device based on large model and storage medium - Google Patents

Sensitive data processing method and device based on large model and storage medium Download PDF

Info

Publication number
CN117272123A
CN117272123A CN202311560860.6A CN202311560860A CN117272123A CN 117272123 A CN117272123 A CN 117272123A CN 202311560860 A CN202311560860 A CN 202311560860A CN 117272123 A CN117272123 A CN 117272123A
Authority
CN
China
Prior art keywords
semantic
data
newly added
metadata
comparison feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311560860.6A
Other languages
Chinese (zh)
Other versions
CN117272123B (en
Inventor
蔡惠民
文友
谢红韬
支婷
汪榕
马环宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC Big Data Research Institute Co Ltd
Original Assignee
CETC Big Data Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC Big Data Research Institute Co Ltd filed Critical CETC Big Data Research Institute Co Ltd
Priority to CN202311560860.6A priority Critical patent/CN117272123B/en
Publication of CN117272123A publication Critical patent/CN117272123A/en
Application granted granted Critical
Publication of CN117272123B publication Critical patent/CN117272123B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a sensitive data processing method, device and storage medium based on a large model, which comprises data acquisition, model training, semantic vector mapping, vector clustering, sensitive grade identification, similarity calculation and data classification. According to the method, the original data is mapped into a semantic vector set by training a twin encoder of a transducer architecture, and a plurality of semantic clusters are formed through vector clustering. Each cluster is identified and configured with a corresponding sensitivity level identification. With the appearance of new data, the data are mapped into new semantic vectors and the similarity to existing clusters is calculated. The similarity calculation and statistics are used to calculate a comparison feature value that determines the sensitivity level of the newly added data. The method combines a large model, vector clustering and semi-supervised learning to automatically process sensitive data and improve the safety of the data.

Description

Sensitive data processing method and device based on large model and storage medium
Technical Field
The present disclosure relates to the field of big data technologies, and in particular, to a method and apparatus for processing sensitive data based on a big model, and a storage medium.
Background
In the information age, with the rapid growth and widespread use of data, data security has become a key issue. Disclosure of sensitive data can lead to serious consequences including personal privacy violations, financial losses, and reputation damage. Therefore, the need to protect sensitive data from unauthorized access and leakage is becoming increasingly important.
Traditional data security techniques include encryption, authentication, rights control, and the like. These techniques are generally used to protect the security of data, but they are mainly used for data storage and transmission, and protection of data at the time of use is relatively limited. In addition, with the growth of data size, and the diversity of data, conventional methods have certain limitations in processing unstructured data.
Current data security technologies focus mainly on protection of data storage and transmission, and lack comprehensive protection of data when in use. Furthermore, with the rapid growth of large data, and the diversity of data sources, the identification and isolation of sensitive data becomes more difficult. Conventional techniques are difficult to effectively address this challenge, and thus a new approach is needed to protect sensitive data from leakage in advance.
Disclosure of Invention
In order to solve the above technical problems, the present application provides a sensitive data processing method, device and storage medium based on a large model, and the following describes the technical solution in the present application:
the first aspect of the application provides a sensitive data processing method based on a large model, which comprises the following steps:
collecting data in a database and creating a pair of samples, the pair of samples comprising a positive sample and a negative sample;
training a twin encoder based on a transducer architecture through the sample pairs;
taking one of the twin encoders as a semantic encoder, and mapping the acquired metadata into semantic vectors through the semantic encoder to obtain a semantic vector set;
vector clustering is carried out on the semantic vector set to obtain a plurality of semantic clusters based on metadata;
identifying all semantic clusters according to a pre-configured standard specification library, and configuring a sensitivity level identification;
setting M critical values, and mapping the newly added metadata into newly added semantic vectors through the semantic encoder when the newly added metadata exists in the database;
calculating the similarity of the newly added semantic vector and the semantic vectors in all semantic clusters, and counting the first quantity of the semantic vectors which are respectively smaller than M critical values;
Calculating the similarity between the newly added semantic vector and the semantic vectors in each semantic cluster, and counting the second quantity of semantic vectors with the similarity smaller than M critical values in each semantic cluster;
calculating a comparison feature value based on the first number and the second number;
and classifying the sensitivity level of the newly added metadata according to the comparison feature values corresponding to the semantic clusters.
Optionally, the calculating the comparison feature value based on the first number and the second number includes:
the comparison feature value is calculated by the following equation:
F=Y/X;
F normalized =(F−min(F))/max(F)−min(F);
wherein Fnormalized represents a normalized comparison feature value, F represents a comparison feature value, and is obtained by comparing two numerical elements Y and X, X represents a first number corresponding to a given one of the critical values, the formula represents a ratio of the two numerical elements, and Y represents a second number corresponding to the given one of the critical values.
Optionally, classifying the sensitivity level of the newly added metadata according to the comparison feature values corresponding to the semantic clusters includes:
comparing the magnitudes of the comparison feature values corresponding to the semantic clusters, determining the maximum comparison feature value, and classifying the newly added metadata into the corresponding semantic cluster.
Optionally, if the number of the maximum comparison feature values is greater than or equal to 2, the number of the critical values is set to m+1, and the comparison feature values are recalculated and compared until the number of the maximum comparison feature values is 1.
Optionally, the collecting data in the database and creating the sample pairs includes:
when the data in the database is structured data, creating a sample pair according to the metadata of the data;
when the data in the database is unstructured data, the method further comprises:
traversing each template based on a pre-constructed template library, and adapting unstructured data to each template to obtain input information, wherein the input information comprises tasks for extracting entities from the unstructured data;
inputting the input information into a large model, executing a task of entity extraction through the large model, and returning an entity extraction result in a JSON format;
and resolving the entity extraction result through a pre-constructed information resolving model to obtain an entity type and an entity.
Optionally, after obtaining the entity type and the entity, determining a final entity and an entity type according to each entity and the frequency of the entity type.
Optionally, after classifying the sensitivity level of the added metadata according to the comparison feature values corresponding to the semantic clusters, the method further includes:
data of different sensitivity levels are processed using different desensitization modes.
Alternatively, the similarity calculation is performed by the following equation:
C=(A⋅B)/(||A||⋅||B||);
wherein, C represents similarity, A represents newly added semantic vector, B represents one semantic vector in the semantic cluster, and A and B represent the modes of the vector A, B respectively.
A second aspect of the present application provides a sensitive data processing apparatus based on a large model, comprising:
a sample pair creation unit configured to collect data in a database and create a sample pair including a positive sample and a negative sample;
the training unit is used for training a twin encoder based on a transducer architecture through the sample pair;
the mapping unit is used for taking one of the twin encoders as a semantic encoder, and mapping the acquired metadata into semantic vectors through the semantic encoder to obtain a semantic vector set;
the clustering unit is used for carrying out vector clustering on the semantic vector set to obtain a plurality of semantic clusters based on metadata;
The configuration unit is used for identifying all semantic clusters according to a pre-configured standard specification library and configuring a sensitivity level identifier;
the critical value setting unit is used for setting M critical values, and mapping the newly added metadata into newly added semantic vectors through the semantic encoder when the newly added metadata exist in the database;
the similarity calculation unit is used for calculating the similarity between the newly added semantic vector and the semantic vectors in all the semantic clusters and counting the first quantity of the semantic vectors which are respectively smaller than M critical values;
the similarity calculation unit is further configured to: calculating the similarity between the newly added semantic vector and the semantic vectors in each semantic cluster, and counting the second quantity of semantic vectors with the similarity smaller than M critical values in each semantic cluster;
a numerical value calculation unit configured to calculate a comparison feature numerical value based on the first number and the second number;
and the classifying unit is used for classifying the sensitivity level of the newly added metadata according to the comparison feature values corresponding to the semantic clusters.
A third aspect of the present application provides a sensitive data processing apparatus based on a large model, the apparatus comprising:
a processor, a memory, an input-output unit, and a bus;
The processor is connected with the memory, the input/output unit and the bus;
the memory holds a program that the processor invokes to perform the method of any of the first aspect and optionally the method of the first aspect.
A fourth aspect of the present application provides a computer readable storage medium having stored thereon a program which when executed on a computer performs the method of any one of the first aspect and optionally the first aspect.
From the above technical scheme, the application has the following advantages:
1. by using a twin encoder and vector clustering, the method can efficiently identify sensitive data, including structured and unstructured data. This helps to quickly discover and protect potentially sensitive information.
2. The method may be applied to different types of data, including text and structured data. The system can be configured and adjusted according to data types and requirements so as to meet the requirements of different fields.
3. By setting a critical value and comparing characteristic values, the method can update the sensitivity level classification in real time when new data exists in the database, and ensure the sensitivity of responding to the new data in time.
4. The method can classify the newly added data in a fine granularity sensitive level so as to ensure that the data with different sensitive levels are properly protected and processed.
5. The method combines a plurality of technologies such as a large model, a twin encoder, vector clustering, rule configuration and the like to realize comprehensive sensitive data management and protection and cope with different data types and demands.
Drawings
In order to more clearly illustrate the technical solutions of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow diagram of one embodiment of a large model-based sensitive data processing method provided in the present application;
FIG. 2 is a flow diagram of one embodiment of a method of constructing a sample pair provided herein;
FIG. 3 is a schematic diagram of one embodiment of a large model-based sensitive data processing apparatus provided herein;
FIG. 4 is a schematic structural diagram of another embodiment of a large model-based sensitive data processing apparatus provided in the present application.
Detailed Description
The embodiments in the present application are described in detail below:
it should be noted that, the method provided in the present application may be applied to a terminal or a system, and may also be applied to a server, for example, the terminal may be a smart phone or a computer, a tablet computer, a smart television, a smart watch, a portable computer terminal, or a fixed terminal such as a desktop computer. For convenience of explanation, the terminal is taken as an execution body for illustration in the application.
Referring to fig. 1, the present application first provides an embodiment of a large model-based sensitive data processing method, which includes:
s101, collecting data in a database and creating a sample pair, wherein the sample pair comprises a positive sample and a negative sample;
first, data is collected from a database, which may include structured data and unstructured text data. Sample pairs are created where positive samples are data containing sensitive information and negative samples are data not containing sensitive information. These pairs of samples will be used to train the twin encoder so that it can identify sensitive information.
In this embodiment, the data is obtained from a database, including structured data and unstructured text data. These data may contain various information, some of which may be sensitive. The data is labeled and classified as positive and negative samples. Positive samples are data containing sensitive information, while negative samples are data not containing sensitive information. This can be manual marking or can be done in an automated fashion, depending on the type of data and the nature of the sensitive information.
For each positive sample, a negative sample is selected to create a sample pair. Some strategies may also need to be considered, such as randomly selecting negative samples, matching similarities, etc. For unstructured text data, a text similarity algorithm may be used to select the negative sample to be similar to the positive sample to some extent.
The twin encoder based on the transducer architecture is trained using the prepared pairs of samples. This encoder will learn how to distinguish between positive and negative samples so that sensitive information can be identified in the future.
S102, training a twin encoder based on a transducer architecture through the sample pair;
a pair of samples is used to train a twin encoder, which is based on a transducer architecture. A twin encoder is a deep learning model that can receive input data and map it to a vector representation in high-dimensional semantic space.
The training process first requires the preparation of pairs of samples, where positive and negative samples have been created and marked. Each sample corresponds to a positive sample and a negative sample for training the encoder.
For unstructured text data, the text data needs to be converted into a vector representation. This can be achieved by pre-trained transducer models (e.g., BERT, GPT, etc.) that encode text into embedded vectors of fixed dimensions.
A twin encoder based on a transducer architecture is constructed. This encoder consists of two identical sub-models, one for processing positive samples and the other for processing negative samples. Each sub-model includes multiple transducer layers.
The two sub-models of the twin encoder share the same parameters. This is to ensure that they can learn the similarity in order to compare the positive and negative samples during training.
The goal of training is to minimize the difference between positive and negative samples. This can be achieved by constructing a loss function that will encourage similar positive samples to be closer after encoding, while pushing negative samples away. One common Loss function is triple Loss (triple Loss), which includes an anchor sample, a positive sample, and a negative sample.
The training process is typically performed using small batches (mini-batch) of data, rather than the entire data set. In each iteration, a batch of positive samples and corresponding negative samples are randomly selected, then a loss function is calculated, and model parameters are updated according to a gradient descent method.
The iterative training step is repeated until the loss function converges or a predetermined number of iterations is reached.
After training is complete, a separate validation set or test set may be used to evaluate the performance of the model, such as computational accuracy, recall, F1 score, and the like.
S103, taking one of the twin encoders as a semantic encoder, and mapping the acquired metadata into semantic vectors through the semantic encoder to obtain a semantic vector set;
one part is selected from the twin encoder to be used as a semantic encoder. The acquired metadata is mapped using a semantic encoder into semantic vectors that represent semantic information of the data. A set of semantic vectors is obtained, which constitute a set of semantic vectors for subsequent analysis and processing.
After training the twin encoder, a part is selected as a semantic encoder, and the semantic encoder is used to map the acquired metadata into semantic vectors, so as to finally obtain a group of semantic vector sets. I.e. one from the two sub-models is selected as semantic encoder. This may be predetermined during training or may be selected randomly for later use.
For metadata to be mapped into semantic vectors, the same preprocessing steps as in training need to be performed. This includes text tokenization, word embedding, etc. The preprocessed metadata is input into the selected semantic encoder. This encoder will convert the metadata into semantic vectors. For each metadata, the semantic encoder will generate a semantic vector. All these semantic vectors are collected to form a set of semantic vectors. This set represents semantic information for all input data.
S104, vector clustering is carried out on the semantic vector set to obtain a plurality of semantic clusters based on metadata;
the set of semantic vectors is clustered using a vector clustering algorithm, such as K-Means.
This will result in a plurality of semantic clusters, each cluster containing similar semantic vectors. These clusters represent semantic groups based on metadata.
In this embodiment, a set of semantic vectors is input to a selected clustering algorithm. The algorithm will perform an iterative process, dividing the semantic vector into K different clusters, each cluster containing similar semantic vectors.
For algorithms such as K-Means, which require a specified K value, some Method may be used to determine the optimal K value, such as Elbow Method or contour coefficient (Silhouette Score).
The clustering results are analyzed to check which semantic vectors each cluster contains. This may involve looking at the center vector, member vector, etc. of the cluster.
These clusters represent semantic groups based on metadata. Different security levels or processing policies may be applied to the data in each semantic cluster depending on the specific application requirements.
S105, identifying all semantic clusters according to a pre-configured standard specification library, and configuring a sensitivity level identification;
In this step, each semantic cluster is identified using a pre-configured standard specification library. A corresponding sensitivity level identification is configured for each semantic cluster to indicate its sensitivity.
Specifically, a standard specification library is created which contains definitions of various data security specifications and sensitive information. This library may include information on the type of sensitive entity, the scope of sensitive data, processing strategies, etc. The contents of the canonical library should match the application requirements and data types.
For each semantic cluster, its content is identified according to its semantic features using rules and patterns defined in the canonical library. This may include pattern matching, entity extraction, keyword recognition, etc. techniques to determine whether the semantic cluster contains sensitive information.
A corresponding sensitivity level identification is configured for each semantic cluster identified as containing sensitive information. These identifications may represent different degrees of sensitivity, e.g. high, medium, low. The level of identification should be determined based on definitions in the specification library and application requirements.
The configured sensitivity level identification information is recorded in the system for use in a subsequent data processing step. This may be a metadata field, tag or other form of identification.
S106, setting M critical values, and mapping the newly added metadata into newly added semantic vectors through the semantic encoder when the newly added metadata exists in the database;
in the system design or configuration process, M critical values are preset, and these critical values represent different similarity thresholds, which can be determined according to data sensitivity and application requirements. These thresholds may be numerical values, such as 0.9, 0.8, 0.7, etc.
When new metadata exists in the database, the new metadata is mapped into a new semantic vector through the semantic encoder.
S107, calculating the similarity between the newly added semantic vector and the semantic vectors in all semantic clusters, and counting the first quantity of the semantic vectors which are respectively smaller than M critical values;
in this embodiment, all semantic clusters are traversed, and the following is performed on the semantic vectors in each semantic cluster.
And calculating the similarity of the newly added semantic vector and each semantic vector in the current semantic cluster by using the proper similarity measure.
The calculated similarities are compared with M threshold values one by one to determine whether they are smaller than these threshold values. A counter may be maintained for each threshold value, with an initial value of 0. For each threshold, if the similarity is less than the threshold, the corresponding counter is incremented by 1. For each critical value, the values of the counter are recorded, which values will constitute the first number.
Repeating the steps, and performing similarity calculation and statistics operation on each newly added semantic vector to obtain a first number under each critical value. Finally, for each threshold, a corresponding first number is obtained, representing the number of semantic vectors smaller than the threshold.
S108, calculating the similarity between the newly added semantic vector and the semantic vectors in each semantic cluster, and counting the second quantity of the semantic vectors with the similarity smaller than M critical values in each semantic cluster;
for each semantic cluster, a similarity of the newly added semantic vector to the semantic vectors in the cluster is calculated.
And counting the number of semantic vectors with similarity smaller than each critical value in each semantic cluster to form a second number.
In this step, all semantic clusters are traversed, and the following operations are performed for each cluster:
for the current semantic cluster, the similarity of the newly added semantic vector to each semantic vector in the cluster is calculated, using a suitable similarity measure, such as cosine similarity.
The calculated similarities are compared with M thresholds one by one to determine if they are less than the thresholds. A counter is maintained for each threshold value, with an initial value of 0.
For each threshold, if the similarity is less than the threshold, the corresponding counter is incremented by 1.
For each critical value, the values of the counter are recorded, which values will constitute the second number.
Repeating the steps, and performing similarity calculation and statistics operation on each semantic cluster to obtain a second number under each critical value.
Finally, for each semantic cluster, a corresponding second number representing a number of semantic vectors smaller than the threshold value will be obtained.
One specific manner of similarity calculation is provided below:
C=(A⋅B)/(||A||⋅||B||);
wherein, C represents similarity, A represents newly added semantic vector, B represents one semantic vector in the semantic cluster, and A and B represent the modes of the vector A, B respectively.
S109, calculating a comparison characteristic value based on the first quantity and the second quantity;
the feature values are compared using the first number and the second number, which feature values will reflect the similarity between the newly added metadata and each semantic cluster.
The first number and the second number are obtained from the previous step, the numbers being calculated at different critical values, each critical value having a corresponding first number and second number.
For each threshold value, a comparison feature value is calculated. The comparison feature values generally represent the similarity between the newly added metadata and each semantic cluster. This can be calculated by the following formula:
F=Y/X;
Wherein F represents the comparison feature value, Y represents the second number corresponding to a given one of the critical values, X represents the first number corresponding to a given one of the critical values, and the formula represents the ratio of the two numerical elements and is defined as the comparison feature value.
This calculation results in a comparison feature value that represents the degree of similarity of the newly added metadata to the semantic clusters at a given threshold. Higher values of the comparison feature indicate higher similarity, while lower values indicate lower similarity. A similarity calculation is performed for each threshold to obtain a comparison feature value at each threshold. The comparison feature values at each threshold are saved for later use.
The comparative feature values mentioned in this example are further described below:
for each semantic cluster, the similarity of the newly added semantic vector to the semantic vectors in the cluster is calculated.
Counting the number of semantic vectors with similarity smaller than each critical value in each semantic cluster to form a second number;
that is, for a semantic cluster, each of the M thresholds has its corresponding second number, and the corresponding comparison feature value is further calculated, then, for M thresholds, each of the M thresholds has its corresponding comparison feature value, and the semantic cluster is subsequently categorized based on the comparison feature values.
While specific categorization may be implemented in various ways, for example, by determining the maximum comparison feature value, when the maximum comparison feature value has a plurality of identical values, there may be some adjustment strategy, for example, resetting the value of M to m+1, as will be described in detail in the following examples.
In an alternative embodiment, the comparison feature may also be normalized, as represented by the following equation:
F normalized =(F−min(F))/max(F)−min(F);
wherein F is normalized The normalized comparison feature value is represented by Y representing the second number and X representing the first number.
S110, classifying the sensitivity level of the newly added metadata according to the comparison feature values corresponding to the semantic clusters.
And classifying the newly added metadata into proper semantic clusters according to the comparison feature values.
For each semantic cluster, its corresponding comparison feature value is obtained, which is the value calculated in the previous step.
For each semantic cluster, the sensitivity level of the newly added metadata is determined using its comparison feature value. The sensitivity level is typically predefined and is determined based on a range of comparison feature values.
In this embodiment, an initial selection method of the critical value includes first calculating cosine distances between newly added metadata and all metadata in all semantic clusters, performing statistical analysis on all cosine distances, and dividing different cosine distances into ranges, so as to obtain 80-bit values, 85-bit values, 90-bit values, 95-bit values and 99-bit values, which can be used as initial reference values of the critical value. And traversing all critical values, and accumulating the comparison feature values of each semantic cluster. And finally, determining which semantic cluster the newly added metadata belongs to by comparing the magnitude of the comparison characteristic value of each semantic cluster.
In this embodiment, M is preferably 4 or 5, each of the quantile values has its corresponding sensitivity level, and data of different sensitivity levels may be subjected to different binning processes.
The above implementation is only a specific example in the present application, and in practice, different adjustment processes may be performed according to different service types and different data sources of data.
And comparing the security level identification of the semantic cluster with the largest feature value as the security level identification of the newly added metadata. And desensitizing or isolating all data corresponding to the newly added metadata according to the level of the security level identifier.
And when the comparison characteristic value is determined, if a plurality of comparison characteristic values exist, adjusting the value of M to be M+1 until a unique comparison characteristic value is obtained.
The newly added metadata is categorized into the appropriate semantic clusters, which are determined from similarity calculations and comparison feature values. This step involves associating the newly added metadata with semantic clusters having similar comparison feature values. Each newly added metadata is assigned a corresponding sensitivity level identification, which will reflect its sensitivity in the semantic cluster. The sensitivity level of the newly added metadata can be recorded in the database, and corresponding security measures, such as sensitive data isolation or desensitization, can be adopted to protect the security of the data.
The above embodiment has the following advantages:
the method comprehensively utilizes a plurality of technologies and methods such as a large model, semi-supervised learning, semantic analysis, vector clustering, a rule engine and the like, and can cope with different types of data including structured data and unstructured text data. Sensitive information can be identified efficiently through large models and semi-supervised learning techniques, and is not limited to known rules and keywords, so that accuracy and coverage are improved. The method has certain dynamic adaptability for the random addition, deletion and modification of the metadata in the structured data, and can cope with the dynamic change of the metadata in the real application scene. Through a pre-configured standard specification library, different semantic clusters can be identified and sensitivity levels can be configured, and standardized sensitive information management is realized. The whole flow has automatic processing capability, and can rapidly identify and classify the newly-added data, thereby reducing the burden of manual intervention. The data can be efficiently processed by using the technologies of vector clustering, similarity calculation and the like, and the method is particularly suitable for large-scale data sets. Through the sensitivity level identification, proper protection measures can be adopted according to the sensitivity of different data, including isolation and desensitization, so that the safety of the data is improved.
The method has certain adaptability and can be configured and adjusted according to specific requirements and data security specifications in different fields.
For the foregoing embodiment, the sensitivity level classification is performed on the newly added metadata according to the comparison feature values corresponding to the semantic clusters, and a specific sensitivity level classification embodiment is provided below, where the embodiment is as follows: and determining the maximum comparison characteristic value by comparing the magnitudes of the comparison characteristic values corresponding to the semantic clusters, and classifying the newly added metadata into the corresponding semantic clusters. If the number of the maximum comparison characteristic values is greater than or equal to 2, setting the number of the critical values as M+1, and recalculating and comparing the comparison characteristic values until the number of the maximum comparison characteristic values is 1.
The method provided by the application can be used for processing the structured data and also can be used for processing the unstructured data, and when the structured data in the database are processed, when the sample pair is constructed, the sample pair is directly created according to the metadata of the data.
When used to process unstructured data, the extraction of entities and entity types is performed by:
traversing each template based on a pre-constructed template library, and adapting unstructured data to each template to obtain input information, wherein the input information comprises tasks for extracting entities from the unstructured data;
Inputting the input information into a large model, executing a task of entity extraction through the large model, and returning an entity extraction result in a JSON format;
and resolving the entity extraction result through a pre-constructed information resolving model to obtain an entity type and an entity.
Referring to fig. 2, an embodiment of the method is described in detail below:
s201, traversing each template based on a pre-constructed template library, and adapting unstructured data to each template to obtain input information, wherein the input information comprises tasks for extracting entities from the unstructured data;
at this stage, a library of template templates is first constructed, which contains a plurality of template templates, each of which describes an entity extraction task. Each template may include some example entity types, keywords, or key phrases to assist the large model in understanding entity extraction tasks. For example, a template of Prompt may be as follows:
"Extract organizations from the following text: {text}"
this template instructs the large model to extract the organization entity from the given text. { text } is a placeholder that represents the text data to be replaced with unstructured in actual use.
For each template, unstructured text data will fit (fill) into placeholders in the template. This will generate a plurality of input messages, each containing an entity extraction task.
S202, inputting the input information into a large model, executing a task of entity extraction through the large model, and returning an entity extraction result in a JSON format;
at this step, the generated input information is input into a large model, such as a pre-trained language model (e.g., BERT or GPT-3). The large model performs the entity extraction task and returns the result of the entity extraction. The results are returned in JSON format, typically including entity type, entity location (start and end locations in the text), and entity text.
For example, the entity extraction results are as follows:
[
{"entity_type": "Organization", "start": 12, "end": 25, "text": "ABC Corp"},
{"entity_type": "Person", "start": 30, "end": 45, "text": "John Smith"}]
this JSON indicates that two entities are extracted from the text, one is "ABC Corp", type Organization (Organization), the other is "John Smith", type Person (Person).
S203, analyzing the entity extraction result through a pre-constructed information analysis model to obtain an entity type and an entity;
at this step, the information parsing model (which may be a rule engine or natural language processing model) will process the entity extraction results to obtain entity types and entity text. This can be done by parsing JSON data in the entity extraction results. The information resolution model identifies the type of each entity and maps it to predefined standard entity types, such as "organization", "personnel", etc.
For the above example, the information parsing model may parse JSON, identifying "ABC Corp" as Organization (Organization), and "John Smith" as Person (Person).
By using a large model, this embodiment may automatically perform entity extraction tasks without manually defining rules or patterns. This increases the accuracy and flexibility of entity extraction. This approach is applicable to a variety of entity types and unstructured text data, as different template templates can be created as needed.
With the advent of new entity types or requirements, the promt template library can be easily extended without modifying the core algorithm. The automated entity extraction and parsing process can be performed quickly with the aid of large models, improving the efficiency of processing unstructured text data. The information resolution model may further process the entity extraction results to identify entity types, which helps normalize and classify the entities.
By creating pairs of samples, training data can be created for the entity for semi-supervised learning, thereby improving the accuracy of entity classification and sensitive information identification. This approach allows flexibility in accommodating new entity extraction tasks and requirements without having to redesign the entire system.
More specifically, in this embodiment, after obtaining the entity type and the entity, the final entity and the entity type are determined according to the respective entity and the frequency of the entity type.
Firstly, executing entity extraction tasks through a large model, and analyzing entity extraction results through an information analysis model to obtain entity types and entity texts. Entity types and entities may occur multiple times when processing large amounts of unstructured text data. In the semi-supervised learning process, statistics of each entity type and occurrence frequency of the entity can be accumulated. Based on the entity type and the frequency statistics of the entity, the final entity and entity type can be screened out. In general, entity types and entities that occur more frequently will be considered the final entity and entity type.
Embodiments of the apparatus provided in the present application are described in detail below:
referring to FIG. 3, the present application provides a large model-based sensitive data processing apparatus embodiment, the embodiment comprising:
a sample pair creation unit 301 for collecting data in a database and creating a sample pair including a positive sample and a negative sample;
a training unit 302, configured to train a twin encoder based on a transform architecture through the pair of samples;
A mapping unit 303, configured to use one of the twin encoders as a semantic encoder, and map the acquired metadata into semantic vectors by using the semantic encoder, so as to obtain a semantic vector set;
a clustering unit 304, configured to perform vector clustering on the semantic vector set to obtain a plurality of semantic clusters based on metadata;
the configuration unit 305 is configured to identify all semantic clusters according to a pre-configured standard specification library, and configure a sensitivity level identifier;
the critical value setting unit 306 is configured to set M critical values, and map the newly added metadata into newly added semantic vectors through the semantic encoder when the newly added metadata exists in the database;
a similarity calculating unit 307, configured to calculate the similarity between the newly added semantic vector and the semantic vectors in all the semantic clusters, and count a first number of semantic vectors that are respectively smaller than M critical values;
the similarity calculation unit 307 is also configured to: calculating the similarity between the newly added semantic vector and the semantic vectors in each semantic cluster, and counting the second quantity of semantic vectors with the similarity smaller than M critical values in each semantic cluster;
a value calculation unit 308 for calculating a comparison feature value based on the first number and the second number;
And the classifying unit 309 is configured to classify the sensitivity level of the newly added metadata according to the comparison feature values corresponding to the semantic clusters.
Optionally, the numerical calculation unit 308 is specifically configured to:
the comparison feature value is calculated by the following equation:
F=Y/X;
Fnormalized=(F−min(F))/max(F)−min(F);
wherein Fnormalized represents a normalized comparison feature value, F represents a comparison feature value, and is obtained by comparing two numerical elements Y and X, X represents a first number corresponding to a given one of the critical values, the formula represents a ratio of the two numerical elements, and Y represents a second number corresponding to the given one of the critical values.
The classifying unit 309 is specifically configured to:
comparing the magnitudes of the comparison feature values corresponding to the semantic clusters, determining the maximum comparison feature value, and classifying the newly added metadata into the corresponding semantic cluster.
Optionally, if the number of the maximum comparison feature values is greater than or equal to 2, the number of the critical values is set to m+1, and the comparison feature values are recalculated and compared until the number of the maximum comparison feature values is 1.
Optionally, the sample pair creation unit 301 is specifically configured to:
when the data in the database is structured data, creating a sample pair according to the metadata of the data;
When the data in the database is unstructured data, the apparatus further comprises an extraction unit 311 for:
traversing each template based on a pre-constructed template library, and adapting unstructured data to each template to obtain input information, wherein the input information comprises tasks for extracting entities from the unstructured data;
inputting the input information into a large model, executing a task of entity extraction through the large model, and returning an entity extraction result in a JSON format;
and resolving the entity extraction result through a pre-constructed information resolving model to obtain an entity type and an entity.
Optionally, after obtaining the entity type and the entity, determining a final entity and an entity type according to each entity and the frequency of the entity type.
Optionally, the apparatus further comprises a desensitizing processing unit 310 for:
data of different sensitivity levels are processed in different ways.
Optionally, the similarity calculation unit 307 is specifically configured to:
the similarity calculation is performed by the following equation:
C=(A⋅B)/(||A||⋅||B||);
wherein, C represents similarity, A represents newly added semantic vector, B represents one semantic vector in the semantic cluster, and A and B represent the modes of the vector A, B respectively.
Referring to fig. 4, the present application further provides a sensitive data processing apparatus based on a large model, including:
a processor 401, a memory 402, an input/output unit 403, and a bus 404;
the processor 401 is connected to the memory 402, the input/output unit 403, and the bus 404;
the memory 402 holds a program, and the processor 401 calls the program to execute any of the methods as described above.
The present application also relates to a computer readable storage medium having a program stored thereon, characterized in that the program, when run on a computer, causes the computer to perform any of the methods as above.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (RAM, random access memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Claims (11)

1. A method for processing sensitive data based on a large model, the method comprising:
collecting data in a database and creating a pair of samples, the pair of samples comprising a positive sample and a negative sample;
training a twin encoder based on a transducer architecture through the sample pairs;
taking one of the twin encoders as a semantic encoder, and mapping the acquired metadata into semantic vectors through the semantic encoder to obtain a semantic vector set;
vector clustering is carried out on the semantic vector set to obtain a plurality of semantic clusters based on metadata;
identifying all semantic clusters according to a pre-configured standard specification library, and configuring a sensitivity level identification;
setting M critical values, and mapping the newly added metadata into newly added semantic vectors through the semantic encoder when the newly added metadata exists in the database;
calculating the similarity of the newly added semantic vector and the semantic vectors in all semantic clusters, and counting the first quantity of the semantic vectors which are respectively smaller than M critical values;
calculating the similarity between the newly added semantic vector and the semantic vectors in each semantic cluster, and counting the second quantity of semantic vectors with the similarity smaller than M critical values in each semantic cluster;
Calculating a comparison feature value based on the first number and the second number;
and classifying the sensitivity level of the newly added metadata according to the comparison feature values corresponding to the semantic clusters.
2. The large model based sensitive data processing method of claim 1, wherein said calculating a comparison feature value based on said first number and said second number comprises:
the comparison feature value is calculated by the following equation:
F=Y/X;
F normalized =(F−min(F))/max(F)−min(F);
wherein F is normalized The normalized comparison feature value is represented by F, the comparison feature value is obtained by comparing two numerical elements Y and X, X represents a first number corresponding to a given one of the critical values, the formula represents a ratio of the two numerical elements, and Y represents a second number corresponding to the given one of the critical values.
3. The method for processing sensitive data based on a large model according to claim 2, wherein classifying the new metadata into a sensitivity level according to the comparison feature values corresponding to the semantic clusters comprises:
comparing the magnitudes of the comparison feature values corresponding to the semantic clusters, determining the maximum comparison feature value, and classifying the newly added metadata into the corresponding semantic cluster.
4. A method of processing sensitive data based on a large model according to claim 3, wherein if the number of maximum comparison feature values is greater than or equal to 2, the number of critical values is set to m+1, and the sum comparison feature values are recalculated until the number of maximum comparison feature values is 1.
5. The large model based sensitive data processing method of claim 1, wherein the collecting data in the database and creating sample pairs comprises:
when the data in the database is structured data, creating a sample pair according to the metadata of the data;
when the data in the database is unstructured data, the method further comprises:
traversing each template based on a pre-constructed template library, and adapting unstructured data to each template to obtain input information, wherein the input information comprises tasks for extracting entities from the unstructured data;
inputting the input information into a large model, executing a task of entity extraction through the large model, and returning an entity extraction result in a JSON format;
and resolving the entity extraction result through a pre-constructed information resolving model to obtain an entity type and an entity.
6. The method for processing sensitive data based on big model as claimed in claim 5, wherein after obtaining the entity type and the entity, determining the final entity and the entity type according to each entity and the frequency of the entity type.
7. The method for processing sensitive data based on a large model according to claim 1, further comprising, after classifying the new metadata into a sensitive class according to the comparison feature values corresponding to the respective semantic clusters:
data of different sensitivity levels are processed using different desensitization modes.
8. The large model based sensitive data processing method according to claim 1, wherein the calculation of the similarity is performed by the following equation:
C=(A⋅B)/(||A||⋅||B||);
wherein, C represents similarity, A represents newly added semantic vector, B represents one semantic vector in the semantic cluster, and A and B represent the modes of the vector A, B respectively.
9. A large model-based sensitive data processing apparatus, comprising:
a sample pair creation unit configured to collect data in a database and create a sample pair including a positive sample and a negative sample;
the training unit is used for training a twin encoder based on a transducer architecture through the sample pair;
The mapping unit is used for taking one of the twin encoders as a semantic encoder, and mapping the acquired metadata into semantic vectors through the semantic encoder to obtain a semantic vector set;
the clustering unit is used for carrying out vector clustering on the semantic vector set to obtain a plurality of semantic clusters based on metadata;
the configuration unit is used for identifying all semantic clusters according to a pre-configured standard specification library and configuring a sensitivity level identifier;
the critical value setting unit is used for setting M critical values, and mapping the newly added metadata into newly added semantic vectors through the semantic encoder when the newly added metadata exist in the database;
the similarity calculation unit is used for calculating the similarity between the newly added semantic vector and the semantic vectors in all the semantic clusters and counting the first quantity of the semantic vectors which are respectively smaller than M critical values;
the similarity calculation unit is further configured to: calculating the similarity between the newly added semantic vector and the semantic vectors in each semantic cluster, and counting the second quantity of semantic vectors with the similarity smaller than M critical values in each semantic cluster;
a numerical value calculation unit configured to calculate a comparison feature numerical value based on the first number and the second number;
And the classifying unit is used for classifying the sensitivity level of the newly added metadata according to the comparison feature values corresponding to the semantic clusters.
10. A large model-based sensitive data processing apparatus, the apparatus comprising:
a processor, a memory, an input-output unit, and a bus;
the processor is connected with the memory, the input/output unit and the bus;
the memory holds a program which the processor invokes to perform the method of any one of claims 1 to 8.
11. A computer readable storage medium having a program stored thereon, which when executed on a computer performs the method of any of claims 1 to 8.
CN202311560860.6A 2023-11-22 2023-11-22 Sensitive data processing method and device based on large model and storage medium Active CN117272123B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311560860.6A CN117272123B (en) 2023-11-22 2023-11-22 Sensitive data processing method and device based on large model and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311560860.6A CN117272123B (en) 2023-11-22 2023-11-22 Sensitive data processing method and device based on large model and storage medium

Publications (2)

Publication Number Publication Date
CN117272123A true CN117272123A (en) 2023-12-22
CN117272123B CN117272123B (en) 2024-02-27

Family

ID=89218208

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311560860.6A Active CN117272123B (en) 2023-11-22 2023-11-22 Sensitive data processing method and device based on large model and storage medium

Country Status (1)

Country Link
CN (1) CN117272123B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113780447A (en) * 2021-09-16 2021-12-10 郑州云智信安安全技术有限公司 Sensitive data discovery and identification method and system based on flow analysis
CN114186058A (en) * 2021-08-18 2022-03-15 中电科大数据研究院有限公司 Policy document title similarity calculation method
CN114358020A (en) * 2022-01-11 2022-04-15 平安科技(深圳)有限公司 Disease part identification method and device, electronic device and storage medium
CN114528844A (en) * 2022-01-14 2022-05-24 中国平安人寿保险股份有限公司 Intention recognition method and device, computer equipment and storage medium
DE102022202017A1 (en) * 2021-03-01 2022-09-01 Robert Bosch Gesellschaft mit beschränkter Haftung Concept-based adversarial generation method with controllable and diverse semantics
CN115270810A (en) * 2022-07-06 2022-11-01 四川长虹电器股份有限公司 Intention recognition device and method based on sentence similarity
CN115270752A (en) * 2022-07-27 2022-11-01 北京邮电大学 Template sentence evaluation method based on multilevel comparison learning
CN116911289A (en) * 2023-09-13 2023-10-20 中电科大数据研究院有限公司 Method, device and storage medium for generating large-model trusted text in government affair field

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102022202017A1 (en) * 2021-03-01 2022-09-01 Robert Bosch Gesellschaft mit beschränkter Haftung Concept-based adversarial generation method with controllable and diverse semantics
CN114186058A (en) * 2021-08-18 2022-03-15 中电科大数据研究院有限公司 Policy document title similarity calculation method
CN113780447A (en) * 2021-09-16 2021-12-10 郑州云智信安安全技术有限公司 Sensitive data discovery and identification method and system based on flow analysis
CN114358020A (en) * 2022-01-11 2022-04-15 平安科技(深圳)有限公司 Disease part identification method and device, electronic device and storage medium
CN114528844A (en) * 2022-01-14 2022-05-24 中国平安人寿保险股份有限公司 Intention recognition method and device, computer equipment and storage medium
CN115270810A (en) * 2022-07-06 2022-11-01 四川长虹电器股份有限公司 Intention recognition device and method based on sentence similarity
CN115270752A (en) * 2022-07-27 2022-11-01 北京邮电大学 Template sentence evaluation method based on multilevel comparison learning
CN116911289A (en) * 2023-09-13 2023-10-20 中电科大数据研究院有限公司 Method, device and storage medium for generating large-model trusted text in government affair field

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WEIQIANG RAO ET AL: "Siamese transformer network for hyperspectral image target detection", 《IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING》, vol. 60, pages 1 - 17 *
温雨等: "基于相似度融合的中文文本相似性度量方法研究", 《信息技术与信息化》, no. 10, pages 36 - 39 *

Also Published As

Publication number Publication date
CN117272123B (en) 2024-02-27

Similar Documents

Publication Publication Date Title
CN110083623B (en) Business rule generation method and device
CN110929525B (en) Network loan risk behavior analysis and detection method, device, equipment and storage medium
CN112052891A (en) Machine behavior recognition method, device, equipment and computer readable storage medium
CN111260220A (en) Group control equipment identification method and device, electronic equipment and storage medium
CN113315851A (en) Domain name detection method, device and storage medium
CN112559526A (en) Data table export method and device, computer equipment and storage medium
CN114332500A (en) Image processing model training method and device, computer equipment and storage medium
CN115186650A (en) Data detection method and related device
CN113986660A (en) Matching method, device, equipment and storage medium of system adjustment strategy
CN114722199A (en) Risk identification method and device based on call recording, computer equipment and medium
CN113886821A (en) Malicious process identification method and device based on twin network, electronic equipment and storage medium
CN110990834B (en) Static detection method, system and medium for android malicious software
CN117272123B (en) Sensitive data processing method and device based on large model and storage medium
CN111444362A (en) Malicious picture intercepting method, device, equipment and storage medium
CN116541792A (en) Method for carrying out group partner identification based on graph neural network node classification
CN115905885A (en) Data identification method, device, storage medium and program product
CN115422000A (en) Abnormal log processing method and device
CN114330369A (en) Local production marketing management method, device and equipment based on intelligent voice analysis
CN111859896B (en) Formula document detection method and device, computer readable medium and electronic equipment
CN111199170B (en) Formula file identification method and device, electronic equipment and storage medium
CN112199388A (en) Strange call identification method and device, electronic equipment and storage medium
CN117235137B (en) Professional information query method and device based on vector database
Liu et al. [Retracted] An Accurate Method of Determining Attribute Weights in Distance‐Based Classification Algorithms
CN114860673B (en) Log feature identification method and device based on dynamic and static combination
CN113723522B (en) Abnormal user identification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant