CN116681086A - Data grading method, system, equipment and storage medium - Google Patents

Data grading method, system, equipment and storage medium Download PDF

Info

Publication number
CN116681086A
CN116681086A CN202310945859.9A CN202310945859A CN116681086A CN 116681086 A CN116681086 A CN 116681086A CN 202310945859 A CN202310945859 A CN 202310945859A CN 116681086 A CN116681086 A CN 116681086A
Authority
CN
China
Prior art keywords
field
sensitive word
matrix
classified
grading
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310945859.9A
Other languages
Chinese (zh)
Other versions
CN116681086B (en
Inventor
邓理平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Aotian Technology Co ltd
Original Assignee
Shenzhen Aotian Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Aotian Technology Co ltd filed Critical Shenzhen Aotian Technology Co ltd
Priority to CN202310945859.9A priority Critical patent/CN116681086B/en
Publication of CN116681086A publication Critical patent/CN116681086A/en
Application granted granted Critical
Publication of CN116681086B publication Critical patent/CN116681086B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of data management, and discloses a data grading method, a system, equipment and a storage medium, wherein the method comprises the following steps: constructing a sensitive word stock, and determining the sensitive word field level of sensitive word fields in the sensitive word stock; constructing a similarity incidence matrix of the field to be classified and the sensitive word field in a text semantic matching mode; converting the similarity incidence matrix into a target two-dimensional table based on the sensitive word field, the sensitive word field level and the field to be classified; performing aggregation operation according to the sensitive word field levels in the target two-dimensional table to obtain score values corresponding to each grading level of the field to be graded; and determining the grading level corresponding to the field to be graded through the score value. Compared with the prior art, the method and the device effectively improve the accuracy and the efficiency of data classification.

Description

Data grading method, system, equipment and storage medium
Technical Field
The present invention relates to the field of data management technologies, and in particular, to a data classification method, system, device, and storage medium.
Background
The main idea of the current data classification is to automatically discover sensitive word data and combine a manual mode to carry out classification operation, so that related personnel can be helped to quickly discover the sensitive word data, but the classification mode is inflexible aiming at subjective data and cannot adapt to the data security classification requirements of various organizations.
Because the data classification in the industry has no unified standard, most of the solutions are to comb by using personnel with various experiences of industry, business, safety and the like, and the method is characterized by high accuracy, good effect, low efficiency, long period and random Fan Yiju.
Therefore, a data classification method is needed to solve the technical problem of how to effectively improve the accuracy and efficiency of data classification.
Disclosure of Invention
The invention mainly aims to provide a data classification method, a system, equipment and a storage medium, which aim to solve the technical problem of how to effectively improve the accuracy and efficiency of data classification.
To achieve the above object, the present invention provides a data classification method, comprising the steps of:
constructing a sensitive word stock, and determining the sensitive word field level of sensitive word fields in the sensitive word stock;
constructing a similarity incidence matrix of the field to be classified and the sensitive word field in a text semantic matching mode;
converting the similarity incidence matrix into a target two-dimensional table based on the sensitive word field, the sensitive word field level and the field to be classified;
performing aggregation operation according to the sensitive word field levels in the target two-dimensional table to obtain score values corresponding to each grading level of the field to be graded;
and determining the grading level corresponding to the field to be graded according to the grading value corresponding to each grading level of the field to be graded.
Optionally, the step of constructing a similarity association matrix of the field to be ranked and the sensitive word field by means of text semantic matching includes:
converting the field to be classified and the sensitive word field into 768-dimensional word vectors based on a BERT model to obtain a field matrix to be classified and a sensitive word field matrix;
and carrying out Hadamard product operation on the corresponding word vector in the field matrix to be classified and the corresponding word vector in the sensitive word field matrix to obtain a similarity association matrix of the field to be classified and the sensitive word field.
Optionally, the step of performing hadamard product operation on the word vector corresponding to the field matrix to be classified and the word vector corresponding to the sensitive word field matrix to obtain a similarity association matrix of the field to be classified and the sensitive word field includes:
respectively adding dimensions to the field matrix to be classified and the sensitive word field matrix according to a preset mode to obtain a first matrix and a second matrix;
and carrying out Hadamard product batch operation on the word vectors corresponding to the first matrix and the word vectors corresponding to the second matrix by using a broadcasting mechanism of Numpy/Tensor to obtain a similarity association matrix of the field to be classified and the sensitive word field.
Optionally, the step of performing hadamard product operation on the word vector corresponding to the field matrix to be classified and the word vector corresponding to the sensitive word field matrix to obtain the similarity association matrix of the field to be classified and the sensitive word field further includes:
determining a field batch value to be classified and a sensitive word field batch value according to the number of the fields to be classified and the number of the sensitive word fields;
splitting the field matrix to be classified into a plurality of target field matrices to be classified according to the field batch value to be classified;
splitting the sensitive word field matrix into a plurality of target sensitive word field matrixes according to the sensitive word field batch value;
respectively adding dimensions to the target field matrix to be classified and the target sensitive word field matrix according to the preset mode to obtain a third matrix and a fourth matrix;
and carrying out Hadamard product batch operation on the word vectors corresponding to the third matrix and the word vectors corresponding to the fourth matrix by using a broadcasting mechanism of Numpy/Tensor to obtain a similarity association matrix of the field to be classified and the sensitive word field.
Optionally, after the step of converting the similarity association matrix into the target two-dimensional table based on the sensitive word field, the sensitive word field level and the field to be ranked, the method further includes:
performing supervised learning training by taking the sensitive word field as input and the sensitive word level corresponding to the sensitive word field as output through a machine learning algorithm to obtain a multi-classification model;
determining a similarity critical threshold according to the multi-classification model;
and replacing the similarity smaller than the similarity critical threshold value in the target two-dimensional table with a missing value to obtain a new target two-dimensional table.
Optionally, the step of performing an aggregation operation according to the sensitive word field level in the target two-dimensional table to obtain a score value corresponding to each grading level of the field to be graded specifically includes:
and carrying out aggregation operation according to the average value based on the sensitive word field levels in the target two-dimensional table to obtain the score value corresponding to each grading level of the field to be graded.
Optionally, the step of determining the grading level corresponding to the field to be graded according to the grading values corresponding to the grading levels of the field to be graded includes:
comparing the score values corresponding to the grading levels of the fields to be graded to obtain a comparison result;
and determining the grading level corresponding to the field to be graded according to the comparison result.
In addition, to achieve the above object, the present invention also proposes a data grading system, the system comprising:
the word stock construction module is used for constructing a sensitive word stock and determining the sensitive word field level of the sensitive word field in the sensitive word stock;
the matrix construction module is used for constructing a similarity incidence matrix of the field to be classified and the sensitive word field in a text semantic matching mode;
the matrix conversion module is used for converting the similarity incidence matrix into a target two-dimensional table based on the sensitive word field, the sensitive word field level and the field to be classified;
the aggregation operation module is used for carrying out aggregation operation according to the sensitive word field levels in the target two-dimensional table to obtain the score values corresponding to the grading levels of the field to be graded;
and the level determining module is used for determining the grading level corresponding to the field to be graded according to the score value corresponding to each grading level of the field to be graded.
In addition, to achieve the above object, the present invention also proposes a data grading apparatus, the apparatus comprising: a memory, a processor, and a data grading program stored on the memory and executable on the processor, the data grading program configured to implement the steps of the data grading method as described above.
In addition, in order to achieve the above object, the present invention also proposes a storage medium having stored thereon a data grading program which, when executed by a processor, implements the steps of the data grading method as described above.
The method comprises the steps of constructing a sensitive word stock and determining the sensitive word field level of sensitive word fields in the sensitive word stock; constructing a similarity incidence matrix of the field to be classified and the sensitive word field in a text semantic matching mode; converting the similarity incidence matrix into a target two-dimensional table based on the sensitive word field, the sensitive word field level and the field to be classified; performing aggregation operation according to the sensitive word field levels in the target two-dimensional table to obtain score values corresponding to each grading level of the field to be graded; and determining the grading level corresponding to the field to be graded according to the grading value corresponding to each grading level of the field to be graded. According to the method, a sensitive word library is constructed, the level of a sensitive word field is determined, a similarity incidence matrix of a field to be classified and the sensitive word field is constructed in a text semantic matching mode, then the similarity incidence matrix is converted into a target two-dimensional table, and aggregation operation is carried out according to the level of the sensitive word field in the target two-dimensional table, so that a score value corresponding to each classification level of the field to be classified is obtained; and determining the grading level corresponding to the field to be graded through the score value, and compared with the prior art, the accuracy and the efficiency of data grading are effectively improved.
Drawings
FIG. 1 is a schematic diagram of a data hierarchy apparatus of a hardware runtime environment in which embodiments of the present invention are implemented;
FIG. 2 is a flow chart of a first embodiment of the data classification method according to the present invention;
FIG. 3 is a schematic diagram of a sensitive word field and a field to be classified;
FIG. 4 is a flow chart of a second embodiment of the data classification method according to the present invention;
FIG. 5 is a flow chart of a third embodiment of the data classification method according to the present invention;
FIG. 6 is a block diagram of a data hierarchy system of the present invention.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a data classifying device of a hardware running environment according to an embodiment of the present invention.
As shown in fig. 1, the data grading apparatus may include: a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a Wireless interface (e.g., a Wireless-Fidelity (Wi-Fi) interface). The Memory 1005 may be a high-speed random access Memory (Random Access Memory, RAM) or a stable nonvolatile Memory (NVM), such as a disk Memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.
It will be appreciated by those skilled in the art that the structure shown in fig. 1 does not constitute a limitation of the data classification apparatus, and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.
As shown in fig. 1, an operating system, a network communication module, a user interface module, and a data hierarchy program may be included in the memory 1005 as one type of storage medium.
In the data staging device shown in FIG. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 in the data classifying apparatus of the present invention may be provided in a data classifying apparatus which calls a data classifying program stored in the memory 1005 through the processor 1001 and performs the data classifying method provided by the embodiment of the present invention.
An embodiment of the present invention provides a data classification method, referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of the data classification method of the present invention.
In this embodiment, the data classification method includes the following steps:
step S10: and constructing a sensitive word stock, and determining the sensitive word field level of the sensitive word field in the sensitive word stock.
It should be noted that, the execution body of the embodiment may be a computing service device having functions of data processing, network communication and program running, such as a server, a tablet computer, a personal computer, a mobile phone, or an electronic device, a data grading device, or the like, which can implement the above functions. The present embodiment and the following embodiments will be described by way of example using a data classifying apparatus.
It should be explained that by constructing a sensitive word stock and marking each sensitive word field in the sensitive word stock with a sensitive word field level, such as L5, L4, L3, etc. On the other hand, the text information of the header field is extracted from the data table field to be classified in the database, and a text list of the field to be classified is constructed, wherein the text list of the field to be classified comprises a plurality of fields to be classified.
In a specific implementation, a sensitive word stock is constructed, and a grading level is given to each sensitive word field in the sensitive word stock, namely, the sensitive word field level. For example, referring to fig. 3, fig. 3 is a schematic diagram of a sensitive word field and a field to be classified, in which a sensitive field text in a sensitive field of a marketspace is payroll, and a prize corresponding classification level (grade) is L5; the grading level of the mobile phone, the telephone and the mailbox is L4; gender, age, and grade L3. Let the field text to be ranked (field to be identified and ranked) be employee income, project prize, contact, email, gender age, birth month, home address, "? "means that the rating level has not been determined yet, the task of data rating is to assign an appropriate rating level to these fields to be rated.
Step S20: and constructing a similarity association matrix of the field to be classified and the sensitive word field in a text semantic matching mode.
It can be understood that text semantic matching, in popular terms, is to determine whether the semantics of two texts are the same. Text semantic matching is one of the most basic tasks in natural language processing, and the semantic matching has wide application in search matching, intelligent customer service, news recommendation and the like.
It should be noted that, the text semantic matching method is to convert the field to be ranked and the sensitive word field into 768-dimensional word vectors based on the BERT model, and if the word vectors corresponding to the wages are: [ -0.04097468 0.02901088 0.01454205 0.04620046 0.03558226 … …]The word vector corresponding to the employee income is: [ -0.00527354 0.03050051 0.02337652 0.05430245 0.07561858 … …]. Similarity between two word vectors indicates the degree of matching between two short text semantics. The greater the number, the higher the degree of similarity between 0 and 1, indicating a higher degree of short text semantic matching. The similarity of word vectors is defined as the hadamard product operation and the results are summed. Such as,The Hadamard product is: />The similarity corresponds to: />=
In a specific implementation, converting a field to be classified and the sensitive word field into 768-dimensional word vectors based on a BERT model to obtain a field matrix to be classified and a sensitive word field matrix; and carrying out Hadamard product operation on the field matrix to be classified and the sensitive word field matrix to obtain a similarity association matrix of the field to be classified and the sensitive word field.
It should be noted that, to-be-ranked fields and sensitive word fields are thousands of, and the similarity of any two-field short text vectors is extremely inefficient in two-layer cyclic traversal calculation. Therefore, the calculation performance can be remarkably improved based on a batch calculation mode, for example, dimensions are added to the field matrix to be classified and the sensitive word field matrix respectively according to a preset mode to obtain a first matrix and a second matrix, and then Hadamard product batch operation is performed on the field matrix to be classified and the sensitive word field matrix by using a Numpy/Tensor broadcasting mechanism to obtain a similarity association matrix for obtaining the field to be classified and the sensitive word field.
Assuming that the number of fields to be classified is n, that is, there are n vectors of fields to be classified, the corresponding field matrix to be classified is (n, 768), and increasing one dimension along the axis=1 direction becomes the first matrix is (n, 1, 768). The number of sensitive word fields is m, that is, there are m sensitive word field vectors, and the corresponding sensitive word field matrix is (m, 768), and one dimension is increased along the axis=0 direction to become a second matrix (1, m, 768). And then carrying out Hadamard product operation along the axis=2 direction by utilizing a broadcasting mechanism of Numpy/Tensor to obtain a similarity association matrix (n, m) of the field to be classified and the sensitive word field.
Step S30: and converting the similarity incidence matrix into a target two-dimensional table based on the sensitive word field, the sensitive word field level and the field to be classified.
Referring to table 1, a multi-layer index is constructed based on a sensitive word field (text) and a sensitive word field level (grade), and then the similarity association matrix is converted into a target two-dimensional table based on the field to be ranked.
TABLE 1 target two-dimensional table
Step S40: and performing aggregation operation according to the sensitive word field levels in the target two-dimensional table to obtain the score values corresponding to the grading levels of the fields to be graded.
In a specific implementation, aggregation operation is performed according to a mean value based on the sensitive word field level in the target two-dimensional table, and the score value corresponding to each grading level of the field to be graded is obtained, as shown in table 2.
TABLE 2 score values corresponding to each hierarchical level of a field to be hierarchical
Step S50: and determining the grading level corresponding to the field to be graded according to the grading value corresponding to each grading level of the field to be graded.
It should be noted that, after obtaining the score values corresponding to the classification levels of the field to be classified, comparing the score values corresponding to the classification levels of the field to be classified to obtain a comparison result; and determining the grading level corresponding to the field to be graded according to the comparison result.
The hierarchical level of each field to be hierarchical is obtained by taking the maximum score and the hierarchical level corresponding to the maximum value for each line, as shown in table 3.
Table 3 hierarchical level table of each field to be hierarchical
The embodiment constructs a sensitive word library and determines the sensitive word field level of sensitive word fields in the sensitive word library; constructing a similarity incidence matrix of the field to be classified and the sensitive word field in a text semantic matching mode; converting the similarity incidence matrix into a target two-dimensional table based on the sensitive word field, the sensitive word field level and the field to be classified; performing aggregation operation according to the average value based on the sensitive word field levels in the target two-dimensional table to obtain the score value corresponding to each grading level of the field to be graded; and determining the grading level corresponding to the field to be graded according to the grading value corresponding to each grading level of the field to be graded. According to the method, a sensitive word library is constructed, the level of a sensitive word field is determined, a similarity incidence matrix of a field to be classified and the sensitive word field is constructed in a text semantic matching mode, then the similarity incidence matrix is converted into a target two-dimensional table, and aggregation operation is carried out according to the level of the sensitive word field in the target two-dimensional table, so that a score value corresponding to each classification level of the field to be classified is obtained; and comparing the score values corresponding to the grading levels of the fields to be graded to obtain a comparison result, and determining the grading level corresponding to the fields to be graded according to the comparison result.
Referring to fig. 4, fig. 4 is a flowchart illustrating a second embodiment of the data classification method according to the present invention.
Based on the first embodiment, in this embodiment, the step S20 includes:
step S201: and converting the field to be classified and the sensitive word field into 768-dimensional word vectors based on the BERT model, and obtaining a field matrix to be classified and a sensitive word field matrix.
It should be explained that, based on the BERT model, the field to be ranked and the sensitive word field are converted into 768-dimensional word vectors, for example, the word vectors corresponding to "payroll" are: the word vector corresponding to the employee income is: [ -0.00527354 0.03050051 0.02337652 0.05430245 0.07561858 … … ].
Step S202: and carrying out Hadamard product operation on the corresponding word vector in the field matrix to be classified and the corresponding word vector in the sensitive word field matrix to obtain a similarity association matrix of the field to be classified and the sensitive word field.
It can be understood that the word vector corresponding to the field matrix to be classified is 768-dimensional word vector converted by the field to be classified based on the BERT model, namely, the field vector to be classified; the corresponding word vector in the sensitive word field matrix is 768-dimensional word vector converted by the sensitive word field based on the BERT model, namely the sensitive word field vector.
In a specific implementation, dimensions can be added to the field matrix to be classified and the sensitive word field matrix according to a preset mode, so as to obtain a first matrix and a second matrix; and then carrying out Hadamard product batch operation on the corresponding word vectors in the first matrix and the corresponding word vectors in the second matrix by using a broadcasting mechanism of Numpy/Tensor to obtain a similarity association matrix of the field to be classified and the sensitive word field.
It should be noted that, in the foregoing preset manner, assuming that the number of fields to be classified is n, that is, there are n vectors of fields to be classified, the corresponding field matrix to be classified is (n, 768), and adding one dimension along the axis=1 direction becomes the first matrix is (n, 1, 768). The number of sensitive word fields is m, that is, there are m sensitive word field vectors, and the corresponding sensitive word field matrix is (m, 768), and one dimension is increased along the axis=0 direction to become a second matrix (1, m, 768). And then carrying out Hadamard product operation along the axis=2 direction by utilizing a broadcasting mechanism of Numpy/Tensor to obtain a similarity association matrix (n, m) of the field to be classified and the sensitive word field.
However, since the values of n and m are large, the problem of memory overflow will occur in the batch calculation of the similarity correlation matrix at a time, and thus a small batch matrix operation is required. For example, the field to be classified has a lot size bs1, a sensitive word fieldThe batch size is bs2, and the similarity matrix of the (bs 1, bs 2) size is calculated for each small batch. Splitting the field matrix to be classified (n, 768) into two fields before performing a small-lot matrix calculationSmall matrices of the numbers (bs 1, 768), splitting the sensitive word field matrix (m, 768) into +.>Small matrices of (bs 2, 768).
Thus, in a specific implementation, the batch value of the fields to be classified and the batch value of the sensitive words can be determined according to the number of the fields to be classified and the number of the sensitive words; splitting the field matrix to be classified into a plurality of target field matrices to be classified according to the field batch value to be classified; splitting the sensitive word field matrix into a plurality of target sensitive word field matrixes according to the sensitive word field batch value; respectively adding dimensions to the target field matrix to be classified and the target sensitive word field matrix according to the preset mode to obtain a third matrix and a fourth matrix; and carrying out Hadamard product batch operation on the word vectors corresponding to the third matrix and the word vectors corresponding to the fourth matrix by using a broadcasting mechanism of Numpy/Tensor to obtain a similarity association matrix of the field to be classified and the sensitive word field.
It will be appreciated that small batch matrix operations may be implemented using the Numpy, pad, and Pytorch frameworks, with the time taken for a single batch calculation increasing with increasing batch size. If n=m=4500 is taken to complete the full-scale similarity matrix operation, the total time consumed by Numpy and PyTorch is slightly affected by the batch size, and it is recommended to select Numpy frame to implement the similarity matrix calculation, and the batch size can be set to about 1000.
According to the number of the fields to be classified and the number of the sensitive word fields, determining a batch value of the fields to be classified and a batch value of the sensitive word fields; splitting the field matrix to be classified into a plurality of target field matrices to be classified according to the field batch value to be classified; splitting the sensitive word field matrix into a plurality of target sensitive word field matrixes according to the sensitive word field batch value; respectively adding dimensions to the target field matrix to be classified and the target sensitive word field matrix according to the preset mode to obtain a third matrix and a fourth matrix; and carrying out Hadamard product batch operation on the word vectors corresponding to the third matrix and the word vectors corresponding to the fourth matrix by using a Numpy/Tensor broadcasting mechanism to obtain similarity incidence matrixes of the field to be classified and the sensitive word field.
Referring to fig. 5, fig. 5 is a flowchart of a third embodiment of the data classification method according to the present invention.
Based on the above embodiments, in this embodiment, after step S30, the method further includes:
step S301: and performing supervised learning training by taking the sensitive word field as input and the sensitive word level corresponding to the sensitive word field as output through a machine learning algorithm to obtain a multi-classification model.
Step S302: and determining a similarity critical threshold according to the multi-classification model.
The machine learning algorithm may be a bayesian algorithm, a decision tree algorithm, or other classification algorithms, which is not limited in this embodiment.
In a specific implementation, a similarity critical threshold is trained through a machine learning method, sensitive word fields are used as input, sensitive word levels corresponding to the sensitive word fields are used as output, and a training sample data set is constructed. Training to obtain a multi-classification model based on a supervised learning method, and determining a similarity critical threshold according to the multi-classification model.
And training through a machine learning algorithm to obtain a similarity critical threshold, and when the similarity is smaller than the similarity critical threshold, indicating that the matching degree between the semantics of the short text is low, and further filling a similarity association matrix by using the missing value.
Step S303: and replacing the similarity smaller than the similarity critical threshold value in the target two-dimensional table with a missing value to obtain a new target two-dimensional table.
Referring to table 4, if the similarity threshold value obtained by construction is 0.6, replacing the similarity smaller than 0.6 in the target two-dimensional table with a missing value to obtain a new target two-dimensional table.
TABLE 4 New target two-dimensional Table
And then, carrying out aggregation operation according to the average value according to the sensitive word field level in the new target two-dimensional table, and taking the maximum value and the maximum value index for each row of the new two-dimensional table to obtain the optimized grading level, thereby reducing the data quantity and improving the data grading efficiency, for example, referring to the table 5.
Table 5 optimized hierarchical level table
As can be seen from Table 5, employee revenues and project prizes may be assigned a rating level of L5; the contact way and the e-mail can be endowed with a grading level of L4; gender age may be assigned a rating level of L3; the sensitive word field level of birth year, month and family address is None, the corresponding grading level is lower than the sensitive word field level, and the assignable data grading level is L2 or L1.
The embodiment constructs a sensitive word library and determines the sensitive word field level of sensitive word fields in the sensitive word library; constructing a similarity incidence matrix of the field to be classified and the sensitive word field in a text semantic matching mode; converting the similarity incidence matrix into a target two-dimensional table based on the sensitive word field, the sensitive word field level and the field to be classified; performing supervised learning training by taking the sensitive word field as input and the sensitive word level corresponding to the sensitive word field as output through a machine learning algorithm to obtain a multi-classification model; determining a similarity critical threshold according to the multi-classification model; replacing the similarity smaller than the similarity critical threshold value in the target two-dimensional table with a missing value to obtain a new target two-dimensional table; performing aggregation operation according to the sensitive word field levels in the target two-dimensional table to obtain score values corresponding to each grading level of the field to be graded; and determining the grading level corresponding to the field to be graded according to the grading value corresponding to each grading level of the field to be graded. According to the method, a sensitive word library is constructed, the level of a sensitive word field is determined, a similarity incidence matrix of a field to be classified and the sensitive word field is constructed in a text semantic matching mode, then the similarity incidence matrix is converted into a target two-dimensional table, then the similarity smaller than the similarity critical threshold value in the target two-dimensional table is replaced by a missing value, a new target two-dimensional table is obtained, aggregation operation is carried out according to the level of the sensitive word field in the target two-dimensional table, and the score value corresponding to each classification level of the field to be classified is obtained; compared with the prior art, the method and the device effectively improve the accuracy and the efficiency of data classification.
In addition, the embodiment of the invention also provides a storage medium, wherein the storage medium stores a data grading program, and the data grading program realizes the steps of the data grading method when being executed by a processor.
Referring to fig. 6, fig. 6 is a block diagram illustrating a data hierarchy system according to the present invention.
As shown in fig. 6, a data grading system according to an embodiment of the present invention includes: a word stock construction module 601, a matrix construction module 602, a matrix conversion module 603, an aggregation operation module 604 and a level determination module 605.
The word stock construction module 601 is configured to construct a sensitive word stock, and determine a sensitive word field level of a sensitive word field in the sensitive word stock.
The matrix construction module 602 is configured to construct a similarity association matrix of the field to be ranked and the sensitive word field in a text semantic matching manner.
The matrix conversion module 603 is configured to convert the similarity association matrix into a target two-dimensional table based on the sensitive word field, the sensitive word field level and the field to be ranked.
The aggregation operation module 604 is configured to perform an aggregation operation according to the sensitive word field level in the target two-dimensional table, and obtain a score value corresponding to each classification level of the field to be classified.
The level determining module 605 is configured to determine a classification level corresponding to the field to be classified according to the score value corresponding to each classification level of the field to be classified.
The system constructs a sensitive word library and determines the sensitive word field level of sensitive word fields in the sensitive word library; constructing a similarity incidence matrix of the field to be classified and the sensitive word field in a text semantic matching mode; converting the similarity incidence matrix into a target two-dimensional table based on the sensitive word field, the sensitive word field level and the field to be classified; performing aggregation operation according to the sensitive word field levels in the target two-dimensional table to obtain score values corresponding to each grading level of the field to be graded; and determining the grading level corresponding to the field to be graded according to the grading value corresponding to each grading level of the field to be graded. According to the method, a sensitive word library is constructed, the level of a sensitive word field is determined, a similarity incidence matrix of a field to be classified and the sensitive word field is constructed in a text semantic matching mode, then the similarity incidence matrix is converted into a target two-dimensional table, and aggregation operation is carried out according to the level of the sensitive word field in the target two-dimensional table, so that a score value corresponding to each classification level of the field to be classified is obtained; and determining the grading level corresponding to the field to be graded through the score value, and compared with the prior art, the accuracy and the efficiency of data grading are effectively improved.
Based on the above-described first embodiment of the data classification system of the present invention, a second embodiment of the data classification system of the present invention is proposed.
In this embodiment, the matrix construction module 602 is further configured to convert, based on a BERT model, a field to be classified and the sensitive word field into 768-dimensional word vectors, to obtain a field matrix to be classified and a sensitive word field matrix; and carrying out Hadamard product operation on the corresponding word vector in the field matrix to be classified and the corresponding word vector in the sensitive word field matrix to obtain a similarity association matrix of the field to be classified and the sensitive word field.
The matrix construction module 602 is further configured to add dimensions to the field matrix to be classified and the sensitive word field matrix according to a preset manner, so as to obtain a first matrix and a second matrix; and carrying out Hadamard product batch operation on the word vectors corresponding to the first matrix and the word vectors corresponding to the second matrix by using a broadcasting mechanism of Numpy/Tensor to obtain a similarity association matrix of the field to be classified and the sensitive word field.
The matrix construction module 602 is further configured to determine a batch value of fields to be classified and a batch value of sensitive words according to the number of fields to be classified and the number of sensitive words; splitting the field matrix to be classified into a plurality of target field matrices to be classified according to the field batch value to be classified; splitting the sensitive word field matrix into a plurality of target sensitive word field matrixes according to the sensitive word field batch value; respectively adding dimensions to the target field matrix to be classified and the target sensitive word field matrix according to the preset mode to obtain a third matrix and a fourth matrix; and carrying out Hadamard product batch operation on the word vectors corresponding to the third matrix and the word vectors corresponding to the fourth matrix by using a broadcasting mechanism of Numpy/Tensor to obtain a similarity association matrix of the field to be classified and the sensitive word field.
Other embodiments or specific implementations of the data classification system of the present invention may refer to the above method embodiments, and are not described herein.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. read-only memory/random-access memory, magnetic disk, optical disk), comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present invention.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims (10)

1. A method of data classification, the method comprising the steps of:
constructing a sensitive word stock, and determining the sensitive word field level of sensitive word fields in the sensitive word stock;
constructing a similarity incidence matrix of the field to be classified and the sensitive word field in a text semantic matching mode;
converting the similarity incidence matrix into a target two-dimensional table based on the sensitive word field, the sensitive word field level and the field to be classified;
performing aggregation operation according to the sensitive word field levels in the target two-dimensional table to obtain score values corresponding to each grading level of the field to be graded;
and determining the grading level corresponding to the field to be graded according to the grading value corresponding to each grading level of the field to be graded.
2. The data grading method according to claim 1, wherein the step of constructing a similarity association matrix of the field to be graded and the sensitive word field by means of text semantic matching includes:
converting the field to be classified and the sensitive word field into 768-dimensional word vectors based on a BERT model to obtain a field matrix to be classified and a sensitive word field matrix;
and carrying out Hadamard product operation on the corresponding word vector in the field matrix to be classified and the corresponding word vector in the sensitive word field matrix to obtain a similarity association matrix of the field to be classified and the sensitive word field.
3. The data classification method of claim 2, wherein the step of performing hadamard product operation on the word vector corresponding to the field matrix to be classified and the word vector corresponding to the sensitive word field matrix to obtain the similarity association matrix of the field to be classified and the sensitive word field comprises:
respectively adding dimensions to the field matrix to be classified and the sensitive word field matrix according to a preset mode to obtain a first matrix and a second matrix;
and carrying out Hadamard product batch operation on the word vectors corresponding to the first matrix and the word vectors corresponding to the second matrix by using a broadcasting mechanism of Numpy/Tensor to obtain a similarity association matrix of the field to be classified and the sensitive word field.
4. The data classification method of claim 3, wherein the step of performing hadamard product operation on the word vector corresponding to the field matrix to be classified and the word vector corresponding to the sensitive word field matrix to obtain the similarity association matrix of the field to be classified and the sensitive word field further comprises:
determining a field batch value to be classified and a sensitive word field batch value according to the number of the fields to be classified and the number of the sensitive word fields;
splitting the field matrix to be classified into a plurality of target field matrices to be classified according to the field batch value to be classified;
splitting the sensitive word field matrix into a plurality of target sensitive word field matrixes according to the sensitive word field batch value;
respectively adding dimensions to the target field matrix to be classified and the target sensitive word field matrix according to the preset mode to obtain a third matrix and a fourth matrix;
and carrying out Hadamard product batch operation on the word vectors corresponding to the third matrix and the word vectors corresponding to the fourth matrix by using a broadcasting mechanism of Numpy/Tensor to obtain a similarity association matrix of the field to be classified and the sensitive word field.
5. The data ranking method of claim 1, wherein after the step of converting the similarity association matrix into a target two-dimensional table based on the sensitive word field, the sensitive word field level and the field to be ranked, further comprising:
performing supervised learning training by taking the sensitive word field as input and the sensitive word level corresponding to the sensitive word field as output through a machine learning algorithm to obtain a multi-classification model;
determining a similarity critical threshold according to the multi-classification model;
and replacing the similarity smaller than the similarity critical threshold value in the target two-dimensional table with a missing value to obtain a new target two-dimensional table.
6. The data classification method according to claim 1, wherein the step of performing an aggregation operation according to the sensitive word field level in the target two-dimensional table to obtain the score value corresponding to each classification level of the field to be classified specifically includes:
and carrying out aggregation operation according to the average value based on the sensitive word field levels in the target two-dimensional table to obtain the score value corresponding to each grading level of the field to be graded.
7. The data grading method according to claim 1, wherein the step of determining the grading level corresponding to the field to be graded by the grading value corresponding to each grading level of the field to be graded includes:
comparing the score values corresponding to the grading levels of the fields to be graded to obtain a comparison result;
and determining the grading level corresponding to the field to be graded according to the comparison result.
8. A data staging system, the data staging system comprising:
the word stock construction module is used for constructing a sensitive word stock and determining the sensitive word field level of the sensitive word field in the sensitive word stock;
the matrix construction module is used for constructing a similarity incidence matrix of the field to be classified and the sensitive word field in a text semantic matching mode;
the matrix conversion module is used for converting the similarity incidence matrix into a target two-dimensional table based on the sensitive word field, the sensitive word field level and the field to be classified;
the aggregation operation module is used for carrying out aggregation operation according to the sensitive word field levels in the target two-dimensional table to obtain the score values corresponding to the grading levels of the field to be graded;
and the level determining module is used for determining the grading level corresponding to the field to be graded according to the score value corresponding to each grading level of the field to be graded.
9. A data grading apparatus, the apparatus comprising: a memory, a processor and a data grading program stored on the memory and executable on the processor, the data grading program being configured to implement the steps of the data grading method according to any of claims 1 to 7.
10. A storage medium having stored thereon a data grading program which, when executed by a processor, implements the steps of the data grading method according to any of claims 1 to 7.
CN202310945859.9A 2023-07-31 2023-07-31 Data grading method, system, equipment and storage medium Active CN116681086B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310945859.9A CN116681086B (en) 2023-07-31 2023-07-31 Data grading method, system, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310945859.9A CN116681086B (en) 2023-07-31 2023-07-31 Data grading method, system, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN116681086A true CN116681086A (en) 2023-09-01
CN116681086B CN116681086B (en) 2024-04-02

Family

ID=87782243

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310945859.9A Active CN116681086B (en) 2023-07-31 2023-07-31 Data grading method, system, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116681086B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109977222A (en) * 2019-03-05 2019-07-05 广州海晟科技有限公司 The recognition methods of data sensitive behavior
CN110826319A (en) * 2019-10-30 2020-02-21 维沃移动通信有限公司 Application information processing method and terminal equipment
CN111767394A (en) * 2020-06-24 2020-10-13 中国工商银行股份有限公司 Abstract extraction method and device based on artificial intelligence expert system
CN113139379A (en) * 2020-01-20 2021-07-20 中国电信股份有限公司 Information identification method and system
US20210224481A1 (en) * 2017-04-07 2021-07-22 Ping An Technology(Shenzhen) Co., Ltd. Method and apparatus for topic early warning, computer equipment and storage medium
CN114491018A (en) * 2021-12-23 2022-05-13 天翼云科技有限公司 Construction method of sensitive information detection model, and sensitive information detection method and device
CN115879455A (en) * 2022-12-29 2023-03-31 华润数字科技有限公司 Word emotion polarity prediction method and device, electronic equipment and storage medium
CN116049397A (en) * 2022-12-29 2023-05-02 北京霍因科技有限公司 Sensitive information discovery and automatic classification method based on multi-mode fusion
CN116150349A (en) * 2021-11-18 2023-05-23 上海数据交易中心有限公司 Data product security compliance checking method, device and server

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210224481A1 (en) * 2017-04-07 2021-07-22 Ping An Technology(Shenzhen) Co., Ltd. Method and apparatus for topic early warning, computer equipment and storage medium
CN109977222A (en) * 2019-03-05 2019-07-05 广州海晟科技有限公司 The recognition methods of data sensitive behavior
CN110826319A (en) * 2019-10-30 2020-02-21 维沃移动通信有限公司 Application information processing method and terminal equipment
CN113139379A (en) * 2020-01-20 2021-07-20 中国电信股份有限公司 Information identification method and system
CN111767394A (en) * 2020-06-24 2020-10-13 中国工商银行股份有限公司 Abstract extraction method and device based on artificial intelligence expert system
CN116150349A (en) * 2021-11-18 2023-05-23 上海数据交易中心有限公司 Data product security compliance checking method, device and server
CN114491018A (en) * 2021-12-23 2022-05-13 天翼云科技有限公司 Construction method of sensitive information detection model, and sensitive information detection method and device
CN115879455A (en) * 2022-12-29 2023-03-31 华润数字科技有限公司 Word emotion polarity prediction method and device, electronic equipment and storage medium
CN116049397A (en) * 2022-12-29 2023-05-02 北京霍因科技有限公司 Sensitive information discovery and automatic classification method based on multi-mode fusion

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
黄炜 等: "基于广度学习的异构社交网络敏感实体识别模型研究", 情报学报, vol. 39, no. 06, pages 579 - 588 *

Also Published As

Publication number Publication date
CN116681086B (en) 2024-04-02

Similar Documents

Publication Publication Date Title
US7756535B1 (en) Lightweight content filtering system for mobile phones
US9817885B2 (en) Method and apparatus for grouping network service users
US11748452B2 (en) Method for data processing by performing different non-linear combination processing
CN111611514B (en) Page display method and device based on user login information and electronic equipment
CN111639247A (en) Method, apparatus, device and computer-readable storage medium for evaluating quality of review
CN112328909A (en) Information recommendation method and device, computer equipment and medium
CN112015562A (en) Resource allocation method and device based on transfer learning and electronic equipment
CN114547257B (en) Class matching method and device, computer equipment and storage medium
CN111179055B (en) Credit line adjusting method and device and electronic equipment
CN111190967B (en) User multidimensional data processing method and device and electronic equipment
CN113127621A (en) Dialogue module pushing method, device, equipment and storage medium
US11947617B2 (en) Assigning variants of content to users while maintaining a stable experimental population
CN111626783B (en) Offline information setting method and device for realizing event conversion probability prediction
CN116681086B (en) Data grading method, system, equipment and storage medium
CN111626898B (en) Method, device, medium and electronic equipment for realizing attribution of events
CN114266255B (en) Corpus classification method, apparatus, device and storage medium based on clustering model
WO2022213662A1 (en) Application recommendation method and system, terminal, and storage medium
CN113612777B (en) Training method, flow classification method, device, electronic equipment and storage medium
CN113095073B (en) Corpus tag generation method and device, computer equipment and storage medium
CN113327133B (en) Data recommendation method, data recommendation device, electronic equipment and readable storage medium
US11983152B1 (en) Systems and methods for processing environmental, social and governance data
JP7355322B1 (en) Email element setting system and email subject setting support system
US20240220902A1 (en) Systems and methods for automatic handling of score revision requests
CN111382244B (en) Deep retrieval matching classification method and device and terminal equipment
CN116861289A (en) Data classification method, device, storage medium, and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant