CN116681086A

CN116681086A - Data grading method, system, equipment and storage medium

Info

Publication number: CN116681086A
Application number: CN202310945859.9A
Authority: CN
Inventors: 邓理平
Original assignee: Shenzhen Aotian Technology Co ltd
Current assignee: Shenzhen Aotian Technology Co ltd
Priority date: 2023-07-31
Filing date: 2023-07-31
Publication date: 2023-09-01
Anticipated expiration: 2043-07-31
Also published as: CN116681086B

Abstract

The invention relates to the technical field of data management, and discloses a data grading method, a system, equipment and a storage medium, wherein the method comprises the following steps: constructing a sensitive word stock, and determining the sensitive word field level of sensitive word fields in the sensitive word stock; constructing a similarity incidence matrix of the field to be classified and the sensitive word field in a text semantic matching mode; converting the similarity incidence matrix into a target two-dimensional table based on the sensitive word field, the sensitive word field level and the field to be classified; performing aggregation operation according to the sensitive word field levels in the target two-dimensional table to obtain score values corresponding to each grading level of the field to be graded; and determining the grading level corresponding to the field to be graded through the score value. Compared with the prior art, the method and the device effectively improve the accuracy and the efficiency of data classification.

Description

Data grading method, system, equipment and storage medium

Technical Field

The present invention relates to the field of data management technologies, and in particular, to a data classification method, system, device, and storage medium.

Background

The main idea of the current data classification is to automatically discover sensitive word data and combine a manual mode to carry out classification operation, so that related personnel can be helped to quickly discover the sensitive word data, but the classification mode is inflexible aiming at subjective data and cannot adapt to the data security classification requirements of various organizations.

Because the data classification in the industry has no unified standard, most of the solutions are to comb by using personnel with various experiences of industry, business, safety and the like, and the method is characterized by high accuracy, good effect, low efficiency, long period and random Fan Yiju.

Therefore, a data classification method is needed to solve the technical problem of how to effectively improve the accuracy and efficiency of data classification.

Disclosure of Invention

The invention mainly aims to provide a data classification method, a system, equipment and a storage medium, which aim to solve the technical problem of how to effectively improve the accuracy and efficiency of data classification.

To achieve the above object, the present invention provides a data classification method, comprising the steps of:

constructing a sensitive word stock, and determining the sensitive word field level of sensitive word fields in the sensitive word stock;

constructing a similarity incidence matrix of the field to be classified and the sensitive word field in a text semantic matching mode;

converting the similarity incidence matrix into a target two-dimensional table based on the sensitive word field, the sensitive word field level and the field to be classified;

performing aggregation operation according to the sensitive word field levels in the target two-dimensional table to obtain score values corresponding to each grading level of the field to be graded;

and determining the grading level corresponding to the field to be graded according to the grading value corresponding to each grading level of the field to be graded.

Optionally, the step of constructing a similarity association matrix of the field to be ranked and the sensitive word field by means of text semantic matching includes:

converting the field to be classified and the sensitive word field into 768-dimensional word vectors based on a BERT model to obtain a field matrix to be classified and a sensitive word field matrix;

and carrying out Hadamard product operation on the corresponding word vector in the field matrix to be classified and the corresponding word vector in the sensitive word field matrix to obtain a similarity association matrix of the field to be classified and the sensitive word field.

Optionally, the step of performing hadamard product operation on the word vector corresponding to the field matrix to be classified and the word vector corresponding to the sensitive word field matrix to obtain a similarity association matrix of the field to be classified and the sensitive word field includes:

respectively adding dimensions to the field matrix to be classified and the sensitive word field matrix according to a preset mode to obtain a first matrix and a second matrix;

and carrying out Hadamard product batch operation on the word vectors corresponding to the first matrix and the word vectors corresponding to the second matrix by using a broadcasting mechanism of Numpy/Tensor to obtain a similarity association matrix of the field to be classified and the sensitive word field.

Optionally, the step of performing hadamard product operation on the word vector corresponding to the field matrix to be classified and the word vector corresponding to the sensitive word field matrix to obtain the similarity association matrix of the field to be classified and the sensitive word field further includes:

determining a field batch value to be classified and a sensitive word field batch value according to the number of the fields to be classified and the number of the sensitive word fields;

splitting the field matrix to be classified into a plurality of target field matrices to be classified according to the field batch value to be classified;

splitting the sensitive word field matrix into a plurality of target sensitive word field matrixes according to the sensitive word field batch value;

respectively adding dimensions to the target field matrix to be classified and the target sensitive word field matrix according to the preset mode to obtain a third matrix and a fourth matrix;

and carrying out Hadamard product batch operation on the word vectors corresponding to the third matrix and the word vectors corresponding to the fourth matrix by using a broadcasting mechanism of Numpy/Tensor to obtain a similarity association matrix of the field to be classified and the sensitive word field.

Optionally, after the step of converting the similarity association matrix into the target two-dimensional table based on the sensitive word field, the sensitive word field level and the field to be ranked, the method further includes:

performing supervised learning training by taking the sensitive word field as input and the sensitive word level corresponding to the sensitive word field as output through a machine learning algorithm to obtain a multi-classification model;

determining a similarity critical threshold according to the multi-classification model;

and replacing the similarity smaller than the similarity critical threshold value in the target two-dimensional table with a missing value to obtain a new target two-dimensional table.

Optionally, the step of performing an aggregation operation according to the sensitive word field level in the target two-dimensional table to obtain a score value corresponding to each grading level of the field to be graded specifically includes:

and carrying out aggregation operation according to the average value based on the sensitive word field levels in the target two-dimensional table to obtain the score value corresponding to each grading level of the field to be graded.

Optionally, the step of determining the grading level corresponding to the field to be graded according to the grading values corresponding to the grading levels of the field to be graded includes:

comparing the score values corresponding to the grading levels of the fields to be graded to obtain a comparison result;

and determining the grading level corresponding to the field to be graded according to the comparison result.

In addition, to achieve the above object, the present invention also proposes a data grading system, the system comprising:

the word stock construction module is used for constructing a sensitive word stock and determining the sensitive word field level of the sensitive word field in the sensitive word stock;

the matrix construction module is used for constructing a similarity incidence matrix of the field to be classified and the sensitive word field in a text semantic matching mode;

the matrix conversion module is used for converting the similarity incidence matrix into a target two-dimensional table based on the sensitive word field, the sensitive word field level and the field to be classified;

the aggregation operation module is used for carrying out aggregation operation according to the sensitive word field levels in the target two-dimensional table to obtain the score values corresponding to the grading levels of the field to be graded;

and the level determining module is used for determining the grading level corresponding to the field to be graded according to the score value corresponding to each grading level of the field to be graded.

In addition, to achieve the above object, the present invention also proposes a data grading apparatus, the apparatus comprising: a memory, a processor, and a data grading program stored on the memory and executable on the processor, the data grading program configured to implement the steps of the data grading method as described above.

In addition, in order to achieve the above object, the present invention also proposes a storage medium having stored thereon a data grading program which, when executed by a processor, implements the steps of the data grading method as described above.

The method comprises the steps of constructing a sensitive word stock and determining the sensitive word field level of sensitive word fields in the sensitive word stock; constructing a similarity incidence matrix of the field to be classified and the sensitive word field in a text semantic matching mode; converting the similarity incidence matrix into a target two-dimensional table based on the sensitive word field, the sensitive word field level and the field to be classified; performing aggregation operation according to the sensitive word field levels in the target two-dimensional table to obtain score values corresponding to each grading level of the field to be graded; and determining the grading level corresponding to the field to be graded according to the grading value corresponding to each grading level of the field to be graded. According to the method, a sensitive word library is constructed, the level of a sensitive word field is determined, a similarity incidence matrix of a field to be classified and the sensitive word field is constructed in a text semantic matching mode, then the similarity incidence matrix is converted into a target two-dimensional table, and aggregation operation is carried out according to the level of the sensitive word field in the target two-dimensional table, so that a score value corresponding to each classification level of the field to be classified is obtained; and determining the grading level corresponding to the field to be graded through the score value, and compared with the prior art, the accuracy and the efficiency of data grading are effectively improved.

Drawings

FIG. 1 is a schematic diagram of a data hierarchy apparatus of a hardware runtime environment in which embodiments of the present invention are implemented;

FIG. 2 is a flow chart of a first embodiment of the data classification method according to the present invention;

FIG. 3 is a schematic diagram of a sensitive word field and a field to be classified;

FIG. 4 is a flow chart of a second embodiment of the data classification method according to the present invention;

FIG. 5 is a flow chart of a third embodiment of the data classification method according to the present invention;

FIG. 6 is a block diagram of a data hierarchy system of the present invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a data classifying device of a hardware running environment according to an embodiment of the present invention.

As shown in fig. 1, the data grading apparatus may include: a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a Wireless interface (e.g., a Wireless-Fidelity (Wi-Fi) interface). The Memory 1005 may be a high-speed random access Memory (Random Access Memory, RAM) or a stable nonvolatile Memory (NVM), such as a disk Memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.

It will be appreciated by those skilled in the art that the structure shown in fig. 1 does not constitute a limitation of the data classification apparatus, and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

As shown in fig. 1, an operating system, a network communication module, a user interface module, and a data hierarchy program may be included in the memory 1005 as one type of storage medium.

In the data staging device shown in FIG. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 in the data classifying apparatus of the present invention may be provided in a data classifying apparatus which calls a data classifying program stored in the memory 1005 through the processor 1001 and performs the data classifying method provided by the embodiment of the present invention.

An embodiment of the present invention provides a data classification method, referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of the data classification method of the present invention.

In this embodiment, the data classification method includes the following steps:

step S10: and constructing a sensitive word stock, and determining the sensitive word field level of the sensitive word field in the sensitive word stock.

It should be noted that, the execution body of the embodiment may be a computing service device having functions of data processing, network communication and program running, such as a server, a tablet computer, a personal computer, a mobile phone, or an electronic device, a data grading device, or the like, which can implement the above functions. The present embodiment and the following embodiments will be described by way of example using a data classifying apparatus.

It should be explained that by constructing a sensitive word stock and marking each sensitive word field in the sensitive word stock with a sensitive word field level, such as L5, L4, L3, etc. On the other hand, the text information of the header field is extracted from the data table field to be classified in the database, and a text list of the field to be classified is constructed, wherein the text list of the field to be classified comprises a plurality of fields to be classified.

In a specific implementation, a sensitive word stock is constructed, and a grading level is given to each sensitive word field in the sensitive word stock, namely, the sensitive word field level. For example, referring to fig. 3, fig. 3 is a schematic diagram of a sensitive word field and a field to be classified, in which a sensitive field text in a sensitive field of a marketspace is payroll, and a prize corresponding classification level (grade) is L5; the grading level of the mobile phone, the telephone and the mailbox is L4; gender, age, and grade L3. Let the field text to be ranked (field to be identified and ranked) be employee income, project prize, contact, email, gender age, birth month, home address, "? "means that the rating level has not been determined yet, the task of data rating is to assign an appropriate rating level to these fields to be rated.

Step S20: and constructing a similarity association matrix of the field to be classified and the sensitive word field in a text semantic matching mode.

It can be understood that text semantic matching, in popular terms, is to determine whether the semantics of two texts are the same. Text semantic matching is one of the most basic tasks in natural language processing, and the semantic matching has wide application in search matching, intelligent customer service, news recommendation and the like.

It should be noted that, the text semantic matching method is to convert the field to be ranked and the sensitive word field into 768-dimensional word vectors based on the BERT model, and if the word vectors corresponding to the wages are: [ -0.04097468 0.02901088 0.01454205 0.04620046 0.03558226 … …]The word vector corresponding to the employee income is: [ -0.00527354 0.03050051 0.02337652 0.05430245 0.07561858 … …]. Similarity between two word vectors indicates the degree of matching between two short text semantics. The greater the number, the higher the degree of similarity between 0 and 1, indicating a higher degree of short text semantic matching. The similarity of word vectors is defined as the hadamard product operation and the results are summed. Such as,The Hadamard product is: />The similarity corresponds to: />=。

In a specific implementation, converting a field to be classified and the sensitive word field into 768-dimensional word vectors based on a BERT model to obtain a field matrix to be classified and a sensitive word field matrix; and carrying out Hadamard product operation on the field matrix to be classified and the sensitive word field matrix to obtain a similarity association matrix of the field to be classified and the sensitive word field.

It should be noted that, to-be-ranked fields and sensitive word fields are thousands of, and the similarity of any two-field short text vectors is extremely inefficient in two-layer cyclic traversal calculation. Therefore, the calculation performance can be remarkably improved based on a batch calculation mode, for example, dimensions are added to the field matrix to be classified and the sensitive word field matrix respectively according to a preset mode to obtain a first matrix and a second matrix, and then Hadamard product batch operation is performed on the field matrix to be classified and the sensitive word field matrix by using a Numpy/Tensor broadcasting mechanism to obtain a similarity association matrix for obtaining the field to be classified and the sensitive word field.

Assuming that the number of fields to be classified is n, that is, there are n vectors of fields to be classified, the corresponding field matrix to be classified is (n, 768), and increasing one dimension along the axis=1 direction becomes the first matrix is (n, 1, 768). The number of sensitive word fields is m, that is, there are m sensitive word field vectors, and the corresponding sensitive word field matrix is (m, 768), and one dimension is increased along the axis=0 direction to become a second matrix (1, m, 768). And then carrying out Hadamard product operation along the axis=2 direction by utilizing a broadcasting mechanism of Numpy/Tensor to obtain a similarity association matrix (n, m) of the field to be classified and the sensitive word field.

Step S30: and converting the similarity incidence matrix into a target two-dimensional table based on the sensitive word field, the sensitive word field level and the field to be classified.

Referring to table 1, a multi-layer index is constructed based on a sensitive word field (text) and a sensitive word field level (grade), and then the similarity association matrix is converted into a target two-dimensional table based on the field to be ranked.

TABLE 1 target two-dimensional table

Step S40: and performing aggregation operation according to the sensitive word field levels in the target two-dimensional table to obtain the score values corresponding to the grading levels of the fields to be graded.

In a specific implementation, aggregation operation is performed according to a mean value based on the sensitive word field level in the target two-dimensional table, and the score value corresponding to each grading level of the field to be graded is obtained, as shown in table 2.

TABLE 2 score values corresponding to each hierarchical level of a field to be hierarchical

Step S50: and determining the grading level corresponding to the field to be graded according to the grading value corresponding to each grading level of the field to be graded.

It should be noted that, after obtaining the score values corresponding to the classification levels of the field to be classified, comparing the score values corresponding to the classification levels of the field to be classified to obtain a comparison result; and determining the grading level corresponding to the field to be graded according to the comparison result.

The hierarchical level of each field to be hierarchical is obtained by taking the maximum score and the hierarchical level corresponding to the maximum value for each line, as shown in table 3.

Table 3 hierarchical level table of each field to be hierarchical

The embodiment constructs a sensitive word library and determines the sensitive word field level of sensitive word fields in the sensitive word library; constructing a similarity incidence matrix of the field to be classified and the sensitive word field in a text semantic matching mode; converting the similarity incidence matrix into a target two-dimensional table based on the sensitive word field, the sensitive word field level and the field to be classified; performing aggregation operation according to the average value based on the sensitive word field levels in the target two-dimensional table to obtain the score value corresponding to each grading level of the field to be graded; and determining the grading level corresponding to the field to be graded according to the grading value corresponding to each grading level of the field to be graded. According to the method, a sensitive word library is constructed, the level of a sensitive word field is determined, a similarity incidence matrix of a field to be classified and the sensitive word field is constructed in a text semantic matching mode, then the similarity incidence matrix is converted into a target two-dimensional table, and aggregation operation is carried out according to the level of the sensitive word field in the target two-dimensional table, so that a score value corresponding to each classification level of the field to be classified is obtained; and comparing the score values corresponding to the grading levels of the fields to be graded to obtain a comparison result, and determining the grading level corresponding to the fields to be graded according to the comparison result.

Referring to fig. 4, fig. 4 is a flowchart illustrating a second embodiment of the data classification method according to the present invention.

Based on the first embodiment, in this embodiment, the step S20 includes:

step S201: and converting the field to be classified and the sensitive word field into 768-dimensional word vectors based on the BERT model, and obtaining a field matrix to be classified and a sensitive word field matrix.

It should be explained that, based on the BERT model, the field to be ranked and the sensitive word field are converted into 768-dimensional word vectors, for example, the word vectors corresponding to "payroll" are: the word vector corresponding to the employee income is: [ -0.00527354 0.03050051 0.02337652 0.05430245 0.07561858 … … ].

Step S202: and carrying out Hadamard product operation on the corresponding word vector in the field matrix to be classified and the corresponding word vector in the sensitive word field matrix to obtain a similarity association matrix of the field to be classified and the sensitive word field.

It can be understood that the word vector corresponding to the field matrix to be classified is 768-dimensional word vector converted by the field to be classified based on the BERT model, namely, the field vector to be classified; the corresponding word vector in the sensitive word field matrix is 768-dimensional word vector converted by the sensitive word field based on the BERT model, namely the sensitive word field vector.

In a specific implementation, dimensions can be added to the field matrix to be classified and the sensitive word field matrix according to a preset mode, so as to obtain a first matrix and a second matrix; and then carrying out Hadamard product batch operation on the corresponding word vectors in the first matrix and the corresponding word vectors in the second matrix by using a broadcasting mechanism of Numpy/Tensor to obtain a similarity association matrix of the field to be classified and the sensitive word field.

It should be noted that, in the foregoing preset manner, assuming that the number of fields to be classified is n, that is, there are n vectors of fields to be classified, the corresponding field matrix to be classified is (n, 768), and adding one dimension along the axis=1 direction becomes the first matrix is (n, 1, 768). The number of sensitive word fields is m, that is, there are m sensitive word field vectors, and the corresponding sensitive word field matrix is (m, 768), and one dimension is increased along the axis=0 direction to become a second matrix (1, m, 768). And then carrying out Hadamard product operation along the axis=2 direction by utilizing a broadcasting mechanism of Numpy/Tensor to obtain a similarity association matrix (n, m) of the field to be classified and the sensitive word field.

However, since the values of n and m are large, the problem of memory overflow will occur in the batch calculation of the similarity correlation matrix at a time, and thus a small batch matrix operation is required. For example, the field to be classified has a lot size bs1, a sensitive word fieldThe batch size is bs2, and the similarity matrix of the (bs 1, bs 2) size is calculated for each small batch. Splitting the field matrix to be classified (n, 768) into two fields before performing a small-lot matrix calculationSmall matrices of the numbers (bs 1, 768), splitting the sensitive word field matrix (m, 768) into +.>Small matrices of (bs 2, 768).

Thus, in a specific implementation, the batch value of the fields to be classified and the batch value of the sensitive words can be determined according to the number of the fields to be classified and the number of the sensitive words; splitting the field matrix to be classified into a plurality of target field matrices to be classified according to the field batch value to be classified; splitting the sensitive word field matrix into a plurality of target sensitive word field matrixes according to the sensitive word field batch value; respectively adding dimensions to the target field matrix to be classified and the target sensitive word field matrix according to the preset mode to obtain a third matrix and a fourth matrix; and carrying out Hadamard product batch operation on the word vectors corresponding to the third matrix and the word vectors corresponding to the fourth matrix by using a broadcasting mechanism of Numpy/Tensor to obtain a similarity association matrix of the field to be classified and the sensitive word field.

It will be appreciated that small batch matrix operations may be implemented using the Numpy, pad, and Pytorch frameworks, with the time taken for a single batch calculation increasing with increasing batch size. If n=m=4500 is taken to complete the full-scale similarity matrix operation, the total time consumed by Numpy and PyTorch is slightly affected by the batch size, and it is recommended to select Numpy frame to implement the similarity matrix calculation, and the batch size can be set to about 1000.

According to the number of the fields to be classified and the number of the sensitive word fields, determining a batch value of the fields to be classified and a batch value of the sensitive word fields; splitting the field matrix to be classified into a plurality of target field matrices to be classified according to the field batch value to be classified; splitting the sensitive word field matrix into a plurality of target sensitive word field matrixes according to the sensitive word field batch value; respectively adding dimensions to the target field matrix to be classified and the target sensitive word field matrix according to the preset mode to obtain a third matrix and a fourth matrix; and carrying out Hadamard product batch operation on the word vectors corresponding to the third matrix and the word vectors corresponding to the fourth matrix by using a Numpy/Tensor broadcasting mechanism to obtain similarity incidence matrixes of the field to be classified and the sensitive word field.

Referring to fig. 5, fig. 5 is a flowchart of a third embodiment of the data classification method according to the present invention.

Based on the above embodiments, in this embodiment, after step S30, the method further includes:

step S301: and performing supervised learning training by taking the sensitive word field as input and the sensitive word level corresponding to the sensitive word field as output through a machine learning algorithm to obtain a multi-classification model.

Step S302: and determining a similarity critical threshold according to the multi-classification model.

The machine learning algorithm may be a bayesian algorithm, a decision tree algorithm, or other classification algorithms, which is not limited in this embodiment.

In a specific implementation, a similarity critical threshold is trained through a machine learning method, sensitive word fields are used as input, sensitive word levels corresponding to the sensitive word fields are used as output, and a training sample data set is constructed. Training to obtain a multi-classification model based on a supervised learning method, and determining a similarity critical threshold according to the multi-classification model.

And training through a machine learning algorithm to obtain a similarity critical threshold, and when the similarity is smaller than the similarity critical threshold, indicating that the matching degree between the semantics of the short text is low, and further filling a similarity association matrix by using the missing value.

Step S303: and replacing the similarity smaller than the similarity critical threshold value in the target two-dimensional table with a missing value to obtain a new target two-dimensional table.

Referring to table 4, if the similarity threshold value obtained by construction is 0.6, replacing the similarity smaller than 0.6 in the target two-dimensional table with a missing value to obtain a new target two-dimensional table.

TABLE 4 New target two-dimensional Table

And then, carrying out aggregation operation according to the average value according to the sensitive word field level in the new target two-dimensional table, and taking the maximum value and the maximum value index for each row of the new two-dimensional table to obtain the optimized grading level, thereby reducing the data quantity and improving the data grading efficiency, for example, referring to the table 5.

Table 5 optimized hierarchical level table

As can be seen from Table 5, employee revenues and project prizes may be assigned a rating level of L5; the contact way and the e-mail can be endowed with a grading level of L4; gender age may be assigned a rating level of L3; the sensitive word field level of birth year, month and family address is None, the corresponding grading level is lower than the sensitive word field level, and the assignable data grading level is L2 or L1.

The embodiment constructs a sensitive word library and determines the sensitive word field level of sensitive word fields in the sensitive word library; constructing a similarity incidence matrix of the field to be classified and the sensitive word field in a text semantic matching mode; converting the similarity incidence matrix into a target two-dimensional table based on the sensitive word field, the sensitive word field level and the field to be classified; performing supervised learning training by taking the sensitive word field as input and the sensitive word level corresponding to the sensitive word field as output through a machine learning algorithm to obtain a multi-classification model; determining a similarity critical threshold according to the multi-classification model; replacing the similarity smaller than the similarity critical threshold value in the target two-dimensional table with a missing value to obtain a new target two-dimensional table; performing aggregation operation according to the sensitive word field levels in the target two-dimensional table to obtain score values corresponding to each grading level of the field to be graded; and determining the grading level corresponding to the field to be graded according to the grading value corresponding to each grading level of the field to be graded. According to the method, a sensitive word library is constructed, the level of a sensitive word field is determined, a similarity incidence matrix of a field to be classified and the sensitive word field is constructed in a text semantic matching mode, then the similarity incidence matrix is converted into a target two-dimensional table, then the similarity smaller than the similarity critical threshold value in the target two-dimensional table is replaced by a missing value, a new target two-dimensional table is obtained, aggregation operation is carried out according to the level of the sensitive word field in the target two-dimensional table, and the score value corresponding to each classification level of the field to be classified is obtained; compared with the prior art, the method and the device effectively improve the accuracy and the efficiency of data classification.

In addition, the embodiment of the invention also provides a storage medium, wherein the storage medium stores a data grading program, and the data grading program realizes the steps of the data grading method when being executed by a processor.

Referring to fig. 6, fig. 6 is a block diagram illustrating a data hierarchy system according to the present invention.

As shown in fig. 6, a data grading system according to an embodiment of the present invention includes: a word stock construction module 601, a matrix construction module 602, a matrix conversion module 603, an aggregation operation module 604 and a level determination module 605.

The word stock construction module 601 is configured to construct a sensitive word stock, and determine a sensitive word field level of a sensitive word field in the sensitive word stock.

The matrix construction module 602 is configured to construct a similarity association matrix of the field to be ranked and the sensitive word field in a text semantic matching manner.

The matrix conversion module 603 is configured to convert the similarity association matrix into a target two-dimensional table based on the sensitive word field, the sensitive word field level and the field to be ranked.

The aggregation operation module 604 is configured to perform an aggregation operation according to the sensitive word field level in the target two-dimensional table, and obtain a score value corresponding to each classification level of the field to be classified.

The level determining module 605 is configured to determine a classification level corresponding to the field to be classified according to the score value corresponding to each classification level of the field to be classified.

The system constructs a sensitive word library and determines the sensitive word field level of sensitive word fields in the sensitive word library; constructing a similarity incidence matrix of the field to be classified and the sensitive word field in a text semantic matching mode; converting the similarity incidence matrix into a target two-dimensional table based on the sensitive word field, the sensitive word field level and the field to be classified; performing aggregation operation according to the sensitive word field levels in the target two-dimensional table to obtain score values corresponding to each grading level of the field to be graded; and determining the grading level corresponding to the field to be graded according to the grading value corresponding to each grading level of the field to be graded. According to the method, a sensitive word library is constructed, the level of a sensitive word field is determined, a similarity incidence matrix of a field to be classified and the sensitive word field is constructed in a text semantic matching mode, then the similarity incidence matrix is converted into a target two-dimensional table, and aggregation operation is carried out according to the level of the sensitive word field in the target two-dimensional table, so that a score value corresponding to each classification level of the field to be classified is obtained; and determining the grading level corresponding to the field to be graded through the score value, and compared with the prior art, the accuracy and the efficiency of data grading are effectively improved.

Based on the above-described first embodiment of the data classification system of the present invention, a second embodiment of the data classification system of the present invention is proposed.

In this embodiment, the matrix construction module 602 is further configured to convert, based on a BERT model, a field to be classified and the sensitive word field into 768-dimensional word vectors, to obtain a field matrix to be classified and a sensitive word field matrix; and carrying out Hadamard product operation on the corresponding word vector in the field matrix to be classified and the corresponding word vector in the sensitive word field matrix to obtain a similarity association matrix of the field to be classified and the sensitive word field.

The matrix construction module 602 is further configured to add dimensions to the field matrix to be classified and the sensitive word field matrix according to a preset manner, so as to obtain a first matrix and a second matrix; and carrying out Hadamard product batch operation on the word vectors corresponding to the first matrix and the word vectors corresponding to the second matrix by using a broadcasting mechanism of Numpy/Tensor to obtain a similarity association matrix of the field to be classified and the sensitive word field.

The matrix construction module 602 is further configured to determine a batch value of fields to be classified and a batch value of sensitive words according to the number of fields to be classified and the number of sensitive words; splitting the field matrix to be classified into a plurality of target field matrices to be classified according to the field batch value to be classified; splitting the sensitive word field matrix into a plurality of target sensitive word field matrixes according to the sensitive word field batch value; respectively adding dimensions to the target field matrix to be classified and the target sensitive word field matrix according to the preset mode to obtain a third matrix and a fourth matrix; and carrying out Hadamard product batch operation on the word vectors corresponding to the third matrix and the word vectors corresponding to the fourth matrix by using a broadcasting mechanism of Numpy/Tensor to obtain a similarity association matrix of the field to be classified and the sensitive word field.

Other embodiments or specific implementations of the data classification system of the present invention may refer to the above method embodiments, and are not described herein.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. read-only memory/random-access memory, magnetic disk, optical disk), comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present invention.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. A method of data classification, the method comprising the steps of:

2. The data grading method according to claim 1, wherein the step of constructing a similarity association matrix of the field to be graded and the sensitive word field by means of text semantic matching includes:

3. The data classification method of claim 2, wherein the step of performing hadamard product operation on the word vector corresponding to the field matrix to be classified and the word vector corresponding to the sensitive word field matrix to obtain the similarity association matrix of the field to be classified and the sensitive word field comprises:

4. The data classification method of claim 3, wherein the step of performing hadamard product operation on the word vector corresponding to the field matrix to be classified and the word vector corresponding to the sensitive word field matrix to obtain the similarity association matrix of the field to be classified and the sensitive word field further comprises:

5. The data ranking method of claim 1, wherein after the step of converting the similarity association matrix into a target two-dimensional table based on the sensitive word field, the sensitive word field level and the field to be ranked, further comprising:

6. The data classification method according to claim 1, wherein the step of performing an aggregation operation according to the sensitive word field level in the target two-dimensional table to obtain the score value corresponding to each classification level of the field to be classified specifically includes:

7. The data grading method according to claim 1, wherein the step of determining the grading level corresponding to the field to be graded by the grading value corresponding to each grading level of the field to be graded includes:

8. A data staging system, the data staging system comprising:

9. A data grading apparatus, the apparatus comprising: a memory, a processor and a data grading program stored on the memory and executable on the processor, the data grading program being configured to implement the steps of the data grading method according to any of claims 1 to 7.

10. A storage medium having stored thereon a data grading program which, when executed by a processor, implements the steps of the data grading method according to any of claims 1 to 7.