CN111931229A

CN111931229A - Data identification method and device and storage medium

Info

Publication number: CN111931229A
Application number: CN202010664475.6A
Authority: CN
Inventors: 李可; 张盼
Original assignee: Sangfor Technologies Co Ltd
Current assignee: Sangfor Technologies Co Ltd
Priority date: 2020-07-10
Filing date: 2020-07-10
Publication date: 2020-11-13
Anticipated expiration: 2040-07-10
Also published as: CN111931229B

Abstract

The invention discloses a data identification method, a data identification device and a storage medium, wherein the method comprises the following steps: acquiring a first table; determining a first column of feature vector sets of a first table according to a preset feature acquisition strategy; the first table includes at least one column; the first column of feature vector set comprises feature vectors of all columns in at least one column; the feature vector includes character features of the respective column; identifying a first-column characteristic vector set by using a preset identification model to obtain a first analysis result vector; the recognition model comprises at least one classifier model; each classifier model in the at least one classifier model is used for identifying a corresponding class table; the first analysis result vector comprises the analysis results of each column in at least one column corresponding to the corresponding classifier model; determining the similarity between the first table and various tables in at least one type of table according to the first analysis result vector, and determining an identification result according to the determined similarity; the recognition result represents the table type corresponding to the first table.

Description

Data identification method and device and storage medium

Technical Field

The present invention relates to data identification technologies, and in particular, to a data identification method, apparatus, and computer-readable storage medium.

Background

The table is used as a means for organizing and sorting data and can comprise various data; sensitive data analysis techniques include sensitive table identification. In the related technology, the identification of the form data is mainly based on a keyword content matching scheme, the scheme requires a user to input a form file to be protected in advance, record specific contents in the form by adopting an abstract and keyword matching technology, and analyze whether the same contents are hit in the form. Although the above method has high recognition accuracy, it has poor detection capability for a case where the content is changed.

Disclosure of Invention

In view of the above, the present invention provides a data identification method, apparatus and computer readable storage medium.

In order to achieve the purpose, the technical scheme of the invention is realized as follows:

the embodiment of the invention provides a data identification method, which comprises the following steps:

acquiring a first table;

determining a first column of feature vector sets of the first table according to a preset feature acquisition strategy; the first table includes at least one column; the first column feature vector set comprises feature vectors of columns in the at least one column; the feature vector comprises character features of a respective column;

recognizing the first-column characteristic vector set by using a preset recognition model to obtain a first analysis result vector; the recognition model comprises at least one classifier model; each classifier model in the at least one classifier model is used for identifying a corresponding class table; the first analysis result vector comprises the analysis results of each column in at least one column corresponding to the corresponding classifier model;

determining the similarity between the first table and various tables in at least one type of table according to the first analysis result vector, and determining an identification result according to the determined similarity; the identification result represents the table type corresponding to the first table.

In the above scheme, the method further comprises: training at least one classifier model; training the classifier model includes:

obtaining at least one sample table;

determining a sample column characteristic vector set corresponding to each sample table in the at least one sample table according to a preset characteristic obtaining strategy;

similar column combination is carried out according to the sample column characteristic vector set corresponding to each sample table to obtain a training data set;

and training according to the training data set and the labels corresponding to the columns in the training data set to obtain a classifier model.

In the foregoing scheme, the performing similar column merging according to the sample column feature vector set corresponding to each sample table to obtain a training data set includes:

determining at least one column of corresponding feature vectors according to the sample column feature vector set corresponding to each sample table;

clustering the characteristic vectors corresponding to the at least one column to obtain at least one cluster as the training data set; each cluster in the at least one cluster comprises at least one column and a feature vector corresponding to each column in the at least one column.

In the foregoing solution, the identifying the first row of feature vector sets by using a preset identification model to obtain a first analysis result vector includes:

similar column combination is carried out on all columns in the first column characteristic vector set, and a second column characteristic vector set is obtained;

identifying the second-row characteristic vector set to obtain a first analysis result vector;

the determining the similarity between the first table and each type of table in at least one type of table according to the first analysis result vector comprises:

determining a first column correlation number, a second column correlation number, and a third column correlation number; the first column correlation number represents the number of columns of the first table, the second column correlation number represents the number of columns of the corresponding class table, and the third column correlation number represents the number of columns of the first table which are common to the corresponding class table;

determining the correlation number of a fourth column corresponding to each classification result in at least one classification result corresponding to the first table; each classification result corresponds to different column types in various tables; the fourth column correlation number represents the number of similar columns of the first table, the classification result of which is a corresponding column category;

determining the number of columns included in the corresponding cluster corresponding to the classification result in the corresponding class table as a fifth column correlation number;

and determining the similarity between the first table and the corresponding table according to the first column correlation number, the second column correlation number, the third column correlation number, the fourth column correlation number and the fifth column correlation number.

In the foregoing solution, the preset feature obtaining policy includes:

determining the content value of at least one column corresponding to at least one row in the table;

extracting a characteristic vector of at least one column according to the content value of at least one column corresponding to the at least one row; the feature vector comprises character-related features of corresponding columns in the table;

and obtaining a column characteristic vector set corresponding to the table according to the at least one column of characteristic vectors.

In the foregoing solution, the determining the recognition result according to the determined similarity includes:

determining the similarity between the first table and at least one type of table;

determining the category of the table with the similarity exceeding a preset similarity threshold;

and sorting the categories of the table with the determined similarity exceeding a preset similarity threshold, and obtaining an identification result based on a sorting result.

The embodiment of the invention provides a data identification device, which comprises: the device comprises an acquisition unit, a processing unit and an identification unit; wherein the content of the first and second substances,

the acquiring unit is used for acquiring a first table;

the processing unit is used for determining a first column of feature vector sets of the first table according to a preset feature acquisition strategy; the first table includes at least one column; the first column feature vector set comprises feature vectors of columns in the at least one column; the feature vector comprises character features of a respective column;

the identification unit is used for identifying the first row of characteristic vector sets by using a preset identification model to obtain a first analysis result vector; the recognition model comprises at least one classifier model; each classifier model in the at least one classifier model is used for identifying a corresponding class table; the first analysis result vector comprises the analysis results of each column in at least one column corresponding to the corresponding classifier model;

In the above scheme, the apparatus further comprises: a preprocessing unit for training at least one classifier model;

the preprocessing unit is specifically configured to obtain at least one sample table;

In the foregoing solution, the preprocessing unit is configured to determine, according to a sample column feature vector set corresponding to each sample table, a feature vector corresponding to at least one column;

In the foregoing solution, the identifying unit is configured to perform similar column merging on each column in the first column feature vector set to obtain a second column feature vector set; identifying the second-row characteristic vector set to obtain a first analysis result vector;

the identification unit is further used for determining a first column correlation number, a second column correlation number and a third column correlation number; the first column correlation number represents the number of columns of the first table, the second column correlation number represents the number of columns of the corresponding class table, and the third column correlation number represents the number of columns of the first table which are common to the corresponding class table;

In the foregoing solution, the preset feature obtaining policy includes:

In the foregoing solution, the identifying unit is specifically configured to determine a similarity between the first table and at least one type of table;

The embodiment of the invention provides a data identification device, which comprises: a processor and a memory for storing a computer program capable of running on the processor; wherein the content of the first and second substances,

the processor is configured to execute the steps of any one of the above data recognition methods when the computer program is run.

Embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the data identification method described in any one of the above.

The data identification method, the data identification device and the computer readable storage medium provided by the embodiment of the invention comprise the following steps: acquiring a first table; determining a first column of feature vector sets of the first table according to a preset feature acquisition strategy; the first table includes at least one column; the first column feature vector set comprises feature vectors of columns in the at least one column; the feature vector comprises character features of a respective column; recognizing the first-column characteristic vector set by using a preset recognition model to obtain a first analysis result vector; the recognition model comprises at least one classifier model; each classifier model in the at least one classifier model is used for identifying a corresponding class table; the first analysis result vector comprises the analysis results of each column in at least one column corresponding to the corresponding classifier model; determining the similarity between the first table and various tables in at least one type of table according to the first analysis result vector, and determining an identification result according to the determined similarity; the identification result represents a table type corresponding to the first table; therefore, the method carries out recognition based on the character features of each column in the table, has better recognition capability for sensitive data scenes with the same type but different key information, and has good generalization capability and robustness.

Drawings

Fig. 1 is a schematic flow chart of a data identification method according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a method for training a classifier model according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of analyzing features according to columns according to an embodiment of the present invention;

FIG. 4 is a flow chart illustrating merging of similar columns according to an embodiment of the present invention;

FIG. 5 is a schematic flow chart illustrating another data identification method according to an embodiment of the present invention;

fig. 6 is a schematic flowchart of another data identification method according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a data recognition apparatus according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of another data identification device according to an embodiment of the present invention.

Detailed Description

In order to make the embodiments of the present application better understood, the technical solutions in the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, but not all embodiments.

The terms "first," "second," and "third," etc. in the description and claims of the present application and the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprises" and "comprising," as well as any variations thereof, are intended to cover a non-exclusive inclusion, such as a list of steps or elements. A method, system, article, or apparatus is not necessarily limited to those steps or elements explicitly listed, but may include other steps or elements not explicitly listed or inherent to such process, system, article, or apparatus.

The following describes a related art of the related data identification method.

By combining the above contents, the table file to be protected is obtained based on the keyword content matching scheme, the specific content in the table is recorded by adopting the abstract and keyword matching technology, and whether the same content is hit in the table or not can be analyzed in the matching stage. The method has poor detection capability for the condition that the content changes, and is difficult to detect if the sensitive data of the same type are different, for example, the specified sensitive content is Zhang III, abc @ hotmail. com, and cannot be identified when the content is Liqu, efg @ hotmail. com.

Based on this, in various embodiments of the present invention, a first table is obtained; determining a first column of feature vector sets of the first table according to a preset feature acquisition strategy; the first table includes at least one column; the first column feature vector set comprises feature vectors of columns in the at least one column; the feature vector comprises character features of a respective column; recognizing the first-column characteristic vector set by using a preset recognition model to obtain a first analysis result vector; the recognition model comprises at least one classifier model; each classifier model in the at least one classifier model is used for identifying a corresponding class table; the first analysis result vector comprises the analysis results of each column in at least one column corresponding to the corresponding classifier model; determining the similarity between the first table and various tables in at least one type of table according to the first analysis result vector, and determining an identification result according to the determined similarity; the identification result represents the table type corresponding to the first table.

The present invention will be described in further detail with reference to examples.

Fig. 1 is a schematic flow chart of a data identification method according to an embodiment of the present invention; as shown in fig. 1, the data identification method is applied to a server, and the method includes:

step 101, acquiring a first table;

102, determining a first column of feature vector sets of the first table according to a preset feature acquisition strategy;

wherein the first table comprises at least one column;

the first column feature vector set comprises feature vectors of columns in the at least one column; the feature vector comprises character features of a respective column;

step 103, recognizing the first row of feature vector sets by using a preset recognition model to obtain a first analysis result vector;

wherein the recognition model comprises at least one classifier model; each classifier model in the at least one classifier model is used for identifying a corresponding class table;

the first analysis result vector comprises the analysis results of each column in at least one column corresponding to the corresponding classifier model;

104, determining the similarity between the first table and each table in at least one type of table according to the first analysis result vector, and determining an identification result according to the determined similarity;

and the identification result represents the table type corresponding to the first table.

Here, the data identification method is applicable to identification of structured data; the structured data refers to data logically expressed and realized by a two-dimensional table structure, and the document class table and the database table are the most common structured data types. That is, the first table may be an office document type table, a database table, or the like.

In some embodiments, the recognition model comprises at least one classifier model;

the classifier model is a classifier for classifying the table; different classifier models (i.e., different classifiers) are used to identify different types of tables; the method may pre-train different classifier models to identify different types of tables.

Here, the type of the table may be set based on the user's needs. For example, if a user needs to identify a form for a financial aspect, a classifier model for the form for the corresponding financial aspect may be trained. The form in the financial aspect may be a form of a certain template (or a plurality of templates, where there is similarity between the templates), that is, the user may specifically set the form template, including: the table may include specific columns, each for which categories.

The method further comprises the following steps: training at least one classifier model;

for each classifier model, training the classifier model, comprising:

obtaining at least one sample table;

The classifier model (also called classifier) is a general term of a method for classifying samples in data mining, and a classifier model is constructed by using a statistical method or a classification algorithm on the basis of existing data.

In the embodiment of the invention, corresponding classifier models are trained aiming at different types of tables, so that when the method is used, at least one classifier model obtained by training can be used for identifying different tables.

Wherein, the similar column merging is performed according to the sample column characteristic vector set corresponding to each sample table to obtain a training data set, and the method includes:

Specifically, the merging of similar columns refers to merging of columns with the same or highly similar contents for different columns in the table but with the same or highly similar contents; for example: the table will have two columns "start time" and "end time" at the same time, and the contents of these two columns are often identical and merged against the two columns.

Here, the clustering the feature vectors corresponding to the at least one column to obtain at least one cluster includes:

determining a feature vector corresponding to each column in the at least one column;

performing clustering analysis by adopting a clustering algorithm based on a similarity threshold or a clustering algorithm based on a density threshold to obtain at least one cluster; the columns in each cluster are labeled with the same class label, that is, the labels of the columns grouped into the same cluster when the columns are used for classifier training at a later stage (the label here refers to the label of the column, and the start time and the end time can be the columns belonging to the time class in one cluster as described above) are the same.

In some embodiments, the identifying, by using a preset identification model, the first column of feature vector sets to obtain a first analysis result vector includes:

Specifically, a first table is identified based on a classifier model, the first table originally has A1 columns, and similar columns are merged to obtain A2 column categories; the certain classifier model originally has B1 columns, and B2 column categories are obtained after similar columns are combined;

wherein the first column correlation number is A1, and the second column correlation number is B1; the third column correlation number is assumed to be C1 (i.e., a column common to a1 columns and B1 columns, where the common column refers to the same class as the class, such as the columns that are all time classes (i.e., the start time, the end time, etc.) described above);

further, assuming that a1 is 10, the first table has 5 time class columns and 5 other class columns, and the similar classes are merged into 6 class columns;

the certain classifier model corresponds to columns with time classes, and the specific number is assumed to be 3;

then, for the column of the time class (i.e. one classification result), the fourth column correlation number is 5, and the fifth column correlation number is the number of columns included in the cluster corresponding to the column of the time class, i.e. 3;

for columns of other classes (i.e. other classification results), the fourth column correlation number is the number of columns for the respective classification result; the fifth column correlation number is the number of columns included for the cluster corresponding to the classification result;

and carrying out data statistics on various columns in the first table, and calculating to obtain the similarity.

Specifically, the similarity can be calculated by using the following formula:

wherein, a corresponds to a classifier model, namely corresponds to a table of a certain class; b represents a first table; k is a radical of_bCharacterizing the first column correlation number, k_aCharacterizing a second column correlation number; SC (b, a) characterizes the third column correlation number;

wherein the content of the first and second substances,

characterizing a fourth column correlation number;

characterizing a fifth column correlation number;

taking the minimum value of the two; g is the total number of column categories after similar column merging.

In some embodiments, the preset feature obtaining policy includes:

The preset feature acquisition strategy can be adopted for both the identified first table and the sample table during training.

Here, the server may use an open source tool or a tool library for data extraction to read contents of the corresponding table, determine the number of rows and columns of the corresponding table, determine content values of different rows and different columns, and determine the feature vector of at least one column according to the content values of different rows and different columns.

The feature vector includes: character features of the column; the character features specifically refer to statistical feature values at the character level in each column.

For example, the feature vector includes character-level statistical feature values of at least one of:

average length (i.e., averaging the lengths of the content values of a column in the table), median length (i.e., determining the median of the lengths of the content values), maximum length (i.e., determining the maximum of the lengths of the content values), minimum length (i.e., determining the minimum of the lengths of the content values), length variance (i.e., taking the variance of the lengths of the content values), average Chinese character proportion (i.e. determining the average proportion of Chinese in each content value), average capitalized English character proportion (i.e. determining the average proportion of capitalized English characters in each content value), average lowercase English character proportion (i.e. determining the average proportion of lowercase English characters in each content value), average number proportion (i.e. determining the average proportion of numbers in each content value), average other special character proportion (i.e. determining the average proportion of other special characters in each content value), and the like. Other special characters represent other characters except the English characters, Chinese characters and numeric characters; such as a special symbol.

That is, the above-described statistical characteristic value at the character level can be determined from the content value of each column.

In some embodiments, the determining the recognition result according to the determined similarity includes:

Here, the recognition result based on the ranking result may be a category of a table determined to have the highest degree of similarity according to the ranking result as the category to which the first table belongs.

Specifically, in some embodiments, the method may further comprise: setting a similarity threshold value for each classifier model;

the similarity threshold can be set by developers based on experience and user requirements; the classifier model can also be detected by testing the positive and negative sample sets (namely, the positive and negative samples are identified by the classifier model to obtain corresponding test identification results), and the similarity threshold is adjusted based on the detection results to obtain the similarity threshold for each classifier model.

Here, the same or different similarity thresholds may be set for different classifier models. When the same similarity threshold is adopted, determining the category of the table with the highest similarity as the category to which the first table belongs; when different similarity thresholds are adopted, the category of the table which exceeds the corresponding similarity threshold and has the largest difference with the similarity threshold can be determined as the category to which the first table belongs.

FIG. 2 is a flowchart illustrating a method for training a classifier model according to an embodiment of the present invention; as shown in fig. 2, the training method of the classifier model includes:

step 201, obtaining a sample table set;

here, the set of sample tables includes: at least one sample table;

the form category of the sample form set is marked as category l; the category l may be a certain category required by the user; a form of a template as set by a user, the template comprising: start time, end time, event condition, etc.

Step 202, extracting the content of a sample table;

for each sample table, an open source tool or a tool library for data extraction may be specifically used to read the content of the corresponding sample table, determine the number of rows and columns of the corresponding sample table, and then determine the content values of different rows and different columns.

Step 203, analyzing the table characteristics according to the columns to obtain a characteristic matrix of the table;

here, for each sample table, analyzing the feature vectors of the columns in the sample table by columns in units of the columns of the sample table; and obtaining a feature matrix of the sample table according to the feature vectors of the columns. The feature vector comprises character features of each column; the character features specifically refer to statistical feature values at the character level in each column.

In some embodiments, as shown in fig. 3, analyzing the table features by columns to obtain a feature matrix of the table includes:

step 2031, randomly reading n rows and columns of j record values;

here, assuming that the number of columns x of the currently read table is j, n rows are selected from the whole table;

the selection strategy can be any one of the first n rows, the last n rows and the completely randomly selected n rows in the sampling table, and the content value of j columns in the record of the n rows is read, that is, the content values of n rows in the jth column in the sample table are read;

here, the n may be set by a developer based on user requirements; n is greater than or equal to 1;

step 2032, performing character feature statistics on the content values of j columns in the n-row records;

here, character feature statistics is performed on the content values of j columns in the n-row records, and the statistical feature values at the character level of specific statistics include, but are not limited to: average length (i.e., averaging the lengths of the respective content values), median length (i.e., determining the median of the lengths of the respective content values), maximum length (i.e., determining the maximum of the lengths of the respective content values), minimum length (i.e., determining the minimum of the lengths of the respective content values), length variance (i.e., taking the variance of the lengths of the respective content values), average Chinese character proportion (i.e. determining the average proportion of Chinese in each content value), average capitalized English character proportion (i.e. determining the average proportion of capitalized English characters in each content value), average lowercase English character proportion (i.e. determining the average proportion of lowercase English characters in each content value), average number proportion (i.e. determining the average proportion of numbers in each content value), average other special character proportion (i.e. determining the average proportion of other special characters in each content value), and the like; other special characters represent other characters except the English characters, Chinese characters and numeric characters; such as a special symbol.

Step 2033, generating a feature vector of the column j;

through the statistics of the dimensions, the characteristic vector of the column j of the sample table x is obtained and is marked as v_xj(ii) a Obtaining a characteristic vector of the column j according to the statistical characteristic value aiming at the character level of the column j; the obtained feature vector comprises: the statistical characteristic value of each character level;

step 2034, generating a feature matrix of the sample table;

specifically, all the columns in the sample table x are counted according to the above 2021-plus 2023 steps to obtain the feature vector v of each column in the table x_xjFinally obtaining the feature matrix V of the sample table x according to the feature vector of each column_x，V_x＝[v_x1,v_x2,…,v_xk]Where k represents the original number of columns of sample table x.

Here, through

steps

202 and 203, the server may determine a sample column feature vector set corresponding to each sample table.

Step 204, merging similar columns;

here, consider that there are cases where the contents of different columns are the same or highly similar for each column of the table, for example: in some class tables, two columns of "start time" and "end time" exist at the same time, the contents of the two columns are often identical, and when a classifier model is trained as a class according to the subsequent columns, the two columns should belong to the same column, so that the columns are subjected to merging preprocessing.

Specifically, the merging of similar columns includes: and merging similar columns according to the sample column characteristic vector set corresponding to each sample table.

In some embodiments, as shown in connection with fig. 4, the merging of similar columns includes:

2041, extracting a feature vector of each row;

if, the user provides only the files of the sample table in category lIf there is one (denoted by x), the set of eigenvectors of each column, i.e., the feature matrix, is V_x＝[v_x1,v_x2,v_x3…,v_xj,…v_xk]；v_ljIs a feature vector (v) for each column in class l_lj＝v_xj)；

If there are multiple tables with the same structure and the tables in the table files are considered to belong to the same class, the set of feature vectors of each column, i.e. the feature matrix, is

Wherein, l represents the category number, i represents the sample table number in the category, and the feature vector v of each column_ljFor the mean value of the statistical characteristic values of the corresponding columns of each table, i.e.

m represents the number of tables.

2042, clustering by adopting a similarity threshold/density threshold based clustering algorithm;

here, a set V of feature vectors for each column of class l is set_l＝[v_l1,v_l2,v_l3…,v_lj,…]As the input of the Clustering, a Clustering algorithm (such as agglomerative Clustering hierarchy Clustering) Based on a similarity threshold or a Clustering algorithm (such as DBSCAN, sensitivity-Based Spatial Clustering of Applications with Noise) Based on a Density threshold is adopted for Clustering analysis, the algorithm has the advantages that the Clustering number is not required to be specified in advance, the clusters can be automatically divided according to the measurement threshold, and a Clustering result C is obtained after the Clustering is finished_l＝[c₁,c₂,…,c_g](ii) a g denotes the total number of clusters, each cluster may include at least one column of feature vectors.

2043, marking the columns in the same cluster as the same class labels;

the original columns are merged into the same cluster, the class labels in the subsequent classifier model training stage are the same, wherein g is less than or equal to k, and the number of columns representing the merged column clusters is less than or equal to the number of original columns; where g represents the number of columns after clustering.

Step 205, generating a training data set.

Summarizing feature matrices for all sample tables of class l

Form a set of feature vectors as a training data set, which is written as

Wherein, V_i＝[v_i1,v_i2,…,v_ik]I represents the sample table number, k represents the number of original table columns, and m represents the number of tables in the class;

step 206, training a classifier model;

here, the step 206 includes: and training according to the training data set and the labels corresponding to the columns in the training data set to obtain a classifier model.

Specifically, here, as S_lAs a set of characteristic vectors, C_lThe labels of the columns (the labels of the columns are preset) are input into a classifier model algorithm (such as LightGBM) for training to obtain a multi-classification model M_l(ii) a The LightGBM is a gradient Boosting framework proposed by Microsoft, is a learning algorithm based on a decision tree, and can be used for classification tasks;

the trained classifier model can be used for inputting feature vectors (denoted as v) of each column_y) Judging and identifying the feature vector v_yProbability distribution belonging to a merged cluster class

Wherein g represents the number of columns after the clusters are merged,

the probability values that column g belongs to the respective cluster are characterized.

Step 207, determining a similarity threshold value by using the positive and negative sample sets;

here, the classification model M_lAfter the training is finished, the training can be performed in advanceInputting the feature vector set extracted for each sample into a model M through the similar feature extraction steps_lAnd identifying and judging the similarity between each table and the category l, wherein a similarity calculation formula is as follows:

wherein k is_rRepresenting the number of original columns, k, of the table r_lRepresenting the original column number of the table in the class l, and SC (r, l) representing the common column number of the table of the sample r and the table of the class l;

wherein the content of the first and second substances,

representing the presence of a classification result in the sample table r as column class

The number of columns;

the cluster class obtained by merging the tables representing the class l (i.e. the cluster obtained by similarly merging the tables representing the class l can be understood as the column class of the sample table)

The number of original columns in (1);

the minimum value of the two is taken; column by column categories

Comparing and accumulating to obtain theta_r ^l(ii) a g represents the total number of column categories of the sample table r;

and h is used for representing a certain column category in the category l.

Comprehensively testing all detection results of the positive and negative sample sets, and dividing a reasonable similarity threshold value theta according to experience (the reasonable similarity threshold value theta can be divided by developers based on the experience of the developers or can be determined by a server based on a preset dividing rule in combination with the test results of the positive and negative sample sets)_ltSimilarity threshold value (theta) for similarity determination corresponding to the last sample l (i.e. corresponding to the classifier model) and used for similarity determination_lt∈[0,1])；

For example, the maximum θ satisfying the detection rate of 99% and the false alarm rate of less than 1% is the similarity analysis value of an unknown analysis table e

Theta or more_ltThen, table e is considered highly similar to category l.

The positive and negative sample sets include: at least one positive test sample (e.g., the same table as class l) that is the same as the table for the corresponding class (e.g., class l), and at least one negative test sample (e.g., a different table from class l) that is different from the table for the corresponding class.

Step 208, saving the classifier model;

here, the model M trained from at least one sample table of class l_lAnd a set similarity threshold theta_ltAnd (5) performing falling storage.

If a plurality of sample table categories exist, the steps 301 and 307 are repeated to complete the storage of the classifier models and the similarity threshold values of all the categories.

The method provided by the embodiment of the invention utilizes a machine learning algorithm, trains a classifier model according to the content statistical information of each row of the sensitive table as the characteristic, analyzes the table to be recognized row by row to obtain the similarity coefficient of the sensitive table template corresponding to the classifier, and synthesizes the similarity results of all the sensitive table classifiers to obtain the final judgment result, which is different from the existing scheme in the industry, does not match the specific content keywords, but analyzes the character-level statistical characteristics of each row (such as the average length, variance, Chinese, English, special character proportion and the like of each field), so that the method has very high generalization capability, can still effectively recognize scenes which lack common keywords but have highly similar contents (such as the tables containing different names-telephones-addresses), and the method still has good robustness for the situations that the column names of the homologous table are deleted, the column sequence is changed, the deleted columns are properly added and the like.

Fig. 5 is a schematic flowchart of another data identification method according to an embodiment of the present invention; as shown in fig. 5, the data identification method includes:

step 501, loading a plurality of classifier models trained in advance;

when the table to be analyzed needs to be judged, firstly reading and loading a plurality of classifier models to obtain a sensitive table classifier model set;

step 502, extracting the feature vectors of each column of the table to be analyzed to obtain a feature vector set of the table to be analyzed;

here, the content of the table e to be analyzed is read to obtain a feature vector set V of the table to be analyzed_e＝[v_e1,v_e2,v_e3…,v_ej]；v_ejCharacterizing a characteristic vector of a jth row of a table e to be analyzed; the feature vector includes: statistical characteristic values of each character level of each column;

specifically, the feature vectors of the columns may be extracted according to the method for analyzing features by columns shown in fig. 3; and will not be described in detail herein.

Step 503, analyzing a feature vector set of the table to be analyzed through each classifier model;

assume that the sensitive form classifier models are aggregated as

(the set of sensitive form classifier models includes L typing models,respectively training tables of different classes l to obtain corresponding classifier models M_l) For each classifier model M_lIn other words, V_eThe feature vector v of each column in_ejWill input model M alone_lAnalyzing to obtain the attribution column category with the highest confidence as the classification result (recorded as

) Classifier model M characterizing class I in column j of table e_lThe results of the analysis and judgment are finally summarized to obtain an analysis result vector

By analogy, when V_eAfter all models are analyzed in sequence, an analysis result vector set is obtained

Step 504, calculating the similarity between the classification result and each table;

for table e, the analysis result vectors R of all sensitive table models need to be compared_e ^IThe similarity with the category I can not be known whether the table is a sensitive table or belongs to which sensitive table.

The similarity calculation method is the same as the step of determining the similarity threshold using the test data set as described in the method of FIG. 2, i.e., using a formula

Calculating the similarity; summarizing the similarity results of all the categories to obtain

Here, the first and second liquid crystal display panels are,

representing the similarity between the table b to be identified and the table of the category a; k is a radical of_bRepresenting the number of original columns, k, of the table b to be recognized_aThe number of original columns of the table representing category a; SC (b, a) represents the number of columns in common between the table of category a and the table b;

wherein the content of the first and second substances,

representing the existence of classification result in the table b to be identified as

(i.e. the

Is characterized by

) The number of columns;

after the tables representing the category a are merged, the clusters thereof

The number of original columns in (1);

then, the minimum value of the two is taken, and the classification results (corresponding to the cluster after the similar columns are combined) are classified one by one

Comparing and accumulating the data to obtain the final result

Step 505, judging whether the similarity exceeds various thresholds;

is determined promptly

Whether or not there is

If the table e exists, the table e is considered to belong to the sensitive table, the file to which the table e belongs is a sensitive table file, and the specific attribution type is judged by the subsequent steps; otherwise, the table e is not considered to belong to the sensitive table, and the judgment process is ended;

step 506, selecting a similarity class as a belonging class;

here, in

To select the maximum similarity value

The corresponding category (z) is the sensitive category of the table e, and the judgment process is ended.

The method provided by the embodiment of the invention uses a clustering algorithm to process the preorder characteristics, and the characteristics mainly adopt character characteristics of contents; and matching by using a classification algorithm and a discrete set similarity matching method. Specifically, in a classifier model training stage, content character feature analysis is carried out on each type of sensitive tables provided in advance according to columns, and a classifier is independently trained for each type of sensitive tables by using a classification algorithm; in the inference stage, firstly extracting the feature vectors of each column of contents in the table from the table file to be analyzed, calculating the similarity between the classification result and the corresponding class sensitive table through the analysis of all sensitive table classifiers, and if the highest value of the similarity is greater than the corresponding class threshold, determining that the table belongs to the corresponding sensitive class. The method has better generalization capability, does not depend on specific keyword content, has higher recognition capability for the conditions of deleting column names, exchanging column sequences, properly adding and deleting columns and the like besides accurately recognizing the sensitive tables with the same homologous list structure, and has better expandability.

Fig. 6 is a schematic flow chart of another data identification method according to an embodiment of the present invention, as shown in fig. 6, where the data identification method is applied to a server, and the method includes:

601, training a model by a sensitive form learning module;

a user needs to provide a sensitive table sample file for training, the sensitive table sample file is distinguished according to categories, the sensitive table learning module counts characteristic vectors of all columns in all types of sensitive tables by reading in table contents, and then a corresponding sensitive table classifier model is obtained through training.

The category may be a table of a particular template of the user's needs.

Step 602, the sensitive form identification module identifies a form;

when the security product audits to a form file, and the form content needs to be analyzed and judged to be sensitive content, the sensitive form identification module loads a sensitive form classifier model, reads in the form to be identified, analyzes the content of each column of the form to be identified by utilizing each classifier model in the sensitive form classifier model, and finally summarizes the analysis results of all the classifier models to obtain the final judgment result; if the table is not sensitive, executing a releasing operation; and if the table is a sensitive table, alarming and carrying out a file blocking strategy.

Fig. 7 is a schematic structural diagram of a data recognition apparatus according to an embodiment of the present invention; as shown in fig. 7, the apparatus includes: the device comprises an acquisition unit, a processing unit and an identification unit; wherein the content of the first and second substances,

the acquiring unit is used for acquiring a first table;

In some embodiments, the apparatus further comprises: a preprocessing unit for training at least one classifier model;

In some embodiments, the preprocessing unit is configured to determine, according to the sample column feature vector set corresponding to each sample table, at least one column of corresponding feature vectors;

In some embodiments, the identifying unit is configured to perform similar column merging on each column in the first column feature vector set to obtain a second column feature vector set; identifying the second-row characteristic vector set to obtain a first analysis result vector;

In some embodiments, the preset feature obtaining policy includes:

In some embodiments, the identifying unit is specifically configured to determine a similarity between the first table and at least one type of table;

It should be noted that: in the data recognition apparatus provided in the above embodiment, only the division of the program modules is exemplified when performing data recognition, and in practical applications, the processing distribution may be completed by different program modules according to needs, that is, the internal structure of the apparatus may be divided into different program modules to complete all or part of the processing described above. In addition, the data identification device and the data identification method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments and are not described herein again.

Fig. 8 is a schematic structural diagram of another data identification device according to an embodiment of the present invention. The apparatus 80 comprises: a processor 801 and a memory 802 for storing computer programs operable on the processor; wherein, the processor 801 is configured to execute, when running the computer program, the following steps: acquiring a first table; determining a first column of feature vector sets of the first table according to a preset feature acquisition strategy; the first table includes at least one column; the first column feature vector set comprises feature vectors of columns in the at least one column; the feature vector comprises character features of a respective column; recognizing the first-column characteristic vector set by using a preset recognition model to obtain a first analysis result vector; the recognition model comprises at least one classifier model; each classifier model in the at least one classifier model is used for identifying a corresponding class table; the first analysis result vector comprises the analysis results of each column in at least one column corresponding to the corresponding classifier model; determining the similarity between the first table and various tables in at least one type of table according to the first analysis result vector, and determining an identification result according to the determined similarity; the identification result represents the table type corresponding to the first table.

It should be noted that: the data identification device and the data identification method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments in detail and are not described herein again.

In practical applications, the apparatus 80 may further include: at least one network interface 803. The various components in the data recognition device 80 are coupled together by a bus system 804. It is understood that the bus system 804 is used to enable communications among the components. The bus system 804 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 804 in FIG. 8. The number of the processors 801 may be at least one. The network interface 803 is used for communication between the data recognition apparatus 80 and other devices in a wired or wireless manner.

The memory 802 in the embodiment of the present invention is used to store various types of data to support the operation of the data recognition device 80.

The methods disclosed in the embodiments of the present invention described above may be implemented in the processor 801 or implemented by the processor 801. The processor 801 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 801. The Processor 801 may be a general purpose Processor, a DiGital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. Processor 801 may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present invention. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed by the embodiment of the invention can be directly implemented by a hardware decoding processor, or can be implemented by combining hardware and software modules in the decoding processor. The software modules may be located in a storage medium that is located in the memory 802, and the processor 801 reads the information in the memory 802 to perform the steps of the aforementioned methods in conjunction with its hardware.

In an exemplary embodiment, the data recognition Device 80 may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, Micro Controllers (MCUs), microprocessors (microprocessors), or other electronic components for performing the foregoing methods.

An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, performs: acquiring a first table; determining a first column of feature vector sets of the first table according to a preset feature acquisition strategy; the first table includes at least one column; the first column feature vector set comprises feature vectors of columns in the at least one column; the feature vector comprises character features of a respective column; recognizing the first-column characteristic vector set by using a preset recognition model to obtain a first analysis result vector; the recognition model comprises at least one classifier model; each classifier model in the at least one classifier model is used for identifying a corresponding class table; the first analysis result vector comprises the analysis results of each column in at least one column corresponding to the corresponding classifier model; determining the similarity between the first table and various tables in at least one type of table according to the first analysis result vector, and determining an identification result according to the determined similarity; the identification result represents the table type corresponding to the first table.

When the computer program is executed by the processor, the corresponding process implemented by the server in each method according to the embodiments of the present invention is implemented, and for brevity, details are not described here again.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all the functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

The above description is only exemplary of the present invention and should not be taken as limiting the scope of the present invention, and any modifications, equivalents, improvements, etc. that are within the spirit and principle of the present invention should be included in the present invention.

Claims

1. A method of data identification, the method comprising:

acquiring a first table;

2. The method of claim 1, further comprising: training at least one classifier model; training the classifier model includes:

obtaining at least one sample table;

3. The method according to claim 2, wherein the performing similar column merging according to the sample column feature vector sets corresponding to the sample tables to obtain a training data set comprises:

4. The method of claim 1, wherein the identifying the first list of feature vector sets using a predetermined identification model to obtain a first analysis result vector comprises:

5. The method according to claim 1 or 2, wherein the preset feature acquisition strategy comprises:

6. The method according to claim 1 or 4, wherein determining the recognition result according to the determined similarity comprises:

7. A data recognition apparatus, the apparatus comprising: the device comprises an acquisition unit, a processing unit and an identification unit; wherein the content of the first and second substances,

the acquiring unit is used for acquiring a first table;

8. The apparatus of claim 7, further comprising: a preprocessing unit for training at least one classifier model;

9. The apparatus according to claim 8, wherein the preprocessing unit is configured to determine at least one column of corresponding feature vectors according to the sample column feature vector set corresponding to each sample table;

10. The apparatus according to claim 7, wherein the identifying unit is configured to perform similar column merging on each column in the first column feature vector set to obtain a second column feature vector set; identifying the second-row characteristic vector set to obtain a first analysis result vector;

11. The apparatus according to claim 7 or 8, wherein the preset feature obtaining strategy comprises:

12. The apparatus according to claim 7 or 10, wherein the identifying unit is specifically configured to determine a similarity between the first table and at least one type of table;

13. A data recognition apparatus, the apparatus comprising: a processor and a memory for storing a computer program capable of running on the processor; wherein the content of the first and second substances,

the processor is adapted to perform the steps of the method of any one of claims 1 to 6 when running the computer program.

14. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.