CN111488327A

CN111488327A - Data standard management method and system

Info

Publication number: CN111488327A
Application number: CN201910084384.2A
Authority: CN
Inventors: 徐瑞芬
Original assignee: Aspire Digital Technologies Shenzhen Co Ltd
Current assignee: Aspire Digital Technologies Shenzhen Co Ltd
Priority date: 2019-01-29
Filing date: 2019-01-29
Publication date: 2020-08-04
Anticipated expiration: 2039-01-29
Also published as: CN111488327B

Abstract

The invention discloses a data standard management method and a system, wherein the method comprises the following steps: analyzing the data to be processed according to the data structure to obtain a corresponding original field, wherein the field comprises a specific mark; traversing original fields according to a preset classification principle and outputting a plurality of synonymous field sets, wherein the synonymous field sets comprise a plurality of original fields with the same specific marks and corresponding occurrence frequencies; respectively selecting an original field from each synonymy field set and generating a candidate field; a criteria field is selected and generated from the candidate fields based on an external input. The system performs the method. According to the method, the corresponding original field obtained by analyzing the data to be processed through the data structure is used for carrying out preliminary data to eliminate non-target data, statistics of synonymous fields can be carried out through a preset classification rule, a proper original field is selected from the same field to be used as a standard field, and a proper data structure can be determined from the existing data to guide subsequent data maintenance.

Description

Data standard management method and system

Technical Field

The invention relates to the technical field of databases, in particular to a data standard management method and a data standard management system.

Background

Data elements (Data elements) are the smallest non-separable units of information in a database, such as "cell phone number", "contact phone number", "phone number", etc., and if not unified, the fields in the physical table may be named completely differently. If the 'mobile phone number' is uniformly adopted, the physical fields of the 'mobile phone number' are uniformly 'PhoneNum', and the 'PhoneNum' is used as a standardized element, so that the total number of data elements used in an information system can be greatly reduced, and the structure of the information system can be greatly simplified.

For the management of a brand-new database, the data can be simplified by specifying the format of the data to be stored, and for the overall improvement of the existing data set, most of the existing data set is sorted in a manual mode, so that the efficiency is low, and the problem of error is easily caused.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art. To this end, it is an object of the present invention to provide a data standard management method and system.

The technical scheme adopted by the invention is as follows:

in a first aspect, the present invention provides a data standard management method, including the steps of: analyzing the data to be processed according to a data structure to obtain a corresponding original field, wherein the field comprises a specific mark; traversing the original fields according to a preset classification principle and outputting a plurality of synonymous field sets, wherein the synonymous field sets comprise a plurality of original fields with the same specific marks and corresponding occurrence frequencies; respectively selecting an original field from each synonymy field set and generating a candidate field; and selecting and generating a standard field from the candidate fields according to external input.

Preferably, the specific mark comprises a Chinese name and/or a physical name and/or an English name, the classification rule comprises the Chinese name, the physical name and the English name, and the original field with the same Chinese name and/or physical name and/or English name as the classification rule is marked as the original field with the same specific mark.

Preferably, the method further comprises the steps of: a field token is obtained that is used to select an original field from the set of synonymous fields as a candidate field.

Preferably, the method specifically comprises the following steps: and respectively judging whether field marks exist in all the synonymy field sets, and if not, taking the original field with the highest frequency of occurrence as a candidate field.

Preferably, the method specifically comprises the following steps: and segmenting specific marks in the candidate fields according to external input, and combining the candidate fields after segmentation to generate a standard field for the data to be processed.

Preferably, the method specifically comprises the following steps: selecting a plurality of fields from the candidate fields according to external input, and combining the fields with a preset fixed data structure to generate a standard field for the data to be processed.

Preferably, the field marks comprise important marks and secondary marks, and correspondingly, when the important marks exist in a synonymous field set, the corresponding original field is selected as a candidate field; when no important mark exists and a secondary mark exists in a synonymy field set, the corresponding original field is selected as a candidate field.

In a second aspect, the present invention provides a data standard management system, comprising: the extraction module is used for analyzing the data to be processed according to the data structure to obtain a corresponding original field, and the field comprises a specific mark; the induction module is used for traversing the original fields according to a preset classification principle and outputting a plurality of synonymous field sets, and each synonymous field set comprises a plurality of original fields with the same specific marks and corresponding occurrence frequencies; the selection module is used for respectively selecting an original field from each synonymy field set and generating a candidate field; and the standard module is used for selecting from the candidate fields according to external input and generating a standard field.

In a third aspect, the present invention provides a computer-readable storage medium having stored thereon computer-executable instructions for causing a computer to perform the method as described above.

The invention has the beneficial effects that:

according to the method, the corresponding original field obtained by analyzing the data to be processed through the data structure is used for carrying out preliminary data to eliminate non-target data, statistics of synonymous fields can be carried out through a preset classification rule, a proper original field is selected from the same field to be used as a standard field, and a proper data structure can be determined from the existing data to guide subsequent data maintenance.

Drawings

FIG. 1 is a schematic diagram of a data standard management method of the present invention;

FIG. 2 is a data criteria processing flow diagram of the present invention;

FIG. 3 is a schematic diagram of a data standard management system of the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

Example 1

The embodiment provides a data standard management method as shown in fig. 1, which includes the steps of:

s1, analyzing the data to be processed according to the data structure to obtain a corresponding original field, wherein the field comprises a specific mark;

s2, traversing the original fields according to a preset classification principle and outputting a plurality of synonymous field sets, wherein the synonymous field sets comprise a plurality of original fields with the same specific marks and corresponding occurrence frequencies;

s3, selecting an original field from each synonymy field set and generating a candidate field;

and S4, selecting and generating a standard field from the candidate fields according to the external input.

Step S1 is specifically to locate/search the original field through a known data structure (that is, the basic data structure of the data to be processed is known), which aims to exclude the data of the non-target, and is beneficial to reducing the overall data processing consumption;

step S2 is to identify keywords (based on the known data structure of S1), and then determine whether the keywords are consistent or equivalent, where this embodiment mainly determines that the keywords are consistent, that is, an original field includes consistent keywords, and then classify the original field into a same field set, and record the occurrence frequency of the field in the whole to-be-processed data at the same time;

step S3 is to select an original field from the synonymous fields as a candidate field for external selection by a suitable selection method, where the specific selection method mainly includes manually presetting a flag (field flag), and when a field matches the flag, the field is considered to be more suitable as a representative of the synonymous field than other fields; the target pointed by the mark is mainly a keyword (i.e. a specific mark) and can also be a combination of keywords and a rule of the combination.

Correspondingly, before step S3, there may be provided an additional step, which includes: the data table importance level is configured to describe the specific form of the token and the importance level to which it refers, which is intended to determine the candidate fields.

The present embodiment provides a data standard processing flow as shown in fig. 2:

101) the system is in butt joint with a service system database to obtain a current network data model; or directly importing the established data model (aiming at acquiring the data to be processed and acquiring the corresponding data structure);

102) configuration data table importance level: if configuration is required, go to 103; otherwise go to 104;

103) marking a key data table structure in the system, configuring importance degree: 1 important, 2 more important (with the aim of determining in advance whether there are preferential candidate fields);

104) and (3) analyzing and calculating the system: marking the importance of the field according to the importance of the table; counting the occurrence frequency of Chinese word segmentation of the field, and recording the field and the table name (namely, analyzing data to obtain an original field and a synonymy field set);

105) the system automatically generates candidate data criteria (i.e., selects an original field and generates candidate fields);

106) manually making a decision with respect to candidate data criteria, establishing data element criteria (i.e., selecting from the candidate fields and generating a criteria field based on external input);

107) data element standards are bound to the data model (maintenance of data according to standard fields, including modification/replacement of existing data, and also including limiting the format of entry of late data).

The step 105 includes:

traversing all the original fields, and classifying the original fields into corresponding synonymous fields if the Chinese names, the physical names or the English names of the original fields are the same (namely the specific marks are consistent, and the equivalent Chinese characters can be English equivalent);

when the traversal is complete (i.e., classification of the set of synonymous fields is complete);

detecting whether the fields in the synonymous field set have the marks of step 103, wherein the purpose of the marks is to describe the keywords with priority, and if so, preferentially taking the fields corresponding to the keywords as candidate fields; there are two types of labels that correspond preferentially, the major label and the minor label, with the order of preference corresponding to the major label being greater than the minor label being greater than the no label.

And in the case of no mark, the original field with the highest frequency of occurrence is taken as the candidate field.

The main purpose of the step 106 is to determine whether to perform word segmentation, i.e. to split the keyword, so that the data format can be divided with higher precision.

Specifically, a candidate data standard field is taken, a participle statistical result of the Chinese name is checked, whether participles are added into the data standard or not is comprehensively judged according to the field importance, field statistical information and statistical information of Chinese words, and if the participles are added, the participles are participled according to the Chinese name, the English name and the physical name; and then, whether the word segmentation is used as a standard of data or not, finally integrating the keyword and a preset data structure (namely, the part is a fixed data structure) to form a standard field.

The standard field may be:

the field needs to be defined by 1-n Chinese words +1 standard types of data element standard, more specifically:

the standard type is defined as: type Chinese name, physical name, length, data type, whether empty or not;

the Chinese word is defined as: word chinese name, physical name, description.

The present embodiment provides a data standard management system as shown in fig. 3, including: the extraction module 1 is used for analyzing the data to be processed according to a data structure to obtain a corresponding original field, wherein the field comprises a specific mark; the induction module 2 is used for traversing the original fields according to a preset classification principle and outputting a plurality of synonymous field sets, wherein the synonymous field sets comprise a plurality of original fields with the same specific marks and corresponding occurrence frequencies; the selection module 3 is used for selecting one original field from each synonymy field set and generating a candidate field; and the standard module 4 is used for selecting from the candidate fields according to external input and generating a standard field.

The present embodiments also provide a computer-readable storage medium having stored thereon computer-executable instructions for causing a computer to perform the method of the above embodiments.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A data standard management method, comprising the steps of:

analyzing the data to be processed according to a data structure to obtain a corresponding original field, wherein the field comprises a specific mark;

traversing the original fields according to a preset classification principle and outputting a plurality of synonymous field sets, wherein the synonymous field sets comprise a plurality of original fields with the same specific marks and corresponding occurrence frequencies;

respectively selecting an original field from each synonymy field set and generating a candidate field;

and selecting and generating a standard field from the candidate fields according to external input.

2. A data standard management method according to claim 1, wherein the specific mark comprises a chinese name and/or a physical name and/or an english name, the classification rule comprises a chinese name, a physical name and an english name, and the original field having the same chinese name and/or physical name and/or english name as the classification rule is marked as the original field having the same specific mark.

3. A data standard management method according to claim 1, further comprising the steps of:

a field token is obtained that is used to select an original field from the set of synonymous fields as a candidate field.

4. The data standard management method according to claim 1, specifically comprising the steps of:

and respectively judging whether field marks exist in all the synonymy field sets, and if not, taking the original field with the highest frequency of occurrence as a candidate field.

5. The data standard management method according to claim 2, specifically comprising:

and segmenting specific marks in the candidate fields according to external input, and combining the candidate fields after segmentation to generate a standard field for the data to be processed.

6. The data standard management method according to claim 2, specifically comprising:

selecting a plurality of fields from the candidate fields according to external input, and combining the fields with a preset fixed data structure to generate a standard field for the data to be processed.

7. The data standard management method of claim 3, wherein the field marks comprise important marks and secondary marks, and correspondingly, when an important mark exists in a synonymous field set, the corresponding original field is selected as a candidate field;

when no important mark exists and a secondary mark exists in a synonymy field set, the corresponding original field is selected as a candidate field.

8. A data standard management system, comprising:

the extraction module is used for analyzing the data to be processed according to the data structure to obtain a corresponding original field, and the field comprises a specific mark;

the induction module is used for traversing the original fields according to a preset classification principle and outputting a plurality of synonymous field sets, and each synonymous field set comprises a plurality of original fields with the same specific marks and corresponding occurrence frequencies;

the selection module is used for respectively selecting an original field from each synonymy field set and generating a candidate field;

and the standard module is used for selecting from the candidate fields according to external input and generating a standard field.

9. A computer-readable storage medium having stored thereon computer-executable instructions for causing a computer to perform the method of claims 1-7.