CN111488327A - Data standard management method and system - Google Patents

Data standard management method and system Download PDF

Info

Publication number
CN111488327A
CN111488327A CN201910084384.2A CN201910084384A CN111488327A CN 111488327 A CN111488327 A CN 111488327A CN 201910084384 A CN201910084384 A CN 201910084384A CN 111488327 A CN111488327 A CN 111488327A
Authority
CN
China
Prior art keywords
field
data
original
fields
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910084384.2A
Other languages
Chinese (zh)
Other versions
CN111488327B (en
Inventor
徐瑞芬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Aspire Digital Technologies Shenzhen Co Ltd
Original Assignee
Aspire Digital Technologies Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aspire Digital Technologies Shenzhen Co Ltd filed Critical Aspire Digital Technologies Shenzhen Co Ltd
Priority to CN201910084384.2A priority Critical patent/CN111488327B/en
Publication of CN111488327A publication Critical patent/CN111488327A/en
Application granted granted Critical
Publication of CN111488327B publication Critical patent/CN111488327B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data standard management method and a system, wherein the method comprises the following steps: analyzing the data to be processed according to the data structure to obtain a corresponding original field, wherein the field comprises a specific mark; traversing original fields according to a preset classification principle and outputting a plurality of synonymous field sets, wherein the synonymous field sets comprise a plurality of original fields with the same specific marks and corresponding occurrence frequencies; respectively selecting an original field from each synonymy field set and generating a candidate field; a criteria field is selected and generated from the candidate fields based on an external input. The system performs the method. According to the method, the corresponding original field obtained by analyzing the data to be processed through the data structure is used for carrying out preliminary data to eliminate non-target data, statistics of synonymous fields can be carried out through a preset classification rule, a proper original field is selected from the same field to be used as a standard field, and a proper data structure can be determined from the existing data to guide subsequent data maintenance.

Description

Data standard management method and system
Technical Field
The invention relates to the technical field of databases, in particular to a data standard management method and a data standard management system.
Background
Data elements (Data elements) are the smallest non-separable units of information in a database, such as "cell phone number", "contact phone number", "phone number", etc., and if not unified, the fields in the physical table may be named completely differently. If the 'mobile phone number' is uniformly adopted, the physical fields of the 'mobile phone number' are uniformly 'PhoneNum', and the 'PhoneNum' is used as a standardized element, so that the total number of data elements used in an information system can be greatly reduced, and the structure of the information system can be greatly simplified.
For the management of a brand-new database, the data can be simplified by specifying the format of the data to be stored, and for the overall improvement of the existing data set, most of the existing data set is sorted in a manual mode, so that the efficiency is low, and the problem of error is easily caused.
Disclosure of Invention
The present invention is directed to solving, at least to some extent, one of the technical problems in the related art. To this end, it is an object of the present invention to provide a data standard management method and system.
The technical scheme adopted by the invention is as follows:
in a first aspect, the present invention provides a data standard management method, including the steps of: analyzing the data to be processed according to a data structure to obtain a corresponding original field, wherein the field comprises a specific mark; traversing the original fields according to a preset classification principle and outputting a plurality of synonymous field sets, wherein the synonymous field sets comprise a plurality of original fields with the same specific marks and corresponding occurrence frequencies; respectively selecting an original field from each synonymy field set and generating a candidate field; and selecting and generating a standard field from the candidate fields according to external input.
Preferably, the specific mark comprises a Chinese name and/or a physical name and/or an English name, the classification rule comprises the Chinese name, the physical name and the English name, and the original field with the same Chinese name and/or physical name and/or English name as the classification rule is marked as the original field with the same specific mark.
Preferably, the method further comprises the steps of: a field token is obtained that is used to select an original field from the set of synonymous fields as a candidate field.
Preferably, the method specifically comprises the following steps: and respectively judging whether field marks exist in all the synonymy field sets, and if not, taking the original field with the highest frequency of occurrence as a candidate field.
Preferably, the method specifically comprises the following steps: and segmenting specific marks in the candidate fields according to external input, and combining the candidate fields after segmentation to generate a standard field for the data to be processed.
Preferably, the method specifically comprises the following steps: selecting a plurality of fields from the candidate fields according to external input, and combining the fields with a preset fixed data structure to generate a standard field for the data to be processed.
Preferably, the field marks comprise important marks and secondary marks, and correspondingly, when the important marks exist in a synonymous field set, the corresponding original field is selected as a candidate field; when no important mark exists and a secondary mark exists in a synonymy field set, the corresponding original field is selected as a candidate field.
In a second aspect, the present invention provides a data standard management system, comprising: the extraction module is used for analyzing the data to be processed according to the data structure to obtain a corresponding original field, and the field comprises a specific mark; the induction module is used for traversing the original fields according to a preset classification principle and outputting a plurality of synonymous field sets, and each synonymous field set comprises a plurality of original fields with the same specific marks and corresponding occurrence frequencies; the selection module is used for respectively selecting an original field from each synonymy field set and generating a candidate field; and the standard module is used for selecting from the candidate fields according to external input and generating a standard field.
In a third aspect, the present invention provides a computer-readable storage medium having stored thereon computer-executable instructions for causing a computer to perform the method as described above.
The invention has the beneficial effects that:
according to the method, the corresponding original field obtained by analyzing the data to be processed through the data structure is used for carrying out preliminary data to eliminate non-target data, statistics of synonymous fields can be carried out through a preset classification rule, a proper original field is selected from the same field to be used as a standard field, and a proper data structure can be determined from the existing data to guide subsequent data maintenance.
Drawings
FIG. 1 is a schematic diagram of a data standard management method of the present invention;
FIG. 2 is a data criteria processing flow diagram of the present invention;
FIG. 3 is a schematic diagram of a data standard management system of the present invention.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
Example 1
The embodiment provides a data standard management method as shown in fig. 1, which includes the steps of:
s1, analyzing the data to be processed according to the data structure to obtain a corresponding original field, wherein the field comprises a specific mark;
s2, traversing the original fields according to a preset classification principle and outputting a plurality of synonymous field sets, wherein the synonymous field sets comprise a plurality of original fields with the same specific marks and corresponding occurrence frequencies;
s3, selecting an original field from each synonymy field set and generating a candidate field;
and S4, selecting and generating a standard field from the candidate fields according to the external input.
Step S1 is specifically to locate/search the original field through a known data structure (that is, the basic data structure of the data to be processed is known), which aims to exclude the data of the non-target, and is beneficial to reducing the overall data processing consumption;
step S2 is to identify keywords (based on the known data structure of S1), and then determine whether the keywords are consistent or equivalent, where this embodiment mainly determines that the keywords are consistent, that is, an original field includes consistent keywords, and then classify the original field into a same field set, and record the occurrence frequency of the field in the whole to-be-processed data at the same time;
step S3 is to select an original field from the synonymous fields as a candidate field for external selection by a suitable selection method, where the specific selection method mainly includes manually presetting a flag (field flag), and when a field matches the flag, the field is considered to be more suitable as a representative of the synonymous field than other fields; the target pointed by the mark is mainly a keyword (i.e. a specific mark) and can also be a combination of keywords and a rule of the combination.
Correspondingly, before step S3, there may be provided an additional step, which includes: the data table importance level is configured to describe the specific form of the token and the importance level to which it refers, which is intended to determine the candidate fields.
The present embodiment provides a data standard processing flow as shown in fig. 2:
101) the system is in butt joint with a service system database to obtain a current network data model; or directly importing the established data model (aiming at acquiring the data to be processed and acquiring the corresponding data structure);
102) configuration data table importance level: if configuration is required, go to 103; otherwise go to 104;
103) marking a key data table structure in the system, configuring importance degree: 1 important, 2 more important (with the aim of determining in advance whether there are preferential candidate fields);
104) and (3) analyzing and calculating the system: marking the importance of the field according to the importance of the table; counting the occurrence frequency of Chinese word segmentation of the field, and recording the field and the table name (namely, analyzing data to obtain an original field and a synonymy field set);
105) the system automatically generates candidate data criteria (i.e., selects an original field and generates candidate fields);
106) manually making a decision with respect to candidate data criteria, establishing data element criteria (i.e., selecting from the candidate fields and generating a criteria field based on external input);
107) data element standards are bound to the data model (maintenance of data according to standard fields, including modification/replacement of existing data, and also including limiting the format of entry of late data).
The step 105 includes:
traversing all the original fields, and classifying the original fields into corresponding synonymous fields if the Chinese names, the physical names or the English names of the original fields are the same (namely the specific marks are consistent, and the equivalent Chinese characters can be English equivalent);
when the traversal is complete (i.e., classification of the set of synonymous fields is complete);
detecting whether the fields in the synonymous field set have the marks of step 103, wherein the purpose of the marks is to describe the keywords with priority, and if so, preferentially taking the fields corresponding to the keywords as candidate fields; there are two types of labels that correspond preferentially, the major label and the minor label, with the order of preference corresponding to the major label being greater than the minor label being greater than the no label.
And in the case of no mark, the original field with the highest frequency of occurrence is taken as the candidate field.
The main purpose of the step 106 is to determine whether to perform word segmentation, i.e. to split the keyword, so that the data format can be divided with higher precision.
Specifically, a candidate data standard field is taken, a participle statistical result of the Chinese name is checked, whether participles are added into the data standard or not is comprehensively judged according to the field importance, field statistical information and statistical information of Chinese words, and if the participles are added, the participles are participled according to the Chinese name, the English name and the physical name; and then, whether the word segmentation is used as a standard of data or not, finally integrating the keyword and a preset data structure (namely, the part is a fixed data structure) to form a standard field.
The standard field may be:
the field needs to be defined by 1-n Chinese words +1 standard types of data element standard, more specifically:
the standard type is defined as: type Chinese name, physical name, length, data type, whether empty or not;
the Chinese word is defined as: word chinese name, physical name, description.
The present embodiment provides a data standard management system as shown in fig. 3, including: the extraction module 1 is used for analyzing the data to be processed according to a data structure to obtain a corresponding original field, wherein the field comprises a specific mark; the induction module 2 is used for traversing the original fields according to a preset classification principle and outputting a plurality of synonymous field sets, wherein the synonymous field sets comprise a plurality of original fields with the same specific marks and corresponding occurrence frequencies; the selection module 3 is used for selecting one original field from each synonymy field set and generating a candidate field; and the standard module 4 is used for selecting from the candidate fields according to external input and generating a standard field.
The present embodiments also provide a computer-readable storage medium having stored thereon computer-executable instructions for causing a computer to perform the method of the above embodiments.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (9)

1. A data standard management method, comprising the steps of:
analyzing the data to be processed according to a data structure to obtain a corresponding original field, wherein the field comprises a specific mark;
traversing the original fields according to a preset classification principle and outputting a plurality of synonymous field sets, wherein the synonymous field sets comprise a plurality of original fields with the same specific marks and corresponding occurrence frequencies;
respectively selecting an original field from each synonymy field set and generating a candidate field;
and selecting and generating a standard field from the candidate fields according to external input.
2. A data standard management method according to claim 1, wherein the specific mark comprises a chinese name and/or a physical name and/or an english name, the classification rule comprises a chinese name, a physical name and an english name, and the original field having the same chinese name and/or physical name and/or english name as the classification rule is marked as the original field having the same specific mark.
3. A data standard management method according to claim 1, further comprising the steps of:
a field token is obtained that is used to select an original field from the set of synonymous fields as a candidate field.
4. The data standard management method according to claim 1, specifically comprising the steps of:
and respectively judging whether field marks exist in all the synonymy field sets, and if not, taking the original field with the highest frequency of occurrence as a candidate field.
5. The data standard management method according to claim 2, specifically comprising:
and segmenting specific marks in the candidate fields according to external input, and combining the candidate fields after segmentation to generate a standard field for the data to be processed.
6. The data standard management method according to claim 2, specifically comprising:
selecting a plurality of fields from the candidate fields according to external input, and combining the fields with a preset fixed data structure to generate a standard field for the data to be processed.
7. The data standard management method of claim 3, wherein the field marks comprise important marks and secondary marks, and correspondingly, when an important mark exists in a synonymous field set, the corresponding original field is selected as a candidate field;
when no important mark exists and a secondary mark exists in a synonymy field set, the corresponding original field is selected as a candidate field.
8. A data standard management system, comprising:
the extraction module is used for analyzing the data to be processed according to the data structure to obtain a corresponding original field, and the field comprises a specific mark;
the induction module is used for traversing the original fields according to a preset classification principle and outputting a plurality of synonymous field sets, and each synonymous field set comprises a plurality of original fields with the same specific marks and corresponding occurrence frequencies;
the selection module is used for respectively selecting an original field from each synonymy field set and generating a candidate field;
and the standard module is used for selecting from the candidate fields according to external input and generating a standard field.
9. A computer-readable storage medium having stored thereon computer-executable instructions for causing a computer to perform the method of claims 1-7.
CN201910084384.2A 2019-01-29 2019-01-29 Data standard management method and system Active CN111488327B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910084384.2A CN111488327B (en) 2019-01-29 2019-01-29 Data standard management method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910084384.2A CN111488327B (en) 2019-01-29 2019-01-29 Data standard management method and system

Publications (2)

Publication Number Publication Date
CN111488327A true CN111488327A (en) 2020-08-04
CN111488327B CN111488327B (en) 2023-08-22

Family

ID=71796572

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910084384.2A Active CN111488327B (en) 2019-01-29 2019-01-29 Data standard management method and system

Country Status (1)

Country Link
CN (1) CN111488327B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023092954A1 (en) * 2021-11-26 2023-06-01 华为云计算技术有限公司 Data governance method and device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514285A (en) * 2013-09-29 2014-01-15 方正国际软件有限公司 System and method for generating optimal record data
CN106933972A (en) * 2017-02-14 2017-07-07 杭州数梦工场科技有限公司 The method and device of data element are defined using natural language processing technique
CN108256074A (en) * 2018-01-17 2018-07-06 链家网(北京)科技有限公司 Method, apparatus, electronic equipment and the storage medium of checking treatment
CN109189769A (en) * 2018-08-14 2019-01-11 平安医疗健康管理股份有限公司 Data standardization processing method, device, computer equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514285A (en) * 2013-09-29 2014-01-15 方正国际软件有限公司 System and method for generating optimal record data
CN106933972A (en) * 2017-02-14 2017-07-07 杭州数梦工场科技有限公司 The method and device of data element are defined using natural language processing technique
CN108256074A (en) * 2018-01-17 2018-07-06 链家网(北京)科技有限公司 Method, apparatus, electronic equipment and the storage medium of checking treatment
CN109189769A (en) * 2018-08-14 2019-01-11 平安医疗健康管理股份有限公司 Data standardization processing method, device, computer equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023092954A1 (en) * 2021-11-26 2023-06-01 华为云计算技术有限公司 Data governance method and device and storage medium

Also Published As

Publication number Publication date
CN111488327B (en) 2023-08-22

Similar Documents

Publication Publication Date Title
CN107544982B (en) Text information processing method and device and terminal
CN110795919A (en) Method, device, equipment and medium for extracting table in PDF document
US20120054135A1 (en) Automated parsing of e-mail messages
CN110490761B (en) Power grid distribution network equipment ledger data model modeling method
CN112163072A (en) Data processing method and device based on multiple data sources
CN112732655A (en) Online analysis method and system for unformatted logs
CN112364014A (en) Data query method, device, server and storage medium
CN109857842B (en) Method and device for recognizing fault-reporting text
CN112307318A (en) Content publishing method, system and device
CN111492364B (en) Data labeling method and device and storage medium
CN111930949B (en) Search string processing method and device, computer readable medium and electronic equipment
CN111488327B (en) Data standard management method and system
CN105787004A (en) Text classification method and device
CN111324705A (en) System and method for adaptively adjusting related search terms
CN107291938A (en) Order Query System and method
CN109993381B (en) Demand management application method, device, equipment and medium based on knowledge graph
CN103136187A (en) Method and system for extraction of patent rejection information
CN115577147A (en) Visual information map retrieval method and device, electronic equipment and storage medium
CN111737529B (en) Multi-source heterogeneous data acquisition method
CN113760907A (en) Data uniqueness identification method in database
KR101363335B1 (en) Apparatus and method for generating document categorization model
CN108572997B (en) Integrated storage system and method of multi-source data with network attributes
CN110781309A (en) Entity parallel relation similarity calculation method based on pattern matching
CN110569435A (en) Intelligent dual-ended recommendation engine system and method
CN111046195A (en) Intelligent cataloging method for mass media assets

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 518000 w601, Shenzhen Hong Kong industry university research base, 015 Gaoxin South 7th Road, high tech Zone community, Yuehai street, Nanshan District, Shenzhen City, Guangdong Province

Applicant after: ASPIRE TECHNOLOGIES (SHENZHEN) LTD.

Address before: 518000 south wing, 6th floor, west block, Shenzhen Hong Kong industry university research base building, South District, high tech Industrial Park, Nanshan District, Shenzhen City, Guangdong Province

Applicant before: ASPIRE TECHNOLOGIES (SHENZHEN) LTD.

GR01 Patent grant
GR01 Patent grant