CN111488327A - Data standard management method and system - Google Patents
Data standard management method and system Download PDFInfo
- Publication number
- CN111488327A CN111488327A CN201910084384.2A CN201910084384A CN111488327A CN 111488327 A CN111488327 A CN 111488327A CN 201910084384 A CN201910084384 A CN 201910084384A CN 111488327 A CN111488327 A CN 111488327A
- Authority
- CN
- China
- Prior art keywords
- field
- data
- original
- fields
- candidate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a data standard management method and a system, wherein the method comprises the following steps: analyzing the data to be processed according to the data structure to obtain a corresponding original field, wherein the field comprises a specific mark; traversing original fields according to a preset classification principle and outputting a plurality of synonymous field sets, wherein the synonymous field sets comprise a plurality of original fields with the same specific marks and corresponding occurrence frequencies; respectively selecting an original field from each synonymy field set and generating a candidate field; a criteria field is selected and generated from the candidate fields based on an external input. The system performs the method. According to the method, the corresponding original field obtained by analyzing the data to be processed through the data structure is used for carrying out preliminary data to eliminate non-target data, statistics of synonymous fields can be carried out through a preset classification rule, a proper original field is selected from the same field to be used as a standard field, and a proper data structure can be determined from the existing data to guide subsequent data maintenance.
Description
Technical Field
The invention relates to the technical field of databases, in particular to a data standard management method and a data standard management system.
Background
Data elements (Data elements) are the smallest non-separable units of information in a database, such as "cell phone number", "contact phone number", "phone number", etc., and if not unified, the fields in the physical table may be named completely differently. If the 'mobile phone number' is uniformly adopted, the physical fields of the 'mobile phone number' are uniformly 'PhoneNum', and the 'PhoneNum' is used as a standardized element, so that the total number of data elements used in an information system can be greatly reduced, and the structure of the information system can be greatly simplified.
For the management of a brand-new database, the data can be simplified by specifying the format of the data to be stored, and for the overall improvement of the existing data set, most of the existing data set is sorted in a manual mode, so that the efficiency is low, and the problem of error is easily caused.
Disclosure of Invention
The present invention is directed to solving, at least to some extent, one of the technical problems in the related art. To this end, it is an object of the present invention to provide a data standard management method and system.
The technical scheme adopted by the invention is as follows:
in a first aspect, the present invention provides a data standard management method, including the steps of: analyzing the data to be processed according to a data structure to obtain a corresponding original field, wherein the field comprises a specific mark; traversing the original fields according to a preset classification principle and outputting a plurality of synonymous field sets, wherein the synonymous field sets comprise a plurality of original fields with the same specific marks and corresponding occurrence frequencies; respectively selecting an original field from each synonymy field set and generating a candidate field; and selecting and generating a standard field from the candidate fields according to external input.
Preferably, the specific mark comprises a Chinese name and/or a physical name and/or an English name, the classification rule comprises the Chinese name, the physical name and the English name, and the original field with the same Chinese name and/or physical name and/or English name as the classification rule is marked as the original field with the same specific mark.
Preferably, the method further comprises the steps of: a field token is obtained that is used to select an original field from the set of synonymous fields as a candidate field.
Preferably, the method specifically comprises the following steps: and respectively judging whether field marks exist in all the synonymy field sets, and if not, taking the original field with the highest frequency of occurrence as a candidate field.
Preferably, the method specifically comprises the following steps: and segmenting specific marks in the candidate fields according to external input, and combining the candidate fields after segmentation to generate a standard field for the data to be processed.
Preferably, the method specifically comprises the following steps: selecting a plurality of fields from the candidate fields according to external input, and combining the fields with a preset fixed data structure to generate a standard field for the data to be processed.
Preferably, the field marks comprise important marks and secondary marks, and correspondingly, when the important marks exist in a synonymous field set, the corresponding original field is selected as a candidate field; when no important mark exists and a secondary mark exists in a synonymy field set, the corresponding original field is selected as a candidate field.
In a second aspect, the present invention provides a data standard management system, comprising: the extraction module is used for analyzing the data to be processed according to the data structure to obtain a corresponding original field, and the field comprises a specific mark; the induction module is used for traversing the original fields according to a preset classification principle and outputting a plurality of synonymous field sets, and each synonymous field set comprises a plurality of original fields with the same specific marks and corresponding occurrence frequencies; the selection module is used for respectively selecting an original field from each synonymy field set and generating a candidate field; and the standard module is used for selecting from the candidate fields according to external input and generating a standard field.
In a third aspect, the present invention provides a computer-readable storage medium having stored thereon computer-executable instructions for causing a computer to perform the method as described above.
The invention has the beneficial effects that:
according to the method, the corresponding original field obtained by analyzing the data to be processed through the data structure is used for carrying out preliminary data to eliminate non-target data, statistics of synonymous fields can be carried out through a preset classification rule, a proper original field is selected from the same field to be used as a standard field, and a proper data structure can be determined from the existing data to guide subsequent data maintenance.
Drawings
FIG. 1 is a schematic diagram of a data standard management method of the present invention;
FIG. 2 is a data criteria processing flow diagram of the present invention;
FIG. 3 is a schematic diagram of a data standard management system of the present invention.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
Example 1
The embodiment provides a data standard management method as shown in fig. 1, which includes the steps of:
s1, analyzing the data to be processed according to the data structure to obtain a corresponding original field, wherein the field comprises a specific mark;
s2, traversing the original fields according to a preset classification principle and outputting a plurality of synonymous field sets, wherein the synonymous field sets comprise a plurality of original fields with the same specific marks and corresponding occurrence frequencies;
s3, selecting an original field from each synonymy field set and generating a candidate field;
and S4, selecting and generating a standard field from the candidate fields according to the external input.
Step S1 is specifically to locate/search the original field through a known data structure (that is, the basic data structure of the data to be processed is known), which aims to exclude the data of the non-target, and is beneficial to reducing the overall data processing consumption;
step S2 is to identify keywords (based on the known data structure of S1), and then determine whether the keywords are consistent or equivalent, where this embodiment mainly determines that the keywords are consistent, that is, an original field includes consistent keywords, and then classify the original field into a same field set, and record the occurrence frequency of the field in the whole to-be-processed data at the same time;
step S3 is to select an original field from the synonymous fields as a candidate field for external selection by a suitable selection method, where the specific selection method mainly includes manually presetting a flag (field flag), and when a field matches the flag, the field is considered to be more suitable as a representative of the synonymous field than other fields; the target pointed by the mark is mainly a keyword (i.e. a specific mark) and can also be a combination of keywords and a rule of the combination.
Correspondingly, before step S3, there may be provided an additional step, which includes: the data table importance level is configured to describe the specific form of the token and the importance level to which it refers, which is intended to determine the candidate fields.
The present embodiment provides a data standard processing flow as shown in fig. 2:
101) the system is in butt joint with a service system database to obtain a current network data model; or directly importing the established data model (aiming at acquiring the data to be processed and acquiring the corresponding data structure);
102) configuration data table importance level: if configuration is required, go to 103; otherwise go to 104;
103) marking a key data table structure in the system, configuring importance degree: 1 important, 2 more important (with the aim of determining in advance whether there are preferential candidate fields);
104) and (3) analyzing and calculating the system: marking the importance of the field according to the importance of the table; counting the occurrence frequency of Chinese word segmentation of the field, and recording the field and the table name (namely, analyzing data to obtain an original field and a synonymy field set);
105) the system automatically generates candidate data criteria (i.e., selects an original field and generates candidate fields);
106) manually making a decision with respect to candidate data criteria, establishing data element criteria (i.e., selecting from the candidate fields and generating a criteria field based on external input);
107) data element standards are bound to the data model (maintenance of data according to standard fields, including modification/replacement of existing data, and also including limiting the format of entry of late data).
The step 105 includes:
traversing all the original fields, and classifying the original fields into corresponding synonymous fields if the Chinese names, the physical names or the English names of the original fields are the same (namely the specific marks are consistent, and the equivalent Chinese characters can be English equivalent);
when the traversal is complete (i.e., classification of the set of synonymous fields is complete);
detecting whether the fields in the synonymous field set have the marks of step 103, wherein the purpose of the marks is to describe the keywords with priority, and if so, preferentially taking the fields corresponding to the keywords as candidate fields; there are two types of labels that correspond preferentially, the major label and the minor label, with the order of preference corresponding to the major label being greater than the minor label being greater than the no label.
And in the case of no mark, the original field with the highest frequency of occurrence is taken as the candidate field.
The main purpose of the step 106 is to determine whether to perform word segmentation, i.e. to split the keyword, so that the data format can be divided with higher precision.
Specifically, a candidate data standard field is taken, a participle statistical result of the Chinese name is checked, whether participles are added into the data standard or not is comprehensively judged according to the field importance, field statistical information and statistical information of Chinese words, and if the participles are added, the participles are participled according to the Chinese name, the English name and the physical name; and then, whether the word segmentation is used as a standard of data or not, finally integrating the keyword and a preset data structure (namely, the part is a fixed data structure) to form a standard field.
The standard field may be:
the field needs to be defined by 1-n Chinese words +1 standard types of data element standard, more specifically:
the standard type is defined as: type Chinese name, physical name, length, data type, whether empty or not;
the Chinese word is defined as: word chinese name, physical name, description.
The present embodiment provides a data standard management system as shown in fig. 3, including: the extraction module 1 is used for analyzing the data to be processed according to a data structure to obtain a corresponding original field, wherein the field comprises a specific mark; the induction module 2 is used for traversing the original fields according to a preset classification principle and outputting a plurality of synonymous field sets, wherein the synonymous field sets comprise a plurality of original fields with the same specific marks and corresponding occurrence frequencies; the selection module 3 is used for selecting one original field from each synonymy field set and generating a candidate field; and the standard module 4 is used for selecting from the candidate fields according to external input and generating a standard field.
The present embodiments also provide a computer-readable storage medium having stored thereon computer-executable instructions for causing a computer to perform the method of the above embodiments.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (9)
1. A data standard management method, comprising the steps of:
analyzing the data to be processed according to a data structure to obtain a corresponding original field, wherein the field comprises a specific mark;
traversing the original fields according to a preset classification principle and outputting a plurality of synonymous field sets, wherein the synonymous field sets comprise a plurality of original fields with the same specific marks and corresponding occurrence frequencies;
respectively selecting an original field from each synonymy field set and generating a candidate field;
and selecting and generating a standard field from the candidate fields according to external input.
2. A data standard management method according to claim 1, wherein the specific mark comprises a chinese name and/or a physical name and/or an english name, the classification rule comprises a chinese name, a physical name and an english name, and the original field having the same chinese name and/or physical name and/or english name as the classification rule is marked as the original field having the same specific mark.
3. A data standard management method according to claim 1, further comprising the steps of:
a field token is obtained that is used to select an original field from the set of synonymous fields as a candidate field.
4. The data standard management method according to claim 1, specifically comprising the steps of:
and respectively judging whether field marks exist in all the synonymy field sets, and if not, taking the original field with the highest frequency of occurrence as a candidate field.
5. The data standard management method according to claim 2, specifically comprising:
and segmenting specific marks in the candidate fields according to external input, and combining the candidate fields after segmentation to generate a standard field for the data to be processed.
6. The data standard management method according to claim 2, specifically comprising:
selecting a plurality of fields from the candidate fields according to external input, and combining the fields with a preset fixed data structure to generate a standard field for the data to be processed.
7. The data standard management method of claim 3, wherein the field marks comprise important marks and secondary marks, and correspondingly, when an important mark exists in a synonymous field set, the corresponding original field is selected as a candidate field;
when no important mark exists and a secondary mark exists in a synonymy field set, the corresponding original field is selected as a candidate field.
8. A data standard management system, comprising:
the extraction module is used for analyzing the data to be processed according to the data structure to obtain a corresponding original field, and the field comprises a specific mark;
the induction module is used for traversing the original fields according to a preset classification principle and outputting a plurality of synonymous field sets, and each synonymous field set comprises a plurality of original fields with the same specific marks and corresponding occurrence frequencies;
the selection module is used for respectively selecting an original field from each synonymy field set and generating a candidate field;
and the standard module is used for selecting from the candidate fields according to external input and generating a standard field.
9. A computer-readable storage medium having stored thereon computer-executable instructions for causing a computer to perform the method of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910084384.2A CN111488327B (en) | 2019-01-29 | 2019-01-29 | Data standard management method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910084384.2A CN111488327B (en) | 2019-01-29 | 2019-01-29 | Data standard management method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111488327A true CN111488327A (en) | 2020-08-04 |
CN111488327B CN111488327B (en) | 2023-08-22 |
Family
ID=71796572
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910084384.2A Active CN111488327B (en) | 2019-01-29 | 2019-01-29 | Data standard management method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111488327B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023092954A1 (en) * | 2021-11-26 | 2023-06-01 | 华为云计算技术有限公司 | Data governance method and device and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103514285A (en) * | 2013-09-29 | 2014-01-15 | 方正国际软件有限公司 | System and method for generating optimal record data |
CN106933972A (en) * | 2017-02-14 | 2017-07-07 | 杭州数梦工场科技有限公司 | The method and device of data element are defined using natural language processing technique |
CN108256074A (en) * | 2018-01-17 | 2018-07-06 | 链家网(北京)科技有限公司 | Method, apparatus, electronic equipment and the storage medium of checking treatment |
CN109189769A (en) * | 2018-08-14 | 2019-01-11 | 平安医疗健康管理股份有限公司 | Data standardization processing method, device, computer equipment and storage medium |
-
2019
- 2019-01-29 CN CN201910084384.2A patent/CN111488327B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103514285A (en) * | 2013-09-29 | 2014-01-15 | 方正国际软件有限公司 | System and method for generating optimal record data |
CN106933972A (en) * | 2017-02-14 | 2017-07-07 | 杭州数梦工场科技有限公司 | The method and device of data element are defined using natural language processing technique |
CN108256074A (en) * | 2018-01-17 | 2018-07-06 | 链家网(北京)科技有限公司 | Method, apparatus, electronic equipment and the storage medium of checking treatment |
CN109189769A (en) * | 2018-08-14 | 2019-01-11 | 平安医疗健康管理股份有限公司 | Data standardization processing method, device, computer equipment and storage medium |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023092954A1 (en) * | 2021-11-26 | 2023-06-01 | 华为云计算技术有限公司 | Data governance method and device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN111488327B (en) | 2023-08-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107544982B (en) | Text information processing method and device and terminal | |
CN110795919A (en) | Method, device, equipment and medium for extracting table in PDF document | |
US20120054135A1 (en) | Automated parsing of e-mail messages | |
CN110490761B (en) | Power grid distribution network equipment ledger data model modeling method | |
CN112163072A (en) | Data processing method and device based on multiple data sources | |
CN112732655A (en) | Online analysis method and system for unformatted logs | |
CN112364014A (en) | Data query method, device, server and storage medium | |
CN109857842B (en) | Method and device for recognizing fault-reporting text | |
CN112307318A (en) | Content publishing method, system and device | |
CN111492364B (en) | Data labeling method and device and storage medium | |
CN111930949B (en) | Search string processing method and device, computer readable medium and electronic equipment | |
CN111488327B (en) | Data standard management method and system | |
CN105787004A (en) | Text classification method and device | |
CN111324705A (en) | System and method for adaptively adjusting related search terms | |
CN107291938A (en) | Order Query System and method | |
CN109993381B (en) | Demand management application method, device, equipment and medium based on knowledge graph | |
CN103136187A (en) | Method and system for extraction of patent rejection information | |
CN115577147A (en) | Visual information map retrieval method and device, electronic equipment and storage medium | |
CN111737529B (en) | Multi-source heterogeneous data acquisition method | |
CN113760907A (en) | Data uniqueness identification method in database | |
KR101363335B1 (en) | Apparatus and method for generating document categorization model | |
CN108572997B (en) | Integrated storage system and method of multi-source data with network attributes | |
CN110781309A (en) | Entity parallel relation similarity calculation method based on pattern matching | |
CN110569435A (en) | Intelligent dual-ended recommendation engine system and method | |
CN111046195A (en) | Intelligent cataloging method for mass media assets |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 518000 w601, Shenzhen Hong Kong industry university research base, 015 Gaoxin South 7th Road, high tech Zone community, Yuehai street, Nanshan District, Shenzhen City, Guangdong Province Applicant after: ASPIRE TECHNOLOGIES (SHENZHEN) LTD. Address before: 518000 south wing, 6th floor, west block, Shenzhen Hong Kong industry university research base building, South District, high tech Industrial Park, Nanshan District, Shenzhen City, Guangdong Province Applicant before: ASPIRE TECHNOLOGIES (SHENZHEN) LTD. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |