CN114185869A - Data model auditing method based on data standard - Google Patents

Data model auditing method based on data standard Download PDF

Info

Publication number
CN114185869A
CN114185869A CN202111463766.XA CN202111463766A CN114185869A CN 114185869 A CN114185869 A CN 114185869A CN 202111463766 A CN202111463766 A CN 202111463766A CN 114185869 A CN114185869 A CN 114185869A
Authority
CN
China
Prior art keywords
data
entity attribute
information item
standard information
text data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111463766.XA
Other languages
Chinese (zh)
Inventor
王峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan XW Bank Co Ltd
Original Assignee
Sichuan XW Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan XW Bank Co Ltd filed Critical Sichuan XW Bank Co Ltd
Priority to CN202111463766.XA priority Critical patent/CN114185869A/en
Publication of CN114185869A publication Critical patent/CN114185869A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/211Schema design and management
    • G06F16/212Schema design and management with details for data modelling support
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data model auditing method based on a data standard, which belongs to the technical field of data model auditing and solves the problems of easy occurrence of data quality problem and generation of more reconstruction cost caused by the fact that the data standard is not introduced in the data model design process in the existing scheme. The invention provides a method for auditing the data model, and introduces the data standard in the stage of designing the data model to replace the manual evaluation of the data model after the design is finished, thereby not only monitoring the data quality in advance, avoiding the problem of the data quality as much as possible, but also reducing the cost generated by modifying the data model.

Description

Data model auditing method based on data standard
Technical Field
The invention belongs to the technical field of data model auditing, and particularly relates to a data model auditing method based on a data standard.
Background
At present, the application of data models in various fields is more and more common, so that judging whether the design of the data models passes or not becomes a subject which cannot be bypassed. In the existing scheme, whether a data model is understood by a basic model-based expert for business and data is judged, data standards are not involved in the design process of the data model, and data quality problems and certain transformation cost can be brought in the follow-up process.
Disclosure of Invention
The invention discloses a data model auditing method based on a data standard, aiming at solving the problems that the data quality is easy to occur and more reconstruction cost is generated because the data standard is not introduced in the data model design process in the existing scheme.
The technical scheme of the invention is as follows:
the invention relates to a data model auditing method based on data standards, which comprises the following steps:
s1: collecting data model design information: acquiring entity attributes in a data model in a design stage, wherein the entity attributes comprise entity attribute names and entity attribute business meanings, and setting the corresponding entity attribute names and entity attribute business meanings as entity attribute text data of a key-value pair structure;
s2: calculating a similarity coefficient: acquiring a standard information item in the existing data standard, wherein the standard information item comprises a standard information item name and a standard information item service meaning, setting the corresponding standard information item name and the standard information item service meaning as standard information item text data with a key-value pair structure, performing text word segmentation processing on the entity attribute text data and the standard information item text data, and calculating a similarity coefficient of each entity attribute text data and the standard information item text data according to a text word segmentation processing result;
s3: data comparison and arrangement: according to the obtained similarity coefficient in the step S2, eliminating the combination of the entity attribute text data and the standard information item text data, the similarity coefficient of which does not meet the requirements, according to the requirements of the user, and sorting the combination of the entity attribute text data and the standard information item text data, the similarity coefficient of which meets the requirements, in descending order according to the similarity coefficient;
s4: model auditing: if the similarity coefficient shows that the entity attribute text data is completely the same as the standard information item text data, directly checking whether the corresponding entity attribute is consistent with the standard information item, if so, checking the entity attribute to pass, otherwise, checking the entity attribute not to pass; if the similarity coefficient shows that the entity attribute text data is different from the standard information item text data, manually determining the standard information item text data which is most similar to the entity attribute text data according to the similarity coefficient, after determining the standard information item text data, checking whether the corresponding entity attribute is consistent with the standard information item, if so, checking the entity attribute to pass, otherwise, checking the entity attribute not to pass;
s5: and (3) model audit feedback: feeding back the result of step S4, for the entity attribute that failed step S4, returning the entity attribute, and modifying the data model according to the returned entity attribute.
The working principle of the technical scheme is as follows:
the method comprises the steps of collecting entity attributes in a data model in a design stage, obtaining standard information items in a data standard, calculating similarity coefficients of entity attribute text data and standard information item text data, executing different operations on the entity attributes according to different execution of the similarity coefficients, judging whether the entity attributes pass audit or not, returning the entity attributes which do not pass audit to a data model designer, and modifying the data model.
Compared with the prior art, the technical scheme has the advantages that the data standard is introduced in the data model design stage to replace manual evaluation of the data model after the data model is designed, so that the data quality is monitored in advance, the data quality problem is avoided as much as possible, and meanwhile, the cost for modifying the data model is reduced.
Further, the entity attribute further includes an entity attribute data type, an entity attribute data length, and an entity attribute data precision.
By setting the entity attributes, a basis is provided for auditing the entity attributes, and the accuracy of model auditing is improved.
Further, the standard information item also comprises a standard information item technical attribute and a marking information item management attribute.
By setting the standard information items, a basis is provided for auditing the entity attributes, and the accuracy of model auditing is further improved.
Further, the similarity coefficient is calculated by J (a, B) ═ a ═ B/a ═ B, if the similarity coefficient is 1, it means that a and B are identical, and if the similarity coefficient is less than 1, it means that a and B are not identical.
Through the setting of the similarity function, the similarity between the entity attribute text and the standard information item text can be visually seen, and meanwhile, the judgment standard is convenient to set.
One or more technical schemes provided by the invention at least have the following technical effects or advantages:
1. the data standard is introduced in the stage of designing the data model to replace manual evaluation of the data model after the design is finished, so that the data quality is monitored in advance, the problem of the data quality is avoided as much as possible, and the cost for modifying the data model is reduced.
2. By setting the entity attributes, a basis is provided for auditing the entity attributes, and the accuracy of model auditing is improved.
3. By setting the standard information items, a basis is provided for auditing the entity attributes, and the accuracy of model auditing is further improved.
4. Through the setting of the similarity function, the similarity between the entity attribute text and the standard information item text can be visually seen, and meanwhile, the judgment standard is convenient to set.
Drawings
FIG. 1 is a flowchart of a data model auditing method based on data standards according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of embodiments of the present application, generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.
Referring to fig. 1, a method for auditing a data model based on data standards according to this embodiment includes the following steps:
s1: collecting data model design information: acquiring entity attributes in a data model in a design stage, wherein the entity attributes comprise entity attribute names and entity attribute business meanings, and setting the corresponding entity attribute names and entity attribute business meanings as entity attribute text data of a key-value pair structure;
s2: calculating a similarity coefficient: acquiring a standard information item in the existing data standard, wherein the standard information item comprises a standard information item name and a standard information item service meaning, setting the corresponding standard information item name and the standard information item service meaning as standard information item text data with a key-value pair structure, performing text word segmentation processing on the entity attribute text data and the standard information item text data, and calculating a similarity coefficient of each entity attribute text data and the standard information item text data according to a text word segmentation processing result;
s3: data comparison and arrangement: according to the obtained similarity coefficient in the step S2, eliminating the combination of the entity attribute text data and the standard information item text data, the similarity coefficient of which does not meet the requirements, according to the requirements of the user, and sorting the combination of the entity attribute text data and the standard information item text data, the similarity coefficient of which meets the requirements, in descending order according to the similarity coefficient;
s4: model auditing: if the similarity coefficient shows that the entity attribute text data is completely the same as the standard information item text data, directly checking whether the corresponding entity attribute is consistent with the standard information item, if so, checking the entity attribute to pass, otherwise, checking the entity attribute not to pass; if the similarity coefficient shows that the entity attribute text data is different from the standard information item text data, manually determining the standard information item text data which is most similar to the entity attribute text data according to the similarity coefficient, after determining the standard information item text data, checking whether the corresponding entity attribute is consistent with the standard information item, if so, checking the entity attribute to pass, otherwise, checking the entity attribute not to pass;
s5: and (3) model audit feedback: feeding back the result of step S4, for the entity attribute that failed step S4, returning the entity attribute, and modifying the data model according to the returned entity attribute.
Specifically, the similarity coefficient of the entity attribute text and the standard information item text is calculated after word segmentation processing is carried out on the entity attribute text and the standard information item text; and if the standard information item cannot be determined manually, storing the corresponding entity attribute into a data standard pending library, and if the verification is passed, forming a new standard information item.
The working principle of the above embodiment is as follows:
the method comprises the steps of collecting entity attributes in a data model in a design stage, obtaining standard information items in a data standard, calculating similarity coefficients of entity attribute text data and standard information item text data, executing different operations on the entity attributes according to different execution of the similarity coefficients, judging whether the entity attributes pass audit or not, returning the entity attributes which do not pass audit to a data model designer, and modifying the data model.
Compared with the prior art, the technical scheme has the advantages that the data standard is introduced in the data model design stage to replace manual evaluation of the data model after the data model is designed, so that the data quality is monitored in advance, the data quality problem is avoided as much as possible, and meanwhile, the cost for modifying the data model is reduced.
The entity attributes further comprise entity attribute data type, entity attribute data length and entity attribute data precision.
By setting the entity attributes, a basis is provided for auditing the entity attributes, and the accuracy of model auditing is improved.
The standard information item also comprises a standard information item technical attribute and a marking information item management attribute.
Specifically, the standard information item technical attributes include data type, data length and data precision.
By setting the standard information items, a basis is provided for auditing the entity attributes, and the accuracy of model auditing is further improved.
The similarity coefficient is calculated by the formula J (A, B) ═ A.n.B/A.u.B, if the similarity coefficient is 1, then A and B are completely the same, if the similarity coefficient is less than 1, then A and B are not completely the same.
Specifically, a combination of the entity attribute text having a similarity coefficient of less than 0.3 and the standard information text is excluded.
Through the setting of the similarity function, the similarity between the entity attribute text and the standard information item text can be visually seen, and meanwhile, the judgment standard is convenient to set.
The above-mentioned embodiments only express the specific embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for those skilled in the art, without departing from the technical idea of the present application, several changes and modifications can be made, which are all within the protection scope of the present application.

Claims (4)

1. A method for auditing a data model based on data standards is characterized by comprising the following steps:
s1: collecting data model design information: acquiring entity attributes in a data model in a design stage, wherein the entity attributes comprise entity attribute names and entity attribute business meanings, and setting the corresponding entity attribute names and entity attribute business meanings as entity attribute text data of a key-value pair structure;
s2: calculating a similarity coefficient: acquiring a standard information item in the existing data standard, wherein the standard information item comprises a standard information item name and a standard information item service meaning, setting the corresponding standard information item name and the standard information item service meaning as standard information item text data with a key-value pair structure, performing text word segmentation processing on the entity attribute text data and the standard information item text data, and calculating a similarity coefficient of each entity attribute text data and the standard information item text data according to a text word segmentation processing result;
s3: data comparison and arrangement: according to the obtained similarity coefficient in the step S2, eliminating the combination of the entity attribute text data and the standard information item text data, the similarity coefficient of which does not meet the requirements, according to the requirements of the user, and sorting the combination of the entity attribute text data and the standard information item text data, the similarity coefficient of which meets the requirements, in descending order according to the similarity coefficient;
s4: model auditing: if the similarity coefficient shows that the entity attribute text data is completely the same as the standard information item text data, directly checking whether the corresponding entity attribute is consistent with the standard information item, if so, checking the entity attribute to pass, otherwise, checking the entity attribute not to pass; if the similarity coefficient shows that the entity attribute text data is different from the standard information item text data, manually determining the standard information item text data which is most similar to the entity attribute text data according to the similarity coefficient, after determining the standard information item text data, checking whether the corresponding entity attribute is consistent with the standard information item, if so, checking the entity attribute to pass, otherwise, checking the entity attribute not to pass;
s5: and (3) model audit feedback: feeding back the result of step S4, for the entity attribute that failed step S4, returning the entity attribute, and modifying the data model according to the returned entity attribute.
2. The method of claim 1, wherein the entity attributes further include entity attribute data type, entity attribute data length, and entity attribute data precision.
3. The method of claim 1, wherein the standard information items further comprise standard information item technical attributes and annotation information item management attributes.
4. The method of claim 1, wherein the similarity coefficient is calculated as J (A, B) ═ An B/Au B, if the similarity coefficient is 1, then A and B are completely the same, and if the similarity coefficient is less than 1, then A and B are not completely the same.
CN202111463766.XA 2021-12-03 2021-12-03 Data model auditing method based on data standard Pending CN114185869A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111463766.XA CN114185869A (en) 2021-12-03 2021-12-03 Data model auditing method based on data standard

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111463766.XA CN114185869A (en) 2021-12-03 2021-12-03 Data model auditing method based on data standard

Publications (1)

Publication Number Publication Date
CN114185869A true CN114185869A (en) 2022-03-15

Family

ID=80603330

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111463766.XA Pending CN114185869A (en) 2021-12-03 2021-12-03 Data model auditing method based on data standard

Country Status (1)

Country Link
CN (1) CN114185869A (en)

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102750367A (en) * 2011-12-29 2012-10-24 中华电信股份有限公司 Big data checking system and method thereof on cloud platform
CN103729713A (en) * 2013-11-06 2014-04-16 远光软件股份有限公司 Audit result display configuration method and device
CN106708909A (en) * 2015-11-18 2017-05-24 阿里巴巴集团控股有限公司 Data quality detection method and apparatus
CN107886240A (en) * 2017-11-09 2018-04-06 上海海事大学 A kind of rule-based cross-border electric business commercial quality Risk Identification Method
CN109684533A (en) * 2018-12-29 2019-04-26 ***股份有限公司 A kind of approaches to IM and device
CN110362601A (en) * 2019-06-19 2019-10-22 平安国际智慧城市科技股份有限公司 Mapping method, device, equipment and the storage medium of metadata standard
CN110414579A (en) * 2019-07-18 2019-11-05 北京信远通科技有限公司 Metadata schema closes mark property inspection method and device, storage medium
CN110765337A (en) * 2019-11-15 2020-02-07 中科院计算技术研究所大数据研究院 Service providing method based on internet big data
CN111539633A (en) * 2020-04-26 2020-08-14 北京思特奇信息技术股份有限公司 Service data quality auditing method, system, device and storage medium
CN112541056A (en) * 2020-12-18 2021-03-23 卫宁健康科技集团股份有限公司 Medical term standardization method, device, electronic equipment and storage medium
CN113127458A (en) * 2019-12-30 2021-07-16 北京奇虎科技有限公司 Data quality auditing method and device, electronic equipment and storage medium
CN113342786A (en) * 2021-08-02 2021-09-03 浩鲸云计算科技股份有限公司 Model management and control-based online data management and management method and system
CN113377740A (en) * 2021-05-28 2021-09-10 中国铁道科学研究院集团有限公司电子计算技术研究所 Railway metadata management method, application method and device
CN113377758A (en) * 2021-06-30 2021-09-10 数字郑州科技有限公司 Data quality auditing engine and auditing method thereof
CN113591485A (en) * 2021-06-17 2021-11-02 国网浙江省电力有限公司 Intelligent data quality auditing system and method based on data science

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102750367A (en) * 2011-12-29 2012-10-24 中华电信股份有限公司 Big data checking system and method thereof on cloud platform
CN103729713A (en) * 2013-11-06 2014-04-16 远光软件股份有限公司 Audit result display configuration method and device
CN106708909A (en) * 2015-11-18 2017-05-24 阿里巴巴集团控股有限公司 Data quality detection method and apparatus
CN107886240A (en) * 2017-11-09 2018-04-06 上海海事大学 A kind of rule-based cross-border electric business commercial quality Risk Identification Method
CN109684533A (en) * 2018-12-29 2019-04-26 ***股份有限公司 A kind of approaches to IM and device
CN110362601A (en) * 2019-06-19 2019-10-22 平安国际智慧城市科技股份有限公司 Mapping method, device, equipment and the storage medium of metadata standard
CN110414579A (en) * 2019-07-18 2019-11-05 北京信远通科技有限公司 Metadata schema closes mark property inspection method and device, storage medium
CN110765337A (en) * 2019-11-15 2020-02-07 中科院计算技术研究所大数据研究院 Service providing method based on internet big data
CN113127458A (en) * 2019-12-30 2021-07-16 北京奇虎科技有限公司 Data quality auditing method and device, electronic equipment and storage medium
CN111539633A (en) * 2020-04-26 2020-08-14 北京思特奇信息技术股份有限公司 Service data quality auditing method, system, device and storage medium
CN112541056A (en) * 2020-12-18 2021-03-23 卫宁健康科技集团股份有限公司 Medical term standardization method, device, electronic equipment and storage medium
CN113377740A (en) * 2021-05-28 2021-09-10 中国铁道科学研究院集团有限公司电子计算技术研究所 Railway metadata management method, application method and device
CN113591485A (en) * 2021-06-17 2021-11-02 国网浙江省电力有限公司 Intelligent data quality auditing system and method based on data science
CN113377758A (en) * 2021-06-30 2021-09-10 数字郑州科技有限公司 Data quality auditing engine and auditing method thereof
CN113342786A (en) * 2021-08-02 2021-09-03 浩鲸云计算科技股份有限公司 Model management and control-based online data management and management method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HOSSEIN TOHIDI 等: "Statistical Character-Based Syntax Similarity Measurement for Detecting Biomedical Syntax Variations through Named Entity Recognition", 《NETWORKED DIGITAL TECHNOLOGIES》, 31 December 2011 (2011-12-31), pages 164 *
刘文奇: "中国公共数据库数据质量控制模型体系及实证", 《中国科学:信息科学》, vol. 44, no. 07, 20 July 2014 (2014-07-20), pages 836 - 856 *

Similar Documents

Publication Publication Date Title
US11100408B2 (en) System and/or method for generating clean records from imperfect data using model stack(s) including classification model(s) and confidence model(s)
Gousios et al. Measuring developer contribution from software repository data
US20180300226A1 (en) System and method for equivalence class analysis-based automated requirements-based test case generation
CN107870956B (en) High-utility item set mining method and device and data processing equipment
CN110134663B (en) Organization structure data processing method and device and electronic equipment
CN109783638A (en) A kind of user comment clustering method based on semi-supervised learning
KR101625124B1 (en) The Technology Valuation Model Using Quantitative Patent Analysis
CN111061998A (en) Analysis model and method for economic measurement
CN113886373A (en) Data processing method and device and electronic equipment
KR950007926B1 (en) Method for assessing the easiness of a process for manipulating a product
CN114185869A (en) Data model auditing method based on data standard
JP5690472B2 (en) Data extraction system
US9785404B2 (en) Method and system for analyzing data in artifacts and creating a modifiable data network
Tsunoda et al. Pitfalls of analyzing a cross-company dataset of software maintenance and support
JP2012014308A (en) Method and device for predicting influence of change
CN115562981A (en) Software quality evaluation method based on machine learning
CN115841359A (en) Object generation method, device, equipment and storage medium
CN111881146B (en) Method, computing device and medium for charging a fee
US9104812B2 (en) Injection of data into a software application
CN115470690A (en) System and method for machine learning based product design automation and optimization
EP3588304B1 (en) System and method for equivalence class analysis-based automated requirements-based test case generation
CN112560952A (en) Supplier assessment method and device, electronic equipment and storage medium
JP2020166443A (en) Data processing method recommendation system, data processing method recommendation method, and data processing method recommendation program
CN109871318B (en) Key class identification method based on software operation network
JP7453932B2 (en) Design support equipment, methods and programs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination