CN113221528B

CN113221528B - Automatic generation and execution method of clinical data quality evaluation rule based on openEHR model

Info

Publication number: CN113221528B
Application number: CN202110507026.5A
Authority: CN
Inventors: 吕旭东; 段会龙; 田琪; 韩喆僖
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2021-05-10
Filing date: 2021-05-10
Publication date: 2023-09-01
Anticipated expiration: 2041-05-10
Also published as: CN113221528A

Abstract

The invention discloses an automatic generation and execution method of a clinical data quality evaluation rule based on an openEHR model, which comprises the following steps: (1) Establishing a mapping relation between data quality constraint knowledge and a data quality evaluation rule of each node in an openEHR model; (2) Acquiring a used openEHR model and analyzing the model into a JSON object in a fixed format; (3) Processing each node in all openEHR models according to the mapping relation and the JSON object to generate a data quality evaluation rule of a fixed structure; (4) Processing the data quality evaluation rule to generate a rule execution configuration file; (5) Deploying relevant information of a database to be evaluated in a Spark rule execution engine; (6) And executing the data quality evaluation rule by the Spark rule execution engine through calling a function according to the rule execution configuration file and the related information of the database to be evaluated, so as to obtain a data quality evaluation result.

Description

Automatic generation and execution method of clinical data quality evaluation rule based on openEHR model

Technical Field

The invention belongs to the technical field of electronic medical record data quality evaluation, and particularly relates to an automatic generation and execution method of a clinical data quality evaluation rule based on an openEHR model.

Background

The data stored in the electronic medical record has important values in the aspects of medical treatment, scientific research, public health and the like. However, the value of the data is established on the basis of high quality or ready for research, and the quality problems of missing, error, invalid, incomplete, inconsistent and the like of the electronic medical record data in China generally exist, and the problems can directly influence the application effect of the data. The data quality assessment can help a data user to find out the data quality problem so as to take appropriate measures to improve the data quality, and is an unobtainable step in the process of integrating and utilizing the electronic medical record data.

For structured electronic medical record data, the rule-based data quality evaluation method is high in universality, easy to implement and most widely used. For such methods, defining rules to be applied to data is a starting point for implementing data quality assessment, and will directly affect the data quality assessment result, and is also a link that needs to be manually participated and is time-consuming in the implementation process of quality assessment. Most of the current evaluation methods involve creating data quality queries, writing data quality evaluation rules through structured query language (Structured Query Language, SQL), but electronic medical records contain hundreds of thousands of data items, and a data quality evaluation task usually needs to run hundreds of rules, which is labor-intensive, time-consuming.

The existing research uses parameterized ideas to define rules through structures such as variables, functions and parameters, so that rule definition flow is simplified, but parameters of each rule still need to be defined in actual application, and the problems of large rule definition workload, time and labor waste are not effectively solved.

The medical information model is a standard information model, clinical concepts are expressed in a standardized and reusable mode, a clinical data standard structure is provided, standard medical terms can be bound, and the requirements of consistency of clinical information expression and storage modes are met.

The OpenEHR model is one of representative hierarchical medical information models, and is divided into a reference model and a prototype model. The reference model is a general basic model which defines the semantics and structure of information and is processed at the grammar level, the prototype model is composed of prototypes and templates, one prototype represents the concept or data element set of one information domain, the definition is carried out by constraining the data structure in the reference model, and the templates are further assembled and constrained to meet the specific scene requirement. The construction and use of the information model is the basis for establishing a standardized electronic medical record system, wherein the information model contains related requirements on data quality, and can be used as a knowledge source to automatically generate data quality evaluation rules.

Disclosure of Invention

In view of the above, the embodiment of the invention provides an automatic generation and execution method of clinical data quality evaluation rules based on openEHR model, comprising the following steps:

(1) Establishing a mapping relation between data quality constraint knowledge and a data quality evaluation rule of each node in an openEHR model;

(2) Acquiring a used openEHR model, and analyzing the openEHR model into a JSON object in a fixed format and node information contained in the JSON object;

(3) Processing each node in all openEHR models according to the mapping relation, the JSON object and the node information contained in the JSON object to generate a data quality evaluation rule of a fixed structure;

(4) Processing the data quality evaluation rule to generate a rule execution configuration file;

(5) Deploying relevant information of a database to be evaluated in a Spark rule execution engine;

(6) And executing the data quality evaluation rule by the Spark rule execution engine through calling a function according to the rule execution configuration file and the related information of the database to be evaluated, so as to obtain a data quality evaluation result.

In one embodiment, in step (1), when the mapping relation between the data quality constraint knowledge and the data quality evaluation rule is constructed, the data type in the openEHR model is taken as a classification basis, and each type of attribute and keyword are combined for construction.

In one embodiment, the step (1) specifically includes:

the method comprises the steps of (1-1) analyzing and obtaining data quality assessment requirements and data quality assessment rules of electronic medical records according to data acquisition specifications issued in the construction process of a regional health information platform;

(1-2) analyzing the structure of an openEHR model according to data quality evaluation requirements, and extracting relevant data quality constraint knowledge in the openEHR model, wherein the openEHR model comprises an openEHR reference model and an openEHR prototype model;

and (1-3) establishing a mapping relation between a data quality evaluation rule of each node and data quality constraint knowledge in the openEHR model by taking the data type defined in the openEHR reference model as a classification basis.

In one embodiment, in step (2), each node information in the obtained JSON object includes a node name, a node path, a database table name corresponding to the node, a column name, a data type, and data constraint information, where data constraint information structures corresponding to nodes of different data types are different.

In one embodiment, in step (3), the generated data quality assessment rule includes: the rule identifier, the rule content and the database information corresponding to the nodes used by the rule, wherein the rule content is defined by using a GDL (Guideline Definition Language) structure.

In one embodiment, in step (3), rule content included in the generated data quality evaluation rule is bound to a node of the openEHR model, so that the data quality evaluation rule is convenient to multiplex.

In one embodiment, in step (3), the rule identifier included in the generated data quality assessment rule includes: the openEHR template id, the node path, the local path and the keyword are used for identifying an automatically generated data quality evaluation rule, namely the data quality evaluation rule is generated according to certain constraint information of a certain node of a certain template, so that the rule is convenient to update and maintain.

In one embodiment, in step (4), the generated rule execution configuration file is used to define a rule execution flow, including: the table names of the database, the column names, the names of the called methods, the parameters required by the methods, and the logical relations among the rules, wherein the logical relations comprise and/or separation, and the parameters required by the methods such as maximum values, minimum values, data formats and the like.

In one embodiment, in step (6), the Spark rule execution engine executes the data quality assessment rule by calling a function, including:

defining a corresponding function according to the data quality evaluation requirement obtained by analysis to realize the function of a data quality evaluation rule;

taking the rule execution configuration file as input, analyzing parameters in the rule execution configuration file, and creating a DataFrame for the database table to be evaluated for processing;

for the rule connected with the AND logic, the input of the latter rule is the data conforming to the former rule, and for the rule connected with the OR logic, the data conforming to the rules on both sides of the OR logic are combined to be used as the integral execution result of the rule; and packaging the processed data quality evaluation result into an object and returning the object to the user.

In one embodiment, the data quality assessment result obtained in step (6) includes a total amount of data, an amount of failed data, a failed data ID, and a failed data value.

According to the method for automatically generating and executing the clinical data quality evaluation rule based on the openEHR model, which is provided by the embodiment, the workload of manually defining the evaluation rule can be remarkably reduced, the quantitative data quality evaluation result can be automatically counted, a user does not need to know the underlying database structure, and a rule updating and maintaining mechanism is provided to help the user manage the rule. Compared with the prior art, the invention has the beneficial technical effects that:

1) The time and labor cost for manually defining the rules are reduced; the automatic generation of the data quality evaluation rule based on the data constraint knowledge in the template of the openEHR model can remarkably reduce the workload of manually defining the rule and reduce the time and labor cost for defining the rule.

2) The regular multiplexing is convenient; the rule is directly bound with the template node of the openEHR model instead of the template node of the bottom database structure, the template node consists of universal prototype nodes, and for the condition that the same prototype nodes are used but the bottom database structure is different, the rule content is not required to be changed, and the mapping relation between the nodes and the database is obtained again, so that the rule can be quickly multiplexed.

3) The management of rules is facilitated; identifying rules by automatically generated rule identifiers can quickly locate knowledge sources that are automatically generated by the rules. After the user modifies the template, the rule can be updated rapidly through the rule identifier, so that the rule management is facilitated.

4) The universality is achieved; for databases based on openEHR standards, the method can be used for automatically generating rules to evaluate the quality of data.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method of automatically generating and executing clinical data quality assessment rules based on an openEHR model in an embodiment;

FIG. 2 is a schematic diagram of an openEHR model DV_COUNT type node resolution format in one embodiment;

FIG. 3 is a flow diagram that illustrates processing an open EHR template node to automatically generate rules in one embodiment;

FIG. 4 is a flow diagram of a rule base update function in one embodiment;

FIG. 5 is a schematic diagram of a rule execution configuration file in one embodiment.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the detailed description is presented by way of example only and is not intended to limit the scope of the invention.

FIG. 1 is a flow chart of a method of automatically generating and executing clinical data quality assessment rules based on an openEHR model in an embodiment. As shown in fig. 1, the method for automatically generating and executing the clinical data quality assessment rule provided by the embodiment includes the following steps:

and step 1, establishing a mapping relation between data quality constraint knowledge and data quality evaluation rules of each node in the openEHR model.

The establishment of the mapping relationship in step 1 is a basis for implementing the method, and in one embodiment, the establishment process of the mapping relationship in step 1 includes:

firstly, analyzing and obtaining data quality assessment requirements and data quality assessment rules of the electronic medical record according to data acquisition specifications issued in the construction process of the regional health information platform.

In the embodiment, the data quality evaluation requirements of the electronic medical record are analyzed according to the data acquisition specifications issued in the construction process of the regional health information platform, the data quality evaluation conditions are summarized, and the data quality evaluation requirements are classified according to the clinical data quality framework to obtain the data quality evaluation requirements of each type of data.

And then analyzing the structure of the openEHR model according to the data quality evaluation requirement, and extracting related data quality constraint knowledge in the openEHR model, wherein the openEHR model comprises an openEHR reference model and an openEHR prototype model.

In an embodiment, after structural analysis is performed on the openEHR model, an openEHR reference model and an openEHR prototype model may be determined, where the openEHR prototype model includes a prototype model and a template model, and constraint definitions are added to the openEHR prototype model and the openEHR reference model through the openEHR template.

And carrying out knowledge analysis on the openEHR prototype model according to the data quality evaluation requirement, and extracting relevant data quality constraint knowledge in the openEHR prototype model. The openEHR prototype is expressed using a prototype definition language (Archetype Definition Language, ADL), where cADL (constraint form of ADL) and dADL (data definition form of ADL) grammars describe data constraints. And analyzing the two grammars according to the summarized data quality assessment requirements, and extracting data quality constraint knowledge which can correspond to the assessment requirements.

And carrying out knowledge analysis on the openEHR reference model according to the data quality evaluation requirement, and extracting relevant data quality constraint knowledge in the openEHR reference model. The information model of the general concept and the information model of the data type in the reference model are relatively related to the data quality assessment requirement, the structure and the attribute are analyzed, and the data quality constraint knowledge which can correspond to the data quality assessment requirement is extracted.

The data quality constraint knowledge obtained from the openEHR prototype model and the openEHR reference model is collectively referred to as the data quality constraint knowledge obtained from the openEHR model. After obtaining the data quality constraint knowledge obtained from the openEHR model, a mapping relationship between the data quality evaluation rule and the data quality constraint knowledge in the openEHR model is established by taking the data types defined in the openEHR reference model as classification basis, and generally one data type contains multiple data quality constraint knowledge and corresponds to multiple data quality evaluation requirements.

Although the data quality constraint knowledge and the data quality evaluation requirement contained in the openEHR model are increased along with the development of informatization, the mapping relationship established by the method can be expanded continuously to meet the requirement, and the mapping relationship has stability in a period of time, and the method has expandability.

And step 2, acquiring a used openEHR model, and analyzing the openEHR model into a JSON object in a fixed format and node information contained in the JSON object.

In an embodiment, in order to extract the knowledge of the data quality constraints in the openEHR template, the template needs to be parsed. And (3) extracting the data quality constraint knowledge of each node of the openEHR template by using the mapping relation determined in the step (1) so as to obtain the JSON object and the node information contained in the JSON object.

In one embodiment, as shown in fig. 2, the extraction result of the dv_count type node is represented by a JSON structure, and mainly includes node paths, node corresponding database structure information, node data types, node names, and node value range, which correspond to "elementPath", "cdrInfo", "type", "ontology", and "range" keywords, respectively. The data quality constraint knowledge contained by the different data types is different, and therefore the structure is also different, with the first four key structures being generic.

And step 3, processing each node in all openEHR models according to the mapping relation, the JSON object and the node information contained in the JSON object, and generating a data quality evaluation rule of a fixed structure.

In the step 3, the JSON structure extracted by each node of the openEHR template model is mainly utilized for processing, and a data quality evaluation rule of a fixed structure is generated. As shown in fig. 3, the method specifically includes:

(a) Acquiring a JSON object of a current node, judging whether the node is used in a database or not through a cdrInfo (clinical data center information) structure, namely judging whether the cdrInfo is empty or not, processing the node with the structure not being empty, and skipping the node with the structure being empty;

(b) cdrlinfo is an array structure that needs to be traversed to process each database field information corresponding to each node. Each type of node contains non-Null constraint component (Null) knowledge, association constraint Element existlon knowledge, and data type constraint Element type knowledge, so these three constraints are handled first. If a certain node requires non-null, generating a corresponding rule;

(c) Judging the node type, and if the node Is DV_IDENTIFier (IDENTIFIER) type, generating Element is unique rule aiming at the unique constraint; if the node is of DV_CODED_TEXT type, generating an Element code by rule according to the coding requirement; if the node is DV_COUNT or DV_INTERVAL < DV_COUNT > type, generating Compare (DataValue) and Element precision rules aiming at the data range and the data precision requirement; if the node is of the DV_DATETIME, DV_DATE or DV_TIME type, then an Element format rule is generated for its data format requirements.

(d) If the node is DV_QUANTITY type, which contains the numerical range and unit information of the node, distinguishing the two information by a localPath (local path), and then requiring to generate rules for the numerical range of each unit; the DV_INTERVAL < DV_QUANTITY > type comprises an upper node and a lower node, each node is of the DV_QUANTITY type, and the upper node and the lower node need to be processed respectively according to the process.

And repeating the process until all nodes of all openEHR templates are processed, and inserting all generated data quality evaluation rules into a rule base. The generated data quality evaluation rule includes: the rule identifier, the rule content and the database information corresponding to the nodes used by the rule, wherein the rule content is defined by using a GDL structure.

To identify the knowledge source of each rule, a rule identifier is first generated when generating a data quality assessment rule, the rule is updated by the rule identifier, and a rule base is managed, and in one embodiment, as shown in fig. 4, the specific process includes:

generating a rule identifier according to a naming rule of 'openEHR template id+node path+local path+keyword';

if the rule identifier exists in the rule base, the rule is judged to need to be deleted or modified, if a certain field of the database is originally required to be not empty, the field can be defined to be empty after the template version is updated, and then the rule corresponding to the constraint is deleted.

If the rule identifier does not exist in the rule base, the processing node information generates a new rule and adds the new rule into the rule base.

And 4, processing the data quality evaluation rule to generate a rule execution configuration file.

The rule execution configuration file is a definition of a rule execution flow, is input by a rule execution engine, mainly comprises a database table name, a column name, a called method name, parameters such as a maximum value, a minimum value, a data format and the like required by the method, and is a logic relationship among rules, wherein the logic relationship comprises a and/or a separation. The format of which is shown in fig. 5. In one embodiment, step 4 specifically includes:

the rules to be executed are divided into four types to be processed respectively: rules related to the association between two database tables, simple rules related to only one database table and one function, complex rules connected by only AND logic and complex rules connected by OR logic;

the complex rule connected by AND logic may contain the constraint rule and simple rule of the association relation, if the association relation constraint exists, the constraint rule is processed first;

complex rules connected by OR logic default do not contain constraint rules of association;

and if the complex rule of the OR logic connection contains the AND logic, calling a processing method of the AND logic to be processed in a blocking mode.

And 5, deploying address, user name and password information of the database to be evaluated in the Spark rule execution engine.

In one embodiment, step (5) creates a DataFrame from any JDBC-compliant database using the JDBC method defined in class Spark DataFrameReader, which does not modify the original data of the database under evaluation, as an object for subsequent data processing. JDBC compatible databases include MySQL, postgresSQL, H, oracle, SQL Server, SAP Hana, and DB2.

And step 6, executing the data quality evaluation rule by the Spark rule execution engine through calling a function according to the rule execution configuration file and the related information of the database to be evaluated, and obtaining a data quality evaluation result.

In one embodiment, the step 6 specifically includes:

creating Spark session as an access point of Spark;

processing logic relations and rule parameters of a rule execution configuration file;

creating a DataFrame for a target database table as an object of subsequent data processing;

calling a corresponding function to process data;

if one rule includes a plurality of constraints, in the processing process, the plurality of constraints are executed according to the sequence defined in the rule configuration file, so that the data result satisfying the previous constraint, namely the DataFrame, is used as the input of the constraint of the next rule, the data quantity is continuously reduced along with the execution of the rule, and finally, the data conforming to all the constraints of the current rule is obtained.

And taking the difference set between the original data set corresponding to the current rule and the data set finally conforming to the rule to obtain the data set not conforming to the rule.

For each rule, the total amount of data processed, non-rule-compliant data ID and non-rule-compliant data value are counted.

According to the method for automatically generating and executing the clinical data quality evaluation rule based on the openEHR model, which is provided by the embodiment, the workload of manually defining the evaluation rule can be remarkably reduced, the quantitative data quality evaluation result can be automatically counted, a user does not need to know the underlying database structure, and a rule updating and maintaining mechanism is provided to help the user manage the rule.

The foregoing detailed description of the preferred embodiments and advantages of the invention will be appreciated that the foregoing description is merely illustrative of the presently preferred embodiments of the invention, and that no changes, additions, substitutions and equivalents of those embodiments are intended to be included within the scope of the invention.

Claims

1. An automatic generation and execution method of clinical data quality evaluation rules based on openEHR model is characterized by comprising the following steps:

(3) Processing each node in all openEHR models according to the mapping relation, the JSON object and the node information contained in the JSON object to generate a data quality evaluation rule of a fixed structure, wherein the generated data quality evaluation rule comprises: the rule identifier, the rule content and the database information corresponding to the nodes used by the rule; the rule content is bound with the node of the openEHR model, so that multiplexing of the data quality evaluation rule is facilitated, and the rule identifier comprises: the openEHR template id, the node path, the local path and the keywords are used for identifying an automatically generated data quality evaluation rule, namely the data quality evaluation rule is generated according to certain constraint information of a certain node of a certain template, so that the rule is convenient to update and maintain;

2. The method for automatically generating and executing the clinical data quality assessment rules based on the openEHR model according to claim 1, wherein in the step (1), when the mapping relation between the data quality constraint knowledge and the data quality assessment rules is constructed, the data type in the openEHR model is used as a classification basis, and each type of attribute and keyword are combined for construction.

3. The method for automatically generating and executing the clinical data quality assessment rule based on the openEHR model according to claim 1, wherein step (1) specifically comprises:

4. The method for automatically generating and executing clinical data quality evaluation rules based on openEHR model according to claim 1, wherein in step (2), each node information in the obtained JSON object includes a node name, a node path, a database table name corresponding to the node, a column name, a data type, and data constraint information, and the data constraint information structures corresponding to the nodes of different data types are different.

5. The method for automatically generating and executing clinical data quality assessment rules based on openEHR model according to claim 1, wherein in step (3), the rule content is defined using GDL structure.

6. The method for automatically generating and executing the clinical data quality assessment rule based on the openEHR model according to claim 1, wherein in step (4), the generated rule execution profile is used for defining a rule execution flow, and includes: the table names, column names, called method names, parameters required by the method and logic relations among rules of the database, wherein the logic relations comprise and/or separation.

7. The method for automatically generating and executing a clinical data quality assessment rule based on openEHR model according to claim 3, wherein in step (6), the Spark rule execution engine executes the data quality assessment rule by calling a function, comprising:

8. The method for automatically generating and executing the openEHR model-based clinical data quality assessment rule according to claim 1 or 7, wherein the obtained data quality assessment result in step (6) includes a total data amount, a reject data ID, and a reject data value.