CN111859301B

CN111859301B - Data reliability evaluation method based on improved Apriori algorithm and Bayesian network reasoning

Info

Publication number: CN111859301B
Application number: CN202010728042.2A
Authority: CN
Inventors: 邓建新; 叶志兴; 谢彬; 曾向明; 贺德强; 李先旺
Original assignee: Guangxi University
Current assignee: Guangxi University
Priority date: 2020-07-23
Filing date: 2020-07-23
Publication date: 2024-02-02
Anticipated expiration: 2040-07-23
Also published as: CN111859301A

Abstract

The invention discloses a data reliability evaluation method based on an improved Apriori algorithm and Bayesian network reasoning, which belongs to the field of data processing. The algorithm evaluates the reliability of the data from the relationship between the data structure and the data, and the data distribution condition reduces the subjectivity of the prior data reliability evaluation without determining the reliability index. The algorithm has universality, is not only suitable for discrete data values, but also suitable for the reliability of interval numbers. The algorithm has higher accuracy, and is beneficial to mining the association relation between the same dimension and different dimensions of the high-dimension data to obtain the local reliability and the global reliability of each data.

Description

Data reliability evaluation method based on improved Apriori algorithm and Bayesian network reasoning

Technical Field

The invention relates to the field of data processing, in particular to a data reliability evaluation method based on an improved Apriori algorithm and Bayesian network reasoning.

Background

With the advent of the big data age, data mining algorithms were widely applied in various fields, making data the most valuable raw material for production by many organizations. Many organizations are selling data, while others offer services and solutions to mining data. In fact, there is an increasing reliance on sources of secondary data, such as estimates and predictions, which may have different characteristics that affect overall reliability. At this point, the more traditional reliability approach becomes less useful because the metadata needs to contain some information that is hidden in the background by the data representation of the system.

Reliability originates from the field of industrial engineering quality control and is initially defined as the ability of a product to run successfully for a predetermined time under specified conditions. This capability is typically attributed to a probability value, i.e., a probability that a given function will be completed within a given time and range under a given environmental condition. Data is taken as a product, and unlike the definition of the reliability of a general product, the definition of the reliability of the data does not have a unified standard. According to the prior reliability theory, more objective definition of the reliability of the data is proposed, namely, the reliability of the data is related to the conditional probability among the data in different dimensions.

The conventional data reliability method mainly focuses on the following three points: 1) Setting standards by means of scoring by users or experts, and establishing a scoring table through a statistical method and a field background; 2) Performing reliability evaluation on the data transmission process; 3) And formulating a reliability index according to the source information of the data, and performing global reliability evaluation through a data mining algorithm. However, when the expert marks the table, and the reliability of the data transmission process or the reliability evaluation is performed by depending on the source information of the data, the formulated reliability index has a certain degree of subjectivity, and a relatively objective data reliability evaluation method is needed at this time, and a relatively complete data reliability evaluation system is formulated by combining the two reliability evaluation methods.

Disclosure of Invention

The invention aims to provide a data reliability evaluation method based on an improved Apriori algorithm and Bayesian network reasoning, which solves the technical problems in the background art.

The method utilizes the integrated clustering based on the nonlinear dimension reduction algorithm, and adopts the integrated algorithm based on the association rule and the Bayesian network. The method is characterized in that the method is used for mining association relations in the same dimension of high-dimension data, and the method is used for mining association relations between different dimensions and expressing the association relations in a probability form. The data reliability calculation method provided by the invention does not need to determine the reliability index and the data distribution condition, thereby reducing the subjectivity of the previous data reliability evaluation. The method is applicable to discrete data values, is applicable to the reliability of interval numbers, and has universality. In addition, the method has a reference function on data reliability evaluation in other fields with correlation among data. The algorithm is helpful for mining the association relationship between the same dimension and different dimensions of the high-dimension data. The method has market prospect in the aspects of data-driven service application, data preprocessing in the field of big data, prediction application based on similar principles, collaborative recommendation of electronic commerce and the like.

Data reliability evaluation method based on improved Apriori algorithm and Bayesian network reasoning, the evaluation method comprises the following steps of

Step 1: multidimensional correlation data S provided with input diversity characteristics _ij ＝{a _ji A mixed set of interval values and discrete values, where i represents the dimension of the data i=1, 2, …, n, j represents the number of samples j=1, 2, …, m, if each data is considered as an interval number a _ji ＝[x _ji ,y _ji ]Wherein x is _ji ,y _ji Can be equal, record data S _ij Set of left endpointsIs data S _ij Minimum value set of S _ij Set of right endpoints->For the maximum value set of the data, the multi-dimensional interval number set with minimum value and maximum value is formed into a sample matrix, namely +.>For maximum value set S _ij ^- And minimum value set S _ij ⁺ Carrying out data coding treatment to obtain a data coding Code and a coding Rule;

step 2: and constructing a Bayesian network directed acyclic graph according to the data correlation and the attribute characteristics. Representing each dimension data of the original data subjected to data encoding according to the step 1 as nodes in the Bayesian networkWhere i represents the dimension of the data and k represents the dimensionThe state of the degree is the Rule of the code under the corresponding code. Calculating node variable +.>Wherein->Independent node variable representing no parent node, +.>Dependent node variables representing parent nodes and directed edges +.>The directed edge represents the relationship of the individual dimension data, whereinFor node->Is a parent node of (a);

step 3: obtaining each node by adopting improved Apriori algorithmSupport of->And as a conditional probability table L (V) of the bayesian network;

step 4: and reasoning the Bayesian network of the data according to the evidence correlation method, and calculating to obtain the reliability of each data.

Further, in the step 1, the data encoding process includes:

step 1.1: respectively to dataAnd->And performing unsupervised cluster learning to obtain the maximum neighbor number N. And pairs of sample matrixes S according to the number N of neighbors _ij And performing linear reconstruction according to a local linear embedding algorithm, and calculating to obtain the eigenvector of the sample matrix. Clustering the feature vectors to obtain a data coding Code of a sample matrix and a set Rule of data dimension clustering, wherein the Rule is a coding Rule.

Further, the specific process of the step 1.1 is as follows:

step 1.1.1: input data matrixDetermining a threshold T by cross-checking;

step 1.1.2: from dataset S _ij ^- Or S _ij ⁺ Counting into a classification set Canopy;

step 1.1.3: from dataset S _ij ^- Or S _ij ⁺ P, calculating the distance between P and the classification set Canopy;

step 1.1.4: determining a classification set Canopy, storing P into the classification set Canopy if the classification set Canopy distance is smaller than T, otherwise, storing P from S _ij ^- Or S _ij ⁺ Delete in the middle;

step 1.1.5: repeating step 1.1.3,1.1.4 until S _ij ^- Or S _ij ⁺ No data in the classification set Canopy, the data number K in the classification set Canopy is output ^- Or classification set K ⁺ Obtaining the clustering number K;

step 1.1.6: from S _ij ^- Or S _ij ⁺ Randomly selecting K data sets, counting into C ^- Or C ⁺ ；

Step 1.1.7: according to Euclidean distance, S is _ij ^- Or S _ij ⁺ Is distributed into C ^- Or C ⁺ Form data set Q ^- Or Q ⁺ ；

Step 1.1.8: calculate each class Q ^- Or Q ⁺ As a new C ^- Or C ⁺ ；

Step 1.1.9: repeat 1.1.7 and 1.1.8 until C ^- Or C ⁺ No longer changes;

step 1.1.10: output Q ^- Or Q ⁺ The maximum neighbor number N in (a);

step 1.1.11: matrix S of samples _ij ＝{S _ij ^- ,S _ij ⁺ Linearly reconstructing according to the maximum neighbor N to obtain a weight coefficient matrix W= { W _j }(j＝1,2,…,m)；

Step 1.1.12: calculate matrix m= (I-W) ^T (I-W) and obtaining a feature vector d of the 2 nd feature value _j Will S _ij ^- Or S _ij ⁺ Replaced by d _j Repeating steps 1.1.6-1.1.9 until C ^- Or C ⁺ No longer change, get d based _j New clustering result Q of (2) _i '；

Step 1.1.13: by the arithmetic code=index (d _j ,Q _i ') to d _j At Q _i Index in', i.e. data Code, index (d) _j ,Q _i ') indicates that d is taken _j Distribution into Q _i Index in'; and the data set combination with the same cluster index is the coding Rule, and the operation is expressed as Rule 'C' S _ij ；

Further, the data cluster number k=min (K ^- ,K ⁺ ) The maximum number of neighbors n=max (Q in step 1.1.10 ^- ,Q ⁺ )。

Further, the method for calculating the weight coefficient in the step 1.1.11 is as follows:

wherein eta represents S _ij Is a single-point network.

Further, the specific process of the modified Apriori algorithm in the step 3 is as follows:

step 3.1:input node variables

Step 3.2: computing independent node variable sets without parent nodesConditional probability table->And data encoding S ^C ；

Step 3.3: for a set of dependent node variables with parent nodesConnecting branches, i.e. combining the node with all its parent nodes to obtain +.>

Step 3.4: computing node variable setsThe proportion of the number of nodes to the total data samples in the state k, i.e. the support +.>

Step 3.5: computing node variable setsConfidence of->And conditional probability table->

Step 3.6: the final conditional probability table L (V) is output.

Further, in the step 3.1, node variables with a parent node and a node without a parent node are divided, and the continuous branch in the step 3.3 starts scanning from the node variable without a parent node.

Further, the reasoning formula of the evidence correlation method in the step 4:

representing node v _i Is a child node of (a). />Representing node v _i Is a parent node of A (v) _i ) Representing node v _i Probability value of state, wherein |S| represents child node +.>The |f| indicates the number of elements in the parent node F.

The invention adopts the technical proposal and has the following technical effects:

1) Objectivity, no reliability index is required to be determined, and data distribution conditions reduce subjectivity of the prior data reliability evaluation;

2) Universality is applicable not only to discrete data values, but also to reliability of the number of intervals. The method has a reference function for evaluating the data reliability in other fields with related relations among the data;

3) The model can also distinguish data to identify noise in the sample, even if interval values exist in the sample, a certain basis is provided for improving the sample quality and the accuracy of a data driving algorithm, and the model has market prospects in the aspects of data driving service application, data preprocessing in the field of big data, prediction application based on a similar principle, data reliability evaluation of electronic commerce and the like.

Drawings

Fig. 1 is a flow chart of the present invention.

Fig. 2 is a schematic diagram of data encoding of the present invention.

Fig. 3 is a data encoding flow chart of the present invention.

Fig. 4 is a diagram of a bayesian network of related data according to the present invention.

Fig. 5 is a diagram of a bayesian network model of the present invention, for example, squeeze casting.

Detailed Description

The present invention will be described in further detail with reference to preferred embodiments for the purpose of making the objects, technical solutions and advantages of the present invention more apparent. It should be noted, however, that many of the details set forth in the description are merely provided to provide a thorough understanding of one or more aspects of the invention, and that these aspects of the invention may be practiced without these specific details.

As shown in fig. 1, the data reliability evaluation method based on the improved Apriori algorithm and bayesian network reasoning according to the present invention comprises the following steps:

step 1: and (5) data encoding. Multidimensional correlation data S provided with input diversity characteristics _ij ＝{a _ji And is a mixed set of interval values and discrete values, where i represents the dimension of the data i=1, 2, …, n, j represents the number of samples j=1, 2, …, m. Specific examples are shown in Table 1.

Table 1 data sample matrix taking squeeze casting process data as an example

If each data is regarded as the interval number a _ji ＝[x _ij ,y _ji ]Wherein x is _ji ,y _ji May be equal. Record data S _ij Set of left endpointsIs data S _ij Minimum value set of S _ij Set of right endpoints->For the maximum value set of the data, the multi-dimensional interval number set with minimum value and maximum value is formed into a sample matrix, namely +.>Respectively to dataAnd->Performing unsupervised cluster learning to obtain the maximum neighbor number N, and comparing the sample matrix S with the neighbor number N _ij And linearly reconstructing according to a local linear embedding algorithm (LLE), and calculating to obtain the eigenvectors of the sample matrix. Clustering the feature vectors to obtain a data coding Code of a sample matrix and a set Rule of data dimension clustering, wherein the Rule is a coding Rule. The calculation process of the algorithm is schematically shown in fig. 2, and the basic flow is shown in fig. 3. Taking extrusion casting process data as an example, the data of table 1 are subjected to data coding, and the results are shown in table 2;

the method comprises the following specific steps:

step 1.1: input data matrix S _i ＝(S _i ^- ,S _i ⁺ ) Determining a threshold T through cross checking;

step 1.2: from dataset S _i ^- (or S) _i ⁺ ) Counting into a classification set Canopy;

step 1.3: from dataset S _i ^- (or S) _i ⁺ ) Calculating the distance between P and Canopy by taking one point P;

step 1.4: and judging Canopy. If the distance is less than T, P is stored in Canopy, otherwise

P is taken from S _i ^- (or S) _i ⁺ ) Delete in the middle;

step 1.5: repeating steps 1.3 and 1.4 until S _i ^- (or S) _i ⁺ ) No data in, output Canopy

Number of data K in (3) ^- (or K) ⁺ ) And k=min (K ^- ,K ⁺ )；

Step 1.6: from S _ij ^- Or S _ij ⁺ Randomly selecting K data sets, counting into C ^- Or C ⁺ ；

Step 1.7: according to Euclidean distance, S is _ij ^- Or S _ij ⁺ Is distributed into C ^- Or C ⁺ Form data set Q ^- Or Q ⁺ ；

Step 1.8: calculate each class Q ^- Or Q ⁺ As a new C ^- Or C ⁺ ；

Step 1.9: repeat 1.1.7 and 1.1.8 until C ^- Or C ⁺ No longer changes;

step 1.10: output Q ^- Or Q ⁺ The maximum neighbor number N in (a);

step 1.11: matrix S of samples _ij ＝{S _ij ^- ,S _ij ⁺ Linearly reconstructing according to the maximum neighbor N to obtain a weight coefficient matrix W= { W _j }(j＝1,2,…,m)；

Step 1.12: calculate matrix m= (I-W) ^T (I-W) and obtaining a feature vector d of the 2 nd feature value _j Will S _ij ^- Or S _ij ⁺ Replaced by d _j Repeating steps 1.1.6-1.1.9 until C ^- Or C ⁺ No longer change, get d based _j New clustering result Q of (2) _i '；

Step 1.13: by the arithmetic code=index (d _j ,Q _i ') to d _j At Q _i Index in', i.e. data Code, index (d) _j ,Q _i ') indicates that d is taken _j Distribution into Q _i Index in'; and the data set combination with the same cluster index is the coding Rule, and the operation is expressed as Rule 'C' S _ij ；

TABLE 2 Code-Rule after encoding of data samples, for example squeeze casting process data

Step 2: and constructing a Bayesian network directed acyclic graph according to the data correlation and the attribute characteristics, as shown in fig. 4. Taking squeeze casting process data as an example, a specific example is shown in fig. 5. After the original data is subjected to data coding according to the step 1, each dimension data is expressed as a node in the Bayesian networkWhere i represents the dimension of the data, k represents the state of the dimension (corresponding to the Rule of encoding), node variable +.>Wherein->An independent node variable representing no parent node,representing dependent node variables with parent nodes. Directed edge->Representing the relation of the respective dimension data, wherein +.>For node->Is a parent node of (a);

step 3: determination using modified Apriori algorithmEach nodeSupport of->And obtaining a conditional probability table L (V) of the Bayesian network; the pseudo code of this algorithm is shown in table 3.

Table 3 improved Apriori algorithm based on bayesian network nodes

The method comprises the following specific steps:

step 3.1: input node variables

Step 3.2: computing node variable sets without parent nodesConditional probability table->And data braiding

Code S ^C ；

Step 3.4: computing node variable setsThe number of nodes at state k is the proportion of the total data sample,i.e. support->

Step 3.6: the final conditional probability table L (V) is output.

Step 4: reasoning is carried out on the Bayesian network of the data according to the evidence correlation method, and the reliability of the data is calculated; taking the squeeze casting process data in Table 1 as an example, the reliability reasoning results that can be obtained are shown in Table 4, where global reliability is the relative reliability of the process parameter population at each material composition.

TABLE 4 reliability reasoning results for the extrusion casting process data

The method comprises the following specific steps:

step 4.1: input: current node variable v _i ；

Step 4.2: distribution node variable v _i ＝{v _P ，v _S And connection node Join (v) _i ,v _i+1 ) Construction of Bayesian networks

Step 4.3: initializing node states and conditional probability tables L (v _i )

Step 4.4: input evidence Sub (v) _i ),Par(v _i )

Step 4.6: reasoning according to evidence correlation formula

Step 4.7: and (3) outputting: current node reliability R (v _i )

Evidence correlation formula:

representing node v _i Is a child node of (a). F (F) _i ^θ ＝Par(v _i ) Representing node v _i Is a parent node of A (v) _i ) Representing node v _i Probability values for states. Wherein |S| represents child node +.>The |f| indicates the number of elements in the parent node F.

The invention respectively forms minimum value and maximum value data sets by the left end point and the right end point of the interval number, performs Canopy-Kmeans clustering on each data, and adopts LLE popular learning algorithm to reduce the dimension of the whole interval data set to obtain the data code of the interval data set. And then, taking each dimension data as a node, taking the relation among the different dimension data as a directed edge, constructing a Bayesian network graph, obtaining a conditional probability table of each node through an improved Apriori algorithm, and finally, constructing reasoning of each node data according to an evidence correlation method.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. The extrusion casting process data reliability evaluation method based on the improved Apriori algorithm and Bayesian network reasoning is characterized by comprising the following steps of: the evaluation method comprises the following steps of

Step 1: multidimensional correlation data S provided with input diversity characteristics _ij ＝{a _ji A mixed set of interval values and discrete values, where i represents the dimension of the data i=1, 2, …, n, j represents the number of samples j=1, 2, …, m, if each data is considered as an interval number a _ji ＝[x _ji ,y _ji ]Wherein x is _ji ,y _ji Equality, record data S _ij Set of left endpointsIs data S _ij Minimum value set of S _ij Set of right endpoints->For the maximum value set of the data, the multi-dimensional interval number set with minimum value and maximum value is formed into a sample matrix, namely +.>For maximum value set S _ij ^- And minimum value set S _ij ⁺ Carrying out data coding treatment to obtain a data coding Code and a coding Rule;

step 2: constructing a Bayesian network directed acyclic graph according to the data correlation and attribute characteristics, and representing each dimension data of the original data after data encoding according to the step 1 as nodes in the Bayesian networkWherein i represents the dimension of the data, k represents the state of the dimension, namely the Rule of the code under the corresponding code, and the node variable +.>Wherein->Independent node variable representing no parent node, +.>Dependent node variables representing parent nodes and directed/>Is a parent node of (a);

step 4: and reasoning the Bayesian network of the data according to the evidence correlation method, and calculating the reliability of the data.

2. The extrusion casting process data reliability evaluation method based on the improved Apriori algorithm and bayesian network reasoning according to claim 1, wherein the method comprises the following steps of: in the step 1, the data encoding process comprises the following steps:

step 1.1: respectively to dataAnd->Performing unsupervised cluster learning to obtain the maximum neighbor number N, and comparing the sample matrix S with the neighbor number N _ij And carrying out linear reconstruction according to a local linear embedding algorithm, calculating to obtain a feature vector of the sample matrix, clustering the feature vector to obtain a data coding Code of the sample matrix and a set Rule of data dimension clusters, wherein the Rule is a coding Rule.

3. The extrusion casting process data reliability evaluation method based on the improved Apriori algorithm and bayesian network reasoning according to claim 2, wherein the method comprises the following steps of: the specific process of the step 1.1 is as follows:

step 1.1.1: input data matrix S _ij ＝(S _ij ^- ,S _ij ⁺ ) Determining a threshold T through cross checking;

Step 1.1.8: calculate each class Q ^- Or Q ⁺ As a new C ^- Or C ⁺ ；

Step 1.1.9: repeat 1.1.7 and 1.1.8 until C ^- Or C ⁺ No longer changes;

step 1.1.10: output Q ^- Or Q ⁺ The maximum neighbor number N in (a);

Step 1.1.12: calculate matrix m= (I-W) ^T (I-W) and obtaining a feature vector d of the 2 nd feature value _j Will beS _ij ^- Or S _ij ⁺ Replaced by d _j Repeating steps 1.1.6-1.1.9 until C-or C ⁺ No longer change, get d based _j New clustering result Q of (2) _i '；

Step 1.1.13: by the arithmetic code=index (d _j ,Q _i ') to d _j At Q _i Index in', i.e. data Code, index (d) _j ,Q _i ') indicates that d is taken _j Distribution into Q _i Index in'; and the data set combination with the same cluster index is the coding Rule, and the operation is expressed as Rule 'C' S _ij 。

4. The extrusion casting process data reliability evaluation method based on the improved Apriori algorithm and bayesian network reasoning according to claim 3, wherein: the data cluster number k=min (K ^- ,K ⁺ ) The maximum number of neighbors n=max (Q in step 1.1.10 ^- ,Q ⁺ )。

5. The extrusion casting process data reliability evaluation method based on the improved Apriori algorithm and bayesian network reasoning according to claim 3, wherein: the method for calculating the weight coefficient in the step 1.1.11 is as follows:

wherein S is _ij Is denoted by eta.

6. The extrusion casting process data reliability evaluation method based on the improved Apriori algorithm and bayesian network reasoning according to claim 1, wherein the method comprises the following steps of: the specific process of the improved Apriori algorithm in the step 3 is as follows:

step 3.1: input node variables

Step 3.4: computing node variable setsThe proportion of the number of nodes to the total data sample at state k, i.e. the degree of support

Step 3.6: the final conditional probability table L (V) is output.

7. The extrusion casting process data reliability evaluation method based on the improved Apriori algorithm and bayesian network reasoning according to claim 6, wherein: and dividing node variables with father nodes and without father nodes in the step 3.1, and starting scanning by connecting branches in the step 3.3 from the node variables without father nodes.

8. The extrusion casting process data reliability evaluation method based on the improved Apriori algorithm and bayesian network reasoning according to claim 1, wherein the method comprises the following steps of: the reasoning formula of the evidence correlation method in the step 4:

representing node v _i Is a child node of F _i ^θ ＝Par(v _i ) Representing node v _i Is a parent node of A (v) _i ) Representing node v _i Probability value of state, where S represents child node +.>The |f| indicates the number of elements in the parent node F.