CN110011847B - Data source quality evaluation method under sensing cloud environment - Google Patents

Data source quality evaluation method under sensing cloud environment Download PDF

Info

Publication number
CN110011847B
CN110011847B CN201910256445.9A CN201910256445A CN110011847B CN 110011847 B CN110011847 B CN 110011847B CN 201910256445 A CN201910256445 A CN 201910256445A CN 110011847 B CN110011847 B CN 110011847B
Authority
CN
China
Prior art keywords
data
quality
data source
value
true
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910256445.9A
Other languages
Chinese (zh)
Other versions
CN110011847A (en
Inventor
李默涵
田志宏
孙彦斌
顾钊铨
韩伟红
仇晶
苏申
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou University
Original Assignee
Guangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou University filed Critical Guangzhou University
Priority to CN201910256445.9A priority Critical patent/CN110011847B/en
Publication of CN110011847A publication Critical patent/CN110011847A/en
Application granted granted Critical
Publication of CN110011847B publication Critical patent/CN110011847B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/50Network service management, e.g. ensuring proper service fulfilment according to agreements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/50Network service management, e.g. ensuring proper service fulfilment according to agreements
    • H04L41/5003Managing SLA; Interaction between SLA and QoS
    • H04L41/5009Determining service level performance parameters or violations of service level contracts, e.g. violations of agreed response time or mean time between failures [MTBF]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
    • H04L67/025Protocols based on web technology, e.g. hypertext transfer protocol [HTTP] for remote control or remote monitoring of applications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/12Protocols specially adapted for proprietary or special-purpose networking environments, e.g. medical networks, sensor networks, networks in vehicles or remote metering networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/30Services specially adapted for particular environments, situations or purposes
    • H04W4/38Services specially adapted for particular environments, situations or purposes for collecting sensor information
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W84/00Network topologies
    • H04W84/18Self-organising networks, e.g. ad-hoc networks or sensor networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Testing Or Calibration Of Command Recording Devices (AREA)

Abstract

The embodiment of the invention discloses a data source quality evaluation method in a sensing cloud environment, which comprises the following steps: acquiring current and historical monitoring data of a sensing cloud storage data source, wherein the sensing cloud is a combination of cloud computing and a wireless sensor network and is used for collecting and processing monitoring data from a plurality of sensor nodes or sensor sub-networks; integrating monitoring data of the data source based on the spatial correlation and the temporal correlation and determining a data true value; generating an initial quality evaluation vector of a data source based on the data truth value, and adjusting the initial quality evaluation vector of the data source according to a quality rule; and calculating a final quality evaluation result of the data source according to the adjusted initial quality evaluation vector of the data source. By adopting the invention, the quality of the data source can be described in multiple angles, and the quality of the data source can be more comprehensively depicted.

Description

Data source quality evaluation method under sensing cloud environment
Technical Field
The invention relates to the field of quality evaluation, in particular to a data source quality evaluation method in a sensing cloud environment.
Background
Currently, some data source quality evaluation methods are proposed, and most of the existing methods evaluate the data quality in the database and monitor the poor quality data based on rules, such as conditional function dependency (conditional functional dependency), conditional inclusion dependency (conditional information dependency), aging constraint (current constraint), matching rule (matching rule), and the like. The data source quality may be further evaluated based on the proportion of bad data produced by the data source. The data quality rule generally has the form a → B, with the semantic that "if the value of attribute set a is a, then the value of attribute set B must be B". By screening the data meeting the rule antecedent A ═ a in the database and checking whether the rule postcedent B ═ B meets, whether errors exist in the data can be judged, the data not meeting the rule are considered to be poor data (or called error data), and the data quality of some data sources can be reduced due to the influence of self or external factors. These negatively affected bad data sources, if not discovered in a timely manner, can further affect the quality of service based on the data source. In order to accurately find the poor quality data source, an accurate data source quality evaluation method is required.
Disclosure of Invention
In order to solve the problems, the invention provides a data source quality evaluation method in a sensing cloud environment, which can describe the data source quality more accurately from multiple angles and describe the data source quality more comprehensively.
Based on the above, the invention provides a data source quality evaluation method under a sensing cloud environment, which comprises the following steps: a data source quality assessment method in a sensing cloud environment is characterized by comprising the following steps:
acquiring current and historical monitoring data of a sensing cloud storage data source, wherein the sensing cloud is a combination of cloud computing and a wireless sensor network and is used for collecting and processing monitoring data from a plurality of sensor nodes or sensor sub-networks;
integrating monitoring data of the data source based on the spatial correlation and the temporal correlation and determining a data true value;
generating an initial quality evaluation vector of a data source based on the data truth value, and adjusting the initial quality evaluation vector of the data source according to a quality rule;
and calculating a final quality evaluation result of the data source according to the adjusted initial quality evaluation vector of the data source.
After the current and historical monitoring data of the sensing cloud storage data source are obtained, if the current and historical monitoring data of the sensing cloud storage data source exceed a threshold value, data reduction is conducted on the data, the data reduction is used for reducing the data volume, and the data reduction comprises a segment-by-segment aggregation approximation method or a self-adaptive segment-by-segment constant approximation method.
Wherein the integrating the monitoring data of the data source and determining the data truth value based on the spatial correlation and the temporal correlation comprises: judging whether the data has spatial correlation, and if the data has spatial correlation, aiming at a given data source siReading siOther sensor node set S in one regular monitoring area aroundN (i)And SN (i)Monitoring data sequence of middle node, SN (i)The nodes in the Cluster form a Cluster(i)
Wherein said obtaining siCluster where it is(i)Then, the S is obtained by integrating the position similarity and the data similarityN (i)Clustering the monitoring data sequence of the middle node, and calculating Cluster(i)The centroid at each time is used as the candidate sequence of the true value.
After the truth value candidate sequence after the spatial correlation processing is obtained, time correlation processing, namely smoothing processing, is also required to be performed on the truth value candidate sequence, wherein the smoothing processing comprises an n-order moving average method or a least square method, and the smoothed sequence is a final truth value sequence.
Wherein the generating an initial quality assessment vector for a data source based on the data truth values comprises: comparison siEvaluating s by the difference between the value of (a) and the true valueiS is obtained based on a quality evaluation functioniAt tkQuality value of time Q(s)i,tk),t1~tmQuality value of<Q(si,t1),…,Q(si,tm)>The initial quality assessment vector Qvec(s) of si is constructedi) The quality assessment function includes:
Q(si,tk)=1-dist(vik,true(vik))/maxdist
wherein v isikIs siAt tkThe value of time, true (v)ik) Is siAt tkTrue value, dist (v) corresponding to timeik,true(vik) Is v)ikAnd true (v)ik) Distance of (1), maxdist is vikAnd true (v)ik) Maximum value of distance (c).
Wherein the quality rule represents positive correlation, negative correlation and other numerical association relations, and the quality rule is represented as:
(f(A)∈targetA)→(g(B)∈validB)
where A and B are two attribute sets, f () and g () are functions acting on A and B, targetATarget value field, valid, representing f (A)BIs the legal value range, target, of g (B)AAnd validBIs an interval or a set of values or another function, and if the quality rule is satisfied at a certain moment, the data at the moment is considered to be reasonable, and the quality problem does not exist.
Wherein said adjusting an initial quality assessment vector of the data source according to a quality rule comprises:
step (1), calculating Qvec(s)i)=<Q(si,t1),…,Q(si,tm)>Mean value Qmean ofiAnd standard deviation QSDi
Step (2) of defining a deviation threshold value TiAt h times the standard deviation, i.e. Ti=h·QSDi
Step (3) for the quality score lower than QmeaniExceeds Ti(i.e., the quality score is too low) at time tkGo through all the rules in the data quality rule Ψ, check at time tk,siWhether or not there is a condition of a front-part in the data of (2)The case where the conditions for the back part are satisfied but not satisfied:
a) quality score Q(s) if there are violated rulesi,tk) Keeping the original shape;
b) if the traversal is completed but no violated rule is found, the quality score at that moment is adjusted and modified to QMeniAnd jumping to the steps (1) and (2) to update QMeni、QSDiAnd Ti
Step (4), repeating steps (1), (2) and (3) until QMeniAnd QSDiNo further change occurred.
Wherein, said siThe quality assessment of (2) includes: qvec(s)i) Mean value of (c), Qvec(s)i) Standard deviation and stationarity QStationary ofi
Wherein the mean value QmeaniI.e. the average homogeneity score over the period of time evaluated, the higher the value, siThe better the quality performance;
standard deviation QSDiI.e. siThe smaller the value of the stability of the mass of (1), the more stable the mass;
stationarity QStationaryiThe value range of (1) is { True, False }, the value of True represents stationary, and the value of False represents non-stationary.
The invention comprehensively considers the space-time relevance to discover the truth value and evaluates the quality of the data source, thereby overcoming the defect that the space-time attribute cannot be processed by the existing work and ensuring that the truth value discovery and the quality evaluation are more accurate;
in the quality evaluation process, a new quality rule is provided instead of relying on a true value found by an unsupervised method, and the evaluation result is corrected by using the quality rule, so that the possibility of misjudgment is reduced;
while current methods of assessing data source quality can only give one-dimensional estimates (e.g., error rate), the techniques proposed by the present invention use triplets<QMeani,QSDi,QStationaryi>The final quality of the data source can be described from multiple angles, and the quality of the data source can be more comprehensively characterized.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flowchart of a data source quality evaluation method in a sensing cloud environment according to an embodiment of the present invention;
fig. 2 is a flow chart for determining a true value of data according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a flowchart of a data source quality evaluation method in a sensing cloud environment according to an embodiment of the present invention, where the data source quality evaluation method in the sensing cloud environment includes:
s101, current and historical monitoring data of a sensing cloud storage data source are obtained, wherein the sensing cloud is a combination of cloud computing and a wireless sensor network and is used for collecting and processing monitoring data from a plurality of heterogeneous sensor nodes or sensor sub-networks.
Data of the physical world can be conveniently collected by deploying a Wireless Sensor Network (WSN) in a target area. The wireless sensor node has the advantages of small size, low price and the like, so that the wireless sensor node can be widely applied to the fields of environmental monitoring, national defense and military, traffic control, community security, target positioning and the like, but is limited by the capabilities of calculation, storage, communication and the like, and the application of a large-scale sensor network also faces a plurality of challenges. With the increasing demand of cloud computing development and information physical convergence, a sensor-cloud (sensor-cloud) is generated by combining cloud computing and a wireless sensor network into a necessary trend. In a sensor cloud, cloud services can collect and process data from multiple heterogeneous sensor nodes or sensor sub-networks, thereby completing data-driven, compute-intensive tasks that would otherwise be difficult to complete at the sensor end. Due to the excellent computing capability of cloud computing, originally heavier services such as data source quality assessment can be applied to sensor networks with limited computing capability, and data source quality assessment can be performed at the cloud end by integrating heterogeneous sensor nodes and sub-networks through the cloud service, so that the data source quality assessment can be used or abandoned as required.
However, since the sensor nodes are mostly deployed in a harsh environment or unattended area, the correctness and accuracy of data of some data sources (i.e. the sensor nodes or sub-networks) are vulnerable to the negative effects of the environment or attacks during the collection and transmission process. In other words, the data quality of some data sources may be degraded by either intrinsic or extrinsic factors. These negatively affected bad data sources, if not discovered in a timely manner, can further affect the quality of service of the cloud data driver. In order to accurately find the poor quality data source, an accurate data source quality evaluation method is required. Because the data acquisition process of the sensor is continuous, the quality of the data source of the sensor fluctuates with time, and therefore, the data source quality evaluation method also needs to support time sequence or data flow analysis to ensure that the data source quality evaluation result can be timely and accurately updated. Based on the evaluation result of the quality of the data source, the cloud application can better select and utilize the data source so as to improve the efficiency and quality of service.
Let Se be { s ═ s1,…,snDenotes the set of data sources to participate in the evaluation, t1Is an initial time, tmIs the current time. First, n time sequences V ═ tone are read from a database<v11,…,v1m>,…,<vn1,…,vnm>Therein of<vi1,…,vim>Is dataSource siV sequence of monitoring values ofijA vector of monitored values representing the data source at time i.
Next, it is checked whether the data amount exceeds a threshold value theta, if m>Theta, then for each sequence<vi1,…,vim>A dimension reduction operation is performed. For simplicity, the reduction operation may be selected from PAA (PAA) (degassing Aggregate adaptation) or APCA (adaptive degassing Constant adaptation), i.e., a sequence<vi1,…,vim>Divided into equal or longer l segments, each represented by the average of the corresponding points. The threshold θ is an empirical value and is set according to the computing power of the cloud.
S102, integrating monitoring data of the data source based on the spatial correlation and the temporal correlation and determining a data true value.
The quality of a data source is evaluated by adopting a simple voting and repeated iteration method in the current method, and the influence of the relevance of the space, time and different attributes of the data source on the quality evaluation is not considered for the Web data source and the relational database. However, in the sensor network, since the geographical positions of the sensor nodes are close, the monitored objects are consistent, the monitoring is continued, and some physical quantities are related, there are spatial correlation between data generated by different data sources at the same time, temporal correlation between data generated by the same data source at different times, and physical quantity correlation between different attributes of the same data source. These correlations may be expressed in either similarity of data values or positive or negative correlations of trends in data changes.
The spatial correlation may be processed first to obtain a true time series after the spatial correlation processing, and then the temporal correlation is processed based on the true time series, fig. 2 is a flowchart of determining a true value of data according to an embodiment of the present invention, please refer to fig. 2:
s201, whether the data have spatial correlation or not.
The sources of spatial correlation considered here are mainly whether the sensor nodes are close in position and whether the monitored objects are consistent. Due to vulnerability of the sensor itself: (Limited energy and easily damaged), there is often redundancy in deployment, i.e., there are multiple sensor nodes monitoring the same object at the same time. The sensor nodes are geographically close and the data they generate should be similar. Given data source (i.e. sensor node or subnet) siAll are then ANDed with siThe set of data sources with consistent monitoring objects is denoted as siCluster to which it belongs(i)
S202, aiming at a given data source SiReading siOther sensor node set S of the same monitoring areaN (i)And SN (i)And monitoring data sequences of the middle nodes.
If the zone to which the monitoring object corresponds is regular (e.g. monitoring the temperature and humidity of a room), then for a given data source siCan read siOther sensor node set S in one regular monitoring area aroundN (i)And SN (i)And monitoring data sequences of the middle nodes. SN (i)All nodes in (1) naturally constitute Cluster(i)
S203, integrating the position similarity and the data similarity pair SN (i)And clustering the monitoring data sequences of the middle nodes.
However, in some cases, due to the influence of river, valley, road, building, and the like, the area corresponding to the monitoring object is often not regular. At this time, SN (i)A part of the nodes in (1) may be associated with siMonitored objects are different, corresponding truth values are also different, and in order to prevent the partial nodes from polluting results found by the truth values, the sum s needs to be screened out through clustering (namely clustering)iNodes which are similar enough need to consider the similarity of data and the similarity of positions simultaneously when clustering. The similarity of the sensor nodes is defined as the weighted average of the position similarity and the data similarity, and is shown as formula (1):
Sim(si,sj)=w1×Simspace(si,sj)+w2×Simdata(si,sj)
wherein, for siAnd optionally SN (i)Node s inj,Simspace(si,sj) Denotes siAnd sjThe position similarity of (a) can be selected from the coordinate similarity, Simdata(s)i,sj) Denotes siAnd sjThe data similarity of (1) can be normalized Euclidean Distance of time series or normalized EMD Distance (Earth Mover's Distance) w calculated by processing time series into histogram and calculating1And w2Are weights, and may all be set to 0.5.
S204, obtaining SiAnd calculating the centroid of the class at each moment in the cluster as a candidate true value sequence.
To obtain siCluster where it is(i)Then, calculate Cluster(i)The centroid at each time is used as the candidate sequence of the true value (i.e. the true value at time t is Cluster(i)The centroid at time t).
And S205, whether the data have time correlation or not.
Processing temporal correlations first requires determining whether there is similarity in the time dimension. Consider similarities in nearby time instants. Data similarity in the near moment exists in many monitored objects, for example, temperature, humidity, altitude and the like are generally continuously changed, and the similarity should be reflected on a true value. Therefore, after the true value candidate sequence after the spatial correlation processing is obtained, the true value candidate sequence needs to be smoothed, so that the situation that the true value suddenly changes due to the error of the sensor data is avoided.
And S206, smoothing the time sequence of the centroid.
The smoothing strategy can adopt an n-order moving average method or a least square method, and the smoothed sequence is the final truth value sequence.
And S103, generating an initial quality evaluation vector of the data source based on the data truth value.
After obtaining the truth sequence, s can be comparediEvaluating s by the difference between the value of (a) and the true valueiThe quality of (c). Can be based on the mass shown in formula (2)The evaluation function yields siAt tkQuality value of time Q (si, t)k),t1~tmQuality value of<Q(si,t1),…,Q(si,tm)>Composition siInitial quality evaluation vector Qvec(s)i)。
Q(si,tk)=1-dist(vik,true(vik))/maxdist
Wherein v isikIs siAt tkThe value of time, true (v)ik) Is siAt tkTrue value, dist (v) corresponding to timeik,true(vik) Is v)ikAnd true (v)ik) Distance of (1), maxdist is vikAnd true (v)ik) Maximum value of distance (c). If only precision errors are considered, the dist function can use the absolute value of the difference of numerical values, if the deployment environment of the sensor network is severe, transmission and storage errors of bit strings need to be considered, and at the moment, the difference can be converted into binary strings firstly and then Hamming Distance (Hamming Distance) or edit Distance (edit Distance) is selected.
And S104, adjusting the initial quality evaluation vector of the data source according to a quality rule.
There are also problems with using the resulting initial quality assessment vector directly for data source quality assessment. The problem is that the influence of sudden abnormal events is not considered. Some emergencies in the environment (e.g., a sudden fire may cause a sudden increase in temperature readings) may cause sudden changes in the sensor readings that should not be considered quality issues, in other words, should not degrade the quality score of the data source due to such sudden changes.
In quality evaluation in relational databases, quality rules are typically employed to account for which dependencies are legitimate and should not be violated. However, the quality rules of the relational database cannot be directly used in the application scenario of sensor monitoring. Therefore, the invention designs a new quality rule, as shown in formula (3), which can represent positive correlation, negative correlation and other numerical correlation.
(f(A)∈targetA)→(g(B)∈validB)
Where A and B are two attribute sets, f () and g () are functions acting on A and B, targetATarget value field, valid, representing f (A)BIs the legal value range, target, of g (B)AAnd validBEither intervals or sets of values, or another function, e.g., - ∞,0]And [0, + ∞) or {0,1}, etc.
The rules are used to declare associations that should exist in the physical world for attribute sets A and B (e.g., altitude and barometric pressure), with the following semantics: if the value of the function f (A) of the rule antecedent (i.e. arrow left) falls within the target value range targetAThen the value range of the function g (B) of the rule back-piece (i.e. right part of the arrow) should fall within validBIn (1). If the rule is satisfied at a certain time (i.e., when the condition of the front piece is satisfied, the condition of the back piece is also satisfied), the data at that time can be considered reasonable, and there is no quality problem.
The quality rule set Ψ is derived from the domain knowledge of the monitored object, and the quality assessment vector Qvec(s) is iteratively adjusted using the rule set as followsi)。
Step (1), calculating Qvec(s)i)=<Q(si,t1),…,Q(si,tm)>Mean value Qmean ofiAnd standard deviation QSDi
Step (2) of defining a deviation threshold value TiIs h times (h is a predetermined constant) standard deviation, i.e. Ti=h·QSDi
Step (3) for the quality score lower than QmeaniExceeds Ti(i.e., the quality score is too low) at time tkGo through all rules in Ψ, check at time tk,siWhether certain rules are violated in the data of (1), i.e., whether there is a case where the condition of the antecedent is satisfied but the condition of the consequent is not satisfied:
a) quality score Q(s) if there are violated rulesi,tk) Keeping the original shape;
b) if the traversal is completed but no violated rule is found, the quality score at that moment is adjusted, modified to QMeni, and the process jumps to step(1) (2) update QMeni、QSDiAnd Ti.
Step (4), repeating steps (1), (2) and (3) until QMeniAnd QSDiNo further change occurred.
Since quality rules reflect physical rules in the real world, the intuitive idea of the tuning process described above is that if a data anomaly satisfies all physical rules in its application scenario, the data anomaly is more likely to indicate an emergency in the physical world than erroneous data. Correspondingly, if the data is abnormal in value and the physical laws are violated, the data is more likely to be error data rather than the abnormal events actually occurring in the physical world. Based on the above adjustment, the erroneous determination can be corrected.
And S105, calculating a final quality evaluation result of the data source according to the adjusted initial quality evaluation vector of the data source.
Obtaining a data source siIs calculated as a final quality assessment vector Qvec(s)i) Then s can be completed based on the vectoriThe quality of (2) is evaluated. siMay be evaluated using a triple<QMeani,QSDi,QStationaryi>To indicate.
Mean value Qmeani。QMeaniIs Qvec(s)i) The mean value of (a), i.e. the average homogeneity score over the period of time being evaluated, the higher the value of (b) is, the higher s isiThe better the quality performance on average.
Standard deviation QSDi。QSDiIs Qvec(s)i) Standard deviation of (a), represents siThe smaller the value of the stability of the mass of (1), the smaller the value of siThe less obvious the change of the quality score is, the more stable the quality is.
Stationarity QStationaryi。QStationaryiThe value range of (1) is { True, False }, the value of True represents stationary, and the value of False represents non-stationary. Normally, if the quality score of the data source at each time is regarded as a random process, the process should be a steady random process (steady stochastic process), in other words, the data source is faithfully providedThis behavior does not change over time for monitoring data at each moment. If the process is not a smooth random process, the quality score of the data source has some non-negligible correlation with time, and the data source itself can be presumed to have some abnormal factors which affect the data quality over time. Therefore, the pair Qvec(s) is requiredi) And (5) carrying out stability test. If QStationaryiHas a value of False, i.e., Qvec(s)i) If the data source is not stationary, the data source is used with a greater risk due to the influence of some unknown abnormal factors, and the abnormal factors of the data source should be checked in a conditional condition, and then whether the data source is to be used continuously is determined.
Based on triplets<QMeani,QSDi,QStationaryi>Can be applied to the data source siThe overall quality and stability of the material is depicted. For a set S of data sources participating in an evaluatione={s1,…,snAnd calculating the triplets of each data source, namely completing the data source quality evaluation task.
Compared with the prior art, the technology provided by the invention has the following advantages:
the truth value discovery is carried out by comprehensively considering the time-space relevance, and the quality of the data source is evaluated, so that the defect that the time-space attribute cannot be processed in the prior art is overcome, and the truth value discovery and the quality evaluation are more accurate;
in the quality evaluation process, a new quality rule is provided instead of relying on a true value found by an unsupervised method, and the evaluation result is corrected by using the quality rule, so that the possibility of misjudgment is reduced;
while current methods of assessing data source quality can only give one-dimensional estimates (e.g., error rate), the techniques proposed by the present invention use triplets<QMeani,QSDi,QStationaryi>The final quality of the data source can be described from multiple angles, and the quality of the data source can be more comprehensively characterized.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and substitutions can be made without departing from the technical principle of the present invention, and these modifications and substitutions should also be regarded as the protection scope of the present invention.

Claims (9)

1. A data source quality assessment method in a sensing cloud environment is characterized by comprising the following steps:
acquiring current and historical monitoring data of a sensing cloud storage data source, wherein the sensing cloud is a combination of cloud computing and a wireless sensor network and is used for collecting and processing monitoring data from a plurality of sensor nodes or sensor sub-networks;
integrating monitoring data of the data source based on the spatial correlation and the temporal correlation and determining a data true value;
generating an initial quality assessment vector for a data source based on the data truth values;
adjusting the initial quality evaluation vector of the data source according to a quality rule, specifically comprising:
step (1), calculating Qvec(s)i)=<Q(si,t1),…,Q(si,tm)>Mean value Qmean ofiAnd standard deviation QSDi
Step (2) of defining a deviation threshold value TiAt h times the standard deviation, i.e. Ti=h·QSDi
Step (3) for the quality score lower than QmeaniExceeds TiTime tkGo through the rules in the data quality rule Ψ, check at time tk,siWhether the condition of the front part is satisfied but the condition of the back part is not satisfied exists in the data of (1):
a) if there are violated rules, the quality score Q(s)i,tk) Keeping the original shape;
b) if the traversal is completed but the violated rule is not found, the quality score at the moment is adjusted and is modified into QMeniAnd jumping to the steps (1) and (2) to update QMeni、QSDiAnd Ti
Step (4), repeating steps (1), (2) and (3) until QMeniAnd QSDiNo change occurs;
and calculating a final quality evaluation result of the data source according to the adjusted initial quality evaluation vector of the data source.
2. The method for evaluating the quality of the data source in the sensor cloud environment according to claim 1, wherein after the current and historical monitoring data of the sensor cloud storage data source are obtained, if the obtained current and historical monitoring data of the sensor cloud storage data source exceed a threshold, data reduction is performed on the data, the data reduction is used for reducing the data volume, and the data reduction includes a segment-by-segment aggregation approximation method or an adaptive segment-by-segment constant approximation method.
3. The method for evaluating the quality of a data source in a sensor cloud environment according to claim 1, wherein the integrating the monitored data of the data source based on the spatial correlation and the temporal correlation and determining the true value of the data comprises: judging whether the data has spatial correlation, and if the data has spatial correlation, aiming at a given data source siReading siOther sensor node set S in one regular monitoring area aroundN (i)And SN (i)Monitoring data sequence of middle node, SN (i)The nodes in the Cluster form a Cluster(i)
4. The method for evaluating the quality of a data source in a sensor cloud environment according to claim 3, wherein said obtaining siCluster where it is(i)Then, the S is obtained by integrating the position similarity and the data similarityN (i)Clustering the monitoring data sequence of the middle node, and calculating Cluster(i)The centroid at each time instant is used as a candidate sequence for the truth.
5. The method for evaluating the quality of a data source in a sensing cloud environment according to claim 4, wherein after the true value candidate sequence is obtained, the true value candidate sequence is subjected to a time correlation process, i.e., a smoothing process, and the smoothing process includes a step n moving average method or a least square method, and the smoothed sequence is a final true value sequence.
6. The method for evaluating the quality of a data source in a sensor cloud environment according to claim 1, wherein said generating an initial quality evaluation vector of the data source based on the data truth values comprises: by comparison of siEvaluating s by the difference between the value of (a) and the true valueiS is obtained based on a quality evaluation functioniAt tkQuality value of time Q(s)i,tk),t1~tmQuality value of<Q(si,t1),…,Q(si,tm)>Composition siInitial quality evaluation vector Qvec(s)i) The quality assessment function includes
Q(si,tk)=1-dist(vik,true(vik))/maxdist
Wherein v isikIs siAt tkThe value of time, true (v)ik) Is siAt tkTrue value, dist (v) corresponding to timeik,true(vik) Is v)ikAnd true (v)ik) Distance of (1), maxdist is vikAnd true (v)ik) Maximum value of distance (c).
7. The method for evaluating the quality of the data source in the sensing cloud environment according to claim 1, wherein the quality rule is expressed as:
(f(A)∈targetA)→(g(B)∈validB)
where A and B are two attribute sets, f () and g () are functions acting on A and B, targetATarget value field, valid, representing f (A)BIs the value range of g (B), targetAAnd validBIs an interval or a set of values or another function, and if the quality rule is satisfied at a certain time, the data at the time is reasonable and there is no quality problem.
8. As claimed in claim1 the data source quality evaluation method in the sensing cloud environment, characterized in that siThe quality assessment of (2) includes: qvec(s)i) Mean value of (c), Qvec(s)i) Standard deviation and stationarity QStationary ofi
9. The method of claim 8, wherein the mean value QMean is a quality estimation method of the data source in the sensing cloud environmentiI.e. for indicating the average homogeneity score over the period evaluated, the higher the value, siThe better the quality performance;
standard deviation QSDiI.e. siThe smaller the value of the stability of the mass of (1), the more stable the mass;
stationarity QStationaryiThe value range of (1) comprises { True, False }, wherein a value of True represents stationary, and a value of False represents non-stationary.
CN201910256445.9A 2019-03-29 2019-03-29 Data source quality evaluation method under sensing cloud environment Active CN110011847B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910256445.9A CN110011847B (en) 2019-03-29 2019-03-29 Data source quality evaluation method under sensing cloud environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910256445.9A CN110011847B (en) 2019-03-29 2019-03-29 Data source quality evaluation method under sensing cloud environment

Publications (2)

Publication Number Publication Date
CN110011847A CN110011847A (en) 2019-07-12
CN110011847B true CN110011847B (en) 2022-03-25

Family

ID=67169319

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910256445.9A Active CN110011847B (en) 2019-03-29 2019-03-29 Data source quality evaluation method under sensing cloud environment

Country Status (1)

Country Link
CN (1) CN110011847B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110519720B (en) * 2019-08-23 2020-09-01 绍兴文理学院 Burst data stream mapping load capacity optimization method in sensing cloud environment
CN111898871B (en) * 2020-07-08 2023-07-18 南京南瑞水利水电科技有限公司 Method, device and system for evaluating data quality of power grid power supply end
CN115097526B (en) * 2022-08-22 2022-11-11 江苏益捷思信息科技有限公司 Seismic acquisition data quality evaluation method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020478A (en) * 2012-12-28 2013-04-03 杭州师范大学 Method for checking reality of ocean color remote sensing product
CN103530347A (en) * 2013-10-09 2014-01-22 北京东方网信科技股份有限公司 Internet resource quality assessment method and system based on big data mining
CN103916860A (en) * 2014-04-16 2014-07-09 东南大学 Outlier data detection method based on space-time correlation in wireless sensor cluster network
CN108614803A (en) * 2018-04-16 2018-10-02 深圳市赑玄阁科技有限公司 A kind of meteorological data method of quality control and system
CN108898311A (en) * 2018-06-28 2018-11-27 国网湖南省电力有限公司 A kind of data quality checking method towards intelligent distribution network repairing dispatching platform

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020478A (en) * 2012-12-28 2013-04-03 杭州师范大学 Method for checking reality of ocean color remote sensing product
CN103530347A (en) * 2013-10-09 2014-01-22 北京东方网信科技股份有限公司 Internet resource quality assessment method and system based on big data mining
CN103916860A (en) * 2014-04-16 2014-07-09 东南大学 Outlier data detection method based on space-time correlation in wireless sensor cluster network
CN108614803A (en) * 2018-04-16 2018-10-02 深圳市赑玄阁科技有限公司 A kind of meteorological data method of quality control and system
CN108898311A (en) * 2018-06-28 2018-11-27 国网湖南省电力有限公司 A kind of data quality checking method towards intelligent distribution network repairing dispatching platform

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
定期统计报表数据质量组合评估方法初探;伍荣坤;《统计研究》;19930207(第01期);全文 *

Also Published As

Publication number Publication date
CN110011847A (en) 2019-07-12

Similar Documents

Publication Publication Date Title
Bosman et al. Spatial anomaly detection in sensor networks using neighborhood information
CN112561191B (en) Prediction model training method, prediction device, prediction apparatus, prediction program, and program
Najafi et al. Statistical downscaling of precipitation using machine learning with optimal predictor selection
CN110011847B (en) Data source quality evaluation method under sensing cloud environment
CN112258093B (en) Data processing method and device for risk level, storage medium and electronic equipment
CN110956255B (en) Difficult sample mining method and device, electronic equipment and computer readable storage medium
TW201928805A (en) Model integration method and device
CN105843829B (en) A kind of big data creditability measurement method based on hierarchical mode
CN101901251B (en) Method for analyzing and recognizing complex network cluster structure based on markov process metastability
Tang et al. Reputation-aware data fusion and malicious participant detection in mobile crowdsensing
Zounemat-Kermani Investigating chaos and nonlinear forecasting in short term and mid-term river discharge
Haribabu et al. Prediction of flood by rainf all using MLP classifier of neural network model
Schmidinger et al. Validation of uncertainty predictions in digital soil mapping
Zhang et al. Automatic Traffic Anomaly Detection on the Road Network with Spatial‐Temporal Graph Neural Network Representation Learning
Huang et al. Research on Real‐Time Anomaly Detection of Fishing Vessels in a Marine Edge Computing Environment
CN115114484A (en) Abnormal event detection method and device, computer equipment and storage medium
Wang et al. TVD-RA: A truthful data value discovery based reverse auction incentive system for mobile crowd sensing
Li et al. [Retracted] Drought Assessment Based on Data Fusion and Deep Learning
He et al. A link quality estimation method for wireless sensor networks based on deep forest
CN117117833A (en) Photovoltaic output power prediction method and device, electronic equipment and storage medium
Wang et al. Enhancing event sequence modeling with contrastive relational inference
CN116151799A (en) BP neural network-based distribution line multi-working-condition fault rate rapid assessment method
US20230209367A1 (en) Telecommunications network predictions based on machine learning using aggregated network key performance indicators
CN117376084A (en) Fault detection method, electronic equipment and medium thereof
US20120109707A1 (en) Providing a status indication for a project

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant