CN118035683A

CN118035683A - Time sequence data analysis method, system and equipment

Info

Publication number: CN118035683A
Application number: CN202410194172.0A
Authority: CN
Inventors: 吴锐彬; 周海军
Original assignee: Guangzhou Yiheng Big Data Research Institute Co ltd; Shenzhen Longshu Information Technology Co ltd; Guangzhou Longshu Technology Co ltd
Current assignee: Guangzhou Yiheng Big Data Research Institute Co ltd; Shenzhen Longshu Information Technology Co ltd; Guangzhou Longshu Technology Co ltd
Priority date: 2024-02-21
Filing date: 2024-02-21
Publication date: 2024-05-14

Abstract

The invention relates to the technical field of data analysis, in particular to a time sequence data analysis method, which comprises the following steps: acquiring data attributes of time series data and performing data conversion; performing extremum removal processing and data cleaning on the data attribute subjected to data conversion; grouping the time series data after finishing data cleaning; carrying out data association and sorting on the grouped time sequence data, and determining an effective association group according to a sorting result; performing data division on the effective association groups to generate a data sorting training set; matching the time series data to be processed with a data sorting training set to perform data analysis; the time series data are grouped for data association and sorting, and the time sequence period of the time series data is corrected according to the sorting result, so that the completion sequence of the time series is ensured to be consistent with the uploading sequence, and the time series data analysis is prevented from being wrong and confused, thereby improving the accuracy of the time series data analysis.

Description

Time sequence data analysis method, system and equipment

Technical Field

The present invention relates to the field of data analysis technologies, and in particular, to a method, a system, and an apparatus for analyzing time series data.

Background

With the continuous improvement of the scientific and technological level in China, the Internet technology is promoted to develop rapidly, the total data generated by users grows exponentially, and in a large amount of data, time series data are very common temporal data, and are widely applied to industries such as industry, agriculture, medical treatment, finance, transportation and the like and are closely related to our lives, so how to accurately analyze the time series data becomes important.

The current commonly used time series analysis method is China patent grant bulletin number: CN108399434B discloses a method for analyzing and predicting high-dimensional time series data based on feature extraction, which relates to the technical field of data analysis. Firstly, measuring the correlation between the condition attributes of the high-dimensional time sequence data and between the condition attributes and the decision attributes, and adding the condition attributes with the correlation with the decision attributes into an attribute kernel set; extracting features of the high-dimensional time sequence data; then establishing a multiple linear regression model, and optimizing regression coefficients in the model through a particle swarm optimization algorithm based on health degree; and finally, obtaining the value of the decision attribute at a certain moment according to the constructed multiple linear regression model. The analysis and prediction method of the high-dimensional time series data based on the feature extraction solves the problems of low prediction efficiency, large error and easy occurrence of local optimal solution in the analysis and prediction process of the high-dimensional time series data, and effectively improves the prediction effect of a multiple linear regression analysis algorithm on the high-dimensional time series data.

However, the above method has the following problems: when the time series data is analyzed, the processing completion sequence of the time series data is different from the uploading sequence due to the fact that the positions of the servers are different, so that errors and confusion of the time series data are caused.

Disclosure of Invention

Therefore, the invention provides a time series data analysis method, a system and a device, which are used for solving the problems that in the prior art, the time series data is wrongly and confused due to the fact that the processing completion sequence and the uploading sequence of the time series data are different due to the fact that the positions of servers are different in time series data analysis.

To achieve the above object, the present invention provides a time-series data analysis method, including:

Step S1, acquiring data attributes of time series data and converting the data;

s2, performing extremum removal processing and data cleaning on the data attribute subjected to the data conversion;

Step S3, grouping the time series data after finishing the data cleaning according to the data attribute;

s4, carrying out data association and sorting on the grouped time series data, and determining an effective association group according to a sorting result;

S5, carrying out data division on the effective association groups to generate a data sorting training set, and determining the silent duration of time sequence data according to the data sorting training set;

Step S6, matching the time sequence data to be processed with the silence duration to perform data analysis;

Wherein, the data attribute comprises a data name, uploading time and a data resource; the data association is to associate the time series data with the corresponding transmission time length and processing time length.

Further, in the step S1, the data conversion includes converting the data time of the time series data of the servers of different time zones into coordinated universal time, and converting the data format of each of the time series data into the same format.

Further, in the step S2, the depolarization processing is to classify the data resources corresponding to the time series data according to the preset standard, and determine the average data resources and standard deviation corresponding to the data resources according to the magnitude, so as to classify the data resources;

and the data cleaning is to reject the time series data according to preset screening conditions.

Further, in the step S3, the grouping manner includes grouping according to the quantile of the data resource corresponding to the time series data and grouping according to the resource level of the data resource corresponding to the time series data.

Further, in the step S4, the time series data in each packet is respectively ordered according to the uploading time and the completion time;

Wherein the timing cycle includes the transmission duration and the processing duration;

wherein the completion time is related to the timing period and the upload time.

Further, in the step S4, whether the time series data in the single packet is a valid association group is determined according to the sorting order of the uploading time and whether the sorting order result of the completion time is consistent;

If the time sequence data in the single packet is consistent according to the sequencing order of the uploading time and the sequencing order result of the finishing time, judging that the time sequence data in the single packet is an effective association group;

And if the time series data in the single packet is inconsistent according to the sequencing order of the uploading time and the sequencing order result of the finishing time, judging that the time series data in the single packet is an invalid association group.

Further, the time sequence period of the time sequence data corresponding to the invalid association group is corrected to generate a corresponding valid association group;

The correction of the time sequence period is to select the longest time sequence period in the invalid associated group as a standard period, and add a silent duration to the time sequence period in the invalid associated group to generate the standard period;

the silence period is determined by equation (1):

t’＝t+t1＝t0 (1)

Wherein, t' is the corrected time sequence period, t0 is the standard period, t is the original time sequence period, and t1 is the silence duration.

Further, determining whether to adopt data parallelism according to the resource magnitude of the data resources corresponding to the effective association group and the corresponding uploading interval;

and if the resource magnitude of the data resource corresponding to the effective association group is larger than a preset magnitude and the corresponding uploading interval is larger than a preset interval, adopting data parallelism for the effective association group.

In another aspect, the present invention also provides a time-series data analysis system, including:

a collecting unit for collecting time series data and acquiring corresponding data attribute

The preprocessing unit is connected with the collecting unit and is used for performing data conversion, depolarization processing and data cleaning on the time series data to generate primary time series data;

A sorting unit, coupled to the preprocessing unit, for grouping the primary time series data according to the data attributes to generate a data subgroup;

an association unit, connected to the sorting unit, for performing data association on the data subgroups to generate an association subgroup;

The judging unit is connected with the association unit and is used for ordering the association groups and determining the types of the association groups according to the ordering result;

The correction unit is connected with the judging unit and the association unit and is used for determining whether to correct the association group according to the type of the association group so as to generate a secondary association group;

The training unit is connected with the association unit and the correction unit and is used for dividing data according to the association group and the secondary association group to generate a data sorting training set and determining the silencing duration of time series data according to the data sorting training set;

And the application unit is connected with the training unit and is used for matching the silent duration with the time sequence data to be processed so as to perform data analysis.

In another aspect, the present invention also provides a time-series data analysis apparatus, including:

A server to receive time-series data;

a processing component for performing data conversion, depolarization, and data cleansing on the time-series data to generate primary time-series data;

an analysis component, coupled to the processing component, for sorting, data correlating the primary time series data to generate a correlation team;

A decision component connected with the analysis component for ordering the association groups and determining the types of the association groups according to the ordering result;

a management component, coupled to the decision component and the analysis component, for determining whether to revise the association group to generate a secondary association group based on the type of the association group;

the development component is connected with the management component and the analysis component and is used for dividing data according to the association group and the secondary association group to generate a data sorting training set and determining the silent duration of time series data according to the data sorting training set;

And the simulation component is connected with the development component and is used for matching the silent duration with the time series data to be processed so as to perform data analysis.

Compared with the prior art, the method has the advantages that the time series data are grouped according to the resource magnitude of the corresponding data resource, the grouped time series data are subjected to data association and sorting, the time sequence period of the time series data is corrected according to the sorting result, the completion sequence of the time series is ensured to be consistent with the corresponding uploading sequence, and errors and confusion of time series data analysis are avoided, so that the accuracy of the time series data analysis is improved.

Furthermore, the invention performs data attribute acquisition and data conversion on the time series data, so as to convert the time series data from the data time of the servers in different time zones into coordinated universal time and convert the data format into the same format to be unified into a specific format or unit, thereby eliminating the difference of the recording time of the different servers and further improving the accuracy of time series data analysis.

Furthermore, the invention classifies the time series data into different data resource magnitudes according to the preset standard, then determines the corresponding average data resource and standard deviation, and then carries out extremum removal processing and data cleaning, thus being beneficial to better removing extreme abnormal values which have larger influence on the data analysis result after classifying according to the amount, retaining more data characteristics, cleaning the data, removing erroneous, repeated, invalid or inconsistent data, improving the analysis quality of the time series data, and further improving the accuracy of the time series data analysis.

Further, the invention can be applied to a network server for processing data or transmitting data by equipment, and can improve the efficiency, stability and reliability of data processing of time sequence period by processing and adjusting the time sequence period of time sequence data when the time sequence data is processed; when the method is applied to the transmission of data by the equipment, the normal operation and the data analysis accuracy of the equipment are ensured by processing and adjusting the time sequence period of the time sequence data, the processing speed of the data transmission and the overall performance of the equipment can be improved, the energy consumption is reduced, the reliability of a data server is improved, and the resource utilization of the equipment is optimized.

Drawings

FIG. 1 is a flow chart of a time series data analysis method according to an embodiment of the invention;

FIG. 2 is a schematic diagram illustrating the division of valid association groups and invalid association groups according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a timing cycle correction method for an invalid associated group according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a time-series data analysis system according to an embodiment of the invention.

Detailed Description

In order that the objects and advantages of the invention will become more apparent, the invention will be further described with reference to the following examples; it should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are merely for explaining the technical principles of the present invention, and are not intended to limit the scope of the present invention.

It should be noted that, in the description of the present invention, terms such as "upper," "lower," "left," "right," "inner," "outer," and the like indicate directions or positional relationships based on the directions or positional relationships shown in the drawings, which are merely for convenience of description, and do not indicate or imply that the apparatus or elements must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention.

Furthermore, it should be noted that, in the description of the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention can be understood by those skilled in the art according to the specific circumstances.

For a better understanding of the present invention, the terms in the present invention are explained as follows:

And (3) extremum removal processing: outliers (i.e., extremes) in the time series data are excluded to improve the data quality and accuracy of the time series data.

Data cleaning: and preprocessing, screening and repairing the original data to ensure the quality and accuracy of the data.

Data format: refers to the organization and representation of time series data in a server.

And (3) data association: by analyzing the association relationship among the plurality of data sets, the correlation and causal relationship among the data sets are found out.

Data ordering training set: machine learning models are trained by using the marked time series data sets to enable the model to learn and understand rules therefrom and make predictions on new data in the future.

Coordination of world time: coordinated universal time is a time metering system that is as close to universal time as possible in time on the basis of atomic time seconds.

Fig. 1 is a flow chart of a time series data analysis method according to an embodiment of the invention, where the time series data analysis method includes:

Step S1, acquiring data attributes of time series data and converting the data;

S2, performing extremum removal processing and data cleaning on the data attribute subjected to data conversion;

step S3, grouping the time series data after finishing data cleaning according to the data attribute;

step S6, matching the time series data to be processed with the silent duration to perform data analysis;

The data attribute comprises a data name, uploading time and a data resource; the data association is to associate the time series data with the corresponding transmission time length and processing time length.

According to the invention, the time series data are grouped according to the resource magnitude of the corresponding data resource, the grouped time series data are subjected to data association and sorting, the time sequence period of the time series data is corrected according to the sorting result, the completion sequence of the time series is ensured to be consistent with the corresponding uploading sequence, and the time series data analysis is prevented from being wrong and confused, so that the accuracy of the time series data analysis is improved.

The invention can be applied to processing data by a network server or transmitting data by equipment, and can improve the efficiency, stability and reliability of data processing of time sequence period by processing and adjusting the time sequence period of time sequence data when the invention is applied to time sequence data processing; when the method is applied to equipment for transmitting data, the normal operation and the data analysis accuracy of the equipment are ensured by processing and adjusting the time sequence period of the time sequence data, the processing speed of the data transmission and the overall performance of the equipment can be improved, the energy consumption is reduced, the reliability of a data server is improved, and the resource utilization of the equipment is optimized, wherein the equipment is used for transmitting the data.

Specifically, in step S1, the data conversion includes time-converting data of time-series data from servers of different time zones into coordinated universal time, and converting data formats of the respective time-series data into the same format.

According to the invention, the time series data is converted from the data time of the servers in different time zones into the coordinated universal time and the data format is converted into the same format to be unified into a specific format or unit by carrying out data attribute acquisition and data conversion on the time series data, so that the difference of the recording time of the different servers is eliminated, and the accuracy of time series data analysis is further improved.

Specifically, in step S2, the depolarization process is to classify the data resources corresponding to the time series data according to the preset standard, and determine the average data resources and standard deviation corresponding to the data resources according to the magnitude, so as to classify the data resources;

In the implementation, the min-max standardization method is adopted to unify the data with different orders into the same standard.

It will be appreciated that the preset criteria relates to the processing capacity of the server to which the time series data corresponds.

Typically, the server processing capacity is 500M/s, and the corresponding preset standard is 50M, and the server processing capacity is 1000M/s, and the corresponding preset standard is 100M.

It is understood that the average data resource is a ratio of a sum of data resources included in each time-series data to the number of time-series data.

It is understood that the standard deviation is the arithmetic mean of the squares of the mean deviation of the data resources corresponding to each time series data.

In an implementation, the preset screening condition is time-series data without repetition and missing time-series data in the time-series data.

According to the invention, the time series data are classified into different data resource magnitudes according to the preset standard, the corresponding average data resources and standard deviation are determined, then extreme value removal processing and data cleaning are carried out, the extreme value with larger influence on the data analysis result is better removed after the time series data are classified according to the amount, more data characteristics are reserved, the cleaning data can remove erroneous, repeated, invalid or inconsistent data, the time series data analysis quality is improved, and the accuracy of the time series data analysis is further improved.

Specifically, in step S3, the grouping method includes grouping according to the quantile of the data resource corresponding to the time-series data and grouping according to the resource level of the data resource corresponding to the time-series data.

Please refer to table 1, which is a comparison table of the time series packet and the resource level of the data resource according to the present invention;

Table 1 relation table of time series packet and resource level of data resource

Time series data name	Resource magnitude/M	Grouping situation
			Time series data No. 1	5000	First group of
Time series data No. 2	2000	First group of
			Time series data No. 3	1000	Second group of
Time series data No. 4	800	Second group of

It can be understood that the processing capacity of the server in the table is 3000M/s, the corresponding preset standard is 1500M, the time series data are grouped by taking 1500M as a standard, the data resource of the time series data No. 1 is 5000M, greater than the preset standard 1500M, the data resource of the time series data No. 2 is 2000M, greater than the preset standard 1500M, the data resource of the time series data No. 3 is 1000M, less than the preset standard 1500M, the data resource of the time series data No. 3 is 800M, less than the preset standard 1500M, and the data resource of the time series data No. 3 is grouped into a second group.

Specifically, in step S4, the time-series data in each packet is sorted according to the upload time and the completion time, respectively;

the time sequence period comprises a transmission time length and a processing time length;

Please refer to table 2, which is a ranking comparison table of the time sequence according to the uploading time and the completion time of the present invention;

it will be appreciated that the timing period t is determined by equation (2):

t＝t2+t3 (2)

wherein t2 is a transmission duration, and t3 is a processing duration.

It is understood that the completion time is the time at which the timing period elapses based on the upload time.

Referring to fig. 2, a schematic diagram of dividing an effective association group and an ineffective association group according to an embodiment of the present invention is shown, in step S4, whether the time series data in a single packet is the effective association group is determined according to the ordering sequence of the uploading time and whether the ordering sequence result according to the completion time is consistent;

If the time series data in the single packet is inconsistent according to the sorting sequence of the uploading time and the sorting sequence result of the finishing time, the time series data in the single packet is judged to be an invalid association group.

Please refer to table 3, which is a relationship comparison table of the two sorting results of the invalid association group according to the present invention;

TABLE 3 Table 3

It will be appreciated that the smaller the resource level, the shorter the corresponding processing time, and the shorter the timing cycle.

It will be appreciated that the first group is determined to be a valid association group with the same result of the ranking according to the upload time and the completion time, and the second group is determined to be an invalid association group with a different result of the ranking according to the upload time and the completion time.

Fig. 3 is a flow chart of a timing period correction method of an invalid association group according to an embodiment of the present invention, which corrects the timing period of the time series data corresponding to the invalid association group to generate a corresponding valid association group;

The correction of the time sequence period is to select the longest time sequence period in the invalid association group as a standard period, and add a silent duration to the time sequence period in the invalid association group to generate the standard period;

the silence period is determined by equation (1):

t’＝t+t1＝t0 (1)

It can be understood that silence is a state in which time series data is kept still without any processing and information transfer.

Specifically, determining whether to adopt data parallelism according to the resource magnitude of the data resource corresponding to the time sequence data to be processed and the corresponding uploading interval;

And if the resource magnitude of the data resources corresponding to the effective association group is larger than a preset magnitude and the corresponding uploading interval is smaller than a preset interval, adopting data parallelism for the effective association group.

It can be understood that if the resource level of the data resource corresponding to the effective association group is not greater than the preset level and the corresponding uploading interval is not less than the preset interval, data parallelism is not adopted for the effective association group.

Typically, the preset magnitude is 1500-2000M and the preset interval is 5-7 min.

It is understood that the upload interval is a time interval of upload times of two time series data.

It will be appreciated that the predetermined magnitude and predetermined interval may be arbitrarily set according to the actual analysis requirements required for the time series data.

Please refer to table 4, which is a table of comparing the effective association of the present invention with the data-parallel relationship;

TABLE 4 Table 4

It will be appreciated that the preset level is 1500M, the preset interval is 5min, the uploading interval of the time series data of No. 1-4 is 4min, which is larger than the preset interval, and the level of the first group data resource is larger than the preset level, and the completion time of the time series data of No. 2 in the second group should be after the completion time of the time series data of No. 1 and before the completion time of the time series data of No. 3, so that the data parallelism is adopted for the time series data of No. 2.

It can be understood that the data is a parallel computing mode, that is, a large amount of data is divided into small blocks, and the data of each small block is processed by a plurality of processors at the same time, so that the computing speed and the computing efficiency are improved, and the time sequence period is shortened.

Referring to fig. 4, a schematic structural diagram of a time-series data analysis system according to an embodiment of the invention is shown, and the time-series data analysis system includes:

A sorting unit connected to the preprocessing unit for grouping the primary time series data according to the data attribute to generate a data group;

an association unit, connected to the sorting unit, for performing data association on the data subgroups to generate association subgroups;

A server to receive time-series data;

An analysis component, coupled to the processing component, for sorting, correlating the primary time series data to generate a correlation team;

The decision component is connected with the analysis component and used for sequencing the association groups and determining the types of the association groups according to the sequencing result;

The management component is connected with the decision component and the analysis component and is used for determining whether to correct the association group according to the type of the association group so as to generate a secondary association group;

The development component is connected with the management component and the analysis component and is used for dividing data according to the association group and the secondary association group to generate a data sorting training set and determining the silencing duration of the time series data according to the data sorting training set;

And the simulation component is connected with the development component and is used for matching the silence duration with the time series data to be processed so as to perform data analysis.

Thus far, the technical solution of the present invention has been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of protection of the present invention is not limited to these specific embodiments. Equivalent modifications and substitutions for related technical features may be made by those skilled in the art without departing from the principles of the present invention, and such modifications and substitutions will be within the scope of the present invention.

The foregoing description is only of the preferred embodiments of the invention and is not intended to limit the invention; various modifications and variations of the present invention will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A time-series data analysis method, comprising:

Step S1, acquiring data attributes of time series data and converting the data;

2. The time-series data analysis method according to claim 1, wherein in the step S1, the data conversion includes time-converting the data of the time-series data of the servers of different time zones into coordinated universal time, and converting the data format of each of the time-series data into the same format.

3. The method according to claim 2, wherein in the step S2, the depolarization process is performed by classifying data resources corresponding to the time-series data according to a preset standard, wherein the data resources are different in magnitude, determining corresponding average data resources and standard deviation among the data resources, and performing the depolarization process in the classification;

4. A time series data analysis method according to claim 3, wherein in said step S3, the grouping means includes grouping according to the quantile of the data resource corresponding to said time series data and grouping according to the resource level of the data resource corresponding to said time series data.

5. The time-series data analysis method according to claim 4, wherein in the step S4, the time-series data in each packet is sorted according to the upload time and the completion time, respectively;

6. The time-series data analysis method according to claim 5, wherein in the step S4, the time-series data in the single packet is compared with whether or not the sorting order according to the uploading time and the sorting order according to the sorting order of the completion time are identical, and whether or not the time-series data in the single packet is a valid association group is determined according to the comparison result;

If the ordering order of the time series data in the single packet is consistent with the ordering order of the ordering order according to the uploading time, judging that the time series data in the single packet is an effective association group;

And if the ordering order of the time series data in the single packet is inconsistent with the ordering order of the ordering order according to the uploading time, judging that the time series data in the single packet is an invalid association group.

7. The method of claim 6, wherein the time-series period of the time-series data corresponding to the invalid association group is modified to generate a corresponding valid association group;

the silence period is determined by equation (1):

t’＝t+t1＝t0 (1)

8. The method according to claim 7, wherein determining whether to employ data parallelism is performed according to a resource level of a data resource corresponding to the time-series data to be processed and a corresponding uploading interval;

and if the resource magnitude of the data resource corresponding to the effective association group is larger than a preset magnitude and the corresponding uploading interval is smaller than a preset interval, adopting data parallelism for the effective association group.

9. A time series data analysis system, comprising:

10. A time-series data analysis apparatus, characterized by comprising:

A server to receive time-series data;

the development component is connected with the management component and the analysis component and used for dividing data according to the association group and the secondary association group so as to generate a secondary association group;