CN111680065A - Processing system, equipment and method for lag data in streaming computation - Google Patents

Processing system, equipment and method for lag data in streaming computation Download PDF

Info

Publication number
CN111680065A
CN111680065A CN202010450024.2A CN202010450024A CN111680065A CN 111680065 A CN111680065 A CN 111680065A CN 202010450024 A CN202010450024 A CN 202010450024A CN 111680065 A CN111680065 A CN 111680065A
Authority
CN
China
Prior art keywords
data
module
calculation result
operator
stage water
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010450024.2A
Other languages
Chinese (zh)
Other versions
CN111680065B (en
Inventor
韩佩利
施小江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taikang Insurance Group Co Ltd
Original Assignee
Taikang Insurance Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taikang Insurance Group Co Ltd filed Critical Taikang Insurance Group Co Ltd
Priority to CN202010450024.2A priority Critical patent/CN111680065B/en
Publication of CN111680065A publication Critical patent/CN111680065A/en
Application granted granted Critical
Publication of CN111680065B publication Critical patent/CN111680065B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a processing system, equipment and a method of hysteresis data in stream computing, which relate to the technical field of data processing, and the method comprises the following steps: when the time stamp of the first-stage water line arrives, the first-stage water line module triggers the operator processing module to process the first hysteresis data to obtain first calculation result information, and the first calculation result is stored in the result storage module; when the time stamp of the second-level waterline arrives, the second-level waterline module triggers the operator processing module to process the second hysteresis data to obtain second calculation result information, and stores the second calculation result in the result storage module; and the result storage module performs data integration on the first calculation result and the second calculation result and then outputs the first calculation result and the second calculation result. The invention solves the problem that hysteresis data after Watermark arrives is processed again in Flink.

Description

Processing system, equipment and method for lag data in streaming computation
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a processing technology of data in a streaming system, and in particular, to a processing method of lag data in streaming computation, a processing system of lag data in streaming computation, a computer device, and a computer-readable storage medium.
Background
This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.
In an online data analysis platform, data is typically written to a kafka message queue. To enable real-time use of these data, streaming real-time processing and computation of the data with Flink is employed. In Flink there are three events, event, process event, and extract event. Since the event time is only relevant to the message, and most accurate, the event time (EnventTime) is used in conjunction with the service scenario as the exact time of each message data.
Event Time is the most reflective of the data Time attribute, but Event Time may have a delay, i.e. data delay or disorder, and the Flink system itself can only process data one by one. In order to process the hysteresis data, the prior art adopts a Flink native self-contained water line mechanism. Water line (Watermark) is an identification of Event Time, content aspect Watermark is a timestamp, and the arrival of a Watermark with a timestamp X is equivalent to telling the Flink system that any Event Time less than X data has arrived. At this time, Flink triggers calculation, and before Watermark does not arrive, Flink does not perform calculation, and only data of a specified time window is collected, and the data can arrive out of order in the time window. Although the water line mechanism provided by Flink can enable data in a time window to arrive out of order before Watermark arrives, the problem of data calculation caused by part of data delay and out of order is solved, how to process hysteresis data after Watermark arrives does not have a good solution.
In the original design of the Flink, two modes are provided for hysteresis data after Watermark arrives, the first mode is direct discarding, namely, the hysteresis data after Watermark is considered to have no value and is not calculated any more; the second method is to set a fixed allowable delay Time during the code writing, so as long as the Event Time of the hysteresis data is within the Time range of the allowable delay, a Flink calculation can be triggered again, and a result can be calculated again. Both of these approaches have their own drawbacks: the discarding method is too simple and violent, which causes the loss of a part of data; the fixed allowable delay time is set in a way that can solve a part of the hysteresis data, but the allowable delay time is fixed and not flexible enough, and the calculation triggered by the delayed data is a new result and cannot be correlated with the result of the calculation triggered by the previous data.
Therefore, how to provide a new solution, which can solve the above technical problems, is a technical problem to be solved in the art.
Disclosure of Invention
In view of this, the present invention provides a processing method of lag data in streaming computation, a processing system of lag data in streaming computation, a computer device, and a computer readable storage medium, where a secondary water line is introduced, and processing is performed on the lag data after arrival of Watermark in Flink again, and a result of second computation of the lag data can be compensated to a result of first triggering and output uniformly, so as to solve the problem of processing the lag data after arrival of Watermark in Flink again.
In order to achieve the purpose, the processing system of the hysteresis data in the flow calculation is provided, and comprises a first-stage water level line module, a second-stage water level line module, an operator processing module and a result storage module;
the first-stage water level line module is used for triggering the operator processing module to process the first hysteresis data to obtain first calculation result information when the time stamp of the first-stage water level line arrives, and storing the first calculation result information to the result storage module;
the second-stage water line module is used for triggering the operator processing module to process the second hysteresis data to obtain second calculation result information when the time stamp of the second-stage water line arrives, and storing the second calculation result in the result storage module;
and the result storage module is used for carrying out data integration on the first calculation result and the second calculation result and then outputting the data.
In a preferred embodiment of the present invention, the second stage water line module comprises:
and the water level line adjusting module is used for adjusting the second-stage water level line according to the source end data stream and the first-stage water level line.
In a preferred embodiment of the present invention, the water level line adjusting module includes:
the data inflow determining module is used for determining data inflow information of the source end according to the log generation speed or the data generation speed of the source end;
the data threshold value presetting module is used for presetting a data threshold value according to the service of a service party and the real-time requirement of the streaming data;
and the water level line setting module is used for adjusting the second-stage water level line according to the data inflow information, the data threshold setting and the first-stage water level line.
In a preferred embodiment of the present invention, the operator processing module comprises:
the operator judging module is used for judging whether the operator of the second hysteresis data belongs to a compensatable operator;
and the data processing module is used for executing the second-stage water line module when the operator judging module judges that the operator is positive.
In a preferred embodiment of the present invention, the result saving module comprises:
the first storage module is used for storing the first calculation result information;
the second storage module is used for storing second calculation result information;
and the data integration module is used for integrating the data of the first calculation result information and the second calculation result information corresponding to the same time window and the same operator and outputting the integrated data.
One of the purposes of the present invention is to provide a processing method of lag data in streaming computing, which comprises the following steps:
when the time stamp of the first-stage water line arrives, the first-stage water line module triggers the operator processing module to process the first hysteresis data to obtain first calculation result information, and the first calculation result is stored in the result storage module;
when the time stamp of the second-level waterline arrives, the second-level waterline module triggers the operator processing module to process the second hysteresis data to obtain second calculation result information, and stores the second calculation result in the result storage module;
and the result storage module performs data integration on the first calculation result and the second calculation result and then outputs the first calculation result and the second calculation result.
In a preferred embodiment of the invention, the method further comprises:
and the second-stage water level line module adjusts a second-stage water level line according to the source end data stream and the first-stage water level line.
In a preferred embodiment of the present invention, the adjusting the second-stage water line according to the source data stream and the first-stage water line by the second-stage water line module includes:
the second-level water line module determines data inflow information of the source end according to the log generation speed or the data generation speed of the source end;
presetting a data threshold according to the service of a service party and the real-time requirement of stream data;
and adjusting the second-stage water level line according to the data inflow information, the data threshold setting and the first-stage water level line.
In a preferred embodiment of the invention, the method further comprises:
the operator processing module judges whether an operator of the second lag data belongs to a compensatable operator;
and if so, executing the step that when the time stamp of the second-level water line arrives, the second-level water line module triggers the operator processing module to process the second hysteresis data to obtain second calculation result information, and storing the second calculation result in the result storage module.
In a preferred embodiment of the present invention, the data integration of the first calculation result and the second calculation result by the result storage module includes:
and the result storage module integrates the data of the first calculation result information and the second calculation result information corresponding to the same time window and the same operator.
One of the objects of the present invention is to provide a computer device, which includes a memory, a processor and a computer program stored in the memory and running on the processor, wherein the processor implements a processing method of lag data in streaming computing when executing the computer program.
One of the objects of the present invention is to provide a computer-readable storage medium storing a processing method for executing lag data in streaming calculation.
The invention has the advantages that a secondary water line is introduced, the hysteresis data after the Watermark arrives are processed again in the Flink, the second calculation result of the hysteresis data can be compensated to the first triggering result for uniform output, and the problem of processing the hysteresis data after the Watermark arrives in the Flink again is solved.
In order to make the aforementioned and other objects, features and advantages of the invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic structural diagram of a system for processing lag data in streaming computing according to an embodiment of the present invention;
fig. 2 is a flowchart of a method for processing lag data in streaming computing according to an embodiment of the present invention;
FIG. 3 is a block diagram of a system for processing lag data in streaming computing according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a system, apparatus, method or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.
In the present invention, Flink is a distributed processing engine that can be used for real-time computation, and kafka is a message queue. When the Flink is used as a tool for streaming computation, the generation time of the service data, namely EventTime, is used as the data processing time of the Flink in most scenarios. The traffic data is not transmitted directly to the Flink. The general technical scheme is that business data are filled into Kafka, data in the Kafka are consumed by Flink for calculation, and consumption is multi-threaded. Therefore, when data arrives at the Flink, the data arrives out of order in most cases, and sometimes, due to consumption queue blockage, network congestion and the like, some data may arrive after Watermark triggering.
Based on this, the present invention provides a processing system for lag data in stream computing, fig. 1 is a schematic structural diagram of the system, please refer to fig. 1, and the system includes a first-level pipeline module 100, a second-level pipeline module 200, an operator processing module 300, and a result saving module 400.
The first-stage water line module 100 is configured to trigger the operator processing module 300 to process the first hysteresis data to obtain first calculation result information when the timestamp of the first-stage water line arrives, and store the first calculation result in the result storing module 400.
In one embodiment of the present invention, the design logic of the first stage water line may be kept consistent with the design of the prior art solution, and the water line value of a fixed water level is given by the new Watermark of Flink. In one embodiment, assuming that the first water line is 2019/08/0918: 50 for the streaming data with time window [2019/08/0918:45 ~ 2019/08/0918: 50], the first level water line Watermark-level1 is 2019/08/0918: 50.
The second level water line module 200 is configured to trigger the operator processing module 300 to process the second hysteresis data to obtain second calculation result information when the timestamp of the second level water line arrives, and store the second calculation result in the result storing module 400.
And the result storage module 400 is configured to perform data integration on the first calculation result and the second calculation result and output the data integration.
In one embodiment of the present invention, the water level setting of the second stage water level line depends on the flow rate of the source end (source end) data stream of the Flink and the first stage water level line, i.e. the second stage water level line module 200 includes:
and the water level line adjusting module is used for adjusting the second-stage water level line according to the source end data stream and the first-stage water level line. In one embodiment of the present invention, the water level line adjusting module includes:
the data inflow determining module is used for determining data inflow information of the source end according to the log generation speed or the data generation speed of the source end;
the data threshold value presetting module is used for presetting a data threshold value according to the service of a service party and the real-time requirement of the streaming data;
and the water level line setting module is used for adjusting the second-stage water level line according to the data inflow information, the data threshold setting and the first-stage water level line.
That is, in a specific embodiment, first, how many pieces of data flow into the Source end every second can be calculated from the log, the data generation speed, and the like, and the result is recorded as dataStreamValue (inflow amount of Source end data). The business side can set a threshold value according to the real-time requirements of the business and the flow data calculation. If the dataStreamValue > is threshold value, it indicates that the current data flow is large, and the value of | Watermark-level1-Watermark-level2| is as small as possible; if the dataStreamValue < threshold value, indicates that the current data traffic is small, the value of | Watermark-level1-Watermark-level2| may be adjusted larger appropriately.
In an embodiment of the present invention, when the flow rate is small, if the value of i Watermark-level1-Watermark-level2| is adjusted to 3 seconds for the batch of streaming data with the time window [2019/08/0918: 45-2019/08/0918: 50], the Watermark-level2 is 2019/08/0918: 53.
In one embodiment of the present invention, when the timestamp of the first-stage waterline arrives, the operator processing module 300 is triggered to process the first lag data to obtain the first calculation result information. Specifically, for the first hysteresis data of the first-stage waterline, the operator processing module performs calculation of a corresponding operator. In an embodiment of the invention, when the first lag data performs the calculation of sum operator, the summation operation is performed on the streaming data with the time window [2019/08/0918: 45-2019/08/0918: 50] given by the first waterline, and then the string of time window values [2019/08/0918: 45-2019/08/0918: 50] + sum is used as a key value, which is the result calculated by the sum operator, and the pair of key-values is put into the result set module for buffering.
In one embodiment of the present invention, when the time stamp of the second-stage water line arrives, the operator processing module includes:
the operator judging module is used for judging whether the operator of the second hysteresis data belongs to a compensatable operator;
and the data processing module is used for executing the second-stage water line module when the operator judging module judges that the operator is positive.
Specifically, for the data of the second-stage water line (i.e. the second lag data), since the second-stage water line Watermark-level2 is 2019/08/0918: 53, which is 3s later than the first-stage water line Watermark-level1, the service time is [2019/08/0918: 45-2019/08/0918: 50], but the lag data that has not arrived before the trigger time of the Watermark-level1 is not reached due to the data delay, and when the second-stage water line reaches the trigger condition, the lag data can be input into the operator processing module again. In this embodiment, the operator of the calculated data is determined first, unlike the first time, in the second processing. For the Flink operators, most of the operators belong to the operators that can compensate the calculation, such as common sum, min (minimum operator), max (maximum operator), map (operator for single-bar processing of data), flatmap (operator for leveling of data), filter, etc., but some operators do not support compensation, such as average calculation operator. Therefore, the second hysteresis data of the second-level waterline firstly judges whether the operator can compensate, if the operator cannot compensate, the operator processing module is ended, and the result storage module directly outputs the first calculation result information. And if the second hysteresis data can be compensated, inputting the second hysteresis data into an operator processing module for calculation. And for the second hysteresis data of the second-stage water line, the operator processing module performs calculation of a corresponding operator. In an embodiment of the invention, when the second lag data performs the calculation of the sum operator, after the second-level water line trigger operator is calculated, the string of time window values [2019/08/0918: 45-2019/08/0918: 50] + sum is taken as a key, value is the result calculated by the sum operator, and the pair of key-value is put into the result set module for caching.
In one embodiment of the present invention, the result saving module includes:
the first storage module is used for storing the first calculation result information;
the second storage module is used for storing second calculation result information;
and the data integration module is used for integrating the data of the first calculation result information and the second calculation result information corresponding to the same time window and the same operator and outputting the integrated data.
In an embodiment of the present invention, the result storage module stores data of key-value, and when the result storage module is triggered and needs to output the data, the data in the result storage module may perform a reduce operation once, and perform a merge operation on the data in the same key value, that is, the data in the same time window and the same operator. The Key value composition is formed as [2019/08/0918: 45-2019/08/0918: 50] + sum, namely a time window + character string of operators, so how the same Key values are combined in reduce depends on the types of the operators. If the operator is sum, value can be added, and if max, value takes the maximum value.
The processing system for the delayed data in the stream-oriented computation, provided by the invention, can trigger one computation when a first-stage waterline timestamp arrives, retain the computation result, and temporarily does not output the computation result, when a computed operator does not meet the condition of introducing a second-stage waterline, the computation result of the first-stage waterline computation is directly output, if the condition of introducing the second-stage waterline is met, the computation of the second-stage waterline is waited, and when the second-stage waterline timestamp arrives, one computation is triggered again, and the two computations are combined and output, so that the problem that the setting of allowable delay data can be dynamically adjusted according to the size of a data stream is solved, and the second computation result of the delay data can be compensated to the first-triggered result and uniformly output is solved.
Furthermore, although in the above detailed description several unit modules of the system are mentioned, this division is not mandatory only. Indeed, the features and functions of two or more of the units described above may be embodied in one unit, according to embodiments of the invention. Also, the features and functions of one unit described above may be further divided into embodiments by a plurality of units. The terms "module" and "unit" used above may be software and/or hardware that realizes a predetermined function. While the modules described in the following embodiments are preferably implemented in software, implementations in hardware, or a combination of software and hardware are also possible and contemplated.
Having described the system for processing lag data in streaming computing according to an exemplary embodiment of the present invention, a method according to an exemplary embodiment of the present invention will be described with reference to the accompanying drawings. The implementation of the method can be referred to the above overall implementation, and repeated details are not repeated.
The present invention further provides a method for processing lag data in stream computing, where fig. 2 is a schematic flow chart of the method, and please refer to fig. 2, the method includes:
s101: and when the time stamp of the first-stage water line arrives, the first-stage water line module triggers the operator processing module to process the first hysteresis data to obtain first calculation result information and store the first calculation result in the result storage module.
In one embodiment of the present invention, the design logic of the first stage water line may be kept consistent with the design of the prior art solution, and the water line value of a fixed water level is given by the new Watermark of Flink. In one embodiment, assuming that the first water line is 2019/08/0918: 50 for the streaming data with time window [2019/08/0918:45 ~ 2019/08/0918: 50], the first level water line Watermark-level1 is 2019/08/0918: 50.
S102: and when the time stamp of the second-level waterline arrives, the second-level waterline module triggers the operator processing module to process the second hysteresis data to obtain second calculation result information, and stores the second calculation result in the result storage module.
S103: and the result storage module performs data integration on the first calculation result and the second calculation result and then outputs the first calculation result and the second calculation result.
In one embodiment of the present invention, the water level setting of the second stage water level line depends on the flow rate of the source end data stream of Flink and the first stage water level line, and therefore the method further comprises:
and the second-stage water level line module adjusts a second-stage water level line according to the source end data stream and the first-stage water level line. In one embodiment of the invention, the steps include:
the second-level water line module determines data inflow information of the source end according to the log generation speed or the data generation speed of the source end;
presetting a data threshold according to the service of a service party and the real-time requirement of stream data;
and adjusting the second-stage water level line according to the data inflow information, the data threshold setting and the first-stage water level line.
That is, in a specific embodiment, first, how many pieces of data flow into the Source end every second can be calculated from the log, the data generation speed, and the like, and the result is recorded as dataStreamValue. The business side can set a threshold value according to the real-time requirements of the business and the flow data calculation. If the dataStreamValue > is threshold value, it indicates that the current data flow is large, and the value of | Watermark-level1-Watermark-level2| is as small as possible; if the dataStreamValue < threshold value, indicates that the current data traffic is small, the value of | Watermark-level1-Watermark-level2| may be adjusted larger appropriately.
In an embodiment of the present invention, when the flow rate is small, if the value of i Watermark-level1-Watermark-level2| is adjusted to 3 seconds for the batch of streaming data with the time window [2019/08/0918: 45-2019/08/0918: 50], the Watermark-level2 is 2019/08/0918: 53.
In one embodiment of the invention, when the time stamp of the first-stage waterline arrives, the operator processing module is triggered to process the first hysteresis data to obtain the first calculation result information. Specifically, for the first hysteresis data of the first-stage waterline, the operator processing module performs calculation of a corresponding operator. In an embodiment of the invention, when the first lag data performs the calculation of sum operator, the summation operation is performed on the streaming data of the batch with the time window [2019/08/0918: 45-2019/08/0918: 50] given by the first water level line, and then the character string with the time window value [2019/08/0918: 45-2019/08/0918: 50] + sum is taken as key, which is the result calculated by sum operator, and the pair of key-value is put into the result set module for buffering.
In one embodiment of the invention, when the time stamp of the second stage waterline arrives, the method further comprises:
the operator processing module judges whether an operator of the second lag data belongs to a compensatable operator;
and if so, executing the step that when the time stamp of the second-level water line arrives, the second-level water line module triggers the operator processing module to process the second hysteresis data to obtain second calculation result information, and storing the second calculation result in the result storage module.
Specifically, for the data of the second-stage water line (i.e. the second lag data), since the second-stage water line Watermark-level2 is 2019/08/0918: 53, which is 3s later than the first-stage water line Watermark-level1, the service time is [2019/08/0918: 45-2019/08/0918: 50], but the lag data that has not arrived before the trigger time of the Watermark-level1 is not reached due to the data delay, and when the second-stage water line reaches the trigger condition, the lag data can be input into the operator processing module again. In this embodiment, the operator of the calculated data is determined first, unlike the first time, in the second processing. For the Flink operators, most of the operators belong to the operators that can compensate the calculation, such as common sum, min, max, map, flitmap, filter, etc., but some of the operators do not support the compensation, such as average calculation operator. Therefore, the second hysteresis data of the second-level waterline firstly judges whether the operator can compensate, if the operator cannot compensate, the operator processing module is ended, and the result storage module directly outputs the first calculation result information. And if the second hysteresis data can be compensated, inputting the second hysteresis data into an operator processing module for calculation. And for the second hysteresis data of the second-stage water line, the operator processing module performs calculation of a corresponding operator. In an embodiment of the invention, when the second lag data performs the calculation of the sum operator, after the second-level water line trigger operator is calculated, the string of time window values [2019/08/0918: 45-2019/08/0918: 50] + sum is taken as a key, value is the result calculated by the sum operator, and the pair of key-value is put into the result set module for caching.
In one embodiment of the present invention, step S103 includes:
and the result storage module integrates the data of the first calculation result information and the second calculation result information corresponding to the same time window and the same operator.
In an embodiment of the present invention, the result storage module stores data of key-value, and when the result storage module is triggered and needs to output the data, the data in the result storage module may perform a reduce operation once, and perform a merge operation on the data in the same key value, that is, the data in the same time window and the same operator. The Key value composition is formed as [2019/08/0918: 45-2019/08/0918: 50] + sum, namely a time window + character string of operators, so how the same Key values are combined in reduce depends on the types of the operators. If the operator is sum, value can be added, and if max, value takes the maximum value.
The invention provides a processing method of hysteresis data in stream-oriented computation, which comprises the steps of firstly, triggering a computation once when a first-stage waterline timestamp arrives, reserving a computation result, temporarily not outputting the computation result, directly outputting the computation result of the first-stage waterline when a computed operator does not meet the condition of introducing a second-stage waterline, waiting for the computation of the second-stage waterline if the computed operator meets the condition of introducing the second-stage waterline, triggering the computation once again after the second-stage waterline timestamp arrives, combining the two computations and outputting the combined computation result, thereby not only solving the problem that the setting of allowable delay data can be dynamically adjusted according to the size of a data stream, but also compensating the second computation result of the delay data to the first-triggered result for uniform output.
The invention also provides computer equipment which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor realizes a processing method of hysteresis data in streaming calculation when executing the computer program.
The invention also provides a computer readable storage medium, which stores a processing method for executing the hysteresis data in the streaming calculation.
It should be noted that while the operations of the method of the present invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions. Having described exemplary embodiments of the present invention, a system of exemplary embodiments of the present invention will now be described with reference to the accompanying drawings. The implementation of the system can be referred to the above overall implementation, and repeated details are not repeated.
The technical solution of the present invention will be described in detail with reference to specific examples.
Fig. 3 is a schematic diagram of a system for processing hysteresis data in streaming computing according to an embodiment of the present invention, and referring to fig. 3, the system includes four virtual modules, a first level pipeline module, a second level pipeline module, an operator processing module, and a result storage module.
When the Flink is used as a tool for streaming computation, the generation time of the service data, namely EventTime, is used as the data processing time of the Flink in most scenarios. The business data cannot be directly transmitted to the Flink, the general technical scheme is that the business data are filled into the Kafka, the Flink consumes the data in the Kafka, calculation is carried out, and consumption is multi-threaded. Therefore, when data arrives at the Flink, the data mostly arrive out of order, sometimes, due to consumption queue blockage, network congestion and other reasons, some data may arrive after Watermark triggering, and if the data is not to be discarded and the data is to be compensated into a result after Watermark triggering, the technical scheme provided by the invention can be adopted.
In this embodiment, when the first-stage water level timestamp X1 arrives, a calculation may be triggered, and the calculation result may be retained in the result storage module and not output Sink for the moment.
The operator processing module judges the operator type calculated by the Flink triggered at the time, judges whether the calculated operator meets the condition of introducing the second-stage waterline or not, and if not, the calculation result of the first-stage waterline in the result storage module is directly output by the sink; if so, the calculation of the second stage water line is awaited.
When the second water level line timestamp X2 arrives, a calculation is triggered again, and the calculation result is still written into the result storage module.
And when the calculation triggered by the timestamp of the second water level line is finished, the result storage module combines the two calculations and then outputs the sink. Therefore, the data can be discarded as less as possible, and the second calculation result of the delay data can be compensated to the first triggering result to be uniformly output.
In summary, the processing method of the lag data in the streaming calculation, the processing system of the lag data in the streaming calculation, the computer device and the computer readable storage medium provided by the invention solve the problem of processing the lag data after the Watermark arrives again in the Flink, introduce the concept of the secondary water line and give the chance that the lag data participates in the calculation again. The secondary water line can be dynamically adjusted according to the data flow, and the second calculation result of the delay data can be compensated to the first triggering result through operator judgment and a result set to be uniformly output.
Improvements to a technology can clearly be distinguished between hardware improvements (e.g. improvements to the circuit structure of diodes, transistors, switches, etc.) and software improvements (improvements to the process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit.
Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually making an integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Language Description Language), traffic, pl (core unified Programming Language), HDCal, JHDL (Java Hardware Description Language), langue, Lola, HDL, laspam, hardsradware (Hardware Description Language), vhjhd (Hardware Description Language), and vhigh-Language, which are currently used in most popular applications.
It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.
The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include, but are not limited to, the following microcontrollers: the ARC625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory.
Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.
From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of software products, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and include instructions for causing a computer system (which may be a personal computer, a server, or a network system, etc.) to execute the methods described in the embodiments or some parts of the embodiments of the present application.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable systems, tablet-type systems, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics systems, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or systems, and the like.
The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing systems that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage systems.
While the present application has been described with examples, those skilled in the art will appreciate that there are numerous variations and permutations of the present application without departing from the spirit of the application, and it is intended that the appended claims encompass such variations and modifications as fall within the true spirit of the application.

Claims (10)

1. The system for processing the hysteresis data in the stream computing is characterized by comprising a first-stage water level line module, a second-stage water level line module, an operator processing module and a result storage module;
the first-stage water level line module is used for triggering the operator processing module to process first hysteresis data to obtain first calculation result information when a time stamp of a first-stage water level line arrives, and storing the first calculation result to the result storage module;
the second-level waterline module is used for triggering the operator processing module to process second hysteresis data to obtain second calculation result information when a time stamp of the second-level waterline is reached, and storing the second calculation result to the result storage module;
and the result storage module is used for carrying out data integration on the first calculation result and the second calculation result and then outputting the data integration.
2. The system of claim 1, wherein the second stage water line module comprises:
the water level line adjusting module is used for adjusting a second-stage water level line according to the source end data stream and the first-stage water level line;
wherein the water level line adjusting module comprises:
the data inflow determining module is used for determining data inflow information of the source end according to the log generation speed or the data generation speed of the source end;
the data threshold value presetting module is used for presetting a data threshold value according to the service of a service party and the real-time requirement of the streaming data;
and the water level line setting module is used for adjusting a second-stage water level line according to the data inflow information, the data threshold setting and the first-stage water level line.
3. The system of claim 1, wherein the operator processing module comprises:
the operator judging module is used for judging whether an operator of the second hysteresis data belongs to a compensatable operator;
and the data processing module is used for executing the second-stage water line module when the operator judgment module judges that the operator is yes.
4. The system of claim 1, wherein the result saving module comprises:
the first storage module is used for storing the first calculation result information;
the second storage module is used for storing second calculation result information;
and the data integration module is used for integrating the data of the first calculation result information and the second calculation result information corresponding to the same time window and the same operator and outputting the integrated data.
5. A method for processing lag data in streaming computing, which is applied to the system for processing lag data in streaming computing according to any one of claims 1 to 4, and comprises:
when the time stamp of the first-stage water line arrives, the first-stage water line module triggers an operator processing module to process first hysteresis data to obtain first calculation result information, and the first calculation result is stored in a result storage module;
when the time stamp of the second-level water line arrives, the second-level water line module triggers the operator processing module to process second hysteresis data to obtain second calculation result information, and stores the second calculation result to the result storage module;
and the result storage module performs data integration on the first calculation result and the second calculation result and then outputs the first calculation result and the second calculation result.
6. The method of claim 5, further comprising:
the second level water line module adjusts a second level water line according to the source end data stream and the first level water line, and the method comprises the following steps:
the second-stage water level line module determines data inflow information of the source end according to the log generation speed or the data generation speed of the source end;
presetting a data threshold according to the service of a service party and the real-time requirement of stream data;
and adjusting the second-stage water level line according to the data inflow information, the data threshold setting and the first-stage water level line.
7. The method of claim 5, further comprising:
the operator processing module judges whether an operator of the second hysteresis data belongs to a compensatable operator;
and if so, executing a step that when the time stamp of the second-level water line arrives, the second-level water line module triggers the operator processing module to process second hysteresis data to obtain second calculation result information, and storing the second calculation result to the result storage module.
8. The method of claim 5, wherein the data integration of the first calculation result and the second calculation result by a result saving module comprises:
and the result storage module is used for carrying out data integration on the first calculation result information and the second calculation result information corresponding to the same time window and the same operator.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of claims 5 to 8 when executing the computer program.
10. A computer-readable storage medium storing a program for performing the method of any one of claims 5 to 8.
CN202010450024.2A 2020-05-25 2020-05-25 Processing system, equipment and method for hysteresis data in stream type calculation Active CN111680065B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010450024.2A CN111680065B (en) 2020-05-25 2020-05-25 Processing system, equipment and method for hysteresis data in stream type calculation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010450024.2A CN111680065B (en) 2020-05-25 2020-05-25 Processing system, equipment and method for hysteresis data in stream type calculation

Publications (2)

Publication Number Publication Date
CN111680065A true CN111680065A (en) 2020-09-18
CN111680065B CN111680065B (en) 2023-11-10

Family

ID=72434665

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010450024.2A Active CN111680065B (en) 2020-05-25 2020-05-25 Processing system, equipment and method for hysteresis data in stream type calculation

Country Status (1)

Country Link
CN (1) CN111680065B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112286582A (en) * 2020-12-31 2021-01-29 浙江岩华文化科技有限公司 Multithreading data processing method, device and medium based on streaming computing framework

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160357476A1 (en) * 2015-06-05 2016-12-08 Microsoft Technology Licensing, Llc Streaming joins in constrained memory environments
US20170289588A1 (en) * 2015-02-13 2017-10-05 Sk Telecom Co., Ltd. Method and apparatus for providing multi-view streaming service
CN107870874A (en) * 2016-09-23 2018-04-03 华为数字技术(成都)有限公司 A kind of data write-in control method and storage device
CN109213793A (en) * 2018-08-07 2019-01-15 泾县麦蓝网络技术服务有限公司 A kind of stream data processing method and system
CN109412732A (en) * 2017-08-16 2019-03-01 深圳市中兴微电子技术有限公司 A kind of control method and device of receiving end delay jitter
US20190163545A1 (en) * 2017-11-30 2019-05-30 Oracle International Corporation Messages with delayed delivery in an in-database sharded queue
CN110222246A (en) * 2019-06-17 2019-09-10 广州小鹏汽车科技有限公司 A kind of data screening method and apparatus
CN110990438A (en) * 2019-12-09 2020-04-10 北京明略软件***有限公司 Data processing method and device, electronic equipment and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170289588A1 (en) * 2015-02-13 2017-10-05 Sk Telecom Co., Ltd. Method and apparatus for providing multi-view streaming service
US20160357476A1 (en) * 2015-06-05 2016-12-08 Microsoft Technology Licensing, Llc Streaming joins in constrained memory environments
CN107870874A (en) * 2016-09-23 2018-04-03 华为数字技术(成都)有限公司 A kind of data write-in control method and storage device
CN109412732A (en) * 2017-08-16 2019-03-01 深圳市中兴微电子技术有限公司 A kind of control method and device of receiving end delay jitter
US20190163545A1 (en) * 2017-11-30 2019-05-30 Oracle International Corporation Messages with delayed delivery in an in-database sharded queue
CN109213793A (en) * 2018-08-07 2019-01-15 泾县麦蓝网络技术服务有限公司 A kind of stream data processing method and system
CN110222246A (en) * 2019-06-17 2019-09-10 广州小鹏汽车科技有限公司 A kind of data screening method and apparatus
CN110990438A (en) * 2019-12-09 2020-04-10 北京明略软件***有限公司 Data processing method and device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
梁秋红 等: "大数据流式计算框架的任务调度优化方法研究", 中州大学学报, vol. 36, no. 3, pages 125 - 128 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112286582A (en) * 2020-12-31 2021-01-29 浙江岩华文化科技有限公司 Multithreading data processing method, device and medium based on streaming computing framework
CN112286582B (en) * 2020-12-31 2021-03-16 浙江岩华文化科技有限公司 Multithreading data processing method, device and medium based on streaming computing framework

Also Published As

Publication number Publication date
CN111680065B (en) 2023-11-10

Similar Documents

Publication Publication Date Title
Buddhika et al. Neptune: Real time stream processing for internet of things and sensing environments
US8321865B2 (en) Processing of streaming data with a keyed delay
US8768078B2 (en) Intelligent media decoding
US8789017B2 (en) System and method for using stream objects to perform stream processing in a text-based computing environment
JP2012043409A (en) Computer-implementing method, system, and computer program for processing data stream
CN110417609B (en) Network traffic statistical method and device, electronic equipment and storage medium
CN112202595A (en) Abstract model construction method based on time sensitive network system
CN113792240A (en) Page loading method and device and electronic equipment
CN111680065A (en) Processing system, equipment and method for lag data in streaming computation
CN113570033A (en) Neural network processing unit, neural network processing method and device
Mayer et al. Meeting predictable buffer limits in the parallel execution of event processing operators
US10725817B2 (en) Reducing spin count in work-stealing for copying garbage collection based on average object references
CN110968404B (en) Equipment data processing method and device
WO2019019295A1 (en) Ring data buffering implementation method based on synchronization mechanism for embedded system
CN112434092A (en) Data processing method and device, electronic equipment and readable storage medium
CN112202596A (en) Abstract model construction device based on time sensitive network system
US10769063B2 (en) Spin-less work-stealing for parallel copying garbage collection
CN111651267A (en) Method and device for performing performance consumption optimization analysis on parallel operation
CN110727666A (en) Cache assembly, method, equipment and storage medium for industrial internet platform
US20140359429A1 (en) Method, computer program, and system for rearranging a server response
CN111061259A (en) Incident driving method, system, device and storage medium for walking robot
CN115470901A (en) Hybrid precision training method and device supporting load sharing of heterogeneous processor at mobile terminal
US9843550B2 (en) Processing messages in a data messaging system using constructed resource models
Sun et al. DSSP: stream split processing model for high correctness of out-of-order data processing
CN111324458A (en) Large file downloading acceleration method based on Java

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant