CN112486639A - Data saving and restoring method and device for task, server and storage medium - Google Patents

Data saving and restoring method and device for task, server and storage medium Download PDF

Info

Publication number
CN112486639A
CN112486639A CN201910866743.XA CN201910866743A CN112486639A CN 112486639 A CN112486639 A CN 112486639A CN 201910866743 A CN201910866743 A CN 201910866743A CN 112486639 A CN112486639 A CN 112486639A
Authority
CN
China
Prior art keywords
data
task
storage system
checkpoint
storage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910866743.XA
Other languages
Chinese (zh)
Inventor
宋亚东
杨长江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZTE Corp
Original Assignee
ZTE Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZTE Corp filed Critical ZTE Corp
Priority to CN201910866743.XA priority Critical patent/CN112486639A/en
Publication of CN112486639A publication Critical patent/CN112486639A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/485Task life-cycle, e.g. stopping, restarting, resuming execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1438Restarting or rejuvenating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Quality & Reliability (AREA)
  • Retry When Errors Occur (AREA)

Abstract

The embodiment of the invention provides a method, a device, a server and a storage medium for storing and recovering data of a task, wherein the method, the device, the server and the storage medium are used for storing the changed data between two adjacent check points of the task, namely performing incremental storage, and compared with a full-storage mode in the related technology, the method and the device can reduce the size of the data amount to be stored to a great extent and reduce the repeatability between the stored data, and can improve the resource utilization rate and the storage efficiency to a great extent; secondly, the embodiment of the invention eliminates the limitation of the space size of the memory by storing the change data into the storage system with the snapshot function instead of the memory, and is applicable to the field of various large data stream calculation; and the snapshot operation can be executed through the snapshot function of the storage system, so that the change is small, the implementation is easy and the implementation cost is low.

Description

Data saving and restoring method and device for task, server and storage medium
Technical Field
The invention relates to the field of big data stream type calculation, in particular to a method and a device for saving and restoring data of a task, a server and a storage medium.
Background
The processing modes of the big data field for data are divided into two types according to timeliness: batch processing, stream processing. Wherein the streaming mode assumes that the potential value of the data is the freshness of the data, so the streaming mode should process the data and get the result as soon as possible. In this manner, data arrives in a streaming manner. In the batch processing mode, data is stored first and then analyzed, so that the batch processing method is not suitable for occasions with high processing delay requirements.
Depending on the processing mode, there are correspondingly two big data processing systems: batch processing system, stream processing system. With the development of the fields of business intelligence and computational advertising, streaming processing that emphasizes real-time performance has received much attention. In streaming processing systems, data often flows into the system from a variety of data sources and is processed in a near real-time manner. Because near real-time processing can provide valuable information as early as possible, many commercial companies are now more interested in real-time processing systems than traditional batch processing systems.
The framework of a stream processing system is also referred to as a stream computation framework, and the input to the stream computation job is a continuous stream of data (also referred to as an event stream). Stream computation frameworks typically have fault tolerance mechanisms that require a restart of a job once the job fails, continuing processing from the location of the previously processed stream. This requires a checkpoint mechanism to guarantee. The checkpoint mechanism periodically saves the execution state of the job (i.e., the task execution data), so that after the job is restarted in failure, the checkpoint mechanism is restored to the most recently saved state to continue execution.
In the related art, the mainstream flow computation framework and the checkpoint mechanism are as follows: the task execution data of each check point of the task is stored in the memory in a full amount, so that great pressure is brought to the memory, the application scene is limited, and the method can only be used in scenes with small operation states generally; and because the data is stored in full, the storage efficiency is low, and more repeated data is likely to exist among the task execution data of each check point stored in the memory, which is not beneficial to the improvement of the resource utilization rate.
Disclosure of Invention
The embodiment of the invention provides a method, a device, a server and a storage medium for storing and recovering task data, and solves the problems that in the related art, the task execution data of each check point of a task is completely stored in a memory, the application scene is limited, the storage efficiency is low, the data repetition degree stored in the memory is high, and the resource utilization rate is low.
In order to solve the above technical problem, an embodiment of the present invention provides a data saving method for a task, including:
acquiring task execution data of a current task check point;
determining change data of the task between the two checkpoints according to the task execution data of the current checkpoint and the task execution data of the previous checkpoint of the task;
and storing the change data into a storage system with a snapshot function, controlling the storage system to execute snapshot operation, and maintaining a storage position of task execution data of the task in the storage system.
In order to solve the above technical problem, an embodiment of the present invention further provides a task recovery method, including:
after the task is restarted, acquiring data corresponding to the inspection points of the task from a storage system according to the storage positions of the data of the task in the storage system, wherein the data corresponding to the inspection points of the task are stored in the storage system by the data storage method;
and restoring the calculation state of the operator of the task according to the acquired data corresponding to the check point.
In order to solve the above technical problem, an embodiment of the present invention further provides a data saving device for a task, including:
the incremental computation module is used for acquiring task execution data of a current checkpoint of a task, determining change data of the task between the two checkpoints according to the task execution data of the current checkpoint and the task execution data of a previous checkpoint of the task, storing the change data into a storage system with a snapshot function, controlling the storage system to execute snapshot operation, and maintaining the storage position of the task execution data of the task in the storage system.
In order to solve the above technical problem, an embodiment of the present invention further provides a task recovery device, including:
the task recovery module is used for acquiring data corresponding to the check point of the task from the storage system according to the storage position of the data of the task in the storage system after the task is restarted, and recovering the calculation state of an operator of the task according to the acquired data corresponding to the check point; the data corresponding to each inspection point of the task is stored in the storage system by the data saving method.
In order to solve the above technical problem, an embodiment of the present invention further provides a server, including a processor, a memory, and a communication bus;
the communication bus is used for connecting the processor and the memory;
the processor is configured to execute a first computer program stored in the memory to implement the steps of the data saving method of the task as described above;
and/or the presence of a gas in the gas,
the processor is adapted to execute a second computer program stored in the memory to implement the steps of the method of task recovery as described above.
In order to solve the technical problem, an embodiment of the present invention further provides a computer-readable storage medium, where a first computer program is stored, where the first computer program is executable by a processor to implement the steps of the data saving method for the task as described above;
and/or the presence of a gas in the gas,
the computer readable storage medium stores a second computer program executable by a processor to implement the steps of the method for resuming a task as described above.
Advantageous effects
According to the data saving and restoring method, device, server and storage medium of the task provided by the embodiment of the invention, in the task execution process, the task execution data of the current checkpoint of the task is obtained, the task execution data of the previous checkpoint of the task is combined, the change data of the task between the two checkpoints is determined, and the change data is stored, namely, is stored in an increment manner, so that compared with a full storage manner in the related art, the data volume size required to be stored can be reduced to a great extent, the repeatability between the stored data can be reduced, and the resource utilization rate and the storage efficiency can be improved to a great extent;
secondly, the embodiment of the invention stores the control to the storage system with the snapshot function instead of the memory, and can execute the snapshot operation through the snapshot function of the storage system, and the data of the check point is stored in the storage system instead of the memory, thereby eliminating the limitation of the memory space size and being applicable to the field of various large data stream calculation; in addition, the data of each check point of the task is saved by utilizing the snapshot function of the storage system, so that the change is small, the realization is easy and the realization cost is low.
Additional features and corresponding advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
Fig. 1 is a schematic flowchart of a task data saving method according to a first embodiment of the present invention;
fig. 2 is a flowchart illustrating a task recovery method according to a first embodiment of the present invention;
FIG. 3 is a diagram illustrating a task data saving device according to a second embodiment of the present invention;
FIG. 4 is a diagram illustrating a task recovery apparatus according to a second embodiment of the present invention;
FIG. 5 is a schematic view of a flow calculation framework according to a second embodiment of the present invention;
FIG. 6 is a flowchart illustrating a data saving method for tasks according to a second embodiment of the present invention;
FIG. 7 is a flowchart illustrating a task recovery method according to a second embodiment of the present invention;
fig. 8 is a schematic structural diagram of a server according to a third embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The first embodiment is as follows:
the method aims at solving the problems that in the related technology, the task execution data of each check point of the task is completely stored in the memory, the application scene is limited, the storage efficiency is low, the data repetition degree stored in the memory is high, and the resource utilization rate is low. In the embodiment, the changed data between two adjacent check points of the task is stored, namely, incremental storage is adopted, so that compared with a full storage mode in the related technology, the data amount required to be stored can be reduced, the repeatability of the stored data is reduced, and the resource utilization rate and the storage efficiency are improved; secondly, the embodiment stores the changed data into a storage system with a snapshot function instead of a memory, so that the limitation of the space size of the memory is eliminated, the method is applicable to various large data stream calculation fields, the snapshot operation can be executed through the snapshot function of the storage system, the snapshot data obtained by the snapshot operation is stored in association with the task, the change is small, the implementation is easy, and the implementation cost is low.
For convenience of understanding, the following description of the present embodiment is given by taking a data saving method of a task as an example, and please refer to fig. 1, which includes:
s101: task execution data of a current checkpoint of a task is obtained.
In this embodiment, the task (task) may be used for various operators (operators) responsible for concurrently executing the stream processing job (job), for example, one job (job) may include multiple tasks (task), which may specifically include but are not limited to: the system comprises a data source operator, various data calculation operators and a data output operator.
In this embodiment, the manner of triggering the generation of the checkpoint of the task of the job and the manner of acquiring the task execution data by the checkpoint are not limited at all, and various checkpoint generation triggering manners and task execution data acquisition manners may be adopted.
For example, in one example, a barrier event (barrier event) may be inserted into a stream processing system input data stream (event stream), and when a task (task) receives the barrier event, checkpoint creation may be triggered. The created checkpoint is a copy of the current task execution data (i.e., the computational state) of the operators of the task, which in one example may include, but is not limited to, at least one of the currently cached data stream, state information associated with the data stream, and intermediate result data.
In this embodiment, the implementation of checkpoints may be implemented within a stream computing framework.
In this embodiment, the task execution data for the obtained current checkpoint may be cached in the memory, and the task execution data for the previous checkpoint of the task may be cached in the memory, so as to subsequently determine the change data between the two checkpoints of the task. Of course, it should be understood that, for the task execution data of the task checkpoint, the task execution data of the current checkpoint and the previous checkpoint may be stored in the memory only, thereby reducing the occupation of the memory resources. In some application scenarios, when it is determined that the change data of the task is not limited to the change data between two checkpoints, the task execution data of the checkpoint stored in the memory may correspond to the task execution data that is not limited to the current checkpoint and the previous checkpoint. And it should be understood that the task execution data of the current checkpoint is not limited to being stored in the memory in the embodiment.
S102: and determining the change data of the task between the two checkpoints according to the task execution data of the current checkpoint and the task execution data of the previous checkpoint.
For example, in one example, the task execution data of the previous checkpoint of the task is obtained from the memory, and the task execution data of the current checkpoint is compared with the task execution data of the previous checkpoint of the task to obtain change data that the task changes between the two checkpoints;
optionally, after obtaining change data that a task changes between two check points, in order to improve the memory resource utilization, the method further includes: and deleting the data of the task of the previous checkpoint of the task from the memory.
In some examples of this embodiment, S102 may be performed by, but is not limited to, an incremental computation module disposed within the stream computation framework. In this example, the incremental computation module may be a memory object incremental computation device; the functions performed by the incremental computation module include, but are not limited to:
and tracking state data cached in a memory by an operator of the task, wherein the state data comprises task execution data of a current checkpoint of the task and task execution data of a previous checkpoint of the task, and determining change data between the task execution data of the current checkpoint of the task and the task execution data of the previous checkpoint of the task. The variation data may specifically include, but is not limited to: the operator state data executed by the Task (Task) varies by the amount between two checkpoints. Including but not limited to modification of state data, deletion of state data, addition of state data.
In this embodiment, if the current checkpoint is the first checkpoint, the change data may be the full amount of data of the task operator at that moment. After the state data (i.e., the task execution data) of the first checkpoint is stored in the storage system, the checkpoint data of the task has a storage location (e.g., a file path or an object ID, etc.) in the storage system, and the storage location information is returned to the incremental computation module for maintenance. And the subsequent change of the checkpoint state data of the task is written into the storage system without changing the storage position (namely, the storage position is continuously used). The incremental computation module controls to trigger the storage system to take a snapshot before the subsequent change data is written into the storage system (or after the subsequent change data is written into the storage system); in an example of this embodiment, snapshot data obtained by the storage system executing the snapshot may be internal data maintained by a snapshot mechanism of the storage system itself, and may not be exposed to the outside, so as to improve the security of storage control.
In this embodiment, when determining the changed data between the task execution data at the current checkpoint of the task and the task execution data at the previous checkpoint of the task, the comparison may be performed by using a corresponding comparison rule according to a specific application scenario. For example, when the task execution data includes data that needs to take the current latest state, such as a count value, a timing value, and the like, in an application scenario, when it is determined that both the task execution data of the current checkpoint and the task execution data of the previous checkpoint include the data, it is directly determined that the data in the task execution data of the current checkpoint is changed data. For example, assuming that the task execution data of the previous checkpoint includes the technical data a and the task execution data of the current checkpoint also includes the technical data a, the technical data a included in the task execution data of the current checkpoint is directly determined as the change data. For another example, in another application scenario, for at least one minute of data of the task execution data, the task execution data of the current checkpoint may be compared with corresponding data in the task execution data of the previous checkpoint to determine whether the data is consistent with the task execution data of the previous checkpoint, and if the data is inconsistent with the task execution data of the current checkpoint, the data is determined to have changed, and the data is determined to be changed data; when data corresponding to certain data in the task execution data of the current check point is not found in the task execution data of the previous check point, the data is newly added data; when the task execution data of the previous checkpoint includes certain data but the corresponding data is not found in the task execution data of the current checkpoint, the data is the deleted data.
S103: storing the changed data into a storage system with a snapshot function, controlling the storage system to execute snapshot operation, and maintaining a storage position of task execution data of the task in the storage system; for an example of the storage location maintenance process, please refer to the above example, which is not described herein again.
In some examples of the present embodiment, for controlling the storage system to store the change data in the storage system with the snapshot function, and controlling the storage system to perform the snapshot operation, and controlling the storage system to store the snapshot data obtained by the snapshot operation in association with the task, the method may include, but is not limited to:
when the determined change data is not empty, in example a, the storage system may be controlled to store the change data into the storage system after executing the snapshot operation; in example B, the storage system may be controlled to perform the snapshot operation after the change data is stored in the storage system. The manner of maintaining the storage location of the data of the task in the storage system may include, but is not limited to: location information stored in the storage system of the change data of the task is obtained and associated with identification information of the task (for example, including but not limited to a task ID) and identification information of a checkpoint of the task (for example, which may include but not limited to a status ID), which may be performed by, but not limited to, an incremental computation module.
The location information stored in the storage system is various information defined by the storage system for locating the location of the user data on the storage system, and may include, for example and without limitation, identification information such as LUNID, file path, object ID, and the like. In this embodiment, the storage system may adopt various systems with snapshot functions according to requirements, for example, but not limited to, a distributed block storage system, a distributed file storage system, and a distributed object storage system.
When the determined change data is empty, in example a, the storage system may be directly controlled to execute snapshot operation, and the snapshot data obtained by the snapshot operation at this time is data corresponding to the last checkpoint (for example, the (n-1) th checkpoint) of the task; in example B, the storage system may be directly controlled to perform the snapshot operation, and this time, snapshot data obtained by the snapshot operation is data corresponding to a current checkpoint (e.g., the nth checkpoint) of the task.
Optionally, in some application scenarios of this embodiment, when the change data is not empty, storing the change data in the storage system may include, but is not limited to:
when the changed data comprises the newly added data, the storage system allocates a new storage space and stores the newly added data into the new storage space;
when the changed data comprises modified data (and state data) compared with the task execution data of the previous check point, controlling the storage system to Copy one Copy of original data corresponding to the modified data stored in the storage system by using a Copy-On-Write mechanism (Copy On Write) and set the Copy as read-only, and then updating the original data corresponding to the modified data stored in the storage system into the modified data; specifically, when it is determined that a certain state data changes, the position of original data corresponding to the state data on the storage system can be acquired, the storage system is controlled to copy one copy of original data corresponding to modified data stored in the storage system by using a copy-on-write mechanism and set as read-only, and the original data at the position is modified into the changed state data.
And when the changed data comprises some deleted data in the task execution data of the previous checkpoint, deleting the original data corresponding to the deleted data stored in the storage system.
Through the process, the incremental computation module in the flow computation framework can be used for determining the change data (namely incremental data) of the task between the adjacent check points and storing the change data in the storage system, and the snapshot function of the storage system is used for realizing snapshot on the storage system, so that the data corresponding to each check point in the task running process is stored through the snapshot function, and the data can be flexibly called in the task recovery process after the task fails.
For example, a task recovery method please refer to fig. 2, which includes:
s201: after the task is restarted, acquiring data corresponding to the inspection point of the task from the storage system according to the storage position of the data of the task in the storage system, wherein the data corresponding to each inspection point of the task can be stored in the storage system by the data storage method.
In this embodiment, the stream processing job unexpectedly fails, including various failures that may cause the job to crash unexpectedly; when a task failure is detected, the operation of the operation can be restarted; as the executor of the job being rerun, the tasks belonging to the job in the cluster are restarted. In this embodiment, when data corresponding to a check point of a task is acquired from a storage system, data corresponding to the check point of the task may be found in a corresponding storage location of the storage system according to identification information of the task and identification information of the check point, and the storage system may recover complete check point data from a snapshot. This process can be done very efficiently for storage systems with a snapshot mechanism.
S202: and restoring the calculation state of the operator of the task according to the acquired data corresponding to the check point.
In some examples of this implementation, the computation state of the operator of the task may be restored according to the data corresponding to the latest checkpoint of the task; and selecting the data corresponding to the corresponding check point according to the specific application requirement to recover the calculation state of the operator of the task.
In this embodiment, the operators in the task may continue to execute from the recovered computation state.
In the data storage and recovery method for the task provided by this embodiment, a method for implementing checkpoint increment storage by using a snapshot function of a storage system is provided, and by adding an increment calculation module to a stream processing system, task execution data at a checkpoint at different times of the task is stored by directly using the snapshot function of a bottom storage system. The method has the advantages that a complex data backup mechanism is not required to be introduced into the stream processing system, a data backup mechanism of a third-party component is not required, the snapshot mechanism of the underlying storage system is directly utilized to realize the storage and maintenance of the incremental data, the complexity and the external dependence degree of the stream processing system are reduced while the requirement that the stream processing system incrementally stores checkpoint data is met, the high-performance snapshot of the storage system also ensures the efficiency of the incremental data maintenance, and the overhead of the incremental data maintenance is reduced to a certain degree.
Example two:
the present embodiment provides a data saving device for tasks, which can be applied in various servers, please refer to fig. 3, including:
the incremental computation module 301 is configured to obtain task execution data of a current checkpoint of a task, determine change data of the task between two checkpoints according to the task execution data of the current checkpoint and task execution data of a previous checkpoint of the task, control to store the change data in a storage system having a snapshot function, control the storage system to execute a snapshot operation, and maintain a storage location of the task execution data of the task in the storage system. For the above specific processing procedure of the increment calculating module 301, please refer to the above embodiments, which is not described herein again.
The embodiment also provides a task recovery device, which can also be applied in various servers, please refer to fig. 4, including:
the task recovery module 302 is configured to, after a task is restarted, obtain data corresponding to an inspection point of the task from the storage system according to a storage location of the data of the task in the storage system, and recover a computation state of an operator of the task according to the obtained data corresponding to the inspection point; the data corresponding to each inspection point of the task is stored in the storage system by the data saving method as shown in the above embodiment. For the above specific processing procedure of the task recovery module 302, please refer to the above embodiments, which is not described herein again.
In addition, in some examples of the present embodiment, when the recovery means of the task and the data saving means of the task are provided in the server, the above functions for the increment calculation module 301 and the task recovery module 302 may be realized by, but are not limited to, a processor in the server.
For ease of understanding, the present embodiment is described below with reference to a schematic diagram of a stream computing framework checkpoint mechanism architecture based on a storage system snapshot as an example. Referring to fig. 5, in this example, the stream computing framework checkpoint mechanism based on the storage system snapshot includes: a job manager (JobManager)401, tasks (Task)402, an incremental computation module (diffcomputer) 403, and a storage system (StorageSystem)404, where the job manager 401 is configured to manage all tasks 402 included in a certain flow processing job, including starting, stopping, and receiving Task execution state data on a computing cluster; and also for triggering generation of a checkpoint for task 402.
In this example, job manager 401 triggering checkpoint generation may include job manager 401 periodically inserting barrier events (barrier events) into the input data stream at a frequency, each barrier event dividing the data stream into two segments, and the operator of task 402 creating a checkpoint of the current processing state of the received barrier event each time a barrier event is received. A checkpoint is a copy of the current computational state of the task 402 operator, including but not limited to cached data flow messages, state associated with the data flow, intermediate result data of the computation, and the like.
Task 402 is used to execute the computational logic of the job and maintain computational state in memory, including: buffered data flow messages, status associated with the data flow, intermediate result data of the computation, and the like.
The increment calculating module 403 is configured to determine change data (i.e., data change) of two checkpoints, that is, a state difference of an operator of the task 402 at the time of two checkpoints triggering, and further determine and maintain a location of the state data (i.e., data in the task execution data of the checkpoint) of the task 402 on the storage system.
Where delta computation module 403 determines checkpoint data deltas, this may include, but is not limited to, for data that only needs to take the most recent state, the state of the task 402 operator may be tracked using the KV structure, and later data will overwrite earlier data if the Key value is the same.
The incremental computation module 403 maintains the location of the state data on the storage system, which may specifically include but is not limited to: after writing the task execution data of the checkpoint to the storage system 404, recording the location of each state data in the task execution data on the storage system; when the task is inquired by the state ID (namely the identification information of the checkpoint of the task), returning the position of each state data in the task execution data on the storage system; examples may include, but are not limited to: distributed block storage systems, distributed file storage systems, distributed object storage systems, and the like are used to determine the identity of the location where the object is stored, e.g., the identity of the LUNID, file path, object ID, and the like.
The storage system 404 is used for storing the task execution data of the checkpoint and maintaining snapshot data of the task execution data of the checkpoint. The storage system in this embodiment may include, but is not limited to: a distributed block storage system, a distributed file storage system, or a distributed object storage system, etc.
Among other things, the storage system 404 maintaining checkpoint snapshot information may include, but is not limited to: starting a snapshot function in a space of the storage system 104 for storing a checkpoint to enable snapshot execution on the space; when data is written by the incremental calculation module 403, data of a previous checkpoint (COW copy-on-write) can be copied first, so that checkpoint data at different times can be maintained. And if the new data exists, writing the data into a blank position.
Based on the stream computing framework checkpoint mechanism based on the storage system snapshot shown in fig. 5, a flow diagram of the stream computing framework checkpoint mechanism based on the storage system snapshot in this embodiment is shown in fig. 6, and includes:
s601: job manager 401 periodically inserts barrier events into the incoming data stream as a source of trigger checkpoints.
In this example, job manager 401 inserts barrier events that may include, but are not limited to: the user configures the checkpoint frequency and job manager 401 periodically inserts barrier events (barrier events) into the data source operator (the data source is also commonly referred to as an operator) as a synchronized way of distributed global state generation checkpoints, according to the configured frequency.
S602: task 402 receives the barrier event and generates a checkpoint for the current operator state.
The barrier event in this example triggers a checkpoint creation operation for a particular Task. The method specifically comprises the following steps: task 402 initiates a process flow, asynchronous to the operator logic, that calls the function of the incremental computation module to determine the state change.
S603: the incremental computation module 403 determines the change in and location on the storage system of the task execution data for the current checkpoint relative to the task execution data for the last checkpoint.
In this example, specifically, the incremental computation module 403 determines the situation and the location of the checkpoint task executing the data change, including: the incremental computation module 403 queries the state data of each state ID (and checkpoint) of the task maintained in the storage system according to the state ID, determines whether the state data is updated or not, and obtains the location of the state data on the storage system.
S604: judging whether the state data of the current check point changes; if not, go to S605; otherwise, go to S606.
In this example, whether there is a change is determined, specifically, according to the output result of the increment calculation module 403, the increment calculation module 403 may determine the change condition of a specific item of state data, and thus may also determine all the state data change conditions of the current checkpoint.
S605: if there are no changes to this checkpoint data, delta computation module 403 notifies the storage system to perform the snapshot operation, but does not write any data.
In this example, only the snapshot does not write data, specifically, the storage system only makes the snapshot and does not have data to be written, but only updates the metadata information of the storage system, so that the overhead is low, and the process can be completed quickly.
S606: if there is a change in this checkpoint data, the delta calculation module 403 notifies the storage system to perform the snapshot operation.
Similar to S605, the state data of the current checkpoint changes from the last time, and the storage system is also notified to perform a snapshot operation, which is ready for subsequent incremental data writes.
S607: judging whether the state data of the check point has new added data, if yes, turning to S608; otherwise, go to S609.
In this example, the judgment of whether the state data of the inspection point is newly added means that the judgment of whether the state data of the inspection point is newly added is performed according to the output of the incremental computation module 403.
S608: the newly added state data is written at blank locations on the storage system (i.e., the storage locations newly allocated by the storage system).
In this example, writing new status data in the blank location of the storage system includes: find a new space on the storage system, write in the newly added status data, and return the location information of the new space to the incremental computation module 403 for finding the location when the data is modified next time.
S609: the updated data is written in the original position after copying one copy of the state data to be modified.
In an example, writing the modified state data includes: finding the position of a certain item of state data to be modified in the storage system according to the position information provided by the incremental calculation module 403, and copying one part of state data through the snapshot operation executed by the storage system in the step S606 (the copied part of state data becomes read-only); the modified state data is then written in situ.
Referring to fig. 7, in the embodiment, a flow computing frame checkpoint mechanism recovery flow based on a storage system snapshot is schematically illustrated, and the flow computing frame checkpoint recovery flow based on the storage system snapshot includes:
s701: stream processing job unexpected failures include various failures that may cause the job to crash unexpectedly.
S702: the job manager detects a task failure and restarts running the job.
S703: as the executor of the job being rerun, the tasks belonging to the job in the cluster are restarted.
S704: the task queries the position information of the check point of the task in the storage system from the incremental calculation module.
S705: and reading the state data of the latest checkpoint from the storage system according to the searched position information.
Since the storage of the checkpoint uses the snapshot mechanism of the storage system, reading the checkpoint data also needs to use the snapshot mechanism of the storage system, and the storage system recovers the complete checkpoint data from the snapshot. This process can be done very efficiently for storage systems with a snapshot mechanism.
S706: and recovering the calculation state of the operator from the state data of the checkpoint acquired by the task, namely the state of the operator when the operator is successfully executed last time.
S707: and (4) the operator in the task continues to execute from the restored calculation state.
Example three:
the present embodiment also provides a server, as shown in fig. 8, which includes a processor 801, a memory 802, and a communication bus 803;
the communication bus 803 is used for realizing communication connection between the processor 801 and the memory 802;
in one example, the processor 801 may be configured to execute a first computer program stored in the memory 802 to implement the steps of the data saving method of the task as in the above embodiments;
and/or the processor 801 may be adapted to execute a second computer program stored in the memory 802 to implement the steps of the recovery method of the task as in the above embodiments.
The present embodiments also provide a computer-readable storage medium including volatile or non-volatile, removable or non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, computer program modules or other data. Computer-readable storage media include, but are not limited to, RAM (Random Access Memory), ROM (Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash Memory or other Memory technology, CD-ROM (Compact disk Read-Only Memory), Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer.
In one example, the computer-readable storage medium in the present embodiment may be used to store a first computer program that is executable by a processor to implement the steps of the data saving method of the task in the above embodiments; and/or the computer readable storage medium in this embodiment may be used to store a second computer program that is executable by a processor to implement the steps of the recovery method of the task as in the above embodiments.
The present embodiment also provides a first computer program (or first computer software), which can be distributed on a computer readable medium and executed by a computing device to implement at least one step of the data saving method of the task in the above embodiments; and in some cases at least one of the steps shown or described may be performed in an order different than that described in the embodiments above.
The present embodiment also provides a second computer program (or second computer software), which can be distributed on a computer readable medium and executed by a computing device to implement at least one step of the task recovery method in the above embodiments; and in some cases at least one of the steps shown or described may be performed in an order different than that described in the embodiments above.
The present embodiments also provide a computer program product comprising a computer readable means on which any of the computer programs as set out above is stored. The computer readable means in this embodiment may include a computer readable storage medium as shown above.
It will be apparent to those skilled in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software (which may be implemented in computer program code executable by a computing device), firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit.
In addition, communication media typically embodies computer readable instructions, data structures, computer program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to one of ordinary skill in the art. Thus, the present invention is not limited to any specific combination of hardware and software.
The foregoing is a more detailed description of embodiments of the present invention, and the present invention is not to be considered limited to such descriptions. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (11)

1. A data saving method for a task comprises the following steps:
acquiring task execution data of a current task check point;
determining change data of the task between the two checkpoints according to the task execution data of the current checkpoint and the task execution data of the previous checkpoint of the task;
and storing the change data into a storage system with a snapshot function, controlling the storage system to execute snapshot operation, and maintaining a storage position of task execution data of the task in the storage system.
2. The data saving method of task according to claim 1, wherein the controlling storing the changed data in a storage system having a snapshot function, and the controlling the storage system to perform a snapshot operation includes:
when the change data are not empty, controlling the storage system to execute snapshot operation, and storing the change data into the storage system; or after the change data is stored in the storage system, controlling the storage system to execute snapshot operation;
and when the change data is empty, controlling the storage system to execute snapshot operation.
3. The method for saving data of a task according to claim 2, wherein when the change data is not empty, the storing the change data in the storage system includes:
when the changed data comprises new data, allocating a new storage space by the storage system, and storing the new data into the new storage space allocated by the storage system;
when the changed data comprises modified data compared with the task execution data of the previous checkpoint, controlling the storage system to copy original data corresponding to the modified data stored in the storage system to one copy and set the copy as read-only by using a copy-on-write mechanism, and then updating the original data corresponding to the modified data stored in the storage system to the modified data.
4. A data saving method for a task as claimed in any one of claims 1 to 3, wherein the task execution data includes at least one of a currently cached data stream, state information associated with the data stream, and intermediate result data.
5. A method of data preservation according to any one of claims 1 to 3, wherein the storage system is a distributed block storage system, a distributed file storage system or a distributed object storage system.
6. The method for data preservation of a task according to any one of claims 1 to 3, wherein the determining, from the task execution data of the current checkpoint and the task execution data of a checkpoint above the task, change data of the task between the two checkpoints comprises:
acquiring task execution data of a previous checkpoint of the task from a memory, and comparing the task execution data of the current checkpoint with the task execution data of the previous checkpoint of the task to obtain change data of the task changing between the two checkpoints;
and deleting the task data of the checkpoint on the task from the memory.
7. A method of task recovery, comprising:
after a task is restarted, acquiring data corresponding to inspection points of the task from a storage system according to the storage position of the data of the task in the storage system, wherein the data corresponding to each inspection point of the task are stored in the storage system by the data storage method according to any one of claims 1 to 6;
and restoring the calculation state of the operator of the task according to the acquired data corresponding to the check point.
8. A data retention device for a task, comprising:
the incremental computation module is used for acquiring task execution data of a current checkpoint of a task, determining change data of the task between the two checkpoints according to the task execution data of the current checkpoint and the task execution data of a previous checkpoint of the task, storing the change data into a storage system with a snapshot function, controlling the storage system to execute snapshot operation, and maintaining the storage position of the task execution data of the task in the storage system.
9. A task recovery apparatus, comprising:
the task recovery module is used for acquiring data corresponding to the check point of the task from the storage system according to the storage position of the data of the task in the storage system after the task is restarted, and recovering the calculation state of an operator of the task according to the acquired data corresponding to the check point; data corresponding to each inspection point of the task is stored in the storage system by the data saving method according to any one of claims 1 to 6.
10. A server comprising a processor, a memory, and a communication bus;
the communication bus is used for connecting the processor and the memory;
the processor is adapted to execute a first computer program stored in the memory to implement the steps of the data saving method of the task of any of claims 1-6;
and/or the presence of a gas in the gas,
the processor is adapted to execute a second computer program stored in the memory to implement the steps of the method of recovery of a task as claimed in claim 7.
11. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a first computer program executable by a processor to implement the steps of a data saving method of the task according to any one of claims 1 to 6;
and/or the presence of a gas in the gas,
the computer-readable storage medium stores a second computer program executable by a processor to implement the steps of the recovery method of the task of claim 7.
CN201910866743.XA 2019-09-12 2019-09-12 Data saving and restoring method and device for task, server and storage medium Pending CN112486639A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910866743.XA CN112486639A (en) 2019-09-12 2019-09-12 Data saving and restoring method and device for task, server and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910866743.XA CN112486639A (en) 2019-09-12 2019-09-12 Data saving and restoring method and device for task, server and storage medium

Publications (1)

Publication Number Publication Date
CN112486639A true CN112486639A (en) 2021-03-12

Family

ID=74920887

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910866743.XA Pending CN112486639A (en) 2019-09-12 2019-09-12 Data saving and restoring method and device for task, server and storage medium

Country Status (1)

Country Link
CN (1) CN112486639A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113239001A (en) * 2021-05-21 2021-08-10 珠海金山网络游戏科技有限公司 Data storage method and device
CN113760669A (en) * 2021-09-09 2021-12-07 湖南快乐阳光互动娱乐传媒有限公司 Problem data warning method and device, electronic equipment and storage medium
CN113778758A (en) * 2021-09-26 2021-12-10 杭州安恒信息技术股份有限公司 Data recovery method, device and equipment and readable storage medium
CN116662325A (en) * 2023-07-24 2023-08-29 宁波森浦信息技术有限公司 Data processing method and system

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113239001A (en) * 2021-05-21 2021-08-10 珠海金山网络游戏科技有限公司 Data storage method and device
CN113760669A (en) * 2021-09-09 2021-12-07 湖南快乐阳光互动娱乐传媒有限公司 Problem data warning method and device, electronic equipment and storage medium
CN113778758A (en) * 2021-09-26 2021-12-10 杭州安恒信息技术股份有限公司 Data recovery method, device and equipment and readable storage medium
CN116662325A (en) * 2023-07-24 2023-08-29 宁波森浦信息技术有限公司 Data processing method and system
CN116662325B (en) * 2023-07-24 2023-11-10 宁波森浦信息技术有限公司 Data processing method and system

Similar Documents

Publication Publication Date Title
CN112486639A (en) Data saving and restoring method and device for task, server and storage medium
EP1789879B1 (en) Recovering from storage transaction failures using checkpoints
EP1789884B1 (en) Systems and methods for providing a modification history for a location within a data store
EP1807779B1 (en) Image data storage device write time mapping
US7730222B2 (en) Processing storage-related I/O requests using binary tree data structures
US7272666B2 (en) Storage management device
US7827362B2 (en) Systems, apparatus, and methods for processing I/O requests
US7844856B1 (en) Methods and apparatus for bottleneck processing in a continuous data protection system having journaling
US8782005B2 (en) Pruning previously-allocated free blocks from a synthetic backup
US20050066118A1 (en) Methods and apparatus for recording write requests directed to a data store
US20050065962A1 (en) Virtual data store creation and use
US20060047998A1 (en) Methods and apparatus for optimally selecting a storage buffer for the storage of data
US20060047989A1 (en) Systems and methods for synchronizing the internal clocks of a plurality of processor modules
CN109542682B (en) Data backup method, device, equipment and storage medium
US20210165575A1 (en) Copy-on-write systems and methods
WO2006023994A1 (en) Methods and devices for restoring a portion of a data store
CN111078667B (en) Data migration method and related device
US10031814B2 (en) Collection record location as log tail beginning
CN111506253B (en) Distributed storage system and storage method thereof
US8914325B2 (en) Change tracking for multiphase deduplication
CN116490855A (en) Efficient backup after restore operations
US10671482B2 (en) Providing consistency in a distributed data store
US10423494B2 (en) Trimming unused blocks from a versioned image backup of a source storage that is stored in a sparse storage
US11263237B2 (en) Systems and methods for storage block replication in a hybrid storage environment
CN116450409A (en) Data backup and data restoration method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination