CN116305101A - Task processing method and device, electronic equipment and storage medium - Google Patents

Task processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116305101A
CN116305101A CN202310301268.8A CN202310301268A CN116305101A CN 116305101 A CN116305101 A CN 116305101A CN 202310301268 A CN202310301268 A CN 202310301268A CN 116305101 A CN116305101 A CN 116305101A
Authority
CN
China
Prior art keywords
tuple
tuples
node
current node
set corresponding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310301268.8A
Other languages
Chinese (zh)
Inventor
吴垚
陈永录
代甜
张德晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202310301268.8A priority Critical patent/CN116305101A/en
Publication of CN116305101A publication Critical patent/CN116305101A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/554Detecting local intrusion or implementing counter-measures involving event detection and direct action
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/52Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems during program execution, e.g. stack integrity ; Preventing unwanted data erasure; Buffer overflow
    • G06F21/53Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems during program execution, e.g. stack integrity ; Preventing unwanted data erasure; Buffer overflow by executing in a restricted environment, e.g. sandbox or secure virtual machine
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a task processing method, a task processing device, electronic equipment and a storage medium, wherein the task processing method comprises the following steps: obtaining M first tuples to be processed; processing the M first tuples by using a Spark task function of the current node to obtain N second tuples; generating a tuple set corresponding to the current node according to the N second tuples and the preset tuple number of the single tuple set, wherein the tuple set at least comprises one virtual tuple except the second tuple; and M and N are integers greater than or equal to 1. The method improves confidentiality in the Spark task execution process.

Description

Task processing method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to task processing technologies, and in particular, to a task processing method, a device, an electronic apparatus, and a storage medium.
Background
Spark is an open source big data calculation engine based on memory iteration. In the big data age, more and more users choose to deploy Spark tasks to be executed in third party cloud services. The cloud service provider is a vital circle for guaranteeing the data security while providing convenient service for users.
In the prior art, in order to ensure the data security in the Spark task running process, a cloud service provider chooses to put the Spark task into a trusted execution environment for execution, and encrypts data outside the trusted execution environment to protect the confidentiality of the data. However, if the attacker is a manager of the third-party cloud service, the attacker can combine the priori knowledge to acquire partial information of the user data set by observing the data flow directions among different nodes, thereby completing the side channel attack and destroying confidentiality of Spark tasks.
Disclosure of Invention
The application provides a task processing method, a task processing device, electronic equipment and a storage medium, which are used for solving the problem of poor confidentiality in the Spark task operation process.
In a first aspect, the present application provides a task processing method, the method including:
obtaining M first tuples to be processed;
processing the M first tuples by using a Spark task function of the current node to obtain N second tuples;
generating a tuple set corresponding to the current node according to the N second tuples and the preset tuple number of the single tuple set, wherein the tuple set at least comprises one virtual tuple except the second tuple; and M and N are integers greater than or equal to 1.
Optionally, the N is greater than or equal to 2, the tuples are represented by key values, and generating the tuple set corresponding to the current node according to the N second tuples and the preset tuple number of the single tuple set includes:
combining the N second tuples to combine the second tuples with the same key into one second tuple to obtain a combined second tuple;
generating at least one virtual tuple according to the number of the combined second tuples and the preset tuple number of the single tuple set;
and generating a tuple set corresponding to the current node according to the merged second tuple and the at least one virtual tuple, wherein the sum of the number of the merged second tuple and the number of the virtual tuples is equal to the preset tuple number.
Optionally, the generating, according to the combined second tuple and the at least one virtual tuple, a tuple set corresponding to the current node includes:
filling the aggregated second tuple and the virtual tuple according to a preset maximum filling length, so that the lengths of the filled second tuple and the filled virtual tuple are the maximum filling length;
And generating a tuple set corresponding to the current node according to the filled second tuple and the filled virtual tuple.
Optionally, the generating a tuple set corresponding to the current node according to the filled second tuple and the filled virtual tuple includes:
compressing the filled second tuple and the filled virtual tuple;
and constructing a tuple set corresponding to the current node by using the compressed second tuple and the compressed virtual tuple.
Optionally, if the current node is the last node of the Spark task, after the generating the tuple set corresponding to the current node, the method further includes:
and encrypting the tuple set corresponding to the current node and storing the tuple set into a file system.
Optionally, if the current node is not the last node of the Spark task, after generating the tuple set corresponding to the current node, the method further includes:
performing hash function operation on tuples in the tuple set corresponding to the current node, and generating a hash value of the tuple set corresponding to the current node;
acquiring the identification of the next execution stage after the current node and the identity information of the next node according to the Spark task directed acyclic graph information;
Embedding the hash value, the identification of the next execution stage after the current node and the identity information of the next node into a tuple set corresponding to the current node;
encrypting a tuple set corresponding to the current node;
and sending the encrypted tuple set corresponding to the current node to the next node.
Optionally, the current node is not the first node of the Spark task, and the obtaining M first tuples to be processed includes:
receiving an encrypted tuple set corresponding to a previous node;
decrypting the encrypted tuple set corresponding to the previous node to obtain the tuple set corresponding to the previous node;
performing integrity verification on the tuple set corresponding to the previous node according to the hash value carried by the tuple set corresponding to the previous node, the identifier of the next execution stage of the previous node and the identity information of the next node;
and after the integrity verification is passed, recovering the M first tuples to be processed from the tuple set corresponding to the previous node.
Optionally, the current node is a first node of the Spark task, and the obtaining M first tuples to be processed includes:
The M first tuples of user input are received.
In a second aspect, the present application provides a task processing device, the device comprising:
the acquisition module is used for acquiring M first tuples to be processed;
the processing module is used for processing the M first tuples by utilizing the Spark task function of the current node to obtain N second tuples;
the generating module is used for generating a tuple set corresponding to the current node according to the N second tuples and the preset tuple number of the single tuple set, wherein the tuple set at least comprises one virtual tuple except the second tuple; and M and N are integers greater than or equal to 1.
In a third aspect, the present application provides an electronic device, including: a processor, and a memory communicatively coupled to the processor;
the memory stores computer-executable instructions;
the processor executes computer-executable instructions stored by the memory to implement the method of any one of the first aspects.
In a fourth aspect, the present application provides a computer-readable storage medium having stored therein computer-executable instructions which, when executed by a processor, are adapted to carry out the task processing method according to any one of the first aspects.
In a fifth aspect, the present application provides a computer program product comprising a computer program which, when executed by a processor, implements the method according to any of the first aspects.
According to the task processing method, the task processing device, the electronic equipment and the storage medium, the current node generates a tuple set corresponding to the current node according to N second tuples obtained by processing M first tuples by using a Spark task function of the current node and the preset tuple number of a single tuple set, wherein the tuple set at least comprises one virtual tuple except for the second tuples. Because the tuple set comprises virtual tuples, an attacker cannot accurately acquire the number of real tuples in the tuple set, which causes interference to the attacker for acquiring the user information, so that the attacker is prevented from launching side channel attack according to the information, the confidentiality of the Spark task is destroyed, and the data security of the Spark task can be improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
FIG. 1 is a schematic diagram of a Spark architecture;
FIG. 2 is a schematic diagram of a Spark task processing flow;
FIG. 3 is a flow chart of a first task processing method provided in the present application;
FIG. 4 is a flow chart of a second task processing method provided in the present application;
FIG. 5 is a flow chart of a third task processing method provided in the present application;
FIG. 6 is a flowchart of a fourth task processing method provided in the present application;
FIG. 7 is a flowchart of a fifth task processing method provided in the present application;
FIG. 8 is a schematic diagram of a task processing device according to the present application;
fig. 9 is a schematic structural diagram of an electronic device 900 provided in the present application.
Specific embodiments thereof have been shown by way of example in the drawings and will herein be described in more detail. These drawings and the written description are not intended to limit the scope of the inventive concepts in any way, but to illustrate the concepts of the present application to those skilled in the art by reference to specific embodiments.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.
Spark is an open source big data compute engine based on memory iterations that supports a wide variety of data processing tasks through data sharing abstract resilient distributed data sets (resilient distributed datasets, RDDs), such as: batch processing, stream processing, structured query language (structured query language, SQL) queries, machine learning, and graph computation, etc. Because Spark has the characteristics of high speed, high throughput, distributed expansion and the like, spark becomes the most popular big data calculation engine at present and is widely used in big data systems of various industries.
Fig. 1 is a schematic diagram of a Spark architecture, as shown in fig. 1, where a Spark running architecture includes a task control Node (Driver) for each Application (Application), a Cluster resource Manager (Cluster Manager), a work Node (Worker Node) for running a job task (job), and an execution process (Executor) disposed on the work Node and responsible for a specific task (task). The Spark architecture implements the operation on the data set included in the Spark task by running the application, i.e., the Spark application program written by the user. One Spark task includes an application and a data set to be processed, which is made up of a plurality of tuples (tuples).
A tuple is a basic concept in a relational database, and a relation is understood to be a table, where each row in the table (i.e., each record in the database) is a tuple. The tuples may be represented in the form of key-value pairs, and keys in a tuple may correspond to one or more values.
Fig. 2 is a schematic diagram of a Spark task processing flow, as shown in fig. 2, after a Spark task is submitted to a Spark architecture for execution, a task control node receives the Spark task, and creates a Spark Context according to an application, so as to apply for computing resources, allocate tasks, and monitor the tasks accordingly. The Spark Context includes a directed acyclic graph (Directed acyclic graph, DAG). The task control node divides the application into one or more job tasks according to the DAG graph, and divides each job task into one or more execution phases (stages) according to whether or not a shuffle is required to be executed, each execution phase including one or more tasks.
Shuffle is a cross-node, cross-process data distribution process across clusters in a distributed computing scenario. If the data set needs to be redistributed during Spark task processing, a shuffle needs to be performed. The Shuffle procedure includes a Shuffle write procedure and a Shuffle read procedure, which are located at different nodes, respectively. In the Spark task running process, a node can generate an intermediate file through a buffer write process from a data set to be sent to the next node. After the next node acquires the intermediate file, the data required for executing the task running on the node can be extracted through a shuffle read process.
Spark tasks include aggregate Spark tasks and non-aggregate Spark tasks. An aggregate Spark task is a task that performs an aggregate operation in the Shuffle Write phase, such as the WordCount task, which does not require the original data of the values of the tuples in the dataset to be preserved during the tuple combining process. A non-aggregate Spark task is a task that does not perform an aggregate operation during the Shuffle Write phase, such as the SortByKey task, which requires the original data of the values of the tuples in the dataset to be preserved during the tuple combining process.
With continued reference to fig. 2, after the task control node divides each job task into one or more execution phases, the task control node applies for resources to the cluster resource manager to cause the cluster resource manager to allocate nodes for execution processes for running the tasks and to start the execution processes. Then, the node applies for the task to the task control node by executing the process. The task control node sends the application program codes and the data sets corresponding to each task to the nodes corresponding to the tasks. Subsequently, each node executes the task by executing the process to process the data set of the Spark task, so as to obtain an operation result, and the operation result is fed back to the task control node or stored in the distributed file system.
In the big data age, more and more users choose to deploy Spark tasks to be performed in Spark architecture in third party cloud services. The cloud service provider provides convenient service for users, and meanwhile guaranteeing confidentiality of data of Spark tasks is also important. If data in Spark tasks of users running in cloud services are stolen by an attacker, the data of the users can be revealed, and irreparable losses are caused to the reputation of third-party cloud service providers.
However, since the operation of the Spark task in the cloud service belongs to the black box operation, the user cannot learn about security problems such as tampering, malicious operation, monitoring and the like in the data transmission process, so that a great potential safety hazard exists.
The trusted execution environment (Trust Execution Environment, abbreviated as TEE) is a secure area in the host processor, and is designed to build a secure execution environment in the central processor by means of software or hardware, so as to provide security for programs and data running therein. The existing scheme selects to put Spark task into the trusted execution environment for execution, and encrypts the data outside the trusted execution environment to protect the confidentiality of the data.
However, the above manner cannot ensure the data security in the Spark task execution process, and an attacker can still combine the priori knowledge, and by observing the data flow direction in the shuffle process between different nodes in the processing process of each Spark task, and the data characteristics of the data set of each node, such as the data flow direction, the length, the data volume, and the like, obtain part of the information of the user data set, thereby completing the side channel attack and damaging the confidentiality of the Spark task.
The side channel attack referred to herein as side channel attack or hidden channel attack refers to an attack mode in which an attacker can obtain private information in a user data set by observing side channel data, such as the data flow described above.
The inventor considers that if the attacker cannot learn the real data characteristics of the data set of each node, the attacker cannot acquire the user information according to the data characteristics of the data set, and cannot launch the side channel attack, so that confidentiality of Spark tasks is further destroyed.
In view of this, the present application provides a task processing method, which generates a virtual tuple in a Spark task processing process, so that an attacker cannot accurately learn the data characteristics of a data set, thereby avoiding the attacker from launching a side channel attack by acquiring user information, and guaranteeing confidentiality of a Spark task.
The task processing method can be applied to any Spark architecture, and the Spark architecture can be deployed locally, can be deployed at the cloud, can be operated in a TEE environment, and can be operated in a non-TEE environment. When the task processing method runs in the TEE environment, the execution safety of the task processing method can be further improved.
Specifically, the task processing method of the present application is a method of how task processing should be performed for a task in a stage of a Spark task execution process, and an execution subject of the task processing method of the present application is a node where the task is deployed, where the node may be a computer, a server, or the like, for example. The present application does not define whether the node is also used to process other tasks of the current execution phase or whether it is used to process tasks of other execution phases.
The following describes the technical solutions of the present application and how the technical solutions of the present application solve the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
Fig. 3 is a flow chart of a first task processing method provided in the present application, as shown in fig. 3, where the method includes:
s101, obtaining M first tuples to be processed.
M is an integer greater than or equal to 1, and on the basis, the specific value of M is not limited in the application. The first tuple may be any tuple to be processed, and the application is not limited to the specific data content of the first tuple.
The mode of the current node obtaining M first tuples to be processed is related to the execution stage to which the task processed by the current node belongs. For example, if the execution phase is the first execution phase of the Spark task, the current node may receive, for example, M first tuples input by the user. The task control node obtains M first tuples sent by the user terminal, where the M first tuples are input by the user through the user terminal. The user terminal may be, for example, a computer, a mobile phone, or the like. The current node then obtains the M first tuples entered by the user from the task control node.
If the execution phase is not the first execution phase of the Spark task, the current node may obtain, for example, M first tuples from the previous node. It should be understood that the last node referred to herein is the node where the task comprised by the last execution phase was deployed. The last node may be one node or a plurality of nodes, which is specifically related to the number of tasks and the deployment mode included in the last execution stage, and the application is not limited thereto.
S102, processing M first tuples by using a Spark task function of the current node to obtain N second tuples.
N is an integer greater than or equal to 1, and on the basis of this, the specific value of N is not limited in the present application. The Spark task function is application program code for performing logic processing on the M first tuples. The tasks deployed on the current node include Spark task functions obtained from the task control node, and the specific form of the Spark task functions is not limited in the application, and can be set by those skilled in the art according to practical situations.
In this step, the current node processes the M first tuples by using the Spark task function of the current node, to obtain N second tuples.
S103, generating a tuple set corresponding to the current node according to the N second tuples and the preset tuple number of the single tuple set, wherein the tuple set at least comprises one virtual tuple except the second tuple.
The current node can firstly combine the N second tuples, and then generate a tuple set corresponding to the current node according to the combined second tuples and the preset tuple number of the single tuple set; instead of merging the N second tuples, the tuple set corresponding to the current node may be generated directly according to the N second tuples and the preset tuple number of the single tuple set.
The above-mentioned merging of N second tuples, i.e. merging of the same kind of tuples of N tuples into one tuple. The types of tuples may be, for example, divided according to the keys of the tuples, and if a plurality of tuples have the same key, the plurality of tuples are regarded as belonging to one type. When the current node generates the tuple set corresponding to the current node according to the combined second tuples, the number of temporary data generated in the task processing process can be reduced, and the data transmission rate between the nodes is improved.
The preset number of tuples of the single tuple set is related to whether N second tuples are merged. For example, if the current node merges N second tuples, the preset number of tuples of the single tuple set may be, for example, greater than the number of types of tuples in the Spark task.
If the current node does not merge the N second tuples, the preset number of tuples of the single tuple set may be, for example, greater than or equal to the number of all tuples in the Spark task. Specifically, the setting can be performed by those skilled in the art according to the actual situation.
The virtual tuple is a tuple in which virtual data exists. The virtual data is not data in the N second tuples. A virtual tuple may be that the entire tuple is virtual. For example, when the second tuple is a key-value pair, the virtual tuple may be, for example, a preset key-value pair, for example, may be < #, where no real data is present in the virtual tuple. The tuple set comprises at least one virtual tuple.
On this basis, the present application does not define whether the current node further processes the second tuple and the virtual tuple. The current node may also comprise, for example, a virtual tuple generated by filling a virtual value for the real tuple, illustratively on this basis. The real second tuple may be, for example, <1,3>, and the current node may generate a virtual tuple, for example, <1,3#, by populating the virtual value "#". When the tuple set includes a plurality of virtual tuples, the plurality of virtual tuples may be identical, different, or partially identical, partially different, and the application is not limited thereto.
The length of the virtual tuples in the tuple set is related to whether the current Spark task is an aggregate Spark task, i.e. whether the current Spark task needs to preserve the original data. The length of the virtual tuple referred to herein is the number of integer types that the value of the virtual tuple includes. If the current Spark task is an aggregate Spark task, i.e. the current Spark task does not need to reserve the original data, the length of the virtual tuple may be 1, for example. If the current Spark task is a non-aggregate Spark task, that is, the current Spark task needs to retain the original data, the length of the virtual tuple in the tuple set may be, for example, greater than or equal to the number of tuples of the tuple type with the largest number of tuples in the Spark task.
Optionally, when the current Spark task is an aggregate Spark task, the current node may further perform compression processing on the second tuple first, and then generate a tuple set corresponding to the current node according to the compressed second tuple and a preset tuple number of the single tuple set. The compression referred to herein shortens the length of each second tuple by compression. The compression method specifically adopted in practical application is not limited, and may be, for example, a snappy compression algorithm, or a zip compression algorithm. By the method, the data volume in the Spark task execution process can be reduced, the transmission efficiency of data exchange between nodes is improved, and the Spark task implementation performance is improved.
In this embodiment, the current node generates a tuple set corresponding to the current node according to N second tuples obtained by processing the M first tuples by using a Spark task function of the current node, and a preset tuple number of a single tuple set, where the tuple set includes at least one virtual tuple except for the second tuple. Because the tuple set comprises virtual tuples, an attacker cannot accurately acquire the number of real tuples in the tuple set, which causes interference to the attacker for acquiring the user information, so that the attacker is prevented from launching side channel attack according to the information, the confidentiality of the Spark task is destroyed, and the data security of the Spark task can be improved.
Optionally, if the current node is the last node of the Spark task, that is, the node of the last execution stage, after generating the tuple set corresponding to the current node, the current node may store the tuple set corresponding to the current node in the file system for other applications or the file system to call for use; the tuple set corresponding to the current node may also be sent to the task control node. Optionally, the current node may encrypt the tuple set corresponding to the current node and store the encrypted tuple set in a file system or send the encrypted tuple set to the task control node, so as to further improve the transmission and storage security of the tuple set.
Optionally, after the current node generates the tuple set corresponding to the current node, the current node may further embed one or more of a hash value of the tuple set corresponding to the current node, an identifier of a next execution stage after the current node, identity information of a next node, and the like into the tuple set corresponding to the current node, so as to implement that the next node may implement verification on the integrity of the tuple set based on the foregoing. By the operation, the attacker can be prevented from damaging the data integrity by means of discarding data, retransmitting data, forming false data and the like, and further, the execution result of the Spark task is interfered.
Next, when N is greater than or equal to 2 and tuples are represented by key pairs, how the current node generates the tuple set corresponding to the current node according to the N second tuples and the preset tuple number of the single tuple set, that is, step S103 in the foregoing embodiment is described, and fig. 4 is a schematic flow chart of the second task processing method provided in the present application, as shown in fig. 4, step S103 may include the following steps:
s201, combining the N second tuples to combine the second tuples with the same key into one second tuple, so as to obtain a combined second tuple.
In this step, the current node performs merging processing on the N second tuples, so that the number of temporary data generated in the task execution process can be reduced, the data transmission efficiency between execution phases is improved, and the Spark task processing performance is improved.
In this step, the manner in which the N second tuples are combined is related to whether the Spark task is an aggregate Spark task, that is, whether the original data needs to be retained. Therefore, in this step, for different situations, there are two merging processing modes:
case 1: spark task is an aggregate Spark task
In this case, the current node may take the key of the second tuple with the same key corresponding to each task as the key of the merged second tuple, sum the values of the second tuple with the same key, and use the sum of the values of the second tuple as the value of the merged second tuple, where the lengths of the merged second tuple are all 1, that is, the key of each second tuple after merging corresponds to only one value. For example, assuming that there are two second tuples of the same key, <2,1> and <2,2>, respectively, the combined second tuple obtained by combining the two second tuples may be <2,3>.
Case 2: spark task is a non-aggregate Spark task
In this case, the current node may use the key of the second tuple having the same key as the key of the merged second tuple and use the value of the second tuple having the same key as the value of the merged second tuple. For example, assuming that there are two second tuples having the same key, <2,1> and <2,2>, respectively, the combined second tuple obtained by combining the two second tuples may be <2,1_2>. By the merging process, the obtained bonds of the merged second tuple are unique among a plurality of merged second tuples.
S202, generating at least one virtual tuple according to the number of the second tuples after merging and the preset tuple number of the single tuple set.
In the present embodiment, since the merging process is performed on the N second tuples, the preset number of tuples of the single tuple set may be greater than or equal to the number of kinds of second tuples.
In this step, the current node generates at least one virtual tuple according to the number of the second tuples after merging and the preset tuple number of the single tuple set, which is related to whether the Spark task is an aggregate Spark task or a non-aggregate Spark task, specifically there are two cases as follows:
Case 1: spark task is an aggregate Spark task
In this case, since the original data does not need to be retained, the length of the combined second tuple is 1. The attacker cannot analyze and acquire the information included in the data set according to the length of the second tuple, and thus cannot launch the side channel attack. At this time, the current node may generate one or more virtual tuples of length 1 such that the preset number of tuples of the single tuple set is equal to the sum of the number of aggregated second tuples and the number of virtual tuples.
For example, if the number of aggregated second tuples is 2 and the preset number of tuples of a single tuple set is 5, 3 virtual tuples of length 1 are generated.
Case 2: spark task is a non-aggregate Spark task
In this case, since the merged second tuple retains the original data and the number of second tuples of the same kind is different, the number of values corresponding to each key of the merged second tuple may be different, i.e., the length of the merged second tuple may be different. Thus, at least one virtual tuple can be generated in this step using the following implementation.
In implementation 1, the current node may generate one or more virtual tuples such that the preset number of tuples of the single tuple set is equal to the sum of the number of aggregated second tuples and the number of virtual tuples. In this case, the present application does not limit the length of the virtual tuple, and may be, for example, 0 or 1.
In implementation 2, the current node may generate at least one virtual tuple according to the number of second tuples aggregated, the preset number of tuples of the single tuple set, and the preset maximum filling length.
For example, the current node may generate at least one virtual tuple based on the aggregated second tuple, a preset number of tuples of the single tuple set, and a preset maximum fill length. At this time, the preset maximum filling length is greater than or equal to the length of the longest polymerized second tuple. The predetermined maximum filling length may be, for example, (V length +1). Times.Z-1, where V length For the longest length in the first tupleThe length of the tuple, Z, is the number of occurrences of the first tuple in the Spark task.
For example, the current node may generate virtual tuples of a preset maximum fill length such that the preset number of tuples of the single tuple set is equal to the sum of the number of aggregated second tuples and the number of virtual tuples.
Or, the current node may compress the aggregated second tuple first, and then generate at least one virtual tuple according to the compressed aggregated second tuple, the preset tuple number of the single tuple set, and the preset maximum filling length. As described above, the compression referred to herein is to compress the length of each second tuple.
At this time, the preset maximum filling length may be greater than or equal to the length of the longest second tuple after compression. By the method, the situation that the generated virtual tuples are overlarge in data size and occupy excessive computing resources due to overlarge preset maximum filling length when the data size of the Spark task is large can be avoided, and then the Spark task execution and data transmission efficiency can be improved.
S203, generating a tuple set corresponding to the current node according to the combined second tuple and at least one virtual tuple.
In this step, the current node is related to the manner of generating at least one virtual tuple and the manner of generating the virtual tuple in step S202 according to the number of the second tuples after aggregation and the preset tuple number of the single tuple set, specifically, there are two cases as follows:
case 1: spark task is an aggregate Spark task
In this case, the current node may directly utilize the aggregated second tuple with length 1, and at least one virtual tuple constitutes the tuple set corresponding to the current node. The sum of the number of aggregated second tuples and the number of virtual tuples in the tuple set is equal to the preset number of tuples.
Case 2: spark task is a non-aggregate Spark task
In this case, the present step may take different implementations according to the implementation of step S202.
For example, if implementation 1 is adopted in step S202, the current node may fill the aggregated second tuple and the virtual tuple according to a preset maximum filling length, so that the lengths of the filled second tuple and the filled virtual tuple are both the maximum filling length. And then, the current node generates a tuple set corresponding to the current node according to the filled second tuple and the filled virtual tuple. The preset number of tuples of the single tuple set is equal to the sum of the number of filled second tuples and the number of filled virtual tuples in the tuple set.
For example, if the second tuple after aggregation is <1,2_3>, the length of the second tuple is 3, and the preset maximum filling length is 5, the second tuple after filling may be, for example, <1,2_3#, where # is the filled dummy data. Or if the virtual tuple is < #, the padded virtual tuple is < #, # #, and the virtual tuple is < #, #.
Optionally, after the filled second tuple and the filled virtual tuple are obtained, the current node may directly utilize the filled second tuple and the filled virtual tuple to form a tuple set corresponding to the current node. The second tuple after filling and the virtual tuple after filling can be compressed; and then, using the compressed second tuple and the compressed virtual tuple to form a tuple set corresponding to the current node, so as to reduce the data volume of the tuple set, improve the transmission rate of the tuple set when transmitting between nodes, and improve the overall execution efficiency of Spark tasks.
If the virtual tuple is generated in the step S202 by adopting the implementation 2 and the aggregated second tuple is not compressed, in this step, the current node may fill the aggregated second tuple according to the preset maximum filling length, so that the lengths of the filled second tuple are all the preset maximum filling length. Then, the current node utilizes the filled second tuple and at least one virtual tuple generated in step S202 constitutes a tuple set corresponding to the current node. The sum of the number of filled second tuples and the number of virtual tuples in the tuple set is equal to the preset number of tuples.
If the current node compresses the aggregated second tuple in step 2, in this step, the current node may fill the compressed aggregated second tuple according to a preset maximum filling length, so that the lengths of the filled second tuple are all the preset maximum filling length. Then, the current node utilizes the filled second tuple and at least one virtual tuple generated in step S202 constitutes a tuple set corresponding to the current node. The sum of the number of filled second tuples and the number of virtual tuples in the tuple set is equal to the preset number of tuples.
In this embodiment, the current node first performs aggregation processing on N second tuples to obtain an aggregated second tuple; and then, generating at least one virtual tuple according to the number of the aggregated second tuples and the preset tuple number of the single tuple set, and generating a tuple set corresponding to the current node according to the at least one virtual tuple, wherein the sum of the number of the aggregated second tuples and the number of the virtual tuples is equal to the preset tuple number. By carrying out aggregation processing on the N second tuples, the current node can reduce temporary data generated in the Spark task execution process, reduce the data transmission quantity among the nodes, and further improve the data transmission efficiency among the nodes.
By enabling the sum of the number of the second tuples after aggregation and the number of the virtual tuples to be equal to the number of preset tuples and generating a tuple set according to the sum, the number of the real tuples of the tuple set can be hidden, so that an attacker cannot analyze and acquire user information according to the number of the tuples of the tuple set, and further, a side channel attack is launched, confidentiality in the Spark task execution process can be improved, and data safety is guaranteed.
Fig. 5 is a flow chart of a third task processing method provided in the present application, as shown in fig. 5, optionally, if the current node is not the last node of the Spark task, the current node may further include the following steps after generating the tuple set corresponding to the current node. It should be understood that the last node referred to herein is the node that performs the task of the last execution stage.
S301, performing hash function operation on tuples in the tuple set corresponding to the current node, and generating a hash value of the tuple set corresponding to the current node.
The present application is not limited to the specific form of the above-described hash function, and those skilled in the art can determine the specific form according to the actual situation. When a tuple in the tuple set is determined, the hash value generated accordingly is unique. In this step, the current node performs a hash function operation on the tuples in the tuple set corresponding to the current node, and generates a hash value of the tuple set corresponding to the current node, so as to serve as a verification identifier of the integrity of the tuple set corresponding to the current node. Subsequently, if it is required to send the tuple set corresponding to the current node to the next node completely, the integrity verification can be performed based on the hash value, so as to prevent an attacker from interfering with smooth execution of the Spark task through the integrity of the tuple set.
S302, according to the Spark task directed acyclic graph information, the identification of the next execution stage after the current node and the identity information of the next node are obtained.
The next node is the node where the task in the next execution stage is deployed. It should be appreciated that the next node as referred to herein may be one or more, and is specifically related to the number of tasks involved in the next execution phase. The present application is not limited to the specific form of the identification of the next execution stage and the identity information of the next node, and the identification of the next execution stage may be, for example, the number of the next execution stage, and the identity information of the next node may be, for example, the number of the next node.
Since the directed acyclic graph information of Spark task includes the identification of all execution phases, and the identity information of all nodes. Therefore, in this step, the current node may obtain, according to the Spark task directed acyclic graph information, the identifier of the next execution stage after the current node and the identity information of the next node.
It should be understood that the execution sequence of steps S301 and S302 is not sequential.
S303, embedding the hash value, the identification of the next execution stage after the current node and the identity information of the next node into a tuple set corresponding to the current node.
For example, the current node may add a hash value to the tuple set, and establish a mapping relationship between the tuple set and the hash value, so as to embed the hash value into the tuple set corresponding to the current node. The mapping relation between the tuple set and the hash value can be, for example, that the hash value is added to the tuple set as an attribute parameter of the tuple set.
For another example, the current node may embed the identifier of the next execution stage after the current node in the tuple set, and establish a mapping relationship between each tuple in the tuple set and the identifier of the next execution stage, for example, the identifier of the next execution stage may be used as an attribute parameter of each tuple in the tuple set and added to the tuple set.
For another example, the current node embeds the identity information of the next node into the tuple in the tuple set that needs to be processed by the next node, that is, establishes a mapping relationship between the tuple in the tuple set that needs to be processed by the next node and the corresponding identity information of the next node. The mapping relation establishment here may be, for example, generating a mapping relation table of each filled virtual tuple and filled second tuple and the identifier of the next execution stage, where the current node can find the identity information of the next node corresponding to the filled virtual tuple and filled second tuple through the mapping relation table and the filled virtual tuple and filled second tuple.
In this step, the current node embeds the hash value, the identifier of the next execution stage after the current node and the identity information of the next node into the tuple set corresponding to the current node, so that the next node verifies the complete row of the tuple set based on the above content, and prevents an attacker from damaging the integrity of the Spark task by discarding data, retransmitting data or constructing false data, so that a user cannot obtain a correct execution result.
S304, encrypting the tuple set corresponding to the current node.
The encryption scheme is not limited in this application. In this step, the current node encrypts the tuple set corresponding to the current node, so as to further increase data security. It should be appreciated that in some embodiments, the current node may not encrypt the tuple set corresponding to the current node.
The confidentiality and the security of data transmission between nodes can be further improved through the steps.
S305, sending the encrypted tuple set corresponding to the current node to the next node.
In this step, the current node sends the encrypted tuple set corresponding to the current node to the next node, so that the next node reads the encrypted tuple set, and processes the encrypted tuple set by adopting the Spark task function corresponding to the next node.
In this embodiment, the current node embeds the hash value, the identifier of the next execution stage after the current node and the identity information of the next node into the tuple set corresponding to the current node, so that the next node can verify the integrity of the tuple set based on the above content, and prevent an attacker from damaging the integrity of the Spark task by discarding data, retransmitting data or constructing false data, so that a user cannot obtain a correct execution result, and the security of the Spark task can be further improved.
In the following, how to obtain M first tuples to be processed when the current node is not the first node of the Spark task, i.e. step S101 in the above embodiment is described. It should be understood that the first node referred to herein, i.e., the node for performing the tasks of the first execution phase. The present embodiment is an embodiment taking as an example the task processing method of the previous embodiment adopted by the previous node of the current node to perform Spark task processing. Fig. 6 is a flow chart of a fourth task processing method provided in the present application, as shown in fig. 6, step S101 may include:
S401, receiving an encrypted tuple set corresponding to the last node.
It should be understood that the last node may be one node or a plurality of nodes, which is specifically related to the number of tasks involved in the last execution stage.
On the basis of the implementation manner, if the previous node encrypts the encrypted tuple set corresponding to the previous node and sends the encrypted tuple set to the current node, the current node receives the encrypted tuple set corresponding to the previous node.
S402, decrypting the encrypted tuple set corresponding to the previous node to obtain the tuple set corresponding to the previous node.
In this step, the current node may, for example, store a key for the encrypted tuple set in advance, and decrypt the encrypted tuple set corresponding to the previous node by using the key, to obtain the tuple set corresponding to the previous node.
S403, carrying out integrity verification on the tuple set corresponding to the previous node according to the hash value carried by the tuple set corresponding to the previous node, the identification of the next execution stage of the previous node and the identity information of the next node.
In this step, the current node performs integrity verification on the tuple set corresponding to the previous node, so as to prevent an attacker from interfering with smooth execution of Spark task by destroying data integrity.
In one possible implementation, the current node first performs a hash function operation on the tuples belonging to the same tuple set by using a hash function adopted by the previous node to obtain a hash value corresponding to the tuple set, and then compares the hash value with the hash value embedded in the tuple set. If the two hash values are not equal, the fact that the tuple which belongs to the same tuple set and is acquired by the current node is different from the tuple which is included in the tuple set and is transmitted by the previous node is indicated, and the possibility that an attacker breaks the tuple set exists, and then the integrity verification fails.
If the two hash values are equal, the fact that the tuple which belongs to the same tuple set and is acquired by the current node is the same as the tuple which is included in the tuple set and is transmitted by the previous node is indicated, and then the current node performs integrity verification through the identity information of the next node.
Illustratively, the current node stores identity information of the current node. Current nodeJudging whether the tuple in the tuple set is embedded with the identity information of the current node, and if so, reserving the tuple. If the identity information of the current node is not embedded, namely T identity ≠T id (T identity Identity information, T, of the node embedded for the tuple id Identity information for the current Task), indicating that the tuple does not need to be processed by the current node, the current node removes the tuple from the current node.
Subsequently, the current node may verify whether the tuple in the tuple set it receives is in the correct execution phase. Illustratively, the current node stores an identification of the current execution stage, and the current node determines whether the identification of the next execution stage of the last node embedded in the tuple set is consistent with the identification of the current execution stage. If the tuples are consistent, indicating that the tuples in the current node do not exist and should belong to the tuples of other execution stages, passing the integrity verification; if there is no coincidence, T stage ≠T current (T stage T is the identification of the next execution stage of the previous node current A current execution phase), indicating that there are tuples in the current node that should belong to other execution phases, i.e. that an attacker may fail the integrity verification by sending tuples that should have been executed by other execution phases to the current execution phase.
Optionally, if the integrity verification fails, the current node may terminate the execution of the Spark task and output a prompt message that the integrity may be damaged.
S404, after the integrity verification is passed, recovering M first tuples to be processed from the tuple set corresponding to the previous node.
In this step, the manner in which the current node recovers M first tuples to be processed from the tuple set corresponding to the previous node is related to whether the Spark task is an aggregate Spark task or a non-aggregate Spark task, which may include, for example, the following two cases:
case 1: spark tasks are aggregate Spark tasks.
In this case, the current node identifies whether or not a tuple in the tuple set corresponding to the previous node is a virtual tuple. For example, the current node may store a form of a virtual tuple, and the current node compares whether the stored form of the virtual tuple is consistent with a tuple in the tuple set, and if so, indicates that the tuple is a virtual tuple. If the tuple is determined to be a virtual tuple, the current node deletes the virtual tuple from the current node.
If the Spark task is an aggregate Spark task, the last node does not need to fill the aggregated second tuples and virtual tuples according to the preset maximum tuple length, so that the current node deletes the virtual tuples from the current node, and M first tuples to be processed can be recovered.
Case 2: spark tasks are non-aggregate Spark tasks.
In this case, after deleting the virtual tuple where the real data does not exist from the current node, the current node may identify whether the filled virtual data exists in the remaining tuples according to the form of the pre-stored filled virtual data, and if so, remove the data.
The current node then restores the original data, i.e., the aggregated tuples to an un-aggregated state. Illustratively, if the second tuple after aggregation is <1,2_3>, then the tuple is restored to <1,2>, <1,3>. Thus, M first tuples to be processed are obtained.
In this embodiment, the current node performs integrity verification on the tuple set corresponding to the previous node according to the hash value carried by the tuple set corresponding to the previous node, the identifier of the next execution stage of the previous node, and the identity information of the next node, and if the integrity verification is passed, the M first tuples to be processed are recovered from the tuple set corresponding to the previous node.
Through the verification of the hash value, the current node can judge whether the last node completely transmits the tuple set to the current node, and an attacker is prevented from damaging the data integrity in a discarding way; through verification of the identification of the next execution stage of the previous node, the current node can judge whether the tuple originally processed by other execution stages is sent to the current execution stage, so that an attacker is prevented from damaging the data integrity through the means; through verifying the identity information carried by the tuple set of the previous node, the tuple processed by the current node can be removed from the current node, and the execution result error caused by processing the tuple originally processed by other nodes is avoided. Overall, through the method, the data integrity in the Spark task processing process is ensured, and the accuracy of the Spark task execution result is further ensured.
The task processing method provided in the present application is described in the following with specific embodiments, and fig. 7 is a schematic flow chart of a fifth task processing method provided in the present application, where the schematic flow chart is a schematic diagram illustrating that the Spark task includes 2 execution stages. Wherein the first execution phase comprises 3 tasks and the second execution phase comprises 2 tasks. The same or similar contents as those of the above embodiments in this embodiment may refer to the above embodiments, and will not be described herein. As shown in fig. 7, this embodiment may include, for example, the following steps:
step 1, nodes (hereinafter referred to as first nodes) deployed by tasks included in the first execution stage respectively acquire corresponding encrypted data sets.
Referring to fig. 7, in this step, nodes deployed by 3 tasks included in the first execution stage acquire corresponding encrypted data sets, respectively. The node deployed by the task 1 acquires 4 first tuples, the node deployed by the task 2 acquires 3 first tuples, and the node deployed by the task 3 acquires 1 first tuple.
And 2, the first node respectively processes the first tuples included in the obtained encrypted data set by using the corresponding Spark task function, and respectively obtains corresponding processed second tuples.
And 3, the first node respectively performs combination processing on the corresponding second tuples so as to combine the second tuples with the same keys into one second tuple, and the combined second tuple is obtained.
As in the above embodiment, the second tuple is merged by adopting different merging modes for different Spark task types.
For example, with continued reference to fig. 7, when the Spark task is a non-aggregate Spark task, the node at which the task included in the first execution stage is deployed uses the keys of the second tuple with the same key as the keys of the merged second tuple, uses the values of the second tuple with the same key as the values of the merged second tuple, and obtains the merged second tuple.
For example, the node deployed by task 1 merges <2,1>, <2,2>, <2,3> in the second tuple into <2,1_2_3>
And 4, the first node generates at least one virtual tuple according to the number of the combined second tuples and the preset tuple number of the single tuple set.
Illustratively, referring to FIG. 7, the preset number of tuples for a single tuple set is 4 and the number of merged second tuples corresponding to the node on which task 1 is deployed is 2, so the node generates 2 virtual tuples.
And 5, the first node respectively generates a corresponding tuple set according to the merged second tuples and at least one virtual tuple, wherein the sum of the number of the merged second tuples and the number of the virtual tuples corresponding to each task is equal to the number of preset tuples.
Illustratively, with continued reference to fig. 7, the node deployed by task 2 fills the merged second tuple and the virtual tuple such that the lengths of the filled second tuple and the filled virtual tuple are both a preset maximum fill length of 5. Then, the method comprises the steps of. The node generates a corresponding tuple set according to the filled second tuple and the filled virtual tuple,
and 6, the first node respectively acquires the hash value of the corresponding tuple set, the identification of the next execution stage and the identity information of the next node.
And 7, the first node respectively embeds the corresponding hash value, the identification of the next execution stage and the identity information of the next node into the corresponding tuple set.
And 8, encrypting the corresponding tuple set by the first node.
And 9, the first node respectively sends the corresponding encrypted tuple set to the next node.
Correspondingly, the nodes (hereinafter referred to as second nodes) where the tasks included in the second execution stage are deployed receive the encrypted tuple set corresponding to the previous node respectively.
And step 10, the second node decrypts the encrypted tuple set corresponding to the previous node to obtain the tuple set corresponding to the previous node.
And 11, the second node performs integrity verification on the tuple set corresponding to the previous node according to the hash value carried by the tuple set corresponding to the previous node, the identification of the next execution stage of the previous node and the identity information of the next node.
And step 12, after the second node passes the integrity verification, recovering the second tuple to be processed from the tuple set corresponding to the previous node.
For example, with continued reference to fig. 7, when the Spark task is a non-aggregate Spark task, the second node deletes the virtual tuple from the tuple set corresponding to the previous node, and restores the combined second tuple to the state before combination, to obtain the second tuple to be processed.
And 13, the second node processes the second tuple to be processed by using the corresponding Spark task function respectively, and the corresponding processed third tuple is obtained respectively.
And 14, respectively carrying out merging processing on the corresponding third sub-groups by the second node so as to merge the third sub-groups with the same key into one third sub-group, and obtaining the merged third sub-group.
And 15, the second node generates at least one virtual tuple according to the number of the third combined tuples and the preset tuple number of the single tuple set.
And step 16, the second node respectively generates a corresponding tuple set according to the combined third tuples and at least one virtual tuple, wherein the sum of the number of the combined third tuples and the number of the virtual tuples corresponding to each task is equal to the number of preset tuples.
And step 17, the second node encrypts the corresponding tuple set and stores the encrypted tuple set into a file system.
According to the task processing method provided by the embodiment, virtual tuples can be added in the tuple set transmitted between the nodes, so that an attacker cannot acquire user information according to priori knowledge, data flow, data characteristics and the like, and further the side channel attack is launched; the integrity verification of the data set in the tuple set can be realized by embedding the hash value, the identification of the next stage, the identity information of the next node and the like in the tuple set, so that the user is prevented from damaging the data integrity by means of discarding data, retransmitting data, forming false data and the like, and further, the executing result of the Spark task is interfered. The method improves confidentiality of Spark tasks from multiple aspects of resisting side channel attacks, guaranteeing data integrity and the like.
Fig. 8 is a schematic structural diagram of a task processing device provided in the present application, and as shown in fig. 8, the task processing device includes: an acquisition module 11, a processing module 12 and a generation module 13. Optionally, the apparatus may further comprise one or more of the following modules: a storage module 14 and a verification module 15.
An obtaining module 11, configured to obtain M first tuples to be processed;
the processing module 12 is configured to process the M first tuples by using a Spark task function of the current node to obtain N second tuples;
a generating module 13, configured to generate a tuple set corresponding to the current node according to the N second tuples and a preset tuple number of the single tuple set, where the tuple set includes at least one virtual tuple except for the second tuple; and M and N are integers greater than or equal to 1.
A possible practice mode, wherein N is greater than or equal to 2, the tuple is represented by a key value, and the generating module 13 is specifically configured to combine the N second tuples, so as to combine the second tuples with the same key into a second tuple, and obtain a combined second tuple; generating at least one virtual tuple according to the number of the combined second tuples and the preset tuple number of the single tuple set; and generating a tuple set corresponding to the current node according to the merged second tuple and the at least one virtual tuple, wherein the sum of the number of the merged second tuple and the number of the virtual tuples is equal to the preset tuple number.
For example, the generating module 13 is specifically configured to fill the aggregated second tuple and the virtual tuple according to a preset maximum filling length, so that the lengths of the second tuple after filling and the virtual tuple after filling are both the maximum filling length; and generating a tuple set corresponding to the current node according to the filled second tuple and the filled virtual tuple.
For example, the generating module 13 is specifically configured to perform compression processing on the second tuple after filling and the virtual tuple after filling; and constructing a tuple set corresponding to the current node by using the compressed second tuple and the compressed virtual tuple.
In one possible implementation manner, if the current node is the last node of the Spark task, the storage module 14 is configured to encrypt the tuple set corresponding to the current node and store the encrypted tuple set in the file system after the generating module 13 generates the tuple set corresponding to the current node.
In a possible implementation manner, if the current node is not the last node of the Spark task, the verification module 15 is configured to perform a hash function operation on a tuple in the tuple set corresponding to the current node after the generating module 13 generates the tuple set corresponding to the current node, so as to generate a hash value of the tuple set corresponding to the current node; acquiring the identification of the next execution stage after the current node and the identity information of the next node according to the Spark task directed acyclic graph information; embedding the hash value, the identification of the next execution stage after the current node and the identity information of the next node into a tuple set corresponding to the current node; encrypting a tuple set corresponding to the current node; and sending the encrypted tuple set corresponding to the current node to the next node.
In the foregoing implementation manner, in one possible implementation manner, the current node is not the first node of the Spark task, and the obtaining module 11 is specifically configured to receive an encrypted tuple set corresponding to the previous node; decrypting the encrypted tuple set corresponding to the previous node to obtain the tuple set corresponding to the previous node; performing integrity verification on the tuple set corresponding to the previous node according to the hash value carried by the tuple set corresponding to the previous node, the identifier of the next execution stage of the previous node and the identity information of the next node; and after the integrity verification is passed, recovering the M first tuples to be processed from the tuple set corresponding to the previous node.
In one possible implementation manner, the current node is the first node of the Spark task, and the obtaining module 11 is specifically configured to receive the M first tuples input by the user.
The task processing device provided in the embodiment of the present application may execute the task processing method in the embodiment of the method, and its implementation principle and technical effects are similar, and are not described herein again. The division of the modules shown in fig. 8 is merely an illustration, and the present application does not limit the division of the modules and the naming of the modules.
Fig. 9 is a schematic structural diagram of an electronic device 900 provided in the present application. As shown in fig. 9, the electronic device 900 may include: at least one processor 901, a memory 902.
A memory 902 for storing a program. In particular, the program may include program code including computer-operating instructions.
The memory 902 may include high-speed RAM memory or may further include non-volatile memory (non-volatile memory), such as at least one disk memory.
The processor 901 is configured to execute computer-executable instructions stored in the memory 902 to implement the task processing method described in the foregoing method embodiment. The processor 901 may be a central processing unit (Central Processing Unit, abbreviated as CPU), or an application specific integrated circuit (Application Specific Integrated Circuit, abbreviated as ASIC), or one or more integrated circuits configured to implement embodiments of the present application.
The electronic device 900 may also include a communication interface 903 such that communication interactions with external devices, such as user terminals (e.g., computers, tablets) may be performed through the communication interface 903. In a specific implementation, if the communication interface 903, the memory 902, and the processor 901 are implemented independently, the communication interface 903, the memory 902, and the processor 901 may be connected to each other through buses and perform communication with each other. The bus may be an industry standard architecture (Industry Standard Architecture, abbreviated ISA) bus, an external device interconnect (Peripheral Component, abbreviated PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated EISA) bus, among others. Buses may be divided into address buses, data buses, control buses, etc., but do not represent only one bus or one type of bus.
Alternatively, in a specific implementation, if the communication interface 903, the memory 902, and the processor 901 are integrated on a chip, the communication interface 903, the memory 902, and the processor 901 may complete communication through internal interfaces.
The present application also provides a computer-readable storage medium, which may include: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, etc., in which program codes may be stored, and specifically, the computer-readable storage medium stores program instructions for the task processing method in the above embodiment.
The present application also provides a program product comprising execution instructions stored in a readable storage medium. The at least one processor of the electronic device may read the execution instructions from the readable storage medium, and execution of the execution instructions by the at least one processor causes the electronic device to implement the task processing methods provided by the various embodiments described above.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (11)

1. A method of task processing, the method comprising:
obtaining M first tuples to be processed;
processing the M first tuples by using a Spark task function of the current node to obtain N second tuples;
generating a tuple set corresponding to the current node according to the N second tuples and the preset tuple number of the single tuple set, wherein the tuple set at least comprises one virtual tuple except the second tuple; and M and N are integers greater than or equal to 1.
2. The method according to claim 1, wherein N is greater than or equal to 2, the tuples are represented by key values, the generating the tuple set corresponding to the current node according to the N second tuples and a preset tuple number of a single tuple set includes:
combining the N second tuples to combine the second tuples with the same key into one second tuple to obtain a combined second tuple;
Generating at least one virtual tuple according to the number of the combined second tuples and the preset tuple number of the single tuple set;
and generating a tuple set corresponding to the current node according to the merged second tuple and the at least one virtual tuple, wherein the sum of the number of the merged second tuple and the number of the virtual tuples is equal to the preset tuple number.
3. The method according to claim 2, wherein generating the tuple set corresponding to the current node from the merged second tuple and the at least one virtual tuple comprises:
filling the aggregated second tuple and the virtual tuple according to a preset maximum filling length, so that the lengths of the filled second tuple and the filled virtual tuple are the maximum filling length;
and generating a tuple set corresponding to the current node according to the filled second tuple and the filled virtual tuple.
4. A method according to claim 3, wherein said generating a tuple set corresponding to the current node from the filled second tuple and the filled virtual tuple comprises:
Compressing the filled second tuple and the filled virtual tuple;
and constructing a tuple set corresponding to the current node by using the compressed second tuple and the compressed virtual tuple.
5. The method according to any one of claims 1-4, wherein, if the current node is the last node of the Spark task, after the generating the tuple set corresponding to the current node, the method further comprises:
and encrypting the tuple set corresponding to the current node and storing the tuple set into a file system.
6. The method according to any one of claims 1-4, wherein, if the current node is not the last node of the Spark task, after generating the tuple set corresponding to the current node, further comprises:
performing hash function operation on tuples in the tuple set corresponding to the current node, and generating a hash value of the tuple set corresponding to the current node;
acquiring the identification of the next execution stage after the current node and the identity information of the next node according to the Spark task directed acyclic graph information;
embedding the hash value, the identification of the next execution stage after the current node and the identity information of the next node into a tuple set corresponding to the current node;
Encrypting a tuple set corresponding to the current node;
and sending the encrypted tuple set corresponding to the current node to the next node.
7. The method of claim 6, wherein the current node is not a first node of a Spark task, and the obtaining M first tuples to be processed comprises:
receiving an encrypted tuple set corresponding to a previous node;
decrypting the encrypted tuple set corresponding to the previous node to obtain the tuple set corresponding to the previous node;
performing integrity verification on the tuple set corresponding to the previous node according to the hash value carried by the tuple set corresponding to the previous node, the identifier of the next execution stage of the previous node and the identity information of the next node;
and after the integrity verification is passed, recovering the M first tuples to be processed from the tuple set corresponding to the previous node.
8. The method according to any one of claims 1-4, wherein the current node is a first node of a Spark task, and the obtaining M first tuples to be processed includes:
the M first tuples of user input are received.
9. A task processing device, the device comprising:
the acquisition module is used for acquiring M first tuples to be processed;
the processing module is used for processing the M first tuples by utilizing the Spark task function of the current node to obtain N second tuples;
the generating module is used for generating a tuple set corresponding to the current node according to the N second tuples and the preset tuple number of the single tuple set, wherein the tuple set at least comprises one virtual tuple except the second tuple; and M and N are integers greater than or equal to 1.
10. An electronic device, the electronic device comprising: a processor, and a memory communicatively coupled to the processor;
the memory stores computer-executable instructions;
the processor executes computer-executable instructions stored in the memory to implement the method of any one of claims 1-8.
11. A computer-readable storage medium, in which computer-executable instructions are stored, which when executed by a processor are adapted to carry out the task processing method according to any one of claims 1 to 8.
CN202310301268.8A 2023-03-24 2023-03-24 Task processing method and device, electronic equipment and storage medium Pending CN116305101A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310301268.8A CN116305101A (en) 2023-03-24 2023-03-24 Task processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310301268.8A CN116305101A (en) 2023-03-24 2023-03-24 Task processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116305101A true CN116305101A (en) 2023-06-23

Family

ID=86792403

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310301268.8A Pending CN116305101A (en) 2023-03-24 2023-03-24 Task processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116305101A (en)

Similar Documents

Publication Publication Date Title
Dong et al. When private set intersection meets big data: an efficient and scalable protocol
CN111047450A (en) Method and device for calculating down-link privacy of on-link data
US11546348B2 (en) Data service system
CN114157415A (en) Data processing method, computing node, system, computer device and storage medium
CN113391880B (en) Trusted mirror image transmission method for layered double hash verification
CN110740038B (en) Blockchain and communication method, gateway, communication system and storage medium thereof
CN113254955A (en) Forward security connection keyword symmetric searchable encryption method, system and application
CN112287366A (en) Data encryption method and device, computer equipment and storage medium
CN111628863B (en) Data signature method and device, electronic equipment and storage medium
CN115150821A (en) Offline package transmission and storage method and device
CN111404892B (en) Data supervision method and device and server
CN117155549A (en) Key distribution method, key distribution device, computer equipment and storage medium
CN116455572B (en) Data encryption method, device and equipment
CN112926983A (en) Block chain-based deposit certificate transaction encryption system and method
CN111931204A (en) Encryption and de-duplication storage method and terminal equipment for distributed system
CN115314228B (en) Unmanned aerial vehicle identity authentication method, device and system
CN114896621B (en) Application service acquisition method, encryption method, device and computer equipment
CN115022012B (en) Data transmission method, device, system, equipment and storage medium
CN113569265B (en) Data processing method, system and device
CN116305101A (en) Task processing method and device, electronic equipment and storage medium
CN111062721B (en) Signature method, system and storage medium applied to blockchain
CN110958285B (en) Data storage system based on block chain
CN112311551A (en) Securing provable resource ownership
CN110958211A (en) Data processing system and method based on block chain
CN117527238B (en) Key generation method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination