CN112948442B - Data stream theta connection optimization method, system, terminal and storage medium - Google Patents

Data stream theta connection optimization method, system, terminal and storage medium Download PDF

Info

Publication number
CN112948442B
CN112948442B CN202110331197.7A CN202110331197A CN112948442B CN 112948442 B CN112948442 B CN 112948442B CN 202110331197 A CN202110331197 A CN 202110331197A CN 112948442 B CN112948442 B CN 112948442B
Authority
CN
China
Prior art keywords
partition
data stream
filtering
data
size
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110331197.7A
Other languages
Chinese (zh)
Other versions
CN112948442A (en
Inventor
王洋
吴嘉澍
须成忠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN202110331197.7A priority Critical patent/CN112948442B/en
Publication of CN112948442A publication Critical patent/CN112948442A/en
Priority to PCT/CN2021/137981 priority patent/WO2022199147A1/en
Application granted granted Critical
Publication of CN112948442B publication Critical patent/CN112948442B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • G06F16/278Data partitioning, e.g. horizontal or vertical partitioning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention relates to a data flow theta connection optimization method, which comprises the following steps: performing first pre-filtering on the two data streams according to a theta condition; respectively calculating partition marks for the two data streams after the first pre-filtering according to the partition number and the maximum and minimum values, combining the partition marks of the two data streams to obtain a combined partition mark, and performing combined partitioning according to the combined partition mark; performing secondary pre-filtering on the two data streams subjected to the joint partitioning according to a theta condition; respectively calculating the average partition size of the data stream after the second pre-filtering, and re-partitioning the partitions with the partition size larger than the average partition size; and carrying out Cartesian product operation on the two data streams after the repartition, and obtaining a data stream theta connection result. The invention also relates to a data flow theta connection optimization system. The invention optimizes the theta-join algorithm by utilizing secondary pre-filtering and joint partitioning, and reduces the number of unnecessary Cartesian products as much as possible, thereby greatly improving the operation efficiency.

Description

Data stream theta connection optimization method, system, terminal and storage medium
Technical Field
The invention relates to a data stream theta connection optimization method, a system, a terminal and a storage medium.
Background
With the rapid development of the digital era, data is also generated in large quantities. In order to more efficiently process and utilize generated data, an efficient data processing algorithm is becoming more important. In many applications, data comes online in the form of an online data stream, such as stock and exchange rate data that fluctuates over time, air temperature and wind speed data that changes over time, and so forth.
An important operation in processing data stream data is a Theta Join (Theta-Join) operation. The Theta-join operation can find out a data group meeting the Theta condition in the data stream through the given Theta condition, so that other subsequent data analysis and utilization are facilitated.
To meet the data processing efficiency requirements of today's big data and information applications, the theta-join algorithm needs to perform join operations on data streams as efficiently as possible. However, the Cartesian product (cross-product) operation required in the theta-join is time-consuming and becomes the most important factor affecting the efficiency of the theta-join algorithm.
The existing theta-join algorithm carries out a large amount of unnecessary Cartesian product operations when theta connection is carried out, so that the operation efficiency of the algorithm is poor. Therefore, how to reduce the number of unnecessary cartesian products performed by the theta-join algorithm as much as possible and improve the operation efficiency of the theta-join algorithm becomes a problem to be optimized.
Disclosure of Invention
In view of this, it is necessary to provide a data stream θ connection optimization method, system, terminal, and storage medium, which can optimize the theta-join algorithm by using secondary pre-filtering and joint partitioning, and reduce the number of unnecessary cartesian products performed by the theta-join algorithm as much as possible, thereby greatly improving the operating efficiency of the theta-join algorithm.
The embodiment of the application provides a data flow theta connection optimization method, which comprises the following steps: a. performing first pre-filtering on the two data streams according to a theta condition; the pre-filtering refers to filtering out partitions of the data stream which cannot meet a theta condition; b. respectively calculating partition marks for the two data streams after the first pre-filtering according to the partition number and the maximum and minimum values, combining the partition marks of the two data streams to obtain a combined partition mark, and performing combined partitioning according to the combined partition mark; c. performing secondary pre-filtering on the two partitioned data streams according to a theta condition; d. respectively calculating the average partition size of the data stream after the second pre-filtering, and re-partitioning the partitions with the partition size larger than the average partition size; e. and carrying out Cartesian product operation on the two data streams after the repartition, and obtaining a data stream theta connection result.
Wherein, the step b comprises:
after the partition marks of the two data streams are calculated respectively, the partition marks calculated by the two data streams are combined to obtain a combined partition mark, and the combined partition mark is utilized to perform more detailed partition on the two data streams respectively.
Wherein, the average partition size is calculated by the following formula:
Figure BDA0002994539270000021
wherein, the average _ partition _ size is the average partition size of the data stream, the number _ of _ elements _ in _ stream is the data amount of the data stream in the current processing window, and the number _ of _ partitions is the number of partitions generated by the data stream by the joint partition.
The number of the subdivided partitions is calculated by the following formula:
Figure BDA0002994539270000022
the repartition _ number is the number of repartitions for repartitioning the overloaded partition, the overloaded _ partion _ size is the size of the overloaded partition, the average _ partion _ size is the average partition size, and the upper parentheses are rounded-up symbols.
The embodiment of the application provides a data flow theta connection optimization system, which comprises a pre-filtering module, a joint partitioning module, a re-partitioning module and a processing module, wherein: the pre-filtering module is used for pre-filtering two data streams for the first time according to a theta condition; the pre-filtering refers to filtering out partitions of the data stream which cannot meet a theta condition; the joint partitioning module is used for respectively calculating partitioning marks of the two data streams after the first pre-filtering according to the partitioning number and the maximum and minimum values, combining the partitioning marks of the two data streams to obtain a joint partitioning mark, and performing joint partitioning according to the joint partitioning mark; the pre-filtering module is further used for performing secondary pre-filtering on the two data streams subjected to the joint partitioning according to a theta condition; the re-partitioning module is used for respectively calculating the average partition size of the data stream after the second pre-filtering and re-partitioning the partitions with the partition size larger than the average partition size; and the processing module is used for carrying out Cartesian product operation on the two data streams after the repartition and the partition to obtain a data stream theta connection result.
Wherein the joint partitioning module is specifically configured to:
after the partition marks of the two data streams are calculated respectively, combining the partition marks calculated by the two data streams to obtain a combined partition mark, and performing more detailed partition on the two data streams respectively by using the combined partition mark.
Wherein, the average partition size is calculated by the following formula:
Figure BDA0002994539270000031
wherein, the average _ partition _ size is the average partition size of the data stream, the number _ of _ elements _ in _ stream is the data amount of the data stream in the current processing window, and the number _ of _ partitions is the number of partitions generated by the data stream by the joint partition.
The number of the subdivided partitions is calculated by the following formula:
Figure BDA0002994539270000032
the repartition _ number is the number of repartitions for repartitioning the overloaded partition, the overloaded _ partion _ size is the size of the overloaded partition, the average _ partion _ size is the average partition size, and the upper parentheses are rounded-up symbols.
An embodiment of the present application provides a terminal, where the terminal includes a processor and a memory coupled to the processor, where:
the memory stores program instructions for implementing data flow θ connection optimization as claimed in any one of claims 1-4;
the processor is configured to execute the program instructions stored by the memory to control a data flow θ for connection optimization.
The embodiment of the application provides a storage medium, which stores program instructions executable by a processor, and the program instructions are used for executing the data stream theta connection optimization method in any one of claims 1 to 4.
The application provides a data stream theta connection optimization method and a data stream theta connection optimization system. Compared with an algorithm which directly performs partitioning without filtering two data streams according to a theta condition, the method and the device can preliminarily reduce the data volume needing partitioning, so that the algorithm runs more efficiently and partitions are more refined, the subsequent screening of data is facilitated, and the number of unnecessary Cartesian products is reduced. Then, partition markers will be calculated for the two data streams, respectively, according to the number of partitions. In the whole process, time-consuming sequencing operation is not needed to be carried out on the two data streams, and the data sequence in the partitions is not required. After the partition marks of the two data streams are calculated respectively, in order to avoid that the two data streams are partitioned in an isolated manner to damage the subsequent filtering effect, the partition marks calculated by the two data streams are combined, and the two data streams are partitioned more finely by using the combined partition marks, so that the subsequent filtering is more refined. After the two data streams are partitioned according to the joint partition mark, secondary pre-filtering is carried out according to the theta condition, and all partitions which cannot have a data group meeting the theta condition are filtered. And for the partitions with the partition size larger than the average partition size, carrying out repartitioning operation, and carrying out pairwise Cartesian product operation on the reserved partitions. And finally, screening a data group meeting the theta condition from the Cartesian product result according to the theta condition, namely the final theta-join result.
Drawings
Fig. 1 is a flowchart of a data flow θ connection optimization method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a Cartesian product matrix and a Cartesian product quantity histogram generated according to a sample data stream according to the embodiments and the prior art (the bold sides in the diagram represent the boundaries of data requiring Cartesian product, and the black thin sides represent partition boundaries);
FIG. 3 is a diagram of a hardware architecture of a data flow θ connection optimization system according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a terminal according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a storage medium according to an embodiment of the present application.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
Fig. 1 is a flowchart illustrating an operation of a data flow θ connection optimization method according to a preferred embodiment of the present invention.
In step S1, the two data streams are pre-filtered for the first time according to the θ condition. Specifically, the method comprises the following steps:
referring to fig. 2(e), as shown in fig. 2(e), compared to the prior art, the first pre-filtering mechanism of the present application effectively reduces the amount of data that needs to be partitioned, so that the operation is more efficient. The principle of the first prefiltration is as follows: in this embodiment, the minimum value in the data stream 2 (horizontal) is 0, so any value greater than 0 can satisfy the θ condition (>) for the data stream 1 (vertical), and the values less than or equal to 0 in the data stream 1 will be pre-filtered, that is, 0 will be filtered out, and will not participate in other steps such as subsequent partitioning. Similarly, the maximum value of data stream 1 is 9, so that for data stream 2, any value greater than or equal to 9 cannot satisfy the θ condition, and any value greater than or equal to 9 in data stream 2 is filtered out, i.e. 10, 9 and 12. It should be noted that when the θ condition is other operators (<, ≦ or ≧), the filtering method is the same and is not listed.
In this embodiment, that is, the data stream 1 and the data stream 2 are pre-filtered for the first time according to the θ condition to obtain the data stream 1 'and the data stream 2'.
The data volume participating in the partition is reduced due to the first pre-filtering, so that the subsequent partitions are more refined. Therefore, the first pre-filtering is more beneficial to the subsequent filtering, and the number of unnecessary Cartesian products is reduced.
And step S2, respectively calculating partition marks for the two data streams after the first pre-filtering according to the partition number and the maximum and minimum values, combining the partition marks of the two data streams to obtain a combined partition mark, and performing combined partitioning according to the combined partition mark. Specifically, the method comprises the following steps:
in this embodiment, the two data streams are divided into three partitions, and the maximum value and the minimum value of the remaining values in data stream 1 (vertical) are 9 and 1, so that the span of each partition is 8/3-2.66, and the partition ranges are [1,3.66 ], [3.66,6.33) and [6.33,9 ]. Similarly, the maximum and minimum values of data stream 2 (horizontal) are 8 and 0, respectively, so the partition span is 8/3-2.66, and the partition ranges are [0,2.66), [2.66,5.33), and [5.33,8 ]. Due to the first pre-filtering, the data volume in the data stream is reduced, the span of the partition is correspondingly reduced, and the partition is more detailed.
After the partition marks of the two data streams are calculated respectively, in order to avoid carrying out isolated partition on the two data streams to damage the subsequent filtering effect, the partition marks obtained by calculating the two data streams are combined and called as joint partition marks, and the joint partition marks are utilized to carry out more refined partition on the two data streams, so that the subsequent filtering is more refined, the number of unnecessary Cartesian product operations is reduced, and the performance efficiency is improved.
As shown in fig. 2, the partition of the data stream 1 is [1,3.66 ], [3.66,6.33], and [6.33,9 ]. The partitions of data stream 2 are [0,2.66), [2.66,5.33) and [5.33,8 ]. The joint partition generated for data stream 1 and data stream 2 in this embodiment is [0,1 ], [1,2.66 ], [2.66,3.66 ], [3.66,5.33 ], [5.33,6.33 ], [6.33,8) and [8,9 ]. Taking the example of partitioning the data stream 1 by using the joint partition, 1 and 2 in the data stream 1 are divided into one area, 3 is divided into one area, 4 and 5 are divided into one area, and 6, 7 and 9 are respectively divided into one area.
If the data stream partition flags are not merged, then the [2.66,5.33) partition in data stream 1 cannot be filtered out when theta-conditional filtering is performed with the [3.66,6.33) partition in data stream 2 because 5 in data stream 1 is greater than 4 in data stream 2, because both partitions can produce a result pair that meets the theta condition. In this case, the subsequent whole of the two partitions is subjected to a cartesian product operation. But in this case 6 in stream 2 is redundant because it is unlikely that there will be a data set satisfying the condition of theta being greater than with the [2.66,5.33) partition of stream 1. Therefore, the non-joint partitioning of the two data streams is performed in isolation, which impairs the subsequent filtering effect, resulting in more meaningless time-consuming cartesian product operations.
If the two data stream partitions are partitioned after being combined according to the method of the present application, the [3.66,6.33] partition of the data stream 2 is decomposed into [3.66,5.33) and [5.33,6.33) by the joint partitioning method. In this way, 6 in the data stream 2 is filtered out, thereby reducing the number of meaningless Cartesian product operations and improving the efficiency of the algorithm. Therefore, in partitioning, the partitions for the two data streams should not be isolated, and the partitions between the two data streams should be joined to each other to better filter out the unnecessary cartesian product operations in advance.
Meanwhile, time-consuming sequencing operation on the two data streams is not needed in the whole process, and the data sequence in the partitions is not required. Taking 5 and 4 in the data stream as an example, the two are located in the [3.66,5.33) partition, and then, the subsequent screening and Cartesian product operation is performed by taking the partition as a unit, so that the sequence of 5 and 4 in the partition has no influence at all, the time-consuming sequencing is avoided, and the operation efficiency is effectively improved.
In this embodiment, that is, for the data stream 1 'and the data stream 2' obtained by the first pre-filtering, the partition flag of the data stream 1 'and the partition flag of the data stream 2' are respectively calculated, the partition flag of the data stream 1 'and the partition flag of the data stream 2' are combined to obtain a joint partition flag, and the data stream 1 'and the data stream 2' are respectively partitioned by using the joint partition flag.
And step S3, performing a second pre-filtering on the two data streams after the joint partitioning according to the theta condition, and further filtering all partitions in which data groups meeting the theta condition are not possible to exist, so as to reduce the required Cartesian product operations, make most of the performed Cartesian product operations meaningful, and improve the operation efficiency. Specifically, the method comprises the following steps:
in this embodiment, when the [1,2.66) partition in the data stream 1 performs the filtering with the [8,9] partition in the data stream 2, where the θ condition is greater than (>), since the maximum value in the partition in the data stream 1 is also smaller than the minimum value in the data stream 2, there is no data set in the two partitions that satisfies the θ condition, so that the two partitions do not need to perform the cartesian product operation, thereby reducing the number of cartesian product operations and improving the operation efficiency. Similarly, similar filtering is performed for the condition theta when the condition theta is other operators (less than, less than or equal to, and greater than or equal to), and the Cartesian product operation is not performed for the partitions which cannot generate the data set meeting the condition theta.
In this embodiment, that is, the partitioned data streams 1 'and 2' are pre-filtered for the second time according to the θ condition.
Step S4, calculating the average partition size of the data stream after the second pre-filtering, and re-partitioning the partitions whose partition size is larger than the average partition size, so as to avoid load imbalance and ensure balanced load as much as possible. Specifically, the method comprises the following steps:
the sizes of the partitions generated after the second pre-filtering may be different greatly, some partitions may be too large, and in order to balance the load, the too large partitions are disassembled, that is, the re-partitioning operation is performed, and the re-partitioning operation is a necessary operation of the present application.
In the present application, a partition whose partition size exceeds the average partition size is identified as an overloaded partition, which is subjected to a re-partitioning operation.
The calculation of the average partition size comprises:
Figure BDA0002994539270000091
wherein, the average _ partition _ size is the average partition size of the data stream, the number _ of _ elements _ in _ stream is the data amount of the data stream in the current processing window, and the number _ of _ partitions is the number of partitions generated by the data stream by the joint partition.
For example, a data stream has 9 data elements, and is divided into three partitions, so that each partition should contain 9/3-3 data elements under the best load condition.
The following formula illustrates the method of calculating the number of repartitioned partitions:
Figure BDA0002994539270000092
the repartition _ number is the number of repartitions for repartitioning the overloaded partition, the overloaded _ partitionion _ size is the size of the overloaded partition, and the average _ partitionion _ size is the average partition size. The upper brackets are rounded up symbols. For example, when the average partition size is 2 and there is a heavily overloaded partition, the partition size is 9, the number of repartitions is 9/2 ═ 4.5, and rounded up to 5. I.e. the partition would be subdivided into 5 parts. The re-partition mode is the same as before, the partition range is determined according to the maximum and minimum partition values, but the joint partition operation is not needed when the re-partition is carried out.
In this embodiment, that is, the average partition size of the data stream 1 'and the average partition size of the data stream 2' are respectively calculated for the data stream 1 'and the data stream 2' after the second pre-filtering, and the overload partitions of the data stream 1 'and the data stream 2' are respectively re-partitioned.
Step S5: and carrying out Cartesian product operation on the two data streams after the repartition, and obtaining a data stream theta connection result. The method specifically comprises the following steps:
after two prefiltering processes, most of the remaining Cartesian product operations are meaningful, i.e., an effective theta-join result can be obtained. The joint partitioning and re-partitioning further improves the algorithm efficiency. And finally, screening a data group meeting the theta condition from the Cartesian product result according to the theta condition, namely the final theta-join result.
For example, a [3.66,5.33) partition in the joint partition would divide 4 and 5 in data stream 1 into one region, divide 4 in data stream 2 into one region, and perform Cartesian product operations on both regions to yield a set of results satisfying the theta condition of 5>4, whereas a Cartesian product of 4 and 4 would not yield a valid theta-join result because 4 is not greater than 4.
It is worth mentioning that: as can be seen from the histogram of the number of cartesian products in the second (f) diagram, the present application greatly reduces the number of cartesian products that need to be performed compared to other comparison methods. In addition, since there are 27 data sets in the data that satisfy the θ condition, any optimization thereof requires at least 27 cartesian product operations. The application performs 29 Cartesian product operations and approaches the optimal solution.
Fig. 3 is a diagram showing a hardware architecture of the data flow θ connection optimization system 10 according to the embodiment of the present application. The system comprises: a pre-filtering module 101, a joint partitioning module 102, a re-partitioning module 103, and a processing module 104.
The pre-filtering module 101 is configured to pre-filter two data streams for a first time according to a θ condition. Specifically, the method comprises the following steps:
referring to fig. 2(e), as shown in fig. 2(e), compared to the prior art, the first pre-filtering mechanism of the present application effectively reduces the amount of data that needs to be partitioned, so that the operation is more efficient. The principle of the first prefiltration is as follows: in this embodiment, the minimum value in the data stream 2 (horizontal) is 0, so any value greater than 0 can satisfy the θ condition (>) for the data stream 1 (vertical), and the values less than or equal to 0 in the data stream 1 will be pre-filtered, that is, 0 will be filtered out, and will not participate in other steps such as subsequent partitioning. Similarly, the maximum value of data stream 1 is 9, so that for data stream 2, any value greater than or equal to 9 cannot satisfy the θ condition, and any value greater than or equal to 9 in data stream 2 is filtered out, i.e. 10, 9 and 12. It should be noted that when the θ condition is other operators (<, ≦ or ≧), the filtering method is the same and is not listed.
In this embodiment, that is, the data stream 1 and the data stream 2 are pre-filtered for the first time according to the θ condition to obtain the data stream 1 'and the data stream 2'.
The data volume participating in the partition is reduced due to the first pre-filtering, so that the subsequent partitions are more refined. Therefore, the first pre-filtering is more beneficial to subsequent filtering, and the number of unnecessary Cartesian products is reduced.
The joint partitioning module 102 is configured to calculate partition marks for the two data streams after the first prefiltering according to the partition number and the maximum and minimum values, respectively, combine the partition marks of the two data streams to obtain a joint partition mark, and perform joint partitioning according to the joint partition mark. The method specifically comprises the following steps:
in this embodiment, the two data streams are divided into three partitions, and the maximum value and the minimum value of the remaining values in data stream 1 (vertical) are 9 and 1, so that the span of each partition is 8/3-2.66, and the partition ranges are [1,3.66 ], [3.66,6.33) and [6.33,9 ]. Similarly, the maximum and minimum values of data stream 2 (horizontal) are 8 and 0, respectively, so the partition span is 8/3-2.66, and the partition ranges are [0,2.66 ], [2.66,5.33), and [5.33,8 ]. Due to the first pre-filtering, the data volume in the data stream is reduced, the span of the partition is correspondingly reduced, and the partition is more detailed.
After the partition marks of the two data streams are calculated respectively, in order to avoid carrying out isolated partition on the two data streams to damage the subsequent filtering effect, the partition marks obtained by calculating the two data streams are combined and called as joint partition marks, and the joint partition marks are utilized to carry out more refined partition on the two data streams, so that the subsequent filtering is more refined, the number of unnecessary Cartesian product operations is reduced, and the performance efficiency is improved.
As shown in fig. 2, the partition of the data stream 1 is [1,3.66 ], [3.66,6.33], and [6.33,9 ]. The partitions of stream 2 are [0,2.66), [2.66,5.33) and [5.33,8 ]. The joint partition generated for data stream 1 and data stream 2 in this embodiment is [0,1 ], [1,2.66 ], [2.66,3.66 ], [3.66,5.33 ], [5.33,6.33 ], [6.33,8) and [8,9 ]. Taking the example of partitioning the data stream 1 by using the joint partition, 1 and 2 in the data stream 1 are divided into one area, 3 is divided into one area, 4 and 5 are divided into one area, and 6, 7 and 9 are respectively divided into one area.
If the data stream partition flags are not merged, then the [2.66,5.33) partition in data stream 1 cannot be filtered out when theta-conditional filtering is performed with the [3.66,6.33) partition in data stream 2 because 5 in data stream 1 is greater than 4 in data stream 2, because both partitions can produce a result pair that meets the theta condition. In this case, the subsequent whole of the two partitions is subjected to a cartesian product operation. But in this case 6 in stream 2 is redundant because it is unlikely that there will be a data set satisfying the condition theta greater than the [2.66,5.33) partition of stream 1. Therefore, the non-joint partitioning of the two data streams is performed in isolation, which impairs the subsequent filtering effect, resulting in more meaningless time-consuming cartesian product operations.
If the two data stream partitions are partitioned after being combined according to the method of the present application, the [3.66,6.33] partition of the data stream 2 is decomposed into [3.66,5.33) and [5.33,6.33) by the joint partitioning method. In this way, 6 in the data stream 2 is filtered out, thereby reducing the number of meaningless Cartesian product operations and improving the efficiency of the algorithm. Therefore, in partitioning, the partitions for the two data streams should not be isolated, and the partitions between the two data streams should be joined to each other to better filter out the unnecessary cartesian product operations in advance.
Meanwhile, time-consuming sequencing operation on the two data streams is not needed in the whole process, and the data sequence in the partitions is not required. Taking 5 and 4 in the data stream as an example, the two are located in the [3.66,5.33) partition, and then, the subsequent screening and Cartesian product operation is performed by taking the partition as a unit, so that the sequence of 5 and 4 in the partition has no influence at all, the time-consuming sequencing is avoided, and the operation efficiency is effectively improved.
In this embodiment, that is, for the data stream 1 'and the data stream 2' obtained by the first pre-filtering, the partition flag of the data stream 1 'and the partition flag of the data stream 2' are respectively calculated, the partition flag of the data stream 1 'and the partition flag of the data stream 2' are combined to obtain a joint partition flag, and the data stream 1 'and the data stream 2' are respectively partitioned by using the joint partition flag.
The pre-filtering module 101 is further configured to perform a second pre-filtering on the two jointly partitioned data streams according to the θ condition, and further filter all partitions in which a data group meeting the θ condition is unlikely to exist, so as to reduce cartesian product operations that need to be performed, so that most of the performed cartesian product operations are meaningful, and thereby improve the operation efficiency. Specifically, the method comprises the following steps:
in this embodiment, when the [1,2.66) partition in the data stream 1 performs the filtering with the [8,9] partition in the data stream 2, in which the θ condition is greater than (>), since the maximum value in the partition in the data stream 1 is also smaller than the minimum value in the data stream 2, there is no possibility that a data set satisfying the θ condition exists in the two partitions, and the two partitions do not need to perform the cartesian product operation, so that the number of cartesian product operations is reduced, and the operation efficiency is improved. Similarly, similar filtering is performed for the condition theta when the condition theta is other operators (less than, less than or equal to, and greater than or equal to), and the Cartesian product operation is not performed for the partitions which cannot generate the data set meeting the condition theta.
In this embodiment, that is, the partitioned data streams 1 'and 2' are pre-filtered for the second time according to the θ condition.
The re-partition module 103 is configured to calculate an average partition size of the data stream after the second pre-filtering, and re-partition partitions whose partition sizes are larger than the average partition size, so as to avoid load imbalance and ensure balanced load as much as possible. Specifically, the method comprises the following steps:
the sizes of the partitions generated after the second pre-filtering may be different greatly, some partitions may be too large, and in order to balance the load, the too large partitions are disassembled, that is, the re-partitioning operation is performed, and the re-partitioning operation is a necessary operation of the present application.
In the present application, a partition whose partition size exceeds the average partition size is identified as an overloaded partition, which is subjected to a re-partitioning operation.
The calculation of the average partition size comprises:
Figure BDA0002994539270000141
wherein, the average _ partition _ size is the average partition size of the data stream, the number _ of _ elements _ in _ stream is the data amount of the data stream in the current processing window, and the number _ of _ partitions is the number of partitions generated by the data stream by the joint partition.
For example, a data stream has 9 data elements, and is divided into three partitions, so that each partition should contain 9/3-3 data elements under the best load condition.
The following formula illustrates the method of calculating the number of repartitioned partitions:
Figure BDA0002994539270000142
the repartition _ number is the number of repartitions for repartitioning the overloaded partition, the overloaded _ partitionion _ size is the size of the overloaded partition, and the average _ partitionion _ size is the average partition size. The upper brackets are rounded up symbols. For example, when the average partition size is 2 and there is a heavily overloaded partition, the partition size is 9, the number of repartitions is 9/2 ═ 4.5, and rounded up to 5. I.e. the partition would be re-partitioned into 5 parts. The re-partition mode is the same as before, the partition range is determined according to the maximum and minimum partition values, but the joint partition operation is not needed when the re-partition is carried out.
In this embodiment, that is, the average partition size of the data stream 1 'and the average partition size of the data stream 2' are respectively calculated for the data stream 1 'and the data stream 2' after the second pre-filtering, and the overload partitions of the data stream 1 'and the data stream 2' are respectively re-partitioned.
The processing module 104 is configured to perform cartesian product operation on the two partitioned data streams to obtain a result of θ connection of the data streams. The method specifically comprises the following steps:
after two pre-filtering operations, most of the remaining Cartesian product operations are meaningful, i.e., an effective theta-join result can be obtained. Two pre-partitions improve the optimization efficiency. And finally, screening a data group meeting the theta condition from the Cartesian product result according to the theta condition, namely the final theta-join result.
For example, a [3.66,5.33) partition in the joint partition would divide 4 and 5 in data stream 1 into one region, divide 4 in data stream 2 into one region, and perform Cartesian product operations on both regions to yield a set of results satisfying the theta condition of 5>4, whereas a Cartesian product of 4 and 4 would not yield a valid theta-join result because 4 is not greater than 4.
It is worth mentioning that: as can be seen from the histogram of the number of cartesian products in the second (f) diagram, the present application greatly reduces the number of cartesian products that need to be performed compared to other comparison methods. In addition, since there are 27 data sets in the data that satisfy the θ condition, any optimization thereof requires at least 27 cartesian product operations. The application performs 29 Cartesian product operations and approaches the optimal solution.
By means of more powerful data stream filtering and more reasonable and effective data stream partitioning, the number of Cartesian products to be carried out is reduced, and the efficiency of theta-join operation is improved, so that the method and the device can be more efficiently applied to big data application.
Please refer to fig. 4, which is a schematic diagram of a terminal structure according to an embodiment of the present application. The terminal 50 comprises a processor 51, a memory 52 coupled to the processor 51.
The memory 52 stores program instructions for implementing the above-described method for data stream θ connection optimization.
The processor 51 is operative to execute program instructions stored in the memory 52 to control optimization for data stream θ connections.
The processor 51 may also be referred to as a CPU (Central Processing Unit). The processor 51 may be an integrated circuit chip having signal processing capabilities. The processor 51 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Fig. 5 is a schematic structural diagram of a storage medium according to an embodiment of the present disclosure. The storage medium of the embodiment of the present application stores a program file 61 capable of implementing all the methods described above, where the program file 61 may be stored in the storage medium in the form of a software product, and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a portable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, or terminal devices such as a computer, a server, a mobile phone, and a tablet.
Although the present invention has been described with reference to the presently preferred embodiments, it will be understood by those skilled in the art that the foregoing description is illustrative only and is not intended to limit the scope of the invention, as claimed.

Claims (10)

1. A method for optimizing data flow θ connections, the method comprising the steps of:
a. performing first pre-filtering on the two data streams according to a theta condition; the pre-filtering refers to filtering out partitions of the data stream which cannot meet a theta condition;
b. respectively calculating partition marks for the two data streams after the first pre-filtering according to the partition number and the maximum and minimum values, combining the partition marks of the two data streams to obtain a combined partition mark, and performing combined partitioning according to the combined partition mark;
c. performing second pre-filtering on the two data streams subjected to the joint partitioning according to a theta condition; the second pre-filtering is to filter out partitions of the data stream which cannot meet the theta condition;
d. respectively calculating the average partition size of the data stream after the second pre-filtering, and re-partitioning the partitions with the partition size larger than the average partition size;
e. and carrying out Cartesian product operation on the two data streams after the repartition, and obtaining a data stream theta connection result.
2. The method of claim 1, wherein step b comprises:
after the partition marks of the two data streams are calculated respectively, combining the partition marks calculated by the two data streams to obtain a combined partition mark, and performing more detailed partition on the two data streams respectively by using the combined partition mark.
3. The method of claim 2, wherein the average partition size is calculated by the following equation:
Figure FDA0003635749720000011
wherein, the average _ partition _ size is the average partition size of the data stream, the number _ of _ elements _ in _ stream is the data amount of the data stream in the current processing window, and the number _ of _ partitions is the number of partitions generated by the data stream by the joint partition.
4. The method of claim 3 wherein the number of subdivisions of said repartition is calculated by the following equation:
Figure FDA0003635749720000021
the repartition _ number is the number of repartitions for repartitioning the overloaded partition, the overloaded _ reparation _ size is the size of the overloaded partition, the average _ reparation _ size is the average partition size, and upper parentheses are rounded-up symbols.
5. A data flow θ connection optimization system, comprising a pre-filtering module, a joint partitioning module, a re-partitioning module, and a processing module, wherein:
the pre-filtering module is used for pre-filtering two data streams for the first time according to a theta condition; the pre-filtering refers to filtering out partitions of the data stream which cannot meet a theta condition;
the joint partitioning module is used for respectively calculating partition marks of the two data streams after the first pre-filtering according to the partition number and the maximum and minimum values, combining the partition marks of the two data streams to obtain a joint partition mark, and performing joint partitioning according to the joint partition mark;
the pre-filtering module is also used for carrying out secondary pre-filtering on the two data streams subjected to the joint partition according to a theta condition;
the re-partition module is used for respectively calculating the average partition size of the data stream after the second pre-filtering and re-partitioning the partitions of which the partition size is larger than the average partition size;
and the processing module is used for carrying out Cartesian product operation on the two data streams after the repartition and the partition to obtain a data stream theta connection result.
6. The system of claim 5, wherein the federated partition module is specifically configured to:
after the partition marks of the two data streams are calculated respectively, the partition marks calculated by the two data streams are combined to obtain a joint partition mark, and the joint partition mark is used for carrying out more detailed partition on the two data streams respectively.
7. The system of claim 6, wherein the average partition size is calculated by the following equation:
Figure FDA0003635749720000031
wherein, the average _ partition _ size is the average partition size of the data stream, the number _ of _ elements _ in _ stream is the data amount of the data stream in the current processing window, and the number _ of _ partitions is the number of partitions generated by the data stream by the joint partition.
8. The system of claim 7, wherein the number of repartitions is calculated by the following equation:
Figure FDA0003635749720000032
the repartition _ number is the number of repartitions for repartitioning the overloaded partition, the overloaded _ reparation _ size is the size of the overloaded partition, the average _ reparation _ size is the average partition size, and upper parentheses are rounded-up symbols.
9. A terminal comprising a processor, a memory coupled to the processor, wherein,
the memory stores program instructions for implementing data flow θ connection optimization as claimed in any one of claims 1-4;
the processor is configured to execute the program instructions stored by the memory to control a data flow θ for connection optimization.
10. A storage medium having stored thereon program instructions executable by a processor to perform the data stream θ connection optimization method of any one of claims 1 to 4.
CN202110331197.7A 2021-03-26 2021-03-26 Data stream theta connection optimization method, system, terminal and storage medium Active CN112948442B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110331197.7A CN112948442B (en) 2021-03-26 2021-03-26 Data stream theta connection optimization method, system, terminal and storage medium
PCT/CN2021/137981 WO2022199147A1 (en) 2021-03-26 2021-12-14 DATA STREAM θ-JOIN OPTIMIZATION METHOD AND SYSTEM, TERMINAL, AND STORAGE MEDIUM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110331197.7A CN112948442B (en) 2021-03-26 2021-03-26 Data stream theta connection optimization method, system, terminal and storage medium

Publications (2)

Publication Number Publication Date
CN112948442A CN112948442A (en) 2021-06-11
CN112948442B true CN112948442B (en) 2022-06-21

Family

ID=76227076

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110331197.7A Active CN112948442B (en) 2021-03-26 2021-03-26 Data stream theta connection optimization method, system, terminal and storage medium

Country Status (2)

Country Link
CN (1) CN112948442B (en)
WO (1) WO2022199147A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112948442B (en) * 2021-03-26 2022-06-21 深圳先进技术研究院 Data stream theta connection optimization method, system, terminal and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6032144A (en) * 1996-05-29 2000-02-29 Lucent Technologies Inc. Optimization of queries using relational algebraic theta-semijoin operator
US6061676A (en) * 1996-05-29 2000-05-09 Lucent Technologies Inc. Effecting constraint magic rewriting on a query with the multiset version of the relational algebric theta-semijoin operator
CN105955999A (en) * 2016-04-20 2016-09-21 华中科技大学 Large scale RDF graph Thetajoin query processing method
CN106021386A (en) * 2016-05-12 2016-10-12 西北工业大学 Theta-join method for massive distributed data
CN110489452A (en) * 2019-08-21 2019-11-22 中国科学院深圳先进技术研究院 Multiplex data stream θ connection optimization method and system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7523123B2 (en) * 2006-11-16 2009-04-21 Yahoo! Inc. Map-reduce with merge to process multiple relational datasets
US10387426B2 (en) * 2017-10-27 2019-08-20 Streamsimple, Inc. Streaming microservices for stream processing applications
CN110659304B (en) * 2019-09-09 2023-06-16 杭州中科先进技术研究院有限公司 Multi-path data stream connection system based on data inclination
CN112365075A (en) * 2020-11-19 2021-02-12 中国科学院深圳先进技术研究院 Stock price trend prediction method, system, terminal and storage medium
CN112948442B (en) * 2021-03-26 2022-06-21 深圳先进技术研究院 Data stream theta connection optimization method, system, terminal and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6032144A (en) * 1996-05-29 2000-02-29 Lucent Technologies Inc. Optimization of queries using relational algebraic theta-semijoin operator
US6061676A (en) * 1996-05-29 2000-05-09 Lucent Technologies Inc. Effecting constraint magic rewriting on a query with the multiset version of the relational algebric theta-semijoin operator
CN105955999A (en) * 2016-04-20 2016-09-21 华中科技大学 Large scale RDF graph Thetajoin query processing method
CN106021386A (en) * 2016-05-12 2016-10-12 西北工业大学 Theta-join method for massive distributed data
CN110489452A (en) * 2019-08-21 2019-11-22 中国科学院深圳先进技术研究院 Multiplex data stream θ connection optimization method and system

Also Published As

Publication number Publication date
WO2022199147A1 (en) 2022-09-29
WO2022199147A9 (en) 2022-10-27
WO2022199147A8 (en) 2023-11-02
CN112948442A (en) 2021-06-11

Similar Documents

Publication Publication Date Title
CN112948442B (en) Data stream theta connection optimization method, system, terminal and storage medium
WO2020119053A1 (en) Picture clustering method and apparatus, storage medium and terminal device
CN116416561A (en) Video image processing method and device
US20140067275A1 (en) Multidimensional cluster analysis
US20230252108A1 (en) Method, Device for Processing Network Flow, Storage Medium and Computer Device
Kim et al. A novel hardware design for SIFT generation with reduced memory requirement
Kim et al. Fine-grained neural architecture search
AU2019241002B2 (en) Transaction processing method and system, and server
CN111612145A (en) Model compression and acceleration method based on heterogeneous separation convolution kernel
CN111967749A (en) Crewman comfort evaluation method, terminal device and storage medium
CN114841272B (en) Coal blending classification modeling method based on X-ray coal quality multi-element
CN112615910B (en) Data stream connection optimization method, system, terminal and storage medium
TWI794822B (en) Fast non-maximum suppression algorithm for object detection
CN114239858B (en) Graph learning method and device for distributed graph model
CN115883391A (en) Single-user service selection method and device
AU2020103440A4 (en) A method for optimizing the convergence performance of data learning with minimal computational steps
CN112966723B (en) Video data augmentation method, video data augmentation device, electronic device and readable storage medium
WO2021237727A1 (en) Method and apparatus of image processing
CN112068690B (en) Processor frequency modulation method and device
Li et al. VNet: a versatile network to train real-time semantic segmentation models on a single GPU
US20210397864A1 (en) Hardware Accelerator for Integral Image Computation
CN115578583B (en) Image processing method, device, electronic equipment and storage medium
CN115408549B (en) Workpiece point cloud filtering method and device, computer readable medium and electronic equipment
CN115544069B (en) Reconfigurable database query acceleration processor and system
CN113393401B (en) Object detection hardware accelerator, system, method, apparatus and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant