CN112948442B

CN112948442B - Data stream theta connection optimization method, system, terminal and storage medium

Info

Publication number: CN112948442B
Application number: CN202110331197.7A
Authority: CN
Inventors: 王洋; 吴嘉澍; 须成忠
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2022-06-21
Anticipated expiration: 2041-03-26
Also published as: WO2022199147A1; WO2022199147A9; WO2022199147A8; CN112948442A

Abstract

The invention relates to a data flow theta connection optimization method, which comprises the following steps: performing first pre-filtering on the two data streams according to a theta condition; respectively calculating partition marks for the two data streams after the first pre-filtering according to the partition number and the maximum and minimum values, combining the partition marks of the two data streams to obtain a combined partition mark, and performing combined partitioning according to the combined partition mark; performing secondary pre-filtering on the two data streams subjected to the joint partitioning according to a theta condition; respectively calculating the average partition size of the data stream after the second pre-filtering, and re-partitioning the partitions with the partition size larger than the average partition size; and carrying out Cartesian product operation on the two data streams after the repartition, and obtaining a data stream theta connection result. The invention also relates to a data flow theta connection optimization system. The invention optimizes the theta-join algorithm by utilizing secondary pre-filtering and joint partitioning, and reduces the number of unnecessary Cartesian products as much as possible, thereby greatly improving the operation efficiency.

Description

Data stream theta connection optimization method, system, terminal and storage medium

Technical Field

The invention relates to a data stream theta connection optimization method, a system, a terminal and a storage medium.

Background

With the rapid development of the digital era, data is also generated in large quantities. In order to more efficiently process and utilize generated data, an efficient data processing algorithm is becoming more important. In many applications, data comes online in the form of an online data stream, such as stock and exchange rate data that fluctuates over time, air temperature and wind speed data that changes over time, and so forth.

An important operation in processing data stream data is a Theta Join (Theta-Join) operation. The Theta-join operation can find out a data group meeting the Theta condition in the data stream through the given Theta condition, so that other subsequent data analysis and utilization are facilitated.

To meet the data processing efficiency requirements of today's big data and information applications, the theta-join algorithm needs to perform join operations on data streams as efficiently as possible. However, the Cartesian product (cross-product) operation required in the theta-join is time-consuming and becomes the most important factor affecting the efficiency of the theta-join algorithm.

The existing theta-join algorithm carries out a large amount of unnecessary Cartesian product operations when theta connection is carried out, so that the operation efficiency of the algorithm is poor. Therefore, how to reduce the number of unnecessary cartesian products performed by the theta-join algorithm as much as possible and improve the operation efficiency of the theta-join algorithm becomes a problem to be optimized.

Disclosure of Invention

In view of this, it is necessary to provide a data stream θ connection optimization method, system, terminal, and storage medium, which can optimize the theta-join algorithm by using secondary pre-filtering and joint partitioning, and reduce the number of unnecessary cartesian products performed by the theta-join algorithm as much as possible, thereby greatly improving the operating efficiency of the theta-join algorithm.

The embodiment of the application provides a data flow theta connection optimization method, which comprises the following steps: a. performing first pre-filtering on the two data streams according to a theta condition; the pre-filtering refers to filtering out partitions of the data stream which cannot meet a theta condition; b. respectively calculating partition marks for the two data streams after the first pre-filtering according to the partition number and the maximum and minimum values, combining the partition marks of the two data streams to obtain a combined partition mark, and performing combined partitioning according to the combined partition mark; c. performing secondary pre-filtering on the two partitioned data streams according to a theta condition; d. respectively calculating the average partition size of the data stream after the second pre-filtering, and re-partitioning the partitions with the partition size larger than the average partition size; e. and carrying out Cartesian product operation on the two data streams after the repartition, and obtaining a data stream theta connection result.

Wherein, the step b comprises:

after the partition marks of the two data streams are calculated respectively, the partition marks calculated by the two data streams are combined to obtain a combined partition mark, and the combined partition mark is utilized to perform more detailed partition on the two data streams respectively.

Wherein, the average partition size is calculated by the following formula:

wherein, the average _ partition _ size is the average partition size of the data stream, the number _ of _ elements _ in _ stream is the data amount of the data stream in the current processing window, and the number _ of _ partitions is the number of partitions generated by the data stream by the joint partition.

The number of the subdivided partitions is calculated by the following formula:

the repartition _ number is the number of repartitions for repartitioning the overloaded partition, the overloaded _ partion _ size is the size of the overloaded partition, the average _ partion _ size is the average partition size, and the upper parentheses are rounded-up symbols.

The embodiment of the application provides a data flow theta connection optimization system, which comprises a pre-filtering module, a joint partitioning module, a re-partitioning module and a processing module, wherein: the pre-filtering module is used for pre-filtering two data streams for the first time according to a theta condition; the pre-filtering refers to filtering out partitions of the data stream which cannot meet a theta condition; the joint partitioning module is used for respectively calculating partitioning marks of the two data streams after the first pre-filtering according to the partitioning number and the maximum and minimum values, combining the partitioning marks of the two data streams to obtain a joint partitioning mark, and performing joint partitioning according to the joint partitioning mark; the pre-filtering module is further used for performing secondary pre-filtering on the two data streams subjected to the joint partitioning according to a theta condition; the re-partitioning module is used for respectively calculating the average partition size of the data stream after the second pre-filtering and re-partitioning the partitions with the partition size larger than the average partition size; and the processing module is used for carrying out Cartesian product operation on the two data streams after the repartition and the partition to obtain a data stream theta connection result.

Wherein the joint partitioning module is specifically configured to:

after the partition marks of the two data streams are calculated respectively, combining the partition marks calculated by the two data streams to obtain a combined partition mark, and performing more detailed partition on the two data streams respectively by using the combined partition mark.

Wherein, the average partition size is calculated by the following formula:

The number of the subdivided partitions is calculated by the following formula:

An embodiment of the present application provides a terminal, where the terminal includes a processor and a memory coupled to the processor, where:

the memory stores program instructions for implementing data flow θ connection optimization as claimed in any one of claims 1-4;

the processor is configured to execute the program instructions stored by the memory to control a data flow θ for connection optimization.

The embodiment of the application provides a storage medium, which stores program instructions executable by a processor, and the program instructions are used for executing the data stream theta connection optimization method in any one of claims 1 to 4.

The application provides a data stream theta connection optimization method and a data stream theta connection optimization system. Compared with an algorithm which directly performs partitioning without filtering two data streams according to a theta condition, the method and the device can preliminarily reduce the data volume needing partitioning, so that the algorithm runs more efficiently and partitions are more refined, the subsequent screening of data is facilitated, and the number of unnecessary Cartesian products is reduced. Then, partition markers will be calculated for the two data streams, respectively, according to the number of partitions. In the whole process, time-consuming sequencing operation is not needed to be carried out on the two data streams, and the data sequence in the partitions is not required. After the partition marks of the two data streams are calculated respectively, in order to avoid that the two data streams are partitioned in an isolated manner to damage the subsequent filtering effect, the partition marks calculated by the two data streams are combined, and the two data streams are partitioned more finely by using the combined partition marks, so that the subsequent filtering is more refined. After the two data streams are partitioned according to the joint partition mark, secondary pre-filtering is carried out according to the theta condition, and all partitions which cannot have a data group meeting the theta condition are filtered. And for the partitions with the partition size larger than the average partition size, carrying out repartitioning operation, and carrying out pairwise Cartesian product operation on the reserved partitions. And finally, screening a data group meeting the theta condition from the Cartesian product result according to the theta condition, namely the final theta-join result.

Drawings

Fig. 1 is a flowchart of a data flow θ connection optimization method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a Cartesian product matrix and a Cartesian product quantity histogram generated according to a sample data stream according to the embodiments and the prior art (the bold sides in the diagram represent the boundaries of data requiring Cartesian product, and the black thin sides represent partition boundaries);

FIG. 3 is a diagram of a hardware architecture of a data flow θ connection optimization system according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a terminal according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a storage medium according to an embodiment of the present application.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

Fig. 1 is a flowchart illustrating an operation of a data flow θ connection optimization method according to a preferred embodiment of the present invention.

In step S1, the two data streams are pre-filtered for the first time according to the θ condition. Specifically, the method comprises the following steps:

referring to fig. 2(e), as shown in fig. 2(e), compared to the prior art, the first pre-filtering mechanism of the present application effectively reduces the amount of data that needs to be partitioned, so that the operation is more efficient. The principle of the first prefiltration is as follows: in this embodiment, the minimum value in the data stream 2 (horizontal) is 0, so any value greater than 0 can satisfy the θ condition (>) for the data stream 1 (vertical), and the values less than or equal to 0 in the data stream 1 will be pre-filtered, that is, 0 will be filtered out, and will not participate in other steps such as subsequent partitioning. Similarly, the maximum value of data stream 1 is 9, so that for data stream 2, any value greater than or equal to 9 cannot satisfy the θ condition, and any value greater than or equal to 9 in data stream 2 is filtered out, i.e. 10, 9 and 12. It should be noted that when the θ condition is other operators (<, ≦ or ≧), the filtering method is the same and is not listed.

In this embodiment, that is, the data stream 1 and the data stream 2 are pre-filtered for the first time according to the θ condition to obtain the data stream 1 'and the data stream 2'.

The data volume participating in the partition is reduced due to the first pre-filtering, so that the subsequent partitions are more refined. Therefore, the first pre-filtering is more beneficial to the subsequent filtering, and the number of unnecessary Cartesian products is reduced.

And step S2, respectively calculating partition marks for the two data streams after the first pre-filtering according to the partition number and the maximum and minimum values, combining the partition marks of the two data streams to obtain a combined partition mark, and performing combined partitioning according to the combined partition mark. Specifically, the method comprises the following steps:

in this embodiment, the two data streams are divided into three partitions, and the maximum value and the minimum value of the remaining values in data stream 1 (vertical) are 9 and 1, so that the span of each partition is 8/3-2.66, and the partition ranges are [1,3.66 ], [3.66,6.33) and [6.33,9 ]. Similarly, the maximum and minimum values of data stream 2 (horizontal) are 8 and 0, respectively, so the partition span is 8/3-2.66, and the partition ranges are [0,2.66), [2.66,5.33), and [5.33,8 ]. Due to the first pre-filtering, the data volume in the data stream is reduced, the span of the partition is correspondingly reduced, and the partition is more detailed.

After the partition marks of the two data streams are calculated respectively, in order to avoid carrying out isolated partition on the two data streams to damage the subsequent filtering effect, the partition marks obtained by calculating the two data streams are combined and called as joint partition marks, and the joint partition marks are utilized to carry out more refined partition on the two data streams, so that the subsequent filtering is more refined, the number of unnecessary Cartesian product operations is reduced, and the performance efficiency is improved.

As shown in fig. 2, the partition of the data stream 1 is [1,3.66 ], [3.66,6.33], and [6.33,9 ]. The partitions of data stream 2 are [0,2.66), [2.66,5.33) and [5.33,8 ]. The joint partition generated for data stream 1 and data stream 2 in this embodiment is [0,1 ], [1,2.66 ], [2.66,3.66 ], [3.66,5.33 ], [5.33,6.33 ], [6.33,8) and [8,9 ]. Taking the example of partitioning the data stream 1 by using the joint partition, 1 and 2 in the data stream 1 are divided into one area, 3 is divided into one area, 4 and 5 are divided into one area, and 6, 7 and 9 are respectively divided into one area.

If the data stream partition flags are not merged, then the [2.66,5.33) partition in data stream 1 cannot be filtered out when theta-conditional filtering is performed with the [3.66,6.33) partition in data stream 2 because 5 in data stream 1 is greater than 4 in data stream 2, because both partitions can produce a result pair that meets the theta condition. In this case, the subsequent whole of the two partitions is subjected to a cartesian product operation. But in this case 6 in stream 2 is redundant because it is unlikely that there will be a data set satisfying the condition of theta being greater than with the [2.66,5.33) partition of stream 1. Therefore, the non-joint partitioning of the two data streams is performed in isolation, which impairs the subsequent filtering effect, resulting in more meaningless time-consuming cartesian product operations.

If the two data stream partitions are partitioned after being combined according to the method of the present application, the [3.66,6.33] partition of the data stream 2 is decomposed into [3.66,5.33) and [5.33,6.33) by the joint partitioning method. In this way, 6 in the data stream 2 is filtered out, thereby reducing the number of meaningless Cartesian product operations and improving the efficiency of the algorithm. Therefore, in partitioning, the partitions for the two data streams should not be isolated, and the partitions between the two data streams should be joined to each other to better filter out the unnecessary cartesian product operations in advance.

Meanwhile, time-consuming sequencing operation on the two data streams is not needed in the whole process, and the data sequence in the partitions is not required. Taking 5 and 4 in the data stream as an example, the two are located in the [3.66,5.33) partition, and then, the subsequent screening and Cartesian product operation is performed by taking the partition as a unit, so that the sequence of 5 and 4 in the partition has no influence at all, the time-consuming sequencing is avoided, and the operation efficiency is effectively improved.

In this embodiment, that is, for the data stream 1 'and the data stream 2' obtained by the first pre-filtering, the partition flag of the data stream 1 'and the partition flag of the data stream 2' are respectively calculated, the partition flag of the data stream 1 'and the partition flag of the data stream 2' are combined to obtain a joint partition flag, and the data stream 1 'and the data stream 2' are respectively partitioned by using the joint partition flag.

And step S3, performing a second pre-filtering on the two data streams after the joint partitioning according to the theta condition, and further filtering all partitions in which data groups meeting the theta condition are not possible to exist, so as to reduce the required Cartesian product operations, make most of the performed Cartesian product operations meaningful, and improve the operation efficiency. Specifically, the method comprises the following steps:

in this embodiment, when the [1,2.66) partition in the data stream 1 performs the filtering with the [8,9] partition in the data stream 2, where the θ condition is greater than (>), since the maximum value in the partition in the data stream 1 is also smaller than the minimum value in the data stream 2, there is no data set in the two partitions that satisfies the θ condition, so that the two partitions do not need to perform the cartesian product operation, thereby reducing the number of cartesian product operations and improving the operation efficiency. Similarly, similar filtering is performed for the condition theta when the condition theta is other operators (less than, less than or equal to, and greater than or equal to), and the Cartesian product operation is not performed for the partitions which cannot generate the data set meeting the condition theta.

In this embodiment, that is, the partitioned data streams 1 'and 2' are pre-filtered for the second time according to the θ condition.

Step S4, calculating the average partition size of the data stream after the second pre-filtering, and re-partitioning the partitions whose partition size is larger than the average partition size, so as to avoid load imbalance and ensure balanced load as much as possible. Specifically, the method comprises the following steps:

the sizes of the partitions generated after the second pre-filtering may be different greatly, some partitions may be too large, and in order to balance the load, the too large partitions are disassembled, that is, the re-partitioning operation is performed, and the re-partitioning operation is a necessary operation of the present application.

In the present application, a partition whose partition size exceeds the average partition size is identified as an overloaded partition, which is subjected to a re-partitioning operation.

The calculation of the average partition size comprises:

For example, a data stream has 9 data elements, and is divided into three partitions, so that each partition should contain 9/3-3 data elements under the best load condition.

The following formula illustrates the method of calculating the number of repartitioned partitions:

the repartition _ number is the number of repartitions for repartitioning the overloaded partition, the overloaded _ partitionion _ size is the size of the overloaded partition, and the average _ partitionion _ size is the average partition size. The upper brackets are rounded up symbols. For example, when the average partition size is 2 and there is a heavily overloaded partition, the partition size is 9, the number of repartitions is 9/2 ═ 4.5, and rounded up to 5. I.e. the partition would be subdivided into 5 parts. The re-partition mode is the same as before, the partition range is determined according to the maximum and minimum partition values, but the joint partition operation is not needed when the re-partition is carried out.

In this embodiment, that is, the average partition size of the data stream 1 'and the average partition size of the data stream 2' are respectively calculated for the data stream 1 'and the data stream 2' after the second pre-filtering, and the overload partitions of the data stream 1 'and the data stream 2' are respectively re-partitioned.

Step S5: and carrying out Cartesian product operation on the two data streams after the repartition, and obtaining a data stream theta connection result. The method specifically comprises the following steps:

after two prefiltering processes, most of the remaining Cartesian product operations are meaningful, i.e., an effective theta-join result can be obtained. The joint partitioning and re-partitioning further improves the algorithm efficiency. And finally, screening a data group meeting the theta condition from the Cartesian product result according to the theta condition, namely the final theta-join result.

For example, a [3.66,5.33) partition in the joint partition would divide 4 and 5 in data stream 1 into one region, divide 4 in data stream 2 into one region, and perform Cartesian product operations on both regions to yield a set of results satisfying the theta condition of 5>4, whereas a Cartesian product of 4 and 4 would not yield a valid theta-join result because 4 is not greater than 4.

It is worth mentioning that: as can be seen from the histogram of the number of cartesian products in the second (f) diagram, the present application greatly reduces the number of cartesian products that need to be performed compared to other comparison methods. In addition, since there are 27 data sets in the data that satisfy the θ condition, any optimization thereof requires at least 27 cartesian product operations. The application performs 29 Cartesian product operations and approaches the optimal solution.

Fig. 3 is a diagram showing a hardware architecture of the data flow θ connection optimization system 10 according to the embodiment of the present application. The system comprises: a pre-filtering module 101, a joint partitioning module 102, a re-partitioning module 103, and a processing module 104.

The pre-filtering module 101 is configured to pre-filter two data streams for a first time according to a θ condition. Specifically, the method comprises the following steps:

The data volume participating in the partition is reduced due to the first pre-filtering, so that the subsequent partitions are more refined. Therefore, the first pre-filtering is more beneficial to subsequent filtering, and the number of unnecessary Cartesian products is reduced.

The joint partitioning module 102 is configured to calculate partition marks for the two data streams after the first prefiltering according to the partition number and the maximum and minimum values, respectively, combine the partition marks of the two data streams to obtain a joint partition mark, and perform joint partitioning according to the joint partition mark. The method specifically comprises the following steps:

in this embodiment, the two data streams are divided into three partitions, and the maximum value and the minimum value of the remaining values in data stream 1 (vertical) are 9 and 1, so that the span of each partition is 8/3-2.66, and the partition ranges are [1,3.66 ], [3.66,6.33) and [6.33,9 ]. Similarly, the maximum and minimum values of data stream 2 (horizontal) are 8 and 0, respectively, so the partition span is 8/3-2.66, and the partition ranges are [0,2.66 ], [2.66,5.33), and [5.33,8 ]. Due to the first pre-filtering, the data volume in the data stream is reduced, the span of the partition is correspondingly reduced, and the partition is more detailed.

As shown in fig. 2, the partition of the data stream 1 is [1,3.66 ], [3.66,6.33], and [6.33,9 ]. The partitions of stream 2 are [0,2.66), [2.66,5.33) and [5.33,8 ]. The joint partition generated for data stream 1 and data stream 2 in this embodiment is [0,1 ], [1,2.66 ], [2.66,3.66 ], [3.66,5.33 ], [5.33,6.33 ], [6.33,8) and [8,9 ]. Taking the example of partitioning the data stream 1 by using the joint partition, 1 and 2 in the data stream 1 are divided into one area, 3 is divided into one area, 4 and 5 are divided into one area, and 6, 7 and 9 are respectively divided into one area.

If the data stream partition flags are not merged, then the [2.66,5.33) partition in data stream 1 cannot be filtered out when theta-conditional filtering is performed with the [3.66,6.33) partition in data stream 2 because 5 in data stream 1 is greater than 4 in data stream 2, because both partitions can produce a result pair that meets the theta condition. In this case, the subsequent whole of the two partitions is subjected to a cartesian product operation. But in this case 6 in stream 2 is redundant because it is unlikely that there will be a data set satisfying the condition theta greater than the [2.66,5.33) partition of stream 1. Therefore, the non-joint partitioning of the two data streams is performed in isolation, which impairs the subsequent filtering effect, resulting in more meaningless time-consuming cartesian product operations.

The pre-filtering module 101 is further configured to perform a second pre-filtering on the two jointly partitioned data streams according to the θ condition, and further filter all partitions in which a data group meeting the θ condition is unlikely to exist, so as to reduce cartesian product operations that need to be performed, so that most of the performed cartesian product operations are meaningful, and thereby improve the operation efficiency. Specifically, the method comprises the following steps:

in this embodiment, when the [1,2.66) partition in the data stream 1 performs the filtering with the [8,9] partition in the data stream 2, in which the θ condition is greater than (>), since the maximum value in the partition in the data stream 1 is also smaller than the minimum value in the data stream 2, there is no possibility that a data set satisfying the θ condition exists in the two partitions, and the two partitions do not need to perform the cartesian product operation, so that the number of cartesian product operations is reduced, and the operation efficiency is improved. Similarly, similar filtering is performed for the condition theta when the condition theta is other operators (less than, less than or equal to, and greater than or equal to), and the Cartesian product operation is not performed for the partitions which cannot generate the data set meeting the condition theta.

The re-partition module 103 is configured to calculate an average partition size of the data stream after the second pre-filtering, and re-partition partitions whose partition sizes are larger than the average partition size, so as to avoid load imbalance and ensure balanced load as much as possible. Specifically, the method comprises the following steps:

The calculation of the average partition size comprises:

the repartition _ number is the number of repartitions for repartitioning the overloaded partition, the overloaded _ partitionion _ size is the size of the overloaded partition, and the average _ partitionion _ size is the average partition size. The upper brackets are rounded up symbols. For example, when the average partition size is 2 and there is a heavily overloaded partition, the partition size is 9, the number of repartitions is 9/2 ═ 4.5, and rounded up to 5. I.e. the partition would be re-partitioned into 5 parts. The re-partition mode is the same as before, the partition range is determined according to the maximum and minimum partition values, but the joint partition operation is not needed when the re-partition is carried out.

The processing module 104 is configured to perform cartesian product operation on the two partitioned data streams to obtain a result of θ connection of the data streams. The method specifically comprises the following steps:

after two pre-filtering operations, most of the remaining Cartesian product operations are meaningful, i.e., an effective theta-join result can be obtained. Two pre-partitions improve the optimization efficiency. And finally, screening a data group meeting the theta condition from the Cartesian product result according to the theta condition, namely the final theta-join result.

By means of more powerful data stream filtering and more reasonable and effective data stream partitioning, the number of Cartesian products to be carried out is reduced, and the efficiency of theta-join operation is improved, so that the method and the device can be more efficiently applied to big data application.

Please refer to fig. 4, which is a schematic diagram of a terminal structure according to an embodiment of the present application. The terminal 50 comprises a processor 51, a memory 52 coupled to the processor 51.

The memory 52 stores program instructions for implementing the above-described method for data stream θ connection optimization.

The processor 51 is operative to execute program instructions stored in the memory 52 to control optimization for data stream θ connections.

The processor 51 may also be referred to as a CPU (Central Processing Unit). The processor 51 may be an integrated circuit chip having signal processing capabilities. The processor 51 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Fig. 5 is a schematic structural diagram of a storage medium according to an embodiment of the present disclosure. The storage medium of the embodiment of the present application stores a program file 61 capable of implementing all the methods described above, where the program file 61 may be stored in the storage medium in the form of a software product, and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a portable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, or terminal devices such as a computer, a server, a mobile phone, and a tablet.

Although the present invention has been described with reference to the presently preferred embodiments, it will be understood by those skilled in the art that the foregoing description is illustrative only and is not intended to limit the scope of the invention, as claimed.

Claims

1. A method for optimizing data flow θ connections, the method comprising the steps of:

a. performing first pre-filtering on the two data streams according to a theta condition; the pre-filtering refers to filtering out partitions of the data stream which cannot meet a theta condition;

b. respectively calculating partition marks for the two data streams after the first pre-filtering according to the partition number and the maximum and minimum values, combining the partition marks of the two data streams to obtain a combined partition mark, and performing combined partitioning according to the combined partition mark;

c. performing second pre-filtering on the two data streams subjected to the joint partitioning according to a theta condition; the second pre-filtering is to filter out partitions of the data stream which cannot meet the theta condition;

d. respectively calculating the average partition size of the data stream after the second pre-filtering, and re-partitioning the partitions with the partition size larger than the average partition size;

e. and carrying out Cartesian product operation on the two data streams after the repartition, and obtaining a data stream theta connection result.

2. The method of claim 1, wherein step b comprises:

3. The method of claim 2, wherein the average partition size is calculated by the following equation:

4. The method of claim 3 wherein the number of subdivisions of said repartition is calculated by the following equation:

the repartition _ number is the number of repartitions for repartitioning the overloaded partition, the overloaded _ reparation _ size is the size of the overloaded partition, the average _ reparation _ size is the average partition size, and upper parentheses are rounded-up symbols.

5. A data flow θ connection optimization system, comprising a pre-filtering module, a joint partitioning module, a re-partitioning module, and a processing module, wherein:

the pre-filtering module is used for pre-filtering two data streams for the first time according to a theta condition; the pre-filtering refers to filtering out partitions of the data stream which cannot meet a theta condition;

the joint partitioning module is used for respectively calculating partition marks of the two data streams after the first pre-filtering according to the partition number and the maximum and minimum values, combining the partition marks of the two data streams to obtain a joint partition mark, and performing joint partitioning according to the joint partition mark;

the pre-filtering module is also used for carrying out secondary pre-filtering on the two data streams subjected to the joint partition according to a theta condition;

the re-partition module is used for respectively calculating the average partition size of the data stream after the second pre-filtering and re-partitioning the partitions of which the partition size is larger than the average partition size;

and the processing module is used for carrying out Cartesian product operation on the two data streams after the repartition and the partition to obtain a data stream theta connection result.

6. The system of claim 5, wherein the federated partition module is specifically configured to:

after the partition marks of the two data streams are calculated respectively, the partition marks calculated by the two data streams are combined to obtain a joint partition mark, and the joint partition mark is used for carrying out more detailed partition on the two data streams respectively.

7. The system of claim 6, wherein the average partition size is calculated by the following equation:

8. The system of claim 7, wherein the number of repartitions is calculated by the following equation:

9. A terminal comprising a processor, a memory coupled to the processor, wherein,

10. A storage medium having stored thereon program instructions executable by a processor to perform the data stream θ connection optimization method of any one of claims 1 to 4.