CN113032446B

CN113032446B - Data processing method and device of distributed query system

Info

Publication number: CN113032446B
Application number: CN201911351500.9A
Authority: CN
Inventors: 李韬
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-12-24
Filing date: 2019-12-24
Publication date: 2024-07-09
Anticipated expiration: 2039-12-24
Also published as: CN113032446A; WO2021129498A1

Abstract

The embodiment of the application provides a data processing method and a data processing device for a distributed query system, which are used for obtaining first data of a first data source and obtaining distribution information of second data in a second data source, configuring the first data in each first node according to the distribution information of the second data, determining the data range of the first data in the first nodes, and extracting the second data from a plurality of second nodes in parallel according to the data range, so that the first data are configured in each first node according to the distribution information of the second data, the distribution of the first data on the first nodes is consistent with the distribution of the second data on the second nodes, the distribution configuration of the second data in the system is avoided, the extraction of the second data according to the data range is avoided, the total extraction of the data is avoided, and the running efficiency of the system is improved.

Description

Data processing method and device of distributed query system

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a data processing method of a distributed query system and a data processing device of the distributed query system.

Background

At present, cross-database query provides timely association query service for online heterogeneous data sources under different environments. Whether the databases are MySQL, SQLServer, postgreSQL or Redis, the relational query between the databases can be accomplished by a single SQL, regardless of which Region the database instance is deployed in.

The variety of data sources accessed in the cross-database query service is increasing, and when the data sources need to be acquired for Join, a great deal of time is consumed if a single JDBC (Java DataBase Connectivity, java database connection) is used for reading data. In addition, when Join operation is performed, a Shuffle operation is also required to be performed on data, a large amount of network overhead is consumed, and Join operation can seem to run very slowly, and even memory overflow may be caused by excessive data volume.

Disclosure of Invention

The technical problem to be solved by the embodiment of the application is to provide a data query method of a distributed query system, so as to solve the problem of low system performance caused by full-quantity reading and Shuffle of large-table data in the data processing process in the prior art.

Correspondingly, the embodiment of the application also provides a data query device of the distributed query system, which is used for ensuring the realization and the application of the method.

In order to solve the above problems, the present application discloses a data processing method of a distributed query system, where the distributed query system includes a plurality of first nodes and is connected to at least two data communications, and the method includes:

Acquiring first data of a first data source and acquiring distribution information of second data in a second data source;

according to the distribution information of the second data, the first data are configured in a plurality of first nodes;

Determining a data range of the first data in the first node;

and extracting the second data from a plurality of second nodes of the second data source in parallel according to the data range.

Optionally, the extracting the second data from the plurality of second nodes of the second data source in parallel according to the data range includes:

Determining Join connection conditions between the first data and the second data;

And extracting second data corresponding to the data range from the plurality of second nodes in parallel according to the Join connection condition.

Optionally, the configuring the first data in a plurality of the first nodes according to the distribution information of the second data includes:

respectively calculating the distribution information to generate a first configuration identifier aiming at the first data;

respectively adopting the first configuration identifier and the Join connection condition to determine a second configuration identifier aiming at the first node;

according to the sequence of the second configuration identifier, the first data are respectively configured in the first nodes corresponding to the second configuration identifier;

The distribution information at least comprises one of a slicing strategy, a partitioning strategy and a barrel strategy.

Optionally, the first data is small table data, and the configuring the first data in the first nodes corresponding to the second configuration identifiers according to the sequence of the second configuration identifiers includes:

Splitting the small table data into first sub-table data corresponding to each second configuration identifier;

Determining target nodes corresponding to the second configuration identifiers from the first nodes respectively;

and configuring the first sub-table data in the target node according to the sequence of the second configuration identifier.

Optionally, the data range includes a value range corresponding to each of the first sub-table data, the second data is large-table data, the large-table data includes second sub-table data corresponding to each of the second nodes, and the acquiring, in parallel, second data corresponding to the data range from the plurality of second nodes according to the Join connection condition includes:

And acquiring second sub-table data corresponding to the value range from the plurality of second nodes in parallel by adopting the Join connection condition.

Optionally, the calculating the distribution information respectively to generate a first configuration identifier for the first data includes:

acquiring first quantity information of the first node and second quantity information of the second node;

when the second quantity information is smaller than or equal to the first quantity information, calculating a third configuration identifier aiming at the first data by adopting the first quantity information and the distribution information;

And when the second quantity information is larger than the first quantity information, calculating a fourth configuration identifier aiming at the first data by adopting the second quantity information and the distribution information.

Optionally, the method further comprises:

Respectively connecting the first sub-table data and the second sub-table data in each first node to generate a node connection result;

And combining node connection results corresponding to the first nodes to generate query results aiming at the small table data and the large table data.

The embodiment of the application also discloses a data query method of the distributed query system, wherein the distributed query system is in communication connection with at least one second data source; the distributed query system comprises a plurality of first nodes, wherein the first nodes comprise first data configured by adopting the distribution information of the second data sources; the second data source includes a plurality of second nodes; the method comprises the following steps:

Determining a data range of the first data in the first node;

The plurality of first nodes send data query requests containing the data range to the plurality of second nodes in parallel;

And connecting second data corresponding to the data query requests sent by the plurality of second nodes with the first data to generate query results for the first data and the second data.

Optionally, the plurality of first nodes send data query requests including the data range to the plurality of second nodes in parallel, including:

And the plurality of first nodes send the data query requests containing the data range to the plurality of second nodes in parallel according to the Join connection condition.

Optionally, the method further comprises:

Acquiring distribution information of second data in the second data source;

and configuring the first data in a plurality of first nodes according to the distribution information of the second data.

Optionally, the data range includes a value range corresponding to each of the first sub-table data, the second data is large-table data, the large-table data includes second sub-table data corresponding to each of the second nodes, and the plurality of first nodes send, in parallel, a data query request including the data range to the plurality of second nodes according to the Join connection condition, including:

Optionally, the connecting the second data corresponding to the data query request sent by the second node with the first data, and generating a query result for the first data and the second data includes:

The embodiment of the application also discloses a data processing device of the distributed query system, the distributed query system comprises a plurality of first nodes and is connected with at least two data communication, the device comprises:

The data and information acquisition module is used for acquiring first data of the first data source and acquiring distribution information of second data in the second data source;

the data configuration module is used for configuring the first data in a plurality of first nodes according to the distribution information of the second data;

A data range determining module, configured to determine a data range of the first data in the first node;

And the data extraction module is used for extracting the second data from a plurality of second nodes of the second data source in parallel according to the data range.

Optionally, the data extraction module includes:

A connection condition determination submodule for determining Join connection conditions between the first data and the second data;

and the data extraction sub-module is used for extracting second data corresponding to the data range from the plurality of second nodes in parallel according to the Join connection condition.

Optionally, the data configuration module includes:

The first identifier generation sub-module is used for respectively calculating the distribution information and generating a first configuration identifier aiming at the first data;

A second identifier generating sub-module, configured to determine a second configuration identifier for the first node by using the first configuration identifier and the Join connection condition respectively;

The data configuration sub-module is used for respectively configuring the first data in the first nodes corresponding to the second configuration identifiers according to the sequences of the second configuration identifiers;

Optionally, the first data is small table data, and the data configuration submodule is specifically configured to:

Optionally, the data range includes a value range corresponding to each of the first sub-table data, the second data is large table data, the large table data includes second sub-table data corresponding to each of the second nodes, and the data extraction sub-module is specifically configured to:

Optionally, the first identifier generation submodule is specifically configured to:

Optionally, the method further comprises:

The sub-table data operation module is used for respectively connecting the first sub-table data and the second sub-table data in each first node to generate a node connection result;

And the connection result combination module is used for combining node connection results corresponding to the first nodes to generate query results aiming at the small table data and the large table data.

The embodiment of the application also discloses a data query device of the distributed query system, which is in communication connection with at least one second data source; the distributed query system comprises a plurality of first nodes, wherein the first nodes comprise first data configured by adopting the distribution information of the second data sources; the second data source includes a plurality of second nodes; the device comprises:

The query request sending module is used for sending the data query requests containing the data range to the plurality of second nodes in parallel by the plurality of first nodes;

And the query result generation module is used for connecting second data corresponding to the data query requests sent by the plurality of second nodes with the first data and generating query results aiming at the first data and the second data.

Optionally, the query request sending module includes:

And the query request sending sub-module is used for a plurality of first nodes to send the data query requests containing the data range to a plurality of second nodes in parallel according to the Join connection condition.

Optionally, the method further comprises:

the distribution information acquisition module is used for acquiring the distribution information of the second data in the second data source;

And the data configuration module is used for configuring the first data in a plurality of first nodes according to the distribution information of the second data.

Optionally, the data configuration module includes:

Optionally, the data range includes a value range corresponding to each first sub-table data, the second data is large table data, the large table data includes second sub-table data corresponding to each second node, and the query request sending submodule is specifically configured to:

Optionally, the query result generation module includes:

The node connection sub-module is used for respectively connecting the first sub-table data and the second sub-table data in each first node to generate a node connection result;

and the system connection sub-module is used for combining the node connection results corresponding to the first nodes to generate query results aiming at the small table data and the large table data.

The embodiment of the application also discloses a device, which comprises:

One or more processors; and

One or more machine-readable media having instructions stored thereon, which when executed by the one or more processors, cause the apparatus to perform one or more methods as described above.

One or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause the processors to perform one or more of the methods described above are also disclosed.

The embodiment of the application has the following advantages:

In the embodiment of the application, the first data of the first data source is acquired, the distribution information of the second data in the second data source is acquired, the first data is configured in each first node according to the distribution information of the second data, the data range of the first data in the first node is determined, and the second data is extracted from a plurality of second nodes in parallel according to the data range, so that the first data is configured in each first node according to the distribution information of the second data, the distribution of the first data on the first node is consistent with the distribution of the second data on the second node, the distribution configuration of the second data in the system is avoided, the extraction of the second data according to the data range is avoided, the total extraction of the data is avoided, and the running efficiency of the system is improved.

Drawings

FIG. 1 is a flowchart illustrating steps of a first embodiment of a data processing method of a distributed query system according to the present application;

FIG. 2 is a flowchart illustrating steps of a second embodiment of a data processing method of a distributed query system according to the present application;

FIG. 3 is a schematic diagram of cross-source data processing in an embodiment of the application;

FIG. 4 is a flow chart of steps of a third embodiment of a data query method of a distributed query system according to the present application;

FIG. 5 is a block diagram illustrating a first embodiment of a data processing apparatus of a distributed query system according to the present application;

fig. 6 is a block diagram of a second embodiment of a data query device of a distributed query system according to the present application.

Detailed Description

In order that the above-recited objects, features and advantages of the present application will become more readily apparent, a more particular description of the application will be rendered by reference to the appended drawings and appended detailed description.

Referring to fig. 1, a flowchart of the steps of a first embodiment of a data processing method of a distributed query system of the present application is shown, the distributed query system including a plurality of first nodes and being in data communication with at least two data communication connections.

With the development of big data technology, the variety of data sources is more and more, and the demands of users for data acquisition, analysis, prediction and the like are increasing. Moreover, the user often needs to connect, combine, analyze, etc. different types of data and different data amounts of data, so as to complete corresponding business services according to the processing result of the data.

As an example, when a user owns a first data source instance, such as a database instance of RDS (Relational Database Service ) MySQL, and a second data source instance, such as ADB (ANALYTIC DATA Base, analytical database), the former may be used to support the user's online service system, may be a stand-alone version, and the amount of first data stored in the first data source is not large; the latter may be used for offline decision analysis, the second data source may be a distributed database, requiring deployment on multiple machines, storing large amounts of data, and the data on the ADBs is typically distributed across the second nodes in slices, partitions, buckets, etc. Then this can be done by a distributed query system when the user needs to correlate offline and online data.

Specifically, the distributed query system (the query system described below) may include a plurality of first nodes (Node 1, node 2..node N), where N may be the number of nodes on the distributed query system, and the query system may be communicatively connected to at least one first data source and at least one second data source by JDBC, ODBC (Open Database Connectivity, open database connection), or the like.

The query system may be communicatively coupled to the first data source, the second data source, and read the first data on the first data source and the second data on the second data source via a single JDBC (or ODBC). And then the query system performs a Shuffle operation on the first data and the second data, so that the data of the same Join Key are distributed on the same computing node, and each computing node runs a stand-alone Join algorithm, such as Hash Join, sort Merge Join, and the like, and produces Join results. However, in the cross-source data processing process, the data size of the second data is larger, the query system takes a long time to read the second data through a single JDBC connection, the task of reading the first data with smaller data size is already completed, the query system also needs to perform a Shuffle operation on the second data, the network overhead is huge, the data processing time is further increased, and the data processing efficiency is greatly reduced. Therefore, the data processing method of the distributed query system in the embodiment of the application can effectively solve the problems, and specifically can comprise the following steps:

step 101, acquiring first data of a first data source and acquiring distribution information of second data in a second data source;

In the embodiment of the application, the query system can acquire the first data from the first data source and acquire the distribution information of the second data in the second data source.

In a specific implementation, the first data may be small table data (a small table is described below), or some query intermediate result, and the query system may obtain the first data from the first data source. The second data may be large table data (large table described below), and the large table may be distributed in a specific distribution manner among a plurality of second nodes of the second data source, for example, may be distributed according to horizontal slicing, barreling, partitioning, and the like, so that the large table may be distributed on a plurality of machines (Server 1, server2.. ServerM) in the second data source, and M may be the number of machines of the second data source. When the second data is distributed in the second data source, corresponding distribution information can be obtained, wherein the distribution information can be at least one of a slicing strategy, a barrel strategy and a partitioning strategy.

102, Configuring the first data in a plurality of first nodes according to the distribution information of the second data;

In a specific implementation, after the query system obtains the distribution information of the large table, a Shuffle strategy for the small table can be generated according to the distribution information, and the small table is configured in a plurality of first nodes of the query system by adopting the Shuffle strategy, so that the Shuffle operation is performed on the small table according to the distribution mode of the large table in the second data source, the distribution mode of the small table in the query system is consistent with the distribution mode of the large table in the second data source, the large table is prevented from being subjected to the Shuffle operation in the process of obtaining the large table by the query system, the data processing time is shortened, and the data processing efficiency is improved.

Step 103, determining a data range of the first data in the first node;

in a specific implementation, after the distribution information of the large table is adopted to configure the small table in each first node, each first node can acquire the data range of the local small table. The data range may be a data column value range of a small table in the node, and/or a data row value range, etc.

And 104, extracting the second data from a plurality of second nodes of the second data source in parallel according to the data range.

In a specific implementation, each first node in the query system can generate a filtering condition for the large table according to the data range, and then each node extracts the data of the large table from a plurality of second nodes in the second data source in parallel according to the filtering condition.

Specifically, since the data of the small table has been read into the query system, and according to the distribution mode of the large table in the second data source, a Shuffle operation is performed in the query system. Therefore, when the data of the small table is configured at the first node of the query system and the data of the large table is also configured at the first machine of the second data source, each first node can read the data of the large table corresponding to the filtering condition from the corresponding second node through the filtering condition, so that on one hand, the buffer operation on the data of the large table in the data processing process is avoided, the data processing time is further reduced, and the data processing efficiency is improved.

In the embodiment of the application, the first data of the first data source is acquired, the distribution information of the second data in the second data source is acquired, the first data is configured in each first node according to the distribution information of the second data, the data range of the first data in the first node is determined, and the second data is extracted from a plurality of second nodes in parallel according to the data range, so that the first data is configured in each first node according to the distribution information of the second data, the distribution of the first data on the first node is consistent with the distribution of the second data on the second node, the distribution configuration of the second data in the system is avoided, the situation of data inclination is greatly relieved in the data processing process, the second data is extracted according to the data range, the full extraction of the data is avoided, and the running efficiency of the system is improved.

Referring to fig. 2, a flowchart of steps of a second embodiment of a data processing method of a distributed query system according to the present application is shown, and is connected to at least two data communications, and may specifically include the following steps:

Step 201, obtaining first data of a first data source and obtaining distribution information of second data in a second data source;

in a specific implementation, the query system may acquire the small table data T1 or a certain intermediate query result from the first data source S1, and acquire the distribution information of the large table data T2 in the second data source S2, where the distribution information may include at least one of a slicing policy, a partitioning policy, and a bucket policy.

Step 202, configuring the first data in a plurality of first nodes according to the distribution information of the second data;

In an alternative embodiment of the present application, step 202 may comprise the following sub-steps:

step S11, respectively calculating the distribution information to generate a first configuration identifier aiming at the first data;

In a specific implementation, when the user performs data processing, a Join connection condition may be input in the query system, where the connection condition may be t1.c1=t2.c2. Wherein T1 represents a small table in the first data source or a certain intermediate query result, T2 represents a large table in the second data source, C1 may be a second configuration identifier, such as a Join Key, and C2 may be a first configuration identifier, such as a slicing policy, a partitioning policy, a barrel policy, and so on.

After the query system obtains the distribution information of the large table, the distribution information can be calculated to generate a first configuration identifier of the small table. Specifically, the query system may obtain the first number of information of the first node and the second number of information of the second node; when the second quantity information is smaller than or equal to the first quantity information, calculating a first configuration identifier aiming at the first data by adopting the first quantity information and the distribution information; and when the second quantity information is larger than the first quantity information, calculating a first configuration identifier aiming at the first data by adopting the second quantity information and the distribution information.

In an example of the embodiment of the present application, if the number of the first nodes in the query system is N and the number of the second nodes in the second data source is M, after the query system obtains the distribution information of the large table, the number relationship between the first nodes and the second nodes may be calculated by using the distribution information to perform the calculation of the first configuration identifier. Specifically, the first quantity information N and the second quantity information M may be subjected to modulo operation, and when the number of the first nodes is greater than that of the second nodes, M is taken, then Hash operation is performed on data in a certain column of the large table, and modulo operation, that is, hash (C2) Mod M is performed; when the number of the first nodes is smaller than that of the second nodes, N is taken, then Hash operation is carried out on data in a certain column of the large table, and then modulo operation, namely Hash (C2) Mod N, is carried out, so that configuration identification aiming at the small table is obtained.

It should be noted that the embodiments of the present application include, but are not limited to, the above examples, and it is understood that, under the guidance of the idea of the embodiments of the present application, those skilled in the art may also use Range operation Range to operate on the configuration identifier, which is not limited in this aspect of the present application.

Step S12, determining a second configuration identifier aiming at the first node by adopting the first configuration identifier and the Join connection condition respectively;

In a specific implementation, since the Join connection condition is t1.c1=t2.c2, the user may write the Join Key, i.e. C1, in the input query statement, and then determine the second configuration identifier for the first node by adopting the first configuration identifier C2 and the Join connection condition.

Specifically, according to the Join connection condition, when the first quantity information N of the first node in the query system is greater than the second quantity information M of the second node in the second data source, hash (C2) Mod M, hash (C1) Mod M; when the first quantity information N of the first node in the query system is smaller than the second quantity information M of the second node in the second data source, the Hash (C2) Mod N is carried out, so that the distribution of the first data in the query system is consistent with the distribution mode of the second data in the second data source.

And S13, respectively configuring the first data in the first nodes corresponding to the second configuration identifiers according to the sequence of the second configuration identifiers.

In a specific implementation, the query system may configure the small table or a certain intermediate query result in the node corresponding to C1 according to the sequence of the second configuration identifier C1, so that the first data is redistributed in each node of the query system according to the value of the column C1, and the distribution of the first data is kept consistent with the distribution of the second data in the second data source.

Specifically, the query system may split the small table data into first sub-table data corresponding to each second configuration identifier, then determine, from the first nodes, the target nodes corresponding to each second configuration identifier, and then configure each first sub-table data in the target nodes according to the sequence of the second configuration identifiers.

Step 203, determining a data range of the first data in the first node;

In a specific implementation, after the distribution information of the large table is adopted to configure the small table or the intermediate query result in each first node, each first node can acquire the data range of the local small table or the intermediate query result. The data range may be a data column value range of a small table or an intermediate query result in a node, and/or a data row value range.

Step 204, extracting the second data from a plurality of second nodes of the second data source in parallel according to the data range;

In a specific implementation, the query system may extract, in parallel, second data corresponding to the data range from the plurality of second nodes according to the Join connection condition between the first data and the second data, and then according to the Join connection condition.

Specifically, since the data of the small table or the intermediate query result has been read into the query system, and according to the distribution mode of the large table in the second data source, a Shuffle operation is performed in the query system. Thus, when the data of the small table or intermediate query result is configured at node one of the query system, the data of the large table is also configured at machine one of the second data source.

The data range may include a value range corresponding to each first sub-table data, the large table adopts modes of slicing, partitioning, and barrel splitting, and the like, a plurality of second sub-table data are generated and distributed in a plurality of second nodes of the second data source, the query system may use Join connection conditions to extract the second sub-table data corresponding to the value range from a plurality of second nodes corresponding to the first nodes in parallel, so that each first node may read the data of the large table corresponding to the filtering conditions from the corresponding second nodes through the filtering conditions, on one hand, a period of performing a Shuffle operation on the data of the large table in a data processing process is avoided, further data processing efficiency is improved, on the other hand, full extraction of the second data is avoided, and extraction efficiency is improved.

Step 205, connecting the first data with the second data, and generating a connection result for the first data and the second data.

In the embodiment of the application, the first node of the first data and the second data is obtained in the query system, and Join local operation can be performed on the first data and the second data, so that a connection result for the first data and the second data is generated.

In a specific implementation, when the first data is a small table, the query system may connect the first sub-table data and the second sub-table data in each first node respectively to generate a node connection result, and then combine the node connection results corresponding to each first node to generate query results for the small table data and the large table data. It should be noted that, when the first data is the intermediate query result, the Join operation process is the same as the Join operation process of the small table, and will not be described herein.

In one example of an embodiment of the present application, as shown in fig. 3, a schematic diagram of cross-source data processing in an embodiment of the present application is shown, where the query system includes N nodes (Node 1, node 2..node N), and the second data source includes a plurality of machine servers. The T1 table may be small table data from the first data source, or, a certain query intermediate result (which is exemplified by a small table below), the T2 table may be segmented (or may be partitioned, barreled, etc.) in a horizontal segment manner, such as Hash or Range, so that the T2 table data is distributed on multiple machines of the second data source, and Server1, server2.. ServerM, M is the number of machines of the second data source. In addition, in the present embodiment, consider the same scenario in which the Join Key is identical to the sharding policy in the T2 table data, and the Join condition is Equi-Join, that is, the Join connection condition is t1.c1=t2.c2.

First, the query system may obtain the T1 table from the first data source by means of JDBC, ODBC, etc., and obtain Sharding means (i.e., distribution information) of the T2 table from the second data source. And then generating a Shuffle strategy of the T1 table in the query system according to Sharding modes of the T2 table, and then redistributing data of the T1 table to N nodes of the query system according to the Shuffle strategy according to values of C1 columns, so that the distribution of the data of the C1 columns is consistent with the distribution of the data of the C2 columns in a second data source. For example: c2 Data of =10 is distributed on Server1 of the second data source, then data of c1=10 should also be distributed on Node1 nodes across the source query engines.

Then, each Node i on the query system can generate a filtering condition of C2 column according to the local C1 column value range, and then extract data of the T2 table from the second data source by using the filtering condition. Specifically, the Node i configured with the T1 table data can initiate M query requests to the second data source in parallel, so that the machine Server receiving the query requests in the second data source can return the table data corresponding to the filtering condition, each Node node_i only acquires a part of data of the T2 table, and the data acquired by all nodes are combined to form complete data of T2, thereby avoiding performing a Shuffle operation on the data of a large table in the data processing process, further reducing the duration of data processing, improving the data processing efficiency, avoiding performing full extraction on the second data, and improving the extraction efficiency.

And then each Node node_i on the query system carries out local Join operation according to the local part T1 table data and the local part T2 table data, such as algorithms of Hash Join, sort Merge Join and the like. And, all nodes execute local Join in parallel. Because the Shuffle strategy of the T1 table is consistent with the Sharding strategy of the T2 table, the data with the same Key of the two tables can be distributed on the same node, so that only local Join is needed, and the Shuffle of the large table T2 is not needed. After the Join operation of each node is finished, the local Join results of all nodes can be combined to generate a complete Join result, so that the condition of data inclination is greatly relieved in the I data processing process, the second data is extracted through the filtering condition, the full extraction of the data is avoided, and the system operation efficiency is improved.

Referring to FIG. 4, a flow chart of steps of a third embodiment of a data query method of a distributed query system of the present application, the distributed query system being communicatively coupled to at least one second data source; the distributed query system comprises a plurality of first nodes, wherein the first nodes comprise first data configured by adopting the distribution information of the second data sources; the second data source includes a plurality of second nodes; the method specifically comprises the following steps:

Step 401, determining a data range of the first data in the first node;

In an embodiment of the application, the distributed query system comprises a plurality of first nodes, wherein the first nodes comprise first data configured by adopting the distributed information of the second data sources. When the first data is not configured in the first nodes, the distribution information of the second data in the second data source can be acquired, and then the first data is configured in a plurality of first nodes. The first data may be small table data, or the intermediate query structure, and the second data may be large table data, and the small table and the large table are exemplified below for convenience of description and understanding.

In an example of the embodiment of the present application, when the first nodes have stored therein a small table configured with the distribution information of the second data source, each first node may determine a data range of the local small table and generate a filtering condition for the large table.

In another example of the embodiment of the present application, when the first node has stored therein the small table and the distribution information of the second data source, then the distribution information may be used to generate a Shuffle policy for the small table, and then the small table is configured in the plurality of first nodes according to the Shuffle policy.

In another example of the embodiment of the present application, when a small table is not stored in a first node, the query system may first obtain distribution information of a large table in a second data source, then calculate each piece of the distribution information to generate a first configuration identifier for the small table, then determine a second configuration identifier for the first node by using the first configuration identifier and the Join connection condition, and then configure the small table in the first node corresponding to the second configuration identifier according to a sequence of the second configuration identifier; the distribution information at least comprises one of a slicing strategy, a partitioning strategy and a barrel strategy.

In a specific implementation, the query system may obtain first number information of the first node and second number information of the second node; when the second quantity information is smaller than or equal to the first quantity information, calculating a third configuration identifier aiming at the first data by adopting the first quantity information and the distribution information; and when the second quantity information is larger than the first quantity information, calculating a fourth configuration identifier aiming at the first data by adopting the second quantity information and the distribution information.

Specifically, the query system may split the small table data into first sub-table data corresponding to each of the second configuration identifiers; determining target nodes corresponding to the second configuration identifiers from the first nodes respectively; and configuring the first sub-table data in the target node according to the sequence of the second configuration identifier.

When the first data is configured in the plurality of first nodes, each first node may determine a data range of the local first data and generate a filtering condition for the large table.

Step 402, a plurality of the first nodes send data query requests containing the data range to a plurality of the second nodes in parallel;

In a specific implementation, after first data is configured on a first node by using distribution information of a large table, join connection conditions between a small table and the large table can be determined, and then a plurality of first nodes send data query requests containing data ranges to a plurality of second nodes in parallel according to the Join connection conditions so as to acquire the large table stored in the second nodes.

Specifically, the data range includes a value range corresponding to each first sub-table data, the second data is large table data, the large table data includes second sub-table data corresponding to each second node, and then the plurality of first nodes can adopt the Join connection condition to obtain the second sub-table data corresponding to the value range from the plurality of second nodes in parallel, so that scanning of all second nodes or scanning of the whole data by the query system is avoided, and the data obtaining efficiency is greatly provided.

And step 403, connecting second data corresponding to the data query requests sent by the plurality of second nodes with the first data, and generating query results for the first data and the second data.

In a specific implementation, the query system may connect the first sub-table data and the second sub-table data in each first node respectively to generate a node connection result, and then combine the node connection results corresponding to each first node to generate query results for the small table data and the large table data.

In the embodiment of the application, a query system determines the data range of the first data in the first node, and then a plurality of first nodes send data query requests containing the data range to a plurality of second nodes in parallel; and connecting the second data corresponding to the data query requests sent by the plurality of second nodes with the first data to generate query results for the first data and the second data, so that the system is prevented from carrying out distributed configuration on the second data in the data processing process, the network overhead is greatly relieved, the second data is extracted according to the data range, the full extraction of the data is avoided, and the efficiency of system operation is improved.

It should be noted that, for simplicity of description, the method embodiments are shown as a series of acts, but it should be understood by those skilled in the art that the embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required by the embodiments of the application.

Referring to fig. 5, there is shown a block diagram of a first embodiment of a data processing apparatus of a distributed query system of the present application, where the distributed query system includes a plurality of first nodes and is communicatively connected to at least two data sources, and may specifically include the following modules:

the data and information acquisition module 501 is configured to acquire first data of a first data source and acquire distribution information of second data in a second data source;

A data configuration module 502, configured to configure the first data in a plurality of the first nodes according to the distribution information of the second data;

A data range determining module 503, configured to determine a data range of the first data in the first node;

And the data extraction module 504 is configured to extract the second data from a plurality of second nodes of the second data source in parallel according to the data range.

In an alternative embodiment of the present application, the data extraction module 504 includes:

In an alternative embodiment of the present application, the data configuration module 502 includes:

In an optional embodiment of the present application, the first data is small table data, and the data configuration submodule is specifically configured to:

In an optional embodiment of the present application, the data range includes a value range corresponding to each of the first sub-table data, the second data is large table data, the large table data includes second sub-table data corresponding to each of the second nodes, and the data extraction sub-module is specifically configured to:

In an optional embodiment of the present application, the first identifier generating submodule is specifically configured to:

In an alternative embodiment of the present application, the method further includes:

Referring to FIG. 6, a block diagram of a second embodiment of a data querying device for a distributed querying system in communication with at least one second data source in accordance with the present application is shown; the distributed query system comprises a plurality of first nodes, wherein the first nodes comprise first data configured by adopting the distribution information of the second data sources; the second data source includes a plurality of second nodes; the method specifically comprises the following modules:

A data range determining module 601, configured to determine a data range of the first data in the first node;

a query request sending module 602, configured to send, by a plurality of the first nodes, a data query request including the data range to the plurality of second nodes in parallel;

And a query result generating module 603, configured to connect second data corresponding to the data query requests sent by the plurality of second nodes with the first data, and generate query results for the first data and the second data.

In an alternative embodiment of the present application, the query request sending module 602 includes:

In an alternative embodiment of the present application, the data configuration module includes:

In an optional embodiment of the present application, the data range includes a value range corresponding to each of the first sub-table data, the second data is large table data, the large table data includes second sub-table data corresponding to each of the second nodes, and the query request sending sub-module is specifically configured to:

In an alternative embodiment of the present application, the query result generating module 603 includes:

For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.

The embodiment of the application also provides a device, which comprises:

One or more processors; and

One or more machine-readable media having instructions stored thereon, which when executed by the one or more processors, cause the apparatus to perform the method described by the embodiments of the present application.

Embodiments of the application also provide one or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause the processors to perform the methods described in embodiments of the application.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

It will be apparent to those skilled in the art that embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Moreover, embodiments of the application may take the form of a computer program product embodied on one or more machine-readable media (including, but not limited to, magnetic disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the application.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or terminal device that comprises the element.

The above description of the data processing method of the distributed query system and the data processing device of the distributed query system provided by the application applies specific examples to illustrate the principles and embodiments of the application, and the above examples are only used to help understand the method and core ideas of the application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. A data processing method of a distributed query system, the distributed query system comprising a plurality of first nodes and being communicatively coupled to at least two data sources, the method comprising:

Determining a data range of the first data in the first node;

Extracting the second data from a plurality of second nodes of the second data source in parallel according to the data range;

wherein the configuring the first data in the plurality of first nodes according to the distribution information of the second data includes:

When the second quantity information is smaller than or equal to the first quantity information, calculating a first configuration identifier aiming at the first data by adopting the first quantity information and the distribution information;

When the second quantity information is larger than the first quantity information, calculating a first configuration identifier aiming at the first data by adopting the second quantity information and the distribution information;

respectively adopting the first configuration identifier and Join connection conditions to determine a second configuration identifier aiming at the first node;

And respectively configuring the first data in the first nodes corresponding to the second configuration identifiers according to the sequence of the second configuration identifiers.

2. The method of claim 1, wherein extracting the second data from a plurality of second nodes of the second data source in parallel according to the data range comprises:

3. The method of claim 2, wherein the distribution information includes at least one of a sharding policy, a partitioning policy, and a barreling policy.

4. The method of claim 3, wherein the first data is small table data, and the configuring the first data in the first node corresponding to the second configuration identifier according to the sequence of the second configuration identifier includes:

5. The method of claim 4, wherein the data range includes a value range corresponding to each of the first sub-table data, the second data is large table data, the large table data includes second sub-table data corresponding to each of the second nodes, and the acquiring, in parallel, second data corresponding to the data range from the plurality of second nodes according to the Join connection condition includes:

6. The method as recited in claim 5, further comprising:

7. A data query method of a distributed query system, wherein the distributed query system is in communication connection with at least one second data source; the distributed query system comprises a plurality of first nodes, wherein the first nodes comprise first data configured by adopting the distribution information of the second data sources; the second data source includes a plurality of second nodes; the method comprises the following steps:

Determining a data range of the first data in the first node;

Connecting second data corresponding to the data query requests sent by the plurality of second nodes with the first data to generate query results for the first data and the second data;

The first data is configured in the first node by the following manner:

8. The method of claim 7, wherein the plurality of first nodes send data query requests containing the data range in parallel to the plurality of second nodes, comprising:

9. The method as recited in claim 8, further comprising:

Acquiring distribution information of second data in the second data source;

10. The method of claim 9, wherein the distribution information includes at least one of a sharding policy, a partitioning policy, and a barreling policy.

11. The method according to claim 10, wherein the first data is small table data, and the configuring the first data in the first node corresponding to the second configuration identifier according to the sequence of the second configuration identifier includes:

12. The method of claim 11, wherein the data range includes a value range corresponding to each of the first sub-table data, the second data is large table data, the large table data includes second sub-table data corresponding to each of the second nodes, and the plurality of first nodes send data query requests including the data range to the plurality of second nodes in parallel according to the Join connection condition, including:

13. The method of claim 10, wherein the computing each of the distribution information to generate a first configuration identifier for the first data comprises:

14. The method of claim 12, wherein the connecting the second data corresponding to the data query request sent by the plurality of second nodes with the first data, and generating the query result for the first data and the second data, comprises:

15. A data processing apparatus of a distributed query system, the distributed query system comprising a plurality of first nodes and being communicatively coupled to at least two data sources, the apparatus comprising:

The data extraction module is used for extracting the second data from a plurality of second nodes of the second data source in parallel according to the data range;

Wherein, the data configuration module includes:

A first identifier generating sub-module, configured to obtain first number information of the first node and second number information of the second node, and calculate a first configuration identifier for the first data using the first number information and the distribution information when the second number information is less than or equal to the first number information; when the second quantity information is larger than the first quantity information, calculating a first configuration identifier aiming at the first data by adopting the second quantity information and the distribution information;

The second identifier generation sub-module is used for determining a second configuration identifier aiming at the first node by adopting the first configuration identifier and Join connection conditions respectively;

and the data configuration sub-module is used for respectively configuring the first data in the first nodes corresponding to the second configuration identifiers according to the sequence of the second configuration identifiers.

16. A data querying device of a distributed querying system, wherein the distributed querying system is communicatively coupled to at least one second data source; the distributed query system comprises a plurality of first nodes, wherein the first nodes comprise first data configured by adopting the distribution information of the second data sources; the second data source includes a plurality of second nodes; the device comprises:

The query result generation module is used for connecting second data corresponding to the data query requests sent by the plurality of second nodes with the first data to generate query results for the first data and the second data;

wherein, still include: means for performing the steps of:

17. An electronic device, comprising:

One or more processors; and

One or more machine readable media having instructions stored thereon, which when executed by the one or more processors, cause the electronic device to perform the method of any of claims 1-14.

18. A machine readable medium having instructions stored thereon, which when executed by one or more processors, cause the processors to perform the method of any of claims 1-14.