CN107153643A - Tables of data connection method and device - Google Patents
Tables of data connection method and device Download PDFInfo
- Publication number
- CN107153643A CN107153643A CN201610118167.7A CN201610118167A CN107153643A CN 107153643 A CN107153643 A CN 107153643A CN 201610118167 A CN201610118167 A CN 201610118167A CN 107153643 A CN107153643 A CN 107153643A
- Authority
- CN
- China
- Prior art keywords
- data
- tables
- data record
- node
- condition
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2282—Tablespace storage structures; Management thereof
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application provides a kind of tables of data connection method and device.Method includes:Tables of data connection task is received, tables of data connection task indicates to be attached operation to the first tables of data and the second tables of data according to condition of contact;According to condition of contact, the data record in the second tables of data is loaded into distributed system at least two nodes;The data record read in the first tables of data is recorded as current data, corresponding condition of contact is recorded according to current data, destination node is determined from least two nodes, and reads the data record in the second tables of data stored on destination node as target data record;Operation is attached to current data record and target data record.The application can reduce the computing resource of tables of data attended operation consumption.
Description
【Technical field】
The application is related to database technical field, more particularly to a kind of tables of data connection method and device.
【Background technology】
With the development of internet, explosive growth is presented in data, and data structure also begins to diversification, data
The information content contained is more and more, and data warehouse plays huge effect in this context.Due to big
Data age is come, and data warehouse is changed into as distributed structure/architecture, to meet the calculating of explosive growth and deposit
The demand of storage.Distributed Data Warehouse is typically stored using column, and preserves data in the form of a file, because
This, can improve the storage of big data using Distributed Data Warehouse and calculate performance.
In the query process of Distributed Data Warehouse, it is often necessary to carry out connection (Join) meter between tables of data
Calculate.It is typically all first by needed Join number when Join of the prior art between processing data table is calculated
Shuffled according to table by way of MapReduce (shuffle) sequence, then at Reducer ends to
Tables of data through being drained through sequence carries out merger operation.Shuffle sequences actually refer to that each treats Join by Map ends
Tables of data carry out subregion according to Join conditions and be assigned to the process at different Reducer ends.
Under typical " star-like " Join scenes, it is assumed that treat that Join tables of data includes main a table and n individual auxiliary
Table, main table is recorded comprising M datas, then when carrying out Join calculating to main table and n auxiliary tables, shuffle
Sequence, which needs total amount of data to be processed to include the main tables of shuffle, needs data volume to be processed to be M*N and shufflen
Individual auxiliary table needs data volume to be processed, and this can consume many computing resources.
【The content of the invention】
The many aspects of the application provide a kind of tables of data connection method and device, to reduce tables of data connection
Operate the computing resource of consumption.
The one side of the application there is provided a kind of tables of data connection method, including:
Tables of data connection task is received, the tables of data connection task is indicated according to condition of contact to the first data
Table and the second tables of data are attached operation;
According to the condition of contact, the data record in second tables of data is loaded into distributed system
On at least two nodes;
The data record read in first tables of data is recorded as current data, according to the current data
Corresponding condition of contact is recorded, destination node is determined from least two node, and read the target
Data record in second tables of data stored on node is as target data record;
Operation is attached to current data record and the target data record.
The another aspect of the application there is provided a kind of tables of data attachment means, including:
Receiving module, for receiving tables of data connection task, the tables of data connection task is indicated according to connection
Condition is attached operation to the first tables of data and the second tables of data;
Load-on module, for according to the condition of contact, the data record in second tables of data to be loaded
Into distributed system at least two nodes;
Read module, is recorded, root for reading the data record in first tables of data as current data
Corresponding condition of contact is recorded according to the current data, destination node is determined from least two node,
And the data record read in second tables of data stored on the destination node is remembered as target data
Record;
Link block, for being attached operation to current data record and the target data record.
In this application, when processing data table connects task, first according to condition of contact therein, by the
Data record in two tables of data is loaded into distributed system at least two nodes, afterwards, can be direct
Read the data record in the first tables of data, and the data record correspondence in the first tables of data read
Condition of contact, the data record from the second required tables of data is read in respective nodes, afterwards to reading
To two tables of data in data record be attached operation.As can be seen here, the application only need to be by the second number
It is distributed to according to table according to condition of contact on different nodes, it is not necessary to which the first tables of data is distributed on different nodes,
Reducing shuffle sequences needs data volume to be processed, advantageously reduces the computing resource that attended operation is consumed.
【Brief description of the drawings】
, below will be to embodiment or existing skill in order to illustrate more clearly of the technical scheme in the embodiment of the present application
The accompanying drawing used required in art description is briefly described, it should be apparent that, drawings in the following description
It is some embodiments of the present application, for those of ordinary skill in the art, is not paying creative work
On the premise of, other accompanying drawings can also be obtained according to these accompanying drawings.
The schematic flow sheet for the tables of data connection method that Fig. 1 provides for the embodiment of the application one;
The configuration diagram for the distributed system that Fig. 2 provides for another embodiment of the application;
The structural representation for the tables of data attachment means that Fig. 3 provides for the another embodiment of the application;
The structural representation for the tables of data attachment means that Fig. 4 provides for the another embodiment of the application.
【Embodiment】
To make the purpose, technical scheme and advantage of the embodiment of the present application clearer, below in conjunction with the application
Accompanying drawing in embodiment, the technical scheme in the embodiment of the present application is clearly and completely described, it is clear that
Described embodiment is some embodiments of the present application, rather than whole embodiments.Based in the application
Embodiment, those of ordinary skill in the art obtained under the premise of creative work is not made it is all its
His embodiment, belongs to the scope of the application protection.
In the query process of Distributed Data Warehouse, it is often necessary to carry out connection (Join) meter between tables of data
Calculate.During Join operation of the prior art between processing data table, due to treating Join tables of data than larger,
So being typically all that needed Join tables of data is first done into shuffle rows by way of MapReduce
Sequence, then carries out merger operation at Reducer ends to the tables of data for being drained through sequence.Shuffle sequences are actual
On refer to that by Map ends each treats that Join tables of data carries out subregion according to Join conditions and is assigned to difference
The process at Reducer ends.By needing the tables of data to needed Join to carry out shuffle sequences, consumption
Computing resource is more.
For above-mentioned technical problem, the application provides a solution, i.e., by the way that the second tables of data is distributed
Store on multiple nodes, as a distributed caching, during to the first data list processing, pass through network
The data record in the second tables of data stored on remote node is obtained, so as to carry out distributed Hash mapping
Connect (Hash map Join) so that without carrying out shuffle sequences to main table, can so save pair
First tables of data carries out the computing resource of shuffle sequence consumption.
The schematic flow sheet for the tables of data connection method that Fig. 1 provides for the embodiment of the application one.As shown in figure 1,
This method includes:
101st, tables of data connection task is received, tables of data connection task indicates to count to first according to condition of contact
Operation is attached according to table and the second tables of data.
102nd, according to above-mentioned condition of contact, the data record in the second tables of data is loaded into distributed system
On at least two nodes.
103rd, the data record read in the first tables of data is recorded as current data, is recorded according to current data
Corresponding condition of contact, determines destination node, and read what is stored on destination node from least two nodes
Data record in second tables of data is used as target data record.
104th, operation is attached to current data record and target data record.
The present embodiment provides a kind of tables of data connection method, can be performed by tables of data attachment means, to enter
Join operations between row tables of data, while reducing consumed computing resource as far as possible.What the present embodiment was provided
Method is applied to distributed system, and the different machines in distributed system here can be respectively as a section
Point.The present embodiment is not intended to limit the framework of realizing of distributed system, for example, can be but not limited to
MapReduce frameworks.
When needing to carry out the Join operations between tables of data, tables of data can be sent to tables of data attachment means
Connection task;Tables of data attachment means receive tables of data connection task.The tables of data connection task indicate according to
Condition of contact carries out Join processing to the first tables of data and the second tables of data.Here the first tables of data and second
Tables of data is actually tables of data to be connected.
On implementing, the tables of data connection task carry condition of contact, the mark of the first tables of data,
Storage location of the mark of second tables of data, the storage location of the first tables of data and the second tables of data etc. is believed
Breath.Wherein, tables of data attachment means can be parsed to tables of data connection task, acquisition condition of contact,
The mark of first tables of data, the mark of the second tables of data, the storage location of the first tables of data and the second number
The information such as the storage location according to table, and need are determined according to the mark of the mark and the second tables of data of the first tables of data
The tables of data of Join operations is carried out, furthermore it is possible to according to the storage location of the first tables of data and the second data
The storage location of table reads the first tables of data and the second tables of data.
In a kind of practical application, the first tables of data can be used as auxiliary table as main table, the second tables of data
Realize.Wherein, the quantity of auxiliary table can be one or more.
Tables of data attachment means are received after tables of data connection task, and needs can be known according to condition of contact
Join operations are carried out to the first tables of data and the second tables of data.Afterwards, before Join operations are performed, first
According to condition of contact, the data record in the second tables of data is loaded at least two node in distributed system
On, realize distributed storage.
It is preferred that, the data record in the second tables of data at least two nodes on each node, its
The internal memory that data volume is less than individual node is limited, that is to say, that be distributed at least two nodes each node
On the second tables of data in data record, can all be put into respective nodes memory space (be preferably
Internal memory) in.
In an optional embodiment, above-mentioned condition of contact includes at least one object key needed for connection, this
In object key be actually key (key) in key-value pair (key-value).Based on this, tables of data connects
Connection device specifically can carry out Hash operation to each object key at least one object key respectively, each to obtain
The cryptographic Hash of object key;According to the cryptographic Hash of each object key and the above-mentioned data being used in the second tables of data of storage
The quantity of at least two nodes of record, determines the corresponding node of each object key;By correspondence in the second tables of data
It is loaded into respectively on the corresponding node of each object key in the data record of each object key.
Further, tables of data attachment means can utilize the cryptographic Hash of each object key to be used to store second to above-mentioned
The quantity modulus of at least two nodes of the data record in tables of data, each object key is determined according to modulus result
Corresponding node.Specifically, the node that can represent modulus result is used as the corresponding node of object key.Or
Person,
Tables of data attachment means can be used to store at least two of the data record in the second tables of data according to above-mentioned
The quantity of individual node and the quantity of object key, each object key is assigned on each node, during dividing equally,
The close object key of cryptographic Hash can be assigned to by same node point according to the cryptographic Hash of each object key.Here Kazakhstan
The close difference that can refer to cryptographic Hash of uncommon value is less than pre-determined threshold, but not limited to this.
Further, during the data in the above-mentioned tables of data of loading second recorded at least two nodes,
The data record in the second tables of data can be specifically loaded into the internal memory of at least two nodes.Second data
Data record in table is stored in the internal memory of node, can be read at any time, and reading speed is very fast, is conducive to
Improve the efficiency of Join operations.
What deserves to be explained is, it is preferred that the data record in the second tables of data can be loaded into it is above-mentioned at least
In the internal memory of two nodes, but internal memory is not limited to, can also be solid state hard disc (the Solid State of node
Drives, SSD) or other storage mediums in.
In an optional embodiment, according to condition of contact, the data record in the second tables of data is loaded
Before at least two nodes into distributed system, it can be determined that whether the data volume of the second tables of data is big
Limited in the internal memory of individual node;If the determination result is YES, i.e., the data volume of the second tables of data is more than single section
The internal memory limitation of point, it means that the data record in the second tables of data can not all be placed on the interior of individual node
In depositing, therefore the data record in the second tables of data can be loaded into by least two sections according to condition of contact
In point so that the data record in the second tables of data being distributed on each node can all be put into corresponding section
In the internal memory of point, distributed storage is realized.Briefly, it is distributed in the second tables of data on each node
Data record, its data volume be less than individual node internal memory limit.
If above-mentioned judged result is no, i.e., the data volume of the second tables of data is less than or equal to the internal memory of individual node
Limitation, it means that the data record in the second tables of data all can be placed in the internal memory of individual node, compared with
To be preferred, the data record of the second tables of data can be all put into the internal memory of individual node, so as to save
Save and shuffle sequences are carried out to the data record of the second tables of data, save computing resource.
The above-mentioned data record by the second tables of data is loaded into distributed system at least two nodes, this
Become multiple small tables equivalent to by the second tables of data, each small table can be all in the internal memory of respective nodes
Set aside concerns, form distributed KV storages so that distributed Hash map Join can be, without
Do sequence and merge connection (Sort Merge Join).It is distributed Hash map Join, it is not necessary to
Data record in first data is ranked up, the data record in the first tables of data can be directly read,
And the corresponding condition of contact of data record in the first tables of data read, read from respective nodes
Data record in the second required tables of data, enters to the data record in two tables of data reading afterwards
Row Join is operated.
Wherein, in the present embodiment distributed Hash map Join and existing Hash map Join difference
It is:It is not that the data note in the second tables of data is searched in local memory when to the first data list processing
Record, but pass through the data record in the second tables of data for being stored on Network Capture remote node.
Specifically, the data record in the second tables of data is being loaded into at least two node in distributed system
Afterwards, tables of data attachment means can read the data in the first tables of data to the storage location of the first tables of data
Record, the data record read is recorded as current data, corresponding connection is recorded according to current data
Condition, determines destination node from above-mentioned at least two node, and goal node refers to and current data
Record carries out the node where the data record in the second tables of data needed for Join operations, then reads target
Data record in the second tables of data stored on node is used as target data record, goal data note
Record refers to record the data record carried out in the second required tables of data of Join operations with current data.
Reading current data record and the target data needed for carrying out Join operations is recorded with current data
After record, Join operations are carried out to current data record and target data record.Due to how to current number
The emphasis that Join operations are not the application is carried out according to record and target data record, will not be described in detail herein, can join
Examine the handling process of relevant Join operations in the prior art.
In an optional embodiment, it is contemplated that in the local cache of tables of data attachment means may exist with
Current data record carries out the target data record needed for Join operations, based on this, remembers according to current data
Corresponding condition of contact is recorded, destination node is determined from least two nodes, and read storage on destination node
The second tables of data in data record as target data record before, can according to current data record pair
The condition of contact answered, judges to whether there is target data record in local cache, if judged result is no,
Execution records corresponding condition of contact according to current data, and destination node is determined from least two nodes, and
The data record in the second tables of data stored on destination node is read as the operation of target data record;If
Judged result is yes, then target data record can be obtained from local cache, so can be more quick
Target data record is obtained, saves and obtains the Internet resources that target data record is consumed, Join operations are improved
Efficiency.
Further, it can be object key that above-mentioned current data, which records corresponding condition of contact, then a kind of basis is worked as
The corresponding condition of contact of preceding data record, determines that the embodiment of destination node includes from least two nodes:
Corresponding object key is recorded to current data and carries out Hash operation, records corresponding to obtain current data
The cryptographic Hash of object key;The cryptographic Hash and above-mentioned at least two section of corresponding object key are recorded according to current data
The quantity of point, determines that current data records the corresponding node of corresponding object key as destination node.
Further, during determining to take target data from some node by object key, if target
Key has multiple, then can carry out batch operation, can so give full play to the advantage of distributed system, improves
Process performance.
As seen from the above analysis, the present embodiment is when processing data table connects task, first according to company therein
Narrow bars part, the data record in the second tables of data is loaded at least two nodes, and this is equivalent to becoming
One distributed KV storage (having distributed Hash table), need not so be Sort Merge
Join so that can be distributed Hash map Join, i.e., need not be to the data record in the first data
It is ranked up, but the data record in the first tables of data can be directly read, and according to first read
The corresponding condition of contact of data record in tables of data, from the second required tables of data is read in respective nodes
Data record, Join operations are carried out to the data record in two tables of data reading afterwards.Thus may be used
See, the second tables of data need to be only distributed on different nodes by the present embodiment according to condition of contact, it is not necessary to by
One tables of data is distributed on different nodes, and reducing shuffle sequences needs data volume to be processed, is conducive to drop
The computing resource that low attended operation is consumed.
Provided below by the contrast Sort Merge Join and distributed Hash map Join calculating consumed
Source, to illustrate advantage that technical scheme is brought.
Assuming that main table is A, its size of data is 100T, it is assumed that it is B and C, auxiliary table respectively that auxiliary table, which has 2,
B size of data is 10G, and auxiliary table C size of data is 100G.
According to existing Sort Merge Join, its shuffle phase sorting is needed main Table A and auxiliary table B
Carry out a minor sort processing, in addition it is also necessary to carry out main Table A and auxiliary table C at one minor sort processing, every minor sort
Reason includes reading tables of data by network I/O and is ranked up by CPU, so the money handled per minor sort
Source consumption includes:CPU shared by sequence and the network I/O shared by meter reading.For ease of description, pass through the number of processing
Resource consumption is represented according to amount, herein, it is contemplated that the data volume of CPU sequence processing namely passes through network I/O
The data volume of reading, therefore the resource consumption of every minor sort processing is represented with a data volume, then shuffle is arranged
The sequence stage needs total resources to consume:(100T+10G)+(100T+100G)=2*100T+10G+
100G。
According to the distributed Hash map Join of the application, its shuffle phase sorting is needed auxiliary table B
It is distributed on different nodes, in addition it is also necessary to auxiliary table C is distributed on different nodes, every time some table is distributed to
Include on different nodes by network I/O reading tables of data and be ranked up by CPU, so by some table
The resource consumption being distributed on different nodes equally includes:CPU shared by sequence and the network I/O shared by meter reading.
For ease of description, resource consumption is represented by the data volume of processing, herein, it is contemplated that CPU sequences are handled
Data volume namely by the data volume of network I O read, therefore represented with a data volume at every minor sort
The resource consumption of reason, then shuffle phase sortings need total resources consumption be:10G+100G.
From above-mentioned, because the second tables of data need to be only distributed to by technical scheme according to condition of contact
On different nodes, it is not necessary to which the first tables of data is distributed on different nodes, shuffle sequence needs are reduced
The data volume of processing, advantageously reduces the computing resource that attended operation is consumed.
The configuration diagram for the distributed system that Fig. 2 provides for another embodiment of the application.As shown in Fig. 2
The distributed system includes:Control node 21, the calculate node 23 of scheduling node 22 and at least two.
Further, as shown in Fig. 2 calculate node 23 at least includes cache module and processing module.
What deserves to be explained is, distributed system shown in Fig. 2 is only a kind of example, however it is not limited to this, for example may be used
So that the scheduling node 22 in Fig. 2 is omitted so as to obtain a kind of more simple distributed system.
Distributed system shown in Fig. 2 will be based on below, and technical scheme will be described in detail.
Control node 21 be responsible for receive tables of data connection task, according to tables of data connect task know needs according to
Condition of contact is attached operation to the first tables of data and the second tables of data.
Control node 21 can connect task according to tables of data and send dispatch command, control to scheduling node 22
Available calculate node 23 in the scheduling distributed system of scheduling node 22.Scheduling node 22 specifically receives control
The dispatch command of node 21, the calculate node 23 in distributed system is dispatched according to dispatch command.
During calculate node 23 in above-mentioned scheduling distributed system, control node 21 passes through scheduling
Node 22 provides the configuration file needed for the data record in the tables of data of subsequent load second to calculate node 23,
The data record that the mark, storage location and needs that the configuration file records the second tables of data are loaded
Identification information etc..
Loading process is deployed with each calculate node 23 in a distributed system, the main basis of the loading process is matched somebody with somebody
File is put, the data record in the second tables of data is loaded into cache module.Specifically, scheduler module 22
The loading process in each calculate node 23 is activated, loading process is read according to configuration file to respective memory locations
The respective data record in the second tables of data is taken, the data record that it is read is loaded into cache module.
What deserves to be explained is, the second tables of data can be stored in the space outside distributed system, but is not limited to
This.
When the loading process in all calculate nodes 23 has performed loading operation, i.e., enter listening port state
When, a loading END instruction is returned to control node 21 by scheduling node 22.The basis of control node 21
The loading END instruction, can know that each calculate node 23 loads the data record in the second tables of data
Into cache module.
Control node 21 sends activation instruction to scheduling node 22 so that scheduling node 22 activates each calculating
Treatment progress on node 23.Treatment progress is deployed with each calculate node 23, treatment progress is mainly used
Record, remembered according to the current data read as current data in the data record read in the first tables of data
Corresponding key is recorded, the calculate node 23 where the data record in corresponding second tables of data of the key is determined,
Data record in the second tables of data stored determined by reading in calculate node 23 is remembered as target data
Record, Join operations are carried out to the current data record and target data record read.What deserves to be explained is,
First tables of data can be stored in the space outside distributed system, but is not limited to this.
Optionally, in a kind of specific implementation, above-mentioned each calculate node 23 can use service end/visitor
The mode at family end is realized.For example, the cache module of each calculate node 23 can be used as buffer service end
(CacheService) realize, the buffer service end also includes a cache management person
(CacheManager), one cache node (CacheNode) of each cache module correspondence;Accordingly,
The processing module of each calculate node 23 is realized as cache client (CacheClient).
Specifically, CacheManager coordinates and manages all CacheNode.CacheNode is responsible for adding
Data are carried to internal memory, and service is provided.Optionally, the second tables of data can be entered with the form of shard files
Row storage management, is because of at failure (failover), once CacheNode using the purpose of shard files
Restart, only need to read in shard files again so that processing is relatively easy.
CacheClient accesses CacheService, by carrying out hash calculating to key, and according to meter
Calculating result, some CacheNode reads data therefrom.In addition, should some in CacheClient
Local cache, it will usually using minimum in the recent period slow using algorithm (Least Recently Used, LRU) etc.
Deposit algorithm the data that part has been read are stored in local cache, such CacheClient can be preferential
Required data are read from local cache, if reading required data, Ke Yijie in local cache
The operation of data is about read from CacheNode by network, is conducive to improving efficiency, economizes on resources.
As seen from the above analysis, the present embodiment is when processing data table connects task, first according to company therein
Narrow bars part, the data record in the second tables of data is loaded at least two nodes, distributed storage is realized,
Allow to directly read the data record in the first tables of data, and according in the first tables of data read
The corresponding condition of contact of data record, the data record from the second required tables of data is read in respective nodes,
Join operations are carried out to the data record in two tables of data reading afterwards, distributed Hash is realized
map Join.As can be seen here, the second tables of data only according to condition of contact need to be distributed to different sections by the present embodiment
On point, it is not necessary to which the first tables of data is distributed on different nodes, reducing shuffle sequences need to be to be processed
Data volume, advantageously reduces the computing resource that attended operation is consumed.
It should be noted that for foregoing each method embodiment, in order to be briefly described, therefore it is all stated
For a series of combination of actions, but those skilled in the art should know that the application is not by described
The limitation of sequence of movement, because according to the application, some steps can be carried out sequentially or simultaneously using other.
Secondly, those skilled in the art should also know, embodiment described in this description belongs to be preferable to carry out
Necessary to example, involved action and module not necessarily the application.
In the above-described embodiments, the description to each embodiment all emphasizes particularly on different fields, without detailed in some embodiment
The part stated, may refer to the associated description of other embodiment.
The structural representation for the tables of data attachment means that Fig. 3 provides for the another embodiment of the application.Such as Fig. 3 institutes
Show, the device includes:Receiving module 31, load-on module 32, read module 33 and link block 34.
Receiving module 31, for receiving tables of data connection task, tables of data connection task is indicated according to connection
Condition is attached operation to the first tables of data and the second tables of data.
Load-on module 32, for according to condition of contact, the data record in the second tables of data to be loaded into distribution
In formula system at least two nodes.
Read module 33, is recorded for reading the data record in the first tables of data as current data, according to
Current data records corresponding condition of contact, destination node is determined from least two nodes, and read target
Data record in the second tables of data stored on node is as target data record.
Link block 34, for being attached operation to current data record and target data record.
It is preferred that, the data record in the second tables of data at least two nodes on each node, its
The internal memory that data volume is less than individual node is limited, that is to say, that be distributed at least two nodes each node
On the second tables of data in data record, can all be put into respective nodes memory space (be preferably
Internal memory) in.
Further, as shown in figure 4, the device also includes:First judge module 35.
First judge module 35, for judging whether the data volume of the second tables of data is more than the internal memory of individual node
Limitation, and triggering load-on module 32 is performed according to condition of contact when judged result is to be, by the second data
Data record in table is loaded into the operation at least two nodes in distributed system.
Further, as shown in figure 4, the device also includes:Second judge module 36.
Second judge module 36, for recording corresponding condition of contact according to current data, judges local cache
In whether there is the target data record, and judged result for it is no when triggering read module 33 perform root
Corresponding condition of contact is recorded according to current data, destination node is determined from least two nodes, and read mesh
The data record in the second tables of data stored on node is marked as the operation of target data record.
In an optional embodiment, above-mentioned condition of contact includes at least one object key needed for connection.This
In object key be actually key in key-value pair.
Based on above-mentioned, load-on module 32 specifically for:
Hash operation is carried out to each object key at least one described object key respectively, to obtain each object key
Cryptographic Hash;
According to the cryptographic Hash of each object key and the quantity of at least two node, determine that each object key is corresponding
Node;
Data record in second tables of data corresponding to each object key is loaded into each object key correspondence respectively
Node on.
In an optional embodiment, load-on module 32 specifically for:
According to the condition of contact, the data record in second tables of data is loaded into described at least two
In the internal memory of node.Data record in second tables of data is stored in the internal memory of node, can be read at any time,
Reading speed is very fast, is conducive to improving the efficiency of Join operations.
What deserves to be explained is, it is preferred that the data record in the second tables of data can be loaded into it is above-mentioned at least
In the internal memory of two nodes, but internal memory is not limited to, can also be the SSD or other storage mediums of node
In.
The tables of data attachment means that the present embodiment is provided, when processing data table connects task, first according to it
In condition of contact, the data record in the second tables of data is loaded at least two node in distributed system
On, this is stored equivalent to a distributed KV is become, and need not so be Sort Merge Join,
Allow to be distributed Hash map Join, i.e., the data record in the first data need not be arranged
Sequence, but the data record in the first tables of data can be directly read, and according to the first tables of data read
In the corresponding condition of contact of data record, the data from the second required tables of data is read in respective nodes
Record, is attached operation to the data record in two tables of data reading afterwards.As can be seen here, adopt
Second tables of data, need to only be distributed to not by the tables of data attachment means provided with the present embodiment according to condition of contact
With on node, it is not necessary to which the first tables of data is distributed on different nodes, reducing shuffle sequences needs place
The data volume of reason, advantageously reduces the computing resource that attended operation is consumed.
It is apparent to those skilled in the art that, for convenience and simplicity of description, foregoing description
System, apparatus, and unit specific work process, may be referred to the corresponding process in preceding method embodiment,
It will not be repeated here.
In several embodiments provided herein, it should be understood that disclosed system, device and side
Method, can be realized by another way.For example, device embodiment described above is only schematic
, for example, the division of the unit, only a kind of division of logic function, can there is another when actually realizing
Outer dividing mode, such as multiple units or component can combine or be desirably integrated into another system, or
Some features can be ignored, or not perform.Another, shown or discussed coupling or straight each other
Connect coupling or communication connection can be by some interfaces, the INDIRECT COUPLING or communication connection of device or unit,
Can be electrical, machinery or other forms.
The unit illustrated as separating component can be or may not be it is physically separate, as
The part that unit is shown can be or may not be physical location, you can with positioned at a place, or
It can also be distributed on multiple NEs.It can select according to the actual needs therein some or all of
Unit realizes the purpose of this embodiment scheme.
In addition, each functional unit in the application each embodiment can be integrated in a processing unit,
Can also be that unit is individually physically present, can also two or more units be integrated in a unit
In.Above-mentioned integrated unit can both be realized in the form of hardware, it would however also be possible to employ hardware adds software function
The form of unit is realized.
The above-mentioned integrated unit realized in the form of SFU software functional unit, can be stored in a computer can
Read in storage medium.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are used
To cause a computer equipment (can be personal computer, server, or network equipment etc.) or place
Manage the part steps that device (processor) performs each embodiment methods described of the application.And foregoing storage
Medium includes:USB flash disk, mobile hard disk, read-only storage (Read-Only Memory, ROM), with
Machine access memory (Random Access Memory, RAM), magnetic disc or CD etc. are various can be with
The medium of store program codes.
Finally it should be noted that:Above example is only limited to the technical scheme for illustrating the application, rather than to it
System;Although the application is described in detail with reference to the foregoing embodiments, one of ordinary skill in the art
It should be understood that:It can still modify to the technical scheme described in foregoing embodiments, or to it
Middle some technical characteristics carry out equivalent substitution;And these modifications or replacement, do not make appropriate technical solution
Essence departs from the spirit and scope of each embodiment technical scheme of the application.
Claims (10)
1. a kind of tables of data connection method, it is characterised in that including:
Tables of data connection task is received, the tables of data connection task is indicated according to condition of contact to the first data
Table and the second tables of data are attached operation;
According to the condition of contact, the data record in second tables of data is loaded into distributed system
On at least two nodes;
The data record read in first tables of data is recorded as current data, according to the current data
Corresponding condition of contact is recorded, destination node is determined from least two node, and read the target
Data record in second tables of data stored on node is as target data record;
Operation is attached to current data record and the target data record.
2. according to the method described in claim 1, it is characterised in that described according to the condition of contact, will
Before data record in second tables of data is loaded into distributed system at least two nodes, including:
Judge whether the data volume of second tables of data is more than the internal memory limitation of individual node;
If the determination result is YES, then perform according to the condition of contact, by the data in second tables of data
Record is loaded into the operation in distributed system at least two nodes.
3. according to the method described in claim 1, it is characterised in that described to be recorded according to the current data
Corresponding condition of contact, determines destination node, and read the destination node from least two node
Before data record in second tables of data of upper storage is as target data record, including:
Corresponding condition of contact is recorded according to the current data, judges to whether there is the mesh in local cache
Mark data record;
If judged result is no, performs and corresponding condition of contact is recorded according to the current data, from described
Destination node is determined at least two nodes, and reads second tables of data stored on the destination node
In data record as target data record operation.
4. according to the method described in claim 1, it is characterised in that the condition of contact is included needed for connection
At least one object key;
It is described according to the condition of contact, the data record in second tables of data is loaded into distributed system
In system at least two nodes, including:
Hash operation is carried out to each object key at least one described object key respectively, to obtain each object key
Cryptographic Hash;
According to the cryptographic Hash of each object key and the quantity of at least two node, determine that each object key is corresponding
Node;
Data record in second tables of data corresponding to each object key is loaded into each object key correspondence respectively
Node on.
5. the method according to claim any one of 1-4, it is characterised in that described according to the connection
Condition, the data record in second tables of data is loaded into distributed system at least two nodes,
Including:
According to the condition of contact, the data record in second tables of data is loaded into described at least two
In the internal memory of node.
6. a kind of tables of data attachment means, it is characterised in that including:
Receiving module, for receiving tables of data connection task, the tables of data connection task is indicated according to connection
Condition is attached operation to the first tables of data and the second tables of data;
Load-on module, for according to the condition of contact, the data record in second tables of data to be loaded
Into distributed system at least two nodes;
Read module, is recorded, root for reading the data record in first tables of data as current data
Corresponding condition of contact is recorded according to the current data, destination node is determined from least two node,
And the data record read in second tables of data stored on the destination node is remembered as target data
Record;
Link block, for being attached operation to current data record and the target data record.
7. device according to claim 6, it is characterised in that also include:
First judge module, for judging whether the data volume of second tables of data is more than in individual node
Limitation is deposited, and in judged result to trigger the load-on module execution when being according to the condition of contact, will
Data record in second tables of data is loaded into the operation in distributed system at least two nodes.
8. device according to claim 6, it is characterised in that also include:
Second judge module, for recording corresponding condition of contact according to the current data, judges local slow
It whether there is the target data record in depositing, and the read module triggered when judged result is no and hold
Row records corresponding condition of contact according to the current data, and target section is determined from least two node
Point, and the data record in second tables of data stored on the destination node is read as target data
The operation of record.
9. device according to claim 6, it is characterised in that the condition of contact is included needed for connection
At least one object key;
The load-on module specifically for:
Hash operation is carried out to each object key at least one described object key respectively, to obtain each object key
Cryptographic Hash;
According to the cryptographic Hash of each object key and the quantity of at least two node, determine that each object key is corresponding
Node;
Data record in second tables of data corresponding to each object key is loaded into each object key correspondence respectively
Node on.
10. the device according to claim any one of 6-9, it is characterised in that the load-on module tool
Body is used for:
According to the condition of contact, the data record in second tables of data is loaded into described at least two
In the internal memory of node.
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610118167.7A CN107153643B (en) | 2016-03-02 | 2016-03-02 | Data table connection method and device |
TW106104646A TWI746511B (en) | 2016-03-02 | 2017-02-13 | Data table connection method and device |
PCT/CN2017/074177 WO2017148297A1 (en) | 2016-03-02 | 2017-02-20 | Method and device for joining tables |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610118167.7A CN107153643B (en) | 2016-03-02 | 2016-03-02 | Data table connection method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107153643A true CN107153643A (en) | 2017-09-12 |
CN107153643B CN107153643B (en) | 2021-02-19 |
Family
ID=59742547
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610118167.7A Active CN107153643B (en) | 2016-03-02 | 2016-03-02 | Data table connection method and device |
Country Status (3)
Country | Link |
---|---|
CN (1) | CN107153643B (en) |
TW (1) | TWI746511B (en) |
WO (1) | WO2017148297A1 (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109710643A (en) * | 2018-12-20 | 2019-05-03 | 上海达梦数据库有限公司 | Outer connecting pipe manages method, apparatus, server and storage medium |
CN110413670A (en) * | 2019-06-28 | 2019-11-05 | 阿里巴巴集团控股有限公司 | Data export method, device and equipment based on MapReduce |
CN111506670A (en) * | 2019-01-31 | 2020-08-07 | 阿里巴巴集团控股有限公司 | Data processing method, device and equipment |
CN112597148A (en) * | 2020-11-25 | 2021-04-02 | 联想(北京)有限公司 | Data table connection method and device |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11520738B2 (en) * | 2019-09-20 | 2022-12-06 | Samsung Electronics Co., Ltd. | Internal key hash directory in table |
CN111752972B (en) * | 2020-07-01 | 2024-06-25 | 浪潮云信息技术股份公司 | Data association query method and system based on RocksDB key-value storage mode |
CN112732715B (en) * | 2020-12-31 | 2023-08-25 | 星环信息科技(上海)股份有限公司 | Data table association method, device and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102323947A (en) * | 2011-09-05 | 2012-01-18 | 东北大学 | Generation method of pre-join table on ring-shaped schema database |
CN102467570A (en) * | 2010-11-17 | 2012-05-23 | 日电(中国)有限公司 | Connection query system and method for distributed data warehouse |
CN103186651A (en) * | 2011-12-31 | 2013-07-03 | ***通信集团公司 | Distributed relational database as well as method and device for building and querying same |
CN104424240A (en) * | 2013-08-27 | 2015-03-18 | 腾讯科技(深圳)有限公司 | Multi-table correlation method and system, main service node and computing node |
CN104504114A (en) * | 2014-12-30 | 2015-04-08 | 杭州华为数字技术有限公司 | Multi-hash table-based relational operation optimization method, device and system |
CN105045871A (en) * | 2015-07-15 | 2015-11-11 | 国家超级计算深圳中心(深圳云计算中心) | Data aggregation query method and apparatus |
US20160055212A1 (en) * | 2014-08-22 | 2016-02-25 | Attivio, Inc. | Automatic joining of data sets based on statistics of field values in the data sets |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7085769B1 (en) * | 2001-04-26 | 2006-08-01 | Ncr Corporation | Method and apparatus for performing hash join |
US7177874B2 (en) * | 2003-01-16 | 2007-02-13 | Jardin Cary A | System and method for generating and processing results data in a distributed system |
CN102214176B (en) * | 2010-04-02 | 2014-02-05 | 中国人民解放军国防科学技术大学 | Method for splitting and join of huge dimension table |
CN104391957A (en) * | 2014-12-01 | 2015-03-04 | 浪潮电子信息产业股份有限公司 | Data interaction analysis method for hybrid big data processing system |
TWI522827B (en) * | 2015-01-09 | 2016-02-21 | Chunghwa Telecom Co Ltd | Real-time storage and real-time reading of huge amounts of data for non-related databases |
CN105183880A (en) * | 2015-09-22 | 2015-12-23 | 浪潮集团有限公司 | Hash join method and device |
-
2016
- 2016-03-02 CN CN201610118167.7A patent/CN107153643B/en active Active
-
2017
- 2017-02-13 TW TW106104646A patent/TWI746511B/en active
- 2017-02-20 WO PCT/CN2017/074177 patent/WO2017148297A1/en active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102467570A (en) * | 2010-11-17 | 2012-05-23 | 日电(中国)有限公司 | Connection query system and method for distributed data warehouse |
CN102323947A (en) * | 2011-09-05 | 2012-01-18 | 东北大学 | Generation method of pre-join table on ring-shaped schema database |
CN103186651A (en) * | 2011-12-31 | 2013-07-03 | ***通信集团公司 | Distributed relational database as well as method and device for building and querying same |
CN104424240A (en) * | 2013-08-27 | 2015-03-18 | 腾讯科技(深圳)有限公司 | Multi-table correlation method and system, main service node and computing node |
US20160055212A1 (en) * | 2014-08-22 | 2016-02-25 | Attivio, Inc. | Automatic joining of data sets based on statistics of field values in the data sets |
CN104504114A (en) * | 2014-12-30 | 2015-04-08 | 杭州华为数字技术有限公司 | Multi-hash table-based relational operation optimization method, device and system |
CN105045871A (en) * | 2015-07-15 | 2015-11-11 | 国家超级计算深圳中心(深圳云计算中心) | Data aggregation query method and apparatus |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109710643A (en) * | 2018-12-20 | 2019-05-03 | 上海达梦数据库有限公司 | Outer connecting pipe manages method, apparatus, server and storage medium |
CN109710643B (en) * | 2018-12-20 | 2020-11-13 | 上海达梦数据库有限公司 | External connection management method, device, server and storage medium |
CN111506670A (en) * | 2019-01-31 | 2020-08-07 | 阿里巴巴集团控股有限公司 | Data processing method, device and equipment |
CN111506670B (en) * | 2019-01-31 | 2023-07-18 | 阿里巴巴集团控股有限公司 | Data processing method, device and equipment |
CN110413670A (en) * | 2019-06-28 | 2019-11-05 | 阿里巴巴集团控股有限公司 | Data export method, device and equipment based on MapReduce |
CN110413670B (en) * | 2019-06-28 | 2023-07-14 | 创新先进技术有限公司 | Data export method, device and equipment based on MapReduce |
CN112597148A (en) * | 2020-11-25 | 2021-04-02 | 联想(北京)有限公司 | Data table connection method and device |
Also Published As
Publication number | Publication date |
---|---|
TWI746511B (en) | 2021-11-21 |
TW201738781A (en) | 2017-11-01 |
CN107153643B (en) | 2021-02-19 |
WO2017148297A1 (en) | 2017-09-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107153643A (en) | Tables of data connection method and device | |
CN104391954B (en) | The processing method and processing device of database journal | |
CN106021445B (en) | It is a kind of to load data cached method and device | |
KR100603699B1 (en) | Hybrid search memory for network processor and computer systems | |
US11929944B2 (en) | Network forwarding element with key-value processing in the data plane | |
CN104301404B (en) | A kind of method and device of the adjustment operation system resource based on virtual machine | |
US20150127649A1 (en) | Efficient implementations for mapreduce systems | |
US9710503B2 (en) | Tunable hardware sort engine for performing composite sorting algorithms | |
CN101673192B (en) | Method for time-sequence data processing, device and system therefor | |
US11625405B2 (en) | System and method for object-oriented pattern matching in arbitrary data object streams | |
CN110417903A (en) | A kind of information processing method and system based on cloud computing | |
US9535743B2 (en) | Data processing control method, computer-readable recording medium, and data processing control device for performing a Mapreduce process | |
CN111563093A (en) | Detection and avoidance system and method for union block chain conflict transaction | |
US9158808B2 (en) | Object arrangement apparatus for determining object destination, object arranging method, and computer program thereof | |
CN106407224A (en) | Method and device for file compaction in KV (Key-Value)-Store system | |
CN107992577A (en) | A kind of Hash table data conflict processing method and device | |
CN109144431A (en) | Caching method, device, equipment and the storage medium of data block | |
CN106973091B (en) | Distributed memory data redistribution method and system, and master control server | |
CN110059096A (en) | Data version management method, apparatus, equipment and storage medium | |
CN106649530A (en) | Cloud detailed list inquiry management system and method | |
CN106557503A (en) | A kind of method and system of image retrieval | |
US20150112947A1 (en) | System and method for database flow management | |
CN110007940A (en) | Verification method, system, server and the readable storage medium storing program for executing of gray scale publication | |
US8371941B2 (en) | System and method for game state reduction | |
CN107807993A (en) | A kind of implementation method and device of web-page histories writing function |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |