CN102541858A - Data equality processing method, device and system based on mapping and protocol - Google Patents

Data equality processing method, device and system based on mapping and protocol Download PDF

Info

Publication number
CN102541858A
CN102541858A CN2010105856138A CN201010585613A CN102541858A CN 102541858 A CN102541858 A CN 102541858A CN 2010105856138 A CN2010105856138 A CN 2010105856138A CN 201010585613 A CN201010585613 A CN 201010585613A CN 102541858 A CN102541858 A CN 102541858A
Authority
CN
China
Prior art keywords
subregion
fine granularity
stipulations
intermediate result
result data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2010105856138A
Other languages
Chinese (zh)
Other versions
CN102541858B (en
Inventor
蔡斌
田万鹏
万乐
史晓峰
邱翔虎
刘奕慧
肖桂菊
宫振飞
张文郁
韩欣
崔小丰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Cloud Computing Beijing Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201010585613.8A priority Critical patent/CN102541858B/en
Publication of CN102541858A publication Critical patent/CN102541858A/en
Application granted granted Critical
Publication of CN102541858B publication Critical patent/CN102541858B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a data equality processing method, a device and a system based on mapping and protocol. The data equality processing method comprises the steps of: obtaining data submitted by a client; initially zoning the obtained data according to the number of preset mappers; respectively mapping the data in initial zones to obtain intermediate result data; invoking a function of a zoning device; zoning fine granularity of the intermediate result data according to the number of preset fine granularity zones, wherein the number of the preset fine granularity zones is larger than the number of protocol zones; outputting intermediate result data quantity information in each fine granularity zone to a working server; receiving a corresponding relationship between the fine granularity zones returned by the working server and the protocol zones; combining the intermediate result data which belong to the same protocol zones in the fine granularity zones; outputting to the corresponding protocol zones; protocoling the intermediate result data in the protocol zones; and obtaining a corresponding data processing result. By applying the invention, data load can be equalized; and the efficiency of processing the data is enhanced.

Description

Data balancing property disposal route, Apparatus and system based on mapping and stipulations
Technical field
The present invention relates to the distributed data computing technique, particularly a kind of data balancing property disposal route, Apparatus and system based on mapping and stipulations (MapReduce).
Background technology
MapReduce is the existing a kind of system architecture that large-scale data is handled that is applied to; System architecture as a kind of programming; Be widely used in the large-scale dataset concurrent operation of (as greater than the 1TB data set); For example, large-scale distributed filtration, extensive distribution sorting, the counter-rotating of Web connection layout, web access log analysis, reverse indexing structure, clustering documents, machine learning and based on the mechanical translation of adding up etc.In the MapReduce system architecture, data handling procedure is divided into two stages: the phase one is mapping (Map) stage, and treating deal with data carries out primary partition, and each element in the primary partition is calculated, and outputs to the stipulations subregion.Wherein, the result of calculation that has a same keys leaves same stipulations subregion in; Subordinate phase is stipulations (Reduce) stages, and the mapping result of calculation with same keys merges to form tabulates, and the element to tabulation carries out suitable merging then, obtains final result.
MapReduce is very easy to the programming personnel can not distributed parallel under the situation of programming, and own written program is operated on the distributed system.
Fig. 1 is existing data balancing property disposal system structural representation based on MapReduce.Framework Hadoop to be used to run application on large-scale cheap hardware device cluster is an example; Referring to Fig. 1; This system comprises: client (Client), Job Server (JobTracker) and one or more task server (TaskTracker), wherein
Among the application, one time the MapReduce computation requests is called as operation.Client is through client-side program (Client Program) and JobTracker submit job; This operation is coordinated by JobTracker, carries out the Map stage earlier, like M1, M2 and the M3 that identifies among Fig. 1; Carry out the Reduce stage then, like R1 and the R2 that identifies among Fig. 1.The data processing operation that Map stage and Reduce stage carry out is monitored by TaskTracker, and operates in the process that is independent of TaskTracker.
Specifically, client is promptly imported data (Input Data) through Client Program and JobTracker submit job; In the Map stage, TaskTracker carries out primary partition to the input data in advance, among the application; The input data are divided into 5 primary partitions of non-overlapping copies, comprise input primary partition 1 (Input Split 1)~input primary partition 5 (Input Split 5), and Map calls through input format (InputFormat); Read the input data; Handled respectively by 5 mappers (Mapper) respectively then, wherein, TaskTracker 1 and TaskTracker 3 comprise two Mapper respectively; TaskTracker 2 comprises a Mapper; Primary partition 1 is imported the Mapper among the TaskTracker 1 with the data in the primary partition 4, and primary partition 3 is imported the Mapper among the TaskTracker 3 with the data in the primary partition 5, the Mapper among the data input TaskTracker 2 in the primary partition 2.The data layout of input Mapper is < key, value >, in describing below, is referred to as key1 and value1.
Mapper produces the intermediate result (intermediate data) that exists with < key, value>form, and deposits random access memory (RAM, Random Access Memory) in after key1 and value1 are handled, and in describing below, is referred to as key2 and value2.TaskTracker can merge (Combine) to the intermediate data of storing among the RAM.Through call zonal device (Partitioner) function for each intermediate result, specify the Reduce subregion (Partition) of intermediate result, the corresponding buffer zone of output intermediate result to corresponding Reduce subregion is like Region1 among Fig. 1 and Region2.
The definition of Partitioner function is following:
int?getPartition(K2?key,V2?value,int?num?Partitions)
This function needs parameter key and value, is respectively output key2 and the value2 of Mapper, simultaneously; Also has the number of partitions (numPartitions); The Reduce number of partitions of i.e. appointment in example shown in Figure 1, comprises two reducer (Reducer); Then the value of the Reduce number of partitions of this appointment is the corresponding Reduce subregions of 2, one Reducer.
After the treatment scheme in the Map stage that is finished, get into the Reduce stage, the Reduce stage comprises 3 steps: shuffle (Shuffle), ordering (Sort) and Reduce.Shuffling step; The MapReduce system of Hadoop is according to the key among the Map result; With relevant result; Be about to Mapper output through merging the intermediate data of storage in the buffer zone (Region1 or Region2) with subregion, be transferred on some Reducer tasks, the intermediate result that just will be distributed in the same key that a plurality of Mapper on the different TaskTracker produce transfers on the TaskTracker of Reducer of this key of processing.For example, all intermediate data that belong to Region 1 among Fig. 1 are transferred to same Reducer task, and, all intermediate data that belong to Region 2 among Fig. 1 all are transferred to another Reducer task.When shuffling, ordering is also carrying out, through being combined together from < key2, the value2>of the identical key value of having of different Mapper; Form one < key2, < tabulation of value2>>, as the input of Reducer among the TaskTracker, Reducer is through < the key2 to receiving; < tabulation of value2>> handle, form net result < key, value >, in describing below; Be referred to as < key3, value3 >, and export as output data (Output Data).
In the above-mentioned example, carry out the TaskTracker of Map phase process and the TaskTracker of execution Reduce phase process and can be same TaskTracker, also can be different TaskTracker.
By above-mentioned visible; Existing data balancing property disposal route based on MapReduce is in the process that the intermediate data in the fine granularity subregion is merged, owing to lack the means that data volume in the fine granularity subregion is effectively added up; Make mode with the Reduce subregion of appointment; It is unbalanced to export Reduce subregion corresponding buffer region data volume respectively to, promptly possibly cause data volume and the data volume among the buffer zone Region 2 among the buffer zone Region 1 to differ greatly, like this; The intermediate data of corresponding buffer zone is transferred in the process of corresponding Reducer task (Reducer); Possibly cause as among the TaskTracker input Reducer intermediate data inhomogeneous, that is to say that the intermediate data that is input on some Reducer is more than the intermediate data on other Reducer.
Fig. 2 is the existing data volume synoptic diagram that comprises based on the Reduce subregion after the data balancing property processing of MapReduce.Referring to Fig. 2; Comprise two Reduce subregions; Suppose among Fig. 1; Data transmission to Reduce subregion 1 (Reducer 1) among the Region 1 of TaskTracker1~TaskTracker 3, data transmission to Reduce subregion 2 (Reducer 2) among the Region 2 of TaskTracker 1~TaskTracker 3, the data volume of Reduce subregion input is respectively area under the line of curve.By finding out among the figure; Because data volume data volume in the Region 2 of TaskTracker 1~TaskTracker 3 among the Region 1 of TaskTracker 1~TaskTracker 3; Make that carrying out the data volume (Reducer 1 carries out) that Reduce handles in the Reduce subregion 1 substantially exceeds the data volume (Reducer 2 carries out) of carrying out Reduce in the Reduce subregion 2 and handling; Like this, the required time of data will be much larger than handling the required time of data in the Reduce subregion 2 in the processing Reduce subregion 1.And in the practical application, the required time (time of task termination) of data processing is determined by long Reducer 1 of processing time, thereby because the unbalanced phenomenon of Reducer load; To cause the prolongation of system handles data time, for example, Reducer 1 is also in deal with data in the system; And simultaneously; Reducer 2 but is in idle condition, makes the TaskTracker resource can not get effective utilization, has reduced the efficient of deal with data.
Summary of the invention
In view of this, fundamental purpose of the present invention is to propose a kind of data balancing property disposal route based on mapping and stipulations, the efficient of equalization data load, raising deal with data.
Another object of the present invention is to propose a kind of data balancing property treating apparatus, the efficient of equalization data load, raising deal with data based on mapping and stipulations.
A purpose more of the present invention is to propose a kind of data balancing property disposal system based on mapping and stipulations, the efficient of equalization data load, raising deal with data.
For achieving the above object, the invention provides a kind of data balancing property disposal route based on mapping and stipulations, this method comprises:
A, obtain the data that client is submitted to, the data of obtaining are carried out primary partition according to the mapper number that is provided with in advance;
B, respectively the data in the primary partition are carried out mapping treatment, obtain the intermediate result data;
C, call the zonal device function, middle result data is carried out the fine granularity subregion according to the fine granularity number of partitions that is provided with in advance, the said fine granularity number of partitions that is provided with in advance is greater than the stipulations number of partitions;
D, export the intermediate result data amount information in each fine granularity subregion to workspace server; The fine granularity subregion that the reception workspace server returns and the corresponding relation of stipulations subregion merge the stipulations subregion that belongs to the intermediate result data of same stipulations subregion in the fine granularity subregion and export correspondence to;
E, the intermediate result data in the stipulations subregion are carried out stipulations handle, obtain corresponding data processed result.
Between said step C and step D, further comprise:
Whether the progress of judging mapping treatment reaches the progress threshold value that is provided with in advance, if, execution in step D.
The said progress threshold value that is provided with in advance is the preset number percent that mapper has been accomplished the data volume mapping treatment in the primary partition.
Export the intermediate result data amount information in each fine granularity subregion to workspace server described in the step D, the fine granularity subregion that the reception workspace server returns and the corresponding relation of stipulations subregion specifically comprise:
Intermediate result data amount information in the respective fine granularity subregion that each mapper that the workspace server statistics receives reports; Obtain intermediate result data total amount; According to the stipulations number of partitions, calculate the intermediate result data volume that each stipulations subregion need be handled, according to the intermediate result data amount information of each the stipulations subregion needs processing that calculates; Confirm the stipulations subregion that the fine granularity subregion is corresponding; Intermediate result data volume sum in the fine granularity subregion of make selecting equals or is approximately equal to the intermediate result data volume that corresponding stipulations subregion need be handled, and then, exports the correspondence relationship information of fine granularity subregion and stipulations subregion to task server.
Further comprise: the fine granularity subregion sequence indicia position that sign fine granularity subregion order in fine granularity subregion crowd is set in advance; The fine granularity subregion that the corresponding stipulations subregion of said selection is corresponding makes intermediate result data volume sum in the fine granularity subregion of selecting equal or be approximately equal to the intermediate result data volume that corresponding stipulations subregion need handle and specifically comprises:
The fine granularity subregion that the corresponding stipulations subregion of select progressively is corresponding makes intermediate result data volume sum in the fine granularity subregion of select progressively equal or is approximately equal to the intermediate result data volume that corresponding stipulations subregion need be handled.
A kind of data balancing property treating apparatus based on mapping and stipulations, this device comprises: receiving element, balance policy computing unit and transmitting element, wherein,
Receiving element is used to receive the data that client is submitted to, exports transmitting element to; Intermediate result data amount information in each fine granularity subregion that the outside task server of reception sends exports the balance policy computing unit to;
The balance policy computing unit; Be used for intermediate result data amount information according to each fine granularity subregion of receiving element output; According to the fine granularity number of partitions of the balance policy dispensed that is provided with in advance each stipulations subregion in the stipulations number of partitions that is provided with in advance, export the correspondence relationship information of fine granularity subregion and stipulations subregion to transmitting element;
Transmitting element, the fine granularity subregion of data that the client that is used for that receiving element is exported is submitted to and the output of balance policy computing unit and the correspondence relationship information of stipulations subregion export outside task server to.
A kind of data balancing property treating apparatus based on mapping and stipulations, this device comprises: receiving element, primary partition unit, mapping treatment unit, fine granularity zoning unit, stipulations zoning unit and stipulations processing unit, wherein,
Receiving element is used for receiving from the external procedure data in server, exports the primary partition unit to; Receive the fine granularity subregion of external procedure server output and the correspondence relationship information of stipulations subregion, export the stipulations zoning unit to;
The primary partition unit is used for according to the mapper number that is provided with in advance the data that receive being carried out primary partition, exports corresponding mapping treatment unit to;
The mapping treatment unit is used for the data of primary partition output are carried out mapping treatment, obtains the intermediate result data, exports the fine granularity zoning unit to;
The fine granularity zoning unit; Be used for middle result data being carried out the fine granularity subregion according to the fine granularity number of partitions that is provided with in advance; The said fine granularity number of partitions that is provided with in advance is greater than the stipulations number of partitions, and exports the intermediate result data amount information to outside Job Server; Reception exports the intermediate result data that belong to same stipulations subregion in the fine granularity subregion to corresponding stipulations subregion from the fine granularity subregion of receiving element output and the correspondence relationship information of stipulations subregion according to corresponding relation;
The stipulations zoning unit is used for exporting the intermediate result data that receive to corresponding stipulations processing unit;
The stipulations processing unit is used for that the intermediate result data of merging of input are carried out stipulations and handles, and obtains corresponding data processed result.
Further comprise judging unit and transmitting element, wherein,
Judging unit is used to judge whether the progress of mapping treatment unit reaches the progress threshold value that is provided with in advance, if trigger and export the intermediate result data amount information in the fine granularity zoning unit to transmitting element;
Transmitting element is used for exporting the intermediate result data amount information that receives to outside Job Server.
A kind of data balancing property disposal system based on mapping and stipulations, this system comprises: Job Server and one or more task server, wherein,
Job Server is used to receive the data that client is submitted to, exports task server to; According to the intermediate result data amount information in each the fine granularity subregion that receives and the stipulations number of partitions that is provided with in advance; According to the fine granularity number of partitions of the balance policy dispensed that is provided with in advance, export the correspondence relationship information of fine granularity subregion and stipulations subregion to task server to each stipulations subregion;
Task server; Be used for the data that receive being carried out primary partition, respectively the data in the primary partition carried out mapping treatment, obtain the intermediate result data according to the mapper number that is provided with in advance; Call the zonal device function; Middle result data is carried out the fine granularity subregion according to the fine granularity number of partitions that is provided with in advance, and the said fine granularity number of partitions that is provided with in advance is greater than the stipulations number of partitions, and exports the intermediate result data amount information in each fine granularity subregion to Job Server; The fine granularity subregion that the reception Job Server returns and the corresponding relation of stipulations subregion; Merge and belong to the intermediate result data of same stipulations subregion in the fine granularity subregion and export the corresponding stipulations subregion of corresponding relation to; And the intermediate result data of importing in the stipulations subregion are carried out stipulations handle, obtain corresponding data processed result.
Said Job Server comprises: receiving element, balance policy computing unit and transmitting element, wherein,
Receiving element is used to receive the data that client is submitted to, exports transmitting element to; Intermediate result data amount information in each fine granularity subregion that the outside task server of reception sends exports the balance policy computing unit to;
The balance policy computing unit; Be used for intermediate result data amount information according to each fine granularity subregion of receiving element output; According to the fine granularity number of partitions of the balance policy dispensed that is provided with in advance each stipulations subregion in the stipulations number of partitions that is provided with in advance, export the correspondence relationship information of fine granularity subregion and stipulations subregion to transmitting element;
Transmitting element, the fine granularity subregion of data that the client that is used for that receiving element is exported is submitted to and the output of balance policy computing unit and the correspondence relationship information of stipulations subregion export outside task server to.
Said task server comprises: receiving element, primary partition unit, mapping treatment unit, fine granularity zoning unit, stipulations zoning unit and stipulations processing unit, wherein,
Receiving element is used for receiving from the external procedure data in server, exports the primary partition unit to; Receive the fine granularity subregion of external procedure server output and the correspondence relationship information of stipulations subregion, export the fine granularity zoning unit to;
The primary partition unit is used for according to the mapper number that is provided with in advance the data that receive being carried out primary partition, exports corresponding mapping treatment unit to;
The mapping treatment unit is used for the data of primary partition output are carried out mapping treatment, obtains the intermediate result data, exports the fine granularity zoning unit to;
The fine granularity zoning unit; Be used for the intermediate result data of the merging of input being carried out the fine granularity subregion according to the fine granularity number of partitions that is provided with in advance; The said fine granularity number of partitions that is provided with in advance is greater than the stipulations number of partitions, and exports the intermediate result data amount information to outside Job Server; Reception exports the intermediate result data that belong to same stipulations subregion in the fine granularity subregion to corresponding stipulations subregion from the fine granularity subregion of receiving element output and the correspondence relationship information of stipulations subregion according to corresponding relation;
The stipulations zoning unit is used for exporting the intermediate result data of the merging that receives to corresponding stipulations processing unit;
The stipulations processing unit is used for that the intermediate result data of merging of input are carried out stipulations and handles, and obtains corresponding data processed result.
Said task server further comprises judging unit and transmitting element, wherein,
Judging unit is used to judge whether the progress of mapping treatment unit reaches the progress threshold value that is provided with in advance, if trigger and export the intermediate result data amount information in the fine granularity zoning unit to transmitting element;
Transmitting element is used for exporting the intermediate result data amount information that receives to outside Job Server.
Visible by above-mentioned technical scheme, a kind of data balancing property disposal route, Apparatus and system based on mapping and stipulations provided by the invention obtain the data that client is submitted to, according to the mapper number that is provided with in advance the data of obtaining are carried out primary partition; Respectively the data in the primary partition are carried out mapping treatment, obtain the intermediate result data; Call the zonal device function, middle result data is carried out the fine granularity subregion according to the fine granularity number of partitions that is provided with in advance, the said fine granularity number of partitions that is provided with in advance is greater than the stipulations number of partitions; Export the intermediate result data amount information in each fine granularity subregion to workspace server; The fine granularity subregion that the reception workspace server returns and the corresponding relation of stipulations subregion merge the stipulations subregion that belongs to the intermediate result data of same stipulations subregion in the fine granularity subregion and export correspondence to; Intermediate result data in the stipulations subregion are carried out stipulations handle, obtain corresponding data processed result.Like this, be divided into a large amount of fine granularity subregions, then through merging the fine granularity subregion through output with Mapper; Form Reduce subregion relatively uniformly; Balanced data payload in each Reduce subregion, thus the unbalanced phenomenon of Reducer load reduced, make the TaskTracker resource be utilized effectively; Reduce the T.T. that the operation completion needs, improved the efficient of deal with data.
Description of drawings
Fig. 1 is existing data balancing property disposal system structural representation based on MapReduce.
Fig. 2 is the existing data volume synoptic diagram that comprises based on the Reduce subregion after the data balancing property processing of MapReduce.
Fig. 3 is the data balancing property process flow synoptic diagram of the embodiment of the invention based on MapReduce.
Fig. 4 carries out the intermediate result data structure synoptic diagram behind the fine granularity subregion for the embodiment of the invention.
Fig. 5 is the data balancing property disposal system structural representation of the embodiment of the invention based on MapReduce.
Fig. 6 is the embodiment of the invention another structural representation based on the data balancing property disposal system of MapReduce.
Fig. 7 is the structural representation of embodiment of the invention Job Server.
Fig. 8 is the structural representation of embodiment of the invention task server.
Embodiment
For making the object of the invention, technical scheme and advantage clearer, will combine accompanying drawing and specific embodiment that the present invention is done to describe in detail further below.
In the prior art; Intermediate result in that Mapper is handled deposits RAM in; And the intermediate result data that deposit RAM in carry out are merged and during subregion; Mode with the Reduce subregion of appointment is handled, and exports corresponding buffer region to, makes that the intermediate result data payload of input block is unbalanced.
In the embodiment of the invention; Before the intermediate result data that deposit RAM in are carried out subregion,, obtain the distribution situation of the intermediate result data of fine granularity subregion through middle result data is carried out the pre-service of fine granularity subregion; According to the DATA DISTRIBUTION situation of obtaining; With the mode of the Reduce subregion of appointment the intermediate result data of fine granularity subregion are merged processing according to the balance policy that is provided with in advance, and export corresponding buffer region to, so that the data load balance in the buffer zone.
Specifically, in the MapReduce framework, the intermediate result data of Map were carried out for two stages divide into groups, the phase one is carried out the fine granularity subregion, and the fine granularity subregion can use original Partitioner function, with the existing stipulations number of partitions (N r) different be the fine granularity number of partitions (N of this phase one f) much larger than the existing stipulations number of partitions, promptly import the number of partitions in the Partitioner function much larger than existing Reducer number (N r), and in Mapper based on the fine granularity number of partitions (N f) middle result data is carried out the fine granularity subregion.
When Mapper proceeds to predefined progress; Data amount information in statistics each fine granularity subregion of phase one; And be reported to Job Tracker; Each fine granularity partition data amount information that Job Tracker reports according to Mapper merges processing with the mode of the Reduce subregion of appointment to the intermediate result data of each fine granularity subregion according to the balance policy that is provided with in advance, and exports corresponding buffer region to; Thereby produce data volume Reduce subregion relatively uniformly, the number of partitions of the Reduce subregion of appointment equals the Reducer number.Follow-up Reducer obtains corresponding Reduce subregion, carries out stipulations, and is identical with prior art.Like this, through secondary fine granularity subregion, can avoid in the MapReduce framework, occurring on some Reducer calculative data volume, thereby reduce the task executions time much larger than the situation of other Reducer.
Fig. 3 is the data balancing property process flow synoptic diagram of the embodiment of the invention based on MapReduce.Referring to Fig. 3, this flow process comprises:
Step 301 is obtained the data that client is submitted to;
In this step, client is passed through the data of client-side program and JobTracker submission user input, and is coordinated by JobTracker, exports TaskTracker to, and TaskTracker obtains the data that client is submitted to.
Step 302 is carried out primary partition according to the mapper number that is provided with in advance to the data of obtaining;
In this step, the mapper number that is provided with in advance for example, if the mapper number is 5, then is divided into 5 primary partitions with the data of obtaining for the present invention is used for data are carried out the mapper number that Map handles.
Step 303 is carried out Map to the data in the primary partition respectively and is handled, and obtains the intermediate result data;
In this step, each Mapper reads the data in the corresponding primary partition, and the data of input are < key, value >, in describing below, is referred to as key1 and value1.
Mapper produces the intermediate result that exists with < key, value>form, and deposits RAM in after key1 and value1 are handled, and in describing below, is referred to as key2 and value2.
Step 301~step 303 is identical with prior art.
Step 304 is called the Partitioner function, and middle result data is carried out the fine granularity subregion according to the fine granularity number of partitions that is provided with in advance, and the said fine granularity number of partitions that is provided with in advance is greater than the Reduce number of partitions;
In this step; In order to make the corresponding Reduce subregion of input Reducer carry out the comparatively equilibrium of data payload that stipulations are handled; In the embodiment of the invention; Further middle result data is carried out the fine granularity subregion,, help follow-up load balancing so that the intermediate result data granularity in each fine granularity subregion is thinner.
Step 305 merges middle result data;
In this step; It is a kind of optimisation strategy that MapReduce uses that middle result data is merged; Can merge identical intermediate result data; So that about in the enterprising professional etiquette of intermediate result data of a Map task output, help to reduce volume of transmitted data from the Map stage to the Reduce stage, be optional step.
Fig. 4 carries out the intermediate result data structure synoptic diagram behind the fine granularity subregion for the embodiment of the invention.Referring to Fig. 4, in the embodiment of the invention, the fine granularity number of partitions that is provided with in advance is 8, is about to the intermediate result data and is divided into 8 fine granularity subregions, and like this, the intermediate result data granularity that comprises in the fine granularity subregion is thinner.
Step 306 judges whether the progress that Map handles reaches the progress threshold value that is provided with in advance, if, execution in step 307, otherwise, proceed Map and handle;
In this step, Mapper is according to the data volume in each primary partition and carry out the data volume (the intermediate result data volume of storing among the RAM) that Map handles, and calculates the progress that Map handles.
The progress threshold value that is provided with in advance can be the preset number percent such as 80% or 100% that Mapper has accomplished that data volume Map in the corresponding primary partition handles, specifically can be according to actual needs and accomplished the data volume that Map handles and can reflect that the DATA DISTRIBUTION situation gets final product.
Step 307; Export the intermediate result data amount information in each fine granularity subregion to Job Tracker; The fine granularity subregion that reception Job Tracker returns and the corresponding relation of Reduce subregion merge the Reduce subregion that belongs to the intermediate result data of same stipulations subregion in the fine granularity subregion and export correspondence to;
In this step; Intermediate result data amount information in the respective fine granularity subregion that each Mapper of Job Tracker reception reports; According to the intermediate result data amount information in each fine granularity subregion and the Reduce number of partitions, i.e. Reducer number is according to the fine granularity number of partitions of the balance policy dispensed that is provided with in advance to each Reduce subregion; Specifically; Intermediate result data amount information in the respective fine granularity subregion that each Mapper that Job Tracker statistics receives reports obtains intermediate result data total amount, according to the Reduce number of partitions; Calculate the intermediate result data volume that each Reduce subregion need be handled; According to the intermediate result data amount information that each the Reduce subregion needs that calculates are handled, confirm the stipulations subregion that the fine granularity subregion is corresponding, make intermediate result data volume sum in the fine granularity subregion of selecting equal or be approximately equal to the intermediate result data volume that corresponding Reduce subregion need be handled.Then, export the correspondence relationship information of fine granularity subregion and Reduce subregion to task server, so that the data volume load that is assigned in each Reduce subregion is comparatively balanced.
In the embodiment of the invention, the balance policy that is provided with in advance can confirm according to actual needs that the overall principle is as long as the data volume load that keeps being assigned in each Reduce subregion is comparatively balanced.For example; For intermediate result data of carrying out behind the fine granularity subregion shown in Figure 4; Suppose that the Reduce number of partitions is 2; Job Tracker is according to each fine granularity subregion of the reporting intermediate result data amount information and Reduce number of partitions 2 in 1.~8.; Utilize the balance policy that is provided with in advance to carry out equilibrium calculating, be provided with the fine granularity subregion 1., 3. the fine granularity subregion export Reduce subregion 1 to the intermediate result data of fine granularity subregion in 4., with other the fine granularity subregion 2., 5., 6., 7., intermediate result data in 8. export Reduce subregion 2 to.Like this, the data payload in Reduce subregion 1 and the Reduce subregion 2 is just comparatively balanced.
Certainly, in the practical application, before merging the fine granularity subregion; Also can be through fine granularity subregion sequence indicia position being set to identify the order of this fine granularity subregion in fine granularity subregion crowd; For example, for intermediate result data of carrying out behind the fine granularity subregion shown in Figure 4, fine granularity subregion fine granularity subregion sequence indicia position 1.~8. can be set be respectively 1~8; Fine granularity subregion sequence indicia position be 1 order of representation before; Like this, can guarantee the order of fine granularity subregion, this need sort to data for some; To guarantee that the fine granularity subregion is to use in the MapReduce application of operate as normal assurance in proper order; Through this fine granularity subregion sequence indicia position is set, when carrying out the merging of fine granularity subregion, carry out order by fine granularity subregion sequence indicia position and merge; With Fig. 4 is example; According to the intermediate result data amount information in each fine granularity subregion, the Reduce number of partitions and the balance policy that is provided with in advance, can with the fine granularity subregion 1., the fine granularity subregion 2. and the intermediate result data of fine granularity subregion in 3. export Reduce subregion 1 to, with the fine granularity subregion 4., the fine granularity subregion 5., the fine granularity subregion 6., the fine granularity subregion 7. and the intermediate result data of fine granularity subregion in 8. export Reduce subregion 2 to.Like this; Though possibly merge not as good as not sequenced fine granularity subregion; But can guarantee fine granularity subregion order, the application scenarios of satisfying the demand data being sorted and handling also can be so that the data payload in Reduce subregion 1 and the Reduce subregion 2 be comparatively balanced.
Step 308 is carried out stipulations to the intermediate result data of importing in the Reduce subregion and is handled, and obtains corresponding data processed result.
In this step, get into the Reduce stage, Reducer is by original flow process, obtains the subregion after the merging, carries out corresponding calculated, and is identical with prior art, repeats no more at this.
Fig. 5 is the data balancing property disposal system structural representation of the embodiment of the invention based on MapReduce.Referring to Fig. 5, this system comprises: Job Server and one or more task server, wherein,
Job Server is used to receive the data that client is submitted to, exports task server to; According to the intermediate result data amount information in each the fine granularity subregion that receives and the Reduce number of partitions that is provided with in advance; According to the fine granularity number of partitions of the balance policy dispensed that is provided with in advance, export the correspondence relationship information of fine granularity subregion and Reduce subregion to task server to each Reduce subregion;
Task server; Be used for the data that receive being carried out primary partition, respectively the data in the primary partition carried out Map and handle, obtain the intermediate result data according to the mapper number that is provided with in advance; Call the Partitioner function; Middle result data is carried out the fine granularity subregion according to the fine granularity number of partitions that is provided with in advance, and the said fine granularity number of partitions that is provided with in advance judges greater than the Reducer number whether the progress that Map handles reaches the progress threshold value that is provided with in advance; If export the intermediate result data amount information in each fine granularity subregion to Job Server; The fine granularity subregion that the reception Job Server returns and the corresponding relation of Reduce subregion; Merge and belong to the intermediate result data of same stipulations subregion in the fine granularity subregion and export the corresponding Reduce subregion of corresponding relation to; And the intermediate result data of importing in the Reduce subregion are carried out stipulations handle, obtain corresponding data processed result.
Fig. 6 is the embodiment of the invention another structural representation based on the data balancing property disposal system of MapReduce.Referring to Fig. 6, comprising: client, Job Server and one or more task server, the main flow process of system comprises: Mapper and Reduce processing are submitted, carried out to task to, wherein,
Task is submitted to: MapReduce developer writes Mapper, Reducer and Combiner according to the requirement of new framework, and fine granularity number of partitions N is provided fWith Reduce number of partitions N r, wherein, N f>N r
Carry out Mapper: the Mapper that the user writes is carried out computing by system call.The output of Mapper is<key2, value2>With fine granularity number of partitions N f, as parameter, call the Partitioner function, obtain the fine granularity subregion of this output, system is stored in this output in the corresponding fine granularity subregion.Mapper is implementing some points, as accomplishing, can report the counting (intermediate result data amount information) of Mapper output data in the current fine granularity subregion to JobTracker at 80% o'clock that imports.Wherein, the intermediate result data amount information is in the distribution that has reflected Reducer input data to a certain degree.JobTracker considers subregion sequence indicia position after having collected the reporting of all or most of Mapper, generates a more uniform Reduce partition scheme, and notify Mapper;
Mapper is according to the Reduce partition scheme that receives, with N f(Region.1~Region.n), merge becomes N to individual fine granularity subregion rIndividual Reduce subregion.
Reduce handles: the Reduce subregion obtains corresponding input data, carries out stipulations.
Above-mentioned flow process more specifically can be described below:
Client is promptly imported data (Input Data), in the Map stage through Client Program and JobTracker submit job; TaskTracker carries out primary partition to the input data in advance, and among the application, the input data are divided into 5 primary partitions of non-overlapping copies; Comprise input fine granularity subregion 1 (Input Split 1)~input fine granularity subregion 5 (Input Split 5), Mapper calls through input format (InputFormat), reads corresponding data; In the embodiment of the invention; Data are handled respectively by 5 mappers (Mapper) respectively, and wherein, TaskTracker 1 and TaskTracker 3 comprise two Mapper respectively; TaskTracker 2 comprises a Mapper; Fine granularity subregion 1 is imported the Mapper among the TaskTracker 1 with the data in the fine granularity subregion 4, and fine granularity subregion 3 is imported the Mapper among the TaskTracker 3 with the data in the fine granularity subregion 5, the Mapper among the data input TaskTracker 2 in the fine granularity subregion 2.The data layout of input Mapper is < key, value >, in describing below, is referred to as key1 and value1.
Mapper produces the intermediate result that exists with < key, value>form, and deposits random access memory (RAM, Random Access Memory) in after key1 and value1 are handled, and in describing below, is referred to as key2 and value2.TaskTracker merges the intermediate data of each primary partition of correspondence of storing among the RAM, and through call zonal device (Partitioner) function for each output, above-mentioned treatment scheme is identical with Fig. 1.
After the fine granularity subregion was merged, different with Fig. 1 was in the treatment scheme that the fine granularity subregion is being merged the back execution, promptly to be Map output subregion and cache management part and the Job Tracker equilibrium treatment part of TaskTracker.In the embodiment of the invention, to original<key2, value2>The subregion of output strengthens, promptly according to the fine granularity number of partitions N that is provided with in advance fTo original<key2, value2>Carry out subregion, will<key2, value2>Be divided into N fIndividual fine granularity subregion.Simultaneously, communicate, after judging that progress that Map handles reaches the progress threshold value that is provided with in advance with the JobTracker of outside; Intermediate result data amount information in each fine granularity subregion of JobTracker report, JobTracker collects the statistics (intermediate result data amount information) of Mapper to the output of fine granularity subregion, can consider subregion sequence indicia position; Generate a more uniform Reduce partition scheme according to balance policy; And notice TaskTracker, TaskTracker receives the order of JobTracker, the i.e. corresponding relation of fine granularity subregion and Reduce subregion; According to the Reduce number of partitions (Reducer number); Merge < key2, value2>in the fine granularity subregion according to corresponding relation, with the mode of the Reduce subregion of appointment; Export corresponding Reduce subregion corresponding buffer region respectively to, like Region1 among Fig. 6 and Region2.Each Reduce subregion carries out the stipulations processing to < key2, the value2>of buffer zone input respectively, obtains corresponding data processed result.
Fig. 7 is the structural representation of embodiment of the invention Job Server.Referring to Fig. 7, this Job Server comprises: receiving element, balance policy computing unit and transmitting element, wherein,
Receiving element is used to receive the data that client is submitted to, exports transmitting element to; Intermediate result data amount information in each fine granularity subregion that the outside task server of reception sends exports the balance policy computing unit to;
The balance policy computing unit; Be used for intermediate result data amount information according to each fine granularity subregion of receiving element output; According to the fine granularity number of partitions of the balance policy dispensed that is provided with in advance each Reduce subregion in the Reduce number of partitions that is provided with in advance, export the correspondence relationship information of fine granularity subregion and Reduce subregion to transmitting element;
Transmitting element, the fine granularity subregion of data that the client that is used for that receiving element is exported is submitted to and the output of balance policy computing unit and the correspondence relationship information of Reduce subregion export outside task server to.
Fig. 8 is the structural representation of embodiment of the invention task server.Referring to Fig. 8, this task server comprises: receiving element, primary partition unit, Map processing unit, fine granularity zoning unit, judging unit, transmitting element, Reduce zoning unit and stipulations processing unit, wherein,
Receiving element is used for receiving from the external procedure data in server, exports the primary partition unit to; Receive the fine granularity subregion of external procedure server output and the correspondence relationship information of Reduce subregion, export the fine granularity zoning unit to;
The primary partition unit is used for according to the mapper number that is provided with in advance the data that receive being carried out primary partition, exports corresponding M ap processing unit to;
The Map processing unit is used for that the data of primary partition output are carried out Map and handles, and obtains the intermediate result data, exports the fine granularity zoning unit to;
The fine granularity zoning unit is used for according to the fine granularity number of partitions that is provided with in advance the intermediate result data of the merging of input being carried out the fine granularity subregion, and the said fine granularity number of partitions that is provided with in advance is greater than the Reduce number of partitions; Reception exports the intermediate result data that belong to same stipulations subregion in the fine granularity subregion to corresponding Reduce subregion from the fine granularity subregion of receiving element output and the correspondence relationship information of Reduce subregion according to corresponding relation;
Judging unit is used to judge whether the progress of Map processing unit reaches the progress threshold value that is provided with in advance, if export the intermediate result data amount information in the fine granularity zoning unit to transmitting element;
Transmitting element is used for exporting the intermediate result data amount information that receives to outside Job Server;
The Reduce zoning unit is used for exporting the intermediate result data that receive to corresponding stipulations processing unit;
The stipulations processing unit is used for that the intermediate result data of input are carried out stipulations and handles, and obtains corresponding data processed result.
By above-mentioned visible; Data balancing property disposal route, the Apparatus and system based on MapReduce of the embodiment of the invention; Task server data are carried out Map handle obtain the intermediate result data after; Further middle result data is carried out the fine granularity subregion according to the fine granularity number of partitions that is provided with in advance; And export the intermediate result data amount information in each fine granularity subregion to Job Tracker; Job Tracker is according to the intermediate result data amount information in each fine granularity subregion and the Reduce number of partitions, according to the fine granularity number of partitions of the balance policy dispensed that is provided with in advance to each Reduce subregion, exports the correspondence relationship information of fine granularity subregion and Reduce subregion to task server; Task server exports the intermediate result data that belong to same stipulations subregion in the fine granularity subregion to corresponding Reduce subregion and carries out the stipulations processing according to the fine granularity subregion that receives and the corresponding relation of Reduce subregion.Like this, be divided into a large amount of fine granularity subregions, then through merging the fine granularity subregion through output with Mapper; Form Reduce subregion relatively uniformly; Balanced data payload in each Reduce subregion, thus the unbalanced phenomenon of Reducer load reduced, make the TaskTracker resource be utilized effectively; Reduce the T.T. that the operation completion needs, improved the efficient of deal with data.Further; Can be when the progress of judging the Map processing reaches the progress threshold value that is provided with in advance; Triggering exports the intermediate result data amount information in each fine granularity subregion to Job Tracker; Make the intermediate result data amount information in the fine granularity subregion can reflect under the situation of DATA DISTRIBUTION, effectively reduce interactive data quantity between TaskTracker and the Job Tracker, reduce and carry out the required time of data balancing property processing.And, through fine granularity subregion sign position is set, can guarantee the application scenarios that need sort to data.
The above is merely preferred embodiment of the present invention, is not to be used to limit protection scope of the present invention.All within spirit of the present invention and principle, any modification of being done, be equal to replacement and improvement etc., all should be included within protection scope of the present invention.

Claims (12)

1. one kind based on the mapping and the data balancing property disposal route of stipulations, it is characterized in that this method comprises:
A, obtain the data that client is submitted to, the data of obtaining are carried out primary partition according to the mapper number that is provided with in advance;
B, respectively the data in the primary partition are carried out mapping treatment, obtain the intermediate result data;
C, call the zonal device function, middle result data is carried out the fine granularity subregion according to the fine granularity number of partitions that is provided with in advance, the said fine granularity number of partitions that is provided with in advance is greater than the stipulations number of partitions;
D, export the intermediate result data amount information in each fine granularity subregion to workspace server; The fine granularity subregion that the reception workspace server returns and the corresponding relation of stipulations subregion merge the stipulations subregion that belongs to the intermediate result data of same stipulations subregion in the fine granularity subregion and export correspondence to;
E, the intermediate result data in the stipulations subregion are carried out stipulations handle, obtain corresponding data processed result.
2. the method for claim 1 is characterized in that, between said step C and step D, further comprises:
Whether the progress of judging mapping treatment reaches the progress threshold value that is provided with in advance, if, execution in step D.
3. method as claimed in claim 2 is characterized in that, the said progress threshold value that is provided with in advance is the preset number percent that mapper has been accomplished the data volume mapping treatment in the primary partition.
4. method as claimed in claim 3 is characterized in that, exports the intermediate result data amount information in each fine granularity subregion to workspace server described in the step D, and the fine granularity subregion that the reception workspace server returns and the corresponding relation of stipulations subregion specifically comprise:
Intermediate result data amount information in the respective fine granularity subregion that each mapper that the workspace server statistics receives reports; Obtain intermediate result data total amount; According to the stipulations number of partitions, calculate the intermediate result data volume that each stipulations subregion need be handled, according to the intermediate result data amount information of each the stipulations subregion needs processing that calculates; Confirm the stipulations subregion that the fine granularity subregion is corresponding; Intermediate result data volume sum in the fine granularity subregion of make selecting equals or is approximately equal to the intermediate result data volume that corresponding stipulations subregion need be handled, and then, exports the correspondence relationship information of fine granularity subregion and stipulations subregion to task server.
5. method as claimed in claim 4; It is characterized in that; Further comprise: the fine granularity subregion sequence indicia position that sign fine granularity subregion order in fine granularity subregion crowd is set in advance; The fine granularity subregion that the corresponding stipulations subregion of said selection is corresponding makes intermediate result data volume sum in the fine granularity subregion of selecting equal or be approximately equal to the intermediate result data volume that corresponding stipulations subregion need handle and specifically comprises:
The fine granularity subregion that the corresponding stipulations subregion of select progressively is corresponding makes intermediate result data volume sum in the fine granularity subregion of select progressively equal or is approximately equal to the intermediate result data volume that corresponding stipulations subregion need be handled.
6. one kind based on the mapping and the data balancing property treating apparatus of stipulations, it is characterized in that this device comprises: receiving element, balance policy computing unit and transmitting element, wherein,
Receiving element is used to receive the data that client is submitted to, exports transmitting element to; Intermediate result data amount information in each fine granularity subregion that the outside task server of reception sends exports the balance policy computing unit to;
The balance policy computing unit; Be used for intermediate result data amount information according to each fine granularity subregion of receiving element output; According to the fine granularity number of partitions of the balance policy dispensed that is provided with in advance each stipulations subregion in the stipulations number of partitions that is provided with in advance, export the correspondence relationship information of fine granularity subregion and stipulations subregion to transmitting element;
Transmitting element, the fine granularity subregion of data that the client that is used for that receiving element is exported is submitted to and the output of balance policy computing unit and the correspondence relationship information of stipulations subregion export outside task server to.
7. one kind based on the mapping and the data balancing property treating apparatus of stipulations, it is characterized in that this device comprises: receiving element, primary partition unit, mapping treatment unit, fine granularity zoning unit, stipulations zoning unit and stipulations processing unit, wherein,
Receiving element is used for receiving from the external procedure data in server, exports the primary partition unit to; Receive the fine granularity subregion of external procedure server output and the correspondence relationship information of stipulations subregion, export the stipulations zoning unit to;
The primary partition unit is used for according to the mapper number that is provided with in advance the data that receive being carried out primary partition, exports corresponding mapping treatment unit to;
The mapping treatment unit is used for the data of primary partition output are carried out mapping treatment, obtains the intermediate result data, exports the fine granularity zoning unit to;
The fine granularity zoning unit; Be used for middle result data being carried out the fine granularity subregion according to the fine granularity number of partitions that is provided with in advance; The said fine granularity number of partitions that is provided with in advance is greater than the stipulations number of partitions, and exports the intermediate result data amount information to outside Job Server; Reception exports the intermediate result data that belong to same stipulations subregion in the fine granularity subregion to corresponding stipulations subregion from the fine granularity subregion of receiving element output and the correspondence relationship information of stipulations subregion according to corresponding relation;
The stipulations zoning unit is used for exporting the intermediate result data that receive to corresponding stipulations processing unit;
The stipulations processing unit is used for that the intermediate result data of merging of input are carried out stipulations and handles, and obtains corresponding data processed result.
8. device as claimed in claim 7 is characterized in that, further comprises judging unit and transmitting element, wherein,
Judging unit is used to judge whether the progress of mapping treatment unit reaches the progress threshold value that is provided with in advance, if trigger and export the intermediate result data amount information in the fine granularity zoning unit to transmitting element;
Transmitting element is used for exporting the intermediate result data amount information that receives to outside Job Server.
9. one kind based on the mapping and the data balancing property disposal system of stipulations, it is characterized in that this system comprises: Job Server and one or more task server, wherein,
Job Server is used to receive the data that client is submitted to, exports task server to; According to the intermediate result data amount information in each the fine granularity subregion that receives and the stipulations number of partitions that is provided with in advance; According to the fine granularity number of partitions of the balance policy dispensed that is provided with in advance, export the correspondence relationship information of fine granularity subregion and stipulations subregion to task server to each stipulations subregion;
Task server; Be used for the data that receive being carried out primary partition, respectively the data in the primary partition carried out mapping treatment, obtain the intermediate result data according to the mapper number that is provided with in advance; Call the zonal device function; Middle result data is carried out the fine granularity subregion according to the fine granularity number of partitions that is provided with in advance, and the said fine granularity number of partitions that is provided with in advance is greater than the stipulations number of partitions, and exports the intermediate result data amount information in each fine granularity subregion to Job Server; The fine granularity subregion that the reception Job Server returns and the corresponding relation of stipulations subregion; Merge and belong to the intermediate result data of same stipulations subregion in the fine granularity subregion and export the corresponding stipulations subregion of corresponding relation to; And the intermediate result data of importing in the stipulations subregion are carried out stipulations handle, obtain corresponding data processed result.
10. system as claimed in claim 9 is characterized in that, said Job Server comprises: receiving element, balance policy computing unit and transmitting element, wherein,
Receiving element is used to receive the data that client is submitted to, exports transmitting element to; Intermediate result data amount information in each fine granularity subregion that the outside task server of reception sends exports the balance policy computing unit to;
The balance policy computing unit; Be used for intermediate result data amount information according to each fine granularity subregion of receiving element output; According to the fine granularity number of partitions of the balance policy dispensed that is provided with in advance each stipulations subregion in the stipulations number of partitions that is provided with in advance, export the correspondence relationship information of fine granularity subregion and stipulations subregion to transmitting element;
Transmitting element, the fine granularity subregion of data that the client that is used for that receiving element is exported is submitted to and the output of balance policy computing unit and the correspondence relationship information of stipulations subregion export outside task server to.
11. system as claimed in claim 9 is characterized in that, said task server comprises: receiving element, primary partition unit, mapping treatment unit, fine granularity zoning unit, stipulations zoning unit and stipulations processing unit, wherein,
Receiving element is used for receiving from the external procedure data in server, exports the primary partition unit to; Receive the fine granularity subregion of external procedure server output and the correspondence relationship information of stipulations subregion, export the fine granularity zoning unit to;
The primary partition unit is used for according to the mapper number that is provided with in advance the data that receive being carried out primary partition, exports corresponding mapping treatment unit to;
The mapping treatment unit is used for the data of primary partition output are carried out mapping treatment, obtains the intermediate result data, exports the fine granularity zoning unit to;
The fine granularity zoning unit; Be used for the intermediate result data of the merging of input being carried out the fine granularity subregion according to the fine granularity number of partitions that is provided with in advance; The said fine granularity number of partitions that is provided with in advance is greater than the stipulations number of partitions, and exports the intermediate result data amount information to outside Job Server; Reception exports the intermediate result data that belong to same stipulations subregion in the fine granularity subregion to corresponding stipulations subregion from the fine granularity subregion of receiving element output and the correspondence relationship information of stipulations subregion according to corresponding relation;
The stipulations zoning unit is used for exporting the intermediate result data of the merging that receives to corresponding stipulations processing unit;
The stipulations processing unit is used for that the intermediate result data of merging of input are carried out stipulations and handles, and obtains corresponding data processed result.
12. system as claimed in claim 11 is characterized in that, said task server further comprises judging unit and transmitting element, wherein,
Judging unit is used to judge whether the progress of mapping treatment unit reaches the progress threshold value that is provided with in advance, if trigger and export the intermediate result data amount information in the fine granularity zoning unit to transmitting element;
Transmitting element is used for exporting the intermediate result data amount information that receives to outside Job Server.
CN201010585613.8A 2010-12-07 2010-12-07 Based on mapping and the data balancing processing method of stipulations, Apparatus and system Active CN102541858B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010585613.8A CN102541858B (en) 2010-12-07 2010-12-07 Based on mapping and the data balancing processing method of stipulations, Apparatus and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010585613.8A CN102541858B (en) 2010-12-07 2010-12-07 Based on mapping and the data balancing processing method of stipulations, Apparatus and system

Publications (2)

Publication Number Publication Date
CN102541858A true CN102541858A (en) 2012-07-04
CN102541858B CN102541858B (en) 2016-06-15

Family

ID=46348781

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010585613.8A Active CN102541858B (en) 2010-12-07 2010-12-07 Based on mapping and the data balancing processing method of stipulations, Apparatus and system

Country Status (1)

Country Link
CN (1) CN102541858B (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103023805A (en) * 2012-11-22 2013-04-03 北京航空航天大学 MapReduce system
CN103942197A (en) * 2013-01-17 2014-07-23 阿里巴巴集团控股有限公司 Data monitoring processing method and device
CN104008012A (en) * 2014-05-30 2014-08-27 长沙麓云信息科技有限公司 High-performance MapReduce realization mechanism based on dynamic migration of virtual machine
CN104252338A (en) * 2013-06-25 2014-12-31 华为技术有限公司 Data processing method and data processing equipment
CN104391748A (en) * 2014-11-21 2015-03-04 浪潮电子信息产业股份有限公司 Mapreduce computation process optimization method
CN105760215A (en) * 2014-12-17 2016-07-13 南京绿云信息技术有限公司 Map-reduce model based job running method for distributed file system
CN106502790A (en) * 2016-10-12 2017-03-15 山东浪潮云服务信息科技有限公司 A kind of task distribution optimization method based on data distribution
CN106933935A (en) * 2015-12-31 2017-07-07 北京国双科技有限公司 task storage method and device
CN108595268A (en) * 2018-04-24 2018-09-28 咪咕文化科技有限公司 A kind of data distributing method, device and computer readable storage medium based on MapReduce
CN109408559A (en) * 2018-10-09 2019-03-01 北京易观智库网络科技有限公司 Retain the method, apparatus and storage medium of analysis
CN109947559A (en) * 2019-02-03 2019-06-28 百度在线网络技术(北京)有限公司 Optimize method, apparatus, equipment and computer storage medium that MapReduce is calculated
CN110309177A (en) * 2018-03-23 2019-10-08 腾讯科技(深圳)有限公司 A kind of method and relevant apparatus of data processing
CN110417910A (en) * 2019-08-07 2019-11-05 北京达佳互联信息技术有限公司 Sending method, device, server and the storage medium of notification message
CN113297188A (en) * 2021-02-01 2021-08-24 淘宝(中国)软件有限公司 Data processing method and device
CN108415912B (en) * 2017-02-09 2021-11-09 阿里巴巴集团控股有限公司 Data processing method and device based on MapReduce model
CN113778657A (en) * 2020-09-24 2021-12-10 北京沃东天骏信息技术有限公司 Data processing method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080086442A1 (en) * 2006-10-05 2008-04-10 Yahoo! Inc. Mapreduce for distributed database processing
CN101764835A (en) * 2008-12-25 2010-06-30 华为技术有限公司 Task allocation method and device based on MapReduce programming framework
CN101770402A (en) * 2008-12-29 2010-07-07 ***通信集团公司 Map task scheduling method, equipment and system in MapReduce system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080086442A1 (en) * 2006-10-05 2008-04-10 Yahoo! Inc. Mapreduce for distributed database processing
CN101764835A (en) * 2008-12-25 2010-06-30 华为技术有限公司 Task allocation method and device based on MapReduce programming framework
CN101770402A (en) * 2008-12-29 2010-07-07 ***通信集团公司 Map task scheduling method, equipment and system in MapReduce system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YONGCHUL KWON ETC.: "Skew-Resistant Parallel Processing of Feature-Extracting Scientific User-Defined Functions", 《SOCC’10》, 11 June 2010 (2010-06-11), pages 1 - 12 *
周敏: "Anthill: 一种基于MapReduce的分布式DBMS", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 10, 15 October 2010 (2010-10-15) *

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103023805A (en) * 2012-11-22 2013-04-03 北京航空航天大学 MapReduce system
CN103942197A (en) * 2013-01-17 2014-07-23 阿里巴巴集团控股有限公司 Data monitoring processing method and device
CN104252338A (en) * 2013-06-25 2014-12-31 华为技术有限公司 Data processing method and data processing equipment
CN104008012A (en) * 2014-05-30 2014-08-27 长沙麓云信息科技有限公司 High-performance MapReduce realization mechanism based on dynamic migration of virtual machine
CN104008012B (en) * 2014-05-30 2017-10-20 长沙麓云信息科技有限公司 A kind of high-performance MapReduce implementation methods based on dynamic migration of virtual machine
CN104391748A (en) * 2014-11-21 2015-03-04 浪潮电子信息产业股份有限公司 Mapreduce computation process optimization method
CN105760215A (en) * 2014-12-17 2016-07-13 南京绿云信息技术有限公司 Map-reduce model based job running method for distributed file system
CN106933935B (en) * 2015-12-31 2019-12-10 北京国双科技有限公司 task storage method and device
CN106933935A (en) * 2015-12-31 2017-07-07 北京国双科技有限公司 task storage method and device
CN106502790A (en) * 2016-10-12 2017-03-15 山东浪潮云服务信息科技有限公司 A kind of task distribution optimization method based on data distribution
CN108415912B (en) * 2017-02-09 2021-11-09 阿里巴巴集团控股有限公司 Data processing method and device based on MapReduce model
CN110309177B (en) * 2018-03-23 2023-11-03 腾讯科技(深圳)有限公司 Data processing method and related device
CN110309177A (en) * 2018-03-23 2019-10-08 腾讯科技(深圳)有限公司 A kind of method and relevant apparatus of data processing
CN108595268B (en) * 2018-04-24 2021-03-09 咪咕文化科技有限公司 Data distribution method and device based on MapReduce and computer-readable storage medium
CN108595268A (en) * 2018-04-24 2018-09-28 咪咕文化科技有限公司 A kind of data distributing method, device and computer readable storage medium based on MapReduce
CN109408559A (en) * 2018-10-09 2019-03-01 北京易观智库网络科技有限公司 Retain the method, apparatus and storage medium of analysis
CN109947559A (en) * 2019-02-03 2019-06-28 百度在线网络技术(北京)有限公司 Optimize method, apparatus, equipment and computer storage medium that MapReduce is calculated
CN109947559B (en) * 2019-02-03 2021-11-23 百度在线网络技术(北京)有限公司 Method, device, equipment and computer storage medium for optimizing MapReduce calculation
CN110417910A (en) * 2019-08-07 2019-11-05 北京达佳互联信息技术有限公司 Sending method, device, server and the storage medium of notification message
CN113778657A (en) * 2020-09-24 2021-12-10 北京沃东天骏信息技术有限公司 Data processing method and device
CN113778657B (en) * 2020-09-24 2024-04-16 北京沃东天骏信息技术有限公司 Data processing method and device
CN113297188A (en) * 2021-02-01 2021-08-24 淘宝(中国)软件有限公司 Data processing method and device

Also Published As

Publication number Publication date
CN102541858B (en) 2016-06-15

Similar Documents

Publication Publication Date Title
CN102541858A (en) Data equality processing method, device and system based on mapping and protocol
Xu et al. An IoT-oriented data placement method with privacy preservation in cloud environment
Hu et al. Time-and cost-efficient task scheduling across geo-distributed data centers
Liu et al. Resource preprocessing and optimal task scheduling in cloud computing environments
CN110321223A (en) The data flow division methods and device of Coflow work compound stream scheduling perception
CN104657220A (en) Model and method for scheduling for mixed cloud based on deadline and cost constraints
Liu et al. SP-Partitioner: A novel partition method to handle intermediate data skew in spark streaming
CN104809130A (en) Method, equipment and system for data query
CN110209494A (en) A kind of distributed task dispatching method and Hadoop cluster towards big data
Mahato et al. On scheduling transactions in a grid processing system considering load through ant colony optimization
CN110308984B (en) Cross-cluster computing system for processing geographically distributed data
CN108197486A (en) Big data desensitization method, system, computer-readable medium and equipment
CN104937544A (en) Computing regression models
CN104516773A (en) Data distribution method and data distribution device for physical machine
Li et al. An effective scheduling strategy based on hypergraph partition in geographically distributed datacenters
Jangiti et al. Scalable and direct vector bin-packing heuristic based on residual resource ratios for virtual machine placement in cloud data centers
Grosu et al. Cooperative load balancing in distributed systems
CN109614227A (en) Task resource concocting method, device, electronic equipment and computer-readable medium
US20220300323A1 (en) Job Scheduling Method and Job Scheduling Apparatus
CN102831102A (en) Method and system for carrying out matrix product operation on computer cluster
Kaur et al. Latency and network aware placement for cloud-native 5G/6G services
Arifuzzaman et al. Fast parallel conversion of edge list to adjacency list for large-scale graphs
Peng et al. Research on cloud computing resources provisioning based on reinforcement learning
Konovalov et al. Job control in heterogeneous computing systems
Yassir et al. Graph-based model and algorithm for minimising big data movement in a cloud environment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20180914

Address after: 100090 Beijing Haidian District Zhichun Road 49 No. 3 West 309

Patentee after: Tencent cloud computing (Beijing) limited liability company

Address before: 518044 East 403 room, Sai Ge science and Technology Park, Futian District Zhenxing Road, Shenzhen, Guangdong, China, 2

Patentee before: Tencent Technology (Shenzhen) Co., Ltd.