CN110222018A

CN110222018A - Data summarization executes method and device

Info

Publication number: CN110222018A
Application number: CN201910397774.5A
Authority: CN
Inventors: 张惠亮; 李贲; 刘胜; 吴锋海
Original assignee: Union Mobile Pay Co Ltd
Current assignee: Union Mobile Pay Co Ltd
Priority date: 2019-05-14
Filing date: 2019-05-14
Publication date: 2019-09-10

Abstract

A kind of data summarization provided in an embodiment of the present invention executes method and device, and the method is applied to support the terminal of at least two aggregation process modules, which comprises obtains the file attribute of the input file of target aggregation process module；File attribute is matched with the property parameters for corresponding to target aggregation process module, obtains matching result；If matching result is yes, then input file is handled according to the task parameters that summarize for corresponding to target aggregation process module, obtain processing result, accomplish to summarize task for different, it is not necessary that MapReduce application program is separately provided to each task, it is handled by build-in function module specific aim, reduces development difficulty and exploitation amount, facilitate execution.

Description

Data summarization executes method and device

Technical field

The present invention relates to technical field of data processing more particularly to a kind of data summarization to execute method and device.

Background technique

Commonly used with big data processing technique, especially (Hadoop is one by Apache base to open source hadoop The distributed system infrastructure of golden club exploitation) system it is increasingly mature, hadoop has become one in Construction of Data Warehouse A critically important infrastructure.Hadoop system is divided into data storage HDFS (distributed file system) and data operation MapReduce, MapReduce are a kind of programming models, the concurrent operation for large-scale dataset (being greater than 1TB).

In the construction of several storehouses, basic data generally can all use Hive tableau format, Hive tableau format and common pass It is that type database is similar, only its bottom is existed with the format of HDFS file HFile.

In usual processing scheme, for the MapReduce program that different calculating tasks is write, each program setting is different Hive bottom input file, write corresponding map and reduce and execute logic, generate corresponding result table.

For this purpose, needing to write different MapReduce programs, even if different if to execute multiple calculating tasks What MapReduce program was read in is identical list file, it is also desirable to repeat to read.All programs, either sequence execute still It is parallel to execute, it requires to occupy a large amount of system and time resource.If newly one calculating task of creation, needs to rewrite one A MapReduce program is submitted, and processing complexity is increased.

Summary of the invention

In view of the problems of the existing technology, the embodiment of the present invention provides a kind of data summarization execution method and device.

The embodiment of the present invention provides a kind of data summarization execution method, and the method is applied to support that at least two summarize place Manage the terminal of module, which comprises

Obtain the file attribute of the input file of target aggregation process module；

The file attribute is matched with the property parameters for corresponding to target aggregation process module, obtains matching knot Fruit；

If matching result be it is yes, according to correspond to target aggregation process module to summarize task parameters literary to the input Part is handled, and processing result is obtained.

The embodiment of the present invention provides a kind of data summarization executive device, and described device is applied to support that at least two summarize place The terminal of module is managed, described device includes:

Acquiring unit, the file attribute of the input file for obtaining target aggregation process module；

Matching unit, for by the file attribute and corresponding to the property parameters progress of target aggregation process module Match, obtains matching result；

Processing unit, for being joined according to the task that summarizes for corresponding to target aggregation process module when matching result, which is, is It is several that the input file is handled, obtain processing result.

The embodiment of the present invention provides a kind of electronic equipment, including memory, processor and storage are on a memory and can be The computer program run on processor, the processor are realized when executing described program as above-mentioned data summarization executes method Step.

The embodiment of the present invention provides a kind of non-transient computer readable storage medium, is stored thereon with computer program, should The step of executing method such as above-mentioned data summarization is realized when computer program is executed by processor.

A kind of data summarization provided in an embodiment of the present invention executes method and device, by supporting at least two aggregation process Module summarizes the execution of task to difference respectively, the file attribute of the input file of target aggregation process module is obtained, by file Attribute is matched with property parameters, after successful match, is summarized task parameters according to corresponding and is handled input file, obtain Processing result is obtained, accomplishes to summarize task for different, it is not necessary that MapReduce application program is separately provided to each task, according to By the processing of built-in functional module specific aim, development difficulty and exploitation amount are reduced, execution is facilitated.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is this hair Bright some embodiments for those of ordinary skill in the art without creative efforts, can be with root Other attached drawings are obtained according to these attached drawings.

Fig. 1 is that data summarization of the present invention executes embodiment of the method flow chart；

Fig. 2 is that data summarization of the present invention executes embodiment of the method flow chart；

Fig. 3 is data summarization executive device example structure figure of the present invention；

Fig. 4 is data summarization executive device example structure figure of the present invention；

Fig. 5 is electronic equipment example structure schematic diagram of the present invention.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without creative efforts, shall fall within the protection scope of the present invention.

A kind of data summarization that Fig. 1 shows one embodiment of the invention offer executes method, and the method is applied to support The terminal of at least two aggregation process modules, which comprises

S11, obtain target aggregation process module input file file attribute；

S12, the file attribute is matched with the property parameters for corresponding to target aggregation process module, is matched As a result；

If S13, matching result be it is yes, summarize task parameters to described defeated according to target aggregation process module is corresponded to Enter file to be handled, obtains processing result.

For step S11- step S13, it should be noted that in embodiments of the present invention, the method is applied to support The terminal of at least two aggregation process modules, the independent MapReduce application program of the terminal built-in to data summarization at Reason.In data summarization processing processing, making each task that summarizes at present is an independent MapReduce application program control System, needs to follow MapReduce frame.MapReduce is a kind of programming model, for large-scale dataset (being greater than 1TB) Concurrent operation comprising " Map (mapping) " and " two processing stages of Reduce (reduction).It is to specify a Map when realizing (mapping) function specifies Reduce (reduction) function completion processing for one group of key-value pair is mapped to one group of new key-value pair. And in embodiments of the present invention, all tasks that summarizes are no longer required for an individual MapReduce application program controlling, but The different tasks that summarizes is corresponded into the different standalone modules summarized in MapReduce application program, i.e. aggregation process module.Only It need to be placed in the configuration file for meeting general-purpose interface form in each module and execute file, reduce development difficulty and open Hair amount.In aggregation process, all tasks that summarize are a MapReduce program, and operation, which can once meet, realizes all remittances The result that total processing module needs.The Development of Module customized can be achieved, without considering that it is excellent between calculating task that difference summarizes First grade, it is flexible in application.

In embodiments of the present invention, it is related to multiple aggregation process modules, needs to configure it.Therefore it needs to obtain user The configuring request of input, the configuring request need to include aggregation process number of modules and aggregation process module id.Aggregation process mould Block number is used to determine to be arranged how many a modules, and aggregation process module id is for distinguishing different disposal module.

In embodiments of the present invention, aggregation process module after setting completed, needs to carry out parameter to aggregation process module to match Set, which kind of, which to be handled, to define each aggregation process module summarizes task, needed when handling task which kind of resource distribution and its He etc..

For this purpose, obtaining the configuring request of user's input, which further includes configuration file and execution file.At this In, configuration file includes: (hive is a data bins based on Hadoop to the Hive basic data table name for needing setting to read The data file of structuring can be mapped as a database table by library tool), output file catalogue, read table index, Reduce task number, the resource information of each Map/Reduce task (various parameters such as CPU, memory, JVM qualifications). Executing file includes: the specific Map phased mission and Reduce phased mission that processing module need to execute.

After obtaining configuration file and executing file, each aggregation process module can be set according to the configuration file Property parameters, and summarize task parameters according to what the execution file was arranged each aggregation process module.Therefore, at this In, property parameters and summarizes task parameters and can respectively correspond and introduce content mentioned by configuration file comprising above-mentioned.

In embodiments of the present invention, after configuration, by the corresponding mark of each aggregation process module, property parameters And summarize task parameters and be integrated into configuration information, it is stored in config directory.

In embodiments of the present invention, in summary file treatment process, Map phased mission and Reduce rank are needed to be implemented Section task.The corresponding task parameters that summarize of each aggregation process module include MapRun function and ReduceRun function.

Before execution, the Reduce task number in the property parameters of all aggregation process modules is summed.To own The value of occupying of the resource information of Map/Reduce task in the property parameters of aggregation process module is maximized.To make to summarize The service requirement of all modules is known and met to MapReduce application program, is conducive to the improved efficiency of system and resource-effective.

Specific explanations explanation are as follows: read the property parameters of each aggregation process module, which includes that Reduce appoints The resource information for the number and Map/Reduce task of being engaged in.Shape parameter, such as Reduce task number are added for resource, execute phase Add operation, then this summarizes the Reduce task number of MapReduce application program for the setting of all aggregation process modules The summation of Reduce task number.For the various parameters qualifications such as resource constraint shape parameter, such as CPU, memory, JVM, hold Row Max operation, then the resources occupation value of the Map/Reduce task for summarizing MapReduce application program summarizes place to be all Manage the maximum value of module setting.

The input file for reading in each aggregation process module in sequence, obtains the file attribute of input file.Then will The file attribute is matched with the property parameters for corresponding to target aggregation process module, obtains matching result.Judge institute State whether the preset attribute information in file attribute whether there is in property parameters.If it exists, then illustrate successful match；Instead It, then match unsuccessful.

In embodiments of the present invention, successful match then illustrates that the target aggregation process module can carry out the file of input The task execution in Map stage and Reduce stage.

In embodiments of the present invention, each aggregation process module it is corresponding summarize task parameters include MapRun function and ReduceRun function.Therefore, when matching result, which is, is, summarize task parameters according to corresponding to target aggregation process module The input file is handled, processing result is obtained.

A kind of data summarization provided in an embodiment of the present invention executes method, by supporting at least two aggregation process modules point It is other that the execution of task are summarized to difference, obtain the file attribute of the input file of target aggregation process module, by file attribute with Property parameters are matched, and after successful match, are summarized task parameters according to corresponding and are handled input file, handled As a result, accomplishing to summarize task for different, it is not necessary that MapReduce application program is separately provided to each task, by built-in The processing of functional module specific aim, reduces development difficulty and exploitation amount, facilitates execution.

Fig. 2 shows a kind of data summarizations that one embodiment of the invention provides to execute method, and the method is applied to support The terminal of at least two aggregation process modules, which comprises

S21, obtain target aggregation process module input file file attribute, the file attribute include Hive basis Data table name；

S22, judge Hive basic data table name in the file attribute and correspond to target aggregation process module Whether property parameters match, and obtain matching result, the property parameters include Hive basic data table name；

S23, when matching result be it is yes, then execute correspond to target aggregation process module MapRun function to the input File carries out mapping processing, obtains intermediate file, the attribute information of the intermediate file includes the mould of target aggregation process module Block identification；

S24, the intermediate file is called, determines that target is converged according to the module id in the attribute information of the intermediate file Total processing module is executed and is carried out at reduction corresponding to the ReduceRun function of target aggregation process module to the intermediate file Reason obtains processing result.

For step S21- step S24, it should be noted that in embodiments of the present invention, the file of each input file Attribute includes Hive basic data table name and index name, and the Hive basic data table name and index name, which produce, to be corresponded to File directory.Such as: for summarizing calculating task A, if the entitled table_base of base data table read, master index The numerical value of index1 is value1, then the file path read in is i.e. are as follows:/warehouse/hive/db/table_base/ Index1=value1/***.

In embodiments of the present invention, after obtaining the file attribute of input file of target aggregation process module, by institute State target aggregation process module input file be put into it is preset read in file set, accomplish not repeat read in input file.

For example, summarizing calculating task B, and table_base being read, the numerical value of master index index1 is value1, then Just do not have to continuing to import.

But it is carried out in matching process in subsequent file attribute and property parameters, the basis Hive is only obtained from file path Data table name judges in property parameters with the presence or absence of corresponding data table name.

If matching result be it is yes, according to correspond to target aggregation process module to summarize task parameters literary to the input Part is handled, and processing result is obtained.Treatment process includes Map stage and ReduceRun stage.It is specific as follows:

The Map stage:

The configuration file of all aggregation process modules is loaded, while generating according to aggregation process module (Module) title should The execution example of aggregation process module, since all aggregation process modules are realized from same general-purpose interface, so software Realize that easily efficiency is very high.After application example generates, the mapRun function of the aggregation process module can be executed.Then Following operation is executed to every record of input file:

All aggregation process modules are traversed, judge whether the file path of corresponding input file needs by the aggregation process Resume module, such as the path of this document is /warehouse/hive/db/table_base/index1=value1/***, And the table that Module A to be processed is read is free of table_base table, then not executing the mapRun function of Module A then； Conversely, then executing the mapRun function of the Module.

After the mapRun function for executing Module A, need with<Key, Value>form intermediate file is written, at this moment The prefix for waiting setting Key is Module A Name, complete Key are as follows: ModuleName+ business major key ServiceKey；In this way Guarantee that different intermediate files can be matched with Module；The intermediate file name prefix of identical Moude is all identical.

The Reduce stage:

The configuration file of all Module is loaded, while generating the execution example of the Module according to Module title, by It is all to realize that institute is implemented in software easily, and efficiency is very high from same general-purpose interface in all Module.Application example generates Afterwards, the Module and reduceRun function can execute.Then following operation is executed to every record of input file:

Judge which Module is the prefix of this record Key belong to, after judging successfully, business master is extracted from existing Key Key ServiceKey, and the reduceRun function of corresponding Module is executed, obtain processing result.

In addition, in embodiments of the present invention, since the file attribute includes output file catalogue, obtaining the processing As a result after, output file catalogue is read, output file catalogue is written into processing result.

It is completely illustrated with specific example below:

Acquisition summarizes calculating task A, the entitled table_base of the base data table of reading, the numerical value of master index index1 File path for value1, reading is are as follows:/warehouse/hive/db/table_base/index1=value1/***.

Judge to summarize the path of the file of calculating task A as/warehouse/hive/db/table_base/index1= Value1/***, the table that Module A to be processed is read contain table_base table, execute the mapRun letter of the Module Number.

After the mapRun function for executing Module A, summarize calculating task A and need with<Key, Value>form write-in Intermediate file, the prefix that Key is at this time arranged is Module A Name, complete Key are as follows: Module A+ business major key ServiceKey。

When the prefix of judgement record Key belongs to Module A, after judging successfully, the extraction business major key from existing Key ServiceKey, and the reduceRun function of corresponding Module A is executed, obtain processing result.

Output file catalogue is read from the configuration file of Module A, and the output file catalogue is written into processing result In.

Fig. 3 shows a kind of data summarization executive device of one embodiment of the invention offer, and described device is applied to support The terminal of at least two aggregation process modules, described device include acquiring unit 31, matching unit 32 and processing unit 33, In:

Acquiring unit 31, the file attribute of the input file for obtaining target aggregation process module；

Matching unit 32, for by the file attribute and corresponding to the property parameters progress of target aggregation process module Match, obtains matching result；

Processing unit 33, for summarizing task according to corresponding to target aggregation process module when matching result, which is, is Parameter handles the input file, obtains processing result.

Since described device of the embodiment of the present invention is identical as the principle of above-described embodiment the method, for more detailed Explain that details are not described herein for content.

It should be noted that can be by hardware processor (hardware processor) come real in the embodiment of the present invention Existing correlation function.

A kind of data summarization executive device provided in an embodiment of the present invention, by supporting at least two aggregation process modules point It is other that the execution of task are summarized to difference, obtain the file attribute of the input file of target aggregation process module, by file attribute with Property parameters are matched, and after successful match, are summarized task parameters according to corresponding and are handled input file, handled As a result, accomplishing to summarize task for different, it is not necessary that MapReduce application program is separately provided to each task, by built-in The processing of functional module specific aim, reduces development difficulty and exploitation amount, facilitates execution.

Fig. 4 shows a kind of data summarization executive device of one embodiment of the invention offer, and described device is applied to support The terminal of at least two aggregation process modules, described device include acquiring unit 31, matching unit 32, processing unit 33 and storage Unit 41, in which:

Matching unit 32, for judging in the Hive basic data table name in the file attribute and the property parameters Hive basic data table name whether match, obtain matching result；

Processing unit 33, for when matching result be it is yes, then execute corresponding to target aggregation process module MapRun letter It is several that mapping processing is carried out to the input file, intermediate file is obtained, the attribute information of the intermediate file includes that target summarizes The module id of processing module；

The intermediate file is called, determines that target summarizes place according to the module id in the attribute information of the intermediate file Module is managed, the ReduceRun function for corresponding to target aggregation process module is executed to intermediate file progress reduction process, obtains Obtain processing result；

It further include storage unit 41, for being deposited according to output file catalogue completion after obtaining the processing result Storage.

Fig. 5 illustrates a kind of entity structure schematic diagram of server, as shown in figure 5, the server may include: processor (processor) 510, communication interface (Communications Interface) 520, memory (memory) 530 and communication Bus 540, wherein processor 510, communication interface 520, memory 530 complete mutual communication by communication bus 540. Processor 510 can call the logical order in memory 530, to execute following method: obtaining target aggregation process module The file attribute of input file；The file attribute is matched with the property parameters for corresponding to target aggregation process module, Obtain matching result；If matching result be it is yes, summarize task parameters to described according to target aggregation process module is corresponded to Input file is handled, and processing result is obtained.

In addition, the logical order in above-mentioned memory 530 can be realized by way of SFU software functional unit and conduct Independent product when selling or using, can store in a computer readable storage medium.Based on this understanding, originally Substantially the part of the part that contributes to existing technology or the technical solution can be in other words for the technical solution of invention The form of software product embodies, which is stored in a storage medium, including some instructions to So that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation of the present invention The all or part of the steps of example the method.And storage medium above-mentioned include: USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic or disk etc. it is various It can store the medium of program code.

The apparatus embodiments described above are merely exemplary, wherein described, unit can as illustrated by the separation member It is physically separated with being or may not be, component shown as a unit may or may not be physics list Member, it can it is in one place, or may be distributed over multiple network units.It can be selected according to the actual needs In some or all of the modules achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying creativeness Labour in the case where, it can understand and implement.

Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can It realizes by means of software and necessary general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, on Stating technical solution, substantially the part that contributes to existing technology can be embodied in the form of software products in other words, should Computer software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including several fingers It enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation Method described in certain parts of example or embodiment.

Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations；Although Present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be used To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features； And these are modified or replaceed, technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution spirit and Range.

Claims

1. a kind of data summarization executes method, which is characterized in that the method is applied to support at least two aggregation process modules Terminal, which comprises

The file attribute is matched with the property parameters for corresponding to target aggregation process module, obtains matching result；

If matching result be it is yes, according to correspond to target aggregation process module summarize task parameters to the input file into Row processing, obtains processing result.

2. data summarization according to claim 1 executes method, which is characterized in that the file attribute includes the basis Hive Data table name, the property parameters include Hive basic data table name；

The file attribute is matched with the property parameters for corresponding to target aggregation process module, obtains matching result, tool Body includes:

Judge the Hive basic data table name in the file attribute and the Hive basic data table name in the property parameters Claim whether to match, obtains matching result.

3. data summarization according to claim 2 executes method, which is characterized in that the task parameters that summarize include MapRun function and ReduceRun function；

When matching result be it is yes, then according to correspond to target aggregation process module summarize task parameters to the input file into Row processing, obtains processing result, specifically includes:

When matching result be it is yes, then execute correspond to target aggregation process module MapRun function to the input file carry out Mapping processing, obtains intermediate file, the attribute information of the intermediate file includes the module id of target aggregation process module；

The intermediate file is called, target aggregation process mould is determined according to the module id in the attribute information of the intermediate file Block executes and corresponds to the ReduceRun function of target aggregation process module and carry out reduction process to the intermediate file, at acquisition Manage result.

4. data summarization according to claim 3 executes method, which is characterized in that the file attribute further includes output text Part catalogue；

Further include: after obtaining the processing result, complete to store according to the output file catalogue.

5. data summarization according to claim 1 executes method, which is characterized in that further include:

After obtaining the file attribute of input file of target aggregation process module, by the defeated of the target aggregation process module Enter file and is put into preset read in file set.

6. data summarization according to claim 1 executes method, which is characterized in that obtaining target aggregation process module Before the file attribute of input file, further includes:

The property parameters include Reduce task number；

By the Reduce task number summation in the property parameters of all aggregation process modules；

The property parameters include the resource information of Map/Reduce task；

The value of occupying of the resource information of Map/Reduce task in the property parameters of all aggregation process modules is maximized.

7. a kind of data summarization executive device, which is characterized in that described device is applied to support at least two aggregation process modules Terminal, described device includes:

Matching unit is obtained for matching the file attribute with the property parameters for corresponding to target aggregation process module Obtain matching result；

Processing unit, for summarizing task parameters pair according to corresponding to target aggregation process module when matching result, which is, is The input file is handled, and processing result is obtained.

8. data summarization executive device according to claim 7, which is characterized in that the file attribute includes the basis Hive Data table name, the property parameters include Hive basic data table name；

The matching unit is specifically used for:

9. data summarization executive device according to claim 7, which is characterized in that the task parameters that summarize include MapRun function and ReduceRun function；

The processing unit is specifically used for:

10. data summarization executive device according to claim 7, which is characterized in that the file attribute further includes output File directory；

It further include storage unit, for completing to store according to the output file catalogue after obtaining the processing result.

11. data summarization executive device according to claim 7, which is characterized in that further include having read memory module, use In:

12. data summarization executive device according to claim 7, which is characterized in that further include detection module, be used for:

The property parameters include Reduce task number；

Before obtaining the file attribute of input file of target aggregation process module, the attribute of all aggregation process modules is joined Reduce task number summation in number；

The property parameters include the resource information of Map/Reduce task；

Before obtaining the file attribute of input file of target aggregation process module, the attribute of all aggregation process modules is joined The value of occupying of the resource information of Map/Reduce task in number is maximized.

13. a kind of electronic equipment including memory, processor and stores the calculating that can be run on a memory and on a processor Machine program, which is characterized in that the processor realizes that data are converged as described in any one of claim 1 to 6 when executing described program Total the step of executing method.

14. a kind of non-transient computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer The step of data summarization executes method as described in any one of claim 1 to 6 is realized when program is executed by processor.