CN103714511B

CN103714511B - GPU-based branch processing method and device

Info

Publication number: CN103714511B
Application number: CN201310695410.8A
Authority: CN
Inventors: 殷罗英; 朱坤; 吴钊源; 陈剑军
Original assignee: Huawei Technologies Co Ltd
Current assignee: Bengbu Hongjing Technology Co.,Ltd.
Priority date: 2013-12-17
Filing date: 2013-12-17
Publication date: 2017-01-18
Anticipated expiration: 2033-12-17
Also published as: CN103714511A

Abstract

The invention discloses a GPU-based branch processing method and a device, and relates to the technical field of data processing. Code logic can be guaranteed and branch execution efficiency can be enhanced simultaneously. The concrete embodiment of the invention comprises that: after a message node corresponding to a branch to be processed is acquired, and the message node meets a preset condition, data to be processed in the message node are acquired and processed. The technical scheme provided by the embodiment of the invention is mainly applied to data processing flow.

Description

A kind of branch processing method based on gpu and device

Technical field

The present invention relates to technical field of data processing, more particularly, to one kind are based on gpu(graphic processing Unit, graphic process unit) branch processing method and device.

Background technology

At present, gpu has parallel processing capability and Programmable Pipeline function, can process nongraphical data.Especially Using simd(single instruction multiple data, single-instruction multiple-data stream (SIMD)) model when, its performance is particularly Superior, the operand of data processing is much larger than data dispatch and the needs of transmission, thus gpu have been widely used supercomputing, The fields such as scientific algorithm, finance, chemistry.Specifically, gpu adopt simd model be one kind to be controlled using a controller many Individual processor, executes identical simultaneously respectively and operates thus realizing sky to each of one group of data (also known as " data vector ") Between on concurrency technology.So this technology, when processing batch data according to same instruction, can give play to multiprocessing The advantage of device, but when different instruction or Data Concurrent specification is not enough, multiprocessor will be by executing difference in batches Instruction carrys out processing data, will result in the low problem of execution efficiency of data-handling efficiency, instruction.

Low in order to solve the problems, such as that above-mentioned different instruction must execute the data-handling efficiency leading in batches, using postponing to change Generation technique.Specifically, when containing branch in the thread of an iteration, using iterative delay technology so that changing in each circulation Dai Zhong, for multiple-limb scene, only walks an individual path by all threads of delay guaranteed, any behaviour is not in other branches Make.State in realization in technical process, inventor finds, although the branch that iterative delay improves on gpu to a certain extent holds Line efficiency, but destroy the logical order of code execution in thread, change the realization of former business.

Content of the invention

Embodiments of the invention provide a kind of branch processing method based on gpu and device, can ensure code logic While, improve the execution efficiency of branch.

For reaching above-mentioned purpose, embodiments of the invention adopt the following technical scheme that

A kind of first aspect, there is provided processing method of the branch based on gpu, comprising:

Obtain the corresponding information node of currently pending branch, at least include any one or more in described information node Pending data；

When described information node meets described pre-conditioned, process the described pending data in described information node；

Wherein, described pre-conditioned for making to be currently able to process most described pending datas.

Described, the first in first aspect may obtain that currently pending Branch Tasks are corresponding to disappear in implementation Before breath node, also include:

According to preset strategy, described pending data is carried out with branch process, pending data described in described same branch Execution same instructions.

In conjunction with the first possible implementation of first aspect, first aspect, can enable in the second of first aspect In mode, methods described also includes:

Described pre-conditioned inclusion: the quantity of the described pending data in described information node meets predetermined threshold value；With/ Or,

The corresponding timer expiry of described information node.

In conjunction with first aspect, first aspect the first may implementation, first aspect second can the side of enabling Formula, in the third possible implementation of first aspect, when the quantity of the described pending data in described information node is full During sufficient predetermined threshold value, obtain and process the described pending data in described information node.

In conjunction with first aspect, first aspect the first may implementation, first aspect second can the side of enabling Formula, in the 4th kind of possible implementation of first aspect, also includes:

When timer expiry corresponding in described information node, obtain and process described pending in described information node Data.

The second of the first the possible implementation in conjunction with first aspect or first aspect or first aspect can enable Any one in 4th kind of possible implementation of the third possible implementation of mode or first aspect or first aspect Or several implementation, in the 5th kind of possible implementation of first aspect, also include:

When the quantity of the described pending data in described information node is unsatisfactory for described predetermined threshold value, it is described message Node arranges timer.

A kind of second aspect, there is provided processing meanss of the branch based on gpu, comprising:

Acquiring unit, for obtaining the corresponding information node of currently pending branch, at least includes in described information node Any one or more pending datas；

Processing unit, for when described information node meets described pre-conditioned, processing the institute in described information node State pending data；

In the first possible implementation of second aspect, described processing unit, it is additionally operable to obtain in described acquiring unit Before taking the corresponding information node of currently pending Branch Tasks, according to preset strategy, bifurcation is carried out to described pending data Reason, pending data described in described same branch executes same instructions.

In conjunction with the first possible implementation of second aspect, second aspect, can enable in the second of second aspect In mode, pre-conditioned inclusion in described processing unit: the quantity of the described pending data in described information node meets Predetermined threshold value；And/or,

The corresponding timer expiry of described information node.

In conjunction with second aspect, second aspect the first may implementation, second aspect second can the side of enabling Formula, in the third possible implementation of second aspect, described processing unit, specifically for when the institute in described information node When stating the quantity of pending data and meeting predetermined threshold value, obtain and process the described pending data in described information node.

In conjunction with second aspect, second aspect the first may implementation, second aspect second can the side of enabling Formula, in the 4th kind of possible implementation of second aspect, described processing unit, specifically for when corresponding in described information node Timer expiry, obtain and process the described pending data in described information node.

The second of the first the possible implementation in conjunction with second aspect or second aspect or second aspect can enable Any one in 4th kind of possible implementation of the third possible implementation of mode or second aspect or second aspect Or several implementation, in the 5th kind of possible implementation of second aspect, described device also includes:

Arranging unit, for being unsatisfactory for described predetermined threshold value when the quantity of the described pending data in described information node When, it is described information node setting timer.

Branch processing method based on gpu provided in an embodiment of the present invention and device, correspond to getting pending branch Information node after, and when this information node meets pre-conditioned, obtain and process the pending data in this information node. In prior art, during data being processed according to branch, improved by the code logic changing processing data The execution efficiency of branch is compared, the embodiment of the present invention ensure code logic while, execute same instruction once place In reason, process most data, thus improve the execution efficiency of branch.

Brief description

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing Have technology description in required use accompanying drawing be briefly described it should be apparent that, drawings in the following description be only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, acceptable Other accompanying drawings are obtained according to these accompanying drawings.

A kind of method flow diagram of branch process based on gpu that Fig. 1 provides for one embodiment of the invention；

A kind of method flow diagram of branch process based on gpu that Fig. 2 provides for another embodiment of the present invention；

The composition of the processing framework of the branch processing method applied based on gpu that Fig. 3 provides for another embodiment of the present invention Schematic diagram；

A kind of processing framework execution branch process based on shown in above-mentioned Fig. 3 that Fig. 4 provides for another embodiment of the present invention Method flow diagram；

A kind of composition schematic diagram of branch process device based on gpu that Fig. 5 provides for another embodiment of the present invention；

A kind of composition schematic diagram of branch process device based on gpu that Fig. 6 provides for another embodiment of the present invention；

A kind of composition schematic diagram of branch process device based on gpu that Fig. 7 provides for another embodiment of the present invention.

Specific embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation description is it is clear that described embodiment is only a part of embodiment of the present invention, rather than whole embodiments.It is based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of not making creative work Embodiment, broadly falls into the scope of protection of the invention.

One embodiment of the invention provides a kind of branch processing method based on gpu, as shown in figure 1, the method includes:

101st, obtain the corresponding information node of currently pending branch.

Wherein, any one or more pending datas are at least included in information node.

What deserves to be explained is, when there is one group or multi-group data, the different disposal condition that meets according to it, by these Data is divided into corresponding branch, and that is, treatment conditions are corresponding with branch.Such as, when processing to two data, when When this two data meet condition 1, two numbers are added, and when this two number meets condition 2, this two number is multiplied, then now Condition 1 corresponds to a branch, and condition 2 corresponds to another branch.

102nd, when information node meets pre-conditioned, obtain and process the pending data in information node.

What deserves to be explained is, in order to, in the single treatment of execution same instructions, process most data, by system or User setup is pre-conditioned, this pre-conditioned for making to be currently able to process most pending datas.

Branch processing method based on gpu provided in an embodiment of the present invention, is getting the corresponding message of pending branch After node, and when this information node meets pre-conditioned, obtain and process the pending data in this information node.With existing In technology, during data being processed according to branch, branch is improved by the code logic changing processing data Execution efficiency is compared, and the embodiment of the present invention, while ensureing code logic, in the single treatment executing same instruction, is located Manage most data, thus improve the execution efficiency of branch.

Another embodiment of the present invention provides a kind of branch processing method based on gpu, in conjunction with the description of a upper embodiment, Before execution above-mentioned 101,

Firstly, it is necessary to pending data be divided the pending data so that in same branch according to preset strategy Execution identical instruction.

Preferably, preset strategy includes, but are not limited to, and user or system are according to settings such as the form of data or expression values Execute instruction, this preset strategy include pending data with instruction corresponding relation.Wherein, pending data include one group or Multi-group data, and the instruction of these data needs execution is identical, the process thread of these data includes two or more, specifically A main thread, one or more secondary thread can be included.

Further, pre-conditioned described in above-mentioned 102, the quantity including the pending data in information node meets Predetermined threshold value；And/or, the corresponding timer expiry of information node.

Specifically, such as, so that pending data is data group as a example, in conjunction with above-mentioned by main and auxiliary thread process number According to description, when on gpu setting parallel thread be 32 when it is preferred that this predetermined threshold value is set to 32 integral multiple, that is, Say that main thread, secondary thread obtain and treat when the pending data group of storage in this information node meets or exceeds this predetermined threshold value The data group processing, and execute corresponding logical code, to complete data processing.

In conjunction with pre-conditioned multi-form, in the detailed process of execution above-mentioned 102, as shown in Figure 2, comprising:

1021st, when the quantity of the pending data in information node meets predetermined threshold value, obtain and process information node In pending data.

1022nd, when timer expiry corresponding in information node, obtain and process the pending data in information node.

What deserves to be explained is, it is provided with predetermined threshold value in order to most data during single treatment, can be processed, but Equally exist the situation that pending data can not meet this predetermined threshold value, be at this moment accomplished by guaranteeing data by arranging timer Can be processed in time, that is, when the data of the pending data in information node is unsatisfactory for predetermined threshold value, be information node Setting timer.

Another embodiment of the present invention provides a kind of branch processing method based on gpu, and the method can be applicable to locate as follows In reason framework, as shown in figure 3, this framework includes message caching ring 101, main thread 102, secondary thread 103, aging chained list 104, its Middle message caching ring 101 includes multiple information nodes, and aging chained list 104 includes multiple ageing timer nodes, and message Information node in caching ring 101 is corresponded with the ageing timer node in aging chained list 104, main thread 102, secondary thread 103 can obtain pending data from information node, and main thread 102 includes acquisition task get event, command synchronization, obtains Cancel breath get msg, four modules of Business Processing, secondary thread 103 include do-nothing instruction nop, command synchronization, get msg, at business Service Processing Module in four modules of reason, wherein main thread 102, secondary thread 103, for 101 information nodes getting In information execute corresponding service code, in present treatment structure, secondary thread 103 can represent the set of one group of secondary thread.

Based on above-mentioned framework, as shown in figure 4, this method includes:

401st, according to preset strategy, pending data is divided, merge the pending data belonging to same branch.

Further, pending data is stored in 101 information node, and branching logic execution sequence is deployed in master In thread 102, so that 102 can complete thread scheduling, process the data of storage in information node.

What deserves to be explained is, to data execution, same instruction is Branch Tasks, 101 message in the present embodiment The content of node storage also includes: the message of Branch Tasks and outside interaction, and represents at the inside with Branch Tasks relation Reason message etc..

402nd, main thread 102 checks whether the information node in first Branch Tasks corresponding 101 meets predetermined threshold value.

Specifically, when determining that information node is unsatisfactory for predetermined threshold value, record current time, and this information node is inserted To in aging chained list 104.

When determining that this information node meets predetermined threshold value, execute following 403.

403rd, main thread 102 obtains the corresponding information node of current branch task.

Further, after determining main thread 102 and secondary thread 103 command synchronization, following 404 are executed.

404th, main thread 102, secondary thread 103 obtain and process pending data by Service Processing Module.

Further, refresh the corresponding ageing timer of current message node after processing is completed.

Further, when determining that the data obtaining from information node needs to be further processed, by this data Write back in the corresponding information node of next step Branch Tasks that message caches in ring 101.

Optionally, in another kind of implementation of the embodiment of the present invention, the current Branch Tasks executing are obtained when determining Node be unsatisfactory for predetermined threshold value, but in the case of having timed out, execute following flow processs:

A, main thread 102 obtain this overtime information node.

B, main thread 102, secondary thread 103 parallel processing Branch Tasks.

Refresh the corresponding ageing timer of present node after c, process Wang Cheng.

Further, when determining that the data obtaining from information node needs to be further processed, this data is write Return in the corresponding information node of next step Branch Tasks that message caches in ring 101.

Optionally, in another kind of implementation of the embodiment of the present invention, when determination gets current message node neither Meet in the case that predetermined threshold value also has not timed out, the execution sequence of the Branch Tasks according to storage for the main thread 102 obtains next Information node is processed.

Optionally, in another kind of implementation of the embodiment of the present invention, main thread 102 is in traversal message caching ring 101 While, aging chained list 104 can be traveled through, specifically execute following flow processs:

A1, main thread determine Branch Tasks currently to be executed, and cache traversal and this Branch Tasks on ring 101 in message Corresponding information node.

A2, check whether aging chained list 104 is empty.

Further, when aging chained list 104 is not space-time, then following a3 are executed；When aging chained list 104 is space-time, only sentence Whether the information node that disconnected message caches on ring 101 meets predetermined threshold value, the realization side that this situation is described with above-mentioned 401-404 Formula is identical, is not repeated.

A3, main thread 102 obtain the node of time-out from aging chained list 104.

A4, main thread 102 and the corresponding Branch Tasks of node overtime in secondary thread 103 executed in parallel.

Further, after completion processing, the overtime node that refresh process completes in aging chained list 104.

What deserves to be explained is, after the aging chained list 104 of main thread 102 traversal processing, continuation message caching ring 101 corresponds to position Put execution traversal processing.

Such as, such as main thread 102 is originally the treatment of aged chained list 104 in the 3rd Branch Tasks, then processed Aging chained list followed by processes the Branch Tasks after the 3rd branch.

Additionally, the message being related in the present embodiment caches ring 101 syndication message can ensure the many of command synchronization execution The different pieces of information of thread process same type, makes thread execution efficiency maximum.

What deserves to be explained is, in main thread 102, secondary thread 103 has processed last Branch Tasks or next step is processed Needs send out and are processed by other Branch Tasks, then main thread 102 sends messages directly to next Branch Tasks.

Another embodiment of the present invention provides a kind of branch process device based on gpu, as shown in figure 5, this device includes: Acquiring unit 51, processing unit 52.

Acquiring unit 51, for obtaining the corresponding information node of currently pending branch.

Processing unit 52, for when information node meets pre-conditioned, processing the pending number in described information node According to.

Preferably, pre-conditioned inclusion: the quantity of the pending data in information node meets predetermined threshold value；And/or, disappear The breath corresponding timer expiry of node.

Wherein, pre-conditioned for making to be currently able to process most pending datas.

Optionally, processing unit 52, are additionally operable to obtain the corresponding message of currently pending Branch Tasks in acquiring unit 51 Before node, according to preset strategy, branch process is carried out to pending data.

What deserves to be explained is, pending data execution same instructions in same branch.

Specifically, processing unit 52, are additionally operable to when the quantity of the pending data in information node meets predetermined threshold value, Obtain and process the pending data in information node；When timer expiry corresponding in information node, obtain and process message Pending data in node.

Optionally, as shown in fig. 6, this device also includes: arranging unit 53.

Arranging unit 53, for when the quantity of the pending data in information node is unsatisfactory for predetermined threshold value, being message Node arranges timer.

A kind of branch process device based on gpu provided in an embodiment of the present invention, gets pending point in acquiring unit After propping up corresponding information node, and when this information node meets pre-conditioned, by this information node of processing unit processes Pending data.In prior art, during data being processed according to branch, by changing the generation of processing data Code logic is compared come the execution efficiency to improve branch, and the embodiment of the present invention is while in guarantee code logic, same executing In the single treatment of instruction, process most data, thus improve the execution efficiency of branch.

Another embodiment of the present invention provides a kind of branch process device based on gpu, as shown in fig. 7, this device includes: Memory 71, processor 72 and bus 73.Wherein, memory 71, processor 72 are communicated to connect by bus 73.

Memory 71 can be read-only storage (read only memory, rom), static storage device, dynamic memory Equipment or random access memory (random access memory, ram).Memory 71 can with storage program area and its His application program.When by software or firmware to realize technical scheme provided in an embodiment of the present invention, it is used for realizing this The program code of the technical scheme that bright embodiment provides is saved in memory 71, and to be executed by processor 72.

Processor 72 can be using general central processing unit (central processing unit, cpu), microprocessor Device, application specific integrated circuit (application specific integrated circuit, asic), or one or Multiple integrated circuits, for executing relative program, to realize the technical scheme that the embodiment of the present invention is provided.

Bus 73 may include a path, transmission letter between device all parts (such as memory 71 and processor 72) Breath.

What deserves to be explained is, although the hardware shown in Fig. 7 illustrate only memory 71 and processor 72 and bus 73, But during implementing, it should be apparent to a person skilled in the art that this terminal also comprise to realize normally to run institute necessary Other devices.Meanwhile, according to specific needs, it should be apparent to a person skilled in the art that also can comprise to realize other functions Hardware device.

Specifically, the device shown in Fig. 7 is used for realizing the method flow shown in Fig. 1-Fig. 4.

Processor 72, for obtaining the corresponding information node of currently pending branch, is additionally operable to meet in advance when information node If during condition, process the pending data in information node；

Wherein, any one or more pending datas are at least included in information node；Pre-conditioned for making current energy Enough process most pending datas it is preferred that this pre-conditioned inclusion: the quantity of the pending data in information node meets Predetermined threshold value；And/or, the corresponding timer expiry of information node.

Memory 71, for storing the pending data in information node.

Optionally, processor 72, be additionally operable to obtain the corresponding information node of currently pending Branch Tasks before, according to Preset strategy carries out branch process to pending data.

What deserves to be explained is, the pending data execution same instructions in same branch.

Processor 72, specifically for when the quantity of the pending data in information node meets predetermined threshold value, obtaining simultaneously Process the pending data in information node；When timer expiry corresponding in information node, obtain and process in information node Pending data.

Optionally, processor 72, are additionally operable to when the quantity of the pending data in information node is unsatisfactory for predetermined threshold value, For information node, timer is set.

Memory 71, is additionally operable to store predetermined threshold value, and data processing instructions.

A kind of branch process device based on gpu provided in an embodiment of the present invention, gets pending branch in processor After corresponding information node, and when this information node meets pre-conditioned, process the pending data in this information node.With In prior art, during data being processed according to branch, improved point by the code logic changing processing data Execution efficiency compare, the embodiment of the present invention ensure code logic while, execute same instruction single treatment In, process most data, thus improve the execution efficiency of branch.

Through the above description of the embodiments, those skilled in the art can be understood that the present invention can borrow Help software to add the mode of necessary common hardware to realize naturally it is also possible to pass through hardware, but the former is more preferably in many cases Embodiment.Based on such understanding, the portion that technical scheme substantially contributes to prior art in other words Divide and can be embodied in the form of software product, this computer software product is stored in the storage medium that can read, such as count The floppy disk of calculation machine, hard disk or CD etc., including some instructions with so that computer equipment (can be personal computer, Server, or the network equipment etc.) method described in execution each embodiment of the present invention.

The above, the only specific embodiment of the present invention, but protection scope of the present invention is not limited thereto, and any Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, all should contain Cover within protection scope of the present invention.Therefore, protection scope of the present invention should be defined by described scope of the claims.

Claims

1. a kind of branch processing method based on graphic process unit gpu is it is characterised in that include:

Obtain the corresponding information node of currently pending branch, at least include in described information node any one or more waiting to locate Reason data；

When described information node meets pre-conditioned, process the described pending data in described information node；

Wherein, described pre-conditioned for making to be currently able to process at most described pending data；

Before the corresponding information node of the currently pending branch of described acquisition, also include:

According to preset strategy, branch process is carried out to described pending data, the execution of pending data described in same branch is identical Instruction.

2. method according to claim 1 is it is characterised in that methods described also includes:

Described pre-conditioned inclusion: the quantity of the described pending data in described information node meets predetermined threshold value；And/or,

The corresponding timer expiry of described information node.

3. method according to claim 2 it is characterised in that

When the quantity of the described pending data in described information node meets predetermined threshold value, obtain and process described message section Described pending data in point.

4. method according to claim 2 it is characterised in that

When timer expiry corresponding in described information node, obtain and process the described pending number in described information node According to.

5. the method according to claim 2-4 any one is it is characterised in that include:

When the quantity of the described pending data in described information node is unsatisfactory for described predetermined threshold value, it is described information node Setting timer.

6. a kind of branch process device based on gpu is it is characterised in that include:

Acquiring unit, for obtaining the corresponding information node of currently pending branch, at least includes in described information node arbitrarily One or more pending datas；

Processing unit, described pending in described information node for when described information node meets pre-conditioned, processing Data；

Wherein, described pre-conditioned for making to be currently able to process most described pending datas；

Described processing unit, is additionally operable to, before described acquiring unit obtains the corresponding information node of currently pending branch, press According to preset strategy, branch process is carried out to described pending data, pending data described in same branch executes same instructions.

7. device according to claim 6 it is characterised in that

Pre-conditioned inclusion in described processing unit: the quantity of the described pending data in described information node meets default Threshold value；And/or,

The corresponding timer expiry of described information node.

8. device according to claim 7 it is characterised in that

Described processing unit, specifically for meeting predetermined threshold value when the quantity of the described pending data in described information node When, obtain and process the described pending data in described information node.

9. device according to claim 7 it is characterised in that

Described processing unit, specifically for when timer expiry corresponding in described information node, obtaining and processing described message Described pending data in node.

10. the device according to claim 7-9 any one is it is characterised in that described device also includes:

Arranging unit, for when the quantity of the described pending data in described information node is unsatisfactory for described predetermined threshold value, For described information node, timer is set.