CN105468567A

CN105468567A - Isomerism many-core discrete memory access optimization method

Info

Publication number: CN105468567A
Application number: CN201510830202.3A
Authority: CN
Inventors: 袁欣辉; 潘治; 林蓉芬; 王礼生
Original assignee: Wuxi Jiangnan Computing Technology Institute
Current assignee: Wuxi Jiangnan Computing Technology Institute
Priority date: 2015-11-24
Filing date: 2015-11-24
Publication date: 2016-04-06
Anticipated expiration: 2035-11-24
Also published as: CN105468567B

Abstract

The invention provides an isomerism many-core discrete memory access optimization method comprising: step one, a general task is divided into a plurality of task segments; step two, establishing a counting variable in a memory space allowing access of both a master core and a slave core; step three, judging whether the value of the counting variable is smaller than the segment number of the general task, and carrying out step four if determining that the counting variable is smaller than the segment number of the general task; step four, taking out the task segments dynamically from a task pool by the master core and each slave core, performing an atom plus one operation, and finishing a memory access operation on the task segments which are taken out; then, dealing with a return to step 3.

Description

The discrete memory access optimization method of the many cores of a kind of isomery

Technical field

The present invention relates to field of computer technology, more particularly, the present invention relates to the discrete memory access optimization method of the many cores of a kind of isomery.

Background technology

Isomery many-core processor is a kind of processor architecture of novelty, and general processor (main core) and acceleration core bunch (from core) are packaged together by it, provide very high calculated performance.But the master and slave core of this framework bunch share processor memory bandwidth, when data locality is poor, the actual memory bandwidth of processor is starkly lower than bandwidth when accessing continuous data.For data-intensive applications, memory access ink-bottle effect is obvious.Wherein, data locality comprises temporal locality and spatial locality.Temporal locality refers to that, in a bit of time, the data of accessing recently are probably accessed again; Spatial locality refers to that the data centralization of a bit of time internal program access is in a small pieces memory block, and the data near just accessed data are probably next accessed.

Immediate with this new architecture is at present CPU (CentralProcessingUnit, central processing unit)+GPU (GraphicsProcessingUnit, graphic process unit) mixed architecture, for improving discrete memory access (referring to that the data locality of routine access the is poor) performance on this framework, use CPU and GPU memory access simultaneously, first CPU and GPU is assigned the task to according to the two respective memory bandwidth by fixed proportion, again by GPU obtain task according to Thread Count mean allocation, to reach entirety preferably problem results of property.

Under CPU+GPU mixed architecture, the memory access path of CPU and GPU is independently independent of each other, therefore can easily according to the ratio that respective bandwidth pinned task is distributed.But this method is difficult to directly be applied on isomery many-core processor: the master and slave core of isomery many-core processor shares memory bandwidth, the memory access of each core may cause conflict and cause other core memory access hydraulic performance declines, in addition the access instruction that each core sends is unordered, random, uncertain completely, and this makes winner's core be unfixed with the memory bandwidth ratio from core bunch; In addition, due to the difference of the actual memory access amount of each task, the intensity of load of each task is variant, and from core bunch, the average division of task can make load imbalance, the heavy core of load can tie down the performance of problem, is also therefore infeasible according to this ratio cut partition task.

Summary of the invention

Technical matters to be solved by this invention is for there is above-mentioned defect in prior art, there is provided the many cores of a kind of isomery discrete memory access optimization method, the hardware characteristics of isomery many-core processor can be utilized, improve the performance of discrete memory access, to improve the performance of data-intensive applications.

In order to realize above-mentioned technical purpose, according to the present invention, providing the discrete memory access optimization method of the many cores of a kind of isomery, it is characterized in that comprising:

First step: general assignment is divided into multiple task fragment;

Second step: main core with set up a counting variable from all addressable storage space of core;

Third step: judge whether the value of counting variable is less than the segments of general assignment, if it is determined that the segments that the value of counting variable is less than general assignment then performs the 4th step;

4th step: main core and eachly atom adding 1 is done to counting variable operate from core dynamic taking-up task fragment from task pool, and complete accessing operation for the task fragment of taking out; Third step is returned with aftertreatment.

Preferably, in third step, if it is determined that the value of counting variable equals the segments of general assignment, then judge that task is disposed, process stops; Otherwise main core and each from core bunch dynamically take out the process of memory access task from core from task pool.

Preferably, in a first step, general assignment is divided into the task fragment of predetermined quantity; In 4th step, main core and eachly to process from certain memory access task fragment of core dynamic requests.

Preferably, in a first step, general assignment is divided into the task fragment of pre-sizing; In 4th step, main core with complete memory access task from core, and task dynamic assignment simultaneously.

Preferably, the pre-sizing of task fragment can be conditioned, main core with complete memory access task from core, and task dynamic assignment simultaneously.

Use method of the present invention can make main core with from core bunch memory access simultaneously, chip memory bandwidth can be made full use of, and utilize main core Cache, when Cache hits, main core is finished the work not occupied bandwidth, and the actual discrete memory bandwidth of said method may higher than total bandwidth; And dynamic task division mode efficiently solves main core and is difficult to determine that task division ratio and each core, load inequality ties down the problem of problem performance between core bunch.The invention provides method for partitioning dynamic tasks, the memory bandwidth of isomery many-core processor can be made full use of, again can according to practical operation situation flexibly, divide main core and from core bunch, from the task amount between core bunch each core, the discrete memory access bottleneck problem on isomery many-core processor effectively can be alleviated dynamically.

Accompanying drawing explanation

By reference to the accompanying drawings, and by reference to detailed description below, will more easily there is more complete understanding to the present invention and more easily understand its adjoint advantage and feature, wherein:

Fig. 1 schematically shows the process flow diagram of the discrete memory access optimization method of the many cores of isomery according to the preferred embodiment of the invention.

It should be noted that, accompanying drawing is for illustration of the present invention, and unrestricted the present invention.Note, represent that the accompanying drawing of structure may not be draw in proportion.Further, in accompanying drawing, identical or similar element indicates identical or similar label.

Embodiment

In order to make content of the present invention clearly with understandable, below in conjunction with specific embodiments and the drawings, content of the present invention is described in detail.

In isomery many-core processor main core with from core bunch shared memory bandwidth: if only with main core memory access, cannot memory bandwidth be made full use of; If only with from core bunch memory access, then waste the benefit that main Nuclear Data Cache (high-speed cache) is brought memory access.Master and slave nuclear coordination memory access, due to the effect of Cache, program likely obtains the memory access performance exceeding discrete memory access total bandwidth.

Due to the design feature of the many core of isomery, memory access ability, the actual memory bandwidth utilized of master and slave core are difficult to precise quantification, task division is improper collaborative memory access may be made there is no effect even effect is poorer; Dividing task amount with fixed proportion makes its versatility poor, and may cause more significant laod unbalance problem.

For solving this problem, the present invention is proposed.Specifically describe the preferred embodiments of the present invention below with reference to the accompanying drawings.

As shown in Figure 1, the discrete memory access optimization method of the many cores of isomery comprises according to the preferred embodiment of the invention:

First step S1: general assignment is divided into multiple task fragment;

Preferably, in first step S1, general assignment can be divided into the task fragment of predetermined quantity.Or, preferably, in first step S1, general assignment can be divided into the task fragment of pre-sizing.Further preferably, user/operating personnel can regulate the pre-sizing of task fragment, to keep dynamic adjustments ability for when being particularly applicable in and not increasing collaborative expense.

Second step S2: main core with set up a counting variable from all addressable storage space of core, and the initial value of counting variable is set to 0; Counting variable is in order to record the performance of overall task.

Third step S3: judge whether the value of counting variable is less than the segments (namely judging whether general assignment has processed) of general assignment, if it is determined that the segments that the value of counting variable is less than general assignment then performs the 4th step S4, the segments of if it is determined that the value of counting variable is not less than (equaling) general assignment, then judge that task is disposed, process stops;

4th step S4: main core and eachly dynamically take out task fragment from task pool from core, does atom adding 1 to counting variable and operates, and complete accessing operation for the task fragment of taking out; Third step S3 is returned with aftertreatment.

Wherein, atomic operation refers to: one or sequence of operations are atom (atomic), if this operation is independent, indivisible, not interruptable before being finished.When same resource has multiple core access, atomic operation ensures that all cores all operate this resource at different time.Common atomic operation has atom adding, atom subtracts, atomic ratio comparatively also exchanges.

Preferably, for obtaining better effect, can be done some to data and such as reordering, reject the operations such as invalid data, improve data locality, play the ability of Cache more significantly.

The present invention is applicable to the processor of the many core frameworks of isomery, its advantage is: 1. main core with carry out accessing operation from core bunch simultaneously, the memory bandwidth of chip can be made full use of, and have Cache due to main core, when main core Cache hits, aforesaid way can obtain the discrete memory access performance higher than discrete memory access total bandwidth; 2. dynamic task division mode efficiently solves main core and is difficult to determine that task division ratio and each core, load inequality ties down the problem of problem performance between core bunch.The method is flexible, and performance cost is little, and comprehensive income is large.

In addition, it should be noted that, unless stated otherwise or point out, otherwise the term " first " in instructions, " second ", " the 3rd " etc. describe only for distinguishing each assembly, element, step etc. in instructions, instead of for representing logical relation between each assembly, element, step or ordinal relation etc.

Be understandable that, although the present invention with preferred embodiment disclose as above, but above-described embodiment and be not used to limit the present invention.For any those of ordinary skill in the art, do not departing under technical solution of the present invention ambit, the technology contents of above-mentioned announcement all can be utilized to make many possible variations and modification to technical solution of the present invention, or be revised as the Equivalent embodiments of equivalent variations.Therefore, every content not departing from technical solution of the present invention, according to technical spirit of the present invention to any simple modification made for any of the above embodiments, equivalent variations and modification, all still belongs in the scope of technical solution of the present invention protection.

Claims

1. the discrete memory access optimization method of the many cores of isomery, is characterized in that comprising:

First step: general assignment is divided into multiple task fragment;

2. the discrete memory access optimization method of the many cores of isomery according to claim 1, is characterized in that, in third step, if it is determined that the value of counting variable equals the segments of general assignment, then judge that task is disposed, process stops; Otherwise main core and each from core dynamic taking-up task process from task pool from core bunch.

3. the discrete memory access optimization method of the many cores of isomery according to claim 1 and 2, is characterized in that, in a first step, general assignment is divided into the task fragment of predetermined quantity; In 4th step, main core and eachly to process from certain memory access task fragment of core dynamic requests.

4. the discrete memory access optimization method of the many cores of isomery according to claim 1 and 2, is characterized in that, in a first step, general assignment is divided into the task fragment of pre-sizing; In 4th step, main core with complete memory access task from core, and task dynamic assignment simultaneously.

5. the discrete memory access optimization method of the many cores of isomery according to claim 1 and 2, it is characterized in that, the pre-sizing of task fragment can be conditioned, main core with complete memory access task from core, and task dynamic assignment simultaneously.