CN101777007A

CN101777007A - Parallel function simulation system for on-chip multi-core processor and method thereof

Info

Publication number: CN101777007A
Application number: CN 201010103887
Authority: CN
Inventors: 吴俊敏; 尹巍; 隋秀峰; 赵小雨; 唐轶轩; 朱小东
Original assignee: Suzhou Institute for Advanced Study USTC
Current assignee: Suzhou Institute for Advanced Study USTC
Priority date: 2010-01-28
Filing date: 2010-01-28
Publication date: 2010-07-14
Anticipated expiration: 2030-01-28
Also published as: CN101777007B

Abstract

The invention discloses a parallel function simulation system for an on-chip multi-core processor and a method thereof. The system comprises a system input module and a system output module, and is characterized in that: a simulation kernel module is arranged between the system input module and the system output module; the simulation kernel module receives working load information run on a target system and provided by the system input module; and the simulation kernel module dynamically establishes a multithread according to the type of working load to perform the parallelization processing of simulation working load and outputs a result through the system output module. In the invention, the problem of performance reduction caused due to the increment of the cores of the target system in a serial function simulation technique is solved. The system of the invention has a higher speed-up ratio and relatively high overall performance.

Description

The parallel function simulation system of chip multi-core processor and method thereof

Technical field

The invention belongs to the emulation field of the processor of information handling system, be specifically related to a kind of parallel function simulation system and method thereof of chip multi-core processor.

Background technology

The behavior that Computer Simulation comes the simulation computer system with software, the researcher can analyze the performance and the behavior of new construction by simulation software, and does not need to set up prototype system, and this has reduced the cycle and the cost of research greatly.Since nearly ten years, industry member and academia apply to emulation technology in the research and performance history of computer hardware and software architecture widely.Along with the arriving in multinuclear epoch, it is more and more important that emulation technology will become in the design process of polycaryon processor.

At present, most multinuclear emulators all are the serial emulators, and these emulators only run on the main thread.Along with the increase of goal systems check figure, the performance of emulator will be worse and worse.In the near future, Moore's Law will be doubled the number that changes per 18 months hardware threads on the sheet into by the transistor size on per 18 months sheets and be doubled.

Yet along with the increase of check figure on the sheet, quantity of state in the simulation process and code space will increase, and this will cause the increase of simulation time.This also may cause increasing considerably of L2 Cache disappearance, thereby causes the increase of emulation periodicity.Therefore, along with the increase of the check figure of goal systems, how emulation multinuclear goal systems will become more and more important on polycaryon processor.

Functional simulation is called simulation kernel again, is a kind of important instrument in the computer system simulation process.It is designed to the storehouse of a high degree of autonomy usually, for other parts of emulator provide interface.Generally, simulation kernel has context, the inquiry context state of creating and destroying context of software, loading procedure, the current existence of emulation, the function of carrying out machine instruction and processing prediction behavior.In the serial simulation kernel, have only a thread on operating system nucleus, to move.Generally, simulation kernel reads contextual content from context configuration file, then each trace is distributed in each context.Main thread in the operating system nucleus will be carried out the instruction in these contexts.But the serial emulator has its inherent defective, and promptly along with the increase of check figure on the goal systems, the overall performance of emulator will descend.

Summary of the invention

The object of the invention is to provide a kind of parallel function simulation system of chip multi-core processor, has solved in the prior art in the serial functional simulation technology problem that performance that the increase owing to the check figure of goal systems forms descends.

In order to solve these problems of the prior art, technical scheme provided by the invention is:

A kind of parallel function simulation system of chip multi-core processor, comprise system's load module and system's output module, it is characterized in that described system also comprises the simulation kernel module, described simulation kernel module is accepted the workload information moved on the goal systems that system's load module provides, described simulation kernel module is carried out the parallelization processing of simulation work load according to the type dynamic creation multithreading of operating load, and passes through system's output module output.

Preferably, described simulation kernel module comprises multiprogramming dummy load processing module and multithread programs dummy load processing module.

Preferably, described multiprogramming dummy load processing module is each application assigned context according to configuration file, and context is organized into creates each contextual thread behind the context chained list.

Preferably, described multithread programs dummy load processing module is created top layer that main thread runs on goal systems and is carried out after the initialization contextual information invoke system call dynamic creation experimental process thread that inserts in the chained list based on context.

Preferably, described analogue system also comprises the shared variable protection module, and when a plurality of threads carried out concurrent visit to shared variable simultaneously, described shared variable protection module used the mutual exclusion lock and the barrier operations of operating system grade to make concurrent access process serializing.

Preferably, described analogue system also comprises the thread local memory module, and described thread local memory module is provided at the copy that carries out global variable when creating thread in each thread.

Preferably, described analogue system also comprises the data packing module, and described data packing module is filled the data of distributing the visit of different host processor nuclear in different cache lines.

Another object of the present invention is to provide a kind of parallel function emulation mode of chip multi-core processor, it is characterized in that said method comprising the steps of:

(1) workload information of being moved on the goal systems that described simulation kernel module receiving system load module provides, and from the context configuration file of operating load, load contextual information;

(2) described simulation kernel module is created the thread of respective number according to the contextual information that is loaded;

(3) instruction in the corresponding context is carried out in the thread of described simulation kernel module dynamic creation and main thread parallel running, carries out system's output and finishes system emulation.

Preferably, the thread of the dynamic creation of simulation kernel module described in the described method and main thread carry out when synchronous, and mutual exclusion lock and barrier operations by operating system grade realize the synchronization of access shared variable.

Preferably, the thread of simulation kernel module creation described in the described method carries out the global variable privatization and guarantees that by the method for filling the data on the different host processor nuclears are assigned with by the data packing module in different Cache is capable by the thread local memory module.

The inventor develops on the basis of former serial emulator through studying for a long period of time, obtains the functional simulation device that parallelization is handled.Parallel function emulator of the present invention utilizes the multiple programming technology to carry out the parallelization of correlative code based on the code of serial functional simulation device.Realize directly and effectively quickening simulation speed by the parallelization technology.

The concrete job step of parallelization functional simulation device is as follows: simulation kernel loads contextual information from context configuration file; According to the contextual information that is loaded, create the thread of respective number; The instruction in the corresponding context is carried out in thread that these are created and main thread parallel running, until finishing emulation.

Yet, simply directly the parallelization of serial emulator can not really be realized parallelization functional simulation device; The inventor runs into a following difficult problem of needing solution badly in the parallelization implementation procedure:

At first be the parallelization problem of simulation kernel: in the process of emulator operation, the number of creating thread is usually by the type decided of the operating load that is moved on the goal systems.According to the type of dummy load, carry out the parallelization of simulation kernel discriminatively, comprise the parallelization of multiprogramming dummy load and the parallelization of multithread programs dummy load.Secondly, the protection of shared variable and stationary problem: in the process of parallelization, need cross-thread synchronously, these synchronous operations will cause that performance descends, therefore, need to realize the protection of shared resource and cross-thread synchronously.The privatization problem of global variable in addition: some state variable is overall in the serial emulator, but in the parallel artificial device, these global variables may be had by single nuclear, must realize the privatization of these global variables.False in addition sharing problem: in the process of parallel artificial kernel, will have the false phenomenon of sharing, the false performance that will influence the parallel artificial device greatly of sharing.

The inventor finds out the solution of the problems referred to above through studying for a long period of time, and concrete scheme is as follows:

(1) parallelization of simulation kernel:

Generally, serial functional simulation kernel is by loading the execution route that contextual information disposes multiprogramming from configuration file.Before carrying out, kernel is generally each application assigned context, directly these contexts is organized into the context chained list then.Because in multiprogramming, the synchronous operation between the thread seldom, the usually corresponding thread of each context is so can create corresponding thread again after the context chained list forms.

In the multithreading operating load, a context (being commonly called main thread) is only arranged, initialized the time, this thread runs on the top layer of goal systems, but in the process of operation, it can invoke system call create a lot of sub-threads.Obviously, these contexts must dynamically be inserted in the context chained list.Therefore, not only need in sub-thread creation, create corresponding thread, and must expand, allow it can support the function of dynamic creation thread kernel.

(2) protection of shared variable and synchronous:

In the process of parallelization serial program, the protection of the shared data that concurrent operations is visited is extremely important.In this course, usually make these concurrent access process serializings with locking.In the implementation procedure of parallel function emulator, the shared variable of numerous species is arranged, for example Hash table, shared memory space and context chained list or the like.When a plurality of threads carried out concurrent visit to shared variable simultaneously, the mutual exclusion lock with operating system grade provided necessary protection usually.But, in utilization lock, to note the utilization granularity and the quantity of locking, avoiding deadlock, thereby reach more satisfactory performance.

Moreover, the difference that synchronous operation often applies to the different threads in the multithread programs hereinafter between.In the serial emulator, when carrying out synchronous operation, only need the context of correspondence is put in the corresponding hang-up tabulation.But, in the parallel artificial device, when synchronous operation takes place, must while pending operation system thread and its pairing context.In this process, can realize such operation by lock and the barrier operations of utilizing operating system grade.

(3) privatization of global variable:

In the serial emulator, many states of emulator all are to share as global variable.But in the parallel artificial device, these variablees have the state of many copy versions with the reflection different IPs.If simply these variablees are modified as vector, will increase whole complicacy.In the parallelization process, solve this problem with the thread local storage usually.

The thread local storage is achieved as follows: in gcc, all have its copy in order to show a variable all threads in thread creation, usually will _ _ the thread key word is placed on before the overall situation or the static variable statement.

(4) solution of false sharing problem

In the serial emulator of polycaryon processor, many data structures are converted to the form of structure array, if each nuclear has an element in this array, in the process of parallelization, may cause false sharing.In order to address this problem, can guarantee that the data on the different IPs are assigned with in different Cache is capable with the method for filling.

With respect to scheme of the prior art, advantage of the present invention is:

Compare with the serial emulator, confirm that through emulation experiment parallel function simulation system of the present invention has higher speed-up ratio, so parallel function simulation system of the present invention has higher overall performance.

Description of drawings

Below in conjunction with drawings and Examples the present invention is further described:

Fig. 1 is the execution model synoptic diagram of parallel function emulator;

The speed-up ratio that Fig. 2 obtains when moving the multiple tracks load for the parallel function emulator;

The average speedup that Fig. 3 obtains when moving the multiple tracks load for the parallel function emulator;

The speed-up ratio that Fig. 4 is obtained when moving the multithreading load for the parallel function emulator;

The average speedup that Fig. 5 is obtained when moving the multithreading load for the parallel function emulator.

Embodiment

Below in conjunction with specific embodiment such scheme is described further.Should be understood that these embodiment are used to the present invention is described and are not limited to limit the scope of the invention.The implementation condition that adopts among the embodiment can be done further adjustment according to the condition of concrete producer, and not marked implementation condition is generally the condition in the normal experiment.

The practice of embodiment parallel function simulation system and test

Present embodiment has been realized the parallel function emulator on the basis of serial emulator Multi2sim-2.1.In whole implementation process, used server is a dawn theory of evolution EP850-GF minicomputer, and the concrete configuration of this minicomputer is as follows: 84 nuclear AMD Opteron 83461.8G HE CPU, 32G DDR2ECC internal memory, 4*146G SAS hard disk.The operating system of this server operation is LinuxDebain (X86-64).

Experiment shows that serial emulator Multi2sim-2.1 is in the goal systems of emulation 8 nuclears, and emulation is slowed down and reached 18.24.In the time of the goal systems of this emulator emulation 16 nuclears, emulation is slowed down and is reached 165.24 unexpectedly.Along with the increase of target check figure, the emulation of serial emulator is slowed down and will sharply be increased.Therefore, must quicken emulator, improve the overall performance of this emulator with parallelization.

In the present embodiment, the serial simulation engine at first loads multiple tracks load or multithreading load, creates a plurality of contexts then respectively.If the loadtype difference, creating contextual mode also can be different.For the multiple tracks load, context is that each application program is created when being loaded into client's internal memory in the multiple tracks load, and for the multithreading load, only create a Your Majesty during loading hereinafter, all the other contexts then are dynamically to generate when hereinafter carrying out the thread creation primitive of correspondence as the Your Majesty.Simulation engine has been safeguarded a series of state-chain-tables (activity, hang-up etc.) jointly for the context of all establishments, carry out the contextual instruction of each activity then successively, may revise contextual state in this process, then end functions emulation when the instruction of the required execution of all contexts is all finished.From the description of this process as can be seen, the multinuclear functional simulation has natural symmetry and isolation, be also pointed out that the emulation that can not relate in the functional simulation microarchitectures such as streamline, Cache simultaneously, these have all simplified parallel function Simulator Design and realization greatly.

In the present embodiment, basic thought just is to use POSIX multi-thread programming model to come Multi2sim-2.1 is carried out parallelization, promptly realize each contextual simulation process with a Pthread thread respectively, implementation procedure at functional simulation, (wherein mainly being the position of Pthread thread creation) taked the strategy of dividing and ruling when determining parallel function emulator structure, promptly according to the difference of multiple tracks and multithreading load simulation process, formulate different thread creation schemes respectively and be achieved, the two merges and forms parallel function emulator the most at last in debugging and after optimizing.For the multiple tracks load, context is to create when each application program that the multiple tracks load is comprised is loaded into client's internal memory, therefore after finishing, loading just can create simultaneously and the corresponding to a plurality of threads of amount of context, and unified beginning simulation process; And for the multithreading load, sub-context dynamically generates when the Your Majesty hereinafter carries out context creation primitive, therefore with it mutually the thread of binding need wait until that also this could create constantly, system call place of realization context creation primitive has increased the thread creation process in emulator for this reason.

When carrying out thread creation, can realize by the parallel function emulator of multiple tracks load by the code below similar:

main()

{

//create?POSIX?threads?according?to?ctx

for(current_ctx＝ke-＞contx_list；current_ctx-＞contx_next；current_ctx＝current_

ctx-＞contx_next)

{

pth?read_create(&pid[i++]，NULL，ke_execute，(void＊)cu?rrent_ctx)；

}

ke_execute((void＊)current_ctx)；

for(context_number＝0；context_number＜ctxnum1；context_number++)

{

pthread_join(pid[context_number]，NULL)；

}

ke_execute(void＊args)

{

struct?ctx_t＊ctx＝(stru?ct?ctx_t＊)args；

while(psim_cycle＜max_cycles){

if(！ctx_get_status(ctx，ctx_running))

break；

/＊Run?an?instruction?from?a?dedicated?context＊/

ke_run((void＊)ctx)；

psim_cycle++；

}

When carrying out thread creation, can realize by the parallel function emulator of multithreading load by the code below similar:

void?syscall_do()

{

case?syscall_code_clone：

{

pid_array_index++；

if(pid_array_index＝＝cores_num-1)

{

struct?ctx_t＊current_ctx＝NULL；

for(current_ctx＝isa_ctx-＞contx_prev；current_ctx；

current_ctx＝current_ctx-＞contx_prev){

pthread_t?sub_pid；

memcpy(current_ctx-＞mem，isa_ctx-＞mem，sizeof(struct?mem_t))；

pthread_create(&sub_pid，NULL，ke_execute，(void＊)current_ctx)；

pid_array[pid_a?rray_index?]＝sub_pid；

}

After finishing the thread creation process, the execution model of parallel artificial device as shown in Figure 1.To comprise and amount of context corresponding to a plurality of threads in the system this moment, and each thread is operated system call to different host processor nuclear, the concurrent complete functional simulation to respective contexts of each thread.

Below comprehensively in the implementation procedure of this example, be specifically related to the protection of shared variable with synchronously, realize the thread local variable storage and eliminate false shared mechanism.

Protection is the key factor that influences concurrent program correctness and performance to the concurrent visit of shared resource, and lock then is one of serialized common technology of visit with shared resource.In Multi2sim-2.1; exist in a large number such as the context state chained list; shared resources such as Hash table and client's internal memory; in the parallel artificial device; when the nuclear thread conducts interviews to these resources; use the mutual exclusion lock of operating system grade to come it is protected, reasonably arrange the granularity and the quantity of lock simultaneously, with the speed-up ratio of maximization parallel artificial device.

In addition, between a plurality of contexts of multithreading load, often exist a large amount of synchronization primitives (as lock, roadblock etc.), when the serial emulator is carried out certain contextual synchronization primitives, only need in corresponding system call, this context be inserted in the suitable context state chained list, during as the lock that can not obtain when certain context to be asked, then be inserted into and hang up in the chained list.Yet, for the parallel artificial device, when context is inserted into suitable state-chain-table, also the operating system thread at its place to be switched to rational state, in emulator, realize having increased in the system call function of synchronization primitives for this reason with its function relative operation system level synchronous operation to achieve the above object.

The same with many serial programs, the distinguishing feature of serial multinuclear emulator Multi2sim-2.1 just is to use global variable to control the executing state of simulator.For example, carry out different contextual instructions in order to use unified interface function, Multi2sim-2.1 has introduced isa_ctx, isa_regs, a series of global variables such as isa_mem.When realizing the parallel artificial device, most of similarly variable all needs an independent copy at each context, yet do the problem that is faced like this and will identify these variablees exactly exactly, it is extended to array (vector) form, and quote separately variable by context id.The workload of obvious this implementation is bigger, and bring certain trouble to writing and debugging of parallel artificial device code, use the language construction of thread local storage (Thread Local Storage:TLS) to solve the predicament that is faced in this class variable parallelization process for this reason.

In gcc, TLS can by before the statement of global variable or static local variable, use _ _ the thread key word realizes, this means that these variablees can generate a duplicate automatically when thread creation.Can manually the variable of the single thread nuclear of expression attribute and the variable of being shared by a plurality of thread be made a distinction in this way, and improve the performance of parallel artificial device.

In serial multinuclear emulator, many data structures all are to organize in the mode of array, the corresponding array element of each processor core.Although from the angle of code compiling, compare so conveniently, and can not cause any problem, when exploitation parallel artificial device, a plurality of threads are quoted array element separately can cause false sharing problem.In order to alleviate this problem, use the method for filling (Padding) to guarantee to be assigned in the different cache lines by the data of different host processor nuclear visit.

Because Multi2sim-2.1 is with the thinking design of serial program fully and realizes, therefore exist the variable of a large amount of the above-mentioned types, these variablees are very big to the performance impact of parallel artificial device in the parallelization process, spent the plenty of time excavates such variable for this reason, finally makes the speed of emulator reach gratifying effect.

Test process adopts the performance of multiple tracks and multithreading load evaluation and test parallel function emulator, and the multiple tracks load is to be combined by the related application among the SPEC2006.Table 1 has been listed employed whole multiple tracks load combinations in the test process.The multithreading test procedure is from Splash2.Used several test loads are FFT in this test process, LU (c), RADIX and LU (n).

Table 1 multiple tracks load combinations

The speed-up ratio that Fig. 2, Fig. 3 have been obtained when having provided this emulator operation multiple tracks test load, as can be seen from the figure, when goal systems had 2 nuclears, the speed-up ratio of the maximum that can reach can reach 1.914, and the speed-up ratio of the minimum that obtains is 1.478; When goal systems had 4 nuclears, the maximum speed-up ratio that can reach can reach 3.814, and the minimum speed-up ratio that obtains is 3.366; When goal systems has 8 nuclears, obtainable maximum speed-up ratio can reach 7.618, and the minimum speed-up ratio that obtains is 7.143, when goal systems has 16 nuclears, the maximum speed-up ratio that can reach can reach 15.827, and the minimum speed-up ratio that is obtained is 15.321.It can also be seen that from Fig. 3 in 2 nuclears, 4 nuclears, 8 nuclears and 16 nuclears, average speedup is respectively 1.748,3.644,7.372 and 15.628.

The speed-up ratio that has been obtained when having provided this emulator operation multithreading test load among Fig. 4, Fig. 5, as can be seen from the figure, when goal systems had 2 nuclears, the speed-up ratio of the maximum that can reach was 1.819, and the speed-up ratio of the minimum that obtains has only 1.378; When goal systems had 4 nuclears, the maximum speed-up ratio that can reach was 3.131, and the minimum speed-up ratio that obtains has only 1.838; When goal systems had 8 nuclears, obtainable maximum speed-up ratio was 4.852, and the minimum speed-up ratio that obtains has only 2.434, and when goal systems had 16 nuclears, the maximum speed-up ratio that can reach was 6.372, and the minimum speed-up ratio that is obtained has only 4.821.It can also be seen that from Fig. 5 in 2 nuclears, 4 nuclears, 8 nuclears and 16 nuclears, average speedup is respectively 1.692,2.760,3.833 and 5.292.

When the load of operation multithreading, the speed-up ratio height that the speed-up ratio that the parallel function emulator is obtained is obtained when not moving the multiple tracks load, this phenomenon is mainly caused by the signal post of cross-thread.In the multiple tracks load, cross-thread is communicated by letter hardly, can obtain comparatively desirable speed-up ratio when therefore moving the multiple tracks load.But, between the different threads of multithreading load, have a large amount of inter-thread communications.Along with the increase of check figure, such expense will be very big, thereby cause the decline of parallel function emulator overall performance.

Above-mentioned example only is explanation technical conceive of the present invention and characteristics, and its purpose is to allow the people who is familiar with this technology can understand content of the present invention and enforcement according to this, can not limit protection scope of the present invention with this.All equivalent transformations that spirit is done according to the present invention or modification all should be encompassed within protection scope of the present invention.

Claims

1. the parallel function simulation system of a chip multi-core processor, comprise system's load module and system's output module, it is characterized in that described system also comprises the simulation kernel module, described simulation kernel module is accepted the workload information moved on the goal systems that system's load module provides, described simulation kernel module is carried out the parallelization processing of simulation work load according to the type dynamic creation multithreading of operating load, and passes through system's output module output.

2. the parallel function simulation system of chip multi-core processor according to claim 1 is characterized in that described simulation kernel module comprises multiprogramming dummy load processing module and multithread programs dummy load processing module.

3. the parallel function simulation system of chip multi-core processor according to claim 2, it is characterized in that described multiprogramming dummy load processing module is each application assigned context according to configuration file, and context is organized into creates each contextual thread behind the context chained list.

4. the parallel function simulation system of chip multi-core processor according to claim 2 is characterized in that described multithread programs dummy load processing module creates top layer that main thread runs on goal systems and carry out after the initialization contextual information invoke system call dynamic creation experimental process thread that inserts in the chained list based on context.

5. the parallel function simulation system of chip multi-core processor according to claim 1; it is characterized in that described analogue system also comprises the shared variable protection module; when a plurality of threads carried out concurrent visit to shared variable simultaneously, described shared variable protection module used the mutual exclusion lock and the barrier operations of operating system grade to make concurrent access process serializing.

6. the parallel function simulation system of chip multi-core processor according to claim 1, it is characterized in that described analogue system also comprises the thread local memory module, described thread local memory module is provided at the copy that carries out global variable when creating thread in each thread.

7. the parallel function simulation system of chip multi-core processor according to claim 1, it is characterized in that described analogue system also comprises the data packing module, described data packing module is filled the data of distributing the visit of different host processor nuclear in different cache lines.

8. the parallel function emulation mode of a chip multi-core processor is characterized in that said method comprising the steps of:

9. method according to claim 8 is characterized in that the thread of the dynamic creation of simulation kernel module described in the described method and main thread carry out when synchronous, and mutual exclusion lock and barrier operations by operating system grade realize the synchronization of access shared variable.

10. method according to claim 8 is characterized in that the thread of simulation kernel module creation described in the described method carries out the global variable privatization and guarantees that by the method for filling the data on the different host processor nuclears are assigned with by the data packing module in different Cache is capable by the thread local memory module.