EP4042277A1 - Verfahren zur reproduzierbaren parallelsimulation auf elektronischer systemebene, die mittels eines multicore-simulationsrechnersystems mit ereignisorientierter simulation implementiert ist - Google Patents

Verfahren zur reproduzierbaren parallelsimulation auf elektronischer systemebene, die mittels eines multicore-simulationsrechnersystems mit ereignisorientierter simulation implementiert ist

Info

Publication number: EP4042277A1
Authority: EP; European Patent Office
Prior art keywords: simulation; address; processes; evaluation; access
Prior art date: 2019-10-11
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.): Pending

Application number

EP20786583.3A

Other languages

English (en)

French (fr)

Inventor

Gabriel BUSNOT

Tanguy SASSOLAS

Nicolas Ventroux

Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)

Commissariat a lEnergie Atomique et aux Energies Alternatives CEA

Original Assignee

Commissariat a lEnergie Atomique CEA

Commissariat a lEnergie Atomique et aux Energies Alternatives CEA

Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)

2019-10-11

Filing date

2020-10-08

Publication date

2022-08-17

2020-10-08 Application filed by Commissariat a lEnergie Atomique CEA, Commissariat a lEnergie Atomique et aux Energies Alternatives CEA filed Critical Commissariat a lEnergie Atomique CEA

2022-08-17 Publication of EP4042277A1 publication Critical patent/EP4042277A1/de

Status Pending legal-status Critical Current

Links

238000000034 method Methods 0.000 title claims abstract description 373
238000004088 simulation Methods 0.000 title claims abstract description 184
230000008569 process Effects 0.000 claims abstract description 302
238000011156 evaluation Methods 0.000 claims abstract description 151
230000015654 memory Effects 0.000 claims abstract description 112
238000001514 detection method Methods 0.000 claims abstract description 19
238000012795 verification Methods 0.000 claims abstract description 16
230000036961 partial effect Effects 0.000 claims description 7
238000004590 computer program Methods 0.000 claims description 2
230000003247 decreasing effect Effects 0.000 claims description 2
230000006870 function Effects 0.000 description 17
230000007704 transition Effects 0.000 description 10
238000013459 approach Methods 0.000 description 9
230000002123 temporal effect Effects 0.000 description 8
239000013598 vector Substances 0.000 description 8
239000000872 buffer Substances 0.000 description 7
230000003993 interaction Effects 0.000 description 7
230000008859 change Effects 0.000 description 5
238000005457 optimization Methods 0.000 description 5
230000001133 acceleration Effects 0.000 description 4
230000009471 action Effects 0.000 description 4
238000004422 calculation algorithm Methods 0.000 description 4
238000013461 design Methods 0.000 description 4
230000007246 mechanism Effects 0.000 description 4
230000026676 system process Effects 0.000 description 4
230000008901 benefit Effects 0.000 description 3
238000004891 communication Methods 0.000 description 3
238000011161 development Methods 0.000 description 3
238000009826 distribution Methods 0.000 description 3
230000003068 static effect Effects 0.000 description 3
238000003860 storage Methods 0.000 description 3
238000005094 computer simulation Methods 0.000 description 2
230000007717 exclusion Effects 0.000 description 2
230000000670 limiting effect Effects 0.000 description 2
238000004519 manufacturing process Methods 0.000 description 2
238000012544 monitoring process Methods 0.000 description 2
230000002085 persistent effect Effects 0.000 description 2
238000012545 processing Methods 0.000 description 2
238000011084 recovery Methods 0.000 description 2
230000001960 triggered effect Effects 0.000 description 2
230000000903 blocking effect Effects 0.000 description 1
230000001934 delay Effects 0.000 description 1
238000010586 diagram Methods 0.000 description 1
230000000694 effects Effects 0.000 description 1
239000012467 final product Substances 0.000 description 1
PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 1
239000010931 gold Substances 0.000 description 1
229910052737 gold Inorganic materials 0.000 description 1
230000010354 integration Effects 0.000 description 1
238000002955 isolation Methods 0.000 description 1
238000012986 modification Methods 0.000 description 1
230000004048 modification Effects 0.000 description 1
230000000644 propagated effect Effects 0.000 description 1
230000002829 reductive effect Effects 0.000 description 1
238000005096 rolling process Methods 0.000 description 1
230000001360 synchronised effect Effects 0.000 description 1
230000009897 systematic effect Effects 0.000 description 1
238000012360 testing method Methods 0.000 description 1
238000012546 transfer Methods 0.000 description 1

Classifications

- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
- G06F9/4887—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues involving deadlines, e.g. rate based, periodic
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/28—Error detection; Error correction; Monitoring by checking the correct order of processing
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3877—Concurrent instruction execution, e.g. pipeline or look ahead using a slave processor, e.g. coprocessor
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/52—Program synchronisation; Mutual exclusion, e.g. by means of semaphores
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/52—Program synchronisation; Mutual exclusion, e.g. by means of semaphores
- G06F9/524—Deadlock detection or avoidance

Definitions

the invention relates to a reproducible parallel simulation method of electronic system level implemented by means of a multi-core computer simulation system with discrete events.
the invention relates to the field of systems-on-a-chip design tools and methodologies, and aims to increase the execution speed of virtual prototyping tools in order to accelerate the early phases of systems-on-a-chip design.
a system on a chip can be broken down into two components: hardware and software.
Software which represents an increasing part of systems-on-a-chip development efforts, needs to be validated sooner rather than later. In particular, it is not possible to wait for the manufacture of the first hardware prototype for reasons of cost and time to market.
high-level modeling tools have been developed. These tools allow the description of a high-level virtual prototype of the hardware platform. The software intended for the system being designed can then be executed and validated on this virtual prototype.
the invention provides a parallel system simulation kernel supporting all types of models (such as RTL for acronym for "Register Transfer Level” in English and TLM for acronym for “Transactional Level Modeling” in English).
a first technique aims to prevent errors related to parallelization by means of a static code analysis such as in [SCHM18].
a specialized compiler for System programs allows you to analyze the source code of a model. It focuses on transitions, that is, the portions of code executed between two calls to the "wait ()" synchronization primitive. Since these portions must be evaluated atomically, the compiler scans for any dependencies between these transitions in order to determine if they can be evaluated in parallel. This technique refines the analysis by distinguishing between modules and ports in order to limit false positive detections. A static ordering of the processes can then be calculated. However, in the context of a TLM model, all the processes accessing, for example, the same memory will be scheduled sequentially, making this approach inefficient.
Process areas are also used in [SCHU13].
the set of processes and associated resources that can be accessed by these processes is called a process zone.
the processes of the same zone are executed sequentially, guaranteeing their atomicity.
the processes of different areas are carried out in parallel.
In order to preserve atomicity when a process in a zone tries to access resources belonging to another zone (variables or functions belonging to a module located in another zone), it is interrupted, its context is migrated. towards the targeted zone then it is restarted sequentially with respect to the other processes of its new zone.
this technique does not guarantee the atomicity of the process in all cases.
a process P a modifies a state S a of the same zone before changing zone to modify a state S j,.
a process P b modified S b before changing zone to modify S a .
each process will see the changes made by the other process during the current evaluation phase, violating the evaluation atomicity of the processes.
all the processes would be sequentialized when accessing this memory, thus presenting performances close to a fully sequential simulation.
the fork (2) function allows the duplication of a process.
the temporal decoupling refers here to a technique used in TLM modeling called "loosely-timed", which consists in allowing a process to get ahead of the overall time of the simulation and to synchronize only at time intervals d 'a constant duration called quantum. This greatly speeds up the simulation speed but introduces timing errors. For example, a process can receive at the local date t 0 an event sent by another process whose local date was t with t ⁇ ⁇ t 0 , violating the principle of causality. In order to improve the precision of these models using temporal decoupling, [JUNG19] implements a rollback technique based on fork (2).
[JUNG19] uses process level rollback to correct simulation timing errors.
simulation speed is still limited by the single-core performance of the host machine.
fork (2) no longer allows saving the state of the simulation because the threads are not duplicated by fork (2) making this approach inapplicable in the case of the invention.
correcting the timing errors of a model using quantum constitutes, in the strict sense, a violation of atomicity of the processes, the latter being interrupted by the simulation kernel without calling the wait () primitive. This functionality may be desired by some but is incompatible with the desire to respect the System standard.
[VENU 6] uses a method in which the concurrent processes of a System simulation are executed in parallel execution queues each associated with a specific logical core of the host machine. A process for analyzing dependencies between the processes is implemented in order to guarantee their atomicity. [VENU 6] relies on the manual declaration of shared memory areas to guarantee a valid simulation. However, it is often impossible to know these areas a priori in the event of dynamic memory allocation or virtualized memory, as is often the case under an operating system. [VENU 6] uses a phase parallel and an optional sequential phase in the event of processes preempted for forbidden access to a shared memory during the parallel phase. Any parallelism is prevented during this sequential phase and causes a significant slowing down.
[VENU 6] proceeds to the establishment of dependencies through multiple graphs constructed during the evaluation phase. This requires heavy synchronization mechanisms which greatly slow down the simulation to guarantee the integrity of the graphs. [VENU 6] also requires that the overall dependency graph be completed and analyzed at the end of each parallel phase, further slowing down the simulation. [VENU 6] handles execution queues monolithically, that is, if one process in the simulation is sequenced, all processes in the same execution queue will be sequenced as well.
[VENU 6] proposes to reproduce a simulation from a linearization of the dependency graph of each evaluation phase stored in a trace. This forces us to sequentially evaluate processes which can turn out to be independent as for the graph (1 2, 1 3) which would be linearized in (1, 2, 3) while 2 and 3, not depending on one of the other, can be executed in parallel.
An object of the invention is to overcome the problems mentioned above, and in particular to speed up the simulation while keeping it reproducible.
a reproducible parallel discrete event simulation method at electronic system level implemented by means of a multi-core computer system, said simulation method comprising a succession of evaluation phases, implemented by a simulation kernel executed by said computer system, comprising the following steps:
Such a method allows the parallel simulation of System models in compliance with the standard.
this method allows identical reproduction of a simulation, facilitating debugging. It supports "loosely-timed" TLM simulation models using temporal decoupling through the use of a simulation quantum and direct memory access (DMI), very useful for achieving high simulation speeds.
DMI simulation quantum and direct memory access
the parallel scheduling of processes uses process queues, the processes of the same queue being executed sequentially by a system task associated with a logical core.
process queues can be populated manually or automatically, for example, it is possible to bring together processes that may have dependencies or rebalance the load on each core by migrating processes from one queue to another.
the rollback uses backups of simulation states during the simulation made by the simulation kernel.
the state machine of an address of the shared memory comprises the following four states:
Read_exclusive when the address has been accessed exclusively in read mode by a single process, said process then being defined as the owner (Owner) of the address
Read_shared when the address has been accessed exclusively in read mode by at least two processes, without a process defined as the owner (Owner) of the address.
the preemption of a process by the kernel is determined when:
a write access is requested on an address of the shared memory by a process which is not the owner of the state machine of the address, and the current state is different from "no access";
the state machine of an address of the shared memory comprises the following four states:
Read_exclusive when the address has been accessed exclusively in read mode by a single process queue, said process queue then being defined as the owner of the address
Read_shared when the address has been accessed exclusively in read mode by at least two process queues, without a process queue defined as owner of the address.
the preemption of a process by the kernel is determined when:
a write access is requested on an address of the shared memory by a process queue which does not own the address in the state machine, and the current state is different from "no access";
all the state machines of the addresses of the shared memory are regularly reset to the "no access" state.
all the state machines of the addresses of the shared memory are reset to the "no access" state during the evaluation phase following the preemption of a process.
the preemption of a process may turn out to be characteristic of a change in the use of an address in the simulated program, and it is preferable to maximize the parallelism by freeing the states of the addresses observed in quantums. previous ones.
the verification of access conflicts to shared memory addresses during each evaluation phase is performed asynchronously, during the execution of the subsequent evaluation phases.
the execution trace allowing the subsequent reproduction of the identical simulation comprises a list of numbers representative of evaluation phases associated with a partial order of evaluation of the processes defined by the inter-process dependency relationships of each evaluation phase.
a rollback upon detection of at least one conflict, restores a past state of the simulation, then reproduces the simulation identically until phase d The evaluation having produced the conflict and then executes its processes sequentially.
a rollback upon detection of at least one conflict, restores a past state of the simulation, then reproduces the simulation identically until phase d 'evaluation having produced the conflict and then executes its processes according to a partial order deduced from the dependency graph of the evaluation phase which produced the conflict after having removed one arc per cycle.
a state of the simulation is saved at regular intervals of evaluation phases.
a state of the simulation is saved at evaluation phase intervals increasing in the absence of conflict detection and decreasing following conflict detection.
a computer program product comprising computer code executable by computer, stored on a computer readable medium and adapted to implement a method as described above.
FIG. 1 schematically illustrates the phases of a system simulation, according to the state of the art
FIG. 2 schematically illustrates an embodiment of the reproducible parallel simulation method at electronic system level implemented by means of a multi-core computer simulation system with discrete events, according to one aspect of the invention
FIG. 3 schematically illustrates a parallel scheduling of processes, according to one aspect of the invention
FIG. 4 schematically illustrates a state machine associated with a shared memory address, according to one aspect of the invention
FIG. 5 schematically illustrates a data structure allowing the recording of a trace of the memory accesses carried out by each of the execution queues of the simulation, according to one aspect of the invention
FIG. 6 schematically illustrates an algorithm making it possible to extract a partial process execution order from an interprocess dependency graph, according to one aspect of the invention
FIG. 7 schematically illustrates the rollback procedure in the event of detection of an error during the simulation, according to one aspect of the invention
FIG. 8 schematically illustrates a trace allowing identical reproduction of a simulation, according to one aspect of the invention.
the invention is based on the monitoring or "monitoring" in the English language of memory accesses associated with a method of detecting shared addresses as well as a system making it possible to restore a previous state of the simulation and to a reproduction system. simulation.
modeling techniques are based on increasingly high-level abstractions. This made it possible to take advantage of the trade-off between speed and precision. This is because a less detailed model requires less computation to simulate a given action, increasing the number of actions that can be simulated in a given time. However, it is becoming increasingly difficult to raise the level of abstraction of models without compromising the validity of simulation results. As simulation results that are too imprecise inevitably lead to costly design errors downstream, it is important to maintain a sufficient level of precision.
the present invention proposes to resort to parallelism to accelerate the simulation of systems on a chip.
a technique of parallel simulation of the System models is used.
a System simulation is broken down into three phases, as illustrated in Figure 1: the development during which the different modules of the model are initialized; the evaluation during which the new state of the model is calculated from its current state through the execution of the various processes of the model; and the update during which the results of the evaluation phase are propagated into the model for the next evaluation phase.
the evaluation phase is triggered by three types of notifications: instantaneous, deltas, and temporal.
An instant notification has the effect of scheduling additional processes to run directly during the current assessment phase.
a delta notification schedules the execution of a process in a new evaluation phase taking place on the same date (simulation time).
a temporal notification schedules the execution of a process at a later date. It is this type of notification that causes the simulated time to advance.
the evaluation phase requires significantly more computing time than the other two. It is therefore the acceleration of this phase which provides the greatest gain and which is the subject of the invention.
the System standard requires that a simulation be reproducible, that is to say that it always produces the same result from one execution to the next. the next in the presence of the same inputs. It is therefore required that the various processes programmed to run during a given evaluation phase are executed in accordance with the semantics of coroutine and therefore in an atomic manner. This makes it possible to obtain an identical simulation result between two executions with the same input conditions.
Atomicity is a property used in concurrent programming to designate an operation or a set of operations of a program that are executed entirely without being able to be interrupted before the end of their execution and without an intermediate state of the operation. atomic cannot be observed.
the invention presents a mechanism ensuring the atomicity of the processes which interact via shared memory only. It is also possible to reproduce a past simulation from a trace stored in a file.
FIG. 2 schematically represents six distinct interacting components of the invention, allowing the parallel simulation of System models:
Parallel scheduling 1 of processes for example by process queues, the processes of the same queue being allocated to the same logical core.
parallel scheduling can also use a distribution of processes by global sharing, that is to say that each evaluation task executes a pending process taken from the global queue of processes to be evaluated. during the present evaluation phase;
backtracking 5 upon detection of at least one conflict, to restore a past state of the simulation after determining an order of execution of the processes of the conflicting evaluation phase during which the conflict is detected , determined from the interprocess dependency graph, to avoid the conflict detected in a new identical simulation up to the conflicting evaluation phase excluded;
Parallel scheduling makes it possible to run concurrently concurrent processes of a simulation, for example by execution queue, in which case each execution queue is assigned to a logical core of the host machine.
An evaluation phase then consists of a succession of parallel sub-phases, the number of which depends on the existence of pre-empted processes during each evaluation sub-phase. Running processes in parallel requires precautions to preserve their atomicity. To do this, memory accesses, which represent the most frequent form of interaction, are instrumented.
each memory access must be instrumented by a prior call to a specific function.
the instrumentation function will determine any inter-process dependencies generated by the instrumented action. If necessary, the process that initiated the action can be preempted. It then resumes its execution alongside the other preempted processes in a new parallel evaluation sub-phase. These parallel evaluation sub-phases are then linked until all the processes are fully evaluated.
each address is associated a state machine indicating whether this address is accessible in read-only mode by all the processes or in read and write mode by a single process according to the previous ones. access to this address. Depending on the state of the address and the access being instrumented, the latter is authorized or the process is preempted.
This mechanism aims to avoid process evaluation atomicity violations, also called conflicts, but does not guarantee their absence. It is therefore necessary to monitor the absence of conflicts at the end of each evaluation phase.
process evaluation atomicity violations also called conflicts
no conflict exists, as detailed below in the description.
the memory accesses likely to generate a dependency have also been stored in a dedicated structure during the quantum evaluation.
the latter is used by an independent system thread to build an inter-process dependency graph and verify that no conflict materialized by a cycle in the graph exists. This check takes place while the simulation continues.
the simulation kernel retrieves the results in parallel with a subsequent evaluation phase. In the event of a conflict, a rollback system makes it possible to return to a past state of the simulation before the conflict.
the cause of the error is analyzed using the dependency relationships between processes and the simulation is resumed at the last save point before the conflict.
the scheduling to be applied to avoid a reproduction of the conflict is transmitted to the simulation before it resumes.
the simulation also resumes in "simulation reproduction" mode, detailed in the remainder of the description, which makes it possible to guarantee an identical simulation result from one simulation to the next. This prevents the point of conflict from being displaced due to the non-determinism of the parallel simulation and that it occurs again.
Simulation reproduction uses a trace generated during a past simulation to reproduce the same result.
This trace essentially represents a partial order in which the processes must be executed during each evaluation phase. It is stored in a file or any other means of storage persistent between two simulations.
partial order an order which is not total, i.e. an order which does not make it possible to classify all the elements one by one. compared to others.
the processes between which no order relation is defined can be executed in parallel.
the invention does not require prior knowledge of the shared or read-only addresses in order to operate, which allows greater flexibility of use. Any conflicts are then managed by a simulation rollback solution. It also presents a higher level of parallelism than similar solutions.
FIG. 3 schematically illustrates the parallel scheduling of processes, with the use of process queues.
process queues instead of using process queues, it is possible to use a distribution of the processes by global sharing, that is to say that each evaluation task executes a pending process taken from the global queue. processes to be evaluated during the current evaluation phase.
the parallel execution of a discrete event simulation relies on parallel scheduling of processes.
the scheduling proposed in the present invention makes it possible to evaluate the concurrent processes of each evaluation phase in parallel. To do this, the processes are assigned to different execution queues. The processes of each execution queue are then executed in turn. However, the execution queues are executed in parallel with each other by different system tasks called evaluation tasks.
One embodiment offering the best performance is to let the user statically associate each process of the simulation with an execution queue and associate each execution queue with a logical core of the simulation platform. However, it is possible to perform this distribution automatically at the start of the simulation or even dynamically using a load balancing algorithm such as that of task stealing or "work stealing" in English.
An execution queue can be implemented using three queues, the detailed use of which will be described in the remainder of the description: the main queue containing the processes to be evaluated during the evaluation sub-phase in progress , the reserve queue containing the processes to be evaluated during the next sub-phase evaluation and the completed process queue containing the processes whose evaluation has been completed.
the scheduling of the tasks is then carried out in a distributed manner between the simulation kernel and the various execution queues, in accordance with FIG. 3, which all have a dedicated system task and, preferably, a dedicated logical heart.
the evaluation phase begins at the end of one of the three possible notification phases (instantaneous, deltas or temporal).
the processes ready to be executed are placed in the various reserve execution queues of each evaluation task.
the kernel then wakes up all the evaluation tasks which then begins the first evaluation sub-phase.
Each of these tasks swaps its reserve queue with its main queue, and consumes the latter's processes one by one (the order does not matter).
a process can terminate in two ways: either it reaches a call to the function or wait clause or "wait ()" in English, or it is preempted due to memory access introducing a dependency with a process of a other evaluation queue.
the process is removed from the main execution queue and placed in the list of completed processes.
it is transferred to the reserve execution queue.
the first parallel evaluation sub-phase is complete. If no process has been preempted, the evaluation phase is complete. If at least one process has been preempted, then a new parallel evaluation sub-phase is started. All the tasks executing the execution queues are then woken up again and repeat the same procedure. The parallel evaluation sub-phases are thus repeated until all the processes are terminated (ie reach a call to wait ()).
the invention is based on the control of interactions by access to shared memory produced by all of the processes evaluated in parallel.
the goal is to ensure that the interleaving of memory accesses resulting from parallel evaluation of execution queues is equivalent to atomic evaluation of processes. Otherwise, there is a conflict. Only accesses to shared addresses can cause conflicts, the other accesses being independent of each other.
the invention includes dynamic detection of shared addresses which does not require no prior information from the user. It is thus possible to preempt the processes accessing shared memory areas and therefore running the risk of causing conflicts.
the technique presented here is based on the instrumentation of all memory accesses. This instrumentation is based on the identifier ID of the process performing an access as well as on the evaluation task executing it, on the type of access (read or write) and on the addresses accessed. This information is processed using the state machine in Figure 4, instantiated once per memory address accessible on the simulated system. Each address can thus be in one of the following four states:
Read_exclusive in English when the address has been accessed exclusively in read mode by a single process, said process then being defined as the owner (Owner in English) of the address;
the preemption of a process by the kernel is determined when:
a write access is requested on an address of the shared memory by a process which is not the owner of the state machine of the address, and the current state is different from "no access";
each address can be in one of the following four states: - "no access” (No_acces), when the state machine has been reinitialized, without a process queue defined as the owner of the address;
the preemption of a process by the kernel is determined when:
a write access is requested on an address of the shared memory by a process queue which does not own the address in the state machine, and the current state is different from "no access";
the owners are evaluation tasks (and not individual System processes), that is to say the system task in charge of evaluating the processes listed in its queue. Evaluation. This is to prevent the processes in the same evaluation queue from blocking each other while ensuring that they cannot run simultaneously.
Read_exclusive was not present and a reading by a task T immediately led to a transition to the "in” state.
the "exclusively read” state Read_exclusive it is possible to wait for a reading from another thread x or else a writing of x to decide more reliably on the nature of the address considered.
a process is preempted whenever it attempts to perform an access that would make the shared address other than "read-only" since the last state machine reset. This corresponds to a write to an address by a process whose evaluation task is not the owner Owner (except if in the state "no access" No_access) or to a read access to an address in the 'owned' state Owned and whose owner Owner is another evaluation task.
These preemption rules ensure that between two resets, it is impossible for an evaluation task to read (respectively write) an address previously written (respectively written or read) by another evaluation task. This therefore guarantees the absence of dependencies linked to memory access between the processes of two separate evaluation queues between two resets.
RegisterMemoryAccess taking as an argument the address of an access, its size and its type (read or write) is made available to the user. .
the latter must call this function before each memory access.
This function retrieves the identifier of the calling process and of its evaluation task, and the instance of the state machine associated with the accessed address is updated. Depending on the transition made, the process can either continue and perform the instrumented memory access or be preempted to continue in the next parallel subphase.
the state machines are stored in an associative container whose keys are addresses and the values of the instances of the state machine shown in Figure 3. This container must support access and concurrent modification.
the transition to be made is determined on the basis of the current state and the characteristics of the access during instrumentation.
the transition must be calculated and applied atomically using, for example, an atomic instruction of type compare and swap, "compare and swap" in English.
all the fields making up the state of an address must be representable on the largest number of bits that can be handled atomically (128 bits on AMD64), the the lower the better. These fields are in our case one byte for the state of the address, one byte for the identifier ID of the evaluation task possessing the address and two bytes for the reset counter, detailed in the rest of the description, for a total of 32 bits.
the update function of the state machine is then called back to try the update again. This is repeated until the successful update of the state machine.
a performance optimization consists in not carrying out the atomic "compare and swap" if the borrowed transition loops on the same state. This is possible because the accesses causing a transition which loops on the same state are commutative with all the other accesses of the same evaluation sub-phase. That is to say that the order in which these accesses looping on the same state are recorded in relation to the accesses immediately neighboring in time has no influence on the final state of the state machine and does not not change any preempted processes.
the update function of the state machine of the address accessed finally indicates whether the calling process must be preempted or not by returning for example a boolean.
State machines are used to determine the nature of the various addresses and to authorize or not certain accesses depending on the state of these addresses.
some addresses can change usage.
a buffer memory or "buffer" in English, can be used to store an image there which is then processed by several threads thereafter.
the System process simulating this task is then the owner of the addresses contained in the buffer memory.
multiple processes access this image in parallel. If the result of image processing is not placed directly into the buffer, then the buffer should be completely in the "shared read” state Read_shared.
One embodiment of the reset policy is as follows, but others can be implemented: when a process accesses a shared address and is preempted, all of the state machines are reset during the process. of the next parallel evaluation sub-phase. This is justified by the following observation: often, an access to a shared address is symptomatic of the situation described above, that is to say that a set of addresses first accessed by a given processes are then only read by a set of processes or accessed by another process exclusively (we can say that data migrates from one task to another). The state machines of these addresses must then be reinitialized to reach a new more suitable state. However, it is difficult to anticipate exactly which addresses should change state. The option chosen is therefore to reset the entire address space by relying on the fact that the addresses which did not need to be reset will quickly return to their previous state.
This reset involves a counter C stored with the state machine of each address.
the value of a global counter C g external to the state machine is passed as an additional argument. If the value of C g differs from that of C, the state machine must be reset before making the transition and C is updated to the value C g . Thus, to trigger the reinitialization of all the state machines, it suffices to increment C g .
Counter C must be updated with the state of the state machine and the possible owner of the address atomically.
C uses two bytes. This means that if C g is incremented exactly 65,536 times between two accesses to a given address, C and C g remain equal and the reinitialization does not take place, which potentially and very rarely leads to unnecessary preemptions but does not compromise the validity of the technique.
the "AccessRecord” recording structure is therefore composed for each sub-phase of a vector per execution queue as shown in Figure 5. Any ordered data structure can be used in place of the vector.
the "RegisterMemoryAccessO" memory register access function if the calling process is not preempted, it inserts the characteristics of the instrumented memory access into the vector of its execution queue: address, number of bytes accessed, access type and process ID.
the simulation kernel entrusts the verification of the absence of conflict to a dedicated system task.
a grouping, or "pool” in English, of tasks is used. If no task is available, a new task is added to it.
the verification of the evaluation phase is then carried out asynchronously while the simulation continues.
Another access record structure “AccessRecord”, itself resulting from a grouping, is used for the following evaluation phase.
the verification task then enumerates the accesses contained in the access record structure "AccessRecord" from the first to the last evaluation sub-phase.
the vectors of each subphase of the "AccessRecord” access record structure must be processed one after the other in any order.
a read at a given address introduces a dependency on the last writer of that address, and a write introduces a dependency on the previous writer and all readers since. This rule does not apply when a dependency is on a process with itself.
An inter-process dependency graph is thus constructed. Once the graph is completed, the latter has for vertices all the processes involved in a dependency which are themselves represented by directed arcs.
a search for cycles is then made in the graph in order to detect a possible circular dependence between symptomatic processes of a conflict.
step 1 group the processes without a predecessor and those not appearing in the graph
step 2 remove from the graph the already grouped processes
step 3 if there are still processes, group the processes without a predecessor, otherwise terminate.
step 4 go back to step 2.
the instrumentation of memory access using the “RegisterMemoryAccessO” memory access recording function aims, on the one hand, to avoid the appearance of conflicts and, on the other hand, to verify a posteriori that the accesses carried out during a given evaluation phase indeed correspond to an execution without conflict.
the observed order of calls to the "RegisterMemoryAccessO" memory access function may differ from the observed order of subsequent writes. This order reversal could totally invalidate the validity of the exposed method: if the recorded order of two postings is reversed with respect to the real order of the postings, then the recorded dependency is reversed with respect to the real dependency and conflicts could go unnoticed.
Any rollback method could be used.
the embodiment shown here relies on a system process level rollback technique.
the CRIU (acronym for "Checkpoint / Restore In Userspace") tool available on Linux can be used. It allows you to write to files the status of a complete process at a given time. This includes in particular an image of the process memory space as well as the state of the processor registers useful at the time of the backup. It is then possible from these files to restart the saved process from the save point.
CRIU also enables incremental process backups. This consists of writing to disk only those memory pages that have changed since the last backup and have a significant speed gain.
CRIU can be controlled via an RPC interface based on the Protobuf library.
the general principle of the rollback system is shown schematically in FIG. 7.
the simulation process is immediately duplicated using the fork (2) system call. It is imperative that this duplication occurs before the creation of additional tasks because these are not duplicated by the call to fork (2).
the resulting child process will be called the simulation and it is it which performs the actual simulation.
the simulation process transmits to the parent process the information relating to this conflict, in particular the number of the evaluation phase in which the conflict occurred and the information useful for reproducing the simulation up to the point of conflict, as described in the remainder of the description.
the execution order to be applied in order to avoid the conflict can then be transmitted.
the parent process then waits for the simulation process to complete before restarting it using CRIU. After the simulation process is restored to a state before the error, the parent process returns information about the conflict that caused the rollback to the simulation process. The simulation can then resume and the conflict can be avoided. Once the conflicting evaluation phase has passed, a new backup is made.
the effectiveness of the invention is based on an appropriate safeguard policy.
the spacing of the backups must in fact be chosen so as to limit the number as much as possible while preventing a possible rollback from referring to a backup that is too old.
the first backup policy is to only backup at the very start of the simulation and then wait for the first conflict, if it occurs. This is very well suited to simulations that cause little or no conflict.
Another policy is to save the simulation at regular intervals, for example every 1000 evaluation phases. It is also possible to vary this backup interval by increasing it in the absence of conflict and by reducing it following a conflict, for example. When a save point is reached, the simulation kernel begins by waiting for all the conflict checks from the previous evaluation phases to be completed. If no conflict has arisen, a new backup is made.
the proposed System simulation kernel can operate in simulation reproduction mode.
This operating mode uses a trace generated by the simulation to be reproduced.
This trace then makes it possible to control the execution of the processes in order to guarantee a simulation result identical to the simulation which produced the trace, thus respecting the requirements of the System standard.
the trace used by the invention is made up of the list of the numbers of the evaluation phases during which dependencies between processes have appeared, with which are associated the orders in which these processes must be executed during each of these evaluation phases in order to reproduce the simulation.
An example is given in the table of figure 8, in which, for each phase listed, each group of processes (inner parentheses) is executable in parallel but the groups must be executed in separate sequential sub-phases.
This trace is stored in a file (for example by serialization) between two simulations or any other means of persistent storage following the end of the simulation process.
the simulation reproduction uses two containers: one, named Tw (for "Trace write” in English), used to store the trace of the current simulation, the other, named Tr (for "Trace read “in English), containing the trace of a previous simulation passed as a parameter of the simulation if the simulation reproduction is activated.
Tw for "Trace write” in English
Tr for "Trace read "in English
Tr is initialized at the start of the simulation using the trace of a simulation passed as an argument of the program. At the start of each evaluation phase, it is then checked whether its number is among the elements of Tr. If so, the list associated with this phase number in Tr is used to schedule the evaluation phase. To do this, the list of processes to be executed in the next parallel evaluation sub-phase is passed to the evaluation threads. When they wake up, they check before starting the evaluation of each process that it is on the list. Otherwise, the process is immediately placed in the reserve execution queue for later evaluation.
Tr can be implemented using an associative container with the evaluation phase numbers as a key, but it is more efficient to use a sequential container of the vector type in which the pairs or pairs (number phase; process order) are stored in descending order of evaluation phase numbers (each row of the table in Figure 8 is a vector pair).
the simulation reproduction mode is not activated, conflicts may arise followed by a rollback of the simulation. The simulation reproduction mode between the return point and the point where the conflict has occurred is then activated.
Tw is then transmitted through the rollback system in order to initialize Tr.
items corresponding to evaluation phases prior to the rollback point must be removed from Tr.
Simulation reproduction can be deactivated once the conflict point has been passed.
a performance optimization consists in deactivating the systems for detecting the shared addresses and for checking the conflicts when the simulation reproduction is activated. Indeed, the latter guarantees that the new instance of the simulation provides a result identical to the reproduced simulation. However, the trace obtained at the end of the latter makes it possible to avoid all the conflicts that could arise. In the case of a rollback, however, it is important to deactivate the simulation reproduction mode after the point of conflict if this optimization is used.

Landscapes

Engineering & Computer Science (AREA)
Theoretical Computer Science (AREA)
Software Systems (AREA)
Physics & Mathematics (AREA)
General Engineering & Computer Science (AREA)
General Physics & Mathematics (AREA)
Quality & Reliability (AREA)
Debugging And Monitoring (AREA)

EP20786583.3A 2019-10-11 2020-10-08 Verfahren zur reproduzierbaren parallelsimulation auf elektronischer systemebene, die mittels eines multicore-simulationsrechnersystems mit ereignisorientierter simulation implementiert ist Pending EP4042277A1 (de)

Applications Claiming Priority (2)

Application Number	Priority Date	Filing Date	Title
FR1911332A FR3101987B1 (fr)	2019-10-11	2019-10-11	Procédé de simulation parallèle reproductible de niveau système électronique mis en œuvre au moyen d'un système informatique multi-cœurs de simulation à événements discrets
PCT/EP2020/078339 WO2021069626A1 (fr)	2019-10-11	2020-10-08	Procédé de simulation parallèle reproductible de niveau système électronique mis en oeuvre au moyen d'un système informatique multi-coeurs de simulation à événements discrets

Publications (1)

Publication Number	Publication Date
EP4042277A1 true EP4042277A1 (de)	2022-08-17

Family

ID=69173021

Family Applications (1)

Application Number	Title	Priority Date	Filing Date
EP20786583.3A Pending EP4042277A1 (de)	2019-10-11	2020-10-08	Verfahren zur reproduzierbaren parallelsimulation auf elektronischer systemebene, die mittels eines multicore-simulationsrechnersystems mit ereignisorientierter simulation implementiert ist

Country Status (4)

Country	Link
US (1)	US20230342198A1 (de)
EP (1)	EP4042277A1 (de)
FR (1)	FR3101987B1 (de)
WO (1)	WO2021069626A1 (de)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number	Priority date	Publication date	Assignee	Title
FR3116625A1 (fr) *	2020-11-25	2022-05-27	Commissariat A L'energie Atomique Et Aux Energies Alternatives	Procédé de simulation parallèle reproductible de niveau système électronique mis en œuvre au moyen d'un système informatique multi-cœurs de simulation à événements discrets.
CN113590363B (zh) *	2021-09-26	2022-02-25	北京鲸鲮信息***技术有限公司	数据发送方法、装置、电子设备及存储介质
CN114168200B (zh) *	2022-02-14	2022-04-22	北京微核芯科技有限公司	多核处理器访存一致性的验证***及方法

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number	Priority date	Publication date	Assignee	Title
FR3043222B1 (fr) *	2015-11-04	2018-11-16	Commissariat A L'energie Atomique Et Aux Energies Alternatives	Procede de simulation parallele de niveau systeme electronique avec detection des conflits d'acces a une memoire partagee

2019
- 2019-10-11 FR FR1911332A patent/FR3101987B1/fr active Active
2020
- 2020-10-08 WO PCT/EP2020/078339 patent/WO2021069626A1/fr unknown
- 2020-10-08 EP EP20786583.3A patent/EP4042277A1/de active Pending
- 2020-10-08 US US17/767,908 patent/US20230342198A1/en active Pending

Also Published As

Publication number	Publication date
FR3101987B1 (fr)	2021-10-01
WO2021069626A1 (fr)	2021-04-15
FR3101987A1 (fr)	2021-04-16
US20230342198A1 (en)	2023-10-26

Legal Events

Date	Code	Title	Description
2020-10-16	STAA	Information on the status of an ep patent application or granted ep patent	Free format text: STATUS: UNKNOWN
2021-04-17	STAA	Information on the status of an ep patent application or granted ep patent	Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE
2022-07-15	PUAI	Public reference made under article 153(3) epc to a published international application that has entered the european phase	Free format text: ORIGINAL CODE: 0009012
2022-07-15	STAA	Information on the status of an ep patent application or granted ep patent	Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE
2022-08-17	17P	Request for examination filed	Effective date: 20220420
2022-08-17	AK	Designated contracting states	Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR
2023-01-18	DAV	Request for validation of the european patent (deleted)
2023-01-18	DAX	Request for extension of the european patent (deleted)

Publication	Publication Date	Title
EP3371719B1 (de)	2019-07-17	Paralleles simulationsverfahren einer elektronischen systemebene mit detektion von konflikten beim zugang zu einem gemeinsamen speicher
WO2021069626A1 (fr)	2021-04-15	Procédé de simulation parallèle reproductible de niveau système électronique mis en oeuvre au moyen d'un système informatique multi-coeurs de simulation à événements discrets
US9063766B2 (en)	2015-06-23	System and method of manipulating virtual machine recordings for high-level execution and replay
Davis et al.	2017	Node. fz: Fuzzing the server-side event-driven architecture
US10296442B2 (en)	2019-05-21	Distributed time-travel trace recording and replay
EP3662372B1 (de)	2022-08-10	Provisorische codeausführung bei einem debugger
EP4006730A1 (de)	2022-06-01	Verfahren zur parallelen reproduzierbaren simulation auf der ebene eines elektronischen systems mit hilfe eines ereignisdiskreten multi-core-simulationscomputersystems
Durán et al.	2016	Robust and reliable reconfiguration of cloud applications
Murillo et al.	2014	Automatic detection of concurrency bugs through event ordering constraints
EP2956874B1 (de)	2017-03-15	Vorrichtung und verfahren zur beschleunigung der aktualisierungsphase eines simulationskerns
Busnot et al.	2020	Standard-compliant parallel SystemC simulation of loosely-timed transaction level models
Bouajjani et al.	2020	Formalizing and checking multilevel consistency
US10579441B2 (en)	2020-03-03	Detecting deadlocks involving inter-processor interrupts
Zhang et al.	2024	Model‐checking‐driven explorative testing of CRDT designs and implementations
Nagar et al.	2020	Semantics, specification, and bounded verification of concurrent libraries in replicated systems
FR2995705A1 (fr)	2014-03-21	Procede de preparation d'une sequence d'execution d'un programme partitionne spatialement et temporellement utilisant un processeur muni d'une memoire cache.
Lukavsky	2022	Building Big Data Pipelines with Apache Beam: Use a single programming model for both batch and stream data processing
Yost	2023	Finding flaky tests in JavaScript applications using stress and test suite reordering
Busnot	2020	Parallel Standard-Compliant SystemC Simulation of Loosely-Timed Transaction Level Models
Seidl et al.	2017	Proving absence of starvation by means of abstract interpretation and model checking
Schaeli et al.	2008	Dynamic testing of flow graph based parallel applications
Rogin et al.	2010	Isolating Failure Causes
FR3103595A1 (fr)	2021-05-28	Simulateur rapide d'un calculateur et d'un logiciel mis en œuvre par ce calculateur
Veeraraghavan	2011	Uniparallel execution and its uses
Agosta et al.	2013	Fault Tolerance