WO2002027479A1 - Computer instructions - Google Patents

Computer instructions Download PDF

Info

Publication number
WO2002027479A1
WO2002027479A1 PCT/GB2001/004299 GB0104299W WO0227479A1 WO 2002027479 A1 WO2002027479 A1 WO 2002027479A1 GB 0104299 W GB0104299 W GB 0104299W WO 0227479 A1 WO0227479 A1 WO 0227479A1
Authority
WO
WIPO (PCT)
Prior art keywords
instruction
instructions
dependency
data manipulation
depend
Prior art date
Application number
PCT/GB2001/004299
Other languages
French (fr)
Inventor
Nigel Paul Smart
Michael David May
Hendrik Lambertus Muller
Original Assignee
University Of Bristol
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University Of Bristol filed Critical University Of Bristol
Priority to AU2001290112A priority Critical patent/AU2001290112A1/en
Publication of WO2002027479A1 publication Critical patent/WO2002027479A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30076Arrangements for executing specific machine instructions to perform miscellaneous control operations, e.g. NOP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3838Dependency mechanisms, e.g. register scoreboarding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/002Countermeasures against attacks on cryptographic mechanisms
    • H04L9/003Countermeasures against attacks on cryptographic mechanisms for power analysis, e.g. differential power analysis [DPA] or simple power analysis [SPA]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2207/00Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F2207/72Indexing scheme relating to groups G06F7/72 - G06F7/729
    • G06F2207/7219Countermeasures against side channel or fault attacks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L2209/00Additional information or applications relating to cryptographic mechanisms or cryptographic arrangements for secret or secure communication H04L9/00
    • H04L2209/12Details relating to cryptographic hardware or logic circuitry
    • H04L2209/125Parallelization or pipelining, e.g. for accelerating processing of cryptographic operations

Definitions

  • This invention relates to computer instructions, and particularly to a method of executing a computer program and to a processor, the program including an ordered sequence of computer instructions.
  • DPA provides the most powerful attack using very cheap resources. Many people have started to examine this problem and S. Chari et al provides a worrying analysis regarding the weakness of AES (Advanced Encryption Standard) algorithms on Smart cards, see the article entitled “A Cautionary Note Regarding the Evaluation of AES Candidates on Smart-Cards" in the Second Advanced Encryption Standard Conference, Rome, March 1999.
  • AES Advanced Encryption Standard
  • the present invention seeks to improve tamper resistance according to the third approach, that is, by decorrelating the timing of power traces on successive program executions.
  • Kocher et al also describe two ways of producing the required temporal misalignment by introducing: i) introducing random clock signals, and ii) introducing randomness into the execution order.
  • Kocher et al in "Differential Power Analysis” mention that randomising execution order can help defeat DPA, but can lead to other problems if not done carefully.
  • One randomising approach uses the idea of randomised multi-threading at an instruction level using a set of essentially "shadow" registers. This allows auxiliary threads to execute random encryptions, hence hoping to mask the correct encryption operation.
  • the disadvantage is that additional computational tasks are again required and this requires a more complex processor architecture having separate banks of registers, one for each thread.
  • the aim of the present invention is to increase the non-deterministic nature of a processor but without unnecessarily impacting the performance.
  • a method of executing a computer program comprising an ordered sequence of instructions including an ignore instruction followed by a set of data manipulation instructions which can be executed in an arbitrary order, the method comprising: reading each instruction in the ordered sequence, checking its dependency with respect to its adjacent instructions and storing an associated dependency bit mask; responsive to detection of the ignore instruction, ignoring the dependency bit masks for subsequent data manipulation instructions issued up to detection of the depend instruction, whereby data manipulation instructions can be issued in an arbitrary order.
  • the ordered sequence of instructions can include a depend instruction after the set of data manipulation instructions which causes dependency bit masks for subsequent instructions in a sequence to be utilised.
  • the dependency bit masks associated with data manipulation instructions which have not yet issued are used to delay issue of the depend instruction until all data manipulation instructions in the set have issued.
  • the dependency bit masks associated with data manipulation instructions which have not yet issued can be combined to ensure that all data manipulation instructions in the set issue prior to instructions subsequent to the depend instruction.
  • the ignore instruction specifies at least one operand, and the dependency bit masks are ignored only for the set of data manipulation instructions which define said operand.
  • ignore instruction defines no operand and the dependencies for all subsequent data manipulation instructions are ignored.
  • a processor comprising: a program memory holding a computer program which comprises an ordered sequence of instructions including an ignore instruction followed by a set of data manipulation instructions which can be executed in an arbitrary order followed by a depend instruction; a decode unit for decoding each instruction and arranged to detect ignore and depend instructions prior to their execution; a dependency checker for checking the dependency of each instruction with respect to adjacent instructions and for generating an associated dependency bit mask; a store for holding said dependency bit masks; and means responsive to detection of the ignore instruction to cause the dependency bit masks held in the store to be ignored for data manipulation instructions up to detection of the depend instruction.
  • a computer program product comprising program code means including an ordered sequence of instructions including an ignore instruction followed by a set of data manipulation instructions which can be executed in an arbitrary order followed by a depend instruction wherein, when the program product is loaded into a computer and executed, on detection of the ignore instruction dependency bit masks associated with each instruction in the ordered sequence are ignored for subsequent data manipulation instructions issued up to detection of the depend instruction.
  • the dependency checker need only operate when necessary, so that a higher degree of non- determinism can be achieved without affecting the performance of a processor.
  • Figure 1 shows a block diagram of a generic CPU architecture
  • Figure 2 shows a non-deterministic processor executing two instructions compared to other processors
  • Figure 3 shows an embodiment of the random issue unit
  • Figure 4 shows a flow chart explaining how instructions are issued at random
  • Figure 5 shows an example of two input random selection unit
  • Figures 6A and 6B show a generic model and a 16 input random selection unit
  • Figure 7 shows a flow chart describing a method for choosing which random instruction in the issue buffer to execute.
  • Figure 8 shows how the dependency checking mechanism 33 is affected upon detection of an IGNORE or DEPEND instruction.
  • FIG. 1 is a block diagram illustrating the standard functional units that make up a pipelined computer system.
  • a program memory 2 contains program instructions, which are addressable at different memory locations.
  • An ADDRESS bus 6 and a DATA bus 4 transfer information to and from the various elements that make up the processor 8.
  • the system contains an instruction fetch unit 10 having a program counter 12 that stores the address of the next instruction to be fetched. For sequential execution of instructions the program counter will normally be incremented by a single addressing unit. However, if a branch instruction is encountered, the program flow is broken and the program counter needs to be loaded with the address of a target instruction (that is, the first instruction of the branch sequence).
  • the instructions are fetched from the program memory and stored in an instruction issue buffer 14.
  • the program counter referred to herein is used to control instruction fetches from memory. There may also be an execution counter which is used by the execution unit 18 to specify which instruction is currently being executed.
  • the instructions are decoded and supplied to relevant execution units. In this example, only one execution unit 18 or pipeline is shown, however the present invention is intended to be used in conjunction with modern processors which may have several execution units allowing parallel execution paths. Encryption algorithms need a substantial level of computational power and modern processor architectures such as superscalar, VLIW (Very Long Instruction Word) and SIMD (Single Instruction Multiple Data) are ideally suited to the present invention.
  • the results of the operations are written back by a result write stage 22 into temporary registers of a register file 20, which is used to load and store data in and out of main memory.
  • the present invention is concerned mainly with the block of functionality denoted by the reference numeral 24.
  • the present invention deals with a modified issue buffer 14 which will be described in more detail later.
  • the issue buffer generates an instruction fetch signal 13 to control which instructions are supplied from the fetch unit 10.
  • part of the decode circuitry may be used to decode the instruction dependencies. This will also be described in more detail at a later stage.
  • Non-deterministic processing as described herein means that for successive runs of the program, although the result will be the same the order of execution of the instructions will be random. This reduces the impact of a DPA-type attack in that the power traces resulting from successive program runs will be different.
  • Figure 2 serves to highlight the differences between a non-deterministic processor and other known processors when executing a simple program consisting of the following two lines of code:
  • the execution flow on the left of Figure 2 represents a standard processor having a single execution pipeline where the two instructions are executed sequentially, i.e. the ADD instruction is executed in cycle 1 followed by the XOR instruction in cycle 2.
  • the middle execution flow represents a modern Pentium processor having a plurality of execution paths, which execute independent instructions in parallel.
  • the execution flow on the right of Figure 2 represents a non-deterministic processor having a single pipeline.
  • the non-deterministic processor allows the instructions to be executed in any order provided that it has been established that the instructions are independent. So in the first cycle either the ADD or the XOR instruction can be carried out and in the second cycle the other instruction will be executed.
  • the standard processor executes instructions sequentially and although there is a little "out of order" execution to help with branch prediction, this occurs on a small scale. In any event, in such a processor each time a program is run containing a certain sequence of instructions, the execution sequence will be identical.
  • the Pentium processor has a plurality of execution units (A) and (B), which execute the independent instructions in parallel the processor is still deterministic in that the ADD and the XOR instructions are executed concurrently in pipes (A) and (B).
  • the purpose of 11 is to LOAD a value addressed by register R2 into register R9.
  • the intention of the code sequence is to add the loaded value from R8 to the value in R9. Therefore, if the ADD instruction 12 is carried out before 11 , the old value of R9 will be added to R8 yielding an incorrect value for the resulting summation R10.
  • the present invention makes use of the fact that in many code sequences a number of instructions are independent and thus can, in theory, be executed in any order. The invention exploits this by executing the instructions in a random order at run time. This causes the access patterns to memory for either data or instructions to be uncorrelated for successive program executions, and thus causes the power trace to be different each time.
  • FIG. 3 shows an example of the implementation of a random issue unit.
  • the random issue unit comprises an instruction table 32 with an associated dependency matrix table 30. Instructions are prefetched into the instruction table 32 using conventional instruction fetch circuitry.
  • the dependency matrix table has slots and columns, where the slots represent bit-masks associated with each instruction in the instruction table 32.
  • the bit-masks or dependency bits are an indication as to whether an instruction has a dependency on another instruction. Broadly speaking there are two types of dependencies that need to be considered for an instruction:
  • a particular instruction will be decoded and the mask bits will be set accordingly in a Used Registers table 34 and a Defined Registers table 36.
  • the Used and Defined Register tables 34, 36 shown in Figure 3 each comprise a number of rows and columns. Each row corresponds to a register (or operand) and each column corresponds to a particular slot (or instruction) in the instruction issue table 32. Each register comprises a plurality of slots corresponding to the number of instructions in the instruction table 32 and is the so-called bit-mask for a register.
  • the bit mask for a register is a binary stream where a "1" indicates which instruction has a dependency on that register.
  • each table has five rows corresponding to registers R1 to R5, i.e. R1 corresponds to the top row and R5 to the bottom row.
  • the processor performs a logical OR operation 38 of the bit mask of the Used Registers table 34 and the Defined Registers table 36 thereby creating a new bit-mask stored in a free slot of the dependency matrix 30.
  • a test can be performed by OR-ing with OR gates 40 each of the dependency bits of a slot of the dependency matrix. If all the dependency bits of a slot associated with a particular instruction are set to zero, then the instruction can be executed and a FIRE signal 42 is generated to the Random Selection Unit 44. Given the result of the OR for each row of the table, a number of zeros (indicating instructions to be executed) and a number of ones (indicating instructions that are blocked) are obtained. The random selection unit 44 selects one of the slots which is indicated at value zero, at random, and causes that instruction to be executed next. In the described embodiment, the dependency bits are overwritten with new values when the dependencies of the next instruction are loaded into the matrix.
  • the random issue unit supplies an instruction to be executed from the instruction table 32 along instruction supply path 50 and loads an instruction into the instruction table 32 along instruction load path 52 at the same time.
  • Figure 4 is a flow chart indicating how the instructions in the instruction issue buffer 14 are issued for execution and loaded concurrently.
  • the load operations are represented by the left branch flow (C), while the issue operations are represented by the right branch flow (D).
  • the left branch flow (C) of figure 4 relates to an instruction load operation starting at step S1 where the next instruction, specified by the program counter 12, is loaded into the instruction table 32 of the issue buffer 14.
  • the load operation will firstly be described in general terms, and then more specifically in relation to one example.
  • Each instruction defines two source operands 54 and a destination operand 56. These will nearly always be defined as registers although that is not necessary. Direct addresses or immediates are possible.
  • the source and destination operands 54,56 are simultaneously decoded.
  • the decoded information is translated into bit-masks that are set in the Used Registers and Defined Registers tables 34,36. These bit-masks are OR-ed by OR gate 38 ( Figure 3) to create dependency bits indicating on which instructions the loaded instruction depends.
  • the empty slot E associated with the loaded instruction is then selected for replacement by setting the InValid flag 58 to zero.
  • the dependency bits are loaded into the selected slot E of the dependency matrix.
  • the bit-masks in column E of the Used Registers and Defined Registers tables 34,36 are set to "1" along path 62 for the corresponding rows of these tables to ensure that future instructions that use those registers are going to wait for the instruction to finish.
  • the Used and Defined Register tables 34, 36 are set-up during the instruction fetch or LOAD sequence, as already indicated.
  • the fetched instruction is decoded and the bit-masks associated with each of the registers specified in the instruction are checked for dependencies with other instructions. For example, assume the instruction: ADD R2, R3, R4 is fetched.
  • the bit masks associated with the registers R2 and R3 in the Used Registers table 34 i.e. the source registers
  • OR gate 38 the bit mask associated with register R4 in the Defined Registers Table (i.e. the destination register) is sent to the OR gate 38.
  • each bit mask has N slots where each slot corresponds to a particular instruction.
  • the OR gate 38 receives the bit-masks and performs a bit-wise logical OR operation for each slot simultaneously. For example, assume the following bit- masks exist:
  • the first step includes simultaneously performing a second OR operation 40 across all the dependency bits for each slot of the dependency matrix 30 to determine which instructions have no dependencies. For the example, a "1" set in the third bit of the dependency mask for the instruction in question means that the OR'ed result will be a "1". Therefore this instruction still has dependencies and cannot be fired at the random selection unit 44.
  • the final step is to set the appropriate bit masks associated with the currently loaded instruction.
  • the appropriate bit-masks being the registers that cannot be used by future instructions until the current instruction has been issued.
  • register R4 in the Used Registers table 34 for the present instruction column in set to "1" to inform all future instructions that R4 cannot be used as a source register (i.e. read from), because the present instruction uses this as a destination register (i.e. write to).
  • registers R2 and R3 are source registers for the present instruction and thus these registers are set to "1" in the Defined Registers table 36 to indicate that these registers cannot be written to until the present instruction has completed.
  • the right branch flow (D) of Figure 4 relates to random instruction issue starting at S1 where the dependency bits associated with each instruction are checked using an OR operation via OR gate 40. Then all of the independent instructions are flagged as ready for issue and appropriate fire signals are sent to the Random Selection Unit.
  • the Random Selection Unit 44 selects one of the instructions 46 for example the instruction X, which is issued along instruction supply path 50 to the relevant execution unit.
  • column X is then cleared (i.e. bits are set to zero) from the dependency matrix 30 as well as from the Registers Used and Registers Defined tables 34, 36. Also, the InValid flag is set (i.e. to 1 ).
  • step S4 a pointer E is initialised for the next iteration.
  • E is a pointer that points to an empty slot which is available in the issue table. After every instruction has been loaded, E must point to another free slot. One could, for example, use the instruction previously executed to initialise E. In that way, the pointer E would follow the executed instructions around the table.
  • Figure 5 represents a two input example of how a random selection unit 44 may be implemented.
  • the truth table for the random selector is shown below:
  • Figure 5 shows two inputs 70 and 72 for the random selection unit 44. It should be apparent from figure 3 that each input l 0 or li will either be a '0' or a '1'. More generally, a '0' will appear if all of the dependency bits of the relevant slot are '0'. Thus, a '0' indicates an independent instruction, which can be selected by the Random Selection Unit 44. An inspection of truth table 2 reveals that if one of the inputs is a , then the output 46 of the random selector will always take the logical value of the other input. Input li is shown coupled to an AND gate 76 through an inverting element 75. The AND gate 76 accepts two other inputs, i.e. a random signal R 80 and an enable signal E 78. The output of the AND gate is OR-ed 74 with input l 0 to produce the selected output 46 of the random selection unit 44.
  • each input l 0 or li will either be a '0' or a
  • the random signal R does not have to be truly random. It could be typically generated using a pseudo-random generator that is reseeded regularly with some entropy.
  • the enable signal 78 allows random issue to be disabled, i.e. non- determinism can be turned off, for example to allow a programmer to debug code by stepping through the instructions.
  • Figures 6A and 6B show a slightly more complex example of a random selection unit having 16 inputs. As shown a 16 input random issue unit can be provided by adapting the simple two input structure shown in Figure 5 and connecting it in a cascaded structure.
  • Figure 6A shows a generalised stage of one of the random selection units. The inputs run from l 0 to l 2 K+1 -1. The generalised stage can be applied to the 16 input random selector shown by Figure 6B.
  • the 16 inputs are divided in half with the even inputs I0, I2...I14 being input to a first multiplexer 82 and the odd inputs 11 , 13, ...115 being input to a second multiplexer 84.
  • Each multiplexer selects 1 output from 2 k inputs (i.e. 8:1 in the final stage) and each multiplexer accepts control signals from the lower stages A 0 ...A «- ⁇ (i.e. Ao, A ⁇ , A 2 in the final stage). This is confirmed by diagram on the right, which shows the selected signals from the lower stages being feedback into the higher stages. Then the relevant stage behaves as the two input model shown in Figure 5.
  • Figure 7 is a flow chart illustrating a method to choose which instruction in the instruction buffer to execute.
  • the issue buffer is assigned the symbol B.
  • step S13a issues this instruction to the relevant execution unit and the program sequence is completed i.e. EXIT. If however, there is more than one instruction in the buffer, step S13b involves dividing the buffer into two sets of roughly equal size and assigning the symbols L and R respectively. Then at S14, the instructions within the L buffer are examined to see if any independent instructions can be issued. If not, step S15b sets the active issue buffer B to look at buffer R and the process is repeated from step S12.
  • step S15a the R buffer is examined to see if it contains any instructions ready for issue. If not, step S16b sets the active buffer B to be buffer L and the process is repeated from step S12. If both L and R contain instructions that are ready for issue, the flow proceeds to step S16a where a random bit is generated. If the random bit is '1' then the process moves to step S16b where the L buffer is selected or if the bit is a '0' then the process moves to step S15b where the R buffer is selected. In both cases, the process will be repeated until there is only one instruction in one of the buffers in which case step S13a is invoked and the program sequence is completed.
  • Such instruction sequences generally include associative and/or commutative operations where the execution order does not affect the end result.
  • the instructions that form such an instruction sequence will be referred to herein as data manipulation instructions.
  • a dependency check at run time is not needed immediately. That is, there may, prima facie, be a dependency on source or destination registers but it is one which can be ignored because the result will be the same whatever the order of execution of the instructions.
  • a compiler can identify such instruction sequences and introduce two extra instructions which are called herein IGNORE and DEPEND that demarcate the section of code containing such a set of data manipulation instructions.
  • the effect of the IGNORE R1 instruction at run-time is to cause all dependencies on R1 to be ignored for all the subsequent data manipulation instructions until the DEPEND R1 instruction is detected.
  • the DEPEND instruction causes the system to return to the default case where the dependencies on subsequent instructions having the specified operand R1 are checked.
  • the IGNORE/DEPEND pair allows the dependencies that exist on register R1 between the data manipulation instructions 11 , 12, 13 to be ignored. This means that these instructions are ready for issue and can be selected by the random selection unit 44. Therefore, the data manipulation instructions 11 , 12, 13 can be executed in a random order which increases the level of non-determinism exhibited by the processor.
  • Figure 8 shows an alternative embodiment to that of Figure 3 of the dependency checking mechanism 33' for use in the presence of the IGNORE and DEPEND instructions.
  • the Defined Registers table 36 has two bit masks for each register, allowing each register dependency to be stored in one of two categories: a dependency of crucial importance and a dependency that can be delayed.
  • reference numeral 90 denotes a bit-mask for a crucial dependency
  • numeral 92 denotes a bit-mask for a deferrable dependency.
  • each register is associated with two such bit-masks.
  • the dependency checker 33 of Figure 3 provides for only a single dependency category for each instruction, i.e. crucial.
  • Each register of the dependency checker 33' of Figure 8 also has a flag associated with it, the so-called IGNORE flag (95-99). If the IGNORE flag is set it indicates that dependencies on that register can be ignored for as long as the flag is set.
  • Figure 8 shows the case where the Defined Registers table 36 has IGNORE flags 95, 96, 97, 98, 99 associated with the row pairs holding bit masks for registers R1 , R2, R3, R4, R5 respectively.
  • the IGNORE flags are set and reset responsive to detection of the IGNORE and DEPEND instructions at the decode stage 16 by a detect unit 16a, that is prior to execution of the instructions. When executed, the IGNORE and DEPEND instructions are executed as NO OPS.
  • the Defined Register table When the Defined Register table is loaded, as in the process described above with reference to Figure 4, which one of the categories of dependencies is set, depends on the status of the IGNORE flag associated with each pair of bit-masks 90, 92 for each register.
  • the IGNORE flag 96 has been set to "1" indicating that the dependencies on the associated register R2 can be temporarily delayed. An IGNORE R2 instruction would have issued to set this flag.
  • the IGNORE flag 96 may be reset to "0" upon detection of a DEPEND R2 instruction.
  • any subsequent data manipulation instructions that specify register R2 will set the second category of bit mask 92 indicating that the instructions have dependencies that can be delayed.
  • the relevant bit-masks 90 are set in the first category defining a crucial dependency.
  • any instructions which have a prima facie dependency on register R2 will be treated as though there is no dependency for the purposes of selection of instructions by the random issue unit. That is, the bit mask sent to the dependency matrix table 30 will indicate there is no dependency. Instead these dependencies are stored in the second bit mask 92 associated with the register R2.
  • the DEPEND instruction like the other instructions in the instruction issue table 32, has an associated slot in the dependency matrix table 30. However, that slot does not contain the dependencies worked out as for the "ordinary" instructions explained above with reference to Figure 3. Instead, it takes the bit mask of deferred dependencies associated with the register defined in the DEPEND instruction. This constitutes the dependencies for instructions whose dependencies have been deferred but which have yet to be selected for issue by the random selection unit 44. Thus, two important events occur when a DEPEND instruction is loaded:
  • a selector switch 110 selects the second category of bit-mask 92 as the dependency mask for the relevant slot of the dependency matrix 32 associated with the register defined in the DEPEND instruction corresponding to the DEPEND instruction.
  • IGNORE R2 a single IGNORE instruction
  • IGNORE R4 a second IGNORE instruction
  • An IGNORE instruction can define more than one operand, in which case IGNORE flags are simultaneously set against multiple defined registers.
  • a single DEPEND instruction defining more than one register can follow, or multiple DEPEND instructions each defining a single operand.
  • the bit masks associated with the multiple defined registers need to be ORed before loading into the slot associated with the DEPEND instructions.
  • data manipulation instructions will issue normally and can occur between an IGNORE/DEPEND instruction pair provided that they do not act on the same operand specified by the IGNORE instruction.
  • an additional instruction such as ADD R7, R8, R9 is inserted between the data manipulation instructions 11 and 12 in the code sequence example of table 3.
  • the IGNORE instruction specifies the register R1 , while this ADD instruction specifies different registers R7, R8, R9. So this ADD instruction will be executed normally, meaning that if this instruction is dependent on another then the default dependency checking mechanism of figure 3 is used and the dependency will be considered crucial.
  • IGNORE and DEPEND instructions can be defined without an operand. In this case, by default all the IGNORE flags are automatically set to "1". In such a situation the dependencies on all the data manipulation instructions that exist between the IGNORE/DEPEND instruction pair will be ignored regardless of the operands specified.
  • the non-deterministic properties of a processor can be exploited without necessarily impacting its performance.
  • the IGNORE/DEPEND instruction pair allows certain types of instructions to be executed in a random order by ignoring their dependencies.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Advance Control (AREA)

Abstract

A method of executing a computer program comprising an ordered sequence of instructions including an ignore instruction followed by a set of data manipulation instructions which can be executed in an arbitrary order, the method comprising reading each instruction in the ordered sequence, checking its dependency with respect to tits adjacent instructions and storing an associated dependency bit mask, responsive to detection of the ignore instruction, ignoring the dependency bit masks for subsequent data manipulation instructions issued up to detection of the depend instruction, whereby data manipulation instructions can be issued in an arbitrary order.

Description

COMPUTER INSTRUCTIONS
This invention relates to computer instructions, and particularly to a method of executing a computer program and to a processor, the program including an ordered sequence of computer instructions.
The era of digital communications has brought about many technological advancements which make our lives easier, but at the same time pose a new set of problems that need attention. A particular area of concern is data security where businesses and customers alike have their own security requirements of the services which they supply or receive. Computer hackers are seen by business as potential hazards for attracting new e-commerce customers, unless customers can be assured that their transactions will be secure. Many encryption schemes have been suggested in an attempt to overcome 'eavesdropping' on private or personal digital communications such as confidential email messages or receiving television broadcasts which have not been paid for, i.e. pay-TV.
Modern cryptography is about ensuring the integrity, confidentiality and authenticity of digital communications. Secret keys are used to encrypt and decrypt the data and it is essential that these keys remain secure. Whereas in the past secret keys were stored in centralised secure vaults, today's network-aware devices have embedded keys making the hardware an attractive target for hackers. A great deal of research has gone into algorithm design and hackers are more prone to concentrate their efforts on the hardware in which the cryptographic unit is housed.
One such attack is performed by taking physical measurements of the cryptographic unit as described by P.Kocher et al in the two articles entitled "Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS and other systems" and "Differential Power Analysis", both in the Advances in Cryptology journal, CRYPTO '96 pages 104-113 (1996) and CRYPTO '99 pages 388-397 (1999) respectively. By taking measurements of power consumption, computing time or EMF radiations over a large number of encryption operations and using known statistical techniques, it is possible to discover the identity of the secret keys. Kocher goes on to describe three main techniques: i) timing attacks, ii) Simple Power Analysis (SPA) and iii) Differential Power Analysis (DPA).
DPA provides the most powerful attack using very cheap resources. Many people have started to examine this problem and S. Chari et al provides a worrying analysis regarding the weakness of AES (Advanced Encryption Standard) algorithms on Smart cards, see the article entitled "A Cautionary Note Regarding the Evaluation of AES Candidates on Smart-Cards" in the Second Advanced Encryption Standard Conference, Rome, March 1999.
L. Goubin et al proposes three general strategies to combat Differential Power Analysis attacks in his article entitled "DES and Differential Power Analysis, The Duplication Method" in Cryptographic Hardware and Embedded Systems pages 158-172, 1999. These are:
i) Make algorithmic changes to the cryptographic primitives under consideration.
ii) Replace critical assembler instructions with ones whose signature is hard to analyse, or re-engineer the crucial circuitry that performs arithmetic operations or memory transfers.
iii) Introduce random timing shifts so as to decorrelate the output traces on individual runs.
The first approach has been attempted before. For example, Goubin et al suggests splitting the operands into two and duplicating the workload. This however means at least doubling the required computer resources. Similarly, Chari proposes masking the internal bits by splitting them up and processing the bit shares in a certain way so that once recombined the correct result is obtained. Kocher et al have attempted the second approach by balancing the Hamming weights of the operands, physical shielding or adding noise circuitry, as discussed for example in "Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS and other systems"
The present invention seeks to improve tamper resistance according to the third approach, that is, by decorrelating the timing of power traces on successive program executions.
Kocher et al also describe two ways of producing the required temporal misalignment by introducing: i) introducing random clock signals, and ii) introducing randomness into the execution order. Kocher et al, in "Differential Power Analysis" mention that randomising execution order can help defeat DPA, but can lead to other problems if not done carefully. One randomising approach uses the idea of randomised multi-threading at an instruction level using a set of essentially "shadow" registers. This allows auxiliary threads to execute random encryptions, hence hoping to mask the correct encryption operation. The disadvantage is that additional computational tasks are again required and this requires a more complex processor architecture having separate banks of registers, one for each thread.
In particular, Chari et al in an article entitled "Towards Sound Approaches to Counteract Power-Analysis Attack" in Advances in Cryptology, CRYPTO "99, pages 398-412, shows that for a randomised execution sequence to be effective the randomisation needs to be done extensively. However, no mechanism is disclosed in Chari to enable extensive randomised execution. For example, if only the XOR instruction in each DES (Data Encryption Standard) round is randomised then DPA is" still possible by taking around 8 times as much data. DES (Data Encryption Standard) is the most widely used encryption algorithm and is known as a "block cipher", which operates on plaintext blocks of a given size (64-bits) and returns ciphertext blocks of the same size. DES operates on the 64-bit blocks using key sizes of 56- bits. The keys are actually stored as being 64 bits long, but every 8th bit in the key is not used (i.e. bits numbered 7, 15, 23, 31 , 39, 47, 55, and 63).
Hence for randomised execution order to work it needs to be done in a highly aggressive manner which would preclude the type of local randomisation implied by the descriptions above. In addition this cannot be achieved in software since a software randomiser would work at too high a level of abstraction. The randomised multi-threading idea is close to a solution but suffers from increased CPU time and requires a more complex processor with separate banks of registers, one for each thread.
The aim of the present invention is to increase the non-deterministic nature of a processor but without unnecessarily impacting the performance.
According to one aspect of the present invention there is provided a method of executing a computer program comprising an ordered sequence of instructions including an ignore instruction followed by a set of data manipulation instructions which can be executed in an arbitrary order, the method comprising: reading each instruction in the ordered sequence, checking its dependency with respect to its adjacent instructions and storing an associated dependency bit mask; responsive to detection of the ignore instruction, ignoring the dependency bit masks for subsequent data manipulation instructions issued up to detection of the depend instruction, whereby data manipulation instructions can be issued in an arbitrary order.
The ordered sequence of instructions can include a depend instruction after the set of data manipulation instructions which causes dependency bit masks for subsequent instructions in a sequence to be utilised.
It is possible to then arrange that, responsive to the depend instruction, the dependency bit masks associated with data manipulation instructions which have not yet issued are used to delay issue of the depend instruction until all data manipulation instructions in the set have issued. Alternatively, the dependency bit masks associated with data manipulation instructions which have not yet issued can be combined to ensure that all data manipulation instructions in the set issue prior to instructions subsequent to the depend instruction.
Preferably, the ignore instruction specifies at least one operand, and the dependency bit masks are ignored only for the set of data manipulation instructions which define said operand.
Alternatively, the ignore instruction defines no operand and the dependencies for all subsequent data manipulation instructions are ignored.
According to another aspect of the present invention there is provided a processor comprising: a program memory holding a computer program which comprises an ordered sequence of instructions including an ignore instruction followed by a set of data manipulation instructions which can be executed in an arbitrary order followed by a depend instruction; a decode unit for decoding each instruction and arranged to detect ignore and depend instructions prior to their execution; a dependency checker for checking the dependency of each instruction with respect to adjacent instructions and for generating an associated dependency bit mask; a store for holding said dependency bit masks; and means responsive to detection of the ignore instruction to cause the dependency bit masks held in the store to be ignored for data manipulation instructions up to detection of the depend instruction.
According to yet a further aspect of the present invention there is provided a computer program product comprising program code means including an ordered sequence of instructions including an ignore instruction followed by a set of data manipulation instructions which can be executed in an arbitrary order followed by a depend instruction wherein, when the program product is loaded into a computer and executed, on detection of the ignore instruction dependency bit masks associated with each instruction in the ordered sequence are ignored for subsequent data manipulation instructions issued up to detection of the depend instruction.
By providing two new instructions IGNORE and DEPEND, the dependency checker need only operate when necessary, so that a higher degree of non- determinism can be achieved without affecting the performance of a processor.
The present invention will now be described by way of an example with reference to the accompanying drawings, in which:-
Figure 1 shows a block diagram of a generic CPU architecture; Figure 2 shows a non-deterministic processor executing two instructions compared to other processors;
Figure 3 shows an embodiment of the random issue unit; Figure 4 shows a flow chart explaining how instructions are issued at random;
Figure 5 shows an example of two input random selection unit; Figures 6A and 6B show a generic model and a 16 input random selection unit;
Figure 7 shows a flow chart describing a method for choosing which random instruction in the issue buffer to execute; and
Figure 8 shows how the dependency checking mechanism 33 is affected upon detection of an IGNORE or DEPEND instruction.
Figure 1 is a block diagram illustrating the standard functional units that make up a pipelined computer system. A program memory 2 contains program instructions, which are addressable at different memory locations. An ADDRESS bus 6 and a DATA bus 4 transfer information to and from the various elements that make up the processor 8. The system contains an instruction fetch unit 10 having a program counter 12 that stores the address of the next instruction to be fetched. For sequential execution of instructions the program counter will normally be incremented by a single addressing unit. However, if a branch instruction is encountered, the program flow is broken and the program counter needs to be loaded with the address of a target instruction (that is, the first instruction of the branch sequence). The instructions are fetched from the program memory and stored in an instruction issue buffer 14. It is worth noting that the program counter referred to herein is used to control instruction fetches from memory. There may also be an execution counter which is used by the execution unit 18 to specify which instruction is currently being executed. Next, the instructions are decoded and supplied to relevant execution units. In this example, only one execution unit 18 or pipeline is shown, however the present invention is intended to be used in conjunction with modern processors which may have several execution units allowing parallel execution paths. Encryption algorithms need a substantial level of computational power and modern processor architectures such as superscalar, VLIW (Very Long Instruction Word) and SIMD (Single Instruction Multiple Data) are ideally suited to the present invention. Finally, the results of the operations are written back by a result write stage 22 into temporary registers of a register file 20, which is used to load and store data in and out of main memory.
The present invention is concerned mainly with the block of functionality denoted by the reference numeral 24. In particular, the present invention deals with a modified issue buffer 14 which will be described in more detail later. The issue buffer generates an instruction fetch signal 13 to control which instructions are supplied from the fetch unit 10. Also, part of the decode circuitry may be used to decode the instruction dependencies. This will also be described in more detail at a later stage.
The present invention is concerned with a non-deterministic processor. Non- deterministic processing as described herein means that for successive runs of the program, although the result will be the same the order of execution of the instructions will be random. This reduces the impact of a DPA-type attack in that the power traces resulting from successive program runs will be different. Figure 2 serves to highlight the differences between a non-deterministic processor and other known processors when executing a simple program consisting of the following two lines of code:
ADD a, b
XOR c, d
The execution flow on the left of Figure 2 represents a standard processor having a single execution pipeline where the two instructions are executed sequentially, i.e. the ADD instruction is executed in cycle 1 followed by the XOR instruction in cycle 2. The middle execution flow represents a modern Pentium processor having a plurality of execution paths, which execute independent instructions in parallel. The execution flow on the right of Figure 2 represents a non-deterministic processor having a single pipeline.
The important point to note is that the non-deterministic processor allows the instructions to be executed in any order provided that it has been established that the instructions are independent. So in the first cycle either the ADD or the XOR instruction can be carried out and in the second cycle the other instruction will be executed. In contrast, the standard processor executes instructions sequentially and although there is a little "out of order" execution to help with branch prediction, this occurs on a small scale. In any event, in such a processor each time a program is run containing a certain sequence of instructions, the execution sequence will be identical. Although the Pentium processor has a plurality of execution units (A) and (B), which execute the independent instructions in parallel the processor is still deterministic in that the ADD and the XOR instructions are executed concurrently in pipes (A) and (B).
A slightly more complex code sequence comprising eight instructions is shown in Table 1.
Figure imgf000010_0001
Table 1
It is apparent from the code listing above that the sequential execution of these eight instructions I0, 11... 17 is merely one way that the code sequence may be correctly executed. There are in fact 80 different code sequences, i.e. instruction orderings, for executing these eight instructions which will all give the right answer. For example, the LOAD instruction I0 reads the value of register R1 holding a memory address and the value stored at this address is written into the register R8. It can be seen that the LOAD instructions I0, 11 , 13 and 14 are all independent instructions, and an equally valid execution sequence could be, for example, 11 , I0, 13, 14, 15, 12, 16, 17 in that none of them are dependent on the results of execution of another. However, an incorrect result occurs if the ADD instruction 12 is executed before the LOAD instruction 11. That is, the purpose of 11 is to LOAD a value addressed by register R2 into register R9. The intention of the code sequence is to add the loaded value from R8 to the value in R9. Therefore, if the ADD instruction 12 is carried out before 11 , the old value of R9 will be added to R8 yielding an incorrect value for the resulting summation R10. We say that there is a dependency between the ADD instruction 12 and the LOAD instruction 11. The present invention makes use of the fact that in many code sequences a number of instructions are independent and thus can, in theory, be executed in any order. The invention exploits this by executing the instructions in a random order at run time. This causes the access patterns to memory for either data or instructions to be uncorrelated for successive program executions, and thus causes the power trace to be different each time.
Figure 3 shows an example of the implementation of a random issue unit. The random issue unit comprises an instruction table 32 with an associated dependency matrix table 30. Instructions are prefetched into the instruction table 32 using conventional instruction fetch circuitry. The dependency matrix table has slots and columns, where the slots represent bit-masks associated with each instruction in the instruction table 32. The bit-masks or dependency bits are an indication as to whether an instruction has a dependency on another instruction. Broadly speaking there are two types of dependencies that need to be considered for an instruction:
1) Use dependencies - which are the dependencies of the source registers that an instruction uses to read data from. 2) Defined dependencies - which are the dependencies of the destination registers that an instruction defines to write data to.
In Figure 3, a particular instruction will be decoded and the mask bits will be set accordingly in a Used Registers table 34 and a Defined Registers table 36. The Used and Defined Register tables 34, 36 shown in Figure 3 each comprise a number of rows and columns. Each row corresponds to a register (or operand) and each column corresponds to a particular slot (or instruction) in the instruction issue table 32. Each register comprises a plurality of slots corresponding to the number of instructions in the instruction table 32 and is the so-called bit-mask for a register. The bit mask for a register is a binary stream where a "1" indicates which instruction has a dependency on that register. As an example, consider the Used and Defined Register tables 34, 36 of Figure 3 where each table has five rows corresponding to registers R1 to R5, i.e. R1 corresponds to the top row and R5 to the bottom row.
At run-time the processor performs a logical OR operation 38 of the bit mask of the Used Registers table 34 and the Defined Registers table 36 thereby creating a new bit-mask stored in a free slot of the dependency matrix 30.
A test can be performed by OR-ing with OR gates 40 each of the dependency bits of a slot of the dependency matrix. If all the dependency bits of a slot associated with a particular instruction are set to zero, then the instruction can be executed and a FIRE signal 42 is generated to the Random Selection Unit 44. Given the result of the OR for each row of the table, a number of zeros (indicating instructions to be executed) and a number of ones (indicating instructions that are blocked) are obtained. The random selection unit 44 selects one of the slots which is indicated at value zero, at random, and causes that instruction to be executed next. In the described embodiment, the dependency bits are overwritten with new values when the dependencies of the next instruction are loaded into the matrix.
All the instructions that have no dependencies are thus identified by fire signals 42 to the random selection unit 44. For purposes of clarity we will assume a single execution pipeline where for each execution cycle the random selection unit selects by selection signal 46 only one of the fired instructions. However it should be appreciated that for example, in a superscalar architecture having parallel execution pipelines, a number of instructions could be issued in parallel under the control of the Random Selection Unit 44. The selection signal 46 of the Random Selection Unit 44 points to an instruction to be executed, while at the same time issues a feedback signal 48 to "free-up" future instructions that may have been dependent on the instruction currently being executed.
The random issue unit supplies an instruction to be executed from the instruction table 32 along instruction supply path 50 and loads an instruction into the instruction table 32 along instruction load path 52 at the same time. Figure 4 is a flow chart indicating how the instructions in the instruction issue buffer 14 are issued for execution and loaded concurrently. The load operations are represented by the left branch flow (C), while the issue operations are represented by the right branch flow (D).
The left branch flow (C) of figure 4 relates to an instruction load operation starting at step S1 where the next instruction, specified by the program counter 12, is loaded into the instruction table 32 of the issue buffer 14. The load operation will firstly be described in general terms, and then more specifically in relation to one example. Each instruction defines two source operands 54 and a destination operand 56. These will nearly always be defined as registers although that is not necessary. Direct addresses or immediates are possible. The source and destination operands 54,56 are simultaneously decoded. At S2, the decoded information is translated into bit-masks that are set in the Used Registers and Defined Registers tables 34,36. These bit-masks are OR-ed by OR gate 38 (Figure 3) to create dependency bits indicating on which instructions the loaded instruction depends. At S3, the empty slot E associated with the loaded instruction is then selected for replacement by setting the InValid flag 58 to zero. The dependency bits are loaded into the selected slot E of the dependency matrix. At S4, the bit-masks in column E of the Used Registers and Defined Registers tables 34,36 are set to "1" along path 62 for the corresponding rows of these tables to ensure that future instructions that use those registers are going to wait for the instruction to finish.
A specific example of the load operation will now be described.
The Used and Defined Register tables 34, 36 are set-up during the instruction fetch or LOAD sequence, as already indicated. The fetched instruction is decoded and the bit-masks associated with each of the registers specified in the instruction are checked for dependencies with other instructions. For example, assume the instruction: ADD R2, R3, R4 is fetched. The bit masks associated with the registers R2 and R3 in the Used Registers table 34 (i.e. the source registers) are sent to OR gate 38. Also, the bit mask associated with register R4 in the Defined Registers Table (i.e. the destination register) is sent to the OR gate 38. Assuming there are N instructions in the instruction table 32, therefore each bit mask has N slots where each slot corresponds to a particular instruction.
The OR gate 38 receives the bit-masks and performs a bit-wise logical OR operation for each slot simultaneously. For example, assume the following bit- masks exist:
Figure imgf000014_0001
The resulting set of dependency bit (or dependency mask) is shown as
0010000 0, which is then sent from the OR gate 38 to a horizontal slot in the dependency matrix 30 that is associated with the corresponding instruction of the instruction table 32. During the execution stage (which is discussed more fully below with reference to the right branch of figure 4), the first step includes simultaneously performing a second OR operation 40 across all the dependency bits for each slot of the dependency matrix 30 to determine which instructions have no dependencies. For the example, a "1" set in the third bit of the dependency mask for the instruction in question means that the OR'ed result will be a "1". Therefore this instruction still has dependencies and cannot be fired at the random selection unit 44.
Returning to the load operation (i.e. left branch of figure 4), the final step is to set the appropriate bit masks associated with the currently loaded instruction. The appropriate bit-masks being the registers that cannot be used by future instructions until the current instruction has been issued. Thus, for the example instruction (i.e. ADD R2, R3, R4), register R4 in the Used Registers table 34 for the present instruction column in set to "1" to inform all future instructions that R4 cannot be used as a source register (i.e. read from), because the present instruction uses this as a destination register (i.e. write to). Similarly, registers R2 and R3 are source registers for the present instruction and thus these registers are set to "1" in the Defined Registers table 36 to indicate that these registers cannot be written to until the present instruction has completed.
The right branch flow (D) of Figure 4 relates to random instruction issue starting at S1 where the dependency bits associated with each instruction are checked using an OR operation via OR gate 40. Then all of the independent instructions are flagged as ready for issue and appropriate fire signals are sent to the Random Selection Unit. At step S2, the Random Selection Unit 44 selects one of the instructions 46 for example the instruction X, which is issued along instruction supply path 50 to the relevant execution unit. At S3, column X is then cleared (i.e. bits are set to zero) from the dependency matrix 30 as well as from the Registers Used and Registers Defined tables 34, 36. Also, the InValid flag is set (i.e. to 1 ). Thus, the dependency column for the instructions currently being executed is erased, indicating that any instruction waiting for this instruction can now be executed. According to step S4, a pointer E is initialised for the next iteration. E is a pointer that points to an empty slot which is available in the issue table. After every instruction has been loaded, E must point to another free slot. One could, for example, use the instruction previously executed to initialise E. In that way, the pointer E would follow the executed instructions around the table.
Figure 5 represents a two input example of how a random selection unit 44 may be implemented. The truth table for the random selector is shown below:
lo li R E A
0 0 0 0 0
0 0 1 0 0
0 1 0 0 0
0 1 1 0 0
1 0 0 0 1
1 0 1 0 1
0 0 0 1 0
0 0 1 1 1
0 1 0 1 0
0 1 1 1 0
1 0 0 1 1
1 0 1 1 1
Table 2
Figure 5 shows two inputs 70 and 72 for the random selection unit 44. It should be apparent from figure 3 that each input l0 or li will either be a '0' or a '1'. More generally, a '0' will appear if all of the dependency bits of the relevant slot are '0'. Thus, a '0' indicates an independent instruction, which can be selected by the Random Selection Unit 44. An inspection of truth table 2 reveals that if one of the inputs is a , then the output 46 of the random selector will always take the logical value of the other input. Input li is shown coupled to an AND gate 76 through an inverting element 75. The AND gate 76 accepts two other inputs, i.e. a random signal R 80 and an enable signal E 78. The output of the AND gate is OR-ed 74 with input l0 to produce the selected output 46 of the random selection unit 44.
The random signal R does not have to be truly random. It could be typically generated using a pseudo-random generator that is reseeded regularly with some entropy. The enable signal 78 allows random issue to be disabled, i.e. non- determinism can be turned off, for example to allow a programmer to debug code by stepping through the instructions. Figures 6A and 6B show a slightly more complex example of a random selection unit having 16 inputs. As shown a 16 input random issue unit can be provided by adapting the simple two input structure shown in Figure 5 and connecting it in a cascaded structure. Figure 6A shows a generalised stage of one of the random selection units. The inputs run from l0 to l2 K+1-1. The generalised stage can be applied to the 16 input random selector shown by Figure 6B.
Sixteen inputs means the selector has inputs l0 to 5 and from the generalised case we can say:
2K+1 - 1 = 15
,K+1 16
Therefore, k = 3
Therefore in the final stage (i.e. R-box3), the 16 inputs are divided in half with the even inputs I0, I2...I14 being input to a first multiplexer 82 and the odd inputs 11 , 13, ...115 being input to a second multiplexer 84. Each multiplexer selects 1 output from 2k inputs (i.e. 8:1 in the final stage) and each multiplexer accepts control signals from the lower stages A0...A«-ι (i.e. Ao, Aι, A2 in the final stage). This is confirmed by diagram on the right, which shows the selected signals from the lower stages being feedback into the higher stages. Then the relevant stage behaves as the two input model shown in Figure 5.
Figure 7 is a flow chart illustrating a method to choose which instruction in the instruction buffer to execute. At S11 , the issue buffer is assigned the symbol B. At S12, the number of instructions remaining in the issue buffer 14 is examined and if the buffer contains only one instruction then step S13a issues this instruction to the relevant execution unit and the program sequence is completed i.e. EXIT. If however, there is more than one instruction in the buffer, step S13b involves dividing the buffer into two sets of roughly equal size and assigning the symbols L and R respectively. Then at S14, the instructions within the L buffer are examined to see if any independent instructions can be issued. If not, step S15b sets the active issue buffer B to look at buffer R and the process is repeated from step S12. If however, buffer L does contain instructions that are ready for issue, then at step S15a, the R buffer is examined to see if it contains any instructions ready for issue. If not, step S16b sets the active buffer B to be buffer L and the process is repeated from step S12. If both L and R contain instructions that are ready for issue, the flow proceeds to step S16a where a random bit is generated. If the random bit is '1' then the process moves to step S16b where the L buffer is selected or if the bit is a '0' then the process moves to step S15b where the R buffer is selected. In both cases, the process will be repeated until there is only one instruction in one of the buffers in which case step S13a is invoked and the program sequence is completed.
There are sequences of instructions that exist which at a first glance appear to be dependant on one another, but in fact may be executed in any order without changing the end result. Such instruction sequences generally include associative and/or commutative operations where the execution order does not affect the end result. The instructions that form such an instruction sequence will be referred to herein as data manipulation instructions. Thus, for these sequences, a dependency check at run time is not needed immediately. That is, there may, prima facie, be a dependency on source or destination registers but it is one which can be ignored because the result will be the same whatever the order of execution of the instructions. A compiler can identify such instruction sequences and introduce two extra instructions which are called herein IGNORE and DEPEND that demarcate the section of code containing such a set of data manipulation instructions.
As an example, consider the following instruction sequence:
Figure imgf000018_0001
I0 IGNORE R1
11 ADD R1 R2 R1
12 ADD R1 R3 R1
13 ADD R1 R4 R1
14 DEPEND R1
15 STORE R1 [ANS]
Table 3
If the data manipulation instructions shown in table 3 are executed in the sequence 11 , 12, 13 - the result obtained is: [ANS] = R1 + R2 + R3 + R4
However, the same result is obtained if the execution order is changed to 13, 11 , 12 - i.e.
[ANS] = R1 + R4 + R2 + R3
So, regardless of the order of execution of the three instructions, the result remains the same. The reason for this is that the ADD operation is both commutative and associative. Other operations that have these properties include the OR, XOR and AND operations.
In the example of Table 3, the effect of the IGNORE R1 instruction at run-time is to cause all dependencies on R1 to be ignored for all the subsequent data manipulation instructions until the DEPEND R1 instruction is detected. The DEPEND instruction causes the system to return to the default case where the dependencies on subsequent instructions having the specified operand R1 are checked. The IGNORE/DEPEND pair allows the dependencies that exist on register R1 between the data manipulation instructions 11 , 12, 13 to be ignored. This means that these instructions are ready for issue and can be selected by the random selection unit 44. Therefore, the data manipulation instructions 11 , 12, 13 can be executed in a random order which increases the level of non-determinism exhibited by the processor. Figure 8 shows an alternative embodiment to that of Figure 3 of the dependency checking mechanism 33' for use in the presence of the IGNORE and DEPEND instructions. In this implementation, the Defined Registers table 36 has two bit masks for each register, allowing each register dependency to be stored in one of two categories: a dependency of crucial importance and a dependency that can be delayed. For each register, reference numeral 90 denotes a bit-mask for a crucial dependency and numeral 92 denotes a bit-mask for a deferrable dependency. Thus, each register is associated with two such bit-masks. In comparison, the dependency checker 33 of Figure 3 provides for only a single dependency category for each instruction, i.e. crucial.
Each register of the dependency checker 33' of Figure 8 also has a flag associated with it, the so-called IGNORE flag (95-99). If the IGNORE flag is set it indicates that dependencies on that register can be ignored for as long as the flag is set. For example, Figure 8 shows the case where the Defined Registers table 36 has IGNORE flags 95, 96, 97, 98, 99 associated with the row pairs holding bit masks for registers R1 , R2, R3, R4, R5 respectively. The IGNORE flags are set and reset responsive to detection of the IGNORE and DEPEND instructions at the decode stage 16 by a detect unit 16a, that is prior to execution of the instructions. When executed, the IGNORE and DEPEND instructions are executed as NO OPS.
When the Defined Register table is loaded, as in the process described above with reference to Figure 4, which one of the categories of dependencies is set, depends on the status of the IGNORE flag associated with each pair of bit-masks 90, 92 for each register. In the example of Figure 8, the IGNORE flag 96 has been set to "1" indicating that the dependencies on the associated register R2 can be temporarily delayed. An IGNORE R2 instruction would have issued to set this flag. The IGNORE flag 96 may be reset to "0" upon detection of a DEPEND R2 instruction.
Thus, for the load operation (see left hand branch of Figure 4) any subsequent data manipulation instructions that specify register R2, will set the second category of bit mask 92 indicating that the instructions have dependencies that can be delayed. For the data manipulation instructions that do not specify register R2, the relevant bit-masks 90 are set in the first category defining a crucial dependency.
For the example, assume the instruction ADD R2, R3, R4 is loaded and that an instruction IGNORE R2 has already set the IGNORE flag 96 for register R2. The ADD instruction is decoded. and the second category of bit-mask 92 is set instead of the first category 90. However, it can be seen from Figure 8 of the alternative embodiment that the mask bits corresponding to the first category 90 are still fed to the OR 38 gate as for the basic embodiment of Figure 3. This means that the OR gate 38 performs an OR operation on the mask bits from the Defined and Used Register Tables 34, 36 as before, which includes the bit-mask of the first category 90 from the Defined Registers Table 36. However, the bit-mask of the first category 90 will not have any of its dependency bits set for register R2. Therefore, any instructions which have a prima facie dependency on register R2 will be treated as though there is no dependency for the purposes of selection of instructions by the random issue unit. That is, the bit mask sent to the dependency matrix table 30 will indicate there is no dependency. Instead these dependencies are stored in the second bit mask 92 associated with the register R2.
By the time the DEPEND R2 instruction is fetched, some, but not necessarily all of the deferrable dependency instructions will have been issued. The DEPEND instruction, like the other instructions in the instruction issue table 32, has an associated slot in the dependency matrix table 30. However, that slot does not contain the dependencies worked out as for the "ordinary" instructions explained above with reference to Figure 3. Instead, it takes the bit mask of deferred dependencies associated with the register defined in the DEPEND instruction. This constitutes the dependencies for instructions whose dependencies have been deferred but which have yet to be selected for issue by the random selection unit 44. Thus, two important events occur when a DEPEND instruction is loaded:
i) the relevant IGNORE flag is reset (as explained before);
ii) a selector switch 110 selects the second category of bit-mask 92 as the dependency mask for the relevant slot of the dependency matrix 32 associated with the register defined in the DEPEND instruction corresponding to the DEPEND instruction.
Thus the DEPEND instruction will not be issued to the random selection unit 44, and therefore completed, until all the delayed dependencies have been resolved.
It will be understood that an arbitrary number of IGNORE instructions can be outstanding, i.e. one per register in the architecture. The example shown in Figure 8 only considers the case when a single IGNORE instruction, i.e. IGNORE R2 has been issued. However, a second IGNORE instruction, i.e. IGNORE R4 could be issued thereby setting the IGNORE flag 98 corresponding to R4 of the Defined Registers table 36.
An IGNORE instruction can define more than one operand, in which case IGNORE flags are simultaneously set against multiple defined registers. In that case, a single DEPEND instruction defining more than one register can follow, or multiple DEPEND instructions each defining a single operand. In the first case, the bit masks associated with the multiple defined registers need to be ORed before loading into the slot associated with the DEPEND instructions.
It should be appreciated that data manipulation instructions will issue normally and can occur between an IGNORE/DEPEND instruction pair provided that they do not act on the same operand specified by the IGNORE instruction. For example, assume that an additional instruction such as ADD R7, R8, R9 is inserted between the data manipulation instructions 11 and 12 in the code sequence example of table 3. The IGNORE instruction specifies the register R1 , while this ADD instruction specifies different registers R7, R8, R9. So this ADD instruction will be executed normally, meaning that if this instruction is dependent on another then the default dependency checking mechanism of figure 3 is used and the dependency will be considered crucial.
It should also be appreciated that the IGNORE and DEPEND instructions can be defined without an operand. In this case, by default all the IGNORE flags are automatically set to "1". In such a situation the dependencies on all the data manipulation instructions that exist between the IGNORE/DEPEND instruction pair will be ignored regardless of the operands specified.
Therefore, by introducing two new instructions IGNORE and DEPEND into an address instruction set, the non-deterministic properties of a processor can be exploited without necessarily impacting its performance. In general terms, the IGNORE/DEPEND instruction pair allows certain types of instructions to be executed in a random order by ignoring their dependencies.
Although the specific example outlined in the invention is directed at cryptography, it should be understood that this invention may be equally applied to any situation where it is desired to keep the environmental impact of the processor non- deterministic, for example reducing resonances in small computing devices. Furthermore, it should be appreciated that the random selection unit described herein is only an example of a possible implementation. The random selection unit which has been described operates on a pseudo random basis. It would of course be possible to use a random selection unit which operated on a truly random basis. The present invention may include any features disclosed herein either implicitly or explicitly or any generalisation thereof, irrespective of whether it relates to the presently claimed invention. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.

Claims

CLAIMS:
1. A method of executing a computer program comprising an ordered sequence of instructions including an ignore instruction followed by a set of data manipulation instructions which can be executed in an arbitrary order, the method comprising: reading each instruction in the ordered sequence, checking its dependency with respect to its adjacent instructions and storing an associated dependency bit mask; responsive to detection of the ignore instruction, ignoring the dependency bit masks for subsequent data manipulation instructions issued up to detection of the depend instruction, whereby data manipulation instructions can be issued in an arbitrary order.
2. A method according to claim 1 , wherein the sequence of instructions includes a depend instruction after the set of data manipulation instructions which causes dependency bit masks for subsequent instructions in the sequence to be utilised.
3. A method according to claim 2, which comprises, responsive to the depend instruction, using the dependency bit masks associated with data manipulation instructions which have not yet issued to delay issue of the depend instruction until all data manipulation instructions in the set have issued.
4. A method according to claim 1 , 2 or 3, wherein the ignore instruction specifies at least one operand, and the dependency bit masks are ignored only for the set of data manipulation instructions which define said operand.
5. A method according to claim 1 , 2 or 3, wherein said ignore instruction does not specify an operand so that the dependency bit masks of all said subsequent data manipulation instructions regardless of their specified operands are ignored up to detection of said depend instruction.
6. A method according to any preceding claim, wherein the step of checking the dependency comprises: decoding each instruction to identify its operands; and comparing the operands of the decoded instruction with the operands of each of the preceding instructions in the sequence.
7. A method according to claim 6, wherein said step of comparing the operands is performed by a logical OR operation executed on corresponding bits of the instruction operands.
8. A method according to any preceding claim, which comprises selecting instructions for execution including the step of checking for each instruction whether there is a dependency bit set to indicate a dependency between instructions, and selecting said instruction if no dependency bit is set.
9. A method according to claim 8, wherein instructions are selected for execution on a random basis.
10. A method according to any of claim 6 or 7, wherein each data manipulation instruction defines at least one source register from which the instruction intends to read data and the decoding step comprises decoding the source register(s) and setting corresponding bit masks in a used registers table.
11. A method according to claim 6, 7 or 10, wherein each data manipulation instruction defines a destination register into which the instruction intends to write data and the decoding step comprises decoding the destination register and setting corresponding bit masks in a defined registers table.
12. A processor comprising: a program memory holding a computer program which comprises an ordered sequence of instructions including an ignore instruction followed by a set of data manipulation instructions which can be executed in an arbitrary order followed by a depend instruction; a decode unit for decoding each instruction and arranged to detect ignore and depend instructions prior to their execution; a dependency checker for checking the dependency of each instruction with respect to adjacent instructions and for generating an associated dependency bit mask; a store for holding said dependency bit masks; and means responsive to detection of the ignore instruction to cause the dependency bit masks held in the store to be ignored for data manipulation instructions up to detection of the depend instruction.
13. A processor according to claim 12, which comprises means responsive to detection of the depend instruction to use the dependency bit masks associated with data manipulation instructions which have not yet issued to delay issue of the depend instruction until all data manipulation instructions in the set have issued.
14. A processor according to claim 12 or 13, which comprises a used registers table for holding bit masks relating to source registers defined in the data manipulation instructions.
15. A processor according to claim 12, 13 or 14, which comprises a defined registers table for holding bit masks associated with destination registers defined by the data manipulation instructions.
16. A processor according to claim 15, wherein the defined registers table includes for each register a first slot for holding a bit mask defining crucial dependencies associates with that register and a second slot for holding a bit mask defining deferrable dependencies associated with that register.
17. A computer program product comprising program code means including an ordered sequence of instructions including an ignore instruction followed by a set of data manipulation instructions which can be executed in an arbitrary order followed by a depend instruction wherein, when the program product is loaded into a computer and executed, on detection of the ignore instruction dependency bit masks associated with each instruction in the ordered sequence are ignored for subsequent data manipulation instructions issued up to detection of the depend instruction.
PCT/GB2001/004299 2000-09-27 2001-09-26 Computer instructions WO2002027479A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2001290112A AU2001290112A1 (en) 2000-09-27 2001-09-26 Computer instructions

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GBGB0023696.8A GB0023696D0 (en) 2000-09-27 2000-09-27 Computer instructions
GB0023696.8 2000-09-27

Publications (1)

Publication Number Publication Date
WO2002027479A1 true WO2002027479A1 (en) 2002-04-04

Family

ID=9900248

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2001/004299 WO2002027479A1 (en) 2000-09-27 2001-09-26 Computer instructions

Country Status (3)

Country Link
AU (1) AU2001290112A1 (en)
GB (1) GB0023696D0 (en)
WO (1) WO2002027479A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016111778A1 (en) * 2015-01-07 2016-07-14 Qualcomm Incorporated Devices and methods implementing operations for selective enforcement of task dependencies

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5745726A (en) * 1995-03-03 1998-04-28 Fujitsu, Ltd Method and apparatus for selecting the oldest queued instructions without data dependencies
US5881308A (en) * 1991-06-13 1999-03-09 International Business Machines Corporation Computer organization for multiple and out-of-order execution of condition code testing and setting instructions out-of-order
EP0924603A2 (en) * 1997-12-16 1999-06-23 Lucent Technologies Inc. Compiler controlled dynamic scheduling of program instructions

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5881308A (en) * 1991-06-13 1999-03-09 International Business Machines Corporation Computer organization for multiple and out-of-order execution of condition code testing and setting instructions out-of-order
US5745726A (en) * 1995-03-03 1998-04-28 Fujitsu, Ltd Method and apparatus for selecting the oldest queued instructions without data dependencies
EP0924603A2 (en) * 1997-12-16 1999-06-23 Lucent Technologies Inc. Compiler controlled dynamic scheduling of program instructions

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
FORREST ET AL: "Building diverse computer systems", OPERATING SYSTEMS, 1997., THE SIXTH WORKSHOP ON HOT TOPICS IN CAPE COD, MA, USA 5-6 MAY 1997, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, 5 May 1997 (1997-05-05), pages 67 - 72, XP010226847, ISBN: 0-8186-7834-8 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016111778A1 (en) * 2015-01-07 2016-07-14 Qualcomm Incorporated Devices and methods implementing operations for selective enforcement of task dependencies
US9678790B2 (en) 2015-01-07 2017-06-13 Qualcomm Incorporated Devices and methods implementing operations for selective enforcement of task dependencies

Also Published As

Publication number Publication date
GB0023696D0 (en) 2000-11-08
AU2001290112A1 (en) 2002-04-08

Similar Documents

Publication Publication Date Title
May et al. Non-deterministic processors
May et al. Random register renaming to foil DPA
US7949883B2 (en) Cryptographic CPU architecture with random instruction masking to thwart differential power analysis
EP3757854B1 (en) Microprocessor pipeline circuitry to support cryptographic computing
US8417961B2 (en) Apparatus and method for implementing instruction support for performing a cyclic redundancy check (CRC)
US20100246814A1 (en) Apparatus and method for implementing instruction support for the data encryption standard (des) algorithm
RU2279123C2 (en) Computing module and method for realization of arithmetic operation with encrypted operands
EP1398901B1 (en) Feistel type encryption method and apparatus protected against DPA attacks
EP1011081B1 (en) Information processing equipment
US9317286B2 (en) Apparatus and method for implementing instruction support for the camellia cipher algorithm
JP2012198565A (en) Method and apparatus for minimizing differential power attacks on processors
Albert et al. Combatting software piracy by encryption and key management
US20100246815A1 (en) Apparatus and method for implementing instruction support for the kasumi cipher algorithm
US7570760B1 (en) Apparatus and method for implementing a block cipher algorithm
Neve Cache-based Vulnerabilities and SPAM analysis
JP2004310752A (en) Error detection in data processor
WO2002027478A1 (en) Instruction issue in a processor
WO2002054228A1 (en) Register renaming
WO2002027474A1 (en) Executing a combined instruction
Fournier et al. Cache based power analysis attacks on AES
Boneh et al. Hardware support for tamper-resistant and copy-resistant software
WO2002027479A1 (en) Computer instructions
WO2002027476A1 (en) Register assignment in a processor
WO2022029443A1 (en) Method and apparatus for reducing the risk of successful side channel and fault injection attacks
US7711955B1 (en) Apparatus and method for cryptographic key expansion

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP