CN112230995A - Instruction generation method and device and electronic equipment - Google Patents

Instruction generation method and device and electronic equipment Download PDF

Info

Publication number
CN112230995A
CN112230995A CN202011093577.3A CN202011093577A CN112230995A CN 112230995 A CN112230995 A CN 112230995A CN 202011093577 A CN202011093577 A CN 202011093577A CN 112230995 A CN112230995 A CN 112230995A
Authority
CN
China
Prior art keywords
instruction
data
instructions
simd
generating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011093577.3A
Other languages
Chinese (zh)
Other versions
CN112230995B (en
Inventor
孙继芬
陈钦树
刘玉佳
廖述京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Communications and Networks Institute
Original Assignee
Guangdong Communications and Networks Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Communications and Networks Institute filed Critical Guangdong Communications and Networks Institute
Priority to CN202011093577.3A priority Critical patent/CN112230995B/en
Publication of CN112230995A publication Critical patent/CN112230995A/en
Application granted granted Critical
Publication of CN112230995B publication Critical patent/CN112230995B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3887Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Complex Calculations (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

The application discloses a method, a device and electronic equipment for generating instructions, wherein the method is applied to a compiler, the compiler is connected with a DSP processing system, and the method comprises the following steps: when receiving a source code, determining a loop body capable of vectorizing from the source code, and performing vectorization processing on an instruction of the loop body to generate an SIMD instruction; splicing the SIMD instructions into an execution instruction set according to a super-long instruction set architecture; and sending an execution instruction set to the DSP processing system to enable the DSP processing system to execute the test processing on the data. The method and the device can determine data capable of being processed in parallel, carry out vectorization processing on the data and generate a plurality of SIMD instructions, and splice the plurality of SIMD instructions into a very long instruction set architecture capable of being processed in parallel, so that a DSP processing system can simultaneously receive the plurality of SIMD instructions through the very long instruction set architecture and respond to the plurality of SIMD instructions in parallel, and the data processing efficiency can be improved.

Description

Instruction generation method and device and electronic equipment
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for generating an instruction, and an electronic device.
Background
A DSP processor is a chip that can implement digital signal processing techniques. SIMD (Single Instruction Multiple Data) is a set of instructions that can replicate Multiple operands and pack them into large registers for Multiple Data processing. In actual operation, the DSP processor can perform pipeline operation on data according to the SIMD instruction set, so that various digital signals can be quickly processed.
However, in the face of the continuous development of artificial intelligence, big data and multimedia technologies, when a large amount of high-precision, homogeneous and mutually independent data needs to be processed, if the SIMD instruction is used for pipeline operation, although the SIMD instruction can perform operation on a plurality of data, the SIMD instruction can only perform fixed-point operation of one operation function at a time, and the processing function is single, and the processing efficiency is low, which is difficult to meet the current processing requirements.
Disclosure of Invention
The technical problem to be solved by the embodiments of the present application is to solve the problems that when a SIMD instruction is adopted, only an instruction with one operation function can be responded to perform fixed-point operation, the processing function is single, and the processing efficiency is low.
In order to solve the above problem, an embodiment of the present application provides a method for generating an instruction, where the method is applied to a compiler, the compiler is connected to a DSP processing system, and the method is suitable for being executed in a computing device and includes at least the following steps:
when a source code is received, determining a loop body capable of vectorizing from the source code, and performing vectorization processing on an instruction of the loop body to generate a SIMD instruction;
splicing the SIMD instructions into an execution instruction set according to a super-long instruction set architecture;
and sending the execution instruction set to the DSP processing system to enable the DSP processing system to execute test processing on data.
Further, determining a loop body capable of vectorizing from the source code, and performing vectorization processing on an instruction of the loop body to generate a SIMD instruction, including:
searching a dependency number and the cycle number corresponding to the cycle body from the source code, wherein the dependency number is smaller than or equal to the width of a preset instruction divided by the width of data to be processed;
when the cycle number is determined to be greater than the dependency number and the data in the cycle body are not dependent, determining the cycle body to be an inner-layer cycle capable of vectorizing;
and vectorizing each scalar instruction in the inner-layer loop to generate a SIMD instruction.
Further, determining a loop body capable of vectorizing from the source code, and performing vectorization processing on an instruction of the loop body to generate a SIMD instruction, including:
searching and opening a loop body from the source code to obtain an expansion loop body;
respectively acquiring a plurality of scalar instructions corresponding to the expansion loop body;
and when the plurality of scalar instructions do not have a dependency relationship with each other, packing the plurality of scalar instructions into a vector instruction to generate a SIMD instruction.
Further, the system also relates to a global configuration register which is respectively connected with the DSP processing system and the compiler;
the generating SIMD instructions comprises:
receiving a bit width value of a register sent by the global configuration register, wherein the bit width value of the register is obtained by calculating the global configuration register according to the bit width length and the cycle number of each datum;
determining the distribution bit width of a preset register according to the bit width value of the register;
and compiling and generating a SIMD instruction according to the allocated bit width.
The embodiment of the present application further provides an instruction generating device, which is applied to a compiler, where the compiler is connected to a DSP processing system, and the device includes:
the generating module is used for determining a loop body capable of being vectorized from a source code when the source code is received, and carrying out vectorization processing on an instruction of the loop body to generate a SIMD instruction;
the splicing module is used for splicing the SIMD instructions into an execution instruction set according to a super-long instruction set architecture;
and the sending module is used for sending the execution instruction set to the DSP processing system so as to enable the DSP processing system to execute test processing on data.
Further, the generating module is further configured to:
searching a dependency number and the cycle number corresponding to the cycle body from the source code, wherein the dependency number is smaller than or equal to the width of a preset instruction divided by the width of data to be processed;
when the cycle number is determined to be greater than the dependency number and the data in the cycle body are not dependent, determining the cycle body to be an inner-layer cycle capable of vectorizing;
and vectorizing each scalar instruction in the inner-layer loop to generate a SIMD instruction.
Further, the generating module is further configured to:
searching and opening a loop body from the source code to obtain an expansion loop body;
respectively acquiring a plurality of scalar instructions of the expanded loop body;
and when the plurality of scalar instructions do not have a dependency relationship with each other, packing the plurality of scalar instructions into a vector instruction to generate a SIMD instruction.
Further, the system also relates to a global configuration register which is respectively connected with the DSP processing system and the compiler;
the generation module is further to:
receiving a bit width value of a register sent by the global configuration register, wherein the bit width value of the register is obtained by calculating the global configuration register according to the bit width length and the cycle number of each datum;
determining the distribution bit width of a preset register according to the bit width value of the register;
and compiling and generating a SIMD instruction according to the allocated bit width.
Further, an embodiment of the present application further provides an electronic device, including: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the method for generating instructions as described in the above embodiments when executing the program.
Further, the present application also provides a computer-readable storage medium, where computer-executable instructions are stored, and the computer-executable instructions are used to cause a computer to execute the instruction generation method according to the foregoing embodiment.
Compared with the prior art, the embodiment can search a loop body from a source code, carry out vectorization processing on the instructions of data in the loop body to obtain a plurality of vector instructions, generate a plurality of SIMD instructions from the plurality of vector instructions, and splice the plurality of SIMD instructions into a very long instruction set architecture capable of parallel processing, so that a DSP processing system can simultaneously receive the plurality of SIMD instructions through the very long instruction set architecture and can respond to the plurality of SIMD instructions in parallel, thereby improving the processing capacity of the DSP processing system; each SIMD instruction is mutually independent and has different operation modes, so that the DSP processing system can execute different operation operations after responding to a plurality of SIMD instructions in parallel, the data processing capability of the DSP processing system is improved, and the data processing efficiency is improved; meanwhile, data corresponding to each SIMD instruction can be stored in different registers respectively, so that the DSP processing system can conveniently acquire the data from the different registers, and the data processing efficiency can be further improved.
Drawings
FIG. 1 is a diagram of an application environment of a method of generating an instruction in one embodiment;
FIG. 2 is a flowchart of a first embodiment of a method for generating an instruction according to an embodiment;
FIG. 3 is a flowchart of the operation of a method of generating an instruction in one embodiment;
FIG. 4 is a flowchart illustrating a second embodiment of a method for generating an instruction according to an embodiment of the present invention;
FIG. 5 is a flowchart illustrating a third embodiment of a method for generating an instruction according to an embodiment of the present invention;
FIG. 6 is a flowchart showing a fourth embodiment of a method of generating an instruction according to the first embodiment;
FIG. 7 is a register bit width allocation diagram of a method of generating instructions in one embodiment;
FIG. 8 is a block diagram showing an example of a structure of an instruction generating apparatus according to the embodiment.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Example one
With the continuous development of artificial intelligence, big data and multimedia technologies at present, the data volume needing to be processed is larger and the requirement is higher. When a large amount of high-precision, homogeneous and mutually independent data needs to be processed, if the processor adopts the SIMD instruction set to perform pipeline processing, fixed-point operation of only one function can be executed each time, the processing function is single, the processing efficiency is low, and the current processing requirements are difficult to meet.
In order to solve the above problem, a method for generating an instruction provided by the embodiment of the present application will be described and explained in detail by the following specific embodiments.
Referring to fig. 1, an application environment diagram of a method for generating an instruction is provided, and referring to fig. 1, the method for generating an instruction may be applied to a compiler 110, and the compiler 110 is communicatively connected to a DSP processing system 120. Wherein the main workflow of the compiler 110 may include: source code → preprocessor → compiler → object code → Linker (Linker) → executable programs (executables). DSP processing system 120 may include, among other things, a central processing unit, a main memory, registers, and input-output interfaces.
In an alternative embodiment, the central processing unit in the DSP processing system 120 may be a DSP processor, and the DSP processor may have a harvard structure with separate programs and data, and a dedicated hardware multiplier, and may perform different pipeline operations, so as to implement various digital signal processing algorithms quickly.
As shown in fig. 2, in the present embodiment, a method for generating an instruction is provided, and the present embodiment is mainly illustrated by applying the method to a compiler. The compiler may specifically be compiler 110 of fig. 1 described above.
Referring to fig. 2, the method for generating the command specifically includes the following steps:
and S11, when the source code is received, determining a loop body capable of vectorizing from the source code, and performing vectorization processing on the instructions of the loop body to generate the SIMD instructions.
The source code may be a code written by a user, specifically a source code written by a language supported by a user development tool, and may be a set of explicit rules for representing information in a discrete form by characters, symbols, or signal symbols. In alternative embodiments, the user may write code in a different programming language, such as the C language.
The compiler can accurately determine the data operation required to be executed by the user through the source code, so that a corresponding instruction is generated according to the source code of the user, and the corresponding instruction of the DSP processing system is enabled to execute the corresponding operation.
The loop body may be a source code or a repeatedly executed step in a loop structure in an algorithm. Since the repetition step of the loop body requires conditional execution, pipelining of the data executing the loop can be performed only if the condition of the repetition step is satisfied, and thus the loop body needs to be searched from the source code.
And a vector architecture is required for parallel operation of data information. Since the vector architecture is only time consuming in the load or store operation of each vector data, and does not need to be time consuming in the load or store operation of each vector data, the frequency of pipeline interlocks can be reduced. In the vector system structure, each vector instruction only needs one pipeline pause, the condition of pause caused by correlation among a plurality of instructions can not occur, and the bit width of the instructions can be greatly reduced. And the vector system structure can ensure that each data operation originally in a scalar system needs one instruction, and each vector operation only needs one vector instruction, thereby greatly improving the processing efficiency.
And S12, splicing the SIMD instructions into an execution instruction set according to the super-long instruction set architecture.
The Very Long Instruction set architecture (VLIW for short) may be an extremely Long combination of instructions, and may be an architecture that connects a plurality of instructions together.
In this embodiment, the compiler may splice a plurality of SIMD instructions into an execution instruction set of a very long instruction set architecture, and store the execution instruction set in an instruction register of a corresponding storage instruction, and when the DSP processing system needs to perform data processing, the DSP processing system may obtain the execution instruction set from the instruction register.
And S13, sending an execution instruction set to the DSP processing system so that the DSP processing system executes the test processing on the data.
The test processing may be processing of multimedia data, for example, processing of audio data, processing of image data, processing of video data, and the like.
When the DSP processing system carries out test processing, the DSP processor of the DSP processing system can respond to the execution instruction set to simultaneously acquire a plurality of SIMD instructions and parallelly respond to the plurality of SIMD instructions, thereby realizing the effect of parallelly processing a plurality of groups of data. Due to the fact that a group of data corresponds to each SIMD instruction, pipeline operation of multiple groups of data can be started after a plurality of SIMD instructions are responded, and loading, storing and operation of the data can be conducted in a pipeline mode.
In actual operation, as shown in fig. 3, the DSP processor receives an execution instruction set in a VLIW format, and then decodes the execution instruction set, so that address codes corresponding to a plurality of SIMD instructions can be obtained from the execution instruction set, and then the DSP processor obtains a plurality of SIMD instructions simultaneously according to the address codes. The multiple SIMD instructions may then be responded to in parallel, and multiple sets of data addresses corresponding to the multiple SIMD instructions, respectively, may be obtained, where the multiple sets of data addresses may be data addresses required for performing an operation corresponding to each SIMD instruction, respectively, registered. Then the DSP processor can respectively obtain a plurality of groups of data from different vector registers according to a plurality of groups of data addresses and execute corresponding processing on the plurality of groups of data. Each vector register may store data corresponding to execution of each SIMD instruction, respectively.
In this embodiment, a loop body is searched from a source code, and an instruction of data in the loop body is subjected to vectorization processing to obtain a plurality of vector instructions, the plurality of vector instructions generate a plurality of SIMD instructions, and then the plurality of SIMD instructions are spliced into a very long instruction set architecture capable of parallel processing, so that a DSP processing system can simultaneously receive the plurality of SIMD instructions through the very long instruction set architecture and can respond to the plurality of SIMD instructions in parallel, thereby improving the processing capability of the DSP processing system; each SIMD instruction is mutually independent and has different operation modes, so that the DSP processing system can execute different operation operations after responding to a plurality of SIMD instructions in parallel, the data processing capability of the DSP processing system is improved, and the data processing efficiency is improved; meanwhile, data corresponding to each SIMD instruction can be stored in different registers respectively, so that the DSP processing system can conveniently acquire the data from the different registers, and the data processing efficiency can be further improved.
Example two
In actual operation, when data in the loop body needs to be subjected to parallel operation, the data in the loop body can be disassembled for parallel processing, but if the data to be subjected to loop operation processing is disassembled for parallel processing, the original loop is damaged, so that the data needs to be readjusted during processing, and the processing efficiency is reduced.
In order to solve the above problem, the present embodiment provides a method for generating an instruction, as shown in fig. 4, which is mainly illustrated by applying the method to a compiler. The compiler may specifically be compiler 110 of fig. 1 described above.
Referring to fig. 4, the method for generating the command specifically includes the following steps:
and S21, when the source code is received, searching the dependency number and the loop frequency corresponding to the loop body from the source code, wherein the dependency number is the width of the preset instruction divided by the width of the data to be processed.
In this embodiment, the dependency number is the width of the instruction for processing data divided by the width of the data to be processed, for example, when the width of the instruction for processing data is 128 bits, and the data to be processed is 32 bits, 128/32 is 4, and the dependency number is less than or equal to 4.
The number of cycles is the number of times the loop body performs the repetitive operation.
And S22, when the loop frequency is determined to be more than the dependency number and the data in the loop body are not dependent, determining the loop body as an inner loop capable of vectorizing processing.
In an optional embodiment, the compiler may perform an operation of searching for a dependency number and a Loop number corresponding to the Loop body by using a Loop-level automatic vectorization (Loop-level automatic vectorization) technique, determine whether the Loop number is greater than the dependency number, and also determine whether data in the Loop body is dependent on each other before and after the Loop body. The compiler can determine whether the loop can perform vectorization processing by searching the innermost loop which can perform vectorization in the source code, particularly by searching the factors which prevent vectorization.
Specifically, a dependency number, which may be data of a vectorization factor (vectorization factor), and a loop number are found. The vectorization factor is the width of the instruction that processes the data divided by the width of the data to be processed.
For example: for (i ═ 0; i < 1024; i + +); a [ i +1] ═ a [ i ] + b [ i ]. When the width of an instruction for processing data is 128 bits, the data to be processed is 32 bits, and the vectorization factor (vectorization factor) is equal to 128/32-4, although the loop number is 1024 times and is greater than the dependency number, a [ i +1] and a [ i ] have a true dependency relationship of data between loops, and at this time, a loop body cannot perform parallel operation and cannot perform vectorization processing because a relatively short vectorization iteration count (smaller than the vectorization factor) is prevented.
Another example is: for (i ═ 0; i < 3; i + +); a [ i ] + c [ i ]; when the vectorization factor (vectorization factor) is 4, the loop times are 3 and less than 4, and although the true dependency relationship between loops does not exist between data, the loop vector automation processing and function call cannot be performed.
Also for example, for (i ═ 0; i < 32; i + +); a [ i ] + c [ i ]; when the vectorization factor (vectorization factor) is 4, the cycle number is 32 more than 4, and a [ i ], b [ i ] and c [ i ] have no dependency relationship with each other, so that the cyclic vector automation processing and function calling can be performed.
In this embodiment, the compiler may determine whether a dependency relationship exists between data through semantic analysis of data dependency.
When it is determined that the data in the loop body has no dependency relationship and the loop times are greater than the dependency number, parallel processing may be performed, and the plurality of data in the loop body may be divided into a group of independent data, and the group of independent data may be parallel processed. And repeating the steps, judging a plurality of loop bodies in the source code, determining a plurality of vectorization-capable loop bodies, and determining a plurality of groups of independent data.
Because the data in each group of data are independent of each other, the data can be processed in parallel, the data in each group can be operated in a pipeline type, the data in each group are independent of each other, and the data in the groups can be processed in parallel.
S23, each scalar instruction in the inner loop is vectorized to generate a SIMD instruction.
In this embodiment, when the compiler determines that the loop body is vectorizable, that is, it determines that the inner loop can perform vectorization processing, the compiler may recompile to generate a new loop, where the new loop is obtained by dividing the original loop by a vectorization factor vectoretion factor, and replace each scalar instruction in the loop body with a corresponding vector instruction.
For example,
Figure BDA0002722096320000091
Figure BDA0002722096320000101
in this embodiment, the data processed by the SIMD instruction is placed in a vector register (vector register), whereas the data processed by the SIMD instruction is placed in a different register. The original data are vectorized by modifying the scalar instruction in each data into the vector instruction, so that the data can be processed in parallel.
And S24, splicing the SIMD instructions into an execution instruction set according to the super-long instruction set architecture.
This step is the same as the above embodiment, and the detailed analysis may refer to the above embodiment, and is not repeated herein to avoid repetition.
And S25, sending an execution instruction set to the DSP processing system so as to enable the DSP processing system to execute calculation processing on the data.
This step is the same as the above embodiment, and the detailed analysis may refer to the above embodiment, and is not repeated herein to avoid repetition.
In this embodiment, by searching for the loop frequency and the dependency number of the loop body in the source code, and by determining that the loop frequency is greater than the dependency number and that the data in the loop body are not dependent on each other, it can be determined that the data in the loop body can perform parallel operation, thereby avoiding a situation that the data needs to be readjusted when the loop body is disassembled to process the data, and simultaneously ensuring that the loop body can be smoothly executed, thereby further improving the data processing efficiency.
EXAMPLE III
In actual operation, not every data processing needs to perform multiple times or a large number of loop operation operations, a large number of data may need to be simply calculated without loops, and there is no dependency relationship between multiple data.
In order to solve the above problem, the present embodiment provides a method for generating an instruction, and as shown in fig. 5, the present embodiment is mainly illustrated by applying the method to a compiler. The compiler may specifically be compiler 110 of fig. 1 described above.
Referring to fig. 5, the method for generating the command specifically includes the following steps:
and S31, when the source code is received, searching and opening the loop body from the source code to obtain the expanded loop body.
In this embodiment, the open Loop body may be Loop unwinding (Loop unwinding) and may be an optimization method that may sacrifice the size of the program to increase the execution speed of the program. The optimization can be automatically completed by a compiler.
In particular, loop unrolling may be accomplished by copying loop body code multiple times. The loop unrolling can increase the instruction scheduling space and reduce the overhead of loop branch instructions, thereby better realizing data prefetching.
S32, a plurality of scalar commands corresponding to the development loop body are acquired.
After the loop body is expanded, respective scalar instructions of the loop body may be fetched, and each scalar instruction may correspond to processing data within one loop body.
S33, when there is no dependency relationship between the scalar instructions, packing the scalar instructions into a vector instruction to generate a SIMD instruction.
In this embodiment, since each scalar instruction processes one piece of data, it is possible to determine whether a plurality of scalar instructions have a dependency relationship with each other by determining whether a dependency relationship exists between the pieces of data.
Specifically, when multiple scalar instructions do not have a dependency relationship with each other, they are packed into a vector instruction.
Examples are as follows:
Figure BDA0002722096320000111
Figure BDA0002722096320000121
and S34, splicing the SIMD instructions into an execution instruction set according to the super-long instruction set architecture.
This step is the same as the above embodiment, and the detailed analysis may refer to the above embodiment, and is not repeated herein to avoid repetition.
And S35, sending an execution instruction set to the DSP processing system so as to enable the DSP processing system to execute calculation processing on the data.
This step is the same as the above embodiment, and the detailed analysis may refer to the above embodiment, and is not repeated herein to avoid repetition.
In this embodiment, the space of instruction scheduling can be increased by expanding the loop body, the overhead of loop branch instructions is reduced, so that data prefetching can be better achieved, whether a dependency relationship exists among a plurality of data can be determined, when the dependency relationship does not exist, the instructions of the plurality of data can be packed into a vector instruction, so that the vector instruction can be generated into a plurality of SIMD instructions, and the chip can perform parallel processing on a plurality of independent data through the plurality of SIMD instructions, so that the data processing time can be shortened, and the data processing efficiency can be improved.
Example four
In actual operation, since the calculation modes of processing data are different, and the width of each data is different, the register capacity required for processing data is different. To ensure smooth data transmission, the maximum bit width of the register is generally used to register the data to ensure that the transmission requirement of the data can be satisfied. However, when a large number of homogeneous and independent accesses are required to be processed and a data stream with a small bit width is used, for example, image data is processed, the common image data type is RGB565, RGBA8888 or YUV422, and the like, and the data of the formats is characterized in that one component of one pixel point is generally represented by data with less than or equal to 8 bits. If a 512-bit wide processor is used to process a large amount of 8-bit data, a large amount of register resources are wasted, and the power consumption of the processor is increased. Therefore, when the compiler is used for compiling the generated instruction, the bit width required for processing the data needs to be determined, so that the bit width of the register can be respectively determined according to the data bit width, and the waste of the resource of the register is avoided.
In order to solve the above problem, the present embodiment provides a method for generating an instruction, and as shown in fig. 6, the technical features of the present embodiment may be applied to embodiment two or embodiment three. Specifically, the SIMD instruction generating step of step S23 of embodiment two or the SIMD instruction generating step of step S33 of embodiment three may be applied.
Referring to fig. 6, the step of generating SIMD instructions may specifically include the following steps:
and S41, receiving the bit width value of the register sent by the global configuration register, wherein the bit width value of the register is obtained by calculating the global configuration register according to the bit width length and the cycle number of each datum.
In this embodiment, a global configurable register (VPE _ SEL) may also be involved, which is connected to the DSP processing system and the compiler, respectively, and which may be visible to the DSP processing system and the editor.
In a specific implementation, a DSP processor of the DSP processing system may be provided with one or more vector issue slots, and each vector issue slot may be an issue slot of the DSP processor. One vector issue slot may correspond to one SIMD instruction to process, and the number of vector issue slots may be the same as the SIMD instruction. For example, if the execution instruction set of the very long instruction set architecture can run 5 SIMD instructions in parallel, 5 vector issue slots may be set. Each Vector issue slot may be provided with 4 Vector Processing Elements (VPEs), and each Vector Processing Element includes a 640-bit Vector unit and a 640-bit Vector register. Each vector issue slot corresponds to a total bit width of 2560 bits.
The compiler may send the bit width of each data and the cycle number required for processing the data to the global configurable register when obtaining the bit width of each data and the cycle number required for processing the data, the global configurable register may calculate a bit width value of the register required to be occupied according to the bit width of each data and the cycle number required for processing the data, and then the global configurable register sends the bit width value of the register required to be occupied for processing the data to the compiler.
In actual practice, the compiler may need to compile multiple loop bodies, for example:
defining int DM [ N ], TM [ N ], A [ N ], E [ N ] respectively; int B, C, D, x, y, z:
1、For(i=0;i<20;i++)
{
v/calculate:
A[i]=DM[i]*TM[i]+A[i];
}
2、B=C+D;x=y*z;
3、For(i=0;i<20;i++)
{
v/calculate:
E[i]=A[i]+E[i];
}
assuming that the bit width length N of the vector register is 20 and the required bit width is less than 20, the above operations can be performed in parallel by using one vector processing unit.
And operation 1 and operation 2 have no data correlation, and the data of operation 1 and operation 2 are independent of each other, and can be executed in the execution instruction set of the same ultra-long instruction set architecture, while operation 3 and operation 1 have data correlation, and are parallel at the same time. After passing through the compiler, it can become like the following instructions:
Op1 A,DM,TM;Op2 B,C,D;Op3 x,y,z
-Op1 represents a cumulative addition; op2 represents an addition operation; op3 represents integer multiplication; these three instructions may be executed in parallel.
Op4 E,A,E
- - - -Op4 represents a vector addition. Since there is a data dependency, it is issued in the cycle following the first 3 instructions.
If the bit width length N of the vector register is 20, and if the user needs to improve the data processing accuracy, the data bit width of DM and TM increases as the accuracy of the processed data increases, and if DM [40], DM [60], and DM [80], that is, if N is 40 or 60 or even N is 80, the instruction needs to be transmitted in two or more cycles, and at the same time, more vector processing units need to be used for processing.
For better parallel computation, the global configurable registers may specify 0x1, 0x3, 0x7, 0xf, 0x1, 0x3, 0x7, 0xf respectively representing the use of 1, 2, 3, 4 vector processing units, and determine whether the bit width value of the register required to process data may be 640 bits, 1280 bits, 1920 bits, or 2560 bits, and transmit the bit width value of the register required to process data to the compiler.
And S42, determining the distribution bit width of the preset register according to the bit width value of the register.
In this embodiment, the compiler may allocate the bit width of the register according to the bit width value of the register that needs to be occupied by the processing data sent by the global configurable register, so that the register may process the data according to the allocated bit width.
As shown in fig. 7, when it is assumed that the register is 128 bits, the required bit width value is 128 bits, and the plurality of data to be processed are 32 bits each, the register may be divided into 4 32 bits, or when the plurality of data to be processed are 8 bits each, the register may be divided into 16 8 bits.
The preset register may be a vector register of the DSP processing system.
And S43, compiling according to the allocated bit width to generate the SIMD instruction.
In this embodiment, the compiler may compile the instruction for each loop body operation using the allocated bit width to obtain the SIMD instruction.
In this embodiment, the bit width required for processing data is determined by the global configurable register, so that the compiler can compile instructions more reasonably, generate a plurality of SIMD instructions, and make reasonable use of register resources, and under the condition of meeting the processing requirement, the resource waste of the register can be avoided, and the power consumption of the register can be reduced.
EXAMPLE five
In one embodiment, as shown in fig. 8, there is further provided an instruction generating apparatus applied to a compiler, where the compiler is connected to a DSP processing system, the apparatus including:
a generating module 501, configured to, when a source code is received, determine a loop body capable of being vectorized from the source code, and perform vectorization processing on an instruction of the loop body to generate an SIMD instruction;
a concatenation module 502, configured to concatenate the SIMD instructions into an execution instruction set according to a very long instruction set architecture;
a sending module 503, configured to send the execution instruction set to the DSP processing system, so that the DSP processing system executes test processing on data.
Further, the generating module is further configured to:
searching a dependency number and the cycle number corresponding to the cycle body from the source code, wherein the dependency number is smaller than or equal to the width of a preset instruction divided by the width of data to be processed;
when the cycle number is determined to be greater than the dependency number and the data in the cycle body are not dependent, determining the cycle body to be an inner-layer cycle capable of vectorizing;
and vectorizing each scalar instruction in the inner-layer loop to generate a SIMD instruction.
Further, the generating module is further configured to:
searching and opening a loop body from the source code to obtain an expansion loop body;
respectively acquiring a plurality of scalar instructions of the expanded loop body;
and when the plurality of scalar instructions do not have a dependency relationship with each other, packing the plurality of scalar instructions into a vector instruction to generate a SIMD instruction.
Further, the system also relates to a global configuration register which is respectively connected with the DSP processing system and the compiler;
the generation module is further to:
receiving a bit width value of a register sent by the global configuration register, wherein the bit width value of the register is obtained by calculating the global configuration register according to the bit width length and the cycle number of each datum;
determining the distribution bit width of a preset register according to the bit width value of the register;
and compiling and generating a SIMD instruction according to the allocated bit width.
In one embodiment, there is provided an electronic device including: the mobile terminal comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the steps of the application program login method of the mobile terminal when executing the program. Here, the steps of the application program registration method of the mobile terminal may be steps in the instruction generation method of each of the above embodiments.
In one embodiment, a computer-readable storage medium is provided, which stores computer-executable instructions for causing a computer to perform the steps of the above-mentioned application login method for a mobile terminal. Here, the steps of the application program registration method of the mobile terminal may be steps in the instruction generation method of each of the above embodiments.
The foregoing is a preferred embodiment of the present application, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations are also regarded as the protection scope of the present application.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

Claims (10)

1. A method for generating an instruction, applied to a compiler, the compiler being connected to a DSP processing system, the method comprising:
when a source code is received, determining a loop body capable of vectorizing from the source code, and performing vectorization processing on an instruction of the loop body to generate a SIMD instruction;
splicing the SIMD instructions into an execution instruction set according to a super-long instruction set architecture;
and sending the execution instruction set to the DSP processing system to enable the DSP processing system to execute test processing on data.
2. The method for generating an instruction according to claim 1, wherein determining a loop body capable of vectorization from the source code, and performing vectorization processing on the instruction of the loop body to generate a SIMD instruction comprises:
searching a dependency number and the cycle number corresponding to the cycle body from the source code, wherein the dependency number is smaller than or equal to the width of a preset instruction divided by the width of data to be processed;
when the cycle number is determined to be greater than the dependency number and the data in the cycle body are not dependent, determining the cycle body to be an inner-layer cycle capable of vectorizing;
and vectorizing each scalar instruction in the inner-layer loop to generate a SIMD instruction.
3. The method for generating an instruction according to claim 1, wherein determining a loop body capable of vectorization from the source code, and performing vectorization processing on the instruction of the loop body to generate a SIMD instruction comprises:
searching and opening a loop body from the source code to obtain an expansion loop body;
respectively acquiring a plurality of scalar instructions corresponding to the expansion loop body;
and when the plurality of scalar instructions do not have a dependency relationship with each other, packing the plurality of scalar instructions into a vector instruction to generate a SIMD instruction.
4. A method for generating instructions according to any one of claims 2 or 3, further involving global configuration registers, said global configuration registers being connected to said DSP processing system and said compiler, respectively;
the generating SIMD instructions comprises:
receiving a bit width value of a register sent by the global configuration register, wherein the bit width value of the register is obtained by calculating the global configuration register according to the bit width length and the cycle number of each datum;
determining the distribution bit width of a preset register according to the bit width value of the register;
and compiling and generating a SIMD instruction according to the allocated bit width.
5. An instruction generating apparatus, applied to a compiler, the compiler being connected to a DSP processing system, the apparatus comprising:
the generating module is used for determining a loop body capable of being vectorized from a source code when the source code is received, and carrying out vectorization processing on an instruction of the loop body to generate a SIMD instruction;
the splicing module is used for splicing the SIMD instructions into an execution instruction set according to a super-long instruction set architecture;
and the sending module is used for sending the execution instruction set to the DSP processing system so as to enable the DSP processing system to execute test processing on data.
6. The apparatus for generating instructions according to claim 5, wherein the generating module is further configured to:
searching a dependency number and the cycle number corresponding to the cycle body from the source code, wherein the dependency number is smaller than or equal to the width of a preset instruction divided by the width of data to be processed;
when the cycle number is determined to be greater than the dependency number and the data in the cycle body are not dependent, determining the cycle body to be an inner-layer cycle capable of vectorizing;
and vectorizing each scalar instruction in the inner-layer loop to generate a SIMD instruction.
7. The apparatus for generating instructions according to claim 5, wherein the generating module is further configured to:
searching and opening a loop body from the source code to obtain an expansion loop body;
respectively acquiring a plurality of scalar instructions of the expanded loop body;
and when the plurality of scalar instructions do not have a dependency relationship with each other, packing the plurality of scalar instructions into a vector instruction to generate a SIMD instruction.
8. The apparatus of any of claims 6 or 7, further comprising global configuration registers, said global configuration registers being coupled to said DSP processing system and said compiler, respectively;
the generation module is further to:
receiving a bit width value of a register sent by the global configuration register, wherein the bit width value of the register is obtained by calculating the global configuration register according to the bit width length and the cycle number of each datum;
determining the distribution bit width of a preset register according to the bit width value of the register;
and compiling and generating a SIMD instruction according to the allocated bit width.
9. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of generating instructions according to any of claims 1 to 4 when executing the program.
10. A computer-readable storage medium storing computer-executable instructions for causing a computer to perform a method of generating instructions of any one of claims 1 to 4.
CN202011093577.3A 2020-10-13 2020-10-13 Instruction generation method and device and electronic equipment Active CN112230995B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011093577.3A CN112230995B (en) 2020-10-13 2020-10-13 Instruction generation method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011093577.3A CN112230995B (en) 2020-10-13 2020-10-13 Instruction generation method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN112230995A true CN112230995A (en) 2021-01-15
CN112230995B CN112230995B (en) 2024-04-09

Family

ID=74113468

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011093577.3A Active CN112230995B (en) 2020-10-13 2020-10-13 Instruction generation method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN112230995B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116719559A (en) * 2022-07-20 2023-09-08 广州众远智慧科技有限公司 Method and device for infrared scanning

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1174353A (en) * 1996-08-19 1998-02-25 三星电子株式会社 Single-instruction-multiple-data processing using multiple banks of vector registers
CN1278342A (en) * 1997-11-07 2000-12-27 博普斯公司 Method and apparatus for efficient synchronous MIMD operation with ivLIM PE-to-PE communication
CN102375805A (en) * 2011-10-31 2012-03-14 中国人民解放军国防科学技术大学 Vector processor-oriented FFT (Fast Fourier Transform) parallel computation method based on SIMD (Single Instruction Multiple Data)
CN102750133A (en) * 2012-06-20 2012-10-24 中国电子科技集团公司第五十八研究所 32-Bit triple-emission digital signal processor supporting SIMD
CN103279327A (en) * 2013-04-28 2013-09-04 中国人民解放军信息工程大学 Automatic vectorizing method for heterogeneous SIMD expansion components
US20140215183A1 (en) * 2013-01-29 2014-07-31 Advanced Micro Devices, Inc. Hardware and software solutions to divergent branches in a parallel pipeline
CN104025033A (en) * 2011-12-30 2014-09-03 英特尔公司 Simd variable shift and rotate using control manipulation
CN104641351A (en) * 2012-10-25 2015-05-20 英特尔公司 Partial vectorization compilation system
US20170177342A1 (en) * 2015-12-22 2017-06-22 Intel IP Corporation Instructions and Logic for Vector Bit Field Compression and Expansion
CN107992330A (en) * 2012-12-31 2018-05-04 英特尔公司 Processor, method, processing system and the machine readable media for carrying out vectorization are circulated to condition
CN107992376A (en) * 2017-11-24 2018-05-04 西安微电子技术研究所 Dsp processor data storage Active Fault Tolerant method and apparatus
CN108268283A (en) * 2016-12-31 2018-07-10 英特尔公司 For operating the computing engines framework data parallel to be supported to recycle using yojan

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1174353A (en) * 1996-08-19 1998-02-25 三星电子株式会社 Single-instruction-multiple-data processing using multiple banks of vector registers
CN1278342A (en) * 1997-11-07 2000-12-27 博普斯公司 Method and apparatus for efficient synchronous MIMD operation with ivLIM PE-to-PE communication
CN102375805A (en) * 2011-10-31 2012-03-14 中国人民解放军国防科学技术大学 Vector processor-oriented FFT (Fast Fourier Transform) parallel computation method based on SIMD (Single Instruction Multiple Data)
CN104025033A (en) * 2011-12-30 2014-09-03 英特尔公司 Simd variable shift and rotate using control manipulation
CN102750133A (en) * 2012-06-20 2012-10-24 中国电子科技集团公司第五十八研究所 32-Bit triple-emission digital signal processor supporting SIMD
CN104641351A (en) * 2012-10-25 2015-05-20 英特尔公司 Partial vectorization compilation system
CN107992330A (en) * 2012-12-31 2018-05-04 英特尔公司 Processor, method, processing system and the machine readable media for carrying out vectorization are circulated to condition
US20140215183A1 (en) * 2013-01-29 2014-07-31 Advanced Micro Devices, Inc. Hardware and software solutions to divergent branches in a parallel pipeline
CN103279327A (en) * 2013-04-28 2013-09-04 中国人民解放军信息工程大学 Automatic vectorizing method for heterogeneous SIMD expansion components
US20170177342A1 (en) * 2015-12-22 2017-06-22 Intel IP Corporation Instructions and Logic for Vector Bit Field Compression and Expansion
CN108268283A (en) * 2016-12-31 2018-07-10 英特尔公司 For operating the computing engines framework data parallel to be supported to recycle using yojan
CN107992376A (en) * 2017-11-24 2018-05-04 西安微电子技术研究所 Dsp processor data storage Active Fault Tolerant method and apparatus

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
YANGZHAO YANG等: "An Approach to Enhance Loop Performance for Multicluster VLIW DSP Processor", IEEE, pages 1 - 8 *
徐华叶: "多簇VLIW_DSP向量化相关编译技术研究", 《中国优秀硕士论文电子期刊网 信息科技辑》, no. 10, pages 29 - 57 *
王敏;王红梅;张铁军;单睿;王东辉;: "面向VLIW DSP结构的编译器的设计与实现", 微计算机应用, no. 07, pages 49 - 54 *
黄胜兵 等: "分簇VLIW DSP上支持单双字模式选择的SIMD编译优化", 计算机应用, vol. 35, no. 8, pages 2371 - 2374 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116719559A (en) * 2022-07-20 2023-09-08 广州众远智慧科技有限公司 Method and device for infrared scanning
CN116719559B (en) * 2022-07-20 2024-06-11 广州众远智慧科技有限公司 Method and device for infrared scanning

Also Published As

Publication number Publication date
CN112230995B (en) 2024-04-09

Similar Documents

Publication Publication Date Title
CN110689138B (en) Operation method, device and related product
US10372429B2 (en) Method and system for generating accelerator program
WO2021000970A1 (en) Deep learning algorithm compiling method, device, and related product.
JP3896087B2 (en) Compiler device and compiling method
EP2876555B1 (en) Method of scheduling loops for processor having a plurality of functional units
US20020095666A1 (en) Program optimization method, and compiler using the same
WO2021000971A1 (en) Method and device for generating operation data and related product
CN111104120A (en) Neural network compiling method and system and corresponding heterogeneous computing platform
US6934938B2 (en) Method of programming linear graphs for streaming vector computation
CN112269581B (en) Memory coupling compiling method and system for reconfigurable chip
US20230093393A1 (en) Processor, processing method, and related device
US12039305B2 (en) Method for compilation, electronic device and storage medium
CN112230995A (en) Instruction generation method and device and electronic equipment
CN114565102A (en) Method, electronic device and computer program product for deploying machine learning model
CN116594682A (en) Automatic testing method and device based on SIMD library
CN116861359A (en) Operator fusion method and system for deep learning reasoning task compiler
US11573777B2 (en) Method and apparatus for enabling autonomous acceleration of dataflow AI applications
JP2008250838A (en) Software generation device, method and program
CN115951936B (en) Chip adaptation method, device, equipment and medium of vectorization compiler
JP2004021890A (en) Data processor
CN117331541B (en) Compiling and operating method and device for dynamic graph frame and heterogeneous chip
JP4158239B2 (en) Information processing apparatus and method, and recording medium
WO2021000638A1 (en) Compiling method and device for deep learning algorithm, and related product
CN117492836A (en) Data flow analysis method for variable length vector system structure
CN114237711A (en) Vector instruction processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant