CN112230995A

CN112230995A - Instruction generation method and device and electronic equipment

Info

Publication number: CN112230995A
Application number: CN202011093577.3A
Authority: CN
Inventors: 孙继芬; 陈钦树; 刘玉佳; 廖述京
Original assignee: Guangdong Communications and Networks Institute
Current assignee: Guangdong Communications and Networks Institute
Priority date: 2020-10-13
Filing date: 2020-10-13
Publication date: 2021-01-15
Anticipated expiration: 2040-10-13
Also published as: CN112230995B

Abstract

The application discloses a method, a device and electronic equipment for generating instructions, wherein the method is applied to a compiler, the compiler is connected with a DSP processing system, and the method comprises the following steps: when receiving a source code, determining a loop body capable of vectorizing from the source code, and performing vectorization processing on an instruction of the loop body to generate an SIMD instruction; splicing the SIMD instructions into an execution instruction set according to a super-long instruction set architecture; and sending an execution instruction set to the DSP processing system to enable the DSP processing system to execute the test processing on the data. The method and the device can determine data capable of being processed in parallel, carry out vectorization processing on the data and generate a plurality of SIMD instructions, and splice the plurality of SIMD instructions into a very long instruction set architecture capable of being processed in parallel, so that a DSP processing system can simultaneously receive the plurality of SIMD instructions through the very long instruction set architecture and respond to the plurality of SIMD instructions in parallel, and the data processing efficiency can be improved.

Description

Instruction generation method and device and electronic equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for generating an instruction, and an electronic device.

Background

A DSP processor is a chip that can implement digital signal processing techniques. SIMD (Single Instruction Multiple Data) is a set of instructions that can replicate Multiple operands and pack them into large registers for Multiple Data processing. In actual operation, the DSP processor can perform pipeline operation on data according to the SIMD instruction set, so that various digital signals can be quickly processed.

However, in the face of the continuous development of artificial intelligence, big data and multimedia technologies, when a large amount of high-precision, homogeneous and mutually independent data needs to be processed, if the SIMD instruction is used for pipeline operation, although the SIMD instruction can perform operation on a plurality of data, the SIMD instruction can only perform fixed-point operation of one operation function at a time, and the processing function is single, and the processing efficiency is low, which is difficult to meet the current processing requirements.

Disclosure of Invention

The technical problem to be solved by the embodiments of the present application is to solve the problems that when a SIMD instruction is adopted, only an instruction with one operation function can be responded to perform fixed-point operation, the processing function is single, and the processing efficiency is low.

In order to solve the above problem, an embodiment of the present application provides a method for generating an instruction, where the method is applied to a compiler, the compiler is connected to a DSP processing system, and the method is suitable for being executed in a computing device and includes at least the following steps:

when a source code is received, determining a loop body capable of vectorizing from the source code, and performing vectorization processing on an instruction of the loop body to generate a SIMD instruction;

splicing the SIMD instructions into an execution instruction set according to a super-long instruction set architecture;

and sending the execution instruction set to the DSP processing system to enable the DSP processing system to execute test processing on data.

Further, determining a loop body capable of vectorizing from the source code, and performing vectorization processing on an instruction of the loop body to generate a SIMD instruction, including:

searching a dependency number and the cycle number corresponding to the cycle body from the source code, wherein the dependency number is smaller than or equal to the width of a preset instruction divided by the width of data to be processed;

when the cycle number is determined to be greater than the dependency number and the data in the cycle body are not dependent, determining the cycle body to be an inner-layer cycle capable of vectorizing;

and vectorizing each scalar instruction in the inner-layer loop to generate a SIMD instruction.

searching and opening a loop body from the source code to obtain an expansion loop body;

respectively acquiring a plurality of scalar instructions corresponding to the expansion loop body;

and when the plurality of scalar instructions do not have a dependency relationship with each other, packing the plurality of scalar instructions into a vector instruction to generate a SIMD instruction.

Further, the system also relates to a global configuration register which is respectively connected with the DSP processing system and the compiler;

the generating SIMD instructions comprises:

receiving a bit width value of a register sent by the global configuration register, wherein the bit width value of the register is obtained by calculating the global configuration register according to the bit width length and the cycle number of each datum;

determining the distribution bit width of a preset register according to the bit width value of the register;

and compiling and generating a SIMD instruction according to the allocated bit width.

The embodiment of the present application further provides an instruction generating device, which is applied to a compiler, where the compiler is connected to a DSP processing system, and the device includes:

the generating module is used for determining a loop body capable of being vectorized from a source code when the source code is received, and carrying out vectorization processing on an instruction of the loop body to generate a SIMD instruction;

the splicing module is used for splicing the SIMD instructions into an execution instruction set according to a super-long instruction set architecture;

and the sending module is used for sending the execution instruction set to the DSP processing system so as to enable the DSP processing system to execute test processing on data.

Further, the generating module is further configured to:

respectively acquiring a plurality of scalar instructions of the expanded loop body;

the generation module is further to:

Further, an embodiment of the present application further provides an electronic device, including: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the method for generating instructions as described in the above embodiments when executing the program.

Further, the present application also provides a computer-readable storage medium, where computer-executable instructions are stored, and the computer-executable instructions are used to cause a computer to execute the instruction generation method according to the foregoing embodiment.

Compared with the prior art, the embodiment can search a loop body from a source code, carry out vectorization processing on the instructions of data in the loop body to obtain a plurality of vector instructions, generate a plurality of SIMD instructions from the plurality of vector instructions, and splice the plurality of SIMD instructions into a very long instruction set architecture capable of parallel processing, so that a DSP processing system can simultaneously receive the plurality of SIMD instructions through the very long instruction set architecture and can respond to the plurality of SIMD instructions in parallel, thereby improving the processing capacity of the DSP processing system; each SIMD instruction is mutually independent and has different operation modes, so that the DSP processing system can execute different operation operations after responding to a plurality of SIMD instructions in parallel, the data processing capability of the DSP processing system is improved, and the data processing efficiency is improved; meanwhile, data corresponding to each SIMD instruction can be stored in different registers respectively, so that the DSP processing system can conveniently acquire the data from the different registers, and the data processing efficiency can be further improved.

Drawings

FIG. 1 is a diagram of an application environment of a method of generating an instruction in one embodiment;

FIG. 2 is a flowchart of a first embodiment of a method for generating an instruction according to an embodiment;

FIG. 3 is a flowchart of the operation of a method of generating an instruction in one embodiment;

FIG. 4 is a flowchart illustrating a second embodiment of a method for generating an instruction according to an embodiment of the present invention;

FIG. 5 is a flowchart illustrating a third embodiment of a method for generating an instruction according to an embodiment of the present invention;

FIG. 6 is a flowchart showing a fourth embodiment of a method of generating an instruction according to the first embodiment;

FIG. 7 is a register bit width allocation diagram of a method of generating instructions in one embodiment;

FIG. 8 is a block diagram showing an example of a structure of an instruction generating apparatus according to the embodiment.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Example one

With the continuous development of artificial intelligence, big data and multimedia technologies at present, the data volume needing to be processed is larger and the requirement is higher. When a large amount of high-precision, homogeneous and mutually independent data needs to be processed, if the processor adopts the SIMD instruction set to perform pipeline processing, fixed-point operation of only one function can be executed each time, the processing function is single, the processing efficiency is low, and the current processing requirements are difficult to meet.

In order to solve the above problem, a method for generating an instruction provided by the embodiment of the present application will be described and explained in detail by the following specific embodiments.

Referring to fig. 1, an application environment diagram of a method for generating an instruction is provided, and referring to fig. 1, the method for generating an instruction may be applied to a compiler 110, and the compiler 110 is communicatively connected to a DSP processing system 120. Wherein the main workflow of the compiler 110 may include: source code → preprocessor → compiler → object code → Linker (Linker) → executable programs (executables). DSP processing system 120 may include, among other things, a central processing unit, a main memory, registers, and input-output interfaces.

In an alternative embodiment, the central processing unit in the DSP processing system 120 may be a DSP processor, and the DSP processor may have a harvard structure with separate programs and data, and a dedicated hardware multiplier, and may perform different pipeline operations, so as to implement various digital signal processing algorithms quickly.

As shown in fig. 2, in the present embodiment, a method for generating an instruction is provided, and the present embodiment is mainly illustrated by applying the method to a compiler. The compiler may specifically be compiler 110 of fig. 1 described above.

Referring to fig. 2, the method for generating the command specifically includes the following steps:

and S11, when the source code is received, determining a loop body capable of vectorizing from the source code, and performing vectorization processing on the instructions of the loop body to generate the SIMD instructions.

The source code may be a code written by a user, specifically a source code written by a language supported by a user development tool, and may be a set of explicit rules for representing information in a discrete form by characters, symbols, or signal symbols. In alternative embodiments, the user may write code in a different programming language, such as the C language.

The compiler can accurately determine the data operation required to be executed by the user through the source code, so that a corresponding instruction is generated according to the source code of the user, and the corresponding instruction of the DSP processing system is enabled to execute the corresponding operation.

The loop body may be a source code or a repeatedly executed step in a loop structure in an algorithm. Since the repetition step of the loop body requires conditional execution, pipelining of the data executing the loop can be performed only if the condition of the repetition step is satisfied, and thus the loop body needs to be searched from the source code.

And a vector architecture is required for parallel operation of data information. Since the vector architecture is only time consuming in the load or store operation of each vector data, and does not need to be time consuming in the load or store operation of each vector data, the frequency of pipeline interlocks can be reduced. In the vector system structure, each vector instruction only needs one pipeline pause, the condition of pause caused by correlation among a plurality of instructions can not occur, and the bit width of the instructions can be greatly reduced. And the vector system structure can ensure that each data operation originally in a scalar system needs one instruction, and each vector operation only needs one vector instruction, thereby greatly improving the processing efficiency.

And S12, splicing the SIMD instructions into an execution instruction set according to the super-long instruction set architecture.

The Very Long Instruction set architecture (VLIW for short) may be an extremely Long combination of instructions, and may be an architecture that connects a plurality of instructions together.

In this embodiment, the compiler may splice a plurality of SIMD instructions into an execution instruction set of a very long instruction set architecture, and store the execution instruction set in an instruction register of a corresponding storage instruction, and when the DSP processing system needs to perform data processing, the DSP processing system may obtain the execution instruction set from the instruction register.

And S13, sending an execution instruction set to the DSP processing system so that the DSP processing system executes the test processing on the data.

The test processing may be processing of multimedia data, for example, processing of audio data, processing of image data, processing of video data, and the like.

When the DSP processing system carries out test processing, the DSP processor of the DSP processing system can respond to the execution instruction set to simultaneously acquire a plurality of SIMD instructions and parallelly respond to the plurality of SIMD instructions, thereby realizing the effect of parallelly processing a plurality of groups of data. Due to the fact that a group of data corresponds to each SIMD instruction, pipeline operation of multiple groups of data can be started after a plurality of SIMD instructions are responded, and loading, storing and operation of the data can be conducted in a pipeline mode.

In actual operation, as shown in fig. 3, the DSP processor receives an execution instruction set in a VLIW format, and then decodes the execution instruction set, so that address codes corresponding to a plurality of SIMD instructions can be obtained from the execution instruction set, and then the DSP processor obtains a plurality of SIMD instructions simultaneously according to the address codes. The multiple SIMD instructions may then be responded to in parallel, and multiple sets of data addresses corresponding to the multiple SIMD instructions, respectively, may be obtained, where the multiple sets of data addresses may be data addresses required for performing an operation corresponding to each SIMD instruction, respectively, registered. Then the DSP processor can respectively obtain a plurality of groups of data from different vector registers according to a plurality of groups of data addresses and execute corresponding processing on the plurality of groups of data. Each vector register may store data corresponding to execution of each SIMD instruction, respectively.

In this embodiment, a loop body is searched from a source code, and an instruction of data in the loop body is subjected to vectorization processing to obtain a plurality of vector instructions, the plurality of vector instructions generate a plurality of SIMD instructions, and then the plurality of SIMD instructions are spliced into a very long instruction set architecture capable of parallel processing, so that a DSP processing system can simultaneously receive the plurality of SIMD instructions through the very long instruction set architecture and can respond to the plurality of SIMD instructions in parallel, thereby improving the processing capability of the DSP processing system; each SIMD instruction is mutually independent and has different operation modes, so that the DSP processing system can execute different operation operations after responding to a plurality of SIMD instructions in parallel, the data processing capability of the DSP processing system is improved, and the data processing efficiency is improved; meanwhile, data corresponding to each SIMD instruction can be stored in different registers respectively, so that the DSP processing system can conveniently acquire the data from the different registers, and the data processing efficiency can be further improved.

Example two

In actual operation, when data in the loop body needs to be subjected to parallel operation, the data in the loop body can be disassembled for parallel processing, but if the data to be subjected to loop operation processing is disassembled for parallel processing, the original loop is damaged, so that the data needs to be readjusted during processing, and the processing efficiency is reduced.

In order to solve the above problem, the present embodiment provides a method for generating an instruction, as shown in fig. 4, which is mainly illustrated by applying the method to a compiler. The compiler may specifically be compiler 110 of fig. 1 described above.

Referring to fig. 4, the method for generating the command specifically includes the following steps:

and S21, when the source code is received, searching the dependency number and the loop frequency corresponding to the loop body from the source code, wherein the dependency number is the width of the preset instruction divided by the width of the data to be processed.

In this embodiment, the dependency number is the width of the instruction for processing data divided by the width of the data to be processed, for example, when the width of the instruction for processing data is 128 bits, and the data to be processed is 32 bits, 128/32 is 4, and the dependency number is less than or equal to 4.

The number of cycles is the number of times the loop body performs the repetitive operation.

And S22, when the loop frequency is determined to be more than the dependency number and the data in the loop body are not dependent, determining the loop body as an inner loop capable of vectorizing processing.

In an optional embodiment, the compiler may perform an operation of searching for a dependency number and a Loop number corresponding to the Loop body by using a Loop-level automatic vectorization (Loop-level automatic vectorization) technique, determine whether the Loop number is greater than the dependency number, and also determine whether data in the Loop body is dependent on each other before and after the Loop body. The compiler can determine whether the loop can perform vectorization processing by searching the innermost loop which can perform vectorization in the source code, particularly by searching the factors which prevent vectorization.

Specifically, a dependency number, which may be data of a vectorization factor (vectorization factor), and a loop number are found. The vectorization factor is the width of the instruction that processes the data divided by the width of the data to be processed.

For example: for (i ═ 0; i < 1024; i + +); a [ i +1] ═ a [ i ] + b [ i ]. When the width of an instruction for processing data is 128 bits, the data to be processed is 32 bits, and the vectorization factor (vectorization factor) is equal to 128/32-4, although the loop number is 1024 times and is greater than the dependency number, a [ i +1] and a [ i ] have a true dependency relationship of data between loops, and at this time, a loop body cannot perform parallel operation and cannot perform vectorization processing because a relatively short vectorization iteration count (smaller than the vectorization factor) is prevented.

Another example is: for (i ═ 0; i < 3; i + +); a [ i ] + c [ i ]; when the vectorization factor (vectorization factor) is 4, the loop times are 3 and less than 4, and although the true dependency relationship between loops does not exist between data, the loop vector automation processing and function call cannot be performed.

Also for example, for (i ═ 0; i < 32; i + +); a [ i ] + c [ i ]; when the vectorization factor (vectorization factor) is 4, the cycle number is 32 more than 4, and a [ i ], b [ i ] and c [ i ] have no dependency relationship with each other, so that the cyclic vector automation processing and function calling can be performed.

In this embodiment, the compiler may determine whether a dependency relationship exists between data through semantic analysis of data dependency.

When it is determined that the data in the loop body has no dependency relationship and the loop times are greater than the dependency number, parallel processing may be performed, and the plurality of data in the loop body may be divided into a group of independent data, and the group of independent data may be parallel processed. And repeating the steps, judging a plurality of loop bodies in the source code, determining a plurality of vectorization-capable loop bodies, and determining a plurality of groups of independent data.

Because the data in each group of data are independent of each other, the data can be processed in parallel, the data in each group can be operated in a pipeline type, the data in each group are independent of each other, and the data in the groups can be processed in parallel.

S23, each scalar instruction in the inner loop is vectorized to generate a SIMD instruction.

In this embodiment, when the compiler determines that the loop body is vectorizable, that is, it determines that the inner loop can perform vectorization processing, the compiler may recompile to generate a new loop, where the new loop is obtained by dividing the original loop by a vectorization factor vectoretion factor, and replace each scalar instruction in the loop body with a corresponding vector instruction.

For example,

in this embodiment, the data processed by the SIMD instruction is placed in a vector register (vector register), whereas the data processed by the SIMD instruction is placed in a different register. The original data are vectorized by modifying the scalar instruction in each data into the vector instruction, so that the data can be processed in parallel.

And S24, splicing the SIMD instructions into an execution instruction set according to the super-long instruction set architecture.

This step is the same as the above embodiment, and the detailed analysis may refer to the above embodiment, and is not repeated herein to avoid repetition.

And S25, sending an execution instruction set to the DSP processing system so as to enable the DSP processing system to execute calculation processing on the data.

In this embodiment, by searching for the loop frequency and the dependency number of the loop body in the source code, and by determining that the loop frequency is greater than the dependency number and that the data in the loop body are not dependent on each other, it can be determined that the data in the loop body can perform parallel operation, thereby avoiding a situation that the data needs to be readjusted when the loop body is disassembled to process the data, and simultaneously ensuring that the loop body can be smoothly executed, thereby further improving the data processing efficiency.

EXAMPLE III

In actual operation, not every data processing needs to perform multiple times or a large number of loop operation operations, a large number of data may need to be simply calculated without loops, and there is no dependency relationship between multiple data.

In order to solve the above problem, the present embodiment provides a method for generating an instruction, and as shown in fig. 5, the present embodiment is mainly illustrated by applying the method to a compiler. The compiler may specifically be compiler 110 of fig. 1 described above.

Referring to fig. 5, the method for generating the command specifically includes the following steps:

and S31, when the source code is received, searching and opening the loop body from the source code to obtain the expanded loop body.

In this embodiment, the open Loop body may be Loop unwinding (Loop unwinding) and may be an optimization method that may sacrifice the size of the program to increase the execution speed of the program. The optimization can be automatically completed by a compiler.

In particular, loop unrolling may be accomplished by copying loop body code multiple times. The loop unrolling can increase the instruction scheduling space and reduce the overhead of loop branch instructions, thereby better realizing data prefetching.

S32, a plurality of scalar commands corresponding to the development loop body are acquired.

After the loop body is expanded, respective scalar instructions of the loop body may be fetched, and each scalar instruction may correspond to processing data within one loop body.

S33, when there is no dependency relationship between the scalar instructions, packing the scalar instructions into a vector instruction to generate a SIMD instruction.

In this embodiment, since each scalar instruction processes one piece of data, it is possible to determine whether a plurality of scalar instructions have a dependency relationship with each other by determining whether a dependency relationship exists between the pieces of data.

Specifically, when multiple scalar instructions do not have a dependency relationship with each other, they are packed into a vector instruction.

Examples are as follows:

and S34, splicing the SIMD instructions into an execution instruction set according to the super-long instruction set architecture.

And S35, sending an execution instruction set to the DSP processing system so as to enable the DSP processing system to execute calculation processing on the data.

In this embodiment, the space of instruction scheduling can be increased by expanding the loop body, the overhead of loop branch instructions is reduced, so that data prefetching can be better achieved, whether a dependency relationship exists among a plurality of data can be determined, when the dependency relationship does not exist, the instructions of the plurality of data can be packed into a vector instruction, so that the vector instruction can be generated into a plurality of SIMD instructions, and the chip can perform parallel processing on a plurality of independent data through the plurality of SIMD instructions, so that the data processing time can be shortened, and the data processing efficiency can be improved.

Example four

In actual operation, since the calculation modes of processing data are different, and the width of each data is different, the register capacity required for processing data is different. To ensure smooth data transmission, the maximum bit width of the register is generally used to register the data to ensure that the transmission requirement of the data can be satisfied. However, when a large number of homogeneous and independent accesses are required to be processed and a data stream with a small bit width is used, for example, image data is processed, the common image data type is RGB565, RGBA8888 or YUV422, and the like, and the data of the formats is characterized in that one component of one pixel point is generally represented by data with less than or equal to 8 bits. If a 512-bit wide processor is used to process a large amount of 8-bit data, a large amount of register resources are wasted, and the power consumption of the processor is increased. Therefore, when the compiler is used for compiling the generated instruction, the bit width required for processing the data needs to be determined, so that the bit width of the register can be respectively determined according to the data bit width, and the waste of the resource of the register is avoided.

In order to solve the above problem, the present embodiment provides a method for generating an instruction, and as shown in fig. 6, the technical features of the present embodiment may be applied to embodiment two or embodiment three. Specifically, the SIMD instruction generating step of step S23 of embodiment two or the SIMD instruction generating step of step S33 of embodiment three may be applied.

Referring to fig. 6, the step of generating SIMD instructions may specifically include the following steps:

and S41, receiving the bit width value of the register sent by the global configuration register, wherein the bit width value of the register is obtained by calculating the global configuration register according to the bit width length and the cycle number of each datum.

In this embodiment, a global configurable register (VPE _ SEL) may also be involved, which is connected to the DSP processing system and the compiler, respectively, and which may be visible to the DSP processing system and the editor.

In a specific implementation, a DSP processor of the DSP processing system may be provided with one or more vector issue slots, and each vector issue slot may be an issue slot of the DSP processor. One vector issue slot may correspond to one SIMD instruction to process, and the number of vector issue slots may be the same as the SIMD instruction. For example, if the execution instruction set of the very long instruction set architecture can run 5 SIMD instructions in parallel, 5 vector issue slots may be set. Each Vector issue slot may be provided with 4 Vector Processing Elements (VPEs), and each Vector Processing Element includes a 640-bit Vector unit and a 640-bit Vector register. Each vector issue slot corresponds to a total bit width of 2560 bits.

The compiler may send the bit width of each data and the cycle number required for processing the data to the global configurable register when obtaining the bit width of each data and the cycle number required for processing the data, the global configurable register may calculate a bit width value of the register required to be occupied according to the bit width of each data and the cycle number required for processing the data, and then the global configurable register sends the bit width value of the register required to be occupied for processing the data to the compiler.

In actual practice, the compiler may need to compile multiple loop bodies, for example:

defining int DM [ N ], TM [ N ], A [ N ], E [ N ] respectively; int B, C, D, x, y, z:

1、For(i＝0；i<20；i++)

{

…

v/calculate:

A[i]＝DM[i]*TM[i]+A[i]；

}

2、B＝C+D；x＝y*z；

3、For(i＝0；i<20；i++)

{

…

v/calculate:

E[i]＝A[i]+E[i]；

}

…

assuming that the bit width length N of the vector register is 20 and the required bit width is less than 20, the above operations can be performed in parallel by using one vector processing unit.

And operation 1 and operation 2 have no data correlation, and the data of operation 1 and operation 2 are independent of each other, and can be executed in the execution instruction set of the same ultra-long instruction set architecture, while operation 3 and operation 1 have data correlation, and are parallel at the same time. After passing through the compiler, it can become like the following instructions:

Op1 A,DM,TM；Op2 B,C,D；Op3 x,y,z

-Op1 represents a cumulative addition; op2 represents an addition operation; op3 represents integer multiplication; these three instructions may be executed in parallel.

Op4 E,A,E

- - - -Op4 represents a vector addition. Since there is a data dependency, it is issued in the cycle following the first 3 instructions.

If the bit width length N of the vector register is 20, and if the user needs to improve the data processing accuracy, the data bit width of DM and TM increases as the accuracy of the processed data increases, and if DM [40], DM [60], and DM [80], that is, if N is 40 or 60 or even N is 80, the instruction needs to be transmitted in two or more cycles, and at the same time, more vector processing units need to be used for processing.

For better parallel computation, the global configurable registers may specify 0x1, 0x3, 0x7, 0xf, 0x1, 0x3, 0x7, 0xf respectively representing the use of 1, 2, 3, 4 vector processing units, and determine whether the bit width value of the register required to process data may be 640 bits, 1280 bits, 1920 bits, or 2560 bits, and transmit the bit width value of the register required to process data to the compiler.

And S42, determining the distribution bit width of the preset register according to the bit width value of the register.

In this embodiment, the compiler may allocate the bit width of the register according to the bit width value of the register that needs to be occupied by the processing data sent by the global configurable register, so that the register may process the data according to the allocated bit width.

As shown in fig. 7, when it is assumed that the register is 128 bits, the required bit width value is 128 bits, and the plurality of data to be processed are 32 bits each, the register may be divided into 4 32 bits, or when the plurality of data to be processed are 8 bits each, the register may be divided into 16 8 bits.

The preset register may be a vector register of the DSP processing system.

And S43, compiling according to the allocated bit width to generate the SIMD instruction.

In this embodiment, the compiler may compile the instruction for each loop body operation using the allocated bit width to obtain the SIMD instruction.

In this embodiment, the bit width required for processing data is determined by the global configurable register, so that the compiler can compile instructions more reasonably, generate a plurality of SIMD instructions, and make reasonable use of register resources, and under the condition of meeting the processing requirement, the resource waste of the register can be avoided, and the power consumption of the register can be reduced.

EXAMPLE five

In one embodiment, as shown in fig. 8, there is further provided an instruction generating apparatus applied to a compiler, where the compiler is connected to a DSP processing system, the apparatus including:

a generating module 501, configured to, when a source code is received, determine a loop body capable of being vectorized from the source code, and perform vectorization processing on an instruction of the loop body to generate an SIMD instruction;

a concatenation module 502, configured to concatenate the SIMD instructions into an execution instruction set according to a very long instruction set architecture;

a sending module 503, configured to send the execution instruction set to the DSP processing system, so that the DSP processing system executes test processing on data.

Further, the generating module is further configured to:

the generation module is further to:

In one embodiment, there is provided an electronic device including: the mobile terminal comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the steps of the application program login method of the mobile terminal when executing the program. Here, the steps of the application program registration method of the mobile terminal may be steps in the instruction generation method of each of the above embodiments.

In one embodiment, a computer-readable storage medium is provided, which stores computer-executable instructions for causing a computer to perform the steps of the above-mentioned application login method for a mobile terminal. Here, the steps of the application program registration method of the mobile terminal may be steps in the instruction generation method of each of the above embodiments.

The foregoing is a preferred embodiment of the present application, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations are also regarded as the protection scope of the present application.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

Claims

1. A method for generating an instruction, applied to a compiler, the compiler being connected to a DSP processing system, the method comprising:

2. The method for generating an instruction according to claim 1, wherein determining a loop body capable of vectorization from the source code, and performing vectorization processing on the instruction of the loop body to generate a SIMD instruction comprises:

3. The method for generating an instruction according to claim 1, wherein determining a loop body capable of vectorization from the source code, and performing vectorization processing on the instruction of the loop body to generate a SIMD instruction comprises:

4. A method for generating instructions according to any one of claims 2 or 3, further involving global configuration registers, said global configuration registers being connected to said DSP processing system and said compiler, respectively;

the generating SIMD instructions comprises:

5. An instruction generating apparatus, applied to a compiler, the compiler being connected to a DSP processing system, the apparatus comprising:

6. The apparatus for generating instructions according to claim 5, wherein the generating module is further configured to:

7. The apparatus for generating instructions according to claim 5, wherein the generating module is further configured to:

8. The apparatus of any of claims 6 or 7, further comprising global configuration registers, said global configuration registers being coupled to said DSP processing system and said compiler, respectively;

the generation module is further to:

9. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of generating instructions according to any of claims 1 to 4 when executing the program.

10. A computer-readable storage medium storing computer-executable instructions for causing a computer to perform a method of generating instructions of any one of claims 1 to 4.