CN112328401B

CN112328401B - 3DES acceleration method based on OpenCL and FPGA

Info

Publication number: CN112328401B
Application number: CN202011302847.7A
Authority: CN
Inventors: 柴志雷; 吴健凤
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2020-11-19
Filing date: 2020-11-19
Publication date: 2024-05-24
Anticipated expiration: 2040-11-19
Also published as: CN112328401A

Abstract

The invention discloses a 3DES acceleration method based on OpenCL and FPGA, which comprises a host end and a device end, wherein the host end realizes the dispatching and management of a kernel based on the OpenCL, and completes the interaction of 3DES encryption and decryption data with the device end; the equipment end is designed on the FPGA and encrypts and decrypts the data by using the 3 DES; the device end comprises a plaintext data input buffer memory module, a 3DES encryption calculation module and a ciphertext data output buffer memory module, wherein the plaintext data input buffer memory module reads plaintext data from a global memory by using data storage adjustment and data bit width improvement, the 3DES encryption calculation module optimizes the data to form a pipeline parallel architecture by carrying out instruction stream, and the ciphertext data output buffer memory module transmits the data from an FPGA chip to an external DDR. The invention adopts data storage adjustment, data bit width improvement and instruction stream optimization, thereby improving the utilization rate and calculation speed of the actual bandwidth of the kernel; the performance is further improved by adopting a kernel vectorization strategy and a computing unit replication strategy.

Description

3DES acceleration method based on OpenCL and FPGA

Technical Field

The invention relates to the technical field of encryption and decryption acceleration of heterogeneous platforms, in particular to a 3DES acceleration method based on OpenCL and FPGA.

Background

Currently, encryption and decryption technologies are widely applied to the fields of digital currency, blockchain, cloud data encryption and the like, and in order to meet the requirement of the encryption and decryption technologies on high-intensity computing capacity, a current server side is allowed to contain heterogeneous computing platforms so as to enhance the performance of specific workloads, and meanwhile, the maintenance cost of the whole system is improved. OpenCL (Open Computing Language) is an open framework of heterogeneous platforms, kernel programs (Kernel) can be compiled for execution on both multicore CPUs and GPU, FPGA, DSP. In addition to adopting an ASIC or a GPU to process large-batch data, the current server side can also deploy the FPGA on a large scale in consideration of factors such as energy efficiency. The FPGA (FieldProgrammable GATE ARRAY) is a product developed further on the basis of programmable devices such as PAL, GAL and the like, is a semi-custom circuit in the field of application specific integrated circuits, can not only solve the defect of custom circuits, but also overcome the defect of limited gate circuits of the original programmable devices. Therefore, research on accelerating encryption and decryption technology based on OpenCL and FPGA is available.

DES (Data Encryption Standard, data encryption standard, a block technique using key encryption) is a common encryption technique, but since the key length of DES is easily broken down by violence due to the enhancement of computer operation capability, 3DES, which is a generic term for triple DES by increasing the key length of DES to avoid a similar attack, is presented, which is equivalent to applying triple DES to encrypt each data block. However, when encrypting and decrypting data, the 3DES based on the OpenCL and the FPGA has the problems of low kernel bandwidth utilization rate and low calculation speed.

Disclosure of Invention

The invention aims to provide the 3DES acceleration method based on the OpenCL and the FPGA, which can effectively utilize the actual bandwidth of the kernel and improve the computing speed.

In order to solve the technical problems, the invention provides a 3DES acceleration method based on OpenCL and FPGA, which comprises a host end and a device end,

The host side realizes the dispatching and management of the kernel based on OpenCL, and completes the interaction of 3DES encryption and decryption data with the equipment side; the equipment end is designed on an FPGA, and the 3DES is used for encrypting and decrypting the data;

The device end comprises a plaintext data input buffer module, a 3DES encryption computing module and a ciphertext data output buffer module, wherein the plaintext data input buffer module reads plaintext data from a global memory by using data storage adjustment and data bit width improvement, the 3DES encryption computing module optimizes the data to form a pipeline parallel architecture, and the ciphertext data output buffer module transmits the data from an FPGA chip to an external DDR.

Further, the host side performs scheduling and management on the kernel based on OpenCL, specifically, the OpenCL platform API and the runtime API are used to interact with the OpenCL device side, the platform API defines functions used by the host side program to discover the OpenCL device and functions of the functions, and the runtime API is used to manage the context to create a command queue and other operations occurring during the runtime.

Further, the plaintext data input buffer module and the ciphertext data output buffer module are located in a global memory area of the device side, and intermediate data of the 3DES encryption calculation module is stored in a private memory area of the device side.

Further, the data storage adjustment is specifically that data transmitted by a host end is stored in an off-chip DDR, a constant memory is located in an on-chip cache unit, a physical address of a local memory is an on-chip RAM resource, and a physical address of a private memory is an on-chip register resource; storing the variable participating in 3DES calculation in a private memory, and accessing a corresponding plaintext data block in the private memory by a work item and completing 3DES encryption; the S box of the sub-key expansion module and the E box of the f function calculation module are transformed and stored in a constant memory, and the corresponding physical address is an on-chip ROM, so that access speed is high, and access conflict is avoided.

Further, the data bit width improvement specifically defines the behavior of a single work item as that 8 bytes of data are transferred from the global memory to the private memory, 3DES encryption calculation is performed on the 8 bytes of data, and then the calculated result is transferred from the private memory to the global memory.

Furthermore, the instruction stream optimization, specifically, the loop expansion and the loop water circulation are used to improve the parallelism of the program, the loop expansion guides the offline compiler to convert the OpenCL Kernel into the hardware mirror image to form effective flow, and the pipeline architecture can shorten the whole execution time.

Furthermore, the device side also adopts a kernel vectorization strategy for forming a wider vector calculation channel so as to improve the efficiency of memory access.

Further, the kernel vectorization strategy, in particular, an instance allowing a plurality of work items to execute kernel programs in a SIMD mode, can improve the efficiency of memory access.

Further, the device side adopts a computing unit copy policy to improve kernel performance with a conventional memory access mode to improve computing throughput.

Further, the computing unit copy policy, specifically, the FPGA compiler, generates a plurality of computing units for the kernel, and each computing unit executes a plurality of working groups simultaneously, so as to improve the throughput rate of the kernel.

The invention has the beneficial effects that: aiming at the characteristics of data transmission of the global memory and the private memory, the OpenCL memory model, the memory access difference of the global memory and the private memory, the invention adopts data storage adjustment and data bit width improvement, and improves the actual bandwidth utilization rate of the kernel; when the 3DES is used, the instruction stream optimization is used for forming a pipeline parallel architecture, so that the calculation speed is improved. The performance of the system is further improved by using a kernel vectorization strategy and a computing unit replication strategy on the basis of data storage adjustment, data bit width improvement and instruction stream optimization.

The foregoing description is only an overview of the present invention, and is intended to provide a better understanding of the present invention, as it is embodied in the following description, with reference to the preferred embodiments of the present invention and the accompanying drawings.

Drawings

Fig. 1 is a schematic diagram of the overall architecture of the present invention.

FIG. 2 is a flowchart of a host-side process according to the present invention.

FIG. 3 is a pipelined architecture of kernel computing in accordance with the present invention.

FIG. 4 is a schematic diagram of a kernel vectorization strategy in the present invention.

Fig. 5 is a schematic diagram of a computing unit replication strategy in the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings and specific examples, which are not intended to be limiting, so that those skilled in the art will better understand the invention and practice it.

In the description of the present invention, it is to be understood that the term "comprising" is intended to cover a non-exclusive inclusion, such as a process, method, system, article, or apparatus that comprises a list of steps or elements does not include only those steps or elements but may, optionally, include other steps or elements not include other steps or elements inherent to such process, method, article, or apparatus.

The technical description of the invention is as follows:

1、3DES

The 3DES is based on DES, and the complexity is enhanced by performing three times of DES encryption so as to ensure the security of the 3 DES. DES involves 16 iterations, using a 56bit key. The 3DES contains 48 iterations, using a 168bit key. DES groups plaintext into 64 bits, and the key length involved in the calculation is fixed to 64 bits (significant bit number 56 bits). The encryption flow mainly comprises three parts of initial replacement, 16 round of loop iteration and inverse initial replacement. Wherein, the sub-key participating in the calculation in the 16 rounds of iterative process is expanded by the 56bit key.

The 64-bit input plaintext is divided into two parts of L ₀ and R ₀ through initial replacement, then 16 rounds of same iterative operation are carried out, and finally the 64-bit output ciphertext is obtained through inverse initial replacement. In each round of iteration, an exclusive-or operation and an f-function operation are included. The input of the f function is that the R _i-1 of the round and the result of the E box expansion of the sub key _i,R_i-1 corresponding to the round are exclusive-or operated with the 48bit sub key _i, and then the 32bits output is obtained through S box transformation and P box replacement. The S box converts 6bit input into 4bit output, which is the only nonlinear transformation in the process, so that the safety is greatly improved. The replacement part of the S boxes is realized by using a lookup table, and the contents of the 8S boxes are stored in an on-chip ROM, so that the calculation efficiency is effectively improved. The sub-key generation module of the DES inputs a key with 56 bits, and 16 rounds of iteration calculation modules of the DES are used for generating 16 sub-key keys _i through 16 rounds of iteration respectively. In the sub-key generation module, each round of iterations includes a round robin left shift and key replacement operation.

The 3DES was developed based on DES with an input of 64bit plaintext and an output of 64bit ciphertext. Unlike DES, 3DES contains a 192bit (effective length 168 bit) key. Let E _ki (I) and D _ki (I) represent DES encryption and decryption operations, respectively, on data block I using DES key Ki. The encryption operation formula of 3DES is o=e _K3(D_K2(E_K1 (I)), which converts a 64-bit input block I into a 64-bit output block O. The decryption operation formula of 3DES is o=d _K1(E_K2(D_K3 (I)), which converts the 64-bit input block I into a 64-bit output block O.

2、OpenCL

OpenCL provides an abstract memory hierarchy for developers to allow efficient code generation to fit into the memory hierarchy of a target device. The OpenCL Memory structure is composed of four types, global Memory (Global Memory), constant Memory (ConstantMemory), local Memory (Local Memory), and Private Memory (Private Memory). Work items run on a processing Unit (ProcessingElement, PE) and can access corresponding private memories, work groups run on a Computing Unit (CU), and work items in the same work group have a common local memory.

The OpenCL program includes a host program and a kernel program. The kernel runtime system creates an integer index space, the workitem (work-item) corresponds to an instance of the execution index space, the workgroup (workgroups) is a collection of workitems, and workitems in the same workgroup share memory and can implement intra-group synchronization. The coordinates of the work item in the global index space are the global ID of the work item, and the coordinates of the work item in the work group are the local ID of the work item.

3、Intel FPGA SDK

The invention adopts INTEL FPGA SDK to realize the design of the 3DES accelerator. Prior to building the system on the FPGA, the SDK supports emulation of OpenCL applications on the CPU. Software emulation utilizes a CPU to simulate FPGA hardware characteristics, typically for functional verification. Currently, intel's tool chain does not support hardware emulation. INTEL FPGASDK contains an offline compiler that compiles the OpenCL kernel to create an optimized hardware image, which converts the kernel code into an intermediate Verilog form, which is then compiled by the quatus II software into a binary image that can be loaded onto the FPGA device when the program is running. The compilation process takes several hours to apply the proper optimization and design the hardware image, so the compilation process is offline and the host program loads the hardware image only at run-time. After the construction is completed, a host executable file and a binary file will be created to run the object program on the FPGA.

Referring to fig. 1, which is a schematic diagram of the overall architecture of the present invention, an embodiment of a 3DES acceleration method based on OpenCL and FPGA of the present invention includes a host side and a device side, where the host side implements scheduling and management of a kernel based on OpenCL, completes interaction of 3DES encryption and decryption data with the device side, and completes reading, initializing, and storing of plaintext data required for encryption; the equipment end is designed on an FPGA and encrypts and decrypts data by using 3 DES. The device end comprises a plaintext data input buffer Memory module, a 3DES encryption calculation module and a ciphertext data output buffer Memory module, wherein the plaintext data input buffer Memory module reads plaintext data from a Global Memory (Global Memory) by using data storage adjustment and data bit width improvement, so that the actual bandwidth utilization rate is improved; the 3DES encryption calculation module completes encryption calculation of the 3DES based on the FPGA, and forms a pipeline parallel architecture by optimizing the instruction stream of the data, improves the cyclic iteration in the method and improves the parallelism of calculation; and the ciphertext data output buffer module transmits the data from the FPGA chip to an external DDR. The plaintext data input buffer memory module and the ciphertext data output buffer memory module are positioned in a global memory area of the equipment end, and intermediate data of the 3DES encryption calculation module are stored in a private memory area of the equipment end. The ciphertext data output buffer memory module and the plaintext data input module are basically the same, and the same optimization method is adopted, wherein the difference is that the transmitted data content and the data moving direction are different, the ciphertext data output buffer memory module is used for transmitting ciphertext, and the data moving direction is from a storage unit on an FPGA chip to an external global memory (the physical address is DDR). In fig. 1, the processing units are processing units of a work item, each processing unit has a corresponding private memory for storing intermediate data of an operation, and one processing unit completes 3DES encryption calculation of a plaintext block.

As shown in the program flow at the host side of fig. 2, the host side implements the scheduling and management of the kernel based on OpenCL, specifically, uses an OpenCL platform API and a runtime API to interact with the OpenCL device side (herein, FPGA), where the platform API defines functions used by the host side program to discover the OpenCL device and the functions of these functions, the runtime API is used to manage the context to create a command queue and other operations that occur at runtime, and the OpenCL platform API and the runtime API are provided by INTEL FPGA SDK for OpenCL (Intel provides a software support package for OpenCL development on the FPGA). After the host end reads the plaintext data and initializes, the host end inquires the equipment end, then creates a context and a command queue, creates a kernel object after the kernel source code is loaded, sets kernel parameters, writes the plaintext data into a memory object and executes the kernel, finally reads the encryption result of the equipment end, and releases resources after finishing data interaction with the equipment end.

In this embodiment, the data storage adjustment is specifically that data transmitted by the host end is stored in an off-chip DDR, and the corresponding data type is __ global; the constant memory is positioned in the on-chip cache unit, and the corresponding data type is __ constant; the physical address of the local memory is an on-chip RAM resource, and the corresponding data type is __ local; the physical address of the private memory is on-chip register resource, and the corresponding data type is __ private. Because of the difference of the sizes, delays and throughput rates of different resources on the chip, the reasonable allocation of the data storage positions has a great influence on the improvement of the performance. The global memory type has the maximum throughput and capacity, but has larger access delay. The data transmitted by the host end is stored in the global memory, so that the actual utilization rate of the memory bandwidth is improved effectively for improving the system performance; the local memory is visible to all work items in the work group, and has higher throughput rate and larger capacity under the condition of quite a memory access delay compared with the private memory, but the work items in the same work group need to ensure data consistency by using a barrier (barrier) after being executed, which increases delay to a certain extent. Therefore, the variable participating in 3DES calculation is stored in the private memory, and the work item accesses the corresponding plaintext data block in the private memory and completes 3DES encryption; for the S box of the subkey expansion module and the E box of the f function calculation module, the values of the S box and the E box are frequently accessed data and are kept unchanged in the calculation process, the S box and the E box are stored in a constant memory, and the corresponding physical address is an on-chip ROM (read only memory) for avoiding memory access conflict while the access speed is high.

In the case of data bit width improvement, if the data bit width processed by the work item is not fixed, the compiler can use more resources to meet the possible data bit width, and meanwhile, the optimization compiling of the program can be limited. The data length processed by a single workitem is adjusted to 8 bytes by combining a 3DES input data length of 64 bits and an output data length of 64 bits. If the data length is adjusted to 4 bytes, two work items are needed to complete the encryption operation of one plaintext block, and at this time, the data among the work items need to be synchronized to ensure the consistency of the data, which increases extra time expenditure; if the data length is adjusted to 16 bytes, the processing data volume of a single work item is doubled, and the kernel execution time is theoretically doubled. Therefore, in this embodiment, the behavior of the single work item is defined as that 8 bytes of data are transferred from the global memory to the private memory, 3DES encryption computation is performed on the 8 bytes of data, and then the computed result is transferred from the private memory to the global memory. By obtaining the global ID of the work item, the one-to-one correspondence between the work item and the plaintext data can be realized, so that the synchronous operation between the work items is avoided.

In this embodiment, the instruction stream optimization, specifically, loop expansion and loop water are used to improve the parallelism of the program, the loop expansion guides the offline compiler to convert the OpenCL Kernel into a hardware mirror image, and by using the loop expansion, an effective pipeline can be formed, so that the pipeline architecture can shorten the overall execution time. As shown in the pipelined architecture of kernel computation of fig. 3, where RD stands for data read, CM stands for data computation, and ST stands for data store. Assuming that 1 clock cycle is required for each operation, when a pipelined design is not formed, the core is delayed by 3 clock cycles in the next calculation, and only 1 clock cycle after using the pipelined design. In the design of the kernel, 8 loops are used for the 8-byte plaintext data transmission module, so that the number of times that the compiler performs iteration can be reduced to 1. For 8 load operations of the global variable, the cyclic expansion can enable the compiler to form memory merging so as to improve the actual utilization rate of the bandwidth; in the 3DES encryption module, performing loop expansion aiming at 16 times of loop left shift and key replacement operation of the sub-key generation module; performing loop expansion on a primary DES calculation module aiming at 16 rounds of iterative calculation modules; and 8-cycle expansion is used for forming memory merging aiming at the ciphertext data transmission module of 8 bytes. The loop expansion reduces the iteration times of the compiler and forms a pipeline architecture on the premise of consuming certain hardware resources, thereby improving the parallelism.

In this embodiment, the device side adopts a kernel vectorization policy to form a wider vector calculation channel to improve the efficiency of memory access. The kernel vectorization policy allows multiple work items to execute instances of the kernel program in SIMD fashion, as shown in the kernel vectorization policy diagram of fig. 4, by using the kernel vectorization policy, the efficiency of memory access can be improved. When the kernel vectorization strategy is used, the size of the working group needs to be specified at the same time, and the parameter of the kernel vectorization strategy can be divided by the size of the working group, and the maximum vectorization parameter set in the embodiment is 16. The workgroup size set in this embodiment is 512, the kernel vectorization parameter is 16, and the workitems in each workgroup are distributed in 16 SIMD vector lanes. After the compiler realizes 16 SIMD vector channels, the calculation workload of each work item is 16 times of the original work item, and the size of the corresponding global work group is reduced by 16 times. As shown in fig. 4, the compiler merges memory accesses using kernel vectorization, merging multiple load operations to global memory into 1 wider vector load operation.

In this embodiment, the device side further adopts a computing unit copy policy, so as to improve the performance of the kernel with the conventional memory access mode, so as to improve the computing throughput. Compute unit replication may improve core performance with regular memory access patterns. INTEL FPGA SDK the compiler supports the generation of multiple compute units for the kernel, typically each compute unit can execute multiple work groups simultaneously, thereby increasing the throughput rate of the kernel. After replication using the compute units, a hardware scheduler in the FPGA dispatches the work group to other available compute units. As long as the computing unit has not reached its maximum capacity, it can be used for work group allocation. As shown in the schematic diagram of the copy policy of the computing unit in fig. 5, in this embodiment, the design implements the combination of 2 computing units and kernel vectorization with parameter 16, the FPGA hardware scheduler distributes the working group to 2 computing units for execution, and the 2 computing units can shorten the running time of the system to half of the original running time.

The two optimization methods of the kernel vectorization strategy and the computing unit replication strategy can enlarge the circuit area on the basis of combining data storage adjustment, data bit width improvement and instruction stream optimization, so that the throughput rate is improved, and the performance of the system is further improved.

In order to further illustrate the beneficial effects of the five methods of data storage adjustment, data bit width improvement, instruction stream optimization, kernel vectorization and computing unit replication in the aspect of improving memory bandwidth and computing speed, the method is implemented in a CentOS Linux release 7.7.1908+GCC V4.8.5,OpenCL software environment with INTEL FPGA SDK for OpenCL 19.3 version; and comparing the conditions of the FPGA ends under different optimization strategies under the hardware environment that the CPU is Intel Xeon E5-2650V 2 and the FPGA is Intel Stratix 10GX 2800. Wherein, the FPGA comprises 1866240 ALUTs, and the memory bandwidth is 34GB/s. Table 1 shows the situation under different optimization strategies. The memory bandwidth represents the memory bandwidth utilization rate, and the larger the numerical value is, the higher the memory bandwidth utilization rate is; the smaller the kernel execution time value, the faster the kernel computation speed.

Scheme for the production of a semiconductor device	Memory bandwidth (MB/s)	Kernel execution time (ms)
			Not optimized	1916.9	1349.181
Instruction stream optimization	5779.7	46.620
			SIMD8	23274.5	11.132
SIMD16	27534.1	9.425
			SIMD16+CU2	28102.3	9.243

TABLE 1 kernel conditions under different optimization strategies

Instruction stream optimization is the result of using data store adjustments, data bit width improvements, and instruction stream optimization. As can be seen from the data in Table 1, the memory bandwidth utilization is improved, the kernel execution time is shortened, and the computation speed is improved as compared with the case where the optimization method is not adopted.

SIMD8 is the result of using data store adjustment, data bit width improvement, instruction stream optimization, and setting the kernel vectorization parameter to 8. As can be seen from the data in table 1, compared with the data storage adjustment, the data bit width improvement and the instruction stream optimization, the memory bandwidth utilization rate is greatly improved, and the calculation speed is improved.

SIMD16 is the result of using data store adjustment, data bit width improvement, instruction stream optimization, and setting the kernel vectorization parameter to 16. As can be seen from the data of table 1, the memory bandwidth utilization and computation speed are further improved compared to using data store adjustment, data bit width improvement, instruction stream optimization, and setting the kernel vectorization parameter to 8.

Simd16+cu2 is the result of using data store adjustment, data bit width improvement, instruction stream optimization, setting the kernel vectorization parameter to 16, and the number of compute unit replications to 2. As can be seen from the data in table 1, the memory bandwidth utilization and computation speed are further improved compared to when data storage adjustment, data bit width improvement, instruction stream optimization, and core vectorization parameter setting to 16 are employed.

Therefore, the SIMD16+CU2 comprehensively using the five methods of data storage adjustment, data bit width improvement, instruction stream optimization, kernel vectorization and calculation unit replication in the invention is the best condition, and further illustrates the beneficial effects of the invention.

The above-described embodiments are merely preferred embodiments for fully explaining the present invention, and the scope of the present invention is not limited thereto. Equivalent substitutions and modifications will occur to those skilled in the art based on the present invention, and are intended to be within the scope of the present invention. The protection scope of the invention is subject to the claims.

Claims

1. A3 DES acceleration method based on OpenCL and FPGA is characterized in that: comprises a host end and a device end,

The host side realizes the dispatching and management of the kernel based on OpenCL, and completes the interaction of 3DES encryption and decryption data with the equipment side; the equipment end is designed on the FPGA and encrypts and decrypts the data by using the 3 DES;

The device end comprises a plaintext data input buffer module, a 3DES encryption computing module and a ciphertext data output buffer module, wherein the plaintext data input buffer module reads plaintext data from a global memory by using data storage adjustment and data bit width improvement, the 3DES encryption computing module optimizes the data to form a pipeline parallel architecture by carrying out instruction stream, and the ciphertext data output buffer module transmits the data from an FPGA chip to an external DDR;

The data storage adjustment is specifically that data transmitted by a host end are stored in an off-chip DDR, a constant memory is positioned in an on-chip cache unit, a physical address of a local memory is an on-chip RAM resource, and a physical address of a private memory is an on-chip register resource; storing variables participating in 3DES calculation in a private memory, and accessing corresponding plaintext data blocks in the private memory by a work item and completing encryption of the 3 DES; the S box of the subkey expansion module and the E box of the f function calculation module are transformed and stored in the constant memory, and the corresponding physical address is the on-chip ROM, so that the access speed is high, and access conflict is avoided.

2. The 3DES acceleration method based on OpenCL and FPGA of claim 1, wherein: the host side performs dispatching and management on the kernel based on the OpenCL, specifically, the OpenCL platform API and the runtime API are used for interacting with the OpenCL device side, the platform API defines functions used by a host side program for discovering the OpenCL device and functions of the functions, and the runtime API is used for managing contexts to create command queues and other operations occurring during the runtime.

3. The 3DES acceleration method based on OpenCL and FPGA of claim 1, wherein: the plaintext data input buffer memory module and the ciphertext data output buffer memory module are positioned in a global memory area of the equipment end, and intermediate data of the 3DES encryption calculation module are stored in a private memory area of the equipment end.

4. The 3DES acceleration method based on OpenCL and FPGA of claim 1, wherein: the data bit width improvement specifically defines the behavior of a single work item as that 8 bytes of data are transferred from the global memory to the private memory, 3DES encryption calculation is performed on the 8 bytes of data, and then the calculated result is transferred from the private memory to the global memory.

5. The 3DES acceleration method based on OpenCL and FPGA of claim 1, wherein: the instruction stream optimization is specifically to use cyclic expansion and cyclic flow to improve the parallelism of a program, the cyclic expansion guides an offline compiler to convert OpenCL Kernel into hardware mirror images to form effective flow, and the pipeline architecture can shorten the overall execution time.

6. The 3DES acceleration method based on OpenCL and FPGA of claim 1, wherein: the device side also adopts a kernel vectorization strategy for improving the efficiency of memory access.

7. The OpenCL and FPGA based 3DES acceleration method of claim 6, wherein: the kernel vectorization strategy, in particular to an instance which allows a plurality of work items to execute kernel programs in a SIMD mode, improves the efficiency of memory access by using the kernel vectorization strategy.

8. The OpenCL and FPGA based 3DES acceleration method of any one of claims 1-7, wherein: the equipment end adopts a computing unit copy strategy to improve the performance of a kernel with a conventional memory access mode so as to improve the computing throughput rate.

9. The OpenCL and FPGA based 3DES acceleration method of claim 8, wherein: the computing unit copy strategy, specifically an FPGA compiler, generates a plurality of computing units for the kernel, and each computing unit executes a plurality of working groups simultaneously for improving the throughput rate of the kernel.