CN112328401A

CN112328401A - 3DES acceleration method based on OpenCL and FPGA

Info

Publication number: CN112328401A
Application number: CN202011302847.7A
Authority: CN
Inventors: 柴志雷; 吴健凤
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2020-11-19
Filing date: 2020-11-19
Publication date: 2021-02-05
Anticipated expiration: 2040-11-19
Also published as: CN112328401B

Abstract

The invention discloses a 3DES (data encryption standard) acceleration method based on OpenCL and FPGA (field programmable gate array), which comprises a host end and an equipment end, wherein the host end realizes the scheduling and management of a kernel based on the OpenCL and completes the interaction of 3DES encryption and decryption data with the equipment end; the device end is designed on an FPGA, and 3DES is used for encrypting and decrypting data; the device end comprises a plaintext data input cache module, a 3DES encryption calculation module and a ciphertext data output cache module, the plaintext data input cache module reads plaintext data from a global memory by using data storage adjustment and data bit width improvement, the 3DES encryption calculation module forms a pipeline parallel architecture by performing instruction stream optimization on the data, and the ciphertext data output cache module transmits the data from an FPGA (field programmable gate array) chip to an external DDR (double data rate). The invention adopts data storage adjustment, data bit width improvement and instruction stream optimization, thereby improving the actual bandwidth utilization rate and the calculation speed of the kernel; and the performance is further improved by adopting a kernel vectorization strategy and a computing unit copy strategy.

Description

3DES acceleration method based on OpenCL and FPGA

Technical Field

The invention relates to the technical field of encryption and decryption acceleration of heterogeneous platforms, in particular to a 3DES acceleration method based on OpenCL and FPGA.

Background

At present, encryption and decryption technologies are widely applied to the fields of digital currency, block chains, cloud data encryption and the like, and in order to meet the requirements of the encryption and decryption technologies on high-intensity computing capacity, a current server side allows a heterogeneous computing platform to be included to enhance the performance of a specific working load, and meanwhile, the maintenance cost of the whole system is improved. Opencl (open Computing language) is an open framework of heterogeneous platforms, and Kernel programs (Kernel) can be compiled and executed on a multi-core CPU, a GPU, an FPGA, and a DSP. At present, besides the ASIC or GPU is adopted to process mass data, the FPGA can be deployed on a large scale in consideration of factors such as energy efficiency and the like. FPGA (field programmable Gate array) is a product which is further developed on the basis of programmable devices such as PAL, GAL and the like, is a semi-custom circuit in the field of special integrated circuits, not only can solve the defect of a custom circuit, but also can overcome the defect of limited Gate circuit number of the original programmable device. Therefore, there is research for accelerating encryption and decryption technology based on OpenCL and FPGA.

DES (Data Encryption Standard, a block technique using key Encryption) is a common Encryption technique, but due to the enhancement of computer computing power, the key length of DES is easily cracked violently, so 3DES that avoids similar attacks by increasing the key length of DES appears, and 3DES is a generic term of triple DES, which is equivalent to applying DES three times to each Data block for Encryption. However, the 3DES based on OpenCL and FPGA has problems of low kernel bandwidth utilization rate and slow computation speed when encrypting and decrypting data.

Disclosure of Invention

The invention aims to provide a 3DES acceleration method based on OpenCL and FPGA, which can effectively utilize the actual bandwidth of a kernel and improve the calculation speed.

In order to solve the technical problems, the invention provides a 3DES (data encryption standard) acceleration method based on OpenCL and FPGA (field programmable gate array), which comprises a host end and a device end,

the host side realizes the scheduling and management of the kernel based on OpenCL and completes the interaction of 3DES encryption and decryption data with the equipment side; the device end is designed on an FPGA, and 3DES is used for encrypting and decrypting data;

the device end comprises a plaintext data input cache module, a 3DES encryption calculation module and a ciphertext data output cache module, the plaintext data input cache module reads plaintext data from a global memory by using data storage adjustment and data bit width improvement, the 3DES encryption calculation module forms a pipeline parallel architecture by performing instruction stream optimization on the data, and the ciphertext data output cache module transmits the data from an FPGA chip to an external DDR.

Further, the host implements scheduling and management of kernels based on OpenCL, specifically, an OpenCL platform API and a runtime API are used to interact with an OpenCL device, the platform API defines functions used by a host program to discover an OpenCL device and functions of the functions, and the runtime API is used to manage a context to create a command queue and other operations occurring during runtime.

Furthermore, the plaintext data input cache module and the ciphertext data output cache module are located in a global memory area of the device end, and intermediate data of the 3DES encryption calculation module is stored in a private memory area of the device end.

Further, the data storage adjustment is to store data transmitted by the host end in the off-chip DDR, the constant memory is located in the on-chip cache unit, the physical address of the local memory is an on-chip RAM resource, and the physical address of the private memory is an on-chip register resource; storing variables participating in the 3DES calculation in a private memory, accessing corresponding plaintext data blocks in the private memory by the workitems and completing encryption of the 3 DES; and the E box of the function calculation module is converted by aiming at the S box of the sub-key expansion module and the E box of the function calculation module, and the E box is stored in a constant memory, and the corresponding physical address is an on-chip ROM, so that access conflict is avoided while the access speed is high.

Further, the data bit width improvement specifically defines a behavior of a single work item as transferring 8 bytes of data from the global memory to the private memory, performs 3DES encryption calculation on the 8 bytes of data, and then transfers a calculation result from the private memory to the global memory.

Further, the instruction flow optimization specifically uses loop expansion and circulating water to improve the parallelism of the program, the loop expansion guides the offline compiler to convert the OpenCL Kernel into a hardware mirror image to form effective flow, and the flow architecture can shorten the overall execution time.

Further, the device side also adopts a kernel vectorization policy for forming a wider vector computation channel to improve the efficiency of memory access.

Further, the kernel vectorization policy, specifically, an instance that allows multiple work items to execute a kernel program in a SIMD manner, may improve the efficiency of memory access by using the kernel vectorization policy.

Further, the device side adopts a computing unit copy policy for improving the performance of the kernel with the conventional memory access mode to improve the computing throughput.

Further, the computing unit replication policy is specifically that the FPGA compiler generates a plurality of computing units for the kernel, and each computing unit executes a plurality of work groups at the same time, so as to improve the throughput of the kernel.

The invention has the beneficial effects that: aiming at the characteristics of data transmission of a global memory and a private memory, and memory access difference of an OpenCL memory model and the global memory and the private memory, the invention adopts data storage adjustment and data bit width improvement to improve the actual bandwidth utilization rate of a kernel; when the 3DES is used, the pipeline parallel architecture is formed by using instruction stream optimization, and the computing speed is improved. And on the basis of using data storage adjustment, data bit width improvement and instruction stream optimization, a kernel vectorization strategy and a computing unit copy strategy are used, so that the performance of the system is further improved.

The foregoing is a summary of the present invention, and in order to provide a clear understanding of the technical means of the present invention and to be implemented in accordance with the present specification, the following is a detailed description of the preferred embodiments of the present invention with reference to the accompanying drawings.

Drawings

FIG. 1 is a general architecture diagram of the present invention.

FIG. 2 is a flowchart of a host-side process of the present invention.

FIG. 3 is a pipelined architecture of kernel computation in the present invention.

Fig. 4 is a schematic diagram of the kernel vectoring strategy in the present invention.

FIG. 5 is a schematic diagram of a compute unit copy policy in accordance with the present invention.

Detailed Description

The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.

In the description of the present invention, it should be understood that the term "comprises/comprising" is intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to the listed steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Technical description in the invention:

1、3DES

3DES is based on DES, and the security is ensured by performing DES encryption three times to enhance complexity. DES involves 16 iterations using a 56bit key. The 3DES involves 48 iterations using 168bit keys. The DES groups plaintext by 64 bits, and the length of the key involved in the calculation is fixed to 64 bits (56 bits of the significant digit). The encryption process mainly comprises three parts of initial replacement, 16 rounds of loop iteration and reverse initial replacement. Wherein, the subkey participating in the calculation in the 16-round iteration process is expanded by the 56-bit key.

The 64bit input plaintext is initially permuted to L₀And R₀And performing 16 rounds of same iterative operations on the two parts, and finally performing inverse initial permutation to obtain a 64-bit output ciphertext. In each iteration, an exclusive or operation and an f function operation are included. The input to the f function is R of the wheel_i-1The sub-key corresponding to the wheel_i，R_i-1The result after E box expansion and the subkey key of 48bit_iAnd carrying out XOR operation, and then carrying out S-box transformation and P-box permutation to obtain 32bits of output. The S box converts the input of 6 bits into the output of 4 bits, which is the only nonlinear transformation in the process, thereby greatly improving the safety. The replacement part of the S box is realized by using a lookup table, and 8S boxes are arranged inThe capacity is stored in the on-chip ROM, so that the calculation efficiency is effectively improved. The input of the sub-key generation module of the DES is a 56-bit key, and 16 sub-key keys are generated through 16 rounds of iteration_iRespectively for 16 rounds of iterative computation modules for DES. In the subkey generation module, each iteration includes a loop left shift and a key permutation operation.

The 3DES is developed on the basis of DES, and has 64-bit plaintext as input and 64-bit ciphertext as output. Unlike DES, 3DES contains a 192bit (effective length 168bit) key. Let E_ki(I) And D_ki(I) Respectively representing DES encryption and decryption operations on a block of data I using a DES key Ki. The encryption operation formula of the 3DES is O ═ E_K3(D_K2(E_K1(I) ) to convert a 64-bit input block I into a 64-bit output block O. The decryption operation formula of the 3DES is O ═ D_K1(E_K2(D_K3(I) ) to convert a 64-bit input block I into a 64-bit output block O.

2、OpenCL

OpenCL provides developers with an abstract memory hierarchy to allow efficient code generation to fit the memory hierarchy of the target device. The OpenCL Memory structure is composed of four types, namely, a Global Memory (Global Memory), a constant Memory (constant Memory), a Local Memory (Local Memory), and a Private Memory (Private Memory). Workitems are run on Processing Elements (PE), and can access corresponding private memories, workgroups are run on a Computing Unit (CU), and workitems in the same workgroup have a common local memory.

The OpenCL program comprises a host program and a kernel program. The kernel program runtime system will establish an integer index space, the work item (work-item) corresponds to an instance of the execution index space, the workgroups (workgroups) are collections of work items, and the work items in the same workgroup share the memory and can realize the synchronization in the group. The coordinate of the work item in the global index space is the global ID of the work item, and the coordinate of the work item in the workgroup is the local ID of the work item.

3、Intel FPGA SDK

The invention adopts Intel FPGA SDK to realize the design of the 3DES accelerator. The SDK supports emulating OpenCL applications on the CPU before building the system on the FPGA. Software emulation utilizes a CPU to simulate FPGA hardware characteristics, typically for functional verification. Currently, Intel's toolchain does not support hardware emulation. The Intel FPGASDK contains an offline compiler that compiles the OpenCL kernel to create an optimized hardware image, which translates the kernel code into an intermediate Verilog form, which is then compiled into a binary image by the Quartus II software, which can be loaded onto the FPGA device at program run time. The compilation process takes hours to apply the proper optimization and design the hardware image, so the compilation process is offline and the host program loads the hardware image only at runtime. After the build is complete, a host executable file and a binary file are created to run the target program on the FPGA.

Referring to fig. 1, which is a schematic diagram of a general architecture of the present invention, an embodiment of a 3DES acceleration method based on OpenCL and FPGA of the present invention includes a host side and a device side, where the host side implements scheduling and management of a kernel based on OpenCL, completes interaction of 3DES encryption and decryption data with the device side, and completes reading, initialization, and storage of plaintext data required for encryption; the device end is designed on an FPGA, and 3DES is used for encrypting and decrypting data. The device end comprises a plaintext data input cache module, a 3DES encryption calculation module and a ciphertext data output cache module, wherein the plaintext data input cache module reads plaintext data from a Global Memory (Global Memory) by using data storage adjustment and data bit width improvement, so that the actual bandwidth utilization rate is improved; the 3DES encryption calculation module completes 3DES encryption calculation based on the FPGA, a pipeline parallel framework is formed by performing instruction stream optimization on data, loop iteration in the method is improved, and the calculation parallelism is improved; and the ciphertext data output cache module transmits the data from the FPGA chip to the external DDR. The plaintext data input cache module and the ciphertext data output cache module are positioned in a global memory area of the equipment end, and the intermediate data of the 3DES encryption calculation module is stored in a private memory area of the equipment end. The ciphertext data output buffer module and the plaintext data input module are basically consistent, the same optimization method is adopted, the difference is that the data content of transmission and the data moving direction are different, the ciphertext data output buffer module is used for transmitting ciphertext, and the data moving direction is from an FPGA on-chip storage unit to an external global memory (the physical address is DDR). The processing units in fig. 1 are processing units of a work item, each processing unit has a corresponding private memory for storing intermediate data of an operation, and one processing unit completes 3DES encryption calculation of a plaintext block.

As shown in the program flow of the host side of fig. 2, the host side implements scheduling and management of kernels based on OpenCL, and specifically, interacts with an OpenCL device side (herein, FPGA) using an OpenCL platform API and a runtime API, where the platform API defines functions used by a host side program to discover the OpenCL device and functions of the functions, the runtime API is used to manage a context to create a command queue and other operations occurring at runtime, and the OpenCL platform API and the runtime API are provided by an Intel FPGA SDK for OpenCL (software support package provided by Intel for OpenCL development on FPGA). The host side inquires the equipment side after reading the plaintext data and initializing, then creates a context and a command queue, creates a kernel object and sets kernel parameters after loading kernel source codes, writes the plaintext data into a memory object and executes the kernel, and finally reads an encryption result of the equipment side, and releases resources after finishing data interaction with the equipment side.

In this embodiment, the data storage adjustment is specifically that data transmitted by the host is stored in the off-chip DDR, and the corresponding data type is __ global; the constant memory is positioned in the on-chip cache unit, and the corresponding data type is __ constant; the physical address of the local memory is an on-chip RAM resource, and the corresponding data type is __ local; the physical address of the private memory is an on-chip register resource, and the corresponding data type is __ private. Due to the difference of the sizes, delays and throughput rates of different resources on the chip, the reasonable allocation of data storage positions has a great influence on the improvement of performance. The global memory type has the maximum throughput rate and capacity, but has larger memory access delay at the same time. Data transmitted by the host side is stored in the global memory, so that the improvement of the actual utilization rate of the memory bandwidth is effective to the improvement of the system performance; compared with a private memory, the local memory has higher throughput rate and larger capacity under the condition of equivalent memory access delay, but the data consistency needs to be ensured by using barriers (barriers) after the work items in the same work group are executed, so that the delay is increased to a certain extent. Therefore, variables participating in the 3DES calculation are stored in the private memory, and the work items access corresponding plaintext data blocks in the private memory and complete 3DES encryption; aiming at the transformation of an S box of a subkey expansion module and an E box of an f function calculation module, because the data is frequently accessed and the value of the data is kept unchanged in the calculation process, the data is stored in a constant memory, and the corresponding physical address is an on-chip ROM (read only memory) which is used for avoiding access conflict when the access speed is high.

When the data bit width is improved, if the data bit width processed by the workitem is not fixed, the compiler uses more resources to meet the possible data bit width, and meanwhile, the optimization compilation of the program is limited. And adjusting the data length of the single work item processing to 8 bytes by combining that the 3DES input data length is 64 bits and the output data length is 64 bits. If the data length is adjusted to 4 bytes, two work items are needed to complete the encryption operation of a plaintext block, and at this time, data among the work items need to be synchronized to ensure data consistency, which increases additional time overhead; if the data length is adjusted to 16 bytes, the processing data amount of the single work item is doubled at this time, and the kernel execution time is theoretically doubled. Therefore, in this embodiment, the behavior of a single work item is defined as transferring 8 bytes of data from the global memory to the private memory, performing 3DES encryption calculation on the 8 bytes of data, and then transferring the calculation result from the private memory to the global memory. The one-to-one correspondence between the work items and the plaintext data can be realized by obtaining the global ID of the work items, so that the synchronous operation among the work items is avoided.

The instruction flow optimization in this embodiment specifically uses loop expansion and circulating water to improve the parallelism of a program, the loop expansion instructs an offline compiler to convert OpenCL kernels into hardware images, effective flowing water can be formed by using the loop expansion, and the pipeline architecture can shorten the overall execution time. As shown in the pipelined architecture of the core computation of fig. 3, where RD stands for data read, CM stands for data compute, and ST stands for data store. Assuming that each operation requires 1 clock cycle, when no pipeline design is formed, the core delays 3 clock cycles in the next calculation, and only 1 clock cycle is delayed after the pipeline design is used. In the design of the kernel, 8 times of loop unrolling are used for the plaintext data transmission module of 8 bytes, and the number of times of executing iteration of the compiler can be reduced to 1. Aiming at 8 times of load operation of the global variable, the compiler can form memory combination by cyclic expansion so as to improve the actual utilization rate of the bandwidth; in the 3DES encryption module, 16 times of circulation left shift and key replacement operation of the sub-key generation module are circularly expanded; performing cyclic expansion on the 16-round iterative computation module in the primary DES computation module; and 8 times of cyclic expansion is used for the 8-byte ciphertext data transmission module to form memory combination. The loop expansion reduces the iteration times of the compiler and forms a pipeline type framework on the premise of consuming certain hardware resources, thereby improving the parallelism.

In this embodiment, the device side adopts a kernel vectorization policy to form a wider vector computation channel to improve the efficiency of memory access. The kernel vectorization policy allows multiple workitems to execute instances of the kernel program in SIMD fashion, as shown in the kernel vectorization policy diagram of fig. 4, and by using the kernel vectorization policy, the efficiency of memory access can be improved. When the kernel vectorization policy is used, the size of the working group needs to be specified at the same time, and the parameter of the kernel vectorization policy can be evenly divided by the size of the working group, where the maximum vector quantization parameter set in this embodiment is 16. In this embodiment, the set work group size is 512, the kernel vectorization parameter is 16, and the work items in each work group are distributed in 16 SIMD vector channels. After the compiler realizes 16 SIMD vector channels, the calculation workload of each work item is 16 times of the original calculation workload, and the size of the corresponding global work group is reduced by 16 times of the original calculation workload. As shown in fig. 4, after kernel vectorization, the compiler rendezvouss and accesses memory, merging multiple load operations on global memory into 1 wider vector load operation.

In this embodiment, the device side further adopts a computing unit copy policy for improving the performance of the kernel having the conventional memory access mode to improve the computation throughput. Compute unit replication may improve performance of cores with conventional memory access patterns. The Intel FPGA SDK compiler supports the generation of multiple compute units for the kernel, and typically each compute unit can execute multiple workgroups simultaneously, thereby improving the throughput of the kernel. After replication using a compute unit, a hardware scheduler in the FPGA dispatches the work group to other available compute units. The calculation unit may be used for workgroup allocation as long as it has not reached its maximum capacity. As shown in the schematic diagram of the replication policy of the computing unit in fig. 5, in the present embodiment, 2 computing units are designed and implemented in combination with the kernel vectorization with a parameter of 16, the FPGA hardware scheduler allocates a work group to the 2 computing units for execution, and the 2 computing units can shorten the running time of the system to half of the original running time.

The two optimization methods of the kernel vectorization strategy and the calculation unit replication strategy can enlarge the circuit area on the basis of combining the data storage adjustment, the data bit width improvement and the instruction stream optimization, thereby improving the throughput rate and further improving the system performance.

In order to further explain the beneficial effects of five methods of data storage adjustment, data bit width improvement, instruction stream optimization, kernel vectorization and calculation unit replication in the invention on the aspects of improving the memory bandwidth and the calculation speed, under the software environment of CentOS Linux release 7.7.1908+ GCC V4.8.5 and OpenCL version of Intel FPGA SDK for OpenCL 19.3; and comparing the conditions of the FPGA end under different optimization strategies under the hardware environment that the CPU is Intel Xeon E5-2650V 2 and the FPGA is Intel Stratix 10GX 2800. The FPGA comprises 1866240 ALUTs, and the memory bandwidth is 34 GB/s. Table 1 shows the situation for different optimization strategies. The memory bandwidth represents the memory bandwidth utilization rate, and the larger the numerical value is, the higher the memory bandwidth utilization rate is; the smaller the kernel execution time value is, the faster the kernel calculation speed is.

Scheme(s)	Memory bandwidth (MB/s)	Kernel execution time (ms)
			Not optimized	1916.9	1349.181
Instruction stream optimization	5779.7	46.620
			SIMD8	23274.5	11.132
SIMD16	27534.1	9.425
			SIMD16+CU2	28102.3	9.243

TABLE 1 Kernel situation under different optimization strategies

Instruction stream optimization is the result of using data store justification, data bit width improvement, and instruction stream optimization. As can be seen from the data in table 1, compared with the case where no optimization method is adopted, the memory bandwidth utilization rate is improved, the kernel execution time is shortened, and the calculation speed is improved.

SIMD8 is the result of using data store justification, data bit width improvement, instruction flow optimization, and setting kernel vectorization parameters to 8. As can be seen from the data in table 1, compared with the data storage adjustment, the data bit width improvement and the instruction stream optimization, the memory bandwidth utilization rate is greatly improved, and the calculation speed is improved.

SIMD16 is the result of using data storage justification, data bit width improvement, instruction flow optimization, and setting the kernel vector quantization parameter to 16. As can be seen from the data in table 1, the memory bandwidth utilization and the calculation speed are further improved compared with the method that data storage adjustment, data bit width improvement, instruction stream optimization and kernel vectorization parameter setting are adopted to be 8.

SIMD16+ CU2 is the result of using data storage justification, data bit width improvement, instruction stream optimization, setting kernel vectoring parameters to 16 and a compute unit copy number of 2. As can be seen from the data in table 1, compared with the case of adopting data storage adjustment, data bit width improvement, instruction stream optimization and setting the kernel vectorization parameter to be 16, the memory bandwidth utilization rate and the calculation speed are further improved.

Therefore, the SIMD16+ CU2 which comprehensively uses the five methods of data storage adjustment, data bit width improvement, instruction stream optimization, kernel vectorization and computing unit replication in the invention is the best result, and further illustrates the beneficial effects of the invention.

The above-mentioned embodiments are merely preferred embodiments for fully illustrating the present invention, and the scope of the present invention is not limited thereto. The equivalent substitution or change made by the technical personnel in the technical field on the basis of the invention is all within the protection scope of the invention. The protection scope of the invention is subject to the claims.

Claims

1. A3 DES acceleration method based on OpenCL and FPGA is characterized in that: comprises a host end and an equipment end,

2. The OpenCL and FPGA-based 3DES acceleration method of claim 1, wherein: the host side realizes scheduling and management of a kernel based on OpenCL, and particularly uses an OpenCL platform API and a runtime API to interact with an OpenCL device side, wherein the platform API defines functions used by a host side program to discover OpenCL devices and functions of the functions, and the runtime API is used for managing context to create a command queue and other operations generated during runtime.

3. The OpenCL and FPGA-based 3DES acceleration method of claim 1, wherein: the plaintext data input cache module and the ciphertext data output cache module are positioned in a global memory area of the equipment end, and the intermediate data of the 3DES encryption calculation module is stored in a private memory area of the equipment end.

4. The OpenCL and FPGA-based 3DES acceleration method of claim 1, wherein: the data storage adjustment is that data transmitted by a host end is stored in an off-chip DDR (double data rate), a constant memory is located in an on-chip cache unit, the physical address of a local memory is an on-chip RAM (random access memory) resource, and the physical address of a private memory is an on-chip register resource; storing variables participating in the 3DES calculation in a private memory, accessing corresponding plaintext data blocks in the private memory by the workitems and completing encryption of the 3 DES; and the E box transformation of the S box and the f function calculation module of the subkey expansion module is stored in a constant memory, and the corresponding physical address is an on-chip ROM, so that access conflict is avoided while the access speed is high.

5. The OpenCL and FPGA-based 3DES acceleration method of claim 1, wherein: the data bit width improvement specifically includes defining the behavior of a single work item as carrying 8 bytes of data from the global memory to the private memory, carrying out 3DES encryption calculation on the 8 bytes of data, and carrying a calculation result from the private memory to the global memory.

6. The OpenCL and FPGA-based 3DES acceleration method of claim 1, wherein: the instruction flow optimization specifically includes that the parallelism of a program is improved by using loop expansion and circulating water, the loop expansion guides an offline compiler to convert OpenCL Kernel into a hardware mirror image mode to form effective flow, and the overall execution time can be shortened by a flow architecture.

7. The OpenCL and FPGA-based 3DES acceleration method of claim 1, wherein: the device side also adopts a kernel vectorization strategy for forming a wider vector calculation channel to improve the efficiency of memory access.

8. The OpenCL and FPGA-based 3DES acceleration method of claim 7, wherein: the kernel vectorization policy is specifically an example that allows multiple workitems to execute a kernel program in a SIMD manner, and the efficiency of memory access can be improved by using the kernel vectorization policy.

9. The OpenCL and FPGA based 3DES acceleration method according to any one of claims 1-8, characterized in that: the device side adopts a computing unit copy strategy for improving the performance of a kernel with a conventional memory access mode so as to improve the computing throughput rate.

10. The OpenCL and FPGA-based 3DES acceleration method according to claim 9, wherein: the computing unit replication strategy is specifically that the FPGA compiler generates a plurality of computing units for the kernel, and each computing unit executes a plurality of working groups simultaneously, so as to improve the throughput rate of the kernel.