CN105739951B

CN105739951B - A kind of L1 minimization problem fast solution methods based on GPU

Info

Publication number: CN105739951B
Application number: CN201610116008.3A
Authority: CN
Inventors: 高家全; 李泽界; 王宇
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2016-03-01
Filing date: 2016-03-01
Publication date: 2018-05-08
Anticipated expiration: 2036-03-01
Also published as: CN105739951A

Abstract

A kind of L1 minimization problem fast solution methods based on GPU, in the Maxwell framework GPU equipment of NVIDIA, using CUDA parallel computational models, using GPU new features and kernel merges and optimisation technique, there is provided a kind of L1 minimization problems fast solution method.The vector calculus of adaptive optimization is not only contained in the method for solving, non-transposed matrix vector multiplies the design multiplied with transposed matrix vector, and can only be distributed by simple CUDA threads and difference is set, realize the Parallel implementations of single or concurrent multiple L1 minimization problems.Test result indicates that method for solving proposed by the present invention is effective, and there is high degree of parallelism and adaptability.Compared with existing Parallel implementation method, performance increases substantially.

Description

A kind of L1 minimization problem fast solution methods based on GPU

Technical field

The present invention relates to signal processing and field of face identification, relates more specifically to a kind of L1 minimums based on GPU and asks Inscribe fast solution method.

Background technology

L1 minimization problems are min | | x | |₁, meet constraint Ax=b, wherein A ∈ R^m×n(m ＜＜ n) is a full rank Dense matrix, b ∈ R^mIt is vector set in advance, x ∈ RⁿIt is unknown solution.The solution of L1 minimization problems, also referred to as rarefaction representation, It has been widely applied to multiple fields, such as signal processing, machine learning and statistical inference etc..Asked to solve L1 minimums Topic, researcher have designed many effective algorithms.For example, gradient projection method, block newton interior point method, Homotopy Method, iteration receive Contracting threshold method and augmented vector approach etc..In actual conditions, b often includes noise, therefore a change of this problem Kind is referred to as unconfined base tracking Denoising Problems (BPDN problems) or Lasso problems, as follows：

Wherein, λ is scalar weights.

With the growth of problem scale, the execution efficiency of algorithm is largely cut down.To improve its efficiency, one has The approach of effect is exactly that these algorithms are transplanted on distributed or multinuclear framework, such as currently a popular graphics processing unit (GPU).Since NVIDIA companies are since 2007 describe CUDA programming models, the research of data processing is accelerated based on GPU The hot spot of people's research is become.

Most of L1 minimize algorithm, its computing is mainly multiplied by dense matrix vector to be formed with vector operation.Due to Efficient realization comprising these computings in CUBLAS storehouses, so the existing L1 accelerated based on GPU is minimized algorithm and is based primarily upon CUBLAS storehouses.However, being found by testing, the matrix vector in CUBLAS multiplies increasing of the method with matrix line number or columns It is long, performance inconsistency can be produced, and minimax performance gap is notable.CUBLAS does not support to melt core, minimum in face of concurrent multiple L1 During change problem, the new feature of existing GPU can not be made full use of, and cannot most optimally configure the computing resource of whole GPU, is deposited In larger overhead.Therefore, the present invention, based on iteratively faster collapse threshold method, is led in the GPU equipment of Maxwell frameworks Cross abundant digging utilization GPU hardware resource and computing capability, there is provided a kind of efficient parallel method for solving of L1 minimization problems.

The content of the invention

The purpose of the present invention is the deficiency for existing method, by digging utilization GPU hardware resource and computing capability, carries For a kind of efficient parallel method for solving of L1 minimization problems.The present invention gives two solvers, single L1 minimization problems The Parallel solver of Parallel solver and concurrent multiple L1 minimization problems, is sweared including the non-transposed matrix of adaptive optimization Amount multiplies Parallel Design, transposed matrix vector multiplies Parallel Design and streaming Parallel Design.

To reach above-mentioned purpose, present invention employs following technical scheme.

Iteratively faster collapse threshold method (FISTA) is a kind of iterative shrinkage threshold method, can solve unconfined base tracking Denoising Problems, and relate generally to matrix vector and multiply and vector operation, it is easily parallel.Therefore the present invention is based on FISTA, NVIDIA's It is parallel to accelerate to solve L1 minimization problems using CUDA parallel computational models in Maxwell framework GPU equipment.Devise adaptive The vector calculus that should optimize, non-transposed matrix vector multiply and multiply with transposed matrix vector, and it is real to pass through the distribution of rational CUDA threads Single L1 minimization problems Parallel solver and the Parallel solver of concurrent multiple L1 minimization problems are showed.

The invention of this method for solving concretely comprises the following steps：

1) set according to the computing resource of the dimension of data dictionary and GPU equipment, complete thread beam distribution setting and thread Distribution is set；

2) by the way of the filling of 32 byte-aligneds, preserve 0 and start the matrix by row storage of index into data dictionary. Data dictionary and vector are transmitted from host side to GPU equipment ends；

3) while in host side, the input parameter of asynchronous computing FISTA；

4) number of the L1 minimization problems solved as needed, if only solving single L1 minimization problems, sets in GPU The setting for calling single L1 minimization problems Parallel solver is inspired on standby end；If solving concurrent multiple L1 minimization problems, The setting for the Parallel solver for calling concurrent multiple L1 minimization problems is inspired in GPU equipment ends；

5) in GPU equipment ends, multiply Parallel Design using the non-transposed matrix vector of adaptive optimization, realize in FISTA Non- transposed matrix vector multiply；

6) in GPU equipment ends, multiply Parallel Design using the transposed matrix vector of adaptive optimization, realize in FISTA Transposed matrix vector multiplies；

7) in GPU equipment ends, remaining vector operation is merged, using streaming Parallel Design, amalgamation mode calculates FISTA In remaining vector operation；

8) while in host side, asynchronous computing scalar value；

If 9) reach the condition of convergence, stop iteration, otherwise transmission rarefaction representation is returned from GPU equipment end to host side Step 5), continues iteration.

FISTA relates generally to vector operation and matrix vector multiplies, without matrix inversion operation and matrix decomposition.Therefore, it It is not only suitable for parallel, and easily extends to extensive high dimensional data.Vector operation and matrix vector, which multiply, both belongs to low calculating Density operation, is limited to bandwidth, thus, the present invention uses kernel amalgamation mode, and multiple vector calculuses and matrix vector are multiplied and are melted (kernel) is combined, the global memory for eliminating intermediate result accesses, and utilizes data locality.In addition, matrix vector multiplies It is made of multiple inner product operations, every a line and the vector of matrix do inner product operation, and by shared drive, vector, which is repeated, to be made With.Meanwhile the allocation strategy of adaptive optimization has been used, make full use of the storage knot of the GPU equipment of computing capability 5.0 and the above Structure, with reference to data locality, realizes multi-level buffer control optimization, reduces global data access.

Non- transposed matrix vector in the step 5) multiplies Parallel Design, is distributed and set according to thread beam, adaptive optimization Ground distributes a thread beam or multiple thread beams and goes to calculate an inner product, and reduces calculation amount using the openness of solution.

The Parallel Design includes following two benches reduction：

1) in first stage, all threads in per thread block cooperate with first reads continuous segment vector extremely parallel The corresponding part reduction operation of per thread completion of shared drive, then same thread beam, then, has been instructed using shuffling Reduction into thread beam, and store the result into continuous shared drive, until all having loaded vector；

2) in second stage, using shuffling, instruction carries out reduction to the shared drive data of first stage storage, calculates Obtain corresponding inner product result.

In this Parallel Design, the number of threads that a thread beam includes is 32.An inner product is calculated to obtain most Excellent thread beam number, it is proposed that following self-adjusted block strategy：

Min w=sm × 2048/k/32, meet m≤w

Wherein w represents thread beam group (by k thread Shu Zucheng) quantity that distribution produces, and k represents to distribute to an inner product Thread beam quantity, if less than 1, be taken as 1, sm and represent the stream multiprocessor number that GPU equipment includes, m represents data dictionary Matrix line number.As k=1, which only needs first stage reduction；As k=32, vector can be directly loaded into Register.

Transposed matrix vector in the step 6) multiplies Parallel Design, is distributed and set according to thread, divides adaptive optimization Go to calculate an inner product with a thread or multiple threads.

The Parallel Design also includes two benches reduction：

1) in first stage, all threads in per thread block cooperate with first reads continuous segment vector extremely parallel Shared drive, then per thread completes corresponding part reduction operation, and stores the result into continuous shared drive；

2) second stage, the shared drive data obtained to the first stage carry out reduction, corresponding inner product are calculated As a result.

The Parallel Design uses following self-adaptive thread allocation strategy：

Min t=sm × 2048/k, meet n≤t

Wherein t represents that the sets of threads that distribution produces (is made of) quantity k thread, and k represents to distribute to the line of an inner product Number of passes amount, if less than 1, is taken as 1, sm and represents the stream multiprocessor number that GPU equipment includes, n represents the rectangular array of data dictionary Number.As k=1, it is only necessary to first stage.

Streaming Parallel Design in the step 7), each element entry in vector operation is handled with streaming mode of loading, Including soft-threshold operator, can also be vectored, and the built-in function provided using CUDA, eliminate branch.

In the step 4) when enabling single L1 minimization problems Parallel solver setting, then a GPU equipment is only asked Solve a L1 minimization problem.Non- transposed matrix vector multiplies Parallel Design, transposed matrix vector multiplies Parallel Design and fusion vector The streaming Parallel Design of operation realizes completion respectively by three CUDA kernel functions.

In the step 4) when the Parallel solver for enabling concurrent multiple L1 minimization problems is set, then a GPU is set It is standby concurrently to solve multiple L1 minimization problems.The solution of each L1 minimization problem is complete by one or more thread blocks Into, and non-transposed matrix vector multiplies Parallel Design, transposed matrix vector multiplies Parallel Design and merges the streaming of vector operation simultaneously Row design is only realized by a CUDA kernel function.In addition, the built-in function provided using CUDA is by the visit of data dictionary matrix Ask and be buffered in read-only data caching, improve access efficiency.

The Parallel implementation methods of this L1 minimization problems proposed by the invention, abundant digging utilization GPU hardware money Source and computing capability, have high degree of parallelism and adaptability.

Brief description of the drawings

Fig. 1 is the storage hierarchy figure of the GPU equipment of computing capability SM5.0+.

Fig. 2 is matrix storage format schematic diagram in the present invention.

Fig. 3 is the filled matrix schematic diagram that 32 byte-aligneds are used in the present invention.

The kernel that Fig. 4 is FISTA in the present invention merges schematic diagram.

Fig. 5 is that performance comparison result of the Parallel solver of single L1 minimization problems in the present invention on GPU and CPU is shown It is intended to.

Fig. 6 is the performance comparison result of the Parallel solver and single version of concurrent multiple L1 minimization problems in the present invention Schematic diagram.

Fig. 7 is flow chart of the method for the present invention.

Embodiment

In the following description, the present invention is further explained in detail with reference to attached drawing 1-7 and specific implementation method.

Iteratively faster collapse threshold algorithm is a kind of iterative shrinkage thresholding algorithm, by combining Nesterovs Optimal gradients The acceleration version that algorithm is realized, possesses non-progressive convergency factor O (k²).The algorithm with the addition of such as next new sequence {y_k, k=1,2 ... }, specific iterative step is as follows：

Wherein, λ is scalar weights, and (u, a)=sign (u) max { | u |-a, 0 } are soft-threshold operators to soft, y₁=x₀, t₁=1, L_fIt is the associated lipschitz constants of ▽ f (), can be by calculating A^TThe characteristic spectrum of A obtain (| | A^TA||₂), ▽ f (y_k)= A^T(Ay_k-b)。

The present invention relates to L1 minimization problems are solved, iteratively faster collapse threshold algorithm is employed, which relates generally to Vector operation, matrix vector multiply.The present invention is in the Maxwell framework GPU equipment of NVIDIA, based on CUDA parallel computation moulds Type, accelerates iteratively faster collapse threshold algorithm parallel.

The non-transposed matrix vector that the present invention proposes adaptive optimization multiplies Parallel Design, transposed matrix vector multiplies and sets parallel Meter and streaming Parallel Design.Using above-mentioned Parallel Design, and single L1 is realized by the distribution of reasonable CUDA threads and is minimized The Parallel solver of problem Parallel solver and concurrent multiple L1 minimization problems.

The specific steps of the Parallel implementation method include as follows：

1) GPU facility informations, including computing capability are read, multiprocessor number is flowed, according to the dimension and GPU of data dictionary Facility information, completes thread beam distribution setting and thread distribution is set；

2) by the way of the filling of 32 byte-aligneds, preserve 0 and start the matrix by row storage of index to data dictionary A In.Data dictionary A, data item b and rarefaction representation x are transmitted from host side to GPU equipment ends；

3) while in host side, input parameter (such as L of asynchronous computing iteratively faster collapse threshold algorithm_f)；

4) number of the L1 minimization problems solved as needed, if single L1 minimization problems are solved, in GPU equipment The setting for calling single L1 minimization problems Parallel solver is inspired on end；If solving concurrent multiple L1 minimization problems, The setting for the Parallel solver for calling concurrent multiple L1 minimization problems is inspired in GPU equipment ends；

5) in GPU equipment ends, multiply Parallel Design using the non-transposed matrix vector of adaptive optimization, realize iteratively faster Non- transposed matrix vector in collapse threshold algorithm multiplies Ay_k-b；

6) in GPU equipment ends, multiply Parallel Design using the transposed matrix vector of adaptive optimization, realize that iteratively faster is received Transposed matrix vector in contracting thresholding algorithm multiplies A^T(*)；

7) in GPU equipment ends, using streaming Parallel Design, melt remaining in joint account iteratively faster collapse threshold algorithm Vector operation；

8) while in host side, asynchronous computing t_k+1Value；

If 9) the sparse sexual satisfaction of iterative steps or solution set, stop iteration, transmission rarefaction representation from GPU equipment ends to Host side, otherwise return to step 5), continue iteration.

Above-mentioned steps 4) in when enabling single L1 minimization problems Parallel solver and setting, then a GPU equipment is only asked A L1 minimization problem is solved, calls three CUDA kernel functions to realize the non-transposition square of iteratively faster collapse threshold algorithm respectively Vector multiplies battle array, transposed matrix vector multiplies and all vector operations, and the idiographic flow of FISTA is as shown in algorithm 1.First kernel letter Number multiplies Parallel Design using non-transposed matrix vector and realizes Ay_k-b；Second kernel multiplies Parallel Design using transposed matrix vector Realize A^T(*)；Then it is the 3rd kernel to merge remaining vector operation, using streaming Parallel Design, as shown in Figure 4.Often The dimensional attributes of the input and output object of a kernel function can be different, i.e., (refer to thread net using different startup configurations Configuration of lattice and thread block etc.).

Above-mentioned steps 4) in when the Parallel solver for enabling concurrent multiple L1 minimization problems is set, then a GPU is set It is standby concurrently to solve multiple L1 minimization problems.The solution of each L1 minimization problem is complete by one or more thread blocks Into, and using only a CUDA kernel function with reference to non-transposed matrix vector multiplies Parallel Design, transposed matrix vector multiplies parallel The streaming Parallel Design of design and fusion vector operation, realizes iteratively faster collapse threshold algorithm.Using _ ldg () function by number According to the access cache of dictionary matrix in read-only data caching, access efficiency is improved, in addition, no longer in host side, asynchronous computing t_k+1Value.

Non- transposed matrix vector, which multiplies, is defined as Ax (A ∈ R^m×n,x∈Rⁿ), it is made of m inner product (in every a line of A and x do Product operation)；And each inner product operation can be calculated independently.Above-mentioned steps 5) in non-transposed matrix vector multiply Ax Parallel Design, distribution one thread beam warp or multiple thread beam warp goes to calculate an inner product of Ax, while calculates multiple Inner product, these thread beams of cycle assignment give each inner product.For different matrix sizes and different GPU equipment (computing resources Scale is different), it is proposed that self-adaptive thread beam allocation strategy, chooses to Automatic Optimal k thread beam warp and calculates a dot product, So that more CUDA cores and other working cells participate in computing.Moreover, this design also reduces this using the openness of solution The calculation amount of kernel.

The Parallel Design caches vector x, including following two benches reduction using shared drive：

First stage, includes the following steps：

1) x-load is walked：All threads in per thread block cooperate with first reads continuous segment vector x in sharing XP is deposited, then performs partial-reduction steps.In this manner, the access for vector x is to merge, and pass through The fragment of shared vector x, it is possible to reduce access times.

2) partial-reduction is walked：The per thread of one thread block is to being already loaded into the part of shared drive Vector x carries out reduction operation, and formula is as follows：

BVAL+=xP_i*A_rj

Wherein, bVAL is the part reduction value that a thread is responsible for, xP_iIt is the segment vector x for being loaded into shared drive I-th of element, A_rjIt is xP_iCorresponding matrix A element.If vector x returns to x-load steps without all having loaded；It is no Then, then warp-reduction steps are performed.Obviously, per thread may need that reduction operation is performed a plurality of times, and for accessing Matrix A in global memory is also what is merged.

3) warp-reduction is walked：In per thread beam warp, per thread completes oneself responsible part and returns About it is worth, then performs the instruction of shuffling (shuffle instructions) that CUDA is provided and complete last reduction operation.

Second stage, this continuous shared drive is read using multiple thread beam warp, and using shuffling, instruction is completed Reduction in thread beam warp, is calculated corresponding inner product as a result, being not over as inner product calculates, returns to first stage, Continue to calculate next group inner product；

The Parallel Design uses following self-adaptive thread beam allocation strategy：

Min w=sm × 2048/k/32, meet m≤w

Wherein w represents thread beam group (by k thread Shu Zucheng) quantity that distribution produces, and k represents to distribute to an inner product Thread beam quantity, if less than 1, be taken as 1, sm and represent the stream multiprocessor number that GPU equipment includes, m represents data dictionary Matrix line number.As k=1, it is only necessary to first stage reduction；As k=32, vector is directly loaded into register.

Transposed matrix vector, which multiplies, is defined as A^Tx(A∈R^m×n,x∈R^m), it is made of n inner product (in each row of A and x do Product operation)；And each inner product operation can be calculated independently.Above-mentioned steps 6) in transposed matrix vector multiply parallel Design, distributes a thread or multiple threads go to calculate an inner product of transposed matrix vector, while calculates multiple inner products, follows Ring distributes these threads to each inner product.For different matrix sizes and different GPU equipment (computing resource scale is different), Propose self-adaptive thread allocation strategy, as soon as Automatic Optimal choose k thread and calculate an inner product, can so cause more More CUDA cores and other working cells participate in computing.

First stage, includes the following steps：

1) x-load is walked：All threads in per thread block synergistically read continuous segment vector x to shared first Memory, then performs partial-reduction steps.

BVAL+=xP_i*A_jc

Wherein, bVAL is the part reduction value that a thread is responsible for, xP_iIt is the segment vector x for being loaded into shared drive I-th of element, A_jcIt is xP_iCorresponding matrix A element.If x returns to x-load steps, if without all having loaded Through x has all been loaded, then second stage is performed.In partial-reduction steps, due to the use of by row storage matrix, And indexed since 0, if the tissue of sets of threads (k thread forms a sets of threads) is unreasonable, in global memory The access of matrix A be nonjoinder.Therefore, the establishment of sets of threads is to merge to ensure to access according to following definition.

Define 1：Assuming that thread block size is s, h thread is assigned to A together^TA dot product in x, and z=s/ h.So, sets of threads tissue is as follows：{0,z,..,(h-1)*z},{1,z+1,..,(h-1)*z+1},…,{z-1,2*z-1,.., 2*(h-1)*z-1}。

Second stage, reads this continuous shared drive using multiple warp and carries out reduction, be calculated corresponding Inner product result；

The Parallel Design uses following self-adaptive thread allocation strategy：

Min t=sm × 2048/k, meet n≤t

Wherein t represents that the sets of threads that distribution produces (is made of) quantity k thread, and wherein k represents to distribute to an inner product Number of threads, if less than 1, be taken as 1, sm and represent the stream multiprocessor number that GPU equipment includes, n represents the square of data dictionary Number of arrays.As k=1, it is only necessary to first stage reduction.

Above-mentioned steps 7) in streaming Parallel Design, with streaming mode of loading handle vector operation in each element entry, I.e. per thread calculates an element entry, including soft-threshold operator, can also be vectored, and uses the interior of CUDA offers Function fmax () is put, eliminates branch.

For bandwagon effect, the present invention tests the method for solving of the present invention, test environment one with the matrix of single precision Computers of the platform Intel Xeon double-cores CPU with a NVIDIA GTX980 video card, compilation run environment is CUDA 6.5.It is attached Fig. 4 and attached drawing 5 respectively show the performance of two next concurrent solvers proposed by the invention, and wherein CFISTA represents to be based on The FISTA that CUBLAS is realized, GFISTA represent the single L1 minimization problems Parallel solver of the present invention, and MFISTAOL represents this Concurrent multiple L1 minimization problems Parallel solvers of invention.As it can be seen that compared to the method for solving based on CUBLAS, list of the invention A L1 minimization problems Parallel solver has in performance significantly to be lifted；Compared to single L1 minimization problems Parallel solver, Concurrent multiple L1 minimization problems Parallel solvers of the present invention also have the lifting of certain amplitude in performance.

Referring to Fig. 1, the storage organization of the GPU equipment of NVIDIA computing capabilitys 5.0+ is multi-level；Per thread can be with Access the shared drive shared in thread block；L2 cachings can cache global memory automatically and (be located at dynamic random access memory Device)；Read-only data caching (L1 cachings) can be by programme-control, for caching global memory.

Referring to Fig. 2 and Fig. 3, the matrix by row storage for starting index with 0 is preserved to data dictionary, and uses 32 bytes Alignment is filled, can so optimize global memory's access performance, reduces internal storage access affairs.

Referring to Fig. 4, the core fusion of single L1 minimization problems Parallel solver.First kernel function merges non-transposition square Battle array vector multiplies realizes Ay with vector subtraction_k-b；Second kernel realizes A^T(*)；Remaining vector behaviour is fused into the 3rd Core.

It is each test case referring to Fig. 5, initial x₀Always contain 1024 nonzero elements and b=Ax₀；50 iteration After terminate；The execution time of all algorithms is listed in figure, chronomere is the second.Compared to CFISTA, GFISTA can obtain model The speed-up ratio enclosed from 37.68 to 53.66 times, average speedup is 48.22, and performance boost is significant.

Referring to Fig. 6, concurrent multiple L1 minimize solver MFISTASOL.The same Fig. 5 of test configurations, is each use-case, and Hair solves 128 L1 minimum problems.Compared with sequentially performing single L1 minimization problems Parallel solver GFISTA, MFISTASOL can have more than 3.0 average speedup.

Claims

1. a kind of L1 minimization problem fast solution methods based on GPU, it is characterised in that calculated based on iteratively faster collapse threshold Method, it is parallel to accelerate solution L1 minimums to ask using CUDA parallel computational models in the Maxwell framework GPU equipment of NVIDIA Topic；Devise the vector calculus of adaptive optimization, non-transposed matrix vector multiplies and multiplies with transposed matrix vector, and by rational The distribution of CUDA threads realizes single L1 minimization problems Parallel solver and the Parallel implementation of concurrent multiple L1 minimization problems Device；

The method for solving comprises the following steps that：

1) set according to the computing resource of the dimension of data dictionary and GPU equipment, complete the distribution of thread beam and set and thread distribution Set；

2) by the way of the filling of 32 byte-aligneds, data dictionary is preserved into 0 matrix by row storage for starting index；Transmission From host side to GPU equipment ends, the vector is data item b and rarefaction representation x for data dictionary and vector；

3) while in host side, the input parameter of asynchronous computing FISTA；

4) number of the L1 minimization problems solved as needed, if single L1 minimization problems are only solved, in GPU equipment ends It is upper to inspire the setting for calling single L1 minimization problems Parallel solver；If concurrent multiple L1 minimization problems are solved, in GPU The setting for the Parallel solver for calling concurrent multiple L1 minimization problems is inspired in equipment end；

Enabling the specific method that single L1 minimization problems Parallel solver is set is：One GPU equipment only solves a L1 minimum Change problem, non-transposed matrix vector multiplies Parallel Design, transposed matrix vector multiplies Parallel Design and merges the streaming of vector operation simultaneously Row design realizes completion respectively by three CUDA kernel functions.

Enabling the specific method of the Parallel solver setting of concurrent multiple L1 minimization problems is：One GPU equipment can be concurrent Solve multiple L1 minimization problems；The solution of each L1 minimization problem is completed by one or more thread blocks, and non-turn Put that matrix vector multiplies Parallel Design, transposed matrix vector multiplies Parallel Design and the streaming Parallel Design of fusion vector operation is only by one A CUDA kernel functions are realized；In addition, using the built-in function of CUDA offers by the access cache of data dictionary matrix read-only In data buffer storage, access efficiency is improved.

5) in GPU equipment ends, multiply Parallel Design using the non-transposed matrix vector of adaptive optimization, realize non-in FISTA Transposed matrix vector multiplies；

6) in GPU equipment ends, multiply Parallel Design using the transposed matrix vector of adaptive optimization, realize the transposition in FISTA Matrix vector multiplies；

7) in GPU equipment ends, remaining vector operation is merged, using streaming Parallel Design, amalgamation mode is calculated in FISTA and remained Remaining vector operation；

8) while in host side, asynchronous computing scalar value；

If 9) reach the condition of convergence, stop iteration, transmit rarefaction representation from GPU equipment end to host side, otherwise return to step 5) iteration, is continued.

A kind of 2. L1 minimization problem fast solution methods based on GPU as claimed in claim 1, it is characterised in that：Step 5) the non-transposed matrix vector in multiplies Parallel Design and is specifically：Distributed and set according to thread beam, distribute one adaptive optimization Thread beam or multiple thread beams go to calculate an inner product, and reduce calculation amount using the openness of solution；

The Parallel Design includes following two benches reduction：

1) in first stage, all threads in per thread block cooperate with first reads continuous segment vector to shared parallel The corresponding part reduction operation of per thread completion of memory, then same thread beam, then, utilizes instruction of shuffling to complete line Reduction in Cheng Shu, and store the result into continuous shared drive, until all having loaded vector；

2) in second stage, using shuffling, instruction carries out reduction to the shared drive data of first stage storage, is calculated Corresponding inner product result；

In this Parallel Design, the number of threads that a thread beam includes is 32；To obtain the optimal line for calculating an inner product Cheng Shu numbers, it is proposed that following self-adjusted block strategy：

Min w=sm × 2048/k/32, meet m≤w

Wherein w represents the thread beam group quantity that distribution produces, and thread beam group is distributed in one by k thread Shu Zucheng, k expression Long-pending thread beam quantity, if less than 1, is taken as 1, sm and represents the stream multiprocessor number that GPU equipment includes, m represents data dictionary Matrix line number；As k=1, which only needs first stage reduction；As k=32, vector can directly be loaded To register.

A kind of 3. L1 minimization problem fast solution methods based on GPU as claimed in claim 1, it is characterised in that：Step 6) the transposed matrix vector in multiplies Parallel Design and is specifically：Distributed and set according to thread, distribute a thread adaptive optimization Or multiple threads go to calculate an inner product；

The Parallel Design also includes two benches reduction：

1) in first stage, all threads in per thread block cooperate with first reads continuous segment vector to shared parallel Memory, then per thread completes corresponding part reduction operation, and stores the result into continuous shared drive；

2) second stage, the shared drive data obtained to the first stage carry out reduction, corresponding inner product result are calculated；

The Parallel Design uses following self-adaptive thread allocation strategy：

Min t=sm × 2048/k, meet n≤t

Wherein t represents the sets of threads quantity that distribution produces, and sets of threads is made of k thread, and k represents to distribute to the line of an inner product Number of passes amount, if less than 1, is taken as 1, sm and represents the stream multiprocessor number that GPU equipment includes, n represents the rectangular array of data dictionary Number；As k=1, it is only necessary to first stage.

A kind of 4. L1 minimization problem fast solution methods based on GPU as claimed in claim 1, it is characterised in that：Step 7) the streaming Parallel Design in is specifically：Each element entry in vector operation is handled with streaming mode of loading, including soft Threshold operator, and the built-in function provided using CUDA, eliminate branch.