CN105739951B - A kind of L1 minimization problem fast solution methods based on GPU - Google Patents

A kind of L1 minimization problem fast solution methods based on GPU Download PDF

Info

Publication number
CN105739951B
CN105739951B CN201610116008.3A CN201610116008A CN105739951B CN 105739951 B CN105739951 B CN 105739951B CN 201610116008 A CN201610116008 A CN 201610116008A CN 105739951 B CN105739951 B CN 105739951B
Authority
CN
China
Prior art keywords
vector
parallel
thread
gpu
minimization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610116008.3A
Other languages
Chinese (zh)
Other versions
CN105739951A (en
Inventor
高家全
李泽界
王宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN201610116008.3A priority Critical patent/CN105739951B/en
Publication of CN105739951A publication Critical patent/CN105739951A/en
Application granted granted Critical
Publication of CN105739951B publication Critical patent/CN105739951B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5044Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering hardware capabilities

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Complex Calculations (AREA)
  • Multi Processors (AREA)

Abstract

A kind of L1 minimization problem fast solution methods based on GPU, in the Maxwell framework GPU equipment of NVIDIA, using CUDA parallel computational models, using GPU new features and kernel merges and optimisation technique, there is provided a kind of L1 minimization problems fast solution method.The vector calculus of adaptive optimization is not only contained in the method for solving, non-transposed matrix vector multiplies the design multiplied with transposed matrix vector, and can only be distributed by simple CUDA threads and difference is set, realize the Parallel implementations of single or concurrent multiple L1 minimization problems.Test result indicates that method for solving proposed by the present invention is effective, and there is high degree of parallelism and adaptability.Compared with existing Parallel implementation method, performance increases substantially.

Description

A kind of L1 minimization problem fast solution methods based on GPU
Technical field
The present invention relates to signal processing and field of face identification, relates more specifically to a kind of L1 minimums based on GPU and asks Inscribe fast solution method.
Background technology
L1 minimization problems are min | | x | |1, meet constraint Ax=b, wherein A ∈ Rm×n(m < < n) is a full rank Dense matrix, b ∈ RmIt is vector set in advance, x ∈ RnIt is unknown solution.The solution of L1 minimization problems, also referred to as rarefaction representation, It has been widely applied to multiple fields, such as signal processing, machine learning and statistical inference etc..Asked to solve L1 minimums Topic, researcher have designed many effective algorithms.For example, gradient projection method, block newton interior point method, Homotopy Method, iteration receive Contracting threshold method and augmented vector approach etc..In actual conditions, b often includes noise, therefore a change of this problem Kind is referred to as unconfined base tracking Denoising Problems (BPDN problems) or Lasso problems, as follows:
Wherein, λ is scalar weights.
With the growth of problem scale, the execution efficiency of algorithm is largely cut down.To improve its efficiency, one has The approach of effect is exactly that these algorithms are transplanted on distributed or multinuclear framework, such as currently a popular graphics processing unit (GPU).Since NVIDIA companies are since 2007 describe CUDA programming models, the research of data processing is accelerated based on GPU The hot spot of people's research is become.
Most of L1 minimize algorithm, its computing is mainly multiplied by dense matrix vector to be formed with vector operation.Due to Efficient realization comprising these computings in CUBLAS storehouses, so the existing L1 accelerated based on GPU is minimized algorithm and is based primarily upon CUBLAS storehouses.However, being found by testing, the matrix vector in CUBLAS multiplies increasing of the method with matrix line number or columns It is long, performance inconsistency can be produced, and minimax performance gap is notable.CUBLAS does not support to melt core, minimum in face of concurrent multiple L1 During change problem, the new feature of existing GPU can not be made full use of, and cannot most optimally configure the computing resource of whole GPU, is deposited In larger overhead.Therefore, the present invention, based on iteratively faster collapse threshold method, is led in the GPU equipment of Maxwell frameworks Cross abundant digging utilization GPU hardware resource and computing capability, there is provided a kind of efficient parallel method for solving of L1 minimization problems.
The content of the invention
The purpose of the present invention is the deficiency for existing method, by digging utilization GPU hardware resource and computing capability, carries For a kind of efficient parallel method for solving of L1 minimization problems.The present invention gives two solvers, single L1 minimization problems The Parallel solver of Parallel solver and concurrent multiple L1 minimization problems, is sweared including the non-transposed matrix of adaptive optimization Amount multiplies Parallel Design, transposed matrix vector multiplies Parallel Design and streaming Parallel Design.
To reach above-mentioned purpose, present invention employs following technical scheme.
Iteratively faster collapse threshold method (FISTA) is a kind of iterative shrinkage threshold method, can solve unconfined base tracking Denoising Problems, and relate generally to matrix vector and multiply and vector operation, it is easily parallel.Therefore the present invention is based on FISTA, NVIDIA's It is parallel to accelerate to solve L1 minimization problems using CUDA parallel computational models in Maxwell framework GPU equipment.Devise adaptive The vector calculus that should optimize, non-transposed matrix vector multiply and multiply with transposed matrix vector, and it is real to pass through the distribution of rational CUDA threads Single L1 minimization problems Parallel solver and the Parallel solver of concurrent multiple L1 minimization problems are showed.
The invention of this method for solving concretely comprises the following steps:
1) set according to the computing resource of the dimension of data dictionary and GPU equipment, complete thread beam distribution setting and thread Distribution is set;
2) by the way of the filling of 32 byte-aligneds, preserve 0 and start the matrix by row storage of index into data dictionary. Data dictionary and vector are transmitted from host side to GPU equipment ends;
3) while in host side, the input parameter of asynchronous computing FISTA;
4) number of the L1 minimization problems solved as needed, if only solving single L1 minimization problems, sets in GPU The setting for calling single L1 minimization problems Parallel solver is inspired on standby end;If solving concurrent multiple L1 minimization problems, The setting for the Parallel solver for calling concurrent multiple L1 minimization problems is inspired in GPU equipment ends;
5) in GPU equipment ends, multiply Parallel Design using the non-transposed matrix vector of adaptive optimization, realize in FISTA Non- transposed matrix vector multiply;
6) in GPU equipment ends, multiply Parallel Design using the transposed matrix vector of adaptive optimization, realize in FISTA Transposed matrix vector multiplies;
7) in GPU equipment ends, remaining vector operation is merged, using streaming Parallel Design, amalgamation mode calculates FISTA In remaining vector operation;
8) while in host side, asynchronous computing scalar value;
If 9) reach the condition of convergence, stop iteration, otherwise transmission rarefaction representation is returned from GPU equipment end to host side Step 5), continues iteration.
FISTA relates generally to vector operation and matrix vector multiplies, without matrix inversion operation and matrix decomposition.Therefore, it It is not only suitable for parallel, and easily extends to extensive high dimensional data.Vector operation and matrix vector, which multiply, both belongs to low calculating Density operation, is limited to bandwidth, thus, the present invention uses kernel amalgamation mode, and multiple vector calculuses and matrix vector are multiplied and are melted (kernel) is combined, the global memory for eliminating intermediate result accesses, and utilizes data locality.In addition, matrix vector multiplies It is made of multiple inner product operations, every a line and the vector of matrix do inner product operation, and by shared drive, vector, which is repeated, to be made With.Meanwhile the allocation strategy of adaptive optimization has been used, make full use of the storage knot of the GPU equipment of computing capability 5.0 and the above Structure, with reference to data locality, realizes multi-level buffer control optimization, reduces global data access.
Non- transposed matrix vector in the step 5) multiplies Parallel Design, is distributed and set according to thread beam, adaptive optimization Ground distributes a thread beam or multiple thread beams and goes to calculate an inner product, and reduces calculation amount using the openness of solution.
The Parallel Design includes following two benches reduction:
1) in first stage, all threads in per thread block cooperate with first reads continuous segment vector extremely parallel The corresponding part reduction operation of per thread completion of shared drive, then same thread beam, then, has been instructed using shuffling Reduction into thread beam, and store the result into continuous shared drive, until all having loaded vector;
2) in second stage, using shuffling, instruction carries out reduction to the shared drive data of first stage storage, calculates Obtain corresponding inner product result.
In this Parallel Design, the number of threads that a thread beam includes is 32.An inner product is calculated to obtain most Excellent thread beam number, it is proposed that following self-adjusted block strategy:
Min w=sm × 2048/k/32, meet m≤w
Wherein w represents thread beam group (by k thread Shu Zucheng) quantity that distribution produces, and k represents to distribute to an inner product Thread beam quantity, if less than 1, be taken as 1, sm and represent the stream multiprocessor number that GPU equipment includes, m represents data dictionary Matrix line number.As k=1, which only needs first stage reduction;As k=32, vector can be directly loaded into Register.
Transposed matrix vector in the step 6) multiplies Parallel Design, is distributed and set according to thread, divides adaptive optimization Go to calculate an inner product with a thread or multiple threads.
The Parallel Design also includes two benches reduction:
1) in first stage, all threads in per thread block cooperate with first reads continuous segment vector extremely parallel Shared drive, then per thread completes corresponding part reduction operation, and stores the result into continuous shared drive;
2) second stage, the shared drive data obtained to the first stage carry out reduction, corresponding inner product are calculated As a result.
The Parallel Design uses following self-adaptive thread allocation strategy:
Min t=sm × 2048/k, meet n≤t
Wherein t represents that the sets of threads that distribution produces (is made of) quantity k thread, and k represents to distribute to the line of an inner product Number of passes amount, if less than 1, is taken as 1, sm and represents the stream multiprocessor number that GPU equipment includes, n represents the rectangular array of data dictionary Number.As k=1, it is only necessary to first stage.
Streaming Parallel Design in the step 7), each element entry in vector operation is handled with streaming mode of loading, Including soft-threshold operator, can also be vectored, and the built-in function provided using CUDA, eliminate branch.
In the step 4) when enabling single L1 minimization problems Parallel solver setting, then a GPU equipment is only asked Solve a L1 minimization problem.Non- transposed matrix vector multiplies Parallel Design, transposed matrix vector multiplies Parallel Design and fusion vector The streaming Parallel Design of operation realizes completion respectively by three CUDA kernel functions.
In the step 4) when the Parallel solver for enabling concurrent multiple L1 minimization problems is set, then a GPU is set It is standby concurrently to solve multiple L1 minimization problems.The solution of each L1 minimization problem is complete by one or more thread blocks Into, and non-transposed matrix vector multiplies Parallel Design, transposed matrix vector multiplies Parallel Design and merges the streaming of vector operation simultaneously Row design is only realized by a CUDA kernel function.In addition, the built-in function provided using CUDA is by the visit of data dictionary matrix Ask and be buffered in read-only data caching, improve access efficiency.
The Parallel implementation methods of this L1 minimization problems proposed by the invention, abundant digging utilization GPU hardware money Source and computing capability, have high degree of parallelism and adaptability.
Brief description of the drawings
Fig. 1 is the storage hierarchy figure of the GPU equipment of computing capability SM5.0+.
Fig. 2 is matrix storage format schematic diagram in the present invention.
Fig. 3 is the filled matrix schematic diagram that 32 byte-aligneds are used in the present invention.
The kernel that Fig. 4 is FISTA in the present invention merges schematic diagram.
Fig. 5 is that performance comparison result of the Parallel solver of single L1 minimization problems in the present invention on GPU and CPU is shown It is intended to.
Fig. 6 is the performance comparison result of the Parallel solver and single version of concurrent multiple L1 minimization problems in the present invention Schematic diagram.
Fig. 7 is flow chart of the method for the present invention.
Embodiment
In the following description, the present invention is further explained in detail with reference to attached drawing 1-7 and specific implementation method.
Iteratively faster collapse threshold algorithm is a kind of iterative shrinkage thresholding algorithm, by combining Nesterovs Optimal gradients The acceleration version that algorithm is realized, possesses non-progressive convergency factor O (k2).The algorithm with the addition of such as next new sequence {yk, k=1,2 ... }, specific iterative step is as follows:
Wherein, λ is scalar weights, and (u, a)=sign (u) max { | u |-a, 0 } are soft-threshold operators to soft, y1=x0, t1=1, LfIt is the associated lipschitz constants of ▽ f (), can be by calculating ATThe characteristic spectrum of A obtain (| | ATA||2), ▽ f (yk)= AT(Ayk-b)。
The present invention relates to L1 minimization problems are solved, iteratively faster collapse threshold algorithm is employed, which relates generally to Vector operation, matrix vector multiply.The present invention is in the Maxwell framework GPU equipment of NVIDIA, based on CUDA parallel computation moulds Type, accelerates iteratively faster collapse threshold algorithm parallel.
The non-transposed matrix vector that the present invention proposes adaptive optimization multiplies Parallel Design, transposed matrix vector multiplies and sets parallel Meter and streaming Parallel Design.Using above-mentioned Parallel Design, and single L1 is realized by the distribution of reasonable CUDA threads and is minimized The Parallel solver of problem Parallel solver and concurrent multiple L1 minimization problems.
The specific steps of the Parallel implementation method include as follows:
1) GPU facility informations, including computing capability are read, multiprocessor number is flowed, according to the dimension and GPU of data dictionary Facility information, completes thread beam distribution setting and thread distribution is set;
2) by the way of the filling of 32 byte-aligneds, preserve 0 and start the matrix by row storage of index to data dictionary A In.Data dictionary A, data item b and rarefaction representation x are transmitted from host side to GPU equipment ends;
3) while in host side, input parameter (such as L of asynchronous computing iteratively faster collapse threshold algorithmf);
4) number of the L1 minimization problems solved as needed, if single L1 minimization problems are solved, in GPU equipment The setting for calling single L1 minimization problems Parallel solver is inspired on end;If solving concurrent multiple L1 minimization problems, The setting for the Parallel solver for calling concurrent multiple L1 minimization problems is inspired in GPU equipment ends;
5) in GPU equipment ends, multiply Parallel Design using the non-transposed matrix vector of adaptive optimization, realize iteratively faster Non- transposed matrix vector in collapse threshold algorithm multiplies Ayk-b;
6) in GPU equipment ends, multiply Parallel Design using the transposed matrix vector of adaptive optimization, realize that iteratively faster is received Transposed matrix vector in contracting thresholding algorithm multiplies AT(*);
7) in GPU equipment ends, using streaming Parallel Design, melt remaining in joint account iteratively faster collapse threshold algorithm Vector operation;
8) while in host side, asynchronous computing tk+1Value;
If 9) the sparse sexual satisfaction of iterative steps or solution set, stop iteration, transmission rarefaction representation from GPU equipment ends to Host side, otherwise return to step 5), continue iteration.
Above-mentioned steps 4) in when enabling single L1 minimization problems Parallel solver and setting, then a GPU equipment is only asked A L1 minimization problem is solved, calls three CUDA kernel functions to realize the non-transposition square of iteratively faster collapse threshold algorithm respectively Vector multiplies battle array, transposed matrix vector multiplies and all vector operations, and the idiographic flow of FISTA is as shown in algorithm 1.First kernel letter Number multiplies Parallel Design using non-transposed matrix vector and realizes Ayk-b;Second kernel multiplies Parallel Design using transposed matrix vector Realize AT(*);Then it is the 3rd kernel to merge remaining vector operation, using streaming Parallel Design, as shown in Figure 4.Often The dimensional attributes of the input and output object of a kernel function can be different, i.e., (refer to thread net using different startup configurations Configuration of lattice and thread block etc.).
Above-mentioned steps 4) in when the Parallel solver for enabling concurrent multiple L1 minimization problems is set, then a GPU is set It is standby concurrently to solve multiple L1 minimization problems.The solution of each L1 minimization problem is complete by one or more thread blocks Into, and using only a CUDA kernel function with reference to non-transposed matrix vector multiplies Parallel Design, transposed matrix vector multiplies parallel The streaming Parallel Design of design and fusion vector operation, realizes iteratively faster collapse threshold algorithm.Using _ ldg () function by number According to the access cache of dictionary matrix in read-only data caching, access efficiency is improved, in addition, no longer in host side, asynchronous computing tk+1Value.
Non- transposed matrix vector, which multiplies, is defined as Ax (A ∈ Rm×n,x∈Rn), it is made of m inner product (in every a line of A and x do Product operation);And each inner product operation can be calculated independently.Above-mentioned steps 5) in non-transposed matrix vector multiply Ax Parallel Design, distribution one thread beam warp or multiple thread beam warp goes to calculate an inner product of Ax, while calculates multiple Inner product, these thread beams of cycle assignment give each inner product.For different matrix sizes and different GPU equipment (computing resources Scale is different), it is proposed that self-adaptive thread beam allocation strategy, chooses to Automatic Optimal k thread beam warp and calculates a dot product, So that more CUDA cores and other working cells participate in computing.Moreover, this design also reduces this using the openness of solution The calculation amount of kernel.
The Parallel Design caches vector x, including following two benches reduction using shared drive:
First stage, includes the following steps:
1) x-load is walked:All threads in per thread block cooperate with first reads continuous segment vector x in sharing XP is deposited, then performs partial-reduction steps.In this manner, the access for vector x is to merge, and pass through The fragment of shared vector x, it is possible to reduce access times.
2) partial-reduction is walked:The per thread of one thread block is to being already loaded into the part of shared drive Vector x carries out reduction operation, and formula is as follows:
BVAL+=xPi*Arj
Wherein, bVAL is the part reduction value that a thread is responsible for, xPiIt is the segment vector x for being loaded into shared drive I-th of element, ArjIt is xPiCorresponding matrix A element.If vector x returns to x-load steps without all having loaded;It is no Then, then warp-reduction steps are performed.Obviously, per thread may need that reduction operation is performed a plurality of times, and for accessing Matrix A in global memory is also what is merged.
3) warp-reduction is walked:In per thread beam warp, per thread completes oneself responsible part and returns About it is worth, then performs the instruction of shuffling (shuffle instructions) that CUDA is provided and complete last reduction operation.
Second stage, this continuous shared drive is read using multiple thread beam warp, and using shuffling, instruction is completed Reduction in thread beam warp, is calculated corresponding inner product as a result, being not over as inner product calculates, returns to first stage, Continue to calculate next group inner product;
The Parallel Design uses following self-adaptive thread beam allocation strategy:
Min w=sm × 2048/k/32, meet m≤w
Wherein w represents thread beam group (by k thread Shu Zucheng) quantity that distribution produces, and k represents to distribute to an inner product Thread beam quantity, if less than 1, be taken as 1, sm and represent the stream multiprocessor number that GPU equipment includes, m represents data dictionary Matrix line number.As k=1, it is only necessary to first stage reduction;As k=32, vector is directly loaded into register.
Transposed matrix vector, which multiplies, is defined as ATx(A∈Rm×n,x∈Rm), it is made of n inner product (in each row of A and x do Product operation);And each inner product operation can be calculated independently.Above-mentioned steps 6) in transposed matrix vector multiply parallel Design, distributes a thread or multiple threads go to calculate an inner product of transposed matrix vector, while calculates multiple inner products, follows Ring distributes these threads to each inner product.For different matrix sizes and different GPU equipment (computing resource scale is different), Propose self-adaptive thread allocation strategy, as soon as Automatic Optimal choose k thread and calculate an inner product, can so cause more More CUDA cores and other working cells participate in computing.
The Parallel Design caches vector x, including following two benches reduction using shared drive:
First stage, includes the following steps:
1) x-load is walked:All threads in per thread block synergistically read continuous segment vector x to shared first Memory, then performs partial-reduction steps.
2) partial-reduction is walked:The per thread of one thread block is to being already loaded into the part of shared drive Vector x carries out reduction operation, and formula is as follows:
BVAL+=xPi*Ajc
Wherein, bVAL is the part reduction value that a thread is responsible for, xPiIt is the segment vector x for being loaded into shared drive I-th of element, AjcIt is xPiCorresponding matrix A element.If x returns to x-load steps, if without all having loaded Through x has all been loaded, then second stage is performed.In partial-reduction steps, due to the use of by row storage matrix, And indexed since 0, if the tissue of sets of threads (k thread forms a sets of threads) is unreasonable, in global memory The access of matrix A be nonjoinder.Therefore, the establishment of sets of threads is to merge to ensure to access according to following definition.
Define 1:Assuming that thread block size is s, h thread is assigned to A togetherTA dot product in x, and z=s/ h.So, sets of threads tissue is as follows:{0,z,..,(h-1)*z},{1,z+1,..,(h-1)*z+1},…,{z-1,2*z-1,.., 2*(h-1)*z-1}。
Second stage, reads this continuous shared drive using multiple warp and carries out reduction, be calculated corresponding Inner product result;
The Parallel Design uses following self-adaptive thread allocation strategy:
Min t=sm × 2048/k, meet n≤t
Wherein t represents that the sets of threads that distribution produces (is made of) quantity k thread, and wherein k represents to distribute to an inner product Number of threads, if less than 1, be taken as 1, sm and represent the stream multiprocessor number that GPU equipment includes, n represents the square of data dictionary Number of arrays.As k=1, it is only necessary to first stage reduction.
Above-mentioned steps 7) in streaming Parallel Design, with streaming mode of loading handle vector operation in each element entry, I.e. per thread calculates an element entry, including soft-threshold operator, can also be vectored, and uses the interior of CUDA offers Function fmax () is put, eliminates branch.
For bandwagon effect, the present invention tests the method for solving of the present invention, test environment one with the matrix of single precision Computers of the platform Intel Xeon double-cores CPU with a NVIDIA GTX980 video card, compilation run environment is CUDA 6.5.It is attached Fig. 4 and attached drawing 5 respectively show the performance of two next concurrent solvers proposed by the invention, and wherein CFISTA represents to be based on The FISTA that CUBLAS is realized, GFISTA represent the single L1 minimization problems Parallel solver of the present invention, and MFISTAOL represents this Concurrent multiple L1 minimization problems Parallel solvers of invention.As it can be seen that compared to the method for solving based on CUBLAS, list of the invention A L1 minimization problems Parallel solver has in performance significantly to be lifted;Compared to single L1 minimization problems Parallel solver, Concurrent multiple L1 minimization problems Parallel solvers of the present invention also have the lifting of certain amplitude in performance.
Referring to Fig. 1, the storage organization of the GPU equipment of NVIDIA computing capabilitys 5.0+ is multi-level;Per thread can be with Access the shared drive shared in thread block;L2 cachings can cache global memory automatically and (be located at dynamic random access memory Device);Read-only data caching (L1 cachings) can be by programme-control, for caching global memory.
Referring to Fig. 2 and Fig. 3, the matrix by row storage for starting index with 0 is preserved to data dictionary, and uses 32 bytes Alignment is filled, can so optimize global memory's access performance, reduces internal storage access affairs.
Referring to Fig. 4, the core fusion of single L1 minimization problems Parallel solver.First kernel function merges non-transposition square Battle array vector multiplies realizes Ay with vector subtractionk-b;Second kernel realizes AT(*);Remaining vector behaviour is fused into the 3rd Core.
It is each test case referring to Fig. 5, initial x0Always contain 1024 nonzero elements and b=Ax0;50 iteration After terminate;The execution time of all algorithms is listed in figure, chronomere is the second.Compared to CFISTA, GFISTA can obtain model The speed-up ratio enclosed from 37.68 to 53.66 times, average speedup is 48.22, and performance boost is significant.
Referring to Fig. 6, concurrent multiple L1 minimize solver MFISTASOL.The same Fig. 5 of test configurations, is each use-case, and Hair solves 128 L1 minimum problems.Compared with sequentially performing single L1 minimization problems Parallel solver GFISTA, MFISTASOL can have more than 3.0 average speedup.

Claims (4)

1. a kind of L1 minimization problem fast solution methods based on GPU, it is characterised in that calculated based on iteratively faster collapse threshold Method, it is parallel to accelerate solution L1 minimums to ask using CUDA parallel computational models in the Maxwell framework GPU equipment of NVIDIA Topic;Devise the vector calculus of adaptive optimization, non-transposed matrix vector multiplies and multiplies with transposed matrix vector, and by rational The distribution of CUDA threads realizes single L1 minimization problems Parallel solver and the Parallel implementation of concurrent multiple L1 minimization problems Device;
The method for solving comprises the following steps that:
1) set according to the computing resource of the dimension of data dictionary and GPU equipment, complete the distribution of thread beam and set and thread distribution Set;
2) by the way of the filling of 32 byte-aligneds, data dictionary is preserved into 0 matrix by row storage for starting index;Transmission From host side to GPU equipment ends, the vector is data item b and rarefaction representation x for data dictionary and vector;
3) while in host side, the input parameter of asynchronous computing FISTA;
4) number of the L1 minimization problems solved as needed, if single L1 minimization problems are only solved, in GPU equipment ends It is upper to inspire the setting for calling single L1 minimization problems Parallel solver;If concurrent multiple L1 minimization problems are solved, in GPU The setting for the Parallel solver for calling concurrent multiple L1 minimization problems is inspired in equipment end;
Enabling the specific method that single L1 minimization problems Parallel solver is set is:One GPU equipment only solves a L1 minimum Change problem, non-transposed matrix vector multiplies Parallel Design, transposed matrix vector multiplies Parallel Design and merges the streaming of vector operation simultaneously Row design realizes completion respectively by three CUDA kernel functions.
Enabling the specific method of the Parallel solver setting of concurrent multiple L1 minimization problems is:One GPU equipment can be concurrent Solve multiple L1 minimization problems;The solution of each L1 minimization problem is completed by one or more thread blocks, and non-turn Put that matrix vector multiplies Parallel Design, transposed matrix vector multiplies Parallel Design and the streaming Parallel Design of fusion vector operation is only by one A CUDA kernel functions are realized;In addition, using the built-in function of CUDA offers by the access cache of data dictionary matrix read-only In data buffer storage, access efficiency is improved.
5) in GPU equipment ends, multiply Parallel Design using the non-transposed matrix vector of adaptive optimization, realize non-in FISTA Transposed matrix vector multiplies;
6) in GPU equipment ends, multiply Parallel Design using the transposed matrix vector of adaptive optimization, realize the transposition in FISTA Matrix vector multiplies;
7) in GPU equipment ends, remaining vector operation is merged, using streaming Parallel Design, amalgamation mode is calculated in FISTA and remained Remaining vector operation;
8) while in host side, asynchronous computing scalar value;
If 9) reach the condition of convergence, stop iteration, transmit rarefaction representation from GPU equipment end to host side, otherwise return to step 5) iteration, is continued.
A kind of 2. L1 minimization problem fast solution methods based on GPU as claimed in claim 1, it is characterised in that:Step 5) the non-transposed matrix vector in multiplies Parallel Design and is specifically:Distributed and set according to thread beam, distribute one adaptive optimization Thread beam or multiple thread beams go to calculate an inner product, and reduce calculation amount using the openness of solution;
The Parallel Design includes following two benches reduction:
1) in first stage, all threads in per thread block cooperate with first reads continuous segment vector to shared parallel The corresponding part reduction operation of per thread completion of memory, then same thread beam, then, utilizes instruction of shuffling to complete line Reduction in Cheng Shu, and store the result into continuous shared drive, until all having loaded vector;
2) in second stage, using shuffling, instruction carries out reduction to the shared drive data of first stage storage, is calculated Corresponding inner product result;
In this Parallel Design, the number of threads that a thread beam includes is 32;To obtain the optimal line for calculating an inner product Cheng Shu numbers, it is proposed that following self-adjusted block strategy:
Min w=sm × 2048/k/32, meet m≤w
Wherein w represents the thread beam group quantity that distribution produces, and thread beam group is distributed in one by k thread Shu Zucheng, k expression Long-pending thread beam quantity, if less than 1, is taken as 1, sm and represents the stream multiprocessor number that GPU equipment includes, m represents data dictionary Matrix line number;As k=1, which only needs first stage reduction;As k=32, vector can directly be loaded To register.
A kind of 3. L1 minimization problem fast solution methods based on GPU as claimed in claim 1, it is characterised in that:Step 6) the transposed matrix vector in multiplies Parallel Design and is specifically:Distributed and set according to thread, distribute a thread adaptive optimization Or multiple threads go to calculate an inner product;
The Parallel Design also includes two benches reduction:
1) in first stage, all threads in per thread block cooperate with first reads continuous segment vector to shared parallel Memory, then per thread completes corresponding part reduction operation, and stores the result into continuous shared drive;
2) second stage, the shared drive data obtained to the first stage carry out reduction, corresponding inner product result are calculated;
The Parallel Design uses following self-adaptive thread allocation strategy:
Min t=sm × 2048/k, meet n≤t
Wherein t represents the sets of threads quantity that distribution produces, and sets of threads is made of k thread, and k represents to distribute to the line of an inner product Number of passes amount, if less than 1, is taken as 1, sm and represents the stream multiprocessor number that GPU equipment includes, n represents the rectangular array of data dictionary Number;As k=1, it is only necessary to first stage.
A kind of 4. L1 minimization problem fast solution methods based on GPU as claimed in claim 1, it is characterised in that:Step 7) the streaming Parallel Design in is specifically:Each element entry in vector operation is handled with streaming mode of loading, including soft Threshold operator, and the built-in function provided using CUDA, eliminate branch.
CN201610116008.3A 2016-03-01 2016-03-01 A kind of L1 minimization problem fast solution methods based on GPU Active CN105739951B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610116008.3A CN105739951B (en) 2016-03-01 2016-03-01 A kind of L1 minimization problem fast solution methods based on GPU

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610116008.3A CN105739951B (en) 2016-03-01 2016-03-01 A kind of L1 minimization problem fast solution methods based on GPU

Publications (2)

Publication Number Publication Date
CN105739951A CN105739951A (en) 2016-07-06
CN105739951B true CN105739951B (en) 2018-05-08

Family

ID=56248952

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610116008.3A Active CN105739951B (en) 2016-03-01 2016-03-01 A kind of L1 minimization problem fast solution methods based on GPU

Country Status (1)

Country Link
CN (1) CN105739951B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106502771B (en) * 2016-09-09 2019-08-02 中国农业大学 Time overhead model building method and system based on kernel function
CN110088730B (en) * 2017-06-30 2021-05-18 华为技术有限公司 Task processing method, device, medium and equipment
CN107886519A (en) * 2017-10-17 2018-04-06 杭州电子科技大学 Multichannel chromatogram three-dimensional image fast partition method based on CUDA
CN109709547A (en) * 2019-01-21 2019-05-03 电子科技大学 A kind of reality beam scanning radar acceleration super-resolution imaging method
CN112487740B (en) * 2020-12-23 2024-06-18 深圳国微芯科技有限公司 Boolean satisfiability problem solving method and system
FR3122753B1 (en) * 2021-05-10 2024-03-15 Commissariat Energie Atomique METHOD FOR EXECUTING A BINARY CODE BY A MICROPROCESSOR
CN114943194B (en) * 2022-05-16 2023-04-28 水利部交通运输部国家能源局南京水利科学研究院 River pollution tracing method based on geostatistics
CN117785480B (en) * 2024-02-07 2024-04-26 北京壁仞科技开发有限公司 Processor, reduction calculation method and electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103505206A (en) * 2012-06-18 2014-01-15 山东大学威海分校 Fast and parallel dynamic MRI method based on compressive sensing technology
US9118347B1 (en) * 2011-08-30 2015-08-25 Marvell International Ltd. Method and apparatus for OFDM encoding and decoding

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20120025233A (en) * 2010-09-07 2012-03-15 삼성전자주식회사 Method and apparatus of reconstructing polychromatic image and medical image system enabling the method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9118347B1 (en) * 2011-08-30 2015-08-25 Marvell International Ltd. Method and apparatus for OFDM encoding and decoding
CN103505206A (en) * 2012-06-18 2014-01-15 山东大学威海分校 Fast and parallel dynamic MRI method based on compressive sensing technology

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Accelerated proximal algorithms for L1-minimization problem;XIAO YA ZHANG等;《 Wavelet Active Media Technology and Information Processing (ICCWAMTIP), 2014 11th International Computer Conference on》;20150402;139-143 *
快速L1范数最小化算法的性能分析和比较;刘杰等;《电脑知识与技术》;20110731;第7卷(第19期);4641-4643 *

Also Published As

Publication number Publication date
CN105739951A (en) 2016-07-06

Similar Documents

Publication Publication Date Title
CN105739951B (en) A kind of L1 minimization problem fast solution methods based on GPU
US20220012593A1 (en) Neural network accelerator and neural network acceleration method based on structured pruning and low-bit quantization
Alwani et al. Fused-layer CNN accelerators
Keuper et al. Distributed training of deep neural networks: Theoretical and practical limits of parallel scalability
US20220391678A1 (en) Neural network model processing method and apparatus, computer device, and storage medium
Lauterbach et al. Fast BVH construction on GPUs
US20180157969A1 (en) Apparatus and Method for Achieving Accelerator of Sparse Convolutional Neural Network
CN106709441B (en) A kind of face verification accelerated method based on convolution theorem
Martín et al. Algorithmic strategies for optimizing the parallel reduction primitive in CUDA
US11797855B2 (en) System and method of accelerating execution of a neural network
CN106875013A (en) The system and method for optimizing Recognition with Recurrent Neural Network for multinuclear
CN103049241B (en) A kind of method improving CPU+GPU isomery device calculated performance
CN104765589B (en) Grid parallel computation preprocess method based on MPI
KR20180123846A (en) Logical-3d array reconfigurable accelerator for convolutional neural networks
CN112084038A (en) Memory allocation method and device of neural network
CN103177414A (en) Structure-based dependency graph node similarity concurrent computation method
US20230409885A1 (en) Hardware Environment-Based Data Operation Method, Apparatus and Device, and Storage Medium
CN114461978B (en) Data processing method and device, electronic equipment and readable storage medium
CN108875914B (en) Method and device for preprocessing and post-processing neural network data
CN110414672B (en) Convolution operation method, device and system
US20240160689A1 (en) Method for optimizing convolution operation of system on chip and related product
CN106484532B (en) GPGPU parallel calculating method towards SPH fluid simulation
Niu et al. SPEC2: Spectral sparse CNN accelerator on FPGAs
Zayer et al. Sparse matrix assembly on the GPU through multiplication patterns
CN107305486A (en) A kind of neutral net maxout layers of computing device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant