CN105739951B - A kind of L1 minimization problem fast solution methods based on GPU - Google Patents
A kind of L1 minimization problem fast solution methods based on GPU Download PDFInfo
- Publication number
- CN105739951B CN105739951B CN201610116008.3A CN201610116008A CN105739951B CN 105739951 B CN105739951 B CN 105739951B CN 201610116008 A CN201610116008 A CN 201610116008A CN 105739951 B CN105739951 B CN 105739951B
- Authority
- CN
- China
- Prior art keywords
- vector
- parallel
- thread
- gpu
- minimization
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 35
- 239000011159 matrix material Substances 0.000 claims abstract description 77
- 238000013461 design Methods 0.000 claims abstract description 56
- HPTJABJPZMULFH-UHFFFAOYSA-N 12-[(Cyclohexylcarbamoyl)amino]dodecanoic acid Chemical compound OC(=O)CCCCCCCCCCCNC(=O)NC1CCCCC1 HPTJABJPZMULFH-UHFFFAOYSA-N 0.000 claims abstract description 24
- 230000003044 adaptive effect Effects 0.000 claims abstract description 16
- 238000005457 optimization Methods 0.000 claims abstract description 16
- 238000005094 computer simulation Methods 0.000 claims abstract description 3
- 238000009826 distribution Methods 0.000 claims description 16
- 230000006870 function Effects 0.000 claims description 14
- 238000003860 storage Methods 0.000 claims description 13
- 230000004927 fusion Effects 0.000 claims description 4
- 238000005267 amalgamation Methods 0.000 claims description 3
- 230000005540 biological transmission Effects 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 230000017105 transposition Effects 0.000 claims description 3
- 238000012360 testing method Methods 0.000 abstract description 6
- 238000004422 calculation algorithm Methods 0.000 description 19
- 238000010586 diagram Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000013459 approach Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 208000034630 Calculi Diseases 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000001568 sexual effect Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 208000008281 urolithiasis Diseases 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3867—Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/5044—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering hardware capabilities
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Complex Calculations (AREA)
- Multi Processors (AREA)
Abstract
A kind of L1 minimization problem fast solution methods based on GPU, in the Maxwell framework GPU equipment of NVIDIA, using CUDA parallel computational models, using GPU new features and kernel merges and optimisation technique, there is provided a kind of L1 minimization problems fast solution method.The vector calculus of adaptive optimization is not only contained in the method for solving, non-transposed matrix vector multiplies the design multiplied with transposed matrix vector, and can only be distributed by simple CUDA threads and difference is set, realize the Parallel implementations of single or concurrent multiple L1 minimization problems.Test result indicates that method for solving proposed by the present invention is effective, and there is high degree of parallelism and adaptability.Compared with existing Parallel implementation method, performance increases substantially.
Description
Technical field
The present invention relates to signal processing and field of face identification, relates more specifically to a kind of L1 minimums based on GPU and asks
Inscribe fast solution method.
Background technology
L1 minimization problems are min | | x | |1, meet constraint Ax=b, wherein A ∈ Rm×n(m < < n) is a full rank
Dense matrix, b ∈ RmIt is vector set in advance, x ∈ RnIt is unknown solution.The solution of L1 minimization problems, also referred to as rarefaction representation,
It has been widely applied to multiple fields, such as signal processing, machine learning and statistical inference etc..Asked to solve L1 minimums
Topic, researcher have designed many effective algorithms.For example, gradient projection method, block newton interior point method, Homotopy Method, iteration receive
Contracting threshold method and augmented vector approach etc..In actual conditions, b often includes noise, therefore a change of this problem
Kind is referred to as unconfined base tracking Denoising Problems (BPDN problems) or Lasso problems, as follows:
Wherein, λ is scalar weights.
With the growth of problem scale, the execution efficiency of algorithm is largely cut down.To improve its efficiency, one has
The approach of effect is exactly that these algorithms are transplanted on distributed or multinuclear framework, such as currently a popular graphics processing unit
(GPU).Since NVIDIA companies are since 2007 describe CUDA programming models, the research of data processing is accelerated based on GPU
The hot spot of people's research is become.
Most of L1 minimize algorithm, its computing is mainly multiplied by dense matrix vector to be formed with vector operation.Due to
Efficient realization comprising these computings in CUBLAS storehouses, so the existing L1 accelerated based on GPU is minimized algorithm and is based primarily upon
CUBLAS storehouses.However, being found by testing, the matrix vector in CUBLAS multiplies increasing of the method with matrix line number or columns
It is long, performance inconsistency can be produced, and minimax performance gap is notable.CUBLAS does not support to melt core, minimum in face of concurrent multiple L1
During change problem, the new feature of existing GPU can not be made full use of, and cannot most optimally configure the computing resource of whole GPU, is deposited
In larger overhead.Therefore, the present invention, based on iteratively faster collapse threshold method, is led in the GPU equipment of Maxwell frameworks
Cross abundant digging utilization GPU hardware resource and computing capability, there is provided a kind of efficient parallel method for solving of L1 minimization problems.
The content of the invention
The purpose of the present invention is the deficiency for existing method, by digging utilization GPU hardware resource and computing capability, carries
For a kind of efficient parallel method for solving of L1 minimization problems.The present invention gives two solvers, single L1 minimization problems
The Parallel solver of Parallel solver and concurrent multiple L1 minimization problems, is sweared including the non-transposed matrix of adaptive optimization
Amount multiplies Parallel Design, transposed matrix vector multiplies Parallel Design and streaming Parallel Design.
To reach above-mentioned purpose, present invention employs following technical scheme.
Iteratively faster collapse threshold method (FISTA) is a kind of iterative shrinkage threshold method, can solve unconfined base tracking
Denoising Problems, and relate generally to matrix vector and multiply and vector operation, it is easily parallel.Therefore the present invention is based on FISTA, NVIDIA's
It is parallel to accelerate to solve L1 minimization problems using CUDA parallel computational models in Maxwell framework GPU equipment.Devise adaptive
The vector calculus that should optimize, non-transposed matrix vector multiply and multiply with transposed matrix vector, and it is real to pass through the distribution of rational CUDA threads
Single L1 minimization problems Parallel solver and the Parallel solver of concurrent multiple L1 minimization problems are showed.
The invention of this method for solving concretely comprises the following steps:
1) set according to the computing resource of the dimension of data dictionary and GPU equipment, complete thread beam distribution setting and thread
Distribution is set;
2) by the way of the filling of 32 byte-aligneds, preserve 0 and start the matrix by row storage of index into data dictionary.
Data dictionary and vector are transmitted from host side to GPU equipment ends;
3) while in host side, the input parameter of asynchronous computing FISTA;
4) number of the L1 minimization problems solved as needed, if only solving single L1 minimization problems, sets in GPU
The setting for calling single L1 minimization problems Parallel solver is inspired on standby end;If solving concurrent multiple L1 minimization problems,
The setting for the Parallel solver for calling concurrent multiple L1 minimization problems is inspired in GPU equipment ends;
5) in GPU equipment ends, multiply Parallel Design using the non-transposed matrix vector of adaptive optimization, realize in FISTA
Non- transposed matrix vector multiply;
6) in GPU equipment ends, multiply Parallel Design using the transposed matrix vector of adaptive optimization, realize in FISTA
Transposed matrix vector multiplies;
7) in GPU equipment ends, remaining vector operation is merged, using streaming Parallel Design, amalgamation mode calculates FISTA
In remaining vector operation;
8) while in host side, asynchronous computing scalar value;
If 9) reach the condition of convergence, stop iteration, otherwise transmission rarefaction representation is returned from GPU equipment end to host side
Step 5), continues iteration.
FISTA relates generally to vector operation and matrix vector multiplies, without matrix inversion operation and matrix decomposition.Therefore, it
It is not only suitable for parallel, and easily extends to extensive high dimensional data.Vector operation and matrix vector, which multiply, both belongs to low calculating
Density operation, is limited to bandwidth, thus, the present invention uses kernel amalgamation mode, and multiple vector calculuses and matrix vector are multiplied and are melted
(kernel) is combined, the global memory for eliminating intermediate result accesses, and utilizes data locality.In addition, matrix vector multiplies
It is made of multiple inner product operations, every a line and the vector of matrix do inner product operation, and by shared drive, vector, which is repeated, to be made
With.Meanwhile the allocation strategy of adaptive optimization has been used, make full use of the storage knot of the GPU equipment of computing capability 5.0 and the above
Structure, with reference to data locality, realizes multi-level buffer control optimization, reduces global data access.
Non- transposed matrix vector in the step 5) multiplies Parallel Design, is distributed and set according to thread beam, adaptive optimization
Ground distributes a thread beam or multiple thread beams and goes to calculate an inner product, and reduces calculation amount using the openness of solution.
The Parallel Design includes following two benches reduction:
1) in first stage, all threads in per thread block cooperate with first reads continuous segment vector extremely parallel
The corresponding part reduction operation of per thread completion of shared drive, then same thread beam, then, has been instructed using shuffling
Reduction into thread beam, and store the result into continuous shared drive, until all having loaded vector;
2) in second stage, using shuffling, instruction carries out reduction to the shared drive data of first stage storage, calculates
Obtain corresponding inner product result.
In this Parallel Design, the number of threads that a thread beam includes is 32.An inner product is calculated to obtain most
Excellent thread beam number, it is proposed that following self-adjusted block strategy:
Min w=sm × 2048/k/32, meet m≤w
Wherein w represents thread beam group (by k thread Shu Zucheng) quantity that distribution produces, and k represents to distribute to an inner product
Thread beam quantity, if less than 1, be taken as 1, sm and represent the stream multiprocessor number that GPU equipment includes, m represents data dictionary
Matrix line number.As k=1, which only needs first stage reduction;As k=32, vector can be directly loaded into
Register.
Transposed matrix vector in the step 6) multiplies Parallel Design, is distributed and set according to thread, divides adaptive optimization
Go to calculate an inner product with a thread or multiple threads.
The Parallel Design also includes two benches reduction:
1) in first stage, all threads in per thread block cooperate with first reads continuous segment vector extremely parallel
Shared drive, then per thread completes corresponding part reduction operation, and stores the result into continuous shared drive;
2) second stage, the shared drive data obtained to the first stage carry out reduction, corresponding inner product are calculated
As a result.
The Parallel Design uses following self-adaptive thread allocation strategy:
Min t=sm × 2048/k, meet n≤t
Wherein t represents that the sets of threads that distribution produces (is made of) quantity k thread, and k represents to distribute to the line of an inner product
Number of passes amount, if less than 1, is taken as 1, sm and represents the stream multiprocessor number that GPU equipment includes, n represents the rectangular array of data dictionary
Number.As k=1, it is only necessary to first stage.
Streaming Parallel Design in the step 7), each element entry in vector operation is handled with streaming mode of loading,
Including soft-threshold operator, can also be vectored, and the built-in function provided using CUDA, eliminate branch.
In the step 4) when enabling single L1 minimization problems Parallel solver setting, then a GPU equipment is only asked
Solve a L1 minimization problem.Non- transposed matrix vector multiplies Parallel Design, transposed matrix vector multiplies Parallel Design and fusion vector
The streaming Parallel Design of operation realizes completion respectively by three CUDA kernel functions.
In the step 4) when the Parallel solver for enabling concurrent multiple L1 minimization problems is set, then a GPU is set
It is standby concurrently to solve multiple L1 minimization problems.The solution of each L1 minimization problem is complete by one or more thread blocks
Into, and non-transposed matrix vector multiplies Parallel Design, transposed matrix vector multiplies Parallel Design and merges the streaming of vector operation simultaneously
Row design is only realized by a CUDA kernel function.In addition, the built-in function provided using CUDA is by the visit of data dictionary matrix
Ask and be buffered in read-only data caching, improve access efficiency.
The Parallel implementation methods of this L1 minimization problems proposed by the invention, abundant digging utilization GPU hardware money
Source and computing capability, have high degree of parallelism and adaptability.
Brief description of the drawings
Fig. 1 is the storage hierarchy figure of the GPU equipment of computing capability SM5.0+.
Fig. 2 is matrix storage format schematic diagram in the present invention.
Fig. 3 is the filled matrix schematic diagram that 32 byte-aligneds are used in the present invention.
The kernel that Fig. 4 is FISTA in the present invention merges schematic diagram.
Fig. 5 is that performance comparison result of the Parallel solver of single L1 minimization problems in the present invention on GPU and CPU is shown
It is intended to.
Fig. 6 is the performance comparison result of the Parallel solver and single version of concurrent multiple L1 minimization problems in the present invention
Schematic diagram.
Fig. 7 is flow chart of the method for the present invention.
Embodiment
In the following description, the present invention is further explained in detail with reference to attached drawing 1-7 and specific implementation method.
Iteratively faster collapse threshold algorithm is a kind of iterative shrinkage thresholding algorithm, by combining Nesterovs Optimal gradients
The acceleration version that algorithm is realized, possesses non-progressive convergency factor O (k2).The algorithm with the addition of such as next new sequence
{yk, k=1,2 ... }, specific iterative step is as follows:
Wherein, λ is scalar weights, and (u, a)=sign (u) max { | u |-a, 0 } are soft-threshold operators to soft, y1=x0,
t1=1, LfIt is the associated lipschitz constants of ▽ f (), can be by calculating ATThe characteristic spectrum of A obtain (| | ATA||2), ▽ f (yk)=
AT(Ayk-b)。
The present invention relates to L1 minimization problems are solved, iteratively faster collapse threshold algorithm is employed, which relates generally to
Vector operation, matrix vector multiply.The present invention is in the Maxwell framework GPU equipment of NVIDIA, based on CUDA parallel computation moulds
Type, accelerates iteratively faster collapse threshold algorithm parallel.
The non-transposed matrix vector that the present invention proposes adaptive optimization multiplies Parallel Design, transposed matrix vector multiplies and sets parallel
Meter and streaming Parallel Design.Using above-mentioned Parallel Design, and single L1 is realized by the distribution of reasonable CUDA threads and is minimized
The Parallel solver of problem Parallel solver and concurrent multiple L1 minimization problems.
The specific steps of the Parallel implementation method include as follows:
1) GPU facility informations, including computing capability are read, multiprocessor number is flowed, according to the dimension and GPU of data dictionary
Facility information, completes thread beam distribution setting and thread distribution is set;
2) by the way of the filling of 32 byte-aligneds, preserve 0 and start the matrix by row storage of index to data dictionary A
In.Data dictionary A, data item b and rarefaction representation x are transmitted from host side to GPU equipment ends;
3) while in host side, input parameter (such as L of asynchronous computing iteratively faster collapse threshold algorithmf);
4) number of the L1 minimization problems solved as needed, if single L1 minimization problems are solved, in GPU equipment
The setting for calling single L1 minimization problems Parallel solver is inspired on end;If solving concurrent multiple L1 minimization problems,
The setting for the Parallel solver for calling concurrent multiple L1 minimization problems is inspired in GPU equipment ends;
5) in GPU equipment ends, multiply Parallel Design using the non-transposed matrix vector of adaptive optimization, realize iteratively faster
Non- transposed matrix vector in collapse threshold algorithm multiplies Ayk-b;
6) in GPU equipment ends, multiply Parallel Design using the transposed matrix vector of adaptive optimization, realize that iteratively faster is received
Transposed matrix vector in contracting thresholding algorithm multiplies AT(*);
7) in GPU equipment ends, using streaming Parallel Design, melt remaining in joint account iteratively faster collapse threshold algorithm
Vector operation;
8) while in host side, asynchronous computing tk+1Value;
If 9) the sparse sexual satisfaction of iterative steps or solution set, stop iteration, transmission rarefaction representation from GPU equipment ends to
Host side, otherwise return to step 5), continue iteration.
Above-mentioned steps 4) in when enabling single L1 minimization problems Parallel solver and setting, then a GPU equipment is only asked
A L1 minimization problem is solved, calls three CUDA kernel functions to realize the non-transposition square of iteratively faster collapse threshold algorithm respectively
Vector multiplies battle array, transposed matrix vector multiplies and all vector operations, and the idiographic flow of FISTA is as shown in algorithm 1.First kernel letter
Number multiplies Parallel Design using non-transposed matrix vector and realizes Ayk-b;Second kernel multiplies Parallel Design using transposed matrix vector
Realize AT(*);Then it is the 3rd kernel to merge remaining vector operation, using streaming Parallel Design, as shown in Figure 4.Often
The dimensional attributes of the input and output object of a kernel function can be different, i.e., (refer to thread net using different startup configurations
Configuration of lattice and thread block etc.).
Above-mentioned steps 4) in when the Parallel solver for enabling concurrent multiple L1 minimization problems is set, then a GPU is set
It is standby concurrently to solve multiple L1 minimization problems.The solution of each L1 minimization problem is complete by one or more thread blocks
Into, and using only a CUDA kernel function with reference to non-transposed matrix vector multiplies Parallel Design, transposed matrix vector multiplies parallel
The streaming Parallel Design of design and fusion vector operation, realizes iteratively faster collapse threshold algorithm.Using _ ldg () function by number
According to the access cache of dictionary matrix in read-only data caching, access efficiency is improved, in addition, no longer in host side, asynchronous computing
tk+1Value.
Non- transposed matrix vector, which multiplies, is defined as Ax (A ∈ Rm×n,x∈Rn), it is made of m inner product (in every a line of A and x do
Product operation);And each inner product operation can be calculated independently.Above-mentioned steps 5) in non-transposed matrix vector multiply Ax
Parallel Design, distribution one thread beam warp or multiple thread beam warp goes to calculate an inner product of Ax, while calculates multiple
Inner product, these thread beams of cycle assignment give each inner product.For different matrix sizes and different GPU equipment (computing resources
Scale is different), it is proposed that self-adaptive thread beam allocation strategy, chooses to Automatic Optimal k thread beam warp and calculates a dot product,
So that more CUDA cores and other working cells participate in computing.Moreover, this design also reduces this using the openness of solution
The calculation amount of kernel.
The Parallel Design caches vector x, including following two benches reduction using shared drive:
First stage, includes the following steps:
1) x-load is walked:All threads in per thread block cooperate with first reads continuous segment vector x in sharing
XP is deposited, then performs partial-reduction steps.In this manner, the access for vector x is to merge, and pass through
The fragment of shared vector x, it is possible to reduce access times.
2) partial-reduction is walked:The per thread of one thread block is to being already loaded into the part of shared drive
Vector x carries out reduction operation, and formula is as follows:
BVAL+=xPi*Arj
Wherein, bVAL is the part reduction value that a thread is responsible for, xPiIt is the segment vector x for being loaded into shared drive
I-th of element, ArjIt is xPiCorresponding matrix A element.If vector x returns to x-load steps without all having loaded;It is no
Then, then warp-reduction steps are performed.Obviously, per thread may need that reduction operation is performed a plurality of times, and for accessing
Matrix A in global memory is also what is merged.
3) warp-reduction is walked:In per thread beam warp, per thread completes oneself responsible part and returns
About it is worth, then performs the instruction of shuffling (shuffle instructions) that CUDA is provided and complete last reduction operation.
Second stage, this continuous shared drive is read using multiple thread beam warp, and using shuffling, instruction is completed
Reduction in thread beam warp, is calculated corresponding inner product as a result, being not over as inner product calculates, returns to first stage,
Continue to calculate next group inner product;
The Parallel Design uses following self-adaptive thread beam allocation strategy:
Min w=sm × 2048/k/32, meet m≤w
Wherein w represents thread beam group (by k thread Shu Zucheng) quantity that distribution produces, and k represents to distribute to an inner product
Thread beam quantity, if less than 1, be taken as 1, sm and represent the stream multiprocessor number that GPU equipment includes, m represents data dictionary
Matrix line number.As k=1, it is only necessary to first stage reduction;As k=32, vector is directly loaded into register.
Transposed matrix vector, which multiplies, is defined as ATx(A∈Rm×n,x∈Rm), it is made of n inner product (in each row of A and x do
Product operation);And each inner product operation can be calculated independently.Above-mentioned steps 6) in transposed matrix vector multiply parallel
Design, distributes a thread or multiple threads go to calculate an inner product of transposed matrix vector, while calculates multiple inner products, follows
Ring distributes these threads to each inner product.For different matrix sizes and different GPU equipment (computing resource scale is different),
Propose self-adaptive thread allocation strategy, as soon as Automatic Optimal choose k thread and calculate an inner product, can so cause more
More CUDA cores and other working cells participate in computing.
The Parallel Design caches vector x, including following two benches reduction using shared drive:
First stage, includes the following steps:
1) x-load is walked:All threads in per thread block synergistically read continuous segment vector x to shared first
Memory, then performs partial-reduction steps.
2) partial-reduction is walked:The per thread of one thread block is to being already loaded into the part of shared drive
Vector x carries out reduction operation, and formula is as follows:
BVAL+=xPi*Ajc
Wherein, bVAL is the part reduction value that a thread is responsible for, xPiIt is the segment vector x for being loaded into shared drive
I-th of element, AjcIt is xPiCorresponding matrix A element.If x returns to x-load steps, if without all having loaded
Through x has all been loaded, then second stage is performed.In partial-reduction steps, due to the use of by row storage matrix,
And indexed since 0, if the tissue of sets of threads (k thread forms a sets of threads) is unreasonable, in global memory
The access of matrix A be nonjoinder.Therefore, the establishment of sets of threads is to merge to ensure to access according to following definition.
Define 1:Assuming that thread block size is s, h thread is assigned to A togetherTA dot product in x, and z=s/
h.So, sets of threads tissue is as follows:{0,z,..,(h-1)*z},{1,z+1,..,(h-1)*z+1},…,{z-1,2*z-1,..,
2*(h-1)*z-1}。
Second stage, reads this continuous shared drive using multiple warp and carries out reduction, be calculated corresponding
Inner product result;
The Parallel Design uses following self-adaptive thread allocation strategy:
Min t=sm × 2048/k, meet n≤t
Wherein t represents that the sets of threads that distribution produces (is made of) quantity k thread, and wherein k represents to distribute to an inner product
Number of threads, if less than 1, be taken as 1, sm and represent the stream multiprocessor number that GPU equipment includes, n represents the square of data dictionary
Number of arrays.As k=1, it is only necessary to first stage reduction.
Above-mentioned steps 7) in streaming Parallel Design, with streaming mode of loading handle vector operation in each element entry,
I.e. per thread calculates an element entry, including soft-threshold operator, can also be vectored, and uses the interior of CUDA offers
Function fmax () is put, eliminates branch.
For bandwagon effect, the present invention tests the method for solving of the present invention, test environment one with the matrix of single precision
Computers of the platform Intel Xeon double-cores CPU with a NVIDIA GTX980 video card, compilation run environment is CUDA 6.5.It is attached
Fig. 4 and attached drawing 5 respectively show the performance of two next concurrent solvers proposed by the invention, and wherein CFISTA represents to be based on
The FISTA that CUBLAS is realized, GFISTA represent the single L1 minimization problems Parallel solver of the present invention, and MFISTAOL represents this
Concurrent multiple L1 minimization problems Parallel solvers of invention.As it can be seen that compared to the method for solving based on CUBLAS, list of the invention
A L1 minimization problems Parallel solver has in performance significantly to be lifted;Compared to single L1 minimization problems Parallel solver,
Concurrent multiple L1 minimization problems Parallel solvers of the present invention also have the lifting of certain amplitude in performance.
Referring to Fig. 1, the storage organization of the GPU equipment of NVIDIA computing capabilitys 5.0+ is multi-level;Per thread can be with
Access the shared drive shared in thread block;L2 cachings can cache global memory automatically and (be located at dynamic random access memory
Device);Read-only data caching (L1 cachings) can be by programme-control, for caching global memory.
Referring to Fig. 2 and Fig. 3, the matrix by row storage for starting index with 0 is preserved to data dictionary, and uses 32 bytes
Alignment is filled, can so optimize global memory's access performance, reduces internal storage access affairs.
Referring to Fig. 4, the core fusion of single L1 minimization problems Parallel solver.First kernel function merges non-transposition square
Battle array vector multiplies realizes Ay with vector subtractionk-b;Second kernel realizes AT(*);Remaining vector behaviour is fused into the 3rd
Core.
It is each test case referring to Fig. 5, initial x0Always contain 1024 nonzero elements and b=Ax0;50 iteration
After terminate;The execution time of all algorithms is listed in figure, chronomere is the second.Compared to CFISTA, GFISTA can obtain model
The speed-up ratio enclosed from 37.68 to 53.66 times, average speedup is 48.22, and performance boost is significant.
Referring to Fig. 6, concurrent multiple L1 minimize solver MFISTASOL.The same Fig. 5 of test configurations, is each use-case, and
Hair solves 128 L1 minimum problems.Compared with sequentially performing single L1 minimization problems Parallel solver GFISTA,
MFISTASOL can have more than 3.0 average speedup.
Claims (4)
1. a kind of L1 minimization problem fast solution methods based on GPU, it is characterised in that calculated based on iteratively faster collapse threshold
Method, it is parallel to accelerate solution L1 minimums to ask using CUDA parallel computational models in the Maxwell framework GPU equipment of NVIDIA
Topic;Devise the vector calculus of adaptive optimization, non-transposed matrix vector multiplies and multiplies with transposed matrix vector, and by rational
The distribution of CUDA threads realizes single L1 minimization problems Parallel solver and the Parallel implementation of concurrent multiple L1 minimization problems
Device;
The method for solving comprises the following steps that:
1) set according to the computing resource of the dimension of data dictionary and GPU equipment, complete the distribution of thread beam and set and thread distribution
Set;
2) by the way of the filling of 32 byte-aligneds, data dictionary is preserved into 0 matrix by row storage for starting index;Transmission
From host side to GPU equipment ends, the vector is data item b and rarefaction representation x for data dictionary and vector;
3) while in host side, the input parameter of asynchronous computing FISTA;
4) number of the L1 minimization problems solved as needed, if single L1 minimization problems are only solved, in GPU equipment ends
It is upper to inspire the setting for calling single L1 minimization problems Parallel solver;If concurrent multiple L1 minimization problems are solved, in GPU
The setting for the Parallel solver for calling concurrent multiple L1 minimization problems is inspired in equipment end;
Enabling the specific method that single L1 minimization problems Parallel solver is set is:One GPU equipment only solves a L1 minimum
Change problem, non-transposed matrix vector multiplies Parallel Design, transposed matrix vector multiplies Parallel Design and merges the streaming of vector operation simultaneously
Row design realizes completion respectively by three CUDA kernel functions.
Enabling the specific method of the Parallel solver setting of concurrent multiple L1 minimization problems is:One GPU equipment can be concurrent
Solve multiple L1 minimization problems;The solution of each L1 minimization problem is completed by one or more thread blocks, and non-turn
Put that matrix vector multiplies Parallel Design, transposed matrix vector multiplies Parallel Design and the streaming Parallel Design of fusion vector operation is only by one
A CUDA kernel functions are realized;In addition, using the built-in function of CUDA offers by the access cache of data dictionary matrix read-only
In data buffer storage, access efficiency is improved.
5) in GPU equipment ends, multiply Parallel Design using the non-transposed matrix vector of adaptive optimization, realize non-in FISTA
Transposed matrix vector multiplies;
6) in GPU equipment ends, multiply Parallel Design using the transposed matrix vector of adaptive optimization, realize the transposition in FISTA
Matrix vector multiplies;
7) in GPU equipment ends, remaining vector operation is merged, using streaming Parallel Design, amalgamation mode is calculated in FISTA and remained
Remaining vector operation;
8) while in host side, asynchronous computing scalar value;
If 9) reach the condition of convergence, stop iteration, transmit rarefaction representation from GPU equipment end to host side, otherwise return to step
5) iteration, is continued.
A kind of 2. L1 minimization problem fast solution methods based on GPU as claimed in claim 1, it is characterised in that:Step
5) the non-transposed matrix vector in multiplies Parallel Design and is specifically:Distributed and set according to thread beam, distribute one adaptive optimization
Thread beam or multiple thread beams go to calculate an inner product, and reduce calculation amount using the openness of solution;
The Parallel Design includes following two benches reduction:
1) in first stage, all threads in per thread block cooperate with first reads continuous segment vector to shared parallel
The corresponding part reduction operation of per thread completion of memory, then same thread beam, then, utilizes instruction of shuffling to complete line
Reduction in Cheng Shu, and store the result into continuous shared drive, until all having loaded vector;
2) in second stage, using shuffling, instruction carries out reduction to the shared drive data of first stage storage, is calculated
Corresponding inner product result;
In this Parallel Design, the number of threads that a thread beam includes is 32;To obtain the optimal line for calculating an inner product
Cheng Shu numbers, it is proposed that following self-adjusted block strategy:
Min w=sm × 2048/k/32, meet m≤w
Wherein w represents the thread beam group quantity that distribution produces, and thread beam group is distributed in one by k thread Shu Zucheng, k expression
Long-pending thread beam quantity, if less than 1, is taken as 1, sm and represents the stream multiprocessor number that GPU equipment includes, m represents data dictionary
Matrix line number;As k=1, which only needs first stage reduction;As k=32, vector can directly be loaded
To register.
A kind of 3. L1 minimization problem fast solution methods based on GPU as claimed in claim 1, it is characterised in that:Step
6) the transposed matrix vector in multiplies Parallel Design and is specifically:Distributed and set according to thread, distribute a thread adaptive optimization
Or multiple threads go to calculate an inner product;
The Parallel Design also includes two benches reduction:
1) in first stage, all threads in per thread block cooperate with first reads continuous segment vector to shared parallel
Memory, then per thread completes corresponding part reduction operation, and stores the result into continuous shared drive;
2) second stage, the shared drive data obtained to the first stage carry out reduction, corresponding inner product result are calculated;
The Parallel Design uses following self-adaptive thread allocation strategy:
Min t=sm × 2048/k, meet n≤t
Wherein t represents the sets of threads quantity that distribution produces, and sets of threads is made of k thread, and k represents to distribute to the line of an inner product
Number of passes amount, if less than 1, is taken as 1, sm and represents the stream multiprocessor number that GPU equipment includes, n represents the rectangular array of data dictionary
Number;As k=1, it is only necessary to first stage.
A kind of 4. L1 minimization problem fast solution methods based on GPU as claimed in claim 1, it is characterised in that:Step
7) the streaming Parallel Design in is specifically:Each element entry in vector operation is handled with streaming mode of loading, including soft
Threshold operator, and the built-in function provided using CUDA, eliminate branch.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610116008.3A CN105739951B (en) | 2016-03-01 | 2016-03-01 | A kind of L1 minimization problem fast solution methods based on GPU |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610116008.3A CN105739951B (en) | 2016-03-01 | 2016-03-01 | A kind of L1 minimization problem fast solution methods based on GPU |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105739951A CN105739951A (en) | 2016-07-06 |
CN105739951B true CN105739951B (en) | 2018-05-08 |
Family
ID=56248952
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610116008.3A Active CN105739951B (en) | 2016-03-01 | 2016-03-01 | A kind of L1 minimization problem fast solution methods based on GPU |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105739951B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106502771B (en) * | 2016-09-09 | 2019-08-02 | 中国农业大学 | Time overhead model building method and system based on kernel function |
CN110088730B (en) * | 2017-06-30 | 2021-05-18 | 华为技术有限公司 | Task processing method, device, medium and equipment |
CN107886519A (en) * | 2017-10-17 | 2018-04-06 | 杭州电子科技大学 | Multichannel chromatogram three-dimensional image fast partition method based on CUDA |
CN109709547A (en) * | 2019-01-21 | 2019-05-03 | 电子科技大学 | A kind of reality beam scanning radar acceleration super-resolution imaging method |
CN112487740B (en) * | 2020-12-23 | 2024-06-18 | 深圳国微芯科技有限公司 | Boolean satisfiability problem solving method and system |
FR3122753B1 (en) * | 2021-05-10 | 2024-03-15 | Commissariat Energie Atomique | METHOD FOR EXECUTING A BINARY CODE BY A MICROPROCESSOR |
CN114943194B (en) * | 2022-05-16 | 2023-04-28 | 水利部交通运输部国家能源局南京水利科学研究院 | River pollution tracing method based on geostatistics |
CN117785480B (en) * | 2024-02-07 | 2024-04-26 | 北京壁仞科技开发有限公司 | Processor, reduction calculation method and electronic equipment |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103505206A (en) * | 2012-06-18 | 2014-01-15 | 山东大学威海分校 | Fast and parallel dynamic MRI method based on compressive sensing technology |
US9118347B1 (en) * | 2011-08-30 | 2015-08-25 | Marvell International Ltd. | Method and apparatus for OFDM encoding and decoding |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20120025233A (en) * | 2010-09-07 | 2012-03-15 | 삼성전자주식회사 | Method and apparatus of reconstructing polychromatic image and medical image system enabling the method |
-
2016
- 2016-03-01 CN CN201610116008.3A patent/CN105739951B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9118347B1 (en) * | 2011-08-30 | 2015-08-25 | Marvell International Ltd. | Method and apparatus for OFDM encoding and decoding |
CN103505206A (en) * | 2012-06-18 | 2014-01-15 | 山东大学威海分校 | Fast and parallel dynamic MRI method based on compressive sensing technology |
Non-Patent Citations (2)
Title |
---|
Accelerated proximal algorithms for L1-minimization problem;XIAO YA ZHANG等;《 Wavelet Active Media Technology and Information Processing (ICCWAMTIP), 2014 11th International Computer Conference on》;20150402;139-143 * |
快速L1范数最小化算法的性能分析和比较;刘杰等;《电脑知识与技术》;20110731;第7卷(第19期);4641-4643 * |
Also Published As
Publication number | Publication date |
---|---|
CN105739951A (en) | 2016-07-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105739951B (en) | A kind of L1 minimization problem fast solution methods based on GPU | |
US20220012593A1 (en) | Neural network accelerator and neural network acceleration method based on structured pruning and low-bit quantization | |
Alwani et al. | Fused-layer CNN accelerators | |
Keuper et al. | Distributed training of deep neural networks: Theoretical and practical limits of parallel scalability | |
US20220391678A1 (en) | Neural network model processing method and apparatus, computer device, and storage medium | |
Lauterbach et al. | Fast BVH construction on GPUs | |
US20180157969A1 (en) | Apparatus and Method for Achieving Accelerator of Sparse Convolutional Neural Network | |
CN106709441B (en) | A kind of face verification accelerated method based on convolution theorem | |
Martín et al. | Algorithmic strategies for optimizing the parallel reduction primitive in CUDA | |
US11797855B2 (en) | System and method of accelerating execution of a neural network | |
CN106875013A (en) | The system and method for optimizing Recognition with Recurrent Neural Network for multinuclear | |
CN103049241B (en) | A kind of method improving CPU+GPU isomery device calculated performance | |
CN104765589B (en) | Grid parallel computation preprocess method based on MPI | |
KR20180123846A (en) | Logical-3d array reconfigurable accelerator for convolutional neural networks | |
CN112084038A (en) | Memory allocation method and device of neural network | |
CN103177414A (en) | Structure-based dependency graph node similarity concurrent computation method | |
US20230409885A1 (en) | Hardware Environment-Based Data Operation Method, Apparatus and Device, and Storage Medium | |
CN114461978B (en) | Data processing method and device, electronic equipment and readable storage medium | |
CN108875914B (en) | Method and device for preprocessing and post-processing neural network data | |
CN110414672B (en) | Convolution operation method, device and system | |
US20240160689A1 (en) | Method for optimizing convolution operation of system on chip and related product | |
CN106484532B (en) | GPGPU parallel calculating method towards SPH fluid simulation | |
Niu et al. | SPEC2: Spectral sparse CNN accelerator on FPGAs | |
Zayer et al. | Sparse matrix assembly on the GPU through multiplication patterns | |
CN107305486A (en) | A kind of neutral net maxout layers of computing device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |