CN112750101A

CN112750101A - FFT (fast Fourier transform) -based algorithm for parallel detection of OCA (optical clear array) defects by using super-large graph GPU (graphic processing Unit)

Info

Publication number: CN112750101A
Application number: CN202011256011.8A
Authority: CN
Inventors: 不公告发明人
Original assignee: Beijing Pingheng Intelligent Technology Co ltd
Current assignee: Beijing Pingheng Intelligent Technology Co ltd
Priority date: 2020-11-11
Filing date: 2020-11-11
Publication date: 2021-05-04

Abstract

The invention provides a 2-FFT-based parallel GPU detection algorithm, which mainly provides a rapid calculation scheme for detection of large-scale images. The method circularly performs the following processing on 3 layers of the original FFT butterfly algorithm: serial calculation is carried out on the outmost layer of circulation, and parallel calculation is carried out after the internal 2 layers of circulation are unified by a formula. The outermost loop times are logarithms of the computation amount, similar to the depth of a binary tree, so the outer layer serial computation amount is small. The internal 2-layer circulation with large calculation amount and the unified formula are calculated by a GPU parallel method. Finally, the purpose of fast calculation of FFT is realized.

Description

FFT (fast Fourier transform) -based algorithm for parallel detection of OCA (optical clear array) defects by using super-large graph GPU (graphic processing Unit)

Technical Field

The invention mainly relates to industrial-grade high-precision real-time detection, in particular to an image processing technology which solves the problem of high-precision detection of small defects of an ultra-large image and meets the requirement of high-precision detection.

Background

The modern industrial product is manufactured to be refined and processed at high speed, the requirement of product detection is more and more, early manual detection meets the current requirement, and the requirement cannot be met more and more no matter the detection speed or the detection accuracy is developed along with the time. With the development of image processing and target detection, an automatic image detection mode gradually enters the industrial detection industry to replace manual detection. Mainly dealing with size detection, various defect detections, and the like.

Aiming at the requirement of industrial image processing, removing some noise points, noise points and the like in image information, Fourier frequency domain filtering processing is needed, Fourier transformation processes continuous signals into trigonometric function signals, the trigonometric function signals are converted from a time domain to a frequency domain, then the related information is processed and filtered on the frequency domain, and the related information is reversely converted on the time domain to remove interference information, so that support is provided for effectively searching defects.

The existing discrete Fourier transform basically adopts fast Fourier transform to achieve the purpose of acceleration, but before acceleration, optimization preparation needs to be carried out, the time is long, the requirement of real-time detection of a factory cannot be met when the discrete Fourier transform is normally used for detection, and the cufft effect of cuda is not ideal.

Detecting small defects (less than 0.01 mm) in a super-large image such as (8K 13K, 16K 30K) by industrial products²) And real-time performance (300 ms) and the like, the detection speed is a very important index for limiting the performance of the algorithm.

Disclosure of Invention

Therefore, in order to solve the problems of large FFT computation amount and high speed, the present invention considers designing the parallel algorithm of the GPU on the basis of the FFT for solving the above-mentioned needs.

The technical scheme adopted by the invention is as follows.

Based on a 2-based FFT, will be other than 2ⁿData of quantity, extended to 2ⁿIndex (most recent index) of (c).

And performing GPU parallel of one-dimensional FFT, firstly storing the FFT transformation coefficient W by using an array, and transmitting the FFT transformation coefficient W to a GPU shared memory.

The odd-type and even-type results inside the multi-cycle are represented by a pool-type array flag under the branch condition, the flag array identifier is calculated by a branch on a CPU, and then the flag array identifier is transmitted to a GPU shared memory for use.

Merging and paralleling inner two-layer loops (i, j) of three-layer loops (k, i, j) of FFT (fast Fourier transform) algorithm, and reserving outermost-layer loops, wherein the number of the outermost-layer loops is log of calculated quantity₂m, similar to the depth of the binary tree, so the outer loop volume is small.

The loop of the inner two layers has odd-type items and even-type items in parallel, the items are judged and distinguished by adding conditions by using a [0010] step flag array, and the parallel subscript of a branch array flag is [ (tid% (nNum)) + k (1< < r) ].

And adding an array f for odd-type terms, and optimizing subscripts of the two arrays into f [ tid ] and f [ tid + (1< (r-k-1)) ] ].

And subtracting an array by using the even type term, then multiplying the even type term by a transformation coefficient W, optimally arranging the two subtracted array indexes into f [ tid ] and f [ tid + (1< (r-k-1)) ], and using [ tid < (1< < k))% (1< (r-1)) as an index to represent the transformation coefficient W.

And each time the outer layer of circulation traverses, assigning the parity type result obtained by calculation to an input array of the next circulation.

And after the outer circulation is finished, carrying out subscript sorting to obtain a converted array of the normal conversion point sequence.

And (4) completing a one-dimensional fast Fourier parallel algorithm in the steps (0008) to (0016).

The method is expanded to two dimensions on the basis of one dimension, one direction is selected for expansion, for example, the height direction, each line is a one-dimensional Fourier parallel algorithm, and total parallelism can be formed.

And after finishing calculation of each row in the height direction, starting calculation of each column. And (4) finishing the calculation of all rows and columns, namely finishing the Fourier transform of one image.

Drawings

The following description of the invention, an understanding of the application scenario, is helpful to reading and referring to the following drawings.

FIG. 1 is a logic flow diagram of a one-dimensional parallel algorithm of the present invention.

Figure 1 shows the algorithm implementation steps and the parallel conditions in detail.

FIG. 2 is a general flow chart of the application steps of the present invention using a parallel algorithm, explicitly identifying the input data, the calculation steps and the output data of the present invention.

Detailed Description

And importing a picture, acquiring the width and the height of the picture, judging whether the width and the height are certain indexes of 2, if so, keeping the width and the height unchanged, and entering the next step, otherwise, filling pixels 0 in the lower right corner of the picture to certain indexes of 2 in width and height.

And declaring a storage space of the image data on the GPU, synthesizing and converting the image data of the CPU into a one-dimensional array, and transmitting the one-dimensional array to the GPU.

And expanding the one-dimensional real number array on the GPU into a complex number array.

And setting Grid and Block parameters, and transmitting the one-dimensional complex array in the step [0009] into a kernel function to perform GPU parallel of one-dimensional FFT.

And distributing two-dimensional Thread through Grid and Block, and transforming in the image broadband direction. One thread block needs to be synchronized by a _ synchreads method to keep data normally computed.

And when one loop is finished, assigning the calculation result to the input value for the next loop iteration.

And finishing all the loop calculations, and performing subscript sorting by using a reverse sorting method.

And (3) transposing the image data, exchanging rows and columns, and then performing GPU parallelism of one-dimensional FFT again according to the form.

And after the steps are completed, Fourier transform of the image is obtained.

And obtaining a central filter convolution kernel by using a Gaussian function, then obtaining a peripheral filter by four branch conditions, and removing partial frequency information on the peripheral filter to obtain the required filter.

And then [0032] the image after Fourier transform and a filter are calculated to obtain a filtered image.

And performing one-dimensional Fourier inverse transformation on the filtered image according to the row direction.

And (4) transposing the image obtained in the step [0035], and then continuing to perform one-dimensional inverse Fourier transform in the row direction.

And (4) transposing the image obtained in the step [0036] to obtain an image with the amplitude, namely the image after filtering.

And transmitting the converted image from the GPU side to the CPU side.

The desired result is obtained.

Claims

1. The algorithm for parallel detection of the OCA defects by the aid of the FFT-based super-large image GPU mainly aims at visual detection task requirements such as high-precision real-time performance of super-large images in industrial products, improves parallelism and achieves the purpose of fast calculation on the basis of fast Fourier transform, and has the following main innovation points.

2. The algorithm of claim 1, wherein the inner multi-layer loop is modified to enhance the representation of parallelism by using a shift pattern of indices to allow arrays to be indexed with a uniform pattern.

3. The algorithm of claim 2, wherein parity patterns in the multi-level loop are replaced with branch conditional predicate statements to facilitate GPU parallelism. The conditional branch outcome is recorded with a pool type array flag. Because the branch condition judgment performance of the GPU end is lower than that of the CPU end, the flag array is calculated and assigned at the CPU end, and then the flag array is transmitted into a shared memory of the GPU, so that the access efficiency is improved, when the GPU end performs calculation, the flag array can be directly inquired to obtain a result quickly, and subscripts are convenient to unify and calculate in a multi-thread parallel mode.

4. The algorithm of claim 1, wherein the coefficients of the fourier transform are pre-computed in the CPU and then transmitted to the GPU shared memory once, thereby improving the computational efficiency and reducing the transmission time per computation.