CN114897665A

CN114897665A - Configurable real-time parallax point cloud computing device and method

Info

Publication number: CN114897665A
Application number: CN202210348784.1A
Authority: CN
Inventors: 孟照腾; 蒿杰; 胡文庆; 孙亚强; 舒琳; 历宁; 范秋香
Original assignee: Institute of Automation of Chinese Academy of Science; Guangdong Institute of Artificial Intelligence and Advanced Computing
Current assignee: Institute of Automation of Chinese Academy of Science; Guangdong Institute of Artificial Intelligence and Advanced Computing
Priority date: 2022-04-01
Filing date: 2022-04-01
Publication date: 2022-08-12
Also published as: WO2023184754A1

Abstract

The invention provides a configurable real-time parallax point cloud computing device and a configurable real-time parallax point cloud computing method, wherein the configurable real-time parallax point cloud computing device comprises an image cache unit, a cache controller, a PE array, a result shaping module, a minimum value searching module and a configuration analysis module; the image buffer unit is used for outputting image window data of the designated window size and the sliding window sequence; the cache controller is used for controlling the image cache unit to output image window data and distribute the image window data to the PEs in the PE array; the PE array is used for generating a plurality of PUs with specified structures and obtaining SAD matching cost calculation results; the result shaping module is used for adding data fields to the matching cost; the minimum value searching module is used for searching the minimum value step by step for the matching cost to obtain a parallax value; the configuration analysis module is used for analyzing the received configuration information, generating corresponding control signals and inputting the control signals into other modules respectively, so that the parallax point cloud calculation can be carried out in real time, and the matching parameters can be configured without reconstruction.

Description

Configurable real-time parallax point cloud computing device and method

Technical Field

The invention relates to the technical field of microelectronics, in particular to a configurable real-time parallax point cloud computing device and method.

Background

The stereo matching is a key link in binocular stereo vision, and the stereo matching algorithm searches corresponding points of the left image and the right image according to the similarity of pixel information so as to determine the parallax. Corresponding point searching is carried out on pixel points of the whole image, so that parallax point cloud of the whole image can be generated, and the parallax point cloud is further used for tasks such as distance measurement or three-dimensional reconstruction. The stereo matching algorithm may be deployed on different platforms such as a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), and the like.

The CPU and the GPU have better programmability, can adapt to different matching parameters to the greatest extent, can meet the three-dimensional matching tasks of different scenes, but have poor real-time performance and cannot meet the application requirement of high real-time performance; the ASIC has high energy efficiency and real-time performance, but it has poor flexibility and cannot adapt to different matching parameters. The FPGA can effectively accelerate computation-intensive tasks, but in the prior art, different matching parameters can only be adapted through reconstruction, and the time cost of reconstruction design is large.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a configurable real-time parallax point cloud computing device and method.

In a first aspect, the present invention provides a configurable real-time parallax point cloud computing device, comprising:

the image processing device comprises an image cache unit, a cache controller, a processing unit PE array, a result shaping module, a minimum value searching module and a configuration analysis module;

the image cache unit is connected with the cache controller and used for shaping cached binocular image data according to the size of a specified window and the sequence of sliding windows under the control of the cache controller and outputting image window data to the cache controller;

the cache controller is respectively connected with the configuration analysis module and the PE array and is used for controlling the image cache unit to output image window data according to the control signal transmitted by the configuration analysis module, and the image window data is distributed to the PE in the PE array through the cache controller;

the PE array is respectively connected with the configuration analysis module and the result shaping module and is used for generating a plurality of PUs with specified structures according to control signals transmitted by the configuration analysis module, processing input image window data based on the PUs with the specified structures, obtaining SAD matching cost calculation results and outputting the SAD matching cost calculation results to the result shaping module;

the result shaping module is respectively connected with the configuration analysis module and the minimum value searching module and is used for performing field addition on the input SAD matching cost calculation result according to the control signal transmitted by the configuration analysis module and outputting the result to the minimum value searching module;

the minimum value searching module is connected with the configuration analysis module and used for searching the minimum value of the input SAD matching cost calculation result step by step according to the control signal and the minimum value searching algorithm transmitted by the configuration analysis module and outputting a parallax value corresponding to the minimum matching cost;

the configuration analysis module is used for analyzing the received configuration information, generating corresponding control signals and respectively inputting the control signals to the cache controller, the PE array, the result shaping module and the minimum value searching module.

Optionally, the PEs in the PE array are interconnected vertically and horizontally, and the intermediate result is transferred in the vertical direction, and the operand and the matching cost are transferred in the horizontal direction.

Optionally, one or more of the following types of PEs are included in the PE array:

the Ultra PE is used for performing difference calculation on two operands to obtain an absolute value and accumulation operation of partial sums in the SAD matching cost calculation process;

the Standard PE is used for performing difference calculation on two operands to obtain an absolute value and accumulation operation of partial sum in the SAD matching cost calculation process;

the Lite PE is used for performing the operation of calculating the absolute value of the difference of two operands in the SAD matching cost calculation process;

and the computing resource corresponding to the Ultra PE is larger than the Standard PE.

Optionally, each column in the PE array may be configured into one or more PUs for performing a SAD matching cost calculation operation for a specified window size.

Optionally, in the PU, the PE in the first row is the Ultra PE or the Standard PE.

In a second aspect, the present invention further provides a configurable real-time parallax point cloud computing method, including:

the configuration analysis module analyzes the received configuration information to generate corresponding control signals, and the control signals are respectively input to the cache controller, the PE array, the result shaping module and the minimum value searching module;

the cache controller controls the image cache unit to output image window data corresponding to one path or multiple paths of binocular image data according to the control signal transmitted by the configuration analysis module and according to the size of a specified window and the sequence of sliding windows;

the PE array generates a plurality of PUs of a specified structure according to the control signal transmitted by the configuration analysis module, and processes input image window data based on the plurality of PUs of the specified structure to obtain an SAD matching cost calculation result corresponding to the image window data;

the result shaping module adds fields to the result of SAD matching cost calculation output by the PE array according to the control signal transmitted by the configuration analysis module;

and the minimum value searching module searches the SAD matching cost calculation result after the field is added for a minimum value step by step according to the control signal transmitted by the configuration analysis module and a minimum value searching algorithm, and outputs a parallax value corresponding to the minimum matching cost.

Optionally, the configuration information includes:

the image resolution, the size of a matching window, the parallax search depth, the number of paths of binocular image data and the PE working mode.

Optionally, the determining manner of the configuration information includes:

determining that the performance indexes of single data stream or multiple data streams are met;

allocating computing resources according to the number of unit computing resources allocable in the PE array, the parallax searching depth corresponding to each data stream and the video frame rate corresponding to each data stream;

configuration information is generated based on the allocated computing resources.

Optionally, when the number of data streams is two, the allocating the computing resources according to the number of unit computing resources allocable in the PE array, the disparity search depth corresponding to each data stream, and the video frame rate corresponding to each data stream includes:

determining remaining allocatable computational resources in the PE array after allocating a unit computational resource for each data stream;

if the remaining allocable computing resources can provide at least one unit computing resource for each data stream and meet a first condition, continuing to allocate one unit computing resource for each data stream;

if the remaining allocable computing resources can provide at least one unit computing resource for each data stream but do not meet the first condition, allocating the unit computing resources for each data stream individually according to the magnitude relation between the first numerical value and the second numerical value;

the first numerical value is determined according to the parallax search depth and the video frame rate corresponding to each data stream, the second numerical value is determined according to the number of unit computing resources currently allocated to each data stream, and the first condition is determined according to the first numerical value, the second numerical value and a preset threshold.

Optionally, the method further comprises:

and if the residual allocable computing resources can only provide unit computing resources for the target data stream, allocating all the residual allocable computing resources to the target data stream.

According to the configurable real-time parallax point cloud computing device and method, adaptation of different matching parameters is achieved through the configuration analysis module, and meanwhile, an FPGA (field programmable gate array) does not need to be reconstructed; the calculation of SAD matching cost is completed through the PE array parallel pipeline structure, the requirement of high real-time performance is met, and the consideration of different matching information and high real-time performance in adaptation is ensured.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is an overall block diagram of a configurable real-time parallax point cloud generating system based on FPGA provided by the present invention;

FIG. 2 is a schematic structural diagram of a configurable real-time parallax point cloud computing apparatus provided in the present invention;

FIG. 3 is a schematic diagram illustrating a resolution compatible method of an image buffer unit according to the present invention;

FIG. 4 is a diagram illustrating PSAD and PSUM definitions provided by the present invention;

FIG. 5 is a schematic diagram of SAD matching cost calculation pipeline design provided by the present invention;

FIG. 6 is a schematic diagram of a fully parallel pipelined SAD computing array provided by the present invention;

FIG. 7 is an internal structure diagram of an Ultra PE provided by the present invention;

FIG. 8 is a schematic flow chart of a configurable real-time parallax point cloud computing method according to the present invention;

FIG. 9 is a schematic diagram of a resource-aware configuration generation process provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Binocular stereo vision is a common means for acquiring scene depth information, has good reliability and robustness, and is widely applied to the fields of mobile robots, automatic driving, industrial automation, automatic monitoring and the like. Stereo matching is a key link in binocular stereo vision. The stereo matching algorithm searches corresponding points of the left and right images according to the similarity of the pixel information, so as to determine the parallax. Corresponding point searching is carried out on pixel points of the whole image, so that parallax point cloud of the whole image can be generated, and the parallax point cloud is further used for tasks such as distance measurement or three-dimensional reconstruction. The current stereo matching algorithm can be divided into a local matching algorithm, a global matching algorithm and a semi-global matching algorithm. Due to its unique real-time property, the local matching algorithm is widely applied to high real-time applications.

The stereo matching algorithm can be deployed on different platforms such as CPU, GPU, FGPA, ASIC, etc. The CPU and the GPU have better programmability, can adapt to different matching parameters (such as the size of a matching window, parallax search depth, image resolution and the like) to the maximum extent, can meet the stereo matching tasks of different scenes, but have poor real-time performance and cannot meet the application requirement of high real-time performance. The ASIC has high energy efficiency and real-time performance, but it has poor flexibility and cannot adapt to different matching parameters. The FPGA can effectively accelerate the calculation intensive task, can adapt to different matching parameters through reconstruction, can compromise real-time performance and flexibility, and becomes a mainstream scheme for accelerating stereo matching.

At present, a plurality of stereo matching acceleration platforms based on the FPGA are available, and deployment of local matching algorithms such as Sum of Absolute Differences (SAD), Sum of Squared Differences (SSD), and Census Transform is realized. However, these platforms can only adapt to different matching parameters through the reconstruction design, and the time cost of the reconstruction design is large. Therefore, the invention provides a stereo matching algorithm solution based on FPGA, which can realize stereo matching task and simultaneously meet the requirements of real-time performance, flexible adaptation to different matching parameters and no need of reconstruction.

The core idea of the invention is as follows: different matching parameters are adapted through a configuration analysis module, and meanwhile, an FPGA (field programmable gate array) is not required to be reconstructed; SAD matching cost calculation is completed through a PE array flow structure, and high real-time performance is guaranteed.

Fig. 1 is an overall block diagram of the FPGA-based configurable real-time parallax point cloud generating system, and as can be seen from fig. 1, image data acquired by left and right camera lenses and an acquisition chip are transmitted into the FPGA chip through a High-speed Interface, such as a Mobile Industry Processor Interface (MIPI), a Universal Serial Bus (USB), a High Definition Multimedia Interface (HDMI), a display Interface (DisplayPort, DP), and the like; after passing through the basic image processing module, entering an image distortion correction module for image distortion correction; buffering the output corrected pixel data into an external memory in a ping-pong buffer mode by taking a frame as a unit under the scheduling of a buffer controller; and sending the image number cached in the external memory to a configurable parallax point cloud computing module for a stereo matching computing process. The stereo matching adopts a local matching algorithm, and measures the matching cost by taking SAD as an index. The obtained parallax data is output via a high-speed interface (e.g., a Peripheral Component Interconnect Express (PCIE)).

The main control unit (for example, a CPU built in an FPGA) generates configuration information, and sends the configuration information to each configurable module to realize configuration of matching parameters, wherein the camera and the acquisition chip can realize adjustment of different resolutions and frame rates through register configuration, the basic image processing module comprises demosaic, gray level correction, image color format conversion, resolution cutting and the like, cutting from high resolution to low resolution can be realized through a register configuration mode, and the image distortion correction module can realize compatibility of different resolutions through a register configuration mode.

The configurable real-time parallax point cloud generating system based on the FPGA can realize the configuration of different resolutions, matching window widths and parallax searching depths in a register configuration mode, and can support the real-time processing of multiple groups of binocular data.

Fig. 2 is a schematic structural diagram of the configurable real-time parallax point cloud computing device provided by the present invention, and as can be seen from fig. 2, the device can be applied to a configurable real-time parallax point cloud generating system based on an FPGA, and the device includes an image caching unit 200, a caching controller 210, a Processing Element (PE) array 220, a result shaping module 230, a minimum search module 240, and a configuration parsing module 250.

The image buffer unit 200 is connected to the buffer controller 210, and is configured to shape the buffered binocular image data according to the size of the designated window and the sliding window sequence under the control of the buffer controller 210, and output the image window data to the buffer controller 210;

the cache controller 210 is connected to the configuration parsing module 250 and the PE array 220, respectively, and is configured to control the image cache unit 200 to output image window data according to a control signal transmitted by the configuration parsing module 250, where the image window data is distributed to the PEs in the PE array 220 through the cache controller 210;

the PE array 220 is connected to the configuration analysis module 250 and the result shaping module 230, respectively, and is configured to generate a plurality of Processing Units (PUs) with a specified structure according to the control signal transmitted by the configuration analysis module 250, and process input image window data based on the plurality of PUs with the specified structure, so as to obtain an SAD matching cost calculation result, and output the SAD matching cost calculation result to the result shaping module 230;

the method comprises the steps that a PE array adopts a mesh topological structure, each PE is interconnected with four PE on the upper side, the lower side, the left side and the right side of the PE array, a PU with a specified structure refers to an algorithm processing unit comprising a row of PEs with a plurality of rows, and is used for completing SAD matching cost calculation of a specified window to obtain a calculation result, the rows of the PU can be determined according to the size of an image window in configuration information, for example, the size of the image window is 3 multiplied by 3, and then one PU comprises 3 PEs with 1 row and 3 lines;

SAD matching cost calculation refers to the cumulative sum of the absolute values of the differences of corresponding pixel values for two windows (e.g., left and right eye image windows).

The result shaping module 230 is respectively connected to the configuration parsing module 250 and the minimum value searching module 240, and is configured to add a data field to the input SAD matching cost calculation result according to the control signal transmitted by the configuration parsing module 250 and output the result to the minimum value searching module 240;

the minimum value searching module 240 is connected to the configuration parsing module 250, and configured to search the input SAD matching cost calculation result for a minimum value step by step based on a minimum value search tree, and output a disparity value corresponding to the minimum matching cost;

the configuration parsing module 250 is configured to parse the received configuration information, generate corresponding control signals, and input the control signals to the cache controller, the PE array, the result shaping module, and the minimum search module, respectively.

As can be seen from fig. 2, the image buffer unit 200 stores 48.64KB in size, and includes 38 banks, each Bank being composed of a Block Random Access Memory (BRAM) with 1280Byte size, and is capable of storing pixel data of 1280Byte size, for example, one line of pixel data for each of the left and right images with 480p resolution (i.e., 640 × 480 resolution), 1/2 lines of pixel data for each of the left and right images with 720p resolution (i.e., 1280 × 720), and 1/3 lines of pixel data for each of the left and right images with 1080p resolution (i.e., 1920 × 1080 resolution). It should be understood by those skilled in the art that the storage size, the number of banks, and the Bank size of the image buffer unit 200 are not limited thereto, and they can be flexibly adjusted as needed.

FIG. 3 is a schematic diagram of a resolution compatible method of the image cache unit provided by the present invention, and it can be seen from the diagram that the image cache unit 200 can perform a shaping operation for images with different resolutions, that is, take out pixel data or rearrange pixels according to a fixed window size and a sliding window sequence, and the matching window size after shaping is 2N +1(N belongs to N) ^* N is less than or equal to 9) (note: at 1920 × 1080 resolution, the size of the shaped matching window is 2N +1(N ∈ N) ^* N is less than or equal to 6)). For a single image, each row of the shaping buffer may buffer 640 pixels, which may complete the shaping operation for a picture with a row pixel number of 640 × n (n belongs to a positive integer). For example, images are buffered line by line, starting from Bank0, in pixel coordinate order. When the input image has 480p resolution, each row of pixels occupies one Bank, and when data reshaping is carried out, data is simultaneously read from each Bank to complete reshaping; when the input image has the resolution of 720p, each row of pixels occupies two banks, and when data shaping is carried out, data is simultaneously read from the 2n-1(n belongs to a positive integer) banks to finish shaping; when the input image has a resolution of 1080p, each row of pixels occupies 3 banks, and when data shaping is performed, data is simultaneously read from 3n-1(n belongs to a positive integer) banks to complete the shaping. When all banks are full, the incoming data will overwrite the historical data from Bank0 and continue to be buffered line by line.

The cache controller 210 may generate a read/write control signal for the image cache unit and read/write data in the BRAM (for example, read from an external memory, write to the BRAM, read from the BRAM, distribute to the PE array, etc.) according to the control signal transmitted by the configuration analysis module 250.

PE array 220 performs SAD based matching cost calculations. Each PE in the PE array may be configured to operate in a different mode according to the control signal transmitted by the configuration parsing module 250. Each PE has different calculation tasks in different working modes (different window widths, parallax search depths, resolutions and video stream numbers), different FPGA resources are required, a configuration space is constructed for each PE according to all calculation parameters supported by the framework, and three types of PE structures are designed according to the configuration space: ultra PE, Standard PE and Lite PE. The three PEs are distributed in different rows in the array according to different calculation tasks, more complex calculation is completed by an Ultra PE, then a Standard PE is adopted, simple calculation is completed by a Lite PE, and FPGA resource consumption can be saved to the maximum extent; meanwhile, each PE in the PE array adopts an up-down, left-right interconnection mode, the operation number and the matching cost can be transmitted in the horizontal direction, and the intermediate result can be transmitted in the vertical direction, so that the multi-stage pipeline design of SAD matching cost calculation is realized, and the SAD matching cost calculation is accelerated; each column in the PE array may be configured to include one or more PUs, enabling matching cost calculations for different search depths.

The result shaping module 230 performs field addition on SAD calculation results generated by the Ultra PE and Standard PE lines. The SAD matching cost calculation results generated by the Ultra PE and the Standard PE lines are transmitted to a result shaping module through an interconnected data channel in the horizontal direction, the result shaping module combines a control signal transmitted by a configuration analysis module, and adds a position field for the SAD result corresponding to the position of the candidate parallax (the difference of horizontal coordinates of the center points of the left window and the right window), namely, an 8-bit binary code is added in front of each result and used for representing the matching cost of the SAD value under the candidate parallax. And sending the matching cost after the field is added to a minimum value searching module.

The minimum search module 240 searches the minimum value of the matching costs step by using a minimum search tree, and outputs a disparity value corresponding to the minimum matching cost.

The configuration analysis module 250 receives configuration information sent by the CPU, including information such as the number of paths, resolution, search depth, and size of a matching window of a video to be processed. The configuration analyzing module 250 analyzes the received configuration information to generate a corresponding flag signal for the cache controller 210 to generate a read/write address and a read/write enable; control signals are sent to PE array 220, result shaping module 230, and minimum search module 240.

The configuration analysis module 250 realizes adaptation to different matching parameters without reconstructing an FPGA; the calculation of SAD matching cost is completed through the PE array flow structure, and the real-time performance is guaranteed, so that the defect that the prior art cannot give consideration to the real-time performance and the matching information which is adapted to different types at the same time is overcome.

Optionally, the PEs in the PE array are interconnected vertically and horizontally, and the intermediate result is transferred in the vertical direction, and the operand and the final matching cost are transferred in the horizontal direction.

Specifically, in the calculation process of the SAD matching cost, each PE in the PE array is interconnected in an up-down, left-right, and left-right manner, so that the operand and the matching cost can be transferred in the horizontal direction, and the intermediate result can be transferred in the vertical direction. In the horizontal direction, the data multiplexing characteristic in the stereo matching pipeline mode is fully utilized, the operand and the final SAD matching cost are transmitted in the PE array, and remote data access is avoided. And in the vertical direction, partial sum transmission and accumulation in the process of calculating the matching cost of the left window and the right window are completed.

Fig. 4 is a diagram illustrating the definition of PSAD and PSUM according to the present invention. As can be seen from fig. 4, PSAD represents that pixel values at the same corresponding position in the same column of the left and right images are subjected to absolute difference, and the absolute difference values of the column are added.

Fig. 5 is a schematic diagram of SAD matching cost calculation pipeline design provided by the present invention. The SAD determination process is explained first: for two input images (a left image and a right image), sequentially scanning each pixel point (called anchor point) of the left image, and when scanning each pixel point of the left image, performing the following operations: constructing a matching window (such as 3 × 3 and 5 × 5. cndot.) with a fixed size by taking each anchor point as a center, and selecting all pixel points in a window coverage area; covering the corresponding position of the right image by using a window, and selecting all pixel points of a covering area; calculating the absolute value of the difference between the gray values of the pixels corresponding to the left image coverage area and the right image coverage area, and adding the absolute values; moving the coverage area of the right image to the left by taking 1 as a step length, taking out all pixel points of the coverage area, and solving the SAD value; repeating the previous step until the center position of the coverage area of the right image exceeds the parallax search range; and finding a window corresponding to the minimum SAD value in the range, wherein the central point of the window is the corresponding point of the left image anchor point, and the difference value of the abscissa of the left image anchor point and the abscissa of the corresponding point of the right image anchor point is the parallax of the anchor point.

Part (a) of fig. 5 shows a parallax search calculation process for an anchor point with a left image pixel value of 98 (the center point of a 3 × 3 window), where the window size is 3 × 3 and the search range is 4, and in the search process, SAD values of the left image window data and the

right image windows

1, 2, 3, and 4 are sequentially calculated, and the minimum SAD value is obtained, so as to obtain the parallax. As part (b) of fig. 5 shows the implementation process of the SAD computation pipeline in the computation architecture proposed by the present invention, for example, the SAD parallel process is as follows, the SAD computation process performed in two windows of 3 × 3 size can be divided into 3 sub-processes, the sub-process is named as PSAD process, the result of PSAD process is defined as PSUM, and then the SAD computation result can be obtained by adding several PSUMs.

The calculation process shown in section (a) of fig. 5 is based on one full SAD calculation, which is based on the fact that there is repeated access of data. In order to achieve efficient pipeline calculation, the SAD calculation process is now adjusted to the form shown in section (b) of fig. 5, which is calculated as granularity by one PSAD, and the SAD calculation process is implemented by accumulation of PSUM. Specifically, at time t1, the sub-window shown in the left diagram (r) and the 4 sub-windows shown in the right diagram (r) execute 4 PSAD processes in parallel, resulting in 4 PSUMs, from right to left, PSUM1_1, PSUM1_2, PSUM1_3, and PSUM1_4(PSUMn _ m: the PSAD result of the mth sub-window from right to left in the set of sub-windows shown in the left diagram sub-window n and the right diagram n); at time t2, the sub-window shown in the left diagram and the 4 sub-windows shown in the right diagram execute 4 PSAD processes in parallel to generate 4 PSUMs, which are PSUM2_1, PSUM2_2, PSUM2_3, PSUM2_ 4. and so on from right to left. In this calculation manner, the SAD process of window 1 shown in section (a) of fig. 5 can be obtained by adding PSUM1_1, PSUM2_1, and PSUM3_1, and is denoted as SUM. The pipelining mode can fully mine the data reusability of the right graph. When calculating the matching cost of two windows with

parallax

0 and 90 as the center point in the left image, redundant calculation is avoided by subtracting PSUM1_1 from SUM4_1 (the PSAD result of the first sub-window from right to left in the sub-window set shown in part (b) of the left image and part (b) of fig. 5). This calculation requires local caching of historical PSUMs.

Exemplarily, fig. 6 is a schematic diagram of a full parallel pipelined SAD computation array provided by the present invention, the array is composed of three rows and four columns of PEs, and can be configured as four PUs (PU1, PU2, PU3, and PU4), and complete a full disparity parallel computation process with a window size of 3 × 3 and a search range of four. Each sub-window of the left image in part (b) of fig. 5 needs to perform PSAD calculation with four sub-windows of the right image, each PSAD calculation is performed by one row of PEs (i.e. 1 PU), the number of PEs in each row of PEs is determined by the window size, so four rows of PEs are needed to perform the full-parallax parallel calculation, and data transmission can be performed between PEs in the up, down, left and right directions. The entire SAD computation stage is divided into two sub-stages, which perform operand filling and pipeline computation, respectively.

(1) Operand stuffing phase:

clk 1-clk 4: as shown in part (b) of fig. 5, the cache controller fetches the three numbers in the sub-windows of the left graph (i), and multicasts them to PE00-PE03, PE10-PE13, and PE20-PE23 in fig. 6 (43 multicast to PE00-PE03, 87 multicast to PE10-PE13, and 34 multicast to PE20-PE 23); the cache controller sequentially takes out the four sub-window data in the right diagram, and sends the four sub-window data to 12 PEs (88 to PE03, 59 to PE13, 88 to PE23, 1 to PE02, 45 to PE12, 6 to PE22, 42 to PE01, 58 to PE11, 14 to PE21, 69 to PE00, 72 to PE10, and 0 to PE20) through the horizontal data transfer path of the array.

A pipeline computing stage:

clk 5: each PE calculates the difference of the two operands in the register and takes the absolute value. Taking PE00 as an example, the AD values (26) of two operands (43 and 69) are calculated and registered.

clk 6: the operands in the PE registers are updated, the AD values of the two updated operands are calculated, and the AD values generated at clk5 are added in columns. Specifically, the PE03 operand is updated to 1, the PE13 operand is updated to 45, the PE23 operand is updated to 6, the PE02 operand is updated to 42, the PE12 operand is updated to 58, the PE22 operand is updated to 14, the PE01 operand is updated to 69, the PE11 operand is updated to 72, the PE21 operand is updated to 0, the PE00 operand is updated to 55, the PE10 operand is updated to 80, and the PE20 operand is updated to 87. The four sub-windows of the right diagram (ii) and the four sub-windows of the right diagram (i) have reusable values, so that only the values of the rightmost sub-window, namely 55, 80 and 87, are taken out from the buffer (PE internal structure) and sent to PE00, PE10 and PE 20. And the operands in the other three sub-windows are transversely transmitted to the right PE by the left PE. Taking PE00 as an example, the AD values (43) of two operands (98 and 55) in the register are calculated and registered. The addition of the AD values is done in two steps: taking the first column as an example, the first step completes the addition of PE00 to the AD value in PE10 and produces an intermediate result p, and the second step completes the addition of the AD value in PE20 to the intermediate result p. clk6 completes the first step of the process described above. The AD value (15) of PE10 is passed to PE00 and added to the AD value (26) of PE00 and the intermediate result (41) is temporarily stored. The remaining column process is the same as the first column.

clk 7: updating the operands in the PE registers, calculating the AD values of the two updated operands, completing the second step of adding the AD values generated at the time clk5, and completing the first step of adding the AD values generated at the time clk 6. Updating of PE operands and calculation of the AD values of the operands is similar to the process described above. AD addition specifically, taking the first column as an example, PE20 passes the AD value (34) for the operands (34 and 0) generated at time clk5 to PE00 and adds it to the intermediate result (41) at time clk6, resulting in PSUM1_1 (75). PE10 passes the AD values (18) of the operands (98 and 80) generated at time clk6 to PE00 and adds the AD values (43) of the operands (98 and 55) generated by PE00 at time clk6, producing an intermediate result (61). The remaining columns are the same as the first column.

clk 8: and updating the operands in the PE registers, calculating the AD values of the two updated operands, completing the second step of adding the AD values generated at the clk6 moment, and completing the first step of adding the AD values generated at the clk7 moment. Updating of PE operands and calculation of the AD values of the operands is similar to the process described above. AD value addition specifically, taking the first column as an example, PE20 passes the AD value (48) for the operands (39 and 87) generated at time clk6, to PE00, and adds the intermediate result (61) at time clk7, resulting in PSUM2_1 (109). PE10 passes the AD values (89) of the operands (90 and 1) generated at time clk7 to PE00 and adds them to the AD values (12) of the operands (44 and 56) generated by PE00 at time clk7, producing an intermediate result (101). The remaining columns are the same as the first column.

clk 9: the second step of adding the AD values generated at clk7 is completed, and PSUM1_1 ~ 4 and PSUM2_1 ~ 4 are added separately. Specifically, taking the first column as an example, PE20 passes the AD value (0) of the operands at time clk7 (45 and 45) to PE00 and adds the AD value to the intermediate result (101) at time clk8, resulting in PSUM3_1 (101). In PE00, sum PSUM1_1(75) and PSUM2_1(109) to obtain intermediate result q (184). The remaining columns are the same as the first column.

clk 10: the addition of the intermediate result q to PSUM3_1 is completed, resulting in a final SUM value. Specifically, taking the first column as an example, in PE00, SUM of PSUM3_1(101) and intermediate result q (184) is completed to obtain final SUM value (285). The remaining columns are the same as the first column.

The PSUM value is vertically and upwards transferred among the PEs, and the data of the right image and SAD calculation results generated by the Ultra PE and the Standard PE lines are transversely transferred among the PEs, so that the calculation of SAD matching cost is accelerated, and the time waste caused by repeatedly reading the data is reduced. Through the PE connection mode, the method is more beneficial to the realization of multi-level flow design of SAD matching cost, and meets the real-time requirement of data transmission and calculation.

Specifically, when the PE array is arranged, the PE array includes one or more of three types of PEs, for example, the PE array may be generated entirely by Ultra PE; the PE array can also be generated by an Ultra PE and a Standard PE; and the PE array can also be generated by three types of Ultra PE, Standard PE and Lite PE together.

FIG. 7 is an internal structure diagram of the Ultra PE provided by the present invention. As can be seen from fig. 7, the Ultra PE internally comprises the following components: AD value accumulation units 701-705, which are used for completing accumulation of intermediate results (AD values) transmitted by the PEs in the vertical direction, generating PSUM, and transmitting PSUM to other PEs through data interconnection; an AD calculation unit 706 for completing a calculation process of a difference of two operands and taking an absolute value (hereinafter referred to as an AD value); a PSUM cache unit 707 for local temporary caching of the history PSUM when the time division multiplexing is completed; a PSUM accumulation unit 708; a matching cost cache unit 709; and an operand buffer unit 710 for temporarily storing operands in time division multiplexing.

It should be noted that the Ultra PE has the most computing resources, the Standard PE has the next lowest computing resources, and the Lite PE has the least computing resources. The Standard PE comprises AD value accumulation units 701-703 in FIG. 7; an AD calculation unit 706; a PSUM caching unit 707; a PSUM accumulation unit 708; a matching cost cache unit 709; operand cache unit 710. The Lite PE only contains the AD calculation unit 706 and the operand cache unit 710 in fig. 6.

Specifically, the AD value accumulation units 701 to 705 include an adder, two registers (a and B), a MUX, and a DEMUX. Specifically, operand 1 of the adder comes from a vertical data bus and is transferred by other PEs; operand 2 is either the historical accumulation result or 0; after completing the accumulation of one PSAD, setting the operand 2 to zero to carry out the next PSAD calculation process; the accumulated intermediate value is registered in A and B, and A and B form an FIFO with the depth of 2; the MUX can select to acquire an accumulated value from A or B (the upper register is B, and the lower register is A in the figure); the effect of registering one beat (delaying one preset clock period) can be realized by acquiring the accumulated intermediate value from the B; the outputs of the 5 AD value accumulation units are connected to a MUX that gates the valid PSUM and feeds it into PSUM cache unit 707 and PSUM accumulation unit 708, respectively.

The AD calculating unit 706 inputs left and right image pixel data, outputs the data through an FIFO composed of three registers, and the MUX can select to obtain an AD value from the three registers, thereby achieving the effect of beating one beat (delaying a preset clock period) or beating two beats.

The PSUM accumulation unit 708 has three operands, which are the current PSUM value, the historical PSUM value, and the historical matching cost, wherein the PSUM value is generated by the AD accumulation units 701 to 705 and is passed through MUX gating, the historical PSUM value is cached in the PSUM value cache 707, and the historical matching cost is cached in the matching cost cache unit 709.

It should be understood by those skilled in the art that the number of the AD value accumulation units, AD calculation units, etc. and the number of registers in the figure are not restrictive, and they may be adjusted as needed.

According to different requirements of SAD multi-stage flow calculation on calculation resources, different PE structures are designed, the array utilization rate is effectively improved, FPGA resources are saved, and FPGA power consumption is reduced.

Optionally, each column in the PE array may be configured to include one or more PUs for performing SAD matching cost calculation operations for a specified window size.

Specifically, in order to perform parallel computation of parallax as much as possible spatially, and also to accommodate different matching window sizes and different parallax search depths, each column of the PE array may be configured to be one or more PUs with a certain size, for example, each column of the PE array has 10 PEs, and may be configured to 3 PUs for calculating a 3 × 3 window matching cost, and perform three SAD matching cost computations in parallel, or configured to calculate 2 PUs for calculating a 5 × 5 window matching cost, and perform two SAD matching cost computations in parallel, or configured to calculate 1 PU for calculating a 9 × 9 window matching cost, and perform one SAD matching cost computation.

Alternatively, the PE array has 19 PEs per column, and these PEs can be configured as a combination of 22 kinds of computing units as shown in table 1, and compatibility of different window sizes is achieved by the following configuration combinations. It should be noted that the numbers in the table indicate the number of computing units capable of computing the left and right window matching costs of the corresponding window size, and taking combination 1 as an example, it means that each column of PE can be configured with 5 computing units for computing the left and right window matching costs with a size of 3 × 3.

TABLE 1 configuration combinations of a single PE in a PE array

On the basis, the method is combined with a time division multiplexing method, and the method can be compatible with larger parallax search depth. For example, if there are 19 PEs per column of the PE array for a total of 25 columns, and each column of PEs can be configured to be 5 PUs for calculating the matching cost of left and right windows of size 3 × 3, then 25 columns of PEs can provide 125 parallel processing computations for 3 × 3 windows. When the parallax search depth is smaller than or equal to 125, the array can perform full parallax parallel calculation, and when the parallax search depth is larger than 125, the calculation is performed in a time division multiplexing mode.

The embodiment of the invention provides 3 time-division multiplexing configurations in total, namely time-division multiplexing for 2 times, 4 times and 8 times, and table 2 shows the maximum search depth which can be achieved by the array under different time-division configurations.

TABLE 2 theoretical maximum search depth for different window widths and multiplexing times

According to the operation, the device can be compatible with different window sizes and different search depths, the window sizes and the search depths can be configured, meanwhile, the parallel design is furthest realized, and the device can meet the requirement of high real-time performance.

Optionally, in the PU, the PE in the first row is an Ultra PE or a Standard PE.

Specifically, in the process of calculating the matching cost, the intermediate result is passed upwards in the vertical direction, so that each PU performs the accumulation of the final result in the first row, and thus the first row may adopt an Ultra PE or a Standard PE.

Furthermore, when the SAD is found in the whole PE array, the resource consumption and the compatibility of different window sizes are considered comprehensively, all rows where the adder appears should be Ultra PEs or Standard PEs, and when the adder needs more (for example, 5) adders, only the Ultra PEs can be used; the lines only carrying out AD value calculation can use Lite PE;

based on this design idea, for example, in a 19-row 25-column PE array, each column PE can be configured to be one or more PUs for performing SAD calculation. The Ultra PE is arranged on the 0 th line and the 9 th line, the Standard PE is arranged on the 2 nd, 4 th, 6 th, 8 th, 11 th, 13 th and 15 th lines, and the Lite PE is arranged on the 1 st, 3 th, 5 th, 7 th, 10 th, 12 th, 14 th, 16 th, 17 th and 18 th lines, so that the configuration design can achieve the optimal resource configuration without causing design redundancy.

And an Ultra PE or a Standard PE is distributed at the top end of the PU, so that the advantages of less resource consumption and less device power consumption can be realized.

The method and the device provided by the embodiments of the application are based on the same application concept, and because the principles of solving the problems of the method and the device are similar, the method and the device can be implemented by mutually referring, and repeated parts are not repeated.

Fig. 8 is a schematic flow chart of the configurable real-time parallax point cloud computing method provided in the present invention, as shown in fig. 8, the method includes the following steps:

step 800, the configuration analyzing module analyzes the received configuration information to generate corresponding control signals, and the control signals are respectively input to the cache controller, the PE array, the result shaping module and the minimum value searching module.

Step 801, controlling the image buffer unit to output image window data corresponding to one or more paths of binocular image data by the buffer controller according to the control signal transmitted by the configuration analysis module and according to the designated window size and the sliding window sequence.

And 802, the PE array generates a plurality of PUs with appointed structures according to the control signals transmitted by the configuration analysis module, and processes the input image window data based on the PUs with the appointed structures to obtain SAD matching cost calculation results corresponding to the image window data.

And step 803, the result shaping module adds a field to the result of the SAD matching cost calculation output by the PE array according to the control signal transmitted by the configuration analysis module.

And step 804, the minimum value searching module searches the SAD matching cost calculation result after the field is added for the minimum value step by step according to the control signal transmitted by the configuration analysis module and the minimum value searching algorithm, and outputs a parallax value corresponding to the minimum matching cost.

Different matching parameters are adapted through a configuration analysis module; the calculation of SAD matching cost is completed through the PE array flow structure, and high real-time performance is guaranteed, so that the defect that the prior art cannot give consideration to high real-time performance and the matching information which is adapted to different types at the same time is overcome.

Optionally, the configuration information of the configurable real-time parallax point cloud computing method includes image resolution, matching window size, parallax search depth, path number of binocular image data, and PE working mode.

Specifically, the image resolution supports 480p, 720p, 1080p, and the like; the matching window size includes 3 × 3, 5 × 5, 13 × 13, etc.; the parallax search depth is determined according to the application scene and is specified by a user; the binocular image data comprises data streams of a left image and a right image, which are called left and right data stream pairs for short, and the number of paths of the binocular image data refers to the number of the left and right data stream pairs; the PE working modes comprise Ultra PE, Standard PE and Lite PE. In the SAD matching cost parallel computing process, the rows in the PE array which need to use the adders all use Ultra PE and Standard PE (the Ultra PE is used when the adders are more), and the rows which only need to use AD value computation use Lite PE.

The configuration information can be transmitted and analyzed through the configuration analysis module, so that the function of flexibly adapting to different matching parameters is realized.

Optionally, the determining manner of the configuration information includes:

allocating computing resources according to the number of unit computing resources allocatable in the PE array, the parallax search depth corresponding to each data stream and the video frame rate corresponding to each data stream;

Specifically, the data stream refers to one pair of the left-right-view data streams, and the multiple data streams are a plurality of pairs of the left-right-view data streams.

The unit computing resource refers to a unit composed of a plurality of PEs and capable of processing matching cost of a window with a certain size, for example, for a window with a size of 3 × 3, the computing unit is a PE matrix with 3 rows and 1 column, and includes 3 PEs.

Each of the modules of fig. 2 may be configured in a multiple data stream sharing mode. The image cache unit 200 can support shaping of multiple data streams with different resolutions, and the PE array spatially divides the data streams into different data streams according to the data stream parameters and the frame rate requirements by a resource-aware configuration information generation method, thereby improving the utilization rate of computing resources.

When a computing unit is allocated to a data stream, firstly, a configuration generation requirement is checked, namely whether the data stream is a multi-data stream or not, under the condition of the multi-data stream, whether the architecture computation power meets the performance index of the multi-data stream (namely, the frame rate requirement is met or close to the frame rate requirement under the specified window width and parallax search depth) is checked according to the multi-data stream configuration space of the design, if the architecture computation power meets the performance index of the multi-data stream, a configuration generation flow is entered, and if the architecture computation power does not meet the performance index, a single-data stream configuration flow is entered or an error is reported.

In the process of determining resource allocation, allocating computing resources according to the number of unit computing resources allocable in the PE array, the disparity search depth corresponding to each data stream, and the frame rate requirement corresponding to each data stream, for example, after allocating resources from the allocated resource space to a data stream to be allocated, dynamically adjusting the allocation resources of each data stream according to the difference between the proportion of computing resources already allocated to the data stream to be allocated and the proportion of computing resources required by the data stream to be allocated, where the computing resources required by the data stream are represented by the disparity search depth and the video frame rate.

The method can allocate the computing resources to each data stream according to the availability of the array computing resources and the index requirement of each data stream, and can achieve the effects of fully utilizing the computing resources and balancing the performance indexes of different data streams.

Optionally, when the number of the data streams is two, allocating the computing resources according to the number of unit computing resources allocable in the PE array, the disparity search depth corresponding to each data stream, and the video frame rate corresponding to each data stream includes:

if the remaining allocable computing resources can provide at least one unit computing resource for each data stream and meet the first condition, continuing to allocate one unit computing resource for each data stream;

the first numerical value is determined according to the parallax search depth and the video frame rate corresponding to each data stream, the second numerical value is determined according to the number of unit computing resources which are currently allocated to each data stream, and the first condition is determined according to the first numerical value, the second numerical value and a preset threshold value.

Specifically, fig. 9 is a schematic diagram of a resource-aware configuration generation process provided by the present invention. As can be seen from the figure, the allocation of resources for the two AB data streams is taken as an example, where

For the first condition, delta is an artificially set empirical value, a small amount, for example 0.2 or 0.5,

is a first value of the number of bits of the digital signal,

for the second value, I represents the number of allocated unit computing resources, D represents the search depth required by the data stream, and F represents the frame rate required by the data stream. The configuration generation flow is as follows: firstly, allocating a unit computing resource for each data stream, then checking whether the AB data streams in the array have the allocatable computing resource, and checking if the AB data streams in the array have the allocatable computing resource

And if the above formula is satisfied, continuously allocating the unit computing resource for the AB. If not, check

And

the magnitude relationship of (1), if

If A is less, the unit computing resource is allocated for A alone, otherwise, the unit computing resource is allocated for B alone, and the check is continued

And outputting the configuration information until the configuration generation is finished.

By using the method, the allocation resources of each data stream can be dynamically adjusted, so that the resource allocation is better and the utilization rate is high.

Optionally, the resource allocation method further includes:

if the residual allocable computing resources can only provide the unit computing resources for the target data stream, the residual allocable computing resources are all allocated to the target data stream.

Specifically, under the condition of allocating resources to two data streams, firstly allocating a unit computing resource to each data stream, then checking whether the AB data streams in the array have allocable computing resources, if the condition is not met, checking whether unit computing resources allocable to the data stream a remain in the array, wherein the data stream a is a target data stream, and if the unit computing resources remain, allocating all the remaining computing resources to the data stream a; if no unit computing resource in the array can be allocated to the data stream A, checking whether unit computing resources which can be allocated to the data stream B remain in the array, wherein the data stream B is a target data stream, if so, allocating all the remaining computing resources to the data stream B, ending configuration, and outputting configuration information; and if no residual computing resource can be allocated to any data stream, finishing configuration and outputting configuration information.

By the allocation method, resources can be allocated to a single data stream under the condition that computing resources cannot satisfy the condition of simultaneously allocating two data streams but can satisfy the condition of allocating one data stream, and the resource utilization maximization is realized.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer-executable instructions. These computer-executable instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These processor-executable instructions may also be stored in a processor-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the processor-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These processor-executable instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A configurable real-time parallax point cloud computing device, comprising:

the PE array is respectively connected with the configuration analysis module and the result shaping module and is used for generating a plurality of algorithm Processing Units (PU) of a specified structure according to the control signals transmitted by the configuration analysis module, processing the input image window data based on the plurality of PUs of the specified structure, obtaining SAD matching cost calculation results and outputting the SAD matching cost calculation results to the result shaping module;

2. The configurable real-time parallax point cloud computing apparatus according to claim 1, wherein the PEs in the PE array are interconnected vertically and horizontally, and the intermediate result is transferred vertically and the operand and the final matching cost are transferred horizontally.

3. The configurable real-time parallax point cloud computing device of claim 2, wherein the array of PEs comprises one or more of the following types of PEs:

4. The configurable real-time parallax point cloud computing device of claim 3, wherein each column of the PE array is configurable to include one or more PUs for performing SAD matching cost computation operations for a specified window size.

5. The configurable real-time parallax point cloud computing device of claim 4, wherein in the PUs, the PE of the first row is the Ultra PE or the Standard PE.

6. A configurable real-time parallax point cloud computing method performed based on the configurable real-time parallax point cloud computing device of any one of claims 1 to 5, the method comprising:

7. The method of claim 6, wherein the configuration information comprises:

image resolution, matching window size, parallax search depth, path number of binocular image data and PE working mode.

8. The method of claim 7, wherein the configuration information is determined in a manner comprising:

determining performance indexes meeting single data stream or multiple data streams;

9. The method of claim 8, wherein the allocating computing resources according to the number of unit computing resources allocable in the PE array, the disparity search depth corresponding to each data stream, and the video frame rate corresponding to each data stream comprises:

10. The configurable real-time parallax point cloud computing method of claim 9, wherein the method further comprises: