CN1916959A

CN1916959A - Scaleable large-scale 2D convolution circuit

Info

Publication number: CN1916959A
Application number: CN 200610105061
Authority: CN
Inventors: 黄士坦; 刘红侠
Original assignee: China Aerospace Times Electronics Corp
Current assignee: China Aerospace Times Electronics Corp
Priority date: 2006-08-29
Filing date: 2006-08-29
Publication date: 2007-02-21
Anticipated expiration: 2026-08-29
Also published as: CN100409259C

Abstract

A large capacity of 2D convolution circuit in contraction and enlargement type consists of a pixel register of reference image; a pixel register set of real time image; multiplier set formed by 128 pieces of array multipliers; product register set formed by 128 pieces of P registers; adder set formed by 128 pieces of adders; intermediate result register set formed by 128 pieces of S registers; an output circuit for outputting three-state result of calculation and a control circuit used to generate signals of clock, write/read, chip selection and clear off.

Description

Scaleable large-scale 2 D convolution circuit

Technical field

The invention belongs to embedded computer and assist processing element at a high speed, relate to a kind of scaleable large-scale 2 D convolution circuit, be used for significantly improving the computing velocity of embedded computer when carrying out images match.

Background technology

Prior art is when carrying out images match calculating, all carry out with microprocessor (comprising the DSP microprocessor), because calculated amount is big, single microprocessor (DSP) does not reach the real-time requirement, for accelerating computing velocity, (DSP) comes parallel computation with a plurality of microprocessors, but this has increased volume, power consumption, also reduce reliability, do not satisfied the Embedded Application requirement.

Summary of the invention

Shortcomings and deficiencies at above-mentioned prior art exists the objective of the invention is to, and a kind of scaleable large-scale 2 D convolution circuit is provided, this circuit can significantly improve processing speed under embedded condition, both improved real-time, guarantee reliability again, and have wider range of application.

In order to realize above-mentioned task, the present invention takes following technical solution:

A kind of scaleable large-scale 2 D convolution circuit is characterized in that: towards algorithm design, and the concurrency characteristics in the abundant mining algorithm, application resource repeats and the time-interleaving technology, directly finishes calculating with hardware circuit; Simultaneously can be according to the variation of computing environment, convergent-divergent calculates scale.Circuit comprises:

A reference map pixel register Y, data width is 8;

The plain registers group X of realtime graphic, data width is 8, by 128 eight bit register x ₀～x ₁₂₇Constitute shift register, its register x ₀Output be connected to outside the sheet, connect when the cascade, X value is with serial mode immigration registers group;

The multiplier group is by 128 array multiplier M ₀, M ₁... ..M ₁₂₇Form each multiplier M _iTwo inputs, from reference map pixel register Y and the corresponding plain register x of realtime graphic _i

The product register group is by 128 register p ₀, p ₁... .p ₁₂₇Form, data width is 16, register p _iInput meets corresponding multiplier M _iOutput;

The totalizer group is by 128 totalizer A ₀, A ₁... A ₁₂₇Constitute each totalizer A _iTwo inputs from corresponding product register p _iWith scratch-pad register S _I-1

The scratch-pad register group is by 128 register S ₀, S ₁... ..S ₁₂₇Form 16～26 of data widths, each scratch-pad register S _iBe used for temporary corresponding totalizer A _iAnd;

An output circuit is used for the three-state output of result of calculation, so that link to each other with the bus of CPU;

A control circuit is used to produce clock, read-write, sheet choosing and clear signal.

Scaleable large-scale 2 D convolution circuit of the present invention can be realized the multiply accumulating of 128 pairs of pixel values simultaneously a clock period, promptly finishes calculating:

R = Σ_{i = 0}^{7} Σ_{j = 0}^{15} x_{ij} y_{ij},

If calculate with microprocessor, need carry out taking advantage of for 128 times operation and 127 add operations, totally 255 operations, but when calculating with convolution circuit of the present invention, after flowing water foundation, need only single job and just can finish.Scaleable large-scale 2 D convolution device circuit of the present invention, when calculating, the calculating that operand is big in the algorithm, systematicness is strong is finished by convolution circuit, microprocessor is responsible for the view data of storage of collected, relatively poor, the random calculating of other concurrencys of union, the high speed of the dirigibility of microprocessor and hardware circuit is organically combined, reach flexibly, the effect of high-adaptability and high real-time.

Description of drawings

Fig. 1 is an extensive two-dimensional convolution device circuit structure diagram of the present invention;

Fig. 2 is the convolution unit circuit diagram;

Fig. 3 is programming Control figure;

Fig. 4 is the signal processing unit structure that acoustic convolver is used for signal Processing.

The present invention is described in further detail below in conjunction with embodiment that accompanying drawing and inventor provide.

Embodiment

In image processing algorithms such as images match, the frequent formula that will be calculated as follows:

R = Σ_{i = 0}^{M - 1} Σ_{j = 0}^{N - 1} x_{ij} y_{ij}

When M and N were very big, calculated amount was very big, but because image processing algorithms such as images match have characteristics such as systematicness is strong, concurrency is good, and the available hardware circuit directly realizes, saved the constraint of program execution time when calculating with microprocessor, thus the raising processing speed.

Flowing water and parallel characteristics in the abundant mining algorithm of convolution circuit, overlapping and the resource repeat techniques of abundant operate time, make circuit have flowing water and computation capability, simultaneously in order to enlarge the utilization scope, can carry out calculating such as level and smooth, filtering, make circuit can pass through programming Control, the realization scale is scalable.

(1), circuit structure

Circuit structure as shown in Figure 1, it consists of:

A, a reference map pixel register Y, data width is 8;

B, the plain registers group X of realtime graphic, data width is 8.By 128 eight bit register x ₀～x ₁₂₇Constitute shift register.Its register x ₀Output be connected to outside the sheet, connect when the cascade.The X value moves into registers group with serial mode;

C, multiplier group: the array multiplier by 128 is formed: M ₀, M ₁... ..M ₁₂₇, each multiplier M _iTwo inputs, from reference map pixel register Y and the corresponding plain register x of realtime graphic _i:

D, product register group are made up of 128 registers: p ₀, p ₁... .p ₁₂₇, data width is 16, register p _iInput meets corresponding multiplier M _iOutput;

E, totalizer group: constitute by 128 totalizers: A ₀, A ₁... A ₁₂₇, each totalizer A _iTwo inputs from corresponding product register p _iWith scratch-pad register S _I-1

F, scratch-pad register group are made up of 128 registers: S ₀S ₁... ..S ₁₂₈16～26 of data widths, each scratch-pad register S _iBe used for temporary corresponding totalizer A _iAnd;

G, output circuit: ternary output;

H, control circuit: produce clock (CLK), read-write (R/W) and sheet choosing (CS) and removing (RESET) signal.

Can find out that by circuit structure entire circuit comes down to be made of the polyphone of base volume product unit one by one.The base volume product unit is made of taking advantage of register, a totalizer and a scratch-pad register multiplier, a pixel a real-time figure register, a pixel.As shown in Figure 2, each elementary cell realizes: S _I-1+ x _iy _i, 128 convolution circuit are directly to be contacted by 128 elementary cells, connect y register again, control circuit and triple gate.The entire circuit compound with regular structure is simple, be easy to design and realize.

(2), circuit working process

1), all registers is resetted;

2), earlier with 128 X value x ₀～x ₁₂₇Serial moves into the X registers group, then the Y value is moved into y register successively;

3), move into first Y value after, the 130th pulse inserted S with first convolution results ₁₂₇Realize:

s_{127} = Σ_{i = 0}^{127} x_{i} y_{i + u}

u＝0，1，…，m

After this, Y value of every immigration, S ₁₂₇In insert a convolution results, promptly each obtains the multiply accumulating result of 128 pairs of pixel values clock period, these results distinguish corresponding u=0 in order, 1 ..., m.

(3), electric circuit characteristic

1), compound with regular structure: be in series by elementary cell one by one, be convenient to design and realize;

2), cascade is convenient, flexible: cascade as required, increase the calculating scale, improve computing velocity;

3), the convolution scale is big, computing velocity is fast;

4), circuit has adopted multiple concurrent technique:

The resource repeat techniques: 128 identical multipliers, 128 identical totalizers, 128 product registers and 128 scratch-pad registers are worked simultaneously.

Time-interleaving technology: take advantage of and add and adopt time-interleaving between multistage adding, water operation.

(4) calculate the scale programmability

For making circuit scale scalable, thereby adapt to the variation of computing environment, can enlarge the calculating scale by cascade, control the variation of calculating scale by programming, as in filtering, level and smooth etc. 3 * 3,5 * 5,7 * 7 templates etc., programming Control as shown in Figure 3.

Coding is as follows with the scale of calculating corresponding relation:

a	b	c	d	e	The calculating scale
a	b	c	d	e	The calculating scale	1	1	1	1	1	8 * 16=128 is to pixel
0	1	1	1	1	8 * 8=64 is to pixel	1	1	1	1	1	8 * 16=128 is to pixel
0	1	1	1	1	8 * 8=64 is to pixel	0	0	1	1	1	7 * 7=49 is to pixel
0	0	0	1	1	4 * 8=32 is to pixel	0	0	1	1	1	7 * 7=49 is to pixel
0	0	0	1	1	4 * 8=32 is to pixel	0	0	0	0	1	5 * 5=25 is to pixel
0	0	0	0	0	3 * 3=9 is to pixel	0	0	0	0	1	5 * 5=25 is to pixel

(5) implementation: be designed to IP kernel, realize with FPGA.

The technique effect that invention brings is:

1, can realize the multiply accumulating of 128 pairs of pixel values a clock period simultaneously, promptly finish calculating:

R = Σ_{i = 0}^{7} Σ_{j = 0}^{15} x_{ij} y_{ij},

If calculate with microprocessor, need carry out taking advantage of for 128 times operation and 127 add operations, totally 255 operations, but when calculating with acoustic convolver of the present invention, after flowing water foundation, need only single job and just finished.

2, do primary processor with DSP, the scaleable large-scale 2 D convolution device is done the signal processing unit processes velocity estimation of quick assist process parts:

Signal processing unit is made of DSPTMS320C6701 and hardware algorithm.Because the calculated amount of images match is very big.Calculate the processing speed of estimating signal processing unit with carrying out images match.The multiply accumulating computing accounted for more than 80% of the amount of calculation during images match was calculated, and can estimate with the multiply accumulating computing.Multiply-add operation with 128 pairs of pixel values is that example illustrates.

The multiply accumulating computing of 128 pairs of pixel values will carry out taking advantage of for 128 times operation and 127 add operations.Totally 255 operations.When estimating with TMS320C6701, each operation on average will be used four instructions, so finish whole calculating, the instruction number that DSP will carry out is L ₁The instruction of=255 * 4=1020 bar.

When signal processing unit calculates, under the control of DSP, calculate by the hardware algorithm acoustic convolver.After flowing water is set up, send out read signal by DSP and from the reference map storer, a pixel value is inserted y register, simultaneously, read the multiply accumulating result of 128 pairs of pixel values, and preserve, so circulation, therefore only need reading and writing, three instructions of conditional transfer, but when reference-to storage, will wait for one-period, so three instructions will take 6 instruction cycles, be equivalent to 6 one-cycle instructions, use L ₂Expression.

So the processing the when processing speed when signal processing unit carries out the multiply accumulating computing is calculated with TMS320C6701 than list has improved greatly.The multiple that improves is:

M = \frac{L_{1}}{L_{2}} = \frac{1020}{6} = 170

The about 600MIPS of average treatment speed of TMS320C6701, so the processing speed of signal processing unit when carrying out the multiply accumulating computing is: V ₁=170 * 600MIPS=102000MIPS.

With 80% conversion that the multiply accumulating computing accounts for whole coupling amount of calculation, the processing power of signal processing unit when carrying out images match calculating is V=V ₁* 80%=81600MIPS.

Embodiment:

Signal processing unit adopts the structure of DSP+ acoustic convolver, as shown in Figure 4.In this structure, the acoustic convolver of realizing with FPGA hangs on the dsp bus, accepts the drive controlling of DSP, as the high speed association processing element of DSP, alleviates the burden of DSP, accelerates travelling speed.Operand is big in the algorithm, and the calculating that systematicness is high (as level and smooth, filtering, coupling calculating etc.) is finished by acoustic convolver.DSP is responsible for the view data of storage of collected, and it is relatively poor to move other concurrency, random calculating (as asking histogram, correction, match etc.), and the result of calculation of acoustic convolver comprehensively judged, the output controlled variable, so just high speed, the high efficiency of the dirigibility of DSP and hardware algorithm circuit are combined, reach the effect of high flexibility, high-adaptability and high real-time.

In order to improve computing velocity, give full play to the parallel efficiency calculation of acoustic convolver, when carrying out convolutional calculation, after flowing water is set up, write data and read the result from acoustic convolver and carry out simultaneously to the y register (see figure 1) of acoustic convolver, will make like this to write data and sense data clashes.For head it off, used a buffer circuit, its effect is to open when DSP isolator when storer writes view data, DSP writes the reference map storer by data bus with data.When carrying out convolutional calculation, isolator is closed, disconnect the path of DSP data bus and memory data bus, the data of being inserted acoustic convolver by storer are isolated with the data of being read by acoustic convolver, thereby can not clash, can accomplish to insert data and read result of calculation and carry out simultaneously, give full play to convolution circuit flowing water and parallel computing characteristics, improve computing velocity to Y.

Claims

1. a scaleable large-scale 2 D convolution circuit is characterized in that, this circuit comprises:

A reference map pixel register Y, data width is 8;