CN1916959A - Scaleable large-scale 2D convolution circuit - Google Patents
Scaleable large-scale 2D convolution circuit Download PDFInfo
- Publication number
- CN1916959A CN1916959A CN 200610105061 CN200610105061A CN1916959A CN 1916959 A CN1916959 A CN 1916959A CN 200610105061 CN200610105061 CN 200610105061 CN 200610105061 A CN200610105061 A CN 200610105061A CN 1916959 A CN1916959 A CN 1916959A
- Authority
- CN
- China
- Prior art keywords
- register
- circuit
- group
- multiplier
- totalizer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Landscapes
- Complex Calculations (AREA)
Abstract
A large capacity of 2D convolution circuit in contraction and enlargement type consists of a pixel register of reference image; a pixel register set of real time image; multiplier set formed by 128 pieces of array multipliers; product register set formed by 128 pieces of P registers; adder set formed by 128 pieces of adders; intermediate result register set formed by 128 pieces of S registers; an output circuit for outputting three-state result of calculation and a control circuit used to generate signals of clock, write/read, chip selection and clear off.
Description
Technical field
The invention belongs to embedded computer and assist processing element at a high speed, relate to a kind of scaleable large-scale 2 D convolution circuit, be used for significantly improving the computing velocity of embedded computer when carrying out images match.
Background technology
Prior art is when carrying out images match calculating, all carry out with microprocessor (comprising the DSP microprocessor), because calculated amount is big, single microprocessor (DSP) does not reach the real-time requirement, for accelerating computing velocity, (DSP) comes parallel computation with a plurality of microprocessors, but this has increased volume, power consumption, also reduce reliability, do not satisfied the Embedded Application requirement.
Summary of the invention
Shortcomings and deficiencies at above-mentioned prior art exists the objective of the invention is to, and a kind of scaleable large-scale 2 D convolution circuit is provided, this circuit can significantly improve processing speed under embedded condition, both improved real-time, guarantee reliability again, and have wider range of application.
In order to realize above-mentioned task, the present invention takes following technical solution:
A kind of scaleable large-scale 2 D convolution circuit is characterized in that: towards algorithm design, and the concurrency characteristics in the abundant mining algorithm, application resource repeats and the time-interleaving technology, directly finishes calculating with hardware circuit; Simultaneously can be according to the variation of computing environment, convergent-divergent calculates scale.Circuit comprises:
A reference map pixel register Y, data width is 8;
The plain registers group X of realtime graphic, data width is 8, by 128 eight bit register x
0~x
127Constitute shift register, its register x
0Output be connected to outside the sheet, connect when the cascade, X value is with serial mode immigration registers group;
The multiplier group is by 128 array multiplier M
0, M
1... ..M
127Form each multiplier M
iTwo inputs, from reference map pixel register Y and the corresponding plain register x of realtime graphic
i
The product register group is by 128 register p
0, p
1... .p
127Form, data width is 16, register p
iInput meets corresponding multiplier M
iOutput;
The totalizer group is by 128 totalizer A
0, A
1... A
127Constitute each totalizer A
iTwo inputs from corresponding product register p
iWith scratch-pad register S
I-1
The scratch-pad register group is by 128 register S
0, S
1... ..S
127Form 16~26 of data widths, each scratch-pad register S
iBe used for temporary corresponding totalizer A
iAnd;
An output circuit is used for the three-state output of result of calculation, so that link to each other with the bus of CPU;
A control circuit is used to produce clock, read-write, sheet choosing and clear signal.
Scaleable large-scale 2 D convolution circuit of the present invention can be realized the multiply accumulating of 128 pairs of pixel values simultaneously a clock period, promptly finishes calculating:
If calculate with microprocessor, need carry out taking advantage of for 128 times operation and 127 add operations, totally 255 operations, but when calculating with convolution circuit of the present invention, after flowing water foundation, need only single job and just can finish.Scaleable large-scale 2 D convolution device circuit of the present invention, when calculating, the calculating that operand is big in the algorithm, systematicness is strong is finished by convolution circuit, microprocessor is responsible for the view data of storage of collected, relatively poor, the random calculating of other concurrencys of union, the high speed of the dirigibility of microprocessor and hardware circuit is organically combined, reach flexibly, the effect of high-adaptability and high real-time.
Description of drawings
Fig. 1 is an extensive two-dimensional convolution device circuit structure diagram of the present invention;
Fig. 2 is the convolution unit circuit diagram;
Fig. 3 is programming Control figure;
Fig. 4 is the signal processing unit structure that acoustic convolver is used for signal Processing.
The present invention is described in further detail below in conjunction with embodiment that accompanying drawing and inventor provide.
Embodiment
In image processing algorithms such as images match, the frequent formula that will be calculated as follows:
When M and N were very big, calculated amount was very big, but because image processing algorithms such as images match have characteristics such as systematicness is strong, concurrency is good, and the available hardware circuit directly realizes, saved the constraint of program execution time when calculating with microprocessor, thus the raising processing speed.
Flowing water and parallel characteristics in the abundant mining algorithm of convolution circuit, overlapping and the resource repeat techniques of abundant operate time, make circuit have flowing water and computation capability, simultaneously in order to enlarge the utilization scope, can carry out calculating such as level and smooth, filtering, make circuit can pass through programming Control, the realization scale is scalable.
(1), circuit structure
Circuit structure as shown in Figure 1, it consists of:
A, a reference map pixel register Y, data width is 8;
B, the plain registers group X of realtime graphic, data width is 8.By 128 eight bit register x
0~x
127Constitute shift register.Its register x
0Output be connected to outside the sheet, connect when the cascade.The X value moves into registers group with serial mode;
C, multiplier group: the array multiplier by 128 is formed: M
0, M
1... ..M
127, each multiplier M
iTwo inputs, from reference map pixel register Y and the corresponding plain register x of realtime graphic
i:
D, product register group are made up of 128 registers: p
0, p
1... .p
127, data width is 16, register p
iInput meets corresponding multiplier M
iOutput;
E, totalizer group: constitute by 128 totalizers: A
0, A
1... A
127, each totalizer A
iTwo inputs from corresponding product register p
iWith scratch-pad register S
I-1
F, scratch-pad register group are made up of 128 registers: S
0S
1... ..S
12816~26 of data widths, each scratch-pad register S
iBe used for temporary corresponding totalizer A
iAnd;
G, output circuit: ternary output;
H, control circuit: produce clock (CLK), read-write (R/W) and sheet choosing (CS) and removing (RESET) signal.
Can find out that by circuit structure entire circuit comes down to be made of the polyphone of base volume product unit one by one.The base volume product unit is made of taking advantage of register, a totalizer and a scratch-pad register multiplier, a pixel a real-time figure register, a pixel.As shown in Figure 2, each elementary cell realizes: S
I-1+ x
iy
i, 128 convolution circuit are directly to be contacted by 128 elementary cells, connect y register again, control circuit and triple gate.The entire circuit compound with regular structure is simple, be easy to design and realize.
(2), circuit working process
1), all registers is resetted;
2), earlier with 128 X value x
0~x
127Serial moves into the X registers group, then the Y value is moved into y register successively;
3), move into first Y value after, the 130th pulse inserted S with first convolution results
127Realize:
After this, Y value of every immigration, S
127In insert a convolution results, promptly each obtains the multiply accumulating result of 128 pairs of pixel values clock period, these results distinguish corresponding u=0 in order, 1 ..., m.
(3), electric circuit characteristic
1), compound with regular structure: be in series by elementary cell one by one, be convenient to design and realize;
2), cascade is convenient, flexible: cascade as required, increase the calculating scale, improve computing velocity;
3), the convolution scale is big, computing velocity is fast;
4), circuit has adopted multiple concurrent technique:
The resource repeat techniques: 128 identical multipliers, 128 identical totalizers, 128 product registers and 128 scratch-pad registers are worked simultaneously.
Time-interleaving technology: take advantage of and add and adopt time-interleaving between multistage adding, water operation.
(4) calculate the scale programmability
For making circuit scale scalable, thereby adapt to the variation of computing environment, can enlarge the calculating scale by cascade, control the variation of calculating scale by programming, as in filtering, level and smooth etc. 3 * 3,5 * 5,7 * 7 templates etc., programming Control as shown in Figure 3.
Coding is as follows with the scale of calculating corresponding relation:
a | b | c | d | e | The calculating scale |
1 | 1 | 1 | 1 | 1 | 8 * 16=128 is to pixel |
0 | 1 | 1 | 1 | 1 | 8 * 8=64 is to pixel |
0 | 0 | 1 | 1 | 1 | 7 * 7=49 is to pixel |
0 | 0 | 0 | 1 | 1 | 4 * 8=32 is to pixel |
0 | 0 | 0 | 0 | 1 | 5 * 5=25 is to pixel |
0 | 0 | 0 | 0 | 0 | 3 * 3=9 is to pixel |
(5) implementation: be designed to IP kernel, realize with FPGA.
The technique effect that invention brings is:
1, can realize the multiply accumulating of 128 pairs of pixel values a clock period simultaneously, promptly finish calculating:
If calculate with microprocessor, need carry out taking advantage of for 128 times operation and 127 add operations, totally 255 operations, but when calculating with acoustic convolver of the present invention, after flowing water foundation, need only single job and just finished.
2, do primary processor with DSP, the scaleable large-scale 2 D convolution device is done the signal processing unit processes velocity estimation of quick assist process parts:
Signal processing unit is made of DSPTMS320C6701 and hardware algorithm.Because the calculated amount of images match is very big.Calculate the processing speed of estimating signal processing unit with carrying out images match.The multiply accumulating computing accounted for more than 80% of the amount of calculation during images match was calculated, and can estimate with the multiply accumulating computing.Multiply-add operation with 128 pairs of pixel values is that example illustrates.
The multiply accumulating computing of 128 pairs of pixel values will carry out taking advantage of for 128 times operation and 127 add operations.Totally 255 operations.When estimating with TMS320C6701, each operation on average will be used four instructions, so finish whole calculating, the instruction number that DSP will carry out is L
1The instruction of=255 * 4=1020 bar.
When signal processing unit calculates, under the control of DSP, calculate by the hardware algorithm acoustic convolver.After flowing water is set up, send out read signal by DSP and from the reference map storer, a pixel value is inserted y register, simultaneously, read the multiply accumulating result of 128 pairs of pixel values, and preserve, so circulation, therefore only need reading and writing, three instructions of conditional transfer, but when reference-to storage, will wait for one-period, so three instructions will take 6 instruction cycles, be equivalent to 6 one-cycle instructions, use L
2Expression.
So the processing the when processing speed when signal processing unit carries out the multiply accumulating computing is calculated with TMS320C6701 than list has improved greatly.The multiple that improves is:
The about 600MIPS of average treatment speed of TMS320C6701, so the processing speed of signal processing unit when carrying out the multiply accumulating computing is: V
1=170 * 600MIPS=102000MIPS.
With 80% conversion that the multiply accumulating computing accounts for whole coupling amount of calculation, the processing power of signal processing unit when carrying out images match calculating is V=V
1* 80%=81600MIPS.
Embodiment:
Signal processing unit adopts the structure of DSP+ acoustic convolver, as shown in Figure 4.In this structure, the acoustic convolver of realizing with FPGA hangs on the dsp bus, accepts the drive controlling of DSP, as the high speed association processing element of DSP, alleviates the burden of DSP, accelerates travelling speed.Operand is big in the algorithm, and the calculating that systematicness is high (as level and smooth, filtering, coupling calculating etc.) is finished by acoustic convolver.DSP is responsible for the view data of storage of collected, and it is relatively poor to move other concurrency, random calculating (as asking histogram, correction, match etc.), and the result of calculation of acoustic convolver comprehensively judged, the output controlled variable, so just high speed, the high efficiency of the dirigibility of DSP and hardware algorithm circuit are combined, reach the effect of high flexibility, high-adaptability and high real-time.
In order to improve computing velocity, give full play to the parallel efficiency calculation of acoustic convolver, when carrying out convolutional calculation, after flowing water is set up, write data and read the result from acoustic convolver and carry out simultaneously to the y register (see figure 1) of acoustic convolver, will make like this to write data and sense data clashes.For head it off, used a buffer circuit, its effect is to open when DSP isolator when storer writes view data, DSP writes the reference map storer by data bus with data.When carrying out convolutional calculation, isolator is closed, disconnect the path of DSP data bus and memory data bus, the data of being inserted acoustic convolver by storer are isolated with the data of being read by acoustic convolver, thereby can not clash, can accomplish to insert data and read result of calculation and carry out simultaneously, give full play to convolution circuit flowing water and parallel computing characteristics, improve computing velocity to Y.
Claims (1)
1. a scaleable large-scale 2 D convolution circuit is characterized in that, this circuit comprises:
A reference map pixel register Y, data width is 8;
The plain registers group X of realtime graphic, data width is 8, by 128 eight bit register x
0~x
127Constitute shift register, its register x
0Output be connected to outside the sheet, connect when the cascade, X value is with serial mode immigration registers group;
The multiplier group is by 128 array multiplier M
0, M
1... ..M
127Form each multiplier M
iTwo inputs, from reference map pixel register Y and the corresponding plain register x of realtime graphic
i
The product register group is by 128 register p
0, p
1... .p
127Form, data width is 16, register p
iInput meets corresponding multiplier M
iOutput;
The totalizer group is by 128 totalizer A
0, A
1... A
127Constitute each totalizer A
iTwo inputs from corresponding product register p
iWith scratch-pad register S
I-1
The scratch-pad register group is by 128 register S
0, S
1... ..S
127Form 16~26 of data widths, each scratch-pad register S
iBe used for temporary corresponding totalizer A
iAnd;
An output circuit is used for the three-state output of result of calculation, so that link to each other with the bus of CPU;
A control circuit is used to produce clock, read-write, sheet choosing and clear signal.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB200610105061XA CN100409259C (en) | 2006-08-29 | 2006-08-29 | Scaleable large-scale 2D convolution circuit |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB200610105061XA CN100409259C (en) | 2006-08-29 | 2006-08-29 | Scaleable large-scale 2D convolution circuit |
Publications (2)
Publication Number | Publication Date |
---|---|
CN1916959A true CN1916959A (en) | 2007-02-21 |
CN100409259C CN100409259C (en) | 2008-08-06 |
Family
ID=37737951
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNB200610105061XA Expired - Fee Related CN100409259C (en) | 2006-08-29 | 2006-08-29 | Scaleable large-scale 2D convolution circuit |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN100409259C (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101309476B (en) * | 2007-05-15 | 2011-05-04 | 鸿富锦精密工业(深圳)有限公司 | Mobile apparatus and method for changing image size |
CN102420931A (en) * | 2011-07-26 | 2012-04-18 | 西安费斯达自动化工程有限公司 | Full-frame-rate image processing method based on FPGA (Field Programmable Gate Array) |
CN104035750A (en) * | 2014-06-11 | 2014-09-10 | 西安电子科技大学 | Field programmable gate array (FPGA)-based real-time template convolution implementing method |
CN104318534A (en) * | 2014-11-18 | 2015-01-28 | 中国电子科技集团公司第三研究所 | Real-time two-dimensional convolution digital filtering system |
CN106530210A (en) * | 2016-10-31 | 2017-03-22 | 北京大学 | Equipment and method for realizing parallel convolution calculation based on resistive random access memory array |
CN107133908A (en) * | 2016-02-26 | 2017-09-05 | 谷歌公司 | Compiler for image processor manages memory |
CN108513042A (en) * | 2017-02-24 | 2018-09-07 | 清华大学 | Device for image procossing |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
FR2634084A1 (en) * | 1988-07-08 | 1990-01-12 | Labo Electronique Physique | INTEGRATED CIRCUIT AND IMAGE PROCESSING DEVICE |
EP0626661A1 (en) * | 1993-05-24 | 1994-11-30 | Societe D'applications Generales D'electricite Et De Mecanique Sagem | Digital image processing circuitry |
JP3251421B2 (en) * | 1994-04-11 | 2002-01-28 | 株式会社日立製作所 | Semiconductor integrated circuit |
TW525078B (en) * | 1998-05-20 | 2003-03-21 | Sony Computer Entertainment Inc | Image processing device, method and providing media |
-
2006
- 2006-08-29 CN CNB200610105061XA patent/CN100409259C/en not_active Expired - Fee Related
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101309476B (en) * | 2007-05-15 | 2011-05-04 | 鸿富锦精密工业(深圳)有限公司 | Mobile apparatus and method for changing image size |
CN102420931A (en) * | 2011-07-26 | 2012-04-18 | 西安费斯达自动化工程有限公司 | Full-frame-rate image processing method based on FPGA (Field Programmable Gate Array) |
CN102420931B (en) * | 2011-07-26 | 2013-08-21 | 西安费斯达自动化工程有限公司 | Full-frame-rate image processing method based on FPGA (Field Programmable Gate Array) |
CN104035750A (en) * | 2014-06-11 | 2014-09-10 | 西安电子科技大学 | Field programmable gate array (FPGA)-based real-time template convolution implementing method |
CN104318534A (en) * | 2014-11-18 | 2015-01-28 | 中国电子科技集团公司第三研究所 | Real-time two-dimensional convolution digital filtering system |
CN104318534B (en) * | 2014-11-18 | 2017-06-06 | 中国电子科技集团公司第三研究所 | A kind of Real-time Two-dimensional convolutional digital filtering system |
CN107133908A (en) * | 2016-02-26 | 2017-09-05 | 谷歌公司 | Compiler for image processor manages memory |
US10685422B2 (en) | 2016-02-26 | 2020-06-16 | Google Llc | Compiler managed memory for image processor |
CN106530210A (en) * | 2016-10-31 | 2017-03-22 | 北京大学 | Equipment and method for realizing parallel convolution calculation based on resistive random access memory array |
CN106530210B (en) * | 2016-10-31 | 2019-09-06 | 北京大学 | The device and method that parallel-convolution calculates are realized based on resistive memory array |
CN108513042A (en) * | 2017-02-24 | 2018-09-07 | 清华大学 | Device for image procossing |
US10827102B2 (en) | 2017-02-24 | 2020-11-03 | Huawei Technologies Co., Ltd | Image processing apparatus |
Also Published As
Publication number | Publication date |
---|---|
CN100409259C (en) | 2008-08-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN1916959A (en) | Scaleable large-scale 2D convolution circuit | |
CN108805266B (en) | Reconfigurable CNN high-concurrency convolution accelerator | |
CN106940815B (en) | Programmable convolutional neural network coprocessor IP core | |
CN1402843A (en) | Processing multiply-accumulate operations in single cycle | |
CN104112053A (en) | Design method of reconfigurable architecture platform oriented image processing | |
CN110851779B (en) | Systolic array architecture for sparse matrix operations | |
CN112487750A (en) | Convolution acceleration computing system and method based on memory computing | |
CN101055644A (en) | Mapping processing device and its method for processing signaling, data and logic unit operation | |
CN103970720A (en) | Embedded reconfigurable system based on large-scale coarse granularity and processing method of system | |
CN1731345A (en) | Extensible high-radix Montgomery's modular multiplication algorithm and circuit structure thereof | |
Ding et al. | A FPGA-based accelerator of convolutional neural network for face feature extraction | |
CN1808419A (en) | Real-time fast Fourier transform circuit | |
CN1187698C (en) | Design method of built-in parallel two-dimensional discrete wavelet conversion VLSI structure | |
CN101067681A (en) | Pulsation array processing circuit for wavefront control operation of adaptive optical system | |
CN103533378A (en) | Three-dimensional integer DCT (Discrete Cosine Transform) transformation system on basis of FPGA (Field Programmable Gate Array) and transformation method thereof | |
CN111275180A (en) | Convolution operation structure for reducing data migration and power consumption of deep neural network | |
Panchbhaiyye et al. | A FIFO based accelerator for convolutional neural networks | |
CN202281998U (en) | Scalar floating-point operation accelerator | |
CN115756389A (en) | Floating-point multiply-add device based on FPGA and calculation method | |
CN111783979B (en) | Image similarity detection hardware accelerator VLSI structure based on SSIM algorithm | |
Park et al. | ShortcutFusion++: optimizing an end-to-end CNN accelerator for high PE utilization | |
CN102693118A (en) | Scalar floating point operation accelerator | |
Qu et al. | A grain-adaptive computing structure for FPGA CNN acceleration | |
Kannappan et al. | A Survey on Multi-operand Adder | |
CN117369767A (en) | Memory computing architecture based on full adder and intelligent memory computing processor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20080806 Termination date: 20160829 |
|
CF01 | Termination of patent right due to non-payment of annual fee |