CN1289212A - Hierarchy programmable parallel video signal processor structure for motion estimation algorithm - Google Patents

Hierarchy programmable parallel video signal processor structure for motion estimation algorithm Download PDF

Info

Publication number
CN1289212A
CN1289212A CN 00130074 CN00130074A CN1289212A CN 1289212 A CN1289212 A CN 1289212A CN 00130074 CN00130074 CN 00130074 CN 00130074 A CN00130074 A CN 00130074A CN 1289212 A CN1289212 A CN 1289212A
Authority
CN
China
Prior art keywords
register
signal
instruction
links
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 00130074
Other languages
Chinese (zh)
Other versions
CN1127264C (en
Inventor
何芸
龚大年
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN 00130074 priority Critical patent/CN1127264C/en
Publication of CN1289212A publication Critical patent/CN1289212A/en
Application granted granted Critical
Publication of CN1127264C publication Critical patent/CN1127264C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Landscapes

  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The present invention relates to the field of video-frequency image coding technology, including six portions of low-layer instruction unit, parallel arithmetic unit, data routing unit, memory and address formation unit, high-layer instruction unit and external memory interface unit, in which the low-layer instruction unit is respectively connected with high-layer instruction unit and parallel arithmetic unit via control signal wire, and the data routing unit is respectively connected with parallel arithmetic unit and memory and address formation unit via data bus. Said invention can simultaneously implement multi-block matching algorithm on one structure, can reduce hardware expense in video-frequency coding system and can support other algorism of video-frequency coding.

Description

The hierarchy programmable parallel video signal processor structure that is used for motion estimation algorithm
The invention belongs to the encoding video pictures field, particularly hierarchy programmable parallel video signal processor design.
Estimation is adopted by all video compression coding international standards, is used to eliminate frame-to-frame correlation.But motion estimation algorithm is open part not by these international standard defineds, and different coded systems can adopt different separately motion estimation algorithms under the prerequisite that satisfies the bit stream syntax structure.Piece coupling full-search algorithm is the highest algorithm of search precision in the motion estimation algorithm, and is huge but its shortcoming is an operand, makes traditional general processor to meet the demands.In order to solve the problem of operand, existing work is started with from two aspects, and the one, research piece coupling fast search algorithm reduces the number of searching for point; The 2nd, research piece matching integrated circuit parallel organization is efficiently realized full-search algorithm or fast search algorithm.
On behalf of document, the typical case of piece coupling fast search algorithm research have: T.Koga, K.Iinuma, A.Hirano, Y.Lijima, and T.Ishiguro, " Motion compensated interframe codingfor video conferencing; " Proc.Nat.Telecommunications Conf.81, (national communication proceeding) New Orleans, LA, in November, 1981, the G5.3.1-G5.3.5 page or leaf.This algorithm is called three-step approach, in level and vertical search scope is in the search window of [7,7] search procedure of motion vector to be divided into for 3 steps, and 8 points of per step search add central point, come to 25 search points.This algorithm is 11.1% of a full-search algorithm operand (225 search points).
Piece matching integrated circuit parallel organization can be divided into two kinds: based on the structure of array processor with based on the structure of tree-shaped adder.On behalf of document, the typical case based on the structure of array processor have: T.Komarek and P.Pirsch, " Array architectures for block matching algorithms; " (array structure of block matching algorithm) IEEE Trans.On Circuits and Systems, (IEEE Circuits and Systems periodical) 36 volumes, 10 phases, the 1301-1308 page or leaf, in October, 1989.It is under 3 situations that Fig. 1 has provided at block size, the schematic diagram of array processor structure.It comprises 9 absolute values and adder unit (AD), 3 adder units (A), a minimum value unit (M).The advantage of this structure is that the bandwidth of memory that needs is little, but its autgmentability is poor, and efficient is low.
On behalf of document, the typical case based on tree-shaped adder structure have: Y.s Jehng; L.G.Chen; T.D.Chiueh, " An efficient and simple VLSI tree architecture for motion estimationalgorithms ", 41 2 phases of volume of (a kind of simple and effective large scale integrated circuit tree structure that is used for motion estimation algorithm) IEEETransactions On Signal Processing (IEEE signal processing periodical), the 889-900 page or leaf, in February, 1993.Fig. 2 has provided the structure chart of tree-shaped adder.Circle is represented adder among the figure, and rectangle is represented register, and adder adopts tree to link to each other.This structure has adopted the tree structure of multi-stage pipeline section, and owing to the increase along with the streamline hop count of the operation efficiency of tree structure reduces, therefore this structure makes the operational performance of tree structure not give full play to owing to introduced the multi-stage pipeline section.
In order to improve the accuracy of matching precision or perhaps motion-vector search, motion estimation algorithm has adopted the large search scope, the different predictive modes and the size of multiple match block and different search strategies.The combination of a plurality of factors has determined the diversity of motion estimation algorithm.Adopt the programmable parallel structure can realize the requirement that diversity and operand are big better simultaneously.On behalf of document, the typical case of programmable parallel structure have: H.D.Lin, A.Anesko, B.Petryna, " A 14-Gops Programmable Motion Estimator for is VideoCoding H.26x; " (programmable movements that H.26x a kind of 14GOPS is used for is estimated structure) IEEE Jssc (IEEE solid state circuit magazine), 31 volumes, 11 phases, in November, 1996.This structure is based on the programmable structure of array processor, it has adopted 64 array processor to finish the computing that piece mates, hardware configuration is in large scale, and the data flow of array processor must make flexibility still be restricted at some specific algorithms simultaneously.
The objective of the invention is for overcoming the weak point of prior art, a kind of hierarchy programmable parallel video signal processor structure (Programmable vidoe signal processor that is used for motion estimation algorithm has been proposed, PVSP), by programmable method of the present invention, on a structure, can realize multiple block matching algorithm simultaneously, and can reduce the hardware spending of video coding system, can also support other algorithm of video coding.
A kind of programmable parallel video signal processor structure that is used for motion estimation algorithm that the present invention proposes, it is characterized in that, comprise the low layer command unit, the concurrent operation unit, data are sought the unit, footpath, memory and scalar/vector, high-level signaling unit, and six parts in external memory interface unit; Wherein, said high-level signaling unit links to each other by control signal wire with the low layer command unit; Said low layer command unit links to each other with control signal wire by data with the concurrent operation unit; Said concurrent operation unit is sought the unit, footpath with data and is linked to each other by 3 circuit-switched data buses; Said data are sought the unit, footpath and are linked to each other by 6 circuit-switched data buses with storage and scalar/vector; The initial order signal of said high-level signaling unit is sought the unit, footpath with movement command signal by data and is connected with storage and scalar/vector; Said data are sought the unit, footpath and are linked to each other by data/address bus with the external memory interface unit; Said high-level signaling unit links to each other by control signal with the external memory interface unit.
The course of work of the present invention is as follows: the high-level signaling unit is sent out control signal and is given the low layer command unit, the low layer command unit begins to carry out the low layer program, send control signal by the low layer instruction decoding unit and give the concurrent operation unit, data are sought the unit, footpath, storage and scalar/vector, data are sought the unit, footpath and select two-way from 3 road signals of storage and scalar/vector output, output in the concurrent operation unit, the operation result of concurrent operation unit is sought the unit, footpath by data and is outputed in storage and the scalar/vector.The high-level signaling unit reads operation result by data/address bus from the concurrent operation unit.The high-level signaling unit reads executing state by control bus from the low layer command unit.The high-level signaling unit sends control signal and gives the external memory interface unit, and the external memory interface unit reads in data from external memory storage and outputs to data and seek the footpath circuit.Data are sought the footpath circuit data of external memory interface unit output are connected to storage and scalar/vector.
Main feature of the present invention:
1) parallel organization has adopted the regular tree accumulation structure of low lag characteristic, comprise many tree-shaped adders of input and accumulator, hardware complexity is significantly smaller than existing estimation programmable structure based on array processor, owing to adopted the regular tree-shaped adder structure of low lag characteristic, make the addition of many input numbers to finish at a high speed, in tree-shaped adder, need not the pipelining segment register, make the efficient of tree-shaped adder give full play to, this tree accumulator structure can be supported 16x16 simultaneously, many kinds of piece matching operations of 16x8 and 8x8 have bigger flexibility.
2) two-dimentional parallel storage structure has adopted the two-dimentional parallel storage of byte alignment and cyclic addressing.Be input as location horizontally and vertically, 16 data of output delegation.
3) programmable structure is realized by high-level signaling unit and low layer command unit, the high-level signaling unit controls finishes in the motion estimation algorithm that to shift branch more, the part that need judge, high-level signaling unit comprise one 16 reduced instruction processors (RISC).It is main piece matching operation that loop computation is finished in the low layer program control.Low layer command unit and high-level signaling unit all adopt 16 bit instruction forms, but adopt different instruction coding methods.
4) PVSP can support multiple fast motion estimation algorithm simultaneously, the programmable parallel arithmetic element of PVSP inside can be supported half-pix motion search and movement compensating algorithm simultaneously, further improved flexibility, need not to design special hardware in addition and support these algorithms, this makes that video coding system is integrated in monolithic becomes possibility.
Brief Description Of Drawings:
Fig. 1 is existing motion estimation architecture schematic diagram based on array processor.
Fig. 2 is existing motion estimation architecture schematic diagram based on tree-shaped adder.
Fig. 3 is a hierarchy programmable parallel video signal processor general structure schematic diagram of the present invention.
Fig. 4 is a low layer command unit structural representation of the present invention.
Fig. 5 is a tree accumulator example structure schematic diagram of the present invention.
Fig. 6 is the tree-shaped adder example structure schematic diagrames of 8 inputs of the present invention.
Fig. 7 is a minimum value parts MIN0 example structure schematic diagram of the present invention.
Fig. 8 is a minimum value parts MIN1 example structure schematic diagram of the present invention.
Fig. 9 is a minimum value parts MIN2 example structure schematic diagram of the present invention.
Figure 10 is a two-dimentional parallel storage example structure schematic diagram of the present invention.
Figure 11 is the address mapping module example structure schematic diagram of two-dimentional parallel storage of the present invention.
Figure 12 is the address generation module ADG0 example structure schematic diagram of two-dimentional parallel storage of the present invention.
Figure 13 is the address generation module ADG1 example structure schematic diagram of 8 one dimension parallel storages of the present invention.
Figure 14 is the generation module ADG2 example structure schematic diagram of 9 one dimension parallel storage addresses of the present invention.
Figure 15 is 16 reduced instruction processor example structure schematic diagrames of the present invention.
Figure 16 is an instruction fetching component example structure schematic diagram of the present invention.
Figure 17 is an instruction execution unit example structure schematic diagram of the present invention.
A kind of hierarchy programmable parallel video signal processor (PVSP) structure embodiment that is used for motion estimation algorithm of the present invention's design is described in detail as follows in conjunction with each accompanying drawing:
PVSP general structure of the present invention as shown in Figure 3.Comprise six parts: the low layer command unit, the concurrent operation unit, data are sought unit, footpath, memory and scalar/vector, high-level signaling unit, and external memory interface unit.The each several part annexation is as follows.The high-level signaling unit links to each other by control signal wire with the low layer command unit; The low layer command unit links to each other with control signal wire by data with the concurrent operation unit; The concurrent operation unit is sought the unit, footpath with data and is linked to each other by 3 circuit-switched data buses; Data are sought the unit, footpath and are linked to each other by 6 circuit-switched data buses with storage and scalar/vector; The initial order signal of high-level signaling unit is sought the unit, footpath with movement command signal by data and is connected with storage and scalar/vector; Data are sought the unit, footpath and are linked to each other by data/address bus with the external memory interface unit; The high-level signaling unit links to each other by control signal with the external memory interface unit.
The concrete structure and the course of work thereof of above-mentioned each unit embodiment are respectively described below in conjunction with the accompanying drawings:
(1) low layer command unit
(1) low layer command unit structure
The structure embodiment of low layer command unit of the present invention as shown in Figure 4.Comprise program address register, low layer command memory and low layer instruction decode module, selector, cycle-index register, subtracter; Its annexation is as follows: the program entry address signal entry of high-level signaling unit output links to each other with program address register, and high-level signaling unit output setting program entry address signal set_entry links to each other with the Enable Pin of program address register; Program address register links to each other with the low layer command memory; The low layer command memory links to each other with the low layer instruction decode module; The cycle-index signal cnt of high-level signaling unit output links to each other with the last input of selector, and subtracter output links to each other with input under the selector.The output of high-level signaling unit is provided with cycle-index signal set_cnt and links to each other with the selecting side of selector.Selector output end links to each other with the cycle-index register.Cycle-index register output links to each other with input on the subtracter.Signal 1 links to each other with input under the subtracter.The carry output of subtracter links to each other with end of run signal done.
Said low layer decoding module by with or logical circuit form.Low layer decoding module output links to each other with movement command signal with the initial order signal.
The low layer command unit operation controlled by the low layer in low layer instruction set instruction.
(2) the low layer command unit course of work
The course of work of low layer command unit is as follows: when the high-level signaling unit sent the command signal that the cycle-index register is set, the outside input of selector selection cycle-index signal cnt outputed to the cycle-index register and latchs.Register outputs to subtracter and does to subtract a computing then.When subtracter is output as 0, send end of run signal done.When the high-level signaling unit sent the command signal of setting program address register, outside input program entry address signal entry outputed to program address register and latchs.
(3) low layer instruction set
(a) command unit coded format
The coded format of the low layer instruction of low layer command unit is as follows: the low layer command length is 16.
Table 1 has provided everybody definition of low layer instruction.4 in type codes territory.Source operand 1 and source operand 2 respectively account for 3, and destination operand accounts for 2.Displacement number immediately accounts for 4.
Everybody definition of table 1 low layer instruction
15 ?14 ????13 ????12 ????11 ????10 ????9 ????8 ????7 ????6 ????5 ????4 3 ?2 ?1 ?0
The type codes territory Source operand 1 Source operand 2 Destination operand Displacement is counted immediately
(b) low layer instruction
The low layer instruction set comprises six types instruction.Table 2 has provided form, action type and the explanation thereof of low layer instruction.PNOP is do-nothing operation.The parallel addition computing of band displacement is finished in PADD instruction, and the parallel subtraction computing is finished in the PSUB instruction, and PADDS finishes saturated add operation, and PMOV finishes data parallelly moves computing, and PSAD finishes the parallel subtraction signed magnitude arithmetic(al).
Table 2 low layer instruction set, #imm represents to count immediately
The type codes territory The instruction title Form Action type Explanation
0 ?PNOP ?PNOP Do-nothing operation Do-nothing operation
1 ?PADD ?PADD?dst,src1,src2, #imm Parallel addition Dst=(src1+src2)>> #imm
2 ?PSUB ?PSUB?dst,src1,Src2 Parallel subtraction Dst=src1-src2
3 ?PADDS ?PADDS?dst,src1,Src2 Parallel saturated addition Dst=clip(src+src2)
?PMOV ?PMOV?dst,src Parallel data is moved Dst=src
5 ?PSAD ?PSAD?src1,src2 The parallel subtraction absolute value Abs(src1-src2)
6-15 Keep Keep Keep Keep
(2) concurrent operation unit
(1) concurrent operation cellular construction
Concurrent operation of the present invention unit embodiment is made up of parallel arithmetic logic unit and tree accumulator, as shown in Figure 3.The output of this parallel arithmetic operation logic module links to each other with the input of tree accumulator.Its course of work is: send control signal by the low layer instruction decoding unit and give the concurrent operation unit, data are sought the unit, footpath and are exported 2 circuit-switched data in the parallel arithmetic operation module, and the operation result of parallel arithmetic operation logic module is exported to data and sought the unit, footpath.The high-level signaling unit reads operation result by data/address bus from the concurrent operation unit.Operation result comprises: macroblock match error signal sad0, first block-matching error signal sad1, second block-matching error signal sad2, macro block optimal motion vector signal opMV0, first piece optimal motion vector signal opMV1, second piece optimal motion vector signal opMV2, macro block minimum match error signal min0, first piece minimum match error signal min1, second piece minimum match error signal min2.
(2) parallel arithmetic logical operation module
The embodiment of parallel arithmetic logical operation module comprises N 9 bit processors, and 9 bit processors are organized into single instruction multiple data stream organization.
(3) tree accumulator module
The example structure of tree accumulator module as shown in Figure 5.Comprise two the 8 tree-shaped adders of input, 11 adders, three accumulators (ACC0, ACC1, ACC2) and three minimum value parts (MIN0, MIN1, MIN2).Its annexation is as follows: the output of the tree-shaped adder of the left side 8 inputs links to each other with 11 adders and accumulator ACC1.The output of the tree-shaped adder of the right 8 inputs links to each other with 11 adders and accumulator ACC2; 11 adder outputs link to each other with accumulator ACC0; Accumulator ACC0, ACC1, ACC2 respectively with minimum value parts MIN0, MIN1, MIN2 links to each other; Accumulator ACC0 links to each other with macroblock match error signal sad0; Accumulator ACC1 links to each other with first block-matching error signal sad1; Accumulator ACC2 links to each other with second block-matching error signal sad2; Minimum value parts MIN0 output links to each other with macro block minimum match error signal min0 and macro block optimal motion vector signal opMV0, minimum value parts MIN1 output links to each other with first block-matching error signal min1 and first piece optimal motion vector signal opMV1, minimum value parts MIN2 output links to each other with second piece minimum match error signal min2 and second piece optimal motion vector signal opMV2: minimum value parts MIN0 input and macroblock match error signal sad0, end of run signal done and motion vector signal MV link to each other, minimum value parts MIN1 input and first block-matching error signal sad1, end of run signal done and motion vector signal MV link to each other, minimum value parts MIN2 input and second block-matching error signal sad2, end of run signal done and motion vector signal MV link to each other.
The course of work of this tree accumulator is: 16 bit accumulator ACC0 add up to the output of 11 adders, and in 16 cycles, ACC0 can export a 16x16 macroblock match error result.12 bit accumulator ACC1 and ACC2 add up to 11 bit data of the left side 8 tree-shaped adders of input and the tree-shaped adder output of the right 8 inputs.In 8 cycles, ACC1 and ACC2 can export the block-matching error result of two 8x8 pieces.
(a) the tree-shaped adder of 8 inputs
The example structure of the tree-shaped adder of above-mentioned 8 inputs comprises 48 adders (ADDER8), 29 adders (ADDER9) and one 10 adders (ADDER10) as shown in Figure 6.Its annexation is: 28 adders outputs in the left side link to each other with 9 adders in the left side, and 9 adders of 2 adders in the right and the right link to each other, and these two 9 adders outputs link to each other with the input of 10 adders.
(b) minimum value parts MIN0
The example structure of above-mentioned minimum value parts MIN0 comprises 16 subtracters as shown in Figure 7, with door, and 16 bit registers and 12 bit registers.Its annexation is: the input of 16 subtracters left side links to each other with the output of 16 bit registers, and 16 right inputs of subtracter are imported macroblock match error signal sad0 with the outside and linked to each other, the carry signal of 16 subtracters with link to each other with the last input of door; Outside input macroblock match error signal sad0 links to each other with the input of 16 bit registers, and the output of 16 bit registers links to each other with macro block minimum match error value min0; Link to each other with the carry of 16 subtracters with the last input of door, link to each other with outside input end of run signal done with student's input; 12 bit registers inputs links to each other with external movement vector signal MV, the enable signal of 12 bit registers and 16 bit registers with link to each other with the output of door.Its course of work is: 16 subtracter output carries are given and door, with door to subtracter output and end of run signal done carry out with computing after, the output enable signal is to 16 and 12 bit registers.16 bit registers are preserved macro block minimum match error value min0, and 12 bit registers are preserved level and vertical motion vector value.If enable signal is effective, 16 bit registers latch macroblock match error signal sad0, and 12 bit registers latch motion vector signal MV.
(c) minimum value parts MIN1
The example structure of above-mentioned minimum value parts MIN1 comprises 16 subtracters as shown in Figure 8, with door, and 16 bit registers and 12 bit registers.Its annexation is: the input of 16 subtracters left side links to each other with the output of 16 bit registers, and 16 right inputs of subtracter are imported macroblock match error signal sad1 with the outside and linked to each other, the carry signal of 16 subtracters with link to each other with the last input of door; The output of 16 bit registers links to each other with macro block minimum match error value min1; Link to each other with the carry of 16 subtracters with the last input of door, link to each other with outside input end of run signal done with student's input; 12 bit registers inputs links to each other with external movement vector signal MV, the enable signal of 12 bit registers and 16 bit registers with link to each other with the output of door.Its course of work is: 16 subtracter output carries are given and door, with door to subtracter output and end of run signal done carry out with computing after, the output enable signal is to 16 and 12 bit registers.16 bit registers are preserved first piece minimum match error signal min1, and 12 bit registers are preserved level and vertical motion vector value.If enable signal is effective, 16 bit registers latch first block-matching error signal sad1, and 12 bit registers latch motion vector signal MV.
(d) minimum value parts MIN2
The example structure of above-mentioned minimum value parts MIN2 comprises 16 subtracters as shown in Figure 9, with door, and 16 bit registers and 12 bit registers.Its annexation is: the input of 16 subtracters left side links to each other with the output of 16 bit registers, and 16 right inputs of subtracter are imported macroblock match error signal sad2 with the outside and linked to each other, the carry signal of 16 subtracters with link to each other with the last input of door; Outside input macroblock match error signal sad1 links to each other with the input of 16 bit registers, and the output of 16 bit registers links to each other with macro block minimum match error value min2; Link to each other with the carry of 16 subtracters with the last input of door, link to each other with outside input end of run signal done with student's input; 12 bit registers inputs links to each other with external movement vector signal MV, the enable signal of 12 bit registers and 16 bit registers with link to each other with the output of door.Its course of work is: 16 subtracter output carries are given and door, with door to subtracter output and end of run signal done carry out with computing after, the output enable signal is to 16 and 12 bit registers.16 bit registers are preserved second piece minimum match error signal min2, and 12 bit registers are preserved level and vertical motion vector value.If enable signal is effective, 16 bit registers latch second block-matching error signal sad2, and 12 bit registers latch motion vector signal MV.
(3) data are sought the unit, footpath
The embodiment that data of the present invention are sought the unit, footpath is made up of selector.Its course of work is: send control signal by the low layer instruction decoding unit and seek the unit, footpath to data, data are sought the unit, footpath and select two-way from 3 road signals of storage and scalar/vector output, output in the parallel arithmetic operation module, the operation result of parallel arithmetic operation module is sought the unit, footpath by data and is outputed in storage and the scalar/vector.
(4) storage and scalar/vector
Storage of the present invention and scalar/vector structure 3 are as shown in the figure, address generation module ADG0 by two-dimentional parallel storage and two-dimentional parallel storage, the address generation module ADG1 of 8 one dimension parallel storages and 8 one dimension parallel storages, the generation module ADG2 of 9 one dimension parallel storages and 9 one dimension parallel storage addresses constitutes.Its inner annexation is, the two dimension parallel storage links to each other by address bus with the address generation module ADG0 of two-dimentional parallel storage, 8 one dimension parallel storages link to each other by address bus with the address generation module ADG1 of 8 one dimension parallel storages, and 9 one dimension parallel storages link to each other by address bus with the generation module ADG2 of 9 one dimension parallel storage addresses.
(1) two-dimentional parallel storage
The example structure of above-mentioned two-dimentional parallel storage comprises address mapping module as shown in figure 10, N road comparator, and priority encoder, N road alternative selector (M0, M1 ... MN-1), N data memory and cyclic shifter.Its annexation is: address mapping module links to each other with vertical storage device address signal Ly with outside input level memory address signal Lx; Address mapping module output b0 links to each other with the left input of N road comparator; Signal 0,1 ... N-1 links to each other with the right input of N road comparator respectively; Comparator output terminal links to each other with the input of priority encoder; The output of priority encoder respectively with the selection signal end S of N road selector 0, S 1... S N-1Link to each other.The data terminal of N road selector links to each other with address mapping module; The output of N road selector links to each other with N data memory; N circuit-switched data memory output links to each other with cyclic shifter.
(a) address mapping module
Said address mapping module interconnector concerns that as shown in figure 11 2 adder left sides are input as 1 among the figure, and the right side is input as the 4th, 5 of horizontal memory address signal Lx.The 6th, 7 of output A1 is continuous with the 4th, 5 that imports vertical storage device address signal Ly; The 4th, 5 of A1 of output links to each other with 2 adders outputs; Output the 0th, 1,2,3 of A1 and vertical storage device address signal Ly the 0th, 1,2,3 are continuous; The 6th, 7 of output A0 is continuous with the 4th, 5 that imports vertical storage device address signal Ly; The 4th, 5 of A0 of output and input level memory address signal Lx the 4th, 5 is continuous; Output the 0th, 1,2,3 of A0 and vertical storage device address signal Ly the 0th, 1,2,3 are continuous; The 0th, 1,2,3 of output b0 and horizontal memory address signal Lx link to each other.
(b) priority encoder
The embodiment of said priority encoder by with or logical circuit form.This priority encoder logic is as follows:
Figure 00130074001611
J=min{j|t wherein j=1, j=0,1 ... N-1}.
Cyclic shifter is with b 0The data of individual data memory move left to highest order.
(2) 8 one dimension parallel storages
The embodiment of above-mentioned 8 one dimension parallel storages is made up of N 8 bit memory modules.Its annexation is: the address signal addr_d1m of 8 one dimension parallel storages of outside input and the address input end of N 8 bit memories link to each other, and N 8 bit memories output 8N bit data links to each other with outside.
(3) 9 one dimension parallel storages
The embodiment of 9 one dimension parallel storages is made up of N 9 bit memory modules.Its annexation is: the address signal addr_dm9 of 9 one dimension parallel storages of outside input and the address input end of N 9 bit memories link to each other, and N 9 bit memories output 9N bit data links to each other with outside.
(4) the address generation module of two-dimentional parallel storage
The example structure of the address generation module ADG0 of above-mentioned two-dimentional parallel storage is by shown in Figure 12, by adder 0, adder 1, and selector 0, selector 1, selector 2, selector 3, register 0, register 1 are formed.Its annexation is: selector 0 left input links to each other with register 0, and right input is imported the vertical starting address signal starty of two-dimentional parallel storage with the outside and linked to each other; Selector 1 left input is imported two-dimentional parallel storage address increment signal step_d2m with the outside and is linked to each other, and right input and external input terminals motion vector signal MV high 6 (MV[11:6]) links to each other; The output of selector 0 and selector 1 links to each other with the input of adder 0; The output of adder 0 links to each other with register 0; Register 0 is output as vertical storage device address signal Ly.Selector 2 left inputs link to each other with register 1, and right input is imported the horizontal starting address signal startx of two-dimentional parallel storage with the outside and linked to each other; Selector 3 left inputs link to each other with signal 0, right input and external input terminals motion vector signal MV low 6 (MV[5:0]) link to each other; The output of selector 2 and selector 3 links to each other with the input of adder 1; The output of adder 1 links to each other with register 1; Register 1 is output as horizontal memory address signal Lx.Its course of work is: when the low layer command unit sends initial order, selector 0-1 selects the right wing signal, horizontal starting address signal startx of two-dimentional parallel storage and motion vector signal MV low 6 (MV[5:0]) outputed to adder 0, the output result of adder is latched in the register 0 then, simultaneously, selector 2-3 selects the right wing signal, vertical starting address signal starty of two-dimentional parallel storage and motion vector signal MV high 6 (MV[11:6]) outputed to adder 1, and the output result of adder is latched in the register 1 then.This is that register 0 has been preserved level and vertical start memory address with register 1.When the low layer command unit sent action command, selector 0-1 selected left road signal, and the value of address step size signal " step " and register 0 preservation is outputed in the adder, and the output result of adder 0 gives register 0 and latchs.Simultaneously, selector 2-3 selects left road signal, and the value that register 0 and register 1 preserved outputs in the adder 1, and the output result of adder 1 exports to register 1 and latchs.Register 0 and register 1 are exported horizontal memory address signal (Lx) and vertical storage device address signal (Ly) respectively.
(5) the address generation module of above-mentioned 8 one dimension parallel storages
The example structure of the address generation module ADG1 of above-mentioned 8 one dimension parallel storages as shown in figure 13, formed by two selectors, an adder and a register, its annexation is: selector 0 left input links to each other with register 0 output, selector 0 right input links to each other with 8 one dimension parallel storage starting address signal start_d1m, selector 1 left input links to each other with 8 one dimension parallel storage address increment signals, and selector 1 right input links to each other with signal 0; The output of selector 0 and selector 1 links to each other with two inputs of adder 0 respectively; Register 0 links to each other with outside by the address signal addr_d1m of 8 one dimension parallel storages.Its course of work is: when the low layer command unit sends the initial order signal, selector 0 and selector 1 are selected the right wing signal, 8 one dimension parallel storage starting address signal start_d1m and 0 are outputed to adder, and the output result of adder is latched in the register then.When the low layer command unit sent movement command signal, selector 0 and 1 was selected left road signal, and the value of 8 one dimension parallel storage address increment signal step_d1m and register holds is outputed in the adder, and the output result of adder gives register and latchs.
The generation module of (6) 9 one dimension parallel storage addresses
The generation module ADG2 example structure of above-mentioned 9 one dimension parallel storage addresses as shown in figure 14, formed by two selectors, an adder and a register, its annexation is: selector 0 left input links to each other with register 0 output, and selector 0 right input links to each other with 9 one dimension parallel storage starting address signal start_dm9.Selector 1 left input links to each other with 9 one dimension parallel storage address increment signal step_dm9, and selector 1 right input links to each other with signal 0; The output of selector 0 and selector 1 links to each other with two inputs of adder 0 respectively; Register 0 links to each other with outside by the address signal addr_dm9 of 9 one dimension parallel storages.Its course of work is: when the low layer command unit sends the initial order signal, selector 0 and selector 1 are selected the right wing signal, 9 one dimension parallel storage starting address signal start_dm9 and 0 are outputed to adder, and the output result of force method device is latched in the register then.When the low layer command unit sent movement command signal, selector 0 and 1 was selected left road signal, and the value of 9 one dimension parallel storage address increment signal step_dm9 and register holds is outputed in the adder, and the output result of adder gives register and latchs.
(5) high-level signaling unit
(1) high-level signaling cellular construction
The example structure of high-level signaling of the present invention unit is made up of 16 compacting instruction set processors and 32 specified register arrays, as shown in Figure 3.Its annexation is: by setting program entry address signal set_entry and cycle-index register signal set_cnt is set links to each other, 16 reduced instruction processors link to each other by control signal with the specified register array with outside for 16 reduced instruction processors.The commands for controlling that the high-level signaling unit is concentrated by high-level signaling.
(2) 16 compacting instruction set processors
Above-mentioned 16 compacting instruction set processor structures comprise 4 parts, i.e. instruction fetching component, decoding unit, execution unit and register array as shown in figure 15.Its annexation is: by jump address ba, command signal d_ir shifts control signal next signal and links to each other between the instruction fetching component and instruction decoding unit; By command code d_op, carry out control signal exec between the instruction decoding unit and instruction execution unit, the first source operand d_src1, the second source operand d_src2 and status signal eflags link to each other; By writing register signal we, consequential signal e_res links to each other between instruction execution unit and the register array; Instruction decoding unit and register array pass through the first source operand address d_a1, the second source operand address d_a2, and register array output signal d_r1 links to each other with d_r2.Its course of work is: instruction fetching component output instruction signal d_ir gives the instruction decoding unit.Control signal next is shifted in the output of instruction decoding unit and jump address signal ba signal is given instruction fetching component.Output status signal e_flags and the register array output signal d_r1 and the d_r2 of instruction decoding unit input bi-directional data signal g_d and instruction execution unit.Instruction decoding unit output d_op, exec, d_src1, the d_src2 signal is given instruction execution unit.Instruction decoding unit output d_a1, d_a2 gives register array.Instruction decoding unit output d_a1 is through register output e_a.Instruction execution unit output is write register signal we to register array, and instruction execution unit output result signal e_res is to register array.Instruction decoding unit output g_a, g_r, g_w is provided with cycle-index register signal set_cnt, and setting program entry address register signal set_entry is as the output control signal of 16 reduced instruction processors.G_d is a two-way signaling.As g_r when being high, g_d is an input signal; As g_w when being high, g_d is an output signal.
(a) instruction fetching component
The example structure of above-mentioned instruction fetching component comprises adder as shown in figure 16, Current Address Register, selector, high-level signaling memory, command register.Its annexation is: input links to each other with signal 1 on the adder, and following input links to each other with the output of Current Address Register; Adder output links to each other with the input of Current Address Register; The last input of selector with link to each other by jump address ba, the following input of selector links to each other with the output of Current Address Register, the selecting side of selector links to each other with transfer control signal next, and the output of selector links to each other with the address input end of high-level signaling memory; The output and instruction register input of high-level signaling memory links to each other.The output of command register links to each other with the external command decoding unit by command signal d_ir.Its course of work is: output to Current Address Register after adder adds 1 with the output of address selector and deposit.Current Address Register and outside input jump address signal ba output to selector, and as external input signal next when being high, selector is selected the output of jump address signal, when next when low, selector is selected Current Address Register output.The output of selector is as the address of high-level signaling memory.The high-level signaling memory is exported the instruction of this address correspondence, and instruction is latched in the command register.Command register output instruction signal d_ir.
(b) instruction decoding unit
Above-mentioned instruction decoding unit embodiment by with or logical circuit form.As shown in figure 15, the annexation of decoding unit and external component is: instruction decoding unit and instruction fetching component ask that by jump address ba command signal d_ir shifts control signal next signal and links to each other.By command code d_op, carry out control signal exec between instruction decoding unit and the instruction execution unit, the first source operand d_src1, the second source operand d_src2 and status signal eflags link to each other.Instruction decoding unit and register array pass through the first source operand address d_a1, the second source operand address d_a2, and register array output signal d_r1 links to each other with d_r2.The course of work is as follows.Instruction fetching component output instruction signal d_ir gives the instruction decoding unit.Control signal next is shifted in the output of instruction decoding unit and jump address signal ba signal is given instruction fetching component.The instruction decoding unit is exported the first source operand address d_a1, and the second source operand address d_a2 gives register array.Instruction decoding unit output d_a1 is through register output e_a.Instruction decoding unit output function sign indicating number d_op carries out control signal exec, the first source operand d_src1, and the second source operand d_src2 gives instruction execution unit.Output status signal e_flags and the register array output signal d_r1 and the d_r2 of instruction decoding unit input bi-directional data signal g_d and instruction execution unit.
(c) instruction execution unit
The example structure of above-mentioned instruction execution unit comprises register 1 as shown in figure 17, register 2, register 3, register 4, status register and arithmetic and logic unit ALU.Its annexation is: the first source operand d_src1, the second source operand d_src2, command code d_op, carry out control signal exec respectively with register 1, register 2, register 3 links to each other with the input of register 4; Register 1, the output of register 2 and register 3 links to each other with arithmetic and logic unit; Arithmetic and logic unit and status register be by carry carry, and zero-signal zero overflows the lowest order d_src1[0 of ovflow and d_src1] link to each other; Register 4 output we link to each other with outside; Status register output eflags links to each other with outside.Its course of work is: external input signal d_src1, d_src2, d_op, exec passes through register 1,2 respectively, 3,4 latch after, export the first source operand e_src1 respectively, the second source operand e_src2, action type signal c_op and write register signal we.E_src1, e src2 and e_op be as the input of ALU, ALU output result signal e_res and Status Flag.Status Flag comprises carry flag carry, zero flag zero, overflow indicator ovflow.This three Status Flag signals and external input signal d_src1[0] be input to status register and latch.Status register output e_flags signal.Table 3 has provided under the different action type signal e_op, the function of ALU.
Under the different action type signal e_op of table 3, the function of arithmetic logic unit alu.
Action type The function of ALU
?0 Assignment, e_r=e_src2
?1 Addition, e_r=e_src1+e_src2
?2 Subtraction, e_r=e_src1-e_src2
?3 Or, e_r=e_src1|e_src2
?4 With, e_r=e_src1﹠e_src2
?5 XOR, e_r=e_src1 ~ e_src2
?6 E_src2[4 is worked as in displacement]=1, e_r=e_src1〉〉 e_src2; Work as e_src2[4]=0, e_r=e_src1<<e_src2
?7 Negate, e_r=~ e_src2
(d) register array
The embodiment of above-mentioned register array is made up of 32 general registers, and referring to Figure 15, the annexation of register array and outside is: by writing register signal we, consequential signal e_res links to each other between register array and the instruction execution unit.Register array and instruction decoding unit pass through the first source operand address d_a1, the second source operand address d_a2, and register array output signal d_r1 links to each other with d_r2.Instruction decoding unit output d_a1 links to each other with register array through register output e_a.Its course of work is: two source register address d_a1 of register array input, d_a2 and destination register address e_a, and write register signal we, output is by two the source operand d_r1 and the d_r2 of d_a1 and d_a2 appointment, simultaneously the e_res as a result of performance element output write in the register by the e_a appointment.
(3) high-level signaling collection
The instruction of the high-level signaling collection of above-mentioned 16 reduced instruction processors has 4 types: do-nothing instruction, assignment directive, transfer instruction and arithmetic logical operation instruction.The bright instruction type of 15 and 14 bit tables of instruction.These 4 types are encoded to 00,01,10,11 respectively.
(a) do-nothing instruction
Table 4 provides everybody definition of do-nothing instruction.16 are 0.
Everybody definition of table 4 do-nothing instruction
The position 15 ?14 ?13 ?12 ?11 ?10 ?9 ?8 ?7 ?6 ?5 ????4 ????3 ?2 ?1 ?0
Value 0 ?0 ?0 ?0 ?0 ?0 ?0 ?0 ?0 ?0 ?0 ????0 ????0 ?0 ?0 ?0
(b) assignment directive
Table 5 provides the form of assignment directive.The subtype territory of first kind of assignment directive has 4.Destination address register has 5.Source address register has 5.The subtype territory of second kind of assignment directive has 4.Destination address register has 5.Number field is 5 immediately.The subtype territory of the third assignment directive has 4.Destination address register has 5.Minimum 5 is 0.
Table 6 has provided 7 assignment directives, finishes respectively global register or general register assign operation.
Everybody definition of table 5 assignment directive
The position 15 ?14 ?13 ?12 ?11 ?10 ?9 ?8 ?7 ?6 ?5 ?4 ?3 ?2 ?1 ?0
One 0 ?1 The subtype territory Destination address register Source address register
Two 0 ?1 The subtype territory Destination address register Number field immediately
Three ?0 ?1 The subtype territory Destination address register ????0
Table 6 assignment directive form and function
Subtype Instruction Command function
1 Lmovr r, #imm The general register assignment, source operand #imm counts immediately for long, and two instruction cycles, destination address is general register r.
2 Lmovg g, #imm The special register assignment, source operand is counted immediately for long, and two instruction cycles, destination address is special register g.
3 Movg g, r Special register assignment, source operand are general register r, and destination address is special register g.
4 Movr r, g General register assignment, source operand are special register g, and destination address is general register r.
5 Imovr r, #imm The general register assignment, source operand is counted immediately for short, and destination address is general register r.
6 Imovg g, gimm The special register assignment, source operand is counted immediately for short, and destination address is special register g.
7 Movpc r General register assignment, source operand are program counter, and destination address is general register r.
(c) transfer instruction
Table 7 has provided everybody definition of transfer instruction.First kind of formatter type field has 4.Condition field has 5.Destination address register has 5.Second kind of formatter type field has 2, and counting address field immediately has 12.
Table 8 has provided the form and the function of transfer instruction.
Everybody definition of table 7 transfer instruction
The position 15 ?14 ?13 ?12 ?11 ?10 ?9 ?8 ?7 ?6 ?5 ?4 ?3 ?2 ?1 ?0
One 1 ?0 The subtype territory Condition field Destination address register
Two 1 ?0 The subtype territory Count address field immediately
Table 8 transfer instruction form and function
Subtype Condition code Instruction Command function
0 ?00010 ?Bc?r The carry condition redirect
0 ?00011 ?Bnc?r The redirect of no-carry condition
0 ?00100 ?Bz?r Zero condition redirect
0 ?00101 ?Bnz?r The redirect of non-zero condition
0 ?01000 ?Bv The overflow condition redirect
0 ?01001 ?bnv?r No overflow condition redirect
0 ?10000 ?Bl?r Lowest order 1 condition redirect
0 ?10001 ?Bnl?r Lowest order 0 condition redirect
1 ?00000 ?Jmpr?r Unconditional jump
2 ?00000 ?Callr?r Roundabout process is called
3 ?00000 ?Ret Process is returned
4-7 Keep Keep Keep
8 ?00000 ?Call?#imm Straight-forward procedure calls
9-11 Keep Keep Keep
12 ?00000 ?Jmp?#imm Direct unconditional jump
13-15 Keep Keep Keep
(d) arithmetic logical operation instruction
Table 9 has provided two kinds of command formats of arithmetic logical operation instruction.The instruction type field has 4 in first kind of form.Purpose/first source operation note has 5.The second source operation note has 5.The instruction type field has 4 in second kind of form.Purpose/first source operation note has 5.Number is 5 immediately.
Table 10 is given the form and the function of the arithmetical logic instruction of falling out.Wherein left shift instruction and right shift instruction take same subtype.(lowest order is the 0th, and highest order is the 15th) is right shift instruction shr when being 1 for the 4th; Otherwise be left shift instruction shl.
Everybody definition of table 9 arithmetic logical operation instruction
The position 15 ?14 ?13 ?12 ?11 ?10 ?9 ????8 ????7 ????6 ?5 ?4 ?3 ?2 ?1 ?0
One 1 ?1 The subtype territory Purpose/first source register Second source register
Two 1 ?1 The subtype territory Purpose/first source register Count immediately for 5
The instruction of table 10 arithmetical logic
Subtype Instruction Command function
0 ?Mov?rd,rs ?rd=rs
?1 ?Add?rd,rs ?Rd=rd+rs
?2 ?Sub?rd,rs ?Rd=rd-rs
?3 ?Or?rd,rs ?Rd=rd|?rs
?4 ?And?rd,rs ?Rd=rd&rs
?5 ?Xor?rd,rs ?Rd=rd‘rs
?6 ?Ishr?rd,rs ?rd=rd>>rs
?6 ?Ishl?rd,rs ?Rd=rd<<rs
?7 ?Not?rd,rs ?Rd=~rs
?8 Keep Keep
?9 ?Iadd?rd, #imm ?Rd=rd+#imm
10 ?Isub?rd,#imm ?Rd=rd-#imm
?11 ?Ior?rd,#imm ?Rd=rd|#imm
?12 ?Iand?rd,#imm ?Rd=rd&#imm
?13 ?Ixor?rd,#imm Rd=rd~#imm
?14 ?Ishr?rd,#imm ?rd=rd>>#imm
?14 ?Ishl?rd,#imm ?Rd=rd<<#imm
?15 ?Inot?rd,#imm ?Rd=~#imm
(4) specified register array
The specified register array implement example of high-level signaling of the present invention unit is made up of 32 16 bit registers.Table 11 has provided specified register and function thereof.Wherein g0-g11 is write by 16 reduced instruction processors.G16-g25 is write by the corresponding module outside 16 reduced instruction processors, can be read by 16 reduced instruction processors.G12-g15 and g26-g31 keep.Keep specified register array output cycle-index signal (cnt), program entry address signal (entry), the two dimension horizontal starting address signal of parallel storage (start_x), the two dimension vertical starting address signal of parallel storage (starty), two dimension parallel storage address increment signal (step_d2m), 8 one dimension parallel storage starting address signals (start_d1m), 8 one dimension parallel storage address increment signals (step_d1m), 9 one dimension parallel storage starting address signals (start_dm9), motion vector signal (MV).Input signal comprises: end of run signal (done), tree accumulator output signal (macroblock match error signal sad0, first block-matching error signal sad1, second block-matching error signal sad2, macro block optimal motion vector signal opMV0, first piece optimal motion vector signal opMV1, second piece optimal motion vector signal opMV2, macro block minimum match error signal min0, first piece minimum match error signal mini, second piece minimum match error signal min2).
Table 11 specified register and function thereof
Specified register Function
G0 The cycle-index register, output cycle-index signal (cnt) is given the low layer command unit
G1 Low layer program entry address register, output program entry address signal (entry) is given the low layer command unit
G2 The horizontal initial address register of two dimension parallel storage is exported the horizontal starting address signal startx of two-dimentional parallel storage
G3 The vertical initial address register of two dimension parallel storage is exported the vertical starting address signal starty of two-dimentional parallel storage
G4 Two dimension parallel storage address increment register is exported two-dimentional parallel storage address increment signal step_d2m
G5 8 one dimension parallel storage initial address register are exported 8 one dimension parallel storage starting address signal start_d1m
G6 8 one dimension parallel storage address increment registers are exported 8 one dimension parallel storage address increment signal step_d1m
G7 9 one dimension parallel storage initial address register are exported 9 one dimension parallel storage starting address signal start_dm9
G8 G position one dimension parallel storage address increment register is exported 9 one dimension parallel storage address increment signal step_dm9
G9 The motion vector register, output movement vector signal MV.High 6 is vertical motion vector, and low 6 is horizontal motion vector.
G10-g15 Keep
G16 Low layer command unit status register latchs low layer command unit output end of run (done)
G17 Concurrent operation unit result register 0 latchs tree accumulator output macro block-matching error signal sad0
G18 Concurrent operation unit result register 1 latchs the tree accumulator and exports first block-matching error signal sad1
G19 Concurrent operation unit result register 2 latchs second block-matching error signal sad2 of tree accumulator output
G20 Concurrent operation unit result register 3 latchs tree accumulator output macro block optimal motion vector signal opMV0
G21 Concurrent operation unit result register 4 latchs the tree accumulator and exports first piece optimal motion vector signal opMV1
?g22 Concurrent operation unit result register 5 latchs second piece optimal motion vector signal opMV2 of tree accumulator output
?g23 Concurrent operation unit result register 6 latchs tree accumulator output macro block minimum match error signal min0
?g24 Concurrent operation unit result register 7 latchs the tree accumulator and exports first piece minimum match error signal min1
?g25 Concurrent operation unit result register 8 latchs second piece minimum match error signal min2 of tree accumulator output
?g26-g31 Keep
(6) external memory interface unit
External memory interface of the present invention unit (Fig. 3).
Annexation is as follows.The high-level signaling unit links to each other by control signal with the external memory interface unit.Data are sought the unit, footpath and are linked to each other by data/address bus with the external memory interface unit.
The course of work is as follows.The high-level signaling unit sends control signal and gives the external memory interface unit, and the external memory interface unit reads in data from external memory storage and outputs to data and seek the footpath circuit.
Present embodiment N gets 16.Specifically.The two dimension parallel storage comprises address mapping module, 16 road comparators, priority encoder, No. 16 selectors, 16 data memories and cyclic shifter.8 one dimension parallel storages are made up of 16 8 bit memory modules.9 one dimension parallel storages are made up of 16 9 bit memory modules.The parallel arithmetic logical operation module comprises 16 9 bit processors, and 9 bit processors are organized into single instruction multiple data stream organization.
The present embodiment structure is realized with Verilog HDL, and finished functional verification with Verilog XL emulation tool, Design compiler synthesis tool with Synopsys carries out comprehensively then, adopt under the technology library of 0.25um the interior static memory (SRAM) of sheet that total door number comprises 28K gate and 40kb.Realized multiple block matching algorithm on PVSP, comprised the full-search algorithm based on spiral sweep, three-step approach is closed on searching algorithm etc. most, and motion compensation, the half pixel searching algorithm.

Claims (32)

1, a kind of hierarchy programmable parallel video signal processor structure that is used for motion estimation algorithm is characterized in that, comprises the low layer command unit, unit, footpath, memory and scalar/vector are sought in concurrent operation unit, data, high-level signaling unit, and six parts in external memory interface unit; Wherein, said high-level signaling unit links to each other by control signal wire with the low layer command unit; Said low layer command unit links to each other with control signal wire by data with the concurrent operation unit; Said concurrent operation unit is sought the unit, footpath with data and is linked to each other by 3 circuit-switched data buses; Said data are sought the unit, footpath and are linked to each other by 6 circuit-switched data buses with storage and scalar/vector; The initial order signal of said high-level signaling unit is sought the unit, footpath with movement command signal by data and is connected with storage and scalar/vector; Said data are sought the unit, footpath and are linked to each other by data/address bus with the external memory interface unit; Said high-level signaling unit links to each other by control signal with the external memory interface unit.
2, hierarchy programmable parallel video signal processor structure as claimed in claim 1, it is characterized in that, said low layer command unit comprises program address register, low layer command memory and low layer instruction decode module, selector, the cycle-index register, subtracter: wherein, the program entry address signal entry of high-level signaling unit output links to each other with said program address register, and high-level signaling unit output setting program entry address signal set_entry links to each other with the Enable Pin of this program address register; This program address register links to each other with the low layer command memory; This low layer command memory links to each other with the low layer instruction decode module; The cycle-index signal cnt of high-level signaling unit output links to each other with the last input of this selector, and this subtracter output links to each other with input under the selector; The output of high-level signaling unit is provided with cycle-index signal set_cnt and links to each other with the selecting side of this selector; This selector output end links to each other with the cycle-index register; This cycle-index register output links to each other with input on the subtracter; This cycle-index register signal 1 links to each other with input under the subtracter; The carry output of this subtracter links to each other with end of run signal done; Said low layer command unit operation controlled by the low layer in low layer instruction set instruction.
3, hierarchy programmable parallel video signal processor structure as claimed in claim 2 is characterized in that, said low layer instruction set low layer command length is 16, arrange from a high position to low level, 4 in type codes territory, source operand 1 and source operand 2 respectively account for 3, and destination operand accounts for 2; Displacement number immediately accounts for 4; #imm represents to count immediately:
(1) type codes territory: 0, instruction: PNOP, PNOP, do-nothing operation, do-nothing operation;
(2) type codes territory: 1, instruction: PADD, PADD dst, sro1, src2, #imm, parallel addition, Dst=(sro1+src2)〉#imm;
(3) type codes territory: 2, instruction: PSUB, PSUB dst, src1, src2, parallel subtraction, Dst=src1-src2;
(4) type codes territory: 3, instruction: PADDS, PADDS dst, src1, src2, parallel saturated addition, Dst=clip (src+src2);
(5) type codes territory: 4, instruction: PMOV, PMOV dst, src, parallel data is moved, Dst=src;
(6) type codes territory: 5, instruction: PSAD, PSAD src1, src2, parallel subtraction absolute value, Abs (src1-src2);
Type codes territory: 6-15 keeps.
4, hierarchy programmable parallel video signal processor structure as claimed in claim 1, it is characterized in that, said concurrent operation unit is made up of parallel arithmetic logic unit and tree accumulator, and the output of this parallel arithmetic operation logic module links to each other with the input of tree accumulator.
5, hierarchy programmable parallel video signal processor structure as claimed in claim 4 is characterized in that, said parallel arithmetic logical operation module comprises N 9 bit processors, and this 9 bit processor is formed single instruction multiple data stream organization.
6, hierarchy programmable parallel video signal processor structure as claimed in claim 4, it is characterized in that, said tree accumulator module comprises that comprising two 8 imports tree-shaped adders, 11 adders, three accumulators (ACC0, ACC1, ACC2) and three minimum value parts (MIN0, MIN1, MIN2); Its annexation is as follows: the output of the tree-shaped adder of the left side 8 inputs links to each other with 11 adders and accumulator ACC1; The output of the tree-shaped adder of the right 8 inputs links to each other with 11 adders and accumulator ACC2; 11 adder outputs link to each other with accumulator ACC0; Accumulator ACC0, ACC1, ACC2 respectively with minimum value parts MIN0, MIN1, MIN2 links to each other; Accumulator ACC0 links to each other with macroblock match error signal sad0; Accumulator ACC1 links to each other with first block-matching error signal sad1; Accumulator ACC2 links to each other with second block-matching error signal sad2; Minimum value parts MIN0 output links to each other with macro block minimum match error signal min0 and macro block optimal motion vector signal opMV0, minimum value parts MIN1 output links to each other with first block-matching error signal min1 and first piece optimal motion vector signal opMV1, and minimum value parts MIN2 output links to each other with second piece minimum match error signal min2 and second piece optimal motion vector signal opMV2; Minimum value parts MIN0 input links to each other with macroblock match error signal sad0, end of run signal done and motion vector signal MV, minimum value parts MIN1 input links to each other with first block-matching error signal sad1, end of run signal done and motion vector signal MV, and minimum value parts MIN2 input links to each other with second block-matching error signal sad2, end of run signal done and motion vector signal MV.
7, the tree-shaped adder of said 8 inputs as claimed in claim 6 comprises 48 adders, 29 adders and 10 adders: wherein, first and second 8 adder outputs link to each other with first 9 adders, third and fourth adder links to each other with second 9 adder, and these two 9 adders outputs link to each other with the input of said 10 adders.
8, said minimum value parts MIN0 as claimed in claim 6 comprises 16 subtracters, with door, and 16 bit registers and 12 bit registers; Its annexation is: the input of 16 subtracters left side links to each other with the output of 16 bit registers, and 16 right inputs of subtracter are imported macroblock match error signal sad0 with the outside and linked to each other, the carry signal of 16 subtracters with link to each other with the last input of door; Outside input macroblock match error signal sad0 links to each other with the input of 16 bit registers, and the output of 16 bit registers links to each other with macro block minimum match error value min0; Link to each other with the carry of 16 subtracters with the last input of door, link to each other with outside input end of run signal done with student's input; 12 bit registers inputs links to each other with external movement vector signal MV, the enable signal of 12 bit registers and 16 bit registers with link to each other with the output of door.
9, said minimum value parts MIN1 as claimed in claim 6 comprises 16 subtracters, with door, 16 bit registers and 12 bit registers: its annexation is: 16 subtracter left side inputs link to each other with the output of 16 bit registers, the right input of 16 subtracters links to each other with outside input macroblock match error signal sad1, the carry signal of 16 subtracters with link to each other with the last input of door; The output of 16 bit registers links to each other with macro block minimum match error value min1; Link to each other with outside input end of run signal done with student's input; 12 bit registers inputs links to each other with external movement vector signal MV, the enable signal of 12 bit registers and 16 bit registers with link to each other with the output of door.
10, said minimum value parts MIN2 as claimed in claim 6 comprises 16 subtracters, with door, and 16 bit registers and 12 bit registers.Its annexation is: the input of 16 subtracters left side links to each other with the output of 16 bit registers, and 16 right inputs of subtracter are imported macroblock match error signal sad2 with the outside and linked to each other, the carry signal of 16 subtracters with link to each other with the last input of door; Outside input macroblock match error signal sad1 links to each other with the input of 16 bit registers, and the output of 16 bit registers links to each other with macro block minimum match error value min2; Link to each other with the carry of 16 subtracters with the last input of door, link to each other with outside input end of run signal done with student's input; 12 bit registers inputs links to each other with external movement vector signal MV, the enable signal of 12 bit registers and 16 bit registers with link to each other with the output of door.
11, hierarchy programmable parallel video signal processor structure as claimed in claim 1, it is characterized in that, said storage and scalar/vector by two-dimentional parallel storage and with the address generation module of its two-dimentional parallel storage that links to each other by address bus, 8 one dimension parallel storages and with the address generation module of its 8 one dimension parallel storages that link to each other by address bus, 9 one dimension parallel storages and with the generation module formation of its 9 one dimension parallel storage addresses that link to each other by address bus.
12, hierarchy programmable parallel video signal processor structure as claimed in claim 11, it is characterized in that, said two-dimentional parallel storage comprises address mapping module, N road comparator, priority encoder, N road alternative selector, N data memory, and cyclic shifter, wherein, said address mapping module output links to each other with the input of N road comparator respectively; This N road comparator output terminal links to each other with the input of priority encoder; The output of this priority encoder respectively with the selection signal end S of N road selector 0, S 1... S N-1Link to each other; The data terminal of this N road selector links to each other with address mapping module; The output of N road selector links to each other with N data memory respectively; This N circuit-switched data memory output all links to each other with cyclic shifter.
13, hierarchy programmable parallel video signal processor structure as claimed in claim 12 is characterized in that, 2 adder left sides of said address mapping module are input as 1, and the right side is input as the 4th, 5 of horizontal memory address signal Lx.The 6th, 7 of output A1 is continuous with the 4th, 5 that imports vertical storage device address signal Ly; The 4th, 5 of A1 of output links to each other with 2 adders outputs; Output the 0th, 1,2,3 of A1 and vertical storage device address signal Ly the 0th, 1,2,3 are continuous; The 6th, 7 of output A0 is continuous with the 4th, 5 that imports vertical storage device address signal Ly; The 4th, 5 of A0 of output and input level memory address signal Lx the 4th, 5 is continuous; Output the 0th, 1,2,3 of A0 and vertical storage device address signal Ly the 0th, 1,2,3 are continuous; The 0th, 1,2,3 of output b0 and horizontal memory address signal Lx link to each other.
14, hierarchy programmable parallel video signal processor structure as claimed in claim 12 is characterized in that, the embodiment of said priority encoder by with or logical circuit form, this priority encoder logic is as follows: J=min{j|t wherein j=1, j=0,1 ... N-1}.Cyclic shifter is with b 0The data of individual data memory move left to highest order.
15, hierarchy programmable parallel video signal processor structure as claimed in claim 11 is characterized in that, said 8 one dimension parallel storages are made up of N 8 bit memory modules; Wherein, the address signal addr_d1m of 8 one dimension parallel storages of outside input and the address input end of N 8 bit memories link to each other, and N 8 bit memories output 8N bit data links to each other with outside.
16, hierarchy programmable parallel video signal processor structure as claimed in claim 11 is characterized in that, the embodiment of said 9 one dimension parallel storages is made up of N 9 bit memory modules.Its annexation is: the address signal addr_dm9 of 9 one dimension parallel storages of outside input and the address input end of N 9 bit memories link to each other, and N 9 bit memories output 9N bit data links to each other with outside.
17, hierarchy programmable parallel video signal processor structure as claimed in claim 11 is characterized in that, the address generation module of said two-dimentional parallel storage is by adder 0,1, selector 0,1,2,3, and register 0,1 is formed; Its annexation is: selector 0 left input links to each other with register 0, and right input links to each other for signal starty; Selector 1 left input links to each other with signal step_d2m, and right input and signal MV high 6 (MV[11:6]) links to each other; The output of selector 0 and selector 1 links to each other with the input of adder 0; The output of adder 0 links to each other with register 0; Register 0 is output as vertical storage device address signal Ly; Selector 2 left inputs link to each other with register 1, and right input links to each other with signal startx; Selector 3 left inputs link to each other with signal 0, right input and signal MV low 6 (MV[5:0]) link to each other; The output of selector 2 and selector 3 links to each other with the input of adder 1; The output of adder 1 links to each other with register 1; Register 1 is output as horizontal memory address signal Lx.
18, hierarchy programmable parallel video signal processor structure as claimed in claim 11, it is characterized in that, the address generation module of said 8 one dimension parallel storages is made up of two selectors, an adder and a register, its annexation is: selector 0 left input links to each other with register 0 output, selector 0 right input links to each other with signal start_d1m, selector 1 left input links to each other with signal step_d1m, and selector 1 right input links to each other with signal 0; The output of selector 0 and selector 1 links to each other with two inputs of adder 0 respectively; Register 0 links to each other with outside by the address signal addr_d1m of 8 one dimension parallel storages.
19, hierarchy programmable parallel video signal processor structure as claimed in claim 11, it is characterized in that, the generation module of said above-mentioned 9 one dimension parallel storage addresses is made up of two selectors, an adder and a register, its annexation is: selector 0 left input links to each other with register 0 output, and selector 0 right input links to each other with signal start_dm9.Selector 1 left input links to each other with signal step_dm9, and selector 1 right input links to each other with signal 0; The output of selector 0 and selector 1 links to each other with two inputs of adder 0 respectively; Register 0 links to each other with outside by the address signal addr_dm9 of 9 one dimension parallel storages.
20, hierarchy programmable parallel video signal processor structure as claimed in claim 1 is characterized in that, said data are sought the unit, footpath and are made up of selector.
21, hierarchy programmable parallel video signal processor structure as claimed in claim 1, it is characterized in that, said high-level signaling unit is made up of 16 compacting instruction set processors and 32 specified register arrays, wherein, by setting program entry address signal set_entry and cycle-index register signal set_cnt is set links to each other, 16 reduced instruction processors link to each other by control signal with the specified register array these 16 reduced instruction processors with outside; The commands for controlling that said high-level signaling unit is concentrated by high-level signaling.
22, hierarchy programmable parallel video signal processor structure as claimed in claim 21 is characterized in that, said 16 compacting instruction set processor structures instruction fetching component, decoding unit, execution unit and register arrays.Its annexation is: by jump address ba, command signal d_ir shifts control signal next signal and links to each other between the instruction fetching component and instruction decoding unit; By command code d_op, carry out control signal exec between the instruction decoding unit and instruction execution unit, the~source operand d_src1, the second source operand d_src2 and status signal eflags link to each other; By writing register signal we, consequential signal e_res links to each other between instruction execution unit and the register array; Instruction decoding unit and register array are by the first source operand address d_a1, the second source operand address d_a2.And register array output signal d-r1 links to each other with d_r2.
23, hierarchy programmable parallel video signal processor structure as claimed in claim 22, it is characterized in that, said instruction fetching component comprises adder, Current Address Register, selector, high-level signaling memory, command register: wherein, input links to each other with signal 1 on the said adder, and following input links to each other with the output of Current Address Register; Adder output links to each other with the input of Current Address Register; The last input of selector with link to each other by jump address ba, the following input of selector links to each other with the output of Current Address Register, the selecting side of selector links to each other with transfer control signal next, and the output of selector links to each other with the address input end of high-level signaling memory; The output and instruction register input of high-level signaling memory links to each other.The output of command register links to each other with the external command decoding unit by command signal d_ir.
24, hierarchy programmable parallel video signal processor structure as claimed in claim 22 is characterized in that, said instruction decoding unit by with or logical circuit form; By jump address ba, command signal d_ir shifts control signal next signal and links to each other between instruction decoding unit and the instruction fetching component; By command code d_op, carry out control signal exec between instruction decoding unit and the instruction execution unit, the first source operand d_src1, the second source operand d_src2 and status signal eflags link to each other; Instruction decoding unit and register array pass through the first source operand address d_a1, the second source operand address d_a2, and register array output signal d_r1 links to each other with d_r2.
25, hierarchy programmable parallel video signal processor structure as claimed in claim 22 is characterized in that, said instruction execution unit comprises register 1, register 2, register 3, register 4, status register and arithmetic and logic unit ALU; Its annexation is: the first source operand d_src1, the second source operand d_src2, command code d_op, carry out control signal exec respectively with register 1, register 2, register 3 links to each other with the input of register 4; Register 1, the output of register 2 and register 3 links to each other with arithmetic and logic unit; Arithmetic and logic unit and status register be by carry carry, and zero-signal zero overflows the lowest order d_src1[0 of ovflow and d_src1] link to each other; Register 4 output we link to each other with outside; Status register output eflags links to each other with outside; Under the different action type signal e_op, the operation of ALU is defined as follows:
(1) action type: 0, assignment, e_r=e_src2
(2) action type: 1, addition, e_r=e_src1+e_src2
(3) action type: 2, subtraction, e_r=e_src1-e_src2
(4) action type: 3, or, e_r=e_src1|e_src2
(5) action type: 4, with, e_r=e_src1﹠amp; E_src2
(6) action type: 5, XOR, e_r=e_src1 ~ e_src2
(7) action type: 6, e_src2[4 is worked as in displacement]=1, e_r=e_src1〉〉 e_src2;
Work as e_src2[4] 0, e_r=e_src1<<e_src2
(8) action type: 7, negate, e_r=~ e_src2
26, hierarchy programmable parallel video signal processor structure as claimed in claim 22 is characterized in that, said register array is made up of 32 general registers; By writing register signal we, consequential signal e_res links to each other between this register array and the instruction execution unit; Register array and instruction decoding unit pass through the first source operand address d_a1, the second source operand address d_a2, and register array output signal d_r1 links to each other with d_r2; Instruction decoding unit output d_a1 links to each other with register array through register output e_a.
27, hierarchy programmable parallel video signal processor structure as claimed in claim 21 is characterized in that, the instruction of said high-level signaling collection has 4 types: do-nothing instruction, assignment directive, transfer instruction and arithmetic logical operation instruction; The bright instruction type of 15 and 14 bit tables of instruction; These 4 types are encoded to 00,01,10,11 respectively.
28, hierarchy programmable parallel video signal processor structure as claimed in claim 27 is characterized in that, said do-nothing instruction is one 16 instruction, from the 0th to the 15th, its value all is 0.
29, hierarchy programmable parallel video signal processor structure as claimed in claim 27 is characterized in that, said assignment directive is divided into three kinds, and the subtype territory of first kind of assignment directive has 4; Destination address register has 5.Source address register has 5.The subtype territory of second kind of assignment directive has 4; Destination address register has 5; Number field is 5 immediately; The subtype territory of the third assignment directive has 4; Destination address register has 5: minimum 5 is 0; Assignment directive following (extremely arrange from a high position position of every instruction): subrange type: 1, instruction: Lmovr r, #imm, the general register assignment, source operand #imm is for long several immediately, and two instruction cycles, destination address is general register r;
(1) subrange type: 2, instruction: Lmovg g, #imm, the special register assignment, source operand is counted immediately for long, and two instruction cycles, destination address is special register g;
(2) subrange type: 3, instruction: Movg g, r, the special register assignment, source operand is general register r, destination address is special register g;
(3) subrange type: 4, instruction: Movr r, g, the general register assignment, source operand is special register g, destination address is general register r;
(4) subrange type: 5, instruction: Imovr r, #imm, the general register assignment, source operand is counted immediately for short, and destination address is general register r;
(5) subrange type: 6, instruction: Imovg g, #imm, the special register assignment, source operand is counted immediately for short, and destination address is special register g;
(6) subrange type: 7, instruction: Movpc r, the general register assignment, source operand is a program counter, destination address is general register r.
30, hierarchy programmable parallel video signal processor structure as claimed in claim 27 is characterized in that, said transfer instruction has 2 kinds of forms, and first kind of formatter type field has 4; Condition field has 5; Destination address register has 5.Second kind of formatter type field has 2, and counting address field immediately has 12; Transfer instruction following (extremely arrange from a high position position of every instruction):
(1) subtype territory: 0, condition code: 00010, instruction: Bc r, carry condition redirect;
(2) subtype territory; 0, condition code: 00011, instruction: Bnc r, the redirect of no-carry condition;
(3) subtype territory: 0, condition code: 00100, instruction: Bz r, zero condition redirect;
(4) subtype territory: 0, condition code: 00101, instruction: Bnz r, the redirect of non-zero condition;
(5) subtype territory: 0, condition code: 01000, instruction: Bv, overflow condition redirect;
(6) subtype territory; 0, condition code: 01001, instruction: bnv r, no overflow condition redirect;
(7) subtype territory: 0, condition code: 10000, instruction: Bl r, lowest order 1 condition redirect;
(8) subtype territory: 0, condition code: 10001, instruction: Bnl r, lowest order 0 condition redirect;
(9) subtype territory: 1, condition code: 00000, instruction: Jmpr r, unconditional jump;
(10) subtype territory: 2, condition code: 00000, instruction: Cailr r, roundabout process is called;
(11) subtype territory: 3, condition code: 00000, instruction: Ret, process is returned;
(12) subtype territory: 4-7, condition code: keep instruction: keep;
(13) subtype territory: 8, condition code: 00000, instruction: Call #imm, straight-forward procedure calls;
(14) subtype territory: 9-11, condition code: keep instruction: keep;
(15) subtype territory: 12, condition code: 00000, instruction: Jmp #imm, directly unconditional jump;
(16) subtype territory: 13-15, condition code: keep instruction: keep.
31, hierarchy programmable parallel video signal processor structure as claimed in claim 27 is characterized in that, said arithmetic logical operation instruction has 2 kinds of forms, and the instruction type field has 4 in first kind of form; Purpose/first source operation note has 5; The second source operation note has 5; The instruction type field has 4 in second kind of form; Purpose/first source operation note has 5; Number is 5 immediately; Arithmetical logic instruction following (extremely arrange from a high position position of every instruction):
(1) subtype: 0, instruction: Mov rd, rs, function: rd=rs
(2) subtype: 1, instruction: Add rd, rs, function: Rd=rd+rs
(3) subtype: 2, instruction: Sub rd, rs, function: Rd=rd-rS
(4) subtype: 3, instruction: Or rd, rs, function: Rd=rd|rs
(5) subtype: 4, instruction: And rd, rs, function: Rd=rd﹠amp; Rs
(6) subtype: 5, instruction: Xor rd, rs, function: Rd=rd ~ rs
(7) subtype: 6, instruction: Ishr rd, rs, function: rd=rd〉rs
(8) subtype: 6, instruction: Ishl rd, rs, function: Rd=rd<<rs
(9) subtype: 7, instruction: Not rd, rs, function: Rd=~ rs
(10) subtype: 8, keep
(11) subtype: 9, instruction: Iadd rd, #imm, function: Rd=rd+#imm
(12) subtype: 10, instruction: Isub rd, #imm, function: Rd=rd-#imm
(13) subtype: 11, instruction: Ior rd, #imm, function: Rd=rd|#imm
(14) subtype: 12, instruction: Iand rd, #imm, function: Rd=rd﹠amp; #imm
(15) subtype: 13, instruction: Ixor rd, #imm, function: Rd=rd ~ #imm
(16) subtype: 14, instruction: Ishr rd, #imm, function: rd=rd〉#imm
(17) subtype: 14, instruction: Ishl rd, #imm, function: Rd=rd<<#imm
(18) subtype: 15, instruction: Inot rd, #imm, function: Rd=~ #imm
32, hierarchy programmable parallel video signal processor structure as claimed in claim 21 is characterized in that, it is as follows that said specified register array is formed each register definitions by 32 16 bit registers:
(1) specified register: g0, the cycle-index register, output cycle-index signal (cnt) is given the low layer command unit;
(2) specified register: g1, low layer program entry address register, output program entry address signal (entry) is given the low layer command unit;
(3) specified register: g2, the horizontal initial address register of two-dimentional parallel storage is exported the horizontal starting address signal startx of two-dimentional parallel storage;
(4) specified register: g3, the vertical initial address register of two-dimentional parallel storage is exported the vertical starting address signal starty of two-dimentional parallel storage;
(5) specified register: g4, two-dimentional parallel storage address increment register is exported two-dimentional parallel storage address increment signal step_d2m;
(6) specified register: g5,8 one dimension parallel storage initial address register are exported 8 one dimension parallel storage starting address signal start_d1m;
(7) specified register: g6,8 one dimension parallel storage address increment registers are exported 8 one dimension parallel storage address increment signal step_d1m;
(8) specified register: g7,9 one dimension parallel storage initial address register are exported 9 one dimension parallel storage starting address signal start_dm9;
(9) specified register: g8,9 one dimension parallel storage address increment registers are exported 9 one dimension parallel storage address increment signal step_dm9;
(10) specified register: g9, motion vector register, output movement vector signal MV.High 6 is vertical motion vector, and low 6 is horizontal motion vector;
(11) specified register: g10-g15 keeps;
(12) specified register: g16, low layer command unit status register latchs low layer command unit output operation knot;
(13) specified register: g17, concurrent operation unit result register 0 latchs tree accumulator output macro block-matching error signal sad0;
(14) specified register: g18, concurrent operation unit result register 1 latchs the tree accumulator and exports first block-matching error signal sad1;
(15) specified register: g19, concurrent operation unit result register 2 latchs second block-matching error signal sad2 of tree accumulator output;
(16) specified register: g20, concurrent operation unit result register 3 latchs tree accumulator output macro block optimal motion vector signal opMV0;
(17) specified register: g21, concurrent operation unit result register 4 latchs the tree accumulator and exports first piece optimal motion vector signal opMV1;
(18) specified register: g22, concurrent operation unit result register 5 latchs second piece optimal motion vector signal opMV2 of tree accumulator output;
(19) specified register: g23, concurrent operation unit result register 6 latchs tree accumulator output macro block minimum match error signal min0;
(20) specified register: g24, concurrent operation unit result register 7 latchs the tree accumulator and exports first piece minimum match error signal min1;
(21) specified register: g25, concurrent operation unit result register 8 latchs second piece minimum match error signal min2 of tree accumulator output;
(22) specified register: g26-g31 keeps;
CN 00130074 2000-10-27 2000-10-27 Hierarchy programmable parallel video signal processor structure for motion estimation algorithm Expired - Fee Related CN1127264C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 00130074 CN1127264C (en) 2000-10-27 2000-10-27 Hierarchy programmable parallel video signal processor structure for motion estimation algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 00130074 CN1127264C (en) 2000-10-27 2000-10-27 Hierarchy programmable parallel video signal processor structure for motion estimation algorithm

Publications (2)

Publication Number Publication Date
CN1289212A true CN1289212A (en) 2001-03-28
CN1127264C CN1127264C (en) 2003-11-05

Family

ID=4593946

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 00130074 Expired - Fee Related CN1127264C (en) 2000-10-27 2000-10-27 Hierarchy programmable parallel video signal processor structure for motion estimation algorithm

Country Status (1)

Country Link
CN (1) CN1127264C (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100435586C (en) * 2004-06-11 2008-11-19 三星电子株式会社 Method and apparatus for predicting motion
CN101090504B (en) * 2007-07-20 2010-06-23 清华大学 Coding decoding apparatus for video standard application
CN1890979B (en) * 2003-12-31 2010-09-29 英特尔公司 Motion estimation sum of all differences (SAD) array having reduced semiconductor die area consumption
CN101146222B (en) * 2006-09-15 2012-05-23 中国航空无线电电子研究所 Motion estimation core of video system
CN103491315A (en) * 2013-08-09 2014-01-01 北京中传视讯科技有限公司 Video data processing method, video data processing device and electronic device comprising video data processing device
US9049520B2 (en) 2006-01-20 2015-06-02 Akrion Systems Llc Composite transducer apparatus and system for processing a substrate and method of constructing the same
CN104822062A (en) * 2007-01-08 2015-08-05 诺基亚公司 Inter-layer prediction for extended spatial scalability in video coding
WO2017185396A1 (en) * 2016-04-26 2017-11-02 北京中科寒武纪科技有限公司 Device and method for use in executing matrix addition/subtraction operations
US9987666B2 (en) 2006-01-20 2018-06-05 Naura Akrion Inc. Composite transducer apparatus and system for processing a substrate and method of constructing the same
CN111651199A (en) * 2016-04-26 2020-09-11 中科寒武纪科技股份有限公司 Apparatus and method for performing vector circular shift operation

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1890979B (en) * 2003-12-31 2010-09-29 英特尔公司 Motion estimation sum of all differences (SAD) array having reduced semiconductor die area consumption
CN100435586C (en) * 2004-06-11 2008-11-19 三星电子株式会社 Method and apparatus for predicting motion
US9049520B2 (en) 2006-01-20 2015-06-02 Akrion Systems Llc Composite transducer apparatus and system for processing a substrate and method of constructing the same
US9987666B2 (en) 2006-01-20 2018-06-05 Naura Akrion Inc. Composite transducer apparatus and system for processing a substrate and method of constructing the same
CN101146222B (en) * 2006-09-15 2012-05-23 中国航空无线电电子研究所 Motion estimation core of video system
CN104822062A (en) * 2007-01-08 2015-08-05 诺基亚公司 Inter-layer prediction for extended spatial scalability in video coding
CN104822062B (en) * 2007-01-08 2018-11-30 诺基亚公司 Improvement inter-layer prediction for extended spatial scalability in Video coding
CN101090504B (en) * 2007-07-20 2010-06-23 清华大学 Coding decoding apparatus for video standard application
CN103491315A (en) * 2013-08-09 2014-01-01 北京中传视讯科技有限公司 Video data processing method, video data processing device and electronic device comprising video data processing device
WO2017185396A1 (en) * 2016-04-26 2017-11-02 北京中科寒武纪科技有限公司 Device and method for use in executing matrix addition/subtraction operations
CN111651199A (en) * 2016-04-26 2020-09-11 中科寒武纪科技股份有限公司 Apparatus and method for performing vector circular shift operation
US10891353B2 (en) * 2016-04-26 2021-01-12 Cambricon Technologies Corporation Limited Apparatus and methods for matrix addition and subtraction
CN111651199B (en) * 2016-04-26 2023-11-17 中科寒武纪科技股份有限公司 Apparatus and method for performing vector cyclic shift operation

Also Published As

Publication number Publication date
CN1127264C (en) 2003-11-05

Similar Documents

Publication Publication Date Title
US6868123B2 (en) Programmable motion estimation module with vector array unit
CN1127264C (en) Hierarchy programmable parallel video signal processor structure for motion estimation algorithm
Liu et al. HDTV1080p H. 264/AVC encoder chip design and performance analysis
CN103369315B (en) The coding of the intra prediction mode of chromatic component, coding/decoding method, equipment and system
Chi et al. A QHD-capable parallel H. 264 decoder
CN101068364A (en) Video encoder and graph processing unit
WO2008115935A1 (en) Efficient implementation of h.264 4 by 4 intra prediction on a vliw processor
US20240037700A1 (en) Apparatus and method for efficient motion estimation
CN1946178A (en) VLSI device for movement evaluation and method for movement evaluation
AlQaralleh et al. Low-complexity motion estimation design using modified XOR function
CN1780402A (en) Video image motion compensator
CN1139873C (en) Programmable video signal processor structure based on mixed video encoding method
Zheng et al. A novel VLSI architecture of motion compensation for multiple standards
Chen et al. A high-performance hardwired CABAC decoder for ultra-high resolution video
Cho et al. Parallelizing the H. 264 decoder on the cell BE architecture
Li et al. A VLSI architecture design of an edge based fast intra prediction mode decision algorithm for H. 264/AVC
Lehtoranta et al. Parallel implementation of video encoder on quad DSP system
He et al. Parallel HD encoding on CELL
Dias et al. Reconfigurable architectures and processors for real-time video motion estimation
Nguyen et al. An Efficient Implementation of H. 264/AVC Integer Motion Estimation Algorithm on Coarse-grained Reconfigurable Computing System.
TW526657B (en) Global elimination algorithm for motion estimation and the hardware structure
CN100339976C (en) Multiple mold multiple scale movement evaluation super large scale integrated circuit system structure and method
CN115379238A (en) Video intra-frame prediction hardware implementation method based on high-level synthesis
Seo et al. Customizing wide-SIMD architectures for H. 264
Yu et al. A high-performance configurable VLSI architecture for integer motion estimation in H. 264

Legal Events

Date Code Title Description
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C06 Publication
PB01 Publication
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20031105

Termination date: 20091127