JP3575991B2

JP3575991B2 - Orthogonal transform circuit

Info

Publication number: JP3575991B2
Application number: JP19755498A
Authority: JP
Inventors: 格北原; 正晃兵頭
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1998-07-13
Filing date: 1998-07-13
Publication date: 2004-10-13
Anticipated expiration: 2018-07-13
Also published as: JP2000029863A

Description

【０００１】
【発明の属する技術分野】
本発明は、画像情報，音声情報の高能率符号化技術の１つである直交変換に関し、より詳細には、直交変換を用いて時間領域の信号を周波数領域の信号に変換するための直交変換回路、及び時間領域の信号と周波数領域の信号との間の相互変換を行う直交変換回路に関するものである。
【０００２】
【従来の技術】
近年になり、画像情報，音声情報の高能率符号化技術が注目を集めている、中でも直交変換を実現する回路は高能率符号化方式実現のための重要な要素であり、その小規模化，高速化を目指して多くの研究，開発が行われている。
直交変換装置の１つである１次元ＤＣＴプロセッサ回路は、バタフライ演算回路と、乗算器を用いずにベクトル内積の積和を求める分布演算技術（ＤＡ法）を用いる演算回路とで実現できる。
かかる１次元ＤＣＴの演算例として、８点１次元ＤＣＴの演算式を式（１）及び式（２）に示す。
【０００３】
【数１】

【０００４】
ＤＣＴプロセッサにおけるバタフライ演算では、８個のデータｘ０，ｘ１，ｘ２，ｘ３，ｘ４，ｘ５，ｘ６，ｘ７に対して（ｘ０，ｘ７），（ｘ１，ｘ６），（ｘ２，ｘ５），（ｘ３，ｘ４）の４組をつくり、各組においてｘ０＋ｘ７，ｘ０−ｘ７，ｘ１＋ｘ６，ｘ１−ｘ６，ｘ２＋ｘ５，ｘ２−ｘ５，ｘ３＋ｘ４，ｘ３−ｘ４の加減算を行う。
【０００５】
次に、このバタフライ演算の出力を用いて、式（１），式（２）にもとづき１次元ＤＣＴ係数を算出する。
一例を示すと、式（１）のＤＣＴ係数Ｘ２は、積和演算により、
Ｘ２＝（１／２）×｛Ｂ×（ｘ０＋ｘ７）＋Ｃ×（ｘ１＋ｘ６）＋（−Ｃ）×（ｘ２＋ｘ５）＋（−Ｂ）×（ｘ３＋ｘ４）｝
と求めることができる。
ＤＡ法とは、式（１），式（２）における行列の積和演算を、図３（Ｄ）に示すような同じ桁のビットデータを集めたビットスライスという単位で行う手法である。ＤＡ法の一例を下記の式（５），式（６），式（７）及び表１に示し、これに従い説明を行う。
【０００６】
【数２】

【０００７】
【数３】

【０００８】
【表１】

【０００９】
上記式（５）に示すように、式（２）におけるバタフライ演算結果（ｘ０＋ｘ７，ｘ１＋ｘ６，ｘ２＋ｘ５，ｘ３＋ｘ４）をビット単位（［ｂ_０ _ｎ，…，ｂ_０１，ｂ_００，］、［ｂ_１ _ｎ，…，ｂ_１１，ｂ_１０，］、［ｂ_２ _ｎ，…，ｂ_２１，ｂ_２０，］、［ｂ_３ｎ，…，ｂ_３１，ｂ_３０，］：ｂ_３ _ｎ，…，ｂ_００はビットを表す）に分解する。これらの同一桁に注目すると式（５）の第２行のＤＣＴ係数Ｘ２を求める積和演算は、式（６）のように分解することができる。
ここで、式（６）におけるｂ_０ _ｎ，ｂ_１ _ｎ，ｂ_２ _ｎ，ｂ_３ _ｎの組をビットスライスと呼んでいる。
（ｎ＋１）桁目のビットスライスの積和演算は、上記式（７）で表わされ、ｂ_００〜ｂ_３ _ｎの取り得る値は０又は１なので、その結果は、（１／２）×（２のｎ乗）を掛け算すると、
（１／２）×（２のｎ乗）×（（Ｂ×ｂ_０ _ｎ）＋（Ｃ×ｂ_１ _ｎ）＋（−Ｃ×ｂ_２ _ｎ）＋（−Ｂ×ｂ_３ _ｎ））
となる。
従って、その結果は、表１に示すような、ビットスライスの組合せに対応する１６種類の部分和のテーブルを持てば積和演算値を求めることができる。
即ち、上式の中の（２のｎ乗）はシフト演算なので、式（６）の演算は、各ビットスライスの値をもとにテーブルを参照し得られた１６種類の値のいずれかをビットシフトした後、順次、加算することによって実現される。同様の処理を式（５）の各行で行うことで、ＤＡ法による行列の積和演算が実現され、１次元ＤＣＴ係数（Ｘ０，Ｘ２，Ｘ４，Ｘ６）を求めることができる。
【００１０】
一方、８点１次元ＩＤＣＴの演算式は、式（３）及び式（４）に示す式であり、入力としては、８個のデータ（Ｘ０，Ｘ１，Ｘ２，Ｘ３，Ｘ４，Ｘ５，Ｘ６，Ｘ７）が順次入力される。
まず入力データを式（３），式（４）に示すような（｛Ｘ０，Ｘ２，Ｘ４，Ｘ６｝と｛Ｘ１，Ｘ３，Ｘ５，Ｘ７｝）の二組にわけ、各組において入力データをビット単位に分解し、ビットスライスを構成する。
これらのビットスライスと、各ビットスライスの値の組合わせに対応したテーブルを用いてＤＡ法による行列の積和演算を行い、式（３），式（４）それぞれの右辺第１項及び第２項を求める。
次にこの２項の加減算を行うことでｘ０〜ｘ７が算出される。ＩＤＣＴにおけるバタフライ演算とはこの加減算を意味する。
【００１１】
特開平７−２３４８６４号公報には、バタフライ演算とＤＡ法を用いた小規模の１次元ＤＣＴ／ＩＤＣＴプロセッサの回路構成が記載されている。特開平７−２３４８６４号公報記載の回路は８×８画素に対して１次元ＤＣＴ／ＩＤＣＴを行う回路であり、入力要素をメモリに記載しておき、必要な順にデータを並び替えてバタフライ演算器に渡すことで、レジスタ数を減少させ回路の小規模化を実現するものである。
【００１２】
図２１に示す特開平７−２３４８６４号公報記載のＤＣＴ／ＩＤＣＴプロセッサにおいて、１８００はアドレス発生器であり、１８０１は８×８ワードの１６ビット幅のメモリであり、１８０２は１６ビット幅のパイプラインレジスタであり、１８０３は１６ビット幅の入力と３４ビット幅の出力を持つ積和演算器であり、１８０４は３４ビット幅のパイプラインレジスタであり、１８０５は１６ビット幅のパイプラインレジスタであり、１８０６は８×８ワードの１６ビット幅のメモリであり、１８０７は３４ビット幅の入力と１６ビット幅の出力を持つバタフライ演算器であり、１８０８は１６ビット幅のパイプラインレジスタであり、１８０９はアドレス発生器である。
【００１３】
ＤＣＴ変換の場合このプロセッサには、図１８に示すような８×８（Ｍビット／個）個の要素データの１行または１列からなる８個のデータｘ０，ｘ１，ｘ２，ｘ３，ｘ４，ｘ５，ｘ６，ｘ７がこの順序で入力され、そのデータをメモリ１８０１に一時記憶しておく。このメモリ１８０１から、アドレス発生器１８００によってバタフライ演算を行うときに必要になる順序、即ちｘ７，ｘ０，ｘ６，ｘ１，ｘ５，ｘ２，ｘ４，ｘ３の順にデータを読み出し、パイプラインレジスタ１８０２を介してバタフライ演算器１８０７に供給する。
バタフライ演算器１８０７は、供給されたデータを加減算するための３４ビット幅の第１と第２入力と、１６ビット幅の出力を持つ並列加算器を備えている。バタフライ演算器１８０７にはｘ７，ｘ０，ｘ６，ｘ１，ｘ５，ｘ２，ｘ４，ｘ３の順にデータが入力され、（ｘ０＋ｘ７），（ｘ０−ｘ７），（ｘ１＋ｘ６），（ｘ１−ｘ６），（ｘ２＋ｘ５），（ｘ２−ｘ５），（ｘ３＋ｘ４），（ｘ３−ｘ４）の順で演算が行われ、パイプラインレジスタ１８０８を介して積和演算器１８０３に出力される。
【００１４】
積和演算器１８０３は、シフトレジスタとビットスライスに対応する部分和のＲＯＭと、ＲＯＭから出力される値をシフトしながら加算する累算回路と３４ビット幅の第１入力と１６ビットの第２，第３入力と３４ビット幅の出力を持つ並列加算器を備えている。
積和演算器１８０３の動作を説明すると、入力されたバタフライ演算結果は、シフトレジスタによってビットスライスに分解される。そのビットスライスを入力アドレスとしたＲＯＭの出力値を求め、その値を累算回路に入力しビットスライス単位での累和演算を行うことで、ＤＡ法による積和演算を実現し、１次元ＤＣＴ係数Ｘ０，Ｘ２，Ｘ４，Ｘ６，Ｘ１，Ｘ３，Ｘ５，Ｘ７が求まる。
演算結果はパイプラインレジスタ１８０４，１８０５を介してメモリ１８０６に渡される。このメモリ１８０６から、アドレス発生器１８０９によって適切な順序，即ちＸ０，Ｘ１，Ｘ２，Ｘ３，Ｘ４，Ｘ５，Ｘ６，Ｘ７に並び替えられ外部に出力される。
【００１５】
【発明が解決しようとする課題】
上記のとおり、特開平７−２３４８６４号公報のＤＣＴ／ＩＤＣＴプロセッサは、入出力データの並びかえを行うたのメモリやアドレス発生器が必要であり、回路規模が大きくなる。例えば入力データが１６ビットの場合、８×８ワードで各ワードが１６ビット幅であれば、１Ｋビットの容量が必要になる。
また、バタフライ演算器１８０７において、入力データの加減算をデータ単位で行うため、並列加算器が大規模になるという問題がある。従来例ではＤＣＴ演算とＩＤＣＴ演算で用いる並列加算器を共有しており、ＩＤＣＴ演算のうちの積和演算結果が３４ビットの場合、並列加算器では３４ビットの加算を行う必要があるということになる。
本発明は、このような従来技術の問題点に鑑みてなされたものであり、バタフライ演算及び積和演算を行う手段を従来よりも小規模とする直交変換回路及び双方向の動作を行うことを可能とする当該直交変換回路を提供することをその目的とする。
【００１６】
【課題を解決するための手段】
上記目的を達成するために、本発明の直交変換回路では、バタフライ演算部においては、Ｎ個の要素を格納するためのＮ個のパラレルレジスタとＮ個のシフトレジスタとＮ個のＫビット並列加算器を備える。パラレルレジスタは、入力要素データの並び替えを行いシフトレジスタに渡す、シフトレジスタは渡された要素データを２個１組にしてＫビットずつ並列加算器に出力する並列加算器は、Ｋビットに分解されて渡された要素データの加算または減算を行う。これにより一時記憶用のメモリやそのメモリからの読み出し順序を制御するアドレス発生器の必要がなくなり、バタフライ演算部で、要素データの並び替え及び加減算をビット単位で行う（入出力がビット幅）ことにより小規模な回路でバタフライ演算を可能にし、直交変換装置回路の小規模化が可能になる。
そして、各請求項記載の発明は次の技術手段により構成される。
【００１７】
請求項１記載の発明は、入力されるＮ個の要素データに対する直交変換処理をバタフライ演算部と積和演算部を備えるプロセッサにより行う直交変換回路において、前記バタフライ演算部は、入力されるＮ個の要素データを要素データ毎に用意したレジスタに格納するパラレルレジスタ群と該パラレルレジスタ群の各レジスタに接続し該レジスタからのＮ個の要素データを要素データ毎に用意したレジスタに格納するシリアルレジスタ群と該シリアルレジスタ群の各レジスタに接続し該レジスタからのデータを加算するＮ個のＫビット加算器で構成され、前記パラレルレジスタ群は、入力されるＮ個の要素データの順序を変換し、前記シリアルレジスタ群は、前記パラレルレジスタ群によって順序が変換された要素データについて下位ビットから順にＫビットずつシリアルにＮ個の要素データ各々を同時に出力し、前記Ｋビット加算器は、前記シリアルレジスタ群から出力されるＫビット毎の前記Ｎ／２組の要素データの加算及び減算を行い、この結果得られるキャリーを保存しこの保存したキャリーを用いて、下位から上位へ順次該加算及び減算を行うものとし、前記積和演算部は、前記バタフライ演算部で求められた加算及び減算結果として下位ビットから順に同時に出力されるＮ／２組の要素データの各組における同一桁のビットスライスを入力とし、該入力ビットスライスの各桁毎の部分和を加算するものとしたことを特徴とする。
【００１９】
請求項２記載の発明は、請求項１に記載のプロセッサをＤＣＴプロセッサとしたことを特徴とする。
【００２０】
請求項３記載の発明は、請求項１または２に記載のプロセッサを各次元に対応させることにより多次元の直交変換処理を行うことを特徴とする。
【００２１】
請求項４記載の発明は、入力されるＮ個の要素データに対する直交変換処理及び／又は逆直交変換処理を、第１のバタフライ演算部と積和演算部と第２のバタフライ演算部を備えるプロセッサにより行う直交変換回路において、前記第１のバタフライ演算部は、入力されるＮ個の要素データを要素データ毎に用意したレジスタに格納するパラレルレジスタ群と該パラレルレジスタ群の各レジスタに接続し該レジスタからのＮ個の要素データを要素データ毎に用意したレジスタに格納するシリアルレジスタ群と該シリアルレジスタ群の各レジスタに接続し該レジスタからのデータを加算するＮ個のＫビット加算器で構成するものであって、前記パラレルレジスタ群は、入力されるＮ個の要素データの順序を変換し、前記シリアルレジスタ群は、前記パラレルレジスタ群によって順序が変換された要素データについて下位ビットから順にＫビットずつシリアルにＮ個の要素データ各々を同時に出力し、前記Ｋビット加算器は、直交変換処理を行う場合には前記シリアルレジスタ群から出力されるＫビット毎の前記Ｎ／２組の要素データの加算及び減算を行い、この結果得られるキャリーを保存しこの保存したキャリーを用いて、下位から上位へ順次該加算及び減算を行い、逆直交変換処理を行う場合には該Ｋビット加算器を機能させずに前記シリアルレジスタ群の出力を直接出力するものとし、前記積和演算部は、直交変換処理を行う場合には前記第１のバタフライ演算部で求められた加算及び減算結果として下位ビットから順に同時に出力されるＮ／２組の要素データの各組における同一桁のビットスライスの下位ビットから順に同時に出力されるＮ／２組の要素データの各組における同一桁のビットスライスを入力とし、逆直交変換処理を行う場合にはＮ個の要素データを２組に分け得た各組に含まれるＮ／２個の要素データ各々の下位ビットから順に同時に出力される同一桁のビットスライスを入力とし、該入力ビットスライスの各桁毎の部分和を加算するものとし、前記第２のバタフライ演算部は、加算器を備え、該加算器は逆直交変換処理を行う場合には前記積和演算部における加算結果である前記入力ビットスライスの各桁毎の部分和を入力としてその加算及び減算を時分割で行うことを特徴とする。
【００２３】
請求項５記載の発明は、請求項４に記載のプロセッサをＤＣＴ／ＩＤＣＴプロセッサとしたことを特徴とする。
【００２４】
請求項６記載の発明は、請求項４または５に記載のプロセッサを各次元に対応させることにより多次元の直交変換処理を行うことを特徴とする。
【００２５】
【発明の実施の形態】
本発明の実施の形態について添付図を参照し、以下に説明する。
図１に、本発明による直交変換回路の一実施形態である２次元ＤＣＴプロセッサの概要をブロック図にて示し、そのプロセッサの具体的な動作を図２乃至図７及び図１８に基づき以下に説明する。
図１において、直列に接続されるバタフライ演算部１２，１６と積和演算部１３，１７、四捨五入部１４，１８それぞれで１次元ＤＣＴを行い、二つの１次元ＤＣＴ回路で２次元のＤＣＴを実現する。
図１に示す２次元ＤＣＴプロセッサの特徴は、バタフライ演算部１２，１６における入力画素データをバタフライ演算部１２，１６の内部に設けたレジスタに格納し、順序変換を行い、変換後そこからデータを２ビットずつ出力することで、加減算を２ビット単位で行うことである。この処理により、データ入力時にメモリやアドレス発生器を必要とせず、また、従来例ではデータビット幅分の入力が必要であった並列加減算の規模も入出力が２ビット幅と小さくおさえることが可能になる。
【００２６】
以下、実施形態の動作を説明する。図１の２次元ＤＣＴプロセッサのバタフライ演算部１２は、並び替え部１００と加減算部１０１で構成される。並び替え部１００では、入力された画素データの順序を入れ換え、各画素データの下位から２ビットずつを加減算部１０１に渡す。
加減算部１０１では、渡された２ビット単位の画素データについて加算及び減算を行うことで、バタフライ演算を実現する。その演算結果をビットスライス単位で積和演算部１３に渡す。積和演算部１３では水平方向の１次元ＤＣＴを行い、求まったＤＣＴ係数の中間結果が、四捨五入部１４に渡される。
四捨五入部１４では、式（１），式（２）中の右辺の（１／２）の乗算を四捨五入を用いて行い、演算結果をＲＡＭ１５に渡す。ＲＡＭ１５は１５ビット幅の入出力を持つデータ転置用のＲＡＭであり、入力されたデータの転置を行いバタフライ演算部１６にデータを渡す。
バタフライ演算部１６はバタフライ演算部１２と同様にデータの並べ替え及び２ビット毎の加減算を行い、ビットスライスを積和演算部１７に出力する。
以降、積和演算部１７，四捨五入部１８は各々積和演算部１３，四捨五入部１４と同じ処理を行う。ＲＡＭ１５をはさんで、まず水平方向の１次元ＤＣＴを行い、次に垂直方向の１次元ＤＣＴを行うことで、２次元ＤＣＴを完了する。
【００２７】
図１における２次元ＤＣＴプロセッサの各ブロックの動作を更に詳しく説明する。バタフライ演算部１２への入力画素データは図１７に示すような８×８マトリックスで、各画素データは９ビットであるとする。
ここでは、下記式（１），式（２）に基づき、第１行の入力画素データｘ０，ｘ１，ｘ２，ｘ３，ｘ４，ｘ５，ｘ６，ｘ７から、ＤＣＴ係数Ｘ０，Ｘ１，Ｘ２，Ｘ３，Ｘ４，Ｘ５，Ｘ６，Ｘ７を求める１次元ＤＣＴの処理手順について述べる。
【００２８】
【数４】

【００２９】
図２にバタフライ演算部１２の構成を示す。バタフライ演算部１２は、大きくは、並び替え部１００と加減算部１０１にわかれている。並び替え部１００は、９ビット幅の画素データ入力と、８個の画素データを２ビット単位に分割した１６ビット幅の出力（８画素データ×２ビット）を持ち、加減算部１０１は、並び替え部１００から渡される１６ビット幅の入力と入力データの加算結果の加算ビットスライス２４１（２ビット×４加算）と、減算結果の減算ビットスライス２４２（２ビット×４減算）の計１６ビット幅の出力を持つ。
【００３０】
並び替え部１００は、９ビットのレジスタ（Ｒ）２００〜２０７と図３（Ａ）にすような９ビットのパラレルイン、２ビット毎のシリアルアウトのシフトレジスタＳＲ２１０〜２１７とで構成される。
加減算部１０１は１ビットのレジスタ２２０〜２２７と全加算器２３０〜２３７を備え、その単位要素を図３（Ｂ）に示すように、上記シフトレジスタ２１０，２１１からの２ビット幅の第１，第２入力と全加算器２３０の下位からのキャリー・アウトを入力とする１ビットのレジスタ２２０の１ビットの桁上がりをキャリー・インとする第３入力とを加算し、加算結果を出力する２ビット幅の第１出力と、キャリー・アウトを出力する１ビットの第２出力を持つ全加算器２３０〜２３７とで構成する。
【００３１】
図６は、バタフライ演算部１２の動作を示すタイミング図で、入出力及び演算部の各要素におけるデータの時間的変化を示す。バタフライ演算部１２に画素データｘ０，ｘ１，ｘ２，ｘ３，ｘ４，５，ｘ６，ｘ７が、この順序で１クロック毎に読み込まれ（図６（ａ））、クロックタイミングＴ１〜Ｔ４の期間は１クロック毎に入力画素データをレジスタ２０６から入力しレジスタ２０４，２０２，２００に順送りし、保存する（図６（ｈ），図６（ｆ），図６（ｄ），図６（ｂ））。
次にＴ５〜Ｔ８の期間は入力画素データをレジスタ２０１から入力し、レジスタ２０３，２０５，２０７に順送りし、保持する（図６（ｃ），図６（ｅ），図６（ｇ），図６（ｉ））。この動作を繰り返すことで８クロック毎にレジスタ２００〜２０７全てが画素データで満たされる。
【００３２】
レジスタ２００〜２０７全てがデータで満たされると、以降８クロック毎に、レジスタ２００〜２０７に保持されたデータを各々シフトレジスタ２１０〜２１７に受け渡す（図６（ｊ）〜図６（ｑ））。
シフトレジスタ２１０からは、１クロック毎に格納されているデータの下位から２ビットずつが出力され、全加算器２３０，２３１に渡される。全加算器２３０はバタフライ演算における加算を、全加算器２３１はバタフライ演算における減算を行う。
シフトレジスタ２１１からも１クロック毎に格納されているデータの下位から２ビットずつが出力され、該データが全加算器２３０に、該データを反転したものが全加算器２３１に渡される。
レジスタ２２０，２２１は、最下位２ビットの演算前に初期値がセットされ、それ以降は演算結果のキャリーを保持する。レジスタ２２０の初期値は０、レジスタ２２１の初期値は１である。
ここで全加算器２３１について、シフトレジスタ２１１からの入力を反転し、かつレジスタ２２１に初期値として１を設定するのは、レジスタ２１１からの入力を２の補強表現で負の数とし、全加算器２３１を減算器として動作させるためである。
全加算器２３０，２３１では２ビットずつ全加算が行われ、（ｘ０＋ｘ７）と（ｘ０−ｘ７）が下位から順に２ビットずつ求められる。
【００３３】
（ｘ０＋ｘ７）について図３（Ｂ），図３（Ｃ），図３（Ｄ）を用いて説明する。図３（Ｂ）は図２のバタフライ演算部のうち、（ｘ０＋ｘ７）の演算に対応する部分を抜き出したものである。図６に示すように、Ｔ９〜Ｔ１６の期間、シフトレジスタ２１０にはｘ０が、シフトレジスタ２１１にはｘ７が格納される。ここで一例としてｘ０＝１１０１０１０００，ｘ７＝１１１００１１１１とする。
シフトレジスタ２１０，２１１に格納されているｘ０，ｘ７は、各々２ビットずつシフトを行いながら、２ビットずつ出力を全加算器２３０に与える。全加算器２３０はキャリーレジスタ２２０を用いながら２ビットの加算を繰り返す。（ｘ０＋ｘ７）は１０ビットであるから加算は計５回の繰り返しで終了する。
【００３４】
この様子を図３（Ｃ）に示す。タイミングＴ９からＴ１３の５回の計算で（ｘ０＋ｘ７）の結果が下位から順に２ビットずつ出力されることになる。これは図２における出力２６０に相当する。他のバタフライ演算の加算結果２６１〜２６３，及び減算結果２５０〜２５３についても同様の演算が行われる。
図２の出力２６０〜２６２に注目すると、図３（Ｄ）のようになる。バタフライ演算の加算部分の計算結果は、図３（Ｄ）に示すように、２６０〜２６３の下位から順に２ビットずつビットスライスで出力される。またバタフライ演算の減算部分の計算結果は２の補数表現で負の数とするための反転処理があること以外は、上記と同様の処理を行い、２５０〜２５３の下位から順に２ビットずつのビットスライスで結果が出力されることになる。
このタイミングを図６（ｒ）〜図６（ｕ）に示す。まず下位２ビットに対応する加算ビットに対応するビットスライスがＴ９からＴ１３の期間出力される。その後Ｔ１４〜Ｔ１６の期間，ダミーデータをはさんで、次の入力データ組の出力に移るサイクルを繰り返す。
【００３５】
図４に示す前記積和演算部１３（なお、積和演算部１７についても同様である）は２つの４ビット幅の入力と、各々１６ビット幅の第１，第２出力を持ち、ＤＣＴ係数の部分和を出力するＲＯＭを含み、４ビット幅の第１，第２入力と演算結果出力として１６ビット幅の出力を持つ累積加算部４００〜４０７と、１６ビット幅の第１，第２，第３，第４入力と出力を持つマルチプレクサ４１２，４１３と各々二つの４ビット幅の入出力を持ち、２クロックのディレイを生じさせる遅延部４２０〜４２５とで構成される。
【００３６】
積和演算部１３の中では、バタフライ演算部１２から渡される１６ビットのビットスライスのうち、（ｘ０＋ｘ７，ｘ１＋ｘ６，ｘ２＋ｘ５，ｘ３＋ｘ４）の加算ビットスライス２４１は累積加算部４００〜４０３に、（ｘ０−ｘ７，ｘ１−ｘ６，ｘ２−ｘ５，ｘ３−ｘ４）の減算ビットスライス２４２は累積加算部４０４〜４０７に入力される。ここで加算，減算ビットスライス各々を下位ビットスライスと上位ビットスライスに分けて表現する。下位上位各々のビットスライスは図３（Ｄ）に示すように４ビットからなる。
【００３７】
積和演算部を構成する累積加算部の単位要素部を図５に示す。
図５に示す累積加算部４０１において、ＲＯＭ５００とＲＯＭ５０１は同一内容のデータを保持している。ＲＯＭ５００は下位ビットスライスを４ビット幅のアドレスとして入力し、ＲＯＭ５０１は上位ビットスライスを４ビット幅のアドレスとして入力する。
これらのＲＯＭ５００及びＲＯＭ５０１各々には、表１に示した、ＤＣＴ演算のためのビットスライス値に対応したコサイン行列に基づく部分和１６種が記憶されている。５０２は１６ビット幅の第１入力５２０と１７ビット幅の第２入力５２１を加算して１８ビット幅の出力をする加算器であり、５０３は１８ビットの第１入力と、１６ビット幅の第２入力を加算して１８ビット幅の出力をする加算器であり、５０４は１８ビット幅の第１，第２入力とそれらの上位１６ビットを出力するマルチプレクサであり、５１１は１８ビット幅のレジスタであり、５１２は１６ビット幅のレジスタであり、５１３はＲＯＭ５０１から出力される部分和を１ビット左にシフトする回路である。
【００３８】
また、図４の累積加算部４００，４０２〜４０７も累積加算部４０１とはＲＯＭに保持されるデータ内容が異なるだけで、その他は全く同一の回路である。累積加算部４００では式（１）のＤＣＴ係数Ｘ０に対応する演算を行い、以下累積加算部４０１はＸ２，４０２はＸ４，４０３はＸ６の演算を、累積加算部４０４では上記式（２）のＤＣＴ係数Ｘ１に対応する演算を行い、以下４０５はＸ３，４０６はＸ５，４０７はＸ７の演算を行う。
また、遅延器４２０〜４２５で入力ビットスライスを遅延させるのは、演算結果を１クロックあたり１係数とするためである。
【００３９】
累積加算部の一例として累積加算部４０１の動作について図５，図７及び下記式（５），式（６）を用いて説明する。
累積加算部４０１では、下記式（５），下記式（６）に示すＤＣＴ係数Ｘ２を算出するために、式（５）の右辺の４×４マトリクスの第２行と４×１マトリクスの積和演算をビットスライス単位の加算で実現する。
【００４０】
【数５】

【００４１】
まず、Ｔ９〜Ｔ１３の期間中、１クロック毎にビットスライスが２ビットずつ入力される（図７（ａ））。下位ビットスライス｛ｂ_００，ｂ_１０，ｂ_２０，ｂ_３０｝はＲＯＭ５００に入力され、このビットスライスに対応した、式（５）右辺の４×４マトリクスの第２行と４×１マトリクスの積和演算の部分和５２０が順次索引される（表１参照）。
次に上位のビットスライス｛ｂ_０１，ｂ_１１，ｂ_２１，ｂ_３１｝はＲＯＭ５０１に入力され、このビットスライスに対応した、式（５）右辺の４×４マトリクスの第２行と４×１マトリクスの積和演算の部分和５２１が順次索引される（表１参照）。
そして、部分和５２１は部分和５２０より１ビット上位にあるため、部分和５２１を左へ１ビットシフトした値、つまり２の乗算を行った値を部分和５２０と加算する（部分和５２０＋部分和５２１×２）。
以上の演算を全ビットスライスに対して行うことで、ＤＣＴ係数Ｘ２を求める。ただし最上位の符号ビットのビットスライスに対応する部分和が入力されたときは、２の補数表現の変換手順に従い、左へ１ビットシフトした値の符号を反転させて加算する（部分和５２０＋（−（部分和５２１×２）））。
【００４２】
加算器５０２による加算結果は、レジスタ５１１を介して加算器５０３によって、既に計算されたより下位にあるビットスライスの結果と累積加算する。この累積加算結果のビット数を出力として要求されているビット精度に落とす。
本実施例の場合では、上位１６ビットがレジスタ５１２に格納された後、１６ビットが出力される（図７（ｂ））。また同時に次の上位ビットの加算結果と累積加算するために、加算器５０３に入力する。
この結果、ビットスライス入力が完了した２クロック後に演算結果が求められ最上位ビットスライスの入力の２クロック後には、上記式（１）の右辺の４×４マトリクスの第２行と４×１マトリクスの積和演算結果（式（１）左辺のＸ０に対応）が出力される（図７（ｃ））。
【００４３】
同様の演算を累積加算部４００，４０２，４０３，４０４，４０５，４０６，４０７で行う。各々、累積加算部４００で式（１）の右辺の４×４マトリクスの第１行と４×１マトリクスの積和演算結果（式（１）左辺のＸ０に対応）を、同４０２で式（１）の右辺の４×４マトリクスの第３行と４×１マトリクスの積和演算結果（式（１）左辺のＸ４に対応）を、同４０３で式（１）の右辺の４×４マトリクスの第４行と４×１マトリクスの積和演算結果（式（１）左辺のＸ６に対応）を求め、２クロック毎にマルチプレクサ４１２によって、タイミングＴ１５〜Ｔ１６の期間は第１行，Ｔ１７〜Ｔ１８の期間は第２行，Ｔ９〜Ｔ２０の期間は第３行、Ｔ２１〜Ｔ２２の期間は第４行の順に選択され、積和４３１として四捨五入部１４（図１参照）に渡す（図７（ｃ））。
【００４４】
同様にして累積加算部４０４で式（２）の右辺の４×４マトリクスの第１行と４×１マトリクスの積和演算結果（式（２）左辺のＸ１に対応）を、同４０５で式（２）の右辺の４×４マトリクスの第２行と４×１マトリクスの積和演算結果（式（２）左辺のＸ３に対応）を、同４０６で式（２）の右辺の４×４マトリクスの第３行と４×１マトリクスの積和演算結果（式（２）左辺のＸ５に対応）を、同４０７で式（２）の右辺の４×４マトリクスの第４行と４×１マトリクスの積和演算結果（式（２）左辺のＸ７に対応）を求め、タイミングＴ１５〜Ｔ１６の期間は第１行，Ｔ１７〜Ｔ１８の期間は第２行、Ｔ１９〜Ｔ２０の期間は第３行、Ｔ２１〜Ｔ２２の期間は第４行を選択するように２クロック毎にマルチプレクサ４１３を切替えて、積和４３２として四捨五入部１４に渡す（図７（ｃ））。
【００４５】
図７（ｃ）に示すように、累積加算部４０１，４０５には、累積加算部４００，４０４よりも２クロック遅延してビットスライスが入力されるので、演算結果が求まるのも２クロック後となる。同様に累積加算部４０２と４０６、４０３と４０７でも各々２クロック遅延して演算結果が出力される。
前記四捨五入部１４において、式（１），（２）の右辺の（１／２）の乗算を行い結果を四捨五入する。
以上の処理を行うことで、水平方向の１次元ＤＣＴが完了する。
【００４６】
ＲＡＭ１５では、図２０に示すように、横向きラスター順に入力されたデータを縦向きラスター順に出力する。すなわち、書き込みと読み出しの順を横から縦に変更し、バタフライ演算部１６に渡す。
垂直方向のＤＣＴは水平方向のＤＣＴと基本的に同じ動作を行う。ただし本実施例では、バタフライ演算部１６に入力されるデータは、前述の積和演算部１３の出力ビット精度を１６ビットとし、それをさらに四捨五入部１４で１／２にした関係上、１５ビットデータになり、水平方向のＤＣＴの９ビット幅の入力とは異なる。また出力データも、ＤＣＴ係数のビット精度を１２ビットとしているので１２ビット幅となり、水平方向のＤＣＴとは異なる。
以上の処理を行うことにより２次元ＤＣＴを実現する。
【００４７】
次に、本発明による直交変換回路の一実施形態であるＤＣＴ／ＩＤＣＴプロセッサの概要ブロック図を図８にて示し、そのプロセッサの具体的な動作を図９乃至図１７，図１９に基づき以下説明する。
ここに、図８の２次元ＤＣＴ／ＩＤＣＴプロセッサは図１の２次元ＤＣＴプロセッサにＩＤＣＴのための回路を付加したものであり、ＤＣＴモードの場合は、前記ＤＣＴプロセッサと同じ処理が行われるため、上述したものと同一符号を付しその説明は省略し、ＩＤＣＴモードの場合の説明のみを行う。
図８に示すＩＤＣＴプロセッサの特徴は図１９に示す入力ＤＣＴ係数データＸ０，Ｘ１，Ｘ２，Ｘ３，Ｘ４，Ｘ５，Ｘ６，Ｘ７を並び替え部８２０に設けたレジスタに格納し、それらからデータを２ビットずつ出力する処理により、メモリやアドレス発生器を必要としないデータ入力を実現することである。また、後段のバタフライ演算部８０４においても、下位ビットから順次加算を行うため、従来例に比べて小さい加算器を備える点に特徴がある。
【００４８】
図８に示す双方向ＤＣＴ／ＩＤＣＴプロセッサでは、下記式（３），（４）に示す演算も行われる。
この演算動作について説明すると、並び替え部８２０では、入力データのビットスライス化のみを行い、このビットスライスが積和演算部８０３に渡される。積和演算部８０３では、垂直方向の１次元ＤＣＴを行い、それによって求まったＩＤＣＴの中間結果が、バタフライ演算部８０４に渡される。バタフライ演算部８０４においては、式（３）及び（４）の右辺の加算及び減算と、（１／２）の乗算を四捨五入を用いて行い、演算結果をＲＡＭ１５に渡す。ＲＡＭ１５では、垂直方向ラスター順に入力されたデータの転置を行い水平方向ラスター順に並び替え部８２２にデータを渡す。以降、バタフライ演算部８０８において最終的に画素データとして出力するときの順序を並び替えること以外は、垂直方向の１次元ＩＤＣＴと同じ動作を行い、バタフライ演算部８０８から画素データが出力される。ＲＡＭ１５をはさんで、まず垂直方向の１次元ＩＤＣＴを行い、次に水平方向の１次元ＩＤＣＴを行うことで、２次元ＩＤＣＴを実行する。
【００４９】
図８に示す２次元ＤＣＴ／ＩＤＣＴプロセッサの各ブロックの動作をさらに詳しく述べる。
ＩＤＣＴモードの場合、入力要素データは、図１９に示す８×８マトリクスで各係数は１２ビットであり、該マトリクスの第１列がＸ０，Ｘ１，Ｘ２，Ｘ３，Ｘ４，Ｘ５，Ｘ６，Ｘ７であるとする。ここでは、下記式（３），式（４）に基づき、入力データＸ０，Ｘ１，Ｘ２，Ｘ３，Ｘ４，Ｘ５，Ｘ６，Ｘ７から、画素値ｘ０，ｘ１，ｘ２，ｘ３，ｘ４，ｘ５，ｘ６，ｘ７を求めることとし、その処理手順について述べる。
【００５０】
【数６】

【００５１】
図９に示すバタフライ演算部８０２の並べ替え部８２０は、図２で示した並び替え部１００に９ビット幅の第１入力と１２ビット幅の第２入力のうちどちらかを選択するマルチプレクサ９００〜９０７を付加した構造となり、加減算部８２１は、図２で示した加減算部１０１に、２ビット幅の第１，第２入力からどちらかを選択／出力するマルチプレクサ９１０〜９１７を付加した構造となる。
【００５２】
図１５は、ＩＤＣＴモードの場合のバタフライ演算部８０２の動作を示すタイミング図である。バタフライ演算部８０２にＸ０，Ｘ１，Ｘ２，Ｘ３，Ｘ４，Ｘ５，Ｘ６，Ｘ７が、この順序で１クロック毎に読み込まれ（図１５（ａ））、タイミングＴ０〜Ｔ７の期間は１クロック毎に入力画素データをレジスタ２０７から入力し、レジスタ２０６，２０５，２０４，２０３，２０２，２０１，２００に順送りする（図１５（ｉ）〜図１５（ｂ））。
レジスタ２００〜２０７全てがデータで満たされると、以降８クロック毎にレジスタ２００〜２０７に保持されたデータを各々シフトレジスタ２１０〜２１７に受け渡す（図１５（ｊ）〜図１５（ｑ））。各シフトレジスタ２１０〜２１７は１クロック毎に格納しているデータの下位から２ビットずつを加減算部８２１に渡す。
【００５３】
加減算部８２１では、ＩＤＣＴモードの場合、加算器２３０〜２３７は用いない。つまり、マルチプレクサ９１０〜９１７によって並び替え部８２０から渡されたＸ０〜Ｘ７のビットスライスを選択し、Ｘ０，Ｘ２，Ｘ４，Ｘ６を下位から２ビットずつまとめたビットスライスＡと、Ｘ１，Ｘ３，Ｘ５，Ｘ７を下位から２ビットずつまとめたビットスライスＢとを積和演算部８０３に出力する（図１５（ｒ）〜図１５（ｕ））。
【００５４】
図１０及び図１１に示す積和演算部８０３は図４で示した積和演算部１３の累積加算部４００〜４０７に一部変更を加えた累積加算部１０００〜１００７と、さらに２ビット幅の第１，第２入力と１ビット幅の第３，第４出力を持つ加減算器１００８〜１０１５と、１ビット幅の第１，第２，第３，第４入力と出力を持つマルチプレクサ１０４２，１０４３を付加した構造となる。
該積和演算部８０３の中では、並び替え部８２０から渡されるビットスライスのうち｛Ｘ０，Ｘ２，Ｘ４，Ｘ６｝のビットスライスＡは累積加算部１０００，１００１，１００２，１００３に、｛Ｘ１，Ｘ３，Ｘ５，Ｘ７｝のビットスライスＢは累積加算部１００４，１００５，１００６，１００７に入力される。
【００５５】
図１２に示す累積加算部１０００は、上述した実施形態におけるＤＣＴの場合で説明した累積加算部４０１と基本的には同じ動作を行う。
図１６に示すタイミングチャートを用いてこの動作の説明をする。まず、タイミングＴ９〜Ｔ１４の期間中、１クロック毎に２ビットずつビットスライスが入力される（図１６（ａ））。２ビットのうち下位ビットスライスがＲＯＭ１１００に、上位ビットスライスがＲＯＭ１１０１に入力され、各々のビットスライスに対応した部分和５２０，部分和５２１が順次索引される。
各部分和を加算し、加算器５０３でより下位のビットスライスの結果と累積加算し累積加算中間結果を求めるまでは、ＤＣＴの場合と同じである。ただし、レジスタ１１１２において累積加算中間結果のビット精度を落とす際、その切り落としたビット（本実施例の場合は下位２ビット）を上記式（３）右辺の加算を行うために加算器１００８に、上記式（４）右辺の減算を行うために加算器１０１２に出力する点が異なる（図１６（ｂ））。
【００５６】
同様の演算を累積加算部１００１，１００２，１００３で行い、式（３），式（４）右辺の第１項の各行を下位から２ビットずつ求め、累積加算部１００４，１００５，１００６，１００７によって式（３），式（４）右辺の第２項の各行を下位から２ビットずつ求める。
累積加算部１０００と累積加算器１００４の計算結果より、加算器１００８では式（３）右辺の加算を、１０１２では式（４）右辺の減算を下位から２ビットずつ行い、そのキャリーを加算キャリーまたは減算キャリーとして出力する。
このキャリーが有効となるタイミングは、最上位のビットスライスによる累積加算結果が累積加算部１０００及び累積加算部１００４から出力される時、即ちＴ１６〜Ｔ１７である（図１６（ｃ））。
同様にして、加算器１００９〜１０１１で式（３）右辺の加算を行った時の各行の加算キャリー１０３２を、加算器１０１３〜１０１５で式（４）右辺の減算を行ったときの各行の減算キャリー１０３３を求め、式（３），式（４）の第１行〜第４行の順でバタフライ演算部８０４に渡す（図１６（ｃ））。
【００５７】
図１３に示すバタフライ演算部８０４は、式（３）右辺の加算、式（４）の右辺の減算を行い、各々１６ビット幅の第１，第２入力と１ビット幅の第３，第４入力と１６ビット幅の出力を持つ。
バタフライ演算部８０４において、１２００は１６ビット幅の第１，第２入力を持ち、それらを加算して１７ビット幅の出力をする加算器であり、１２０１は１７ビット幅の第１入力と、１ビット幅の第２入力を持ち、それらを加算して１６ビットの出力をする加算器であり、１２０４は１７ビット幅、１２０５は１６ビット幅のそれぞれ入出力を持つレジスタであり、１２０２は１ビット幅の第１，第２入力から１ビット幅の出力を選択し、１２０３は１６ビット幅の第１，第２入力から１６ビット幅の出力を選択するマルチプレクサである。
【００５８】
バタフライ演算部８０４の中では、１段目の加算器１２００で、積和演算部８０３から入力された１６ビット幅の２入力とビット幅の加算もしくは減算キャリーを用いて式（３），式（４）右辺の第１項と第２項の加算及び減算を交互に行い、１７ビット幅のデータＸ０，Ｘ１，Ｘ２，Ｘ３，Ｘ４，Ｘ５，Ｘ６，Ｘ７を算出し、出力する。
加算の場合は右辺第１項の積和１０３０と、マルチプレクサ１２０３によって選択された右辺第２項の積和１０３１と、マルチプレクサ１２０２によって選択された加算キャリー１０３２を用いて、第１行から順に加算を行い、減算の場合は積和１０３０と、マルチプレクサ１２０３によって選択された積和１０３１を反転したデータと、マルチプレクサ１２０２によって選択された減算キャリー１０３３を用いて、第１行から順に減算を行う。
図１６（ｃ）で示した通り、期間Ｔ１６，Ｔ１７に累積加算部１０００，１００４の出力が有効になり、２出力の加算で式（３）のｘ０，減算で式（４）のｘ７が求められる。同様にして順にｘ１，ｘ６，ｘ２，ｘ５，ｘ３，ｘ４が求められる。
【００５９】
加算及び減算により求められた、１７ビット幅のデータｘ０，ｘ１，ｘ２，ｘ３，ｘ４，ｘ５，ｘ６，ｘ７，はレジスタ１２０４を介して加算器１２０１に渡される。２段目の加算器は１２０１では式（３），（４）右辺の（１／２）の乗算を四捨五入を用いて行い、上位１６ビットがレジスタ１２０５に格納され、１６ビット幅のデータとしてＲＡＭ１５に出力される。
以上の処理を行うことで、垂直方向の一次元ＩＤＣＴが完了する。
【００６０】
ＲＡＭ１５の中では、図２０に示すように、垂直方向ラスター順に入力されたデータの転置を行い、水平方向ラスター順にバタフライ演算部８０６にデータを渡す。バタフライ演算部８０６の中では、バタフライ演算部８０２と同様の演算が行われ、Ｘ０，Ｘ２，Ｘ４，Ｘ６を下位から２ビットずつまとめたビットスライスＡと、Ｘ１，Ｘ３，Ｘ５，Ｘ７を下位から２ビットずつまとめたビットスライスＢとを積和演算部８０７に出力する（図１５（ｒ）〜図１５（ｕ））。
【００６１】
水平方向のＩＤＣＴは垂直方向のＩＤＣＴと基本的に同じ動作を行う。だたし図１４に示すバタフライ演算部８０８において、２段目の加算器１３００では、出力の画素データのビット精度を９ビットとしているので、１３ビット幅の入力を四捨五入して９ビット幅にする点が異なる。
また、加算器１３００の出力がバタフライ演算部８０４の加算器１２０１の出力と同様に、ｘ０，ｘ７，ｘ１，ｘ６，ｘ２，ｘ５，ｘ３，ｘ４順になるのをレジスタ１３０２，レジスタ１３０３，レジスタ１３０４，レジスタ１３０５及びマルチプレクサ１３０６を用いて出力画素データをｘ０，ｘ１，…，ｘ６，ｘ７の順序に並び替える動作が付加されている。
【００６２】
図１７に示すバタフライ演算部８０８のタイミングチャートを用いてその動作を説明する。レジスタ１３０２には、ｘ０，ｘ７，ｘ１，ｘ６，ｘ２，ｘ５，ｘ３，ｘ４が、この順序で、１クロック毎に加算器１３００から渡される（図１７（ａ））。
レジスタ１３０３には、タイミングＴ１９〜Ｔ２４の期間中は、ｘ０，ｘ７，ｘ１，ｘ６，ｘ２，ｘ５が、この順序で、１クロック毎にレジスタ１３０２から渡され、Ｔ２４〜Ｔ２６の期間中は、ｘ５が保持される（図１７（ｂ））。
レジスタ１３０４には、Ｔ２０〜Ｔ２３の期間中は、ｘ０，ｘ７，ｘ１，ｘ６が、この順序で、１クロック毎にレジスタ１３０３から渡され、Ｔ２３〜Ｔ２７の期間中は、ｘ６が保持される（図１７（ｃ））。
レジスタ１３０５には、Ｔ２１，Ｔ２２の期間中は、ｘ０，ｘ７が、この順序で、１クロック毎にレジスタ１３０４から渡され、Ｔ２２〜Ｔ２８の期間中は、ｘ７が保持される（図１７（ｄ））。
出力画素データの並べ替えは、マルチプレクサ１３０６によってレジスタ１３０５，１３０４，１３０３，１３０２，１３０３，１３０４，１３０５の順に１クロック毎に出力を切替えることで実現する（図１７（ｅ））。
【００６３】
また、本実施形態では、並び替え部８２２に入力されるデータは、前述の積和演算部８０３の出力ビット精度を１６ビットとした関係上、１６ビットデータになり、垂直方向ＩＤＣＴのビット幅の入力とは異なる。また出力データも、画素データのビット精度を９ビットとしているので９ビット幅となり、垂直方向ＩＤＣＴの場合とは異なる。以上の処理を行うことにより二次元ＤＣＴを実現する。
なお、本実施形態で示した具体化例は一例であり、本発明はこの例以外にも適用可能である。例えば、この例では画素データは９ビットとしているが、９ビット以外でも構わない。また四捨五入の方法も一例であり、具体的に限定するものではない。データの形式も実施形態では画素，ＤＣＴ係数としているが、具体的に画像データ，ＤＣＴ係数に限定するものではない。
【００６４】
【発明の効果】
以上説明してきたとおり、本発明によれば、従来の直交変換回路が備えていたメモリやアドレス発生器をなくした小規模の直交変換回路の実現が可能になる。従って本回路を集積化した場合にはチップ面積を小さくすることが可能になり本発明の実施化による有効性は大きい。
また、本発明によれば、直交変換回路に複数のマルチプレクサや加算器を付加するだけで、直交変換及び逆直交変換を行う双方向の直交変換回路とすることができ、双方向の直交変換回路においてもメモリやアドレス発生器は不要であり、小規模の回路構成とすることができる。
また、プロセッサをＤＣＴプロセッサとし、さらに各次元に対応させることにより多次元の当該直交変換回路を構成することが可能となり、直交変換の実用化手段として有効な回路を提供する。
【図面の簡単な説明】
【図１】本発明によるＤＣＴプロセッサの構成を示すブロック図である。
【図２】本発明によるＤＣＴプロセッサ中のバタフライ演算部の構成を示すブロック図である。
【図３】本発明によるＤＣＴプロセッサ中のパラレルレジスタ，全加算器の構成及び加算の概念図である。
【図４】本発明によるＤＣＴプロセッサ中の積和演算部の構成を示すブロック図である。
【図５】図４に示される積和演算部の累積加算部の構成を示すブロック図である。
【図６】本発明によるＤＣＴプロセッサ中のバタフライ演算部の動作を説明するためのタイミングチャートである。
【図７】本発明によるＤＣＴプロセッサ中の積和演算部の動作を説明するためのタイミングチャートである。
【図８】本発明による双方向ＤＣＴ／ＩＤＣＴプロセッサの構成を示すブロック図である。
【図９】本発明による双方向ＤＣＴ／ＩＤＣＴプロセッサ中のバタフライ演算部の構成を示すブロック図である。
【図１０】本発明による双方向ＤＣＴ／ＩＤＣＴプロセッサ中の積和演算部の構成を示すブロック図（その１）である。
【図１１】本発明による双方向ＤＣＴ／ＩＤＣＴプロセッサ中の積和演算部の構成を示すブロック図（その２）である。
【図１２】本発明による双方向ＤＣＴ／ＩＤＣＴプロセッサ中の累積加算部の構成を示すブロック図である。
【図１３】本発明による双方向ＤＣＴ／ＩＤＣＴプロセッサ中のバタフライ演算部の構成を示すブロック図である。
【図１４】本発明による双方向ＤＣＴ／ＩＤＣＴプロセッサ中のバタフライ演算部の構成を示すブロック図である。
【図１５】本発明による双方向ＤＣＴ／ＩＤＣＴプロセッサ中のバタフライ演算部のＩＤＣモードの動作を説明するためのタイミングチャートである。
【図１６】図１０及び図１１に示される積和演算部の動作を説明するためのタイミングチャートである。
【図１７】図１４に示すバタフライ演算部の動作を説明するための図で、レジスタ及びマルチプレクサのタイミングチャートである。
【図１８】ＤＣＴ／ＩＤＣＴプロセッサにおけるＤＣＴモードの場合の入力マトリクスを示す図である。
【図１９】ＤＣＴ／ＩＤＣＴプロセッサにおけるＩＤＣＴモードの場合の入力マトリクスを示す図である。
【図２０】メモリへのデータの保持状態を転換することによる行列における転置の例を示す図である。
【図２１】従来例のＤＣＴ／ＩＤＣＴプロセッサを示すブロック図である。
【符号の説明】
１２，１６…バタフライ演算部、１３，１７…積和演算部、１４，１８…四捨五入部、１５…ＲＡＭ、
１００，１０２…並び替え部、１０１，１０３…加減算部、
２００〜２０７…レジスタ、２１０〜２１７…シフトレジスタ、２２０〜２２７…レジスタ、２３０〜２３７…全加算器、２４１…加算ビットスライス、２４２…減算ビットスライス、２５０〜２５３…加減算結果、２６０〜２６３…加減算結果、
４００〜４０７…累積加算部、４１２，４１３…マルチプレクサ、４２０〜４２５…遅延器、４３１，４３２…積和演算結果出力、
５００，５０１…ＲＯＭ、５０２，５０３…加算器、５０４…マルチプレクサ、５１１，５１２…レジスタ、５１３…シフト回路、
８０３，８０７…積和演算部、８０２，８０４，８０６，８０８…バタフライ演算部、８０９…マルチプレクサ、８２０，８２２…並び替え部、８２１，８２３…加減算部、
９００〜９０７，９１０〜９１７…マルチプレクサ、
１０００〜１００７…累積加算部、１００８〜１０１５…加減算器、１０３０，１０３１…積和演算結果出力、１０３２…加算キャリー出力、１０３３…減算キャリー出力、１０４０〜１０４３…マルチプレクサ、
１１００，１１０１…ＲＯＭ、１１１２…レジスタ、１１１３…シフト回路、
１２００，１２０１…加算器、１２０２，１２０３…マルチプレクサ、１２０４，１２０５…レジスタ、
１３００…加算器、１３０１〜１３０５…レジスタ、１３０６…マルチプレクサ、
１８００，１８０９…アドレス発生器、１８０１，１８０６…メモリ、１８０２，１８０４，１８０５，１８０８…パイプラインレジスタ、１８０３…積和演算器、１８０７…バタフライ演算器。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to orthogonal transform, which is one of high-efficiency coding techniques for image information and audio information, and more particularly, to orthogonal transform for transforming a time-domain signal into a frequency-domain signal using orthogonal transform. The present invention relates to a circuit and an orthogonal transformation circuit for performing mutual transformation between a signal in a time domain and a signal in a frequency domain.
[0002]
[Prior art]
In recent years, high-efficiency coding technology for image information and audio information has been attracting attention. Circuits that realize orthogonal transform are important elements for realizing a high-efficiency coding method. Many researches and developments are being conducted with the aim of speeding up.
A one-dimensional DCT processor circuit, which is one of the orthogonal transform devices, can be realized by a butterfly operation circuit and an operation circuit using a distribution operation technique (DA method) for finding the sum of products of vector inner products without using a multiplier.
As an example of the operation of the one-dimensional DCT, equations (1) and (2) show the arithmetic expressions of the eight-point one-dimensional DCT.
[0003]
(Equation 1)

[0004]
In the butterfly operation in the DCT processor, (x0, x7), (x1, x6), (x2, x5), (x3, 8) are set for eight data x0, x1, x2, x3, x4, x5, x6, and x7. x4) are formed, and addition / subtraction of x0 + x7, x0-x7, x1 + x6, x1-x6, x2 + x5, x2-x5, x3 + x4, x3-x4 is performed in each set.
[0005]
Next, a one-dimensional DCT coefficient is calculated based on Expressions (1) and (2) using the output of the butterfly operation.
As an example, the DCT coefficient X2 in the equation (1) is calculated by a product-sum operation.
X2 = (1/2) × {B × (x0 + x7) + C × (x1 + x6) + (− C) × (x2 + x5) + (− B) × (x3 + x4)}
You can ask.
The DA method is a method of performing the product-sum operation of the matrices in Expressions (1) and (2) in units of bit slices in which bit data of the same digit as shown in FIG. 3D is collected. An example of the DA method is shown in the following expressions (5), (6), and (7) and Table 1, and the description will be made according to them.
[0006]
(Equation 2)

[0007]
(Equation 3)

[0008]
[Table 1]

[0009]
As shown in the above equation (5), the butterfly operation result (x0 + x7, x1 + x6, x2 + x5, x3 + x4) in equation (2) is expressed in bit units ([b₀ _n, ..., b₀₁, B₀₀,], [B₁ _n, ..., b₁₁, B₁₀,], [B₂ _n, ..., b₂₁, B₂₀,], [B_3n, ..., b₃₁, B₃₀,]: B₃ _n, ..., b₀₀Represents a bit). Paying attention to these same digits, the product-sum operation for obtaining the DCT coefficient X2 in the second row of Expression (5) can be decomposed as in Expression (6).
Here, b in equation (6)₀ _n, B₁ _n, B₂ _n, B₃ _nAre called bit slices.
The product-sum operation of the (n + 1) -th bit slice is represented by the above equation (7), and b₀₀~ B₃ _nIs 0 or 1, so the result is multiplied by (1/2) × (2 n)
(1/2) × (2 n) × ((B × b₀ _n) + (C × b₁ _n) + (− C × b₂ _n) + (− B × b₃ _n))
It becomes.
Therefore, as a result, a product-sum operation value can be obtained by holding a table of 16 types of partial sums corresponding to combinations of bit slices as shown in Table 1.
That is, since (2 to the nth power) in the above equation is a shift operation, the operation of the equation (6) uses one of 16 types of values obtained by referring to the table based on the value of each bit slice. This is realized by sequentially adding after the bit shift. By performing the same process for each row of the equation (5), the product-sum operation of the matrix by the DA method is realized, and the one-dimensional DCT coefficients (X0, X2, X4, X6) can be obtained.
[0010]
On the other hand, the arithmetic expression of the eight-point one-dimensional IDCT is the expression shown in Expressions (3) and (4), and eight data (X0, X1, X2, X3, X4, X5, X6, X7) are sequentially input.
First, the input data is divided into two sets ({X0, X2, X4, X6} and {X1, X3, X5, X7}) as shown in equations (3) and (4). Decompose in bit units to form a bit slice.
Using the bit slice and a table corresponding to the combination of the values of each bit slice, the product-sum operation of the matrix by the DA method is performed, and the first term and the second term of the right side of each of the equations (3) and (4) are obtained. Find the term.
Next, x0 to x7 are calculated by performing addition and subtraction of these two terms. The butterfly operation in the IDCT means this addition and subtraction.
[0011]
JP-A-7-234864 describes a circuit configuration of a small-scale one-dimensional DCT / IDCT processor using a butterfly operation and a DA method. The circuit described in Japanese Patent Application Laid-Open No. 7-234864 is a circuit that performs one-dimensional DCT / IDCT on 8 × 8 pixels, describes input elements in a memory, rearranges data in a necessary order, and performs a butterfly operation. , The number of registers is reduced and the circuit is downsized.
[0012]
In the DCT / IDCT processor described in JP-A-7-234864 shown in FIG. 21, 1800 is an address generator, 1801 is an 8 × 8 word 16-bit memory, and 1802 is a 16-bit pipeline. A register 1803 is a multiply-accumulate unit having a 16-bit width input and a 34-bit width output, 1804 is a 34-bit width pipeline register, 1805 is a 16-bit width pipeline register, Reference numeral 1806 denotes an 8 × 8 word 16-bit width memory, 1807 denotes a butterfly operation unit having a 34-bit width input and a 16-bit width output, 1808 denotes a 16-bit width pipeline register, and 1809 denotes a 16-bit width pipeline register. It is an address generator.
[0013]
In the case of DCT transform, this processor includes eight data x0, x1, x2, x3, x4, which are composed of one row or one column of 8 × 8 (M bits / piece) element data as shown in FIG. x5, x6, and x7 are input in this order, and the data is temporarily stored in the memory 1801. From this memory 1801, data is read out in the order required when performing a butterfly operation by the address generator 1800, that is, in the order of x7, x0, x6, x1, x5, x2, x4, x3, and via the pipeline register 1802. It is supplied to the butterfly computing unit 1807.
The butterfly operation unit 1807 includes a 34-bit wide first and second input for adding and subtracting the supplied data, and a parallel adder having a 16-bit width output. Data is input to the butterfly operation unit 1807 in the order of x7, x0, x6, x1, x5, x2, x4, x3, and (x0 + x7), (x0-x7), (x1 + x6), (x1-x6), (x2 + x5). ), (X2-x5), (x3 + x4), and (x3-x4) in this order, and are output to the product-sum calculator 1803 via the pipeline register 1808.
[0014]
The product-sum operation unit 1803 includes a ROM of a partial sum corresponding to a shift register and a bit slice, an accumulator circuit for adding a value output from the ROM while shifting, a first input having a 34-bit width, and a second input having a 16-bit width. , And a parallel adder having a third input and an output having a width of 34 bits.
The operation of the product-sum operation unit 1803 will be described. The input butterfly operation result is decomposed into bit slices by a shift register. The output value of the ROM having the bit slice as an input address is obtained, the value is input to an accumulator circuit, and the sum operation is performed in bit slice units, thereby realizing the product-sum operation by the DA method. The coefficients X0, X2, X4, X6, X1, X3, X5 and X7 are obtained.
The operation result is passed to the memory 1806 via the

pipeline registers

1804 and 1805. From this memory 1806, the data is rearranged by an address generator 1809 in an appropriate order, that is, X0, X1, X2, X3, X4, X5, X6, and X7, and output to the outside.
[0015]
[Problems to be solved by the invention]
As described above, the DCT / IDCT processor disclosed in Japanese Patent Application Laid-Open No. 7-234864 requires a memory and an address generator for rearranging input / output data, and the circuit scale becomes large. For example, when the input data is 16 bits, if each word is 16 bits wide with 8 × 8 words, a capacity of 1 K bit is required.
Also, in the butterfly operation unit 1807, since addition and subtraction of input data are performed in data units, there is a problem that the parallel adder becomes large-scale. In the conventional example, the parallel adder used for the DCT operation and the IDCT operation is shared. If the product-sum operation result of the IDCT operation is 34 bits, the parallel adder needs to perform 34-bit addition. Become.
The present invention has been made in view of such a problem of the related art, and has an object to perform an orthogonal transform circuit and a bidirectional operation in which a means for performing a butterfly operation and a product-sum operation is smaller than in the related art. It is an object of the present invention to provide an orthogonal transformation circuit that enables the orthogonal transformation circuit.
[0016]
[Means for Solving the Problems]
In order to achieve the above object, in the orthogonal transform circuit according to the present invention, in the butterfly operation unit, N parallel registers for storing N elements, N shift registers, and N K-bit parallel additions are provided. Equipped with a vessel. The parallel register rearranges the input element data and passes the data to the shift register. The shift register outputs the received element data in pairs and outputs the data to the parallel adder K bits at a time. Add or subtract the passed element data. This eliminates the need for a memory for temporary storage and an address generator for controlling the order of reading from the memory, and allows the butterfly operation unit to rearrange and add / subtract element data in bit units (input / output is bit width). Accordingly, butterfly operation can be performed with a small-scale circuit, and the orthogonal transform device circuit can be downsized.
The invention described in each claim is constituted by the following technical means.
[0017]
The invention according to claim 1 is an orthogonal transformation circuit that performs orthogonal transformation processing on input N element data by a processor including a butterfly operation unit and a product-sum operation unit, wherein the butterfly operation unit includes:A parallel register group for storing input N element data in a register prepared for each element data, and a register connected to each register of the parallel register group and preparing N element data from the register for each element data , And N K-bit adders connected to the registers of the serial register group and adding data from the registers. The parallel register group includes N pieces of input element data. , And the serial register group outputs N pieces of element data at the same time serially in order of K bits sequentially from the lower bits for the element data whose order has been converted by the parallel register group, and the K-bit adder Represents addition and subtraction of the N / 2 sets of element data for each K bits output from the serial register group. It was carried out, using the saved carry obtained result carry that this store, sequentially performs the addition and subtraction from bottom to topThe product-sum operation unit receives as input the bit slice of the same digit in each set of N / 2 sets of element data that are simultaneously output sequentially from the lower bit as the addition and subtraction results obtained by the butterfly operation unit. , And a partial sum for each digit of the input bit slice is added.
[0019]
Claim2The invention described in claim 1ToThe described processor is a DCT processor.
[0020]
Claim3The invention described in claim 1Or 2The multi-dimensional orthogonal transformation process is performed by making the processor described in (1) correspond to each dimension.
[0021]
Claim4According to the described invention, an orthogonal transform process and / or an inverse orthogonal transform process for input N element data is performed by a processor including a first butterfly operation unit, a product-sum operation unit, and a second butterfly operation unit. In the conversion circuit, the first butterfly operation unit includes:A parallel register group for storing input N element data in a register prepared for each element data, and a register connected to each register of the parallel register group and preparing N element data from the register for each element data , And a set of N K-bit adders connected to each register of the serial register group and adding data from the register. The parallel register group includes N input bits. The serial register group converts the order of the element data, and outputs the N pieces of element data at the same time serially in order of K bits in order from the lower bit for the element data whose order has been converted by the parallel register group, When performing an orthogonal transformation process, the K-bit adder is used for every K bits output from the serial register group. When the N / 2 sets of element data are added and subtracted, the resulting carry is stored, and the stored carry is used to sequentially perform the addition and subtraction from lower to higher to perform an inverse orthogonal transform process. Directly outputs the output of the serial register group without operating the K-bit adder.When performing the orthogonal transformation process, the product-sum operation unit outputs N / 2 sets of element data of N / 2 sets that are simultaneously output sequentially from the lower bit as the addition and subtraction results obtained by the first butterfly operation unit. When the bit slices of the same digit in each set of N / 2 sets of element data that are simultaneously output sequentially from the lower bit of the bit slices of the same digit in each set are input and N orthogonal elements are to be processed, N elements are used. A bit slice of the same digit output simultaneously and sequentially from the lower bits of each of N / 2 element data included in each set obtained by dividing data into two sets is input, and a partial sum of each digit of the input bit slice is input. The second butterfly operation unit includes an adder, and the adder, when performing an inverse orthogonal transformation process, the input bit stream which is a result of the addition in the product-sum operation unit. And performing partial sum of each digit of the chair in a time sharing the addition and subtraction as an input.
[0023]
Claim5The invention described in the claimsTo fourThe described processor is a DCT / IDCT processor.
[0024]
Claim6The invention described in the claims4 or 5The multi-dimensional orthogonal transformation process is performed by making the processor described in (1) correspond to each dimension.
[0025]
BEST MODE FOR CARRYING OUT THE INVENTION
Embodiments of the present invention will be described below with reference to the accompanying drawings.
FIG. 1 is a block diagram showing an outline of a two-dimensional DCT processor which is an embodiment of an orthogonal transform circuit according to the present invention, and a specific operation of the processor will be described below with reference to FIGS. 2 to 7 and FIG. I do.
In FIG. 1, one-dimensional DCT is performed by each of

butterfly operation units

12, 16 connected in series, product-

sum operation units

13, 17, and round-off

units

14, 18 to realize two-dimensional DCT by two one-dimensional DCT circuits. I do.
A feature of the two-dimensional DCT processor shown in FIG. 1 is that input pixel data in the

butterfly operation units

12 and 16 is stored in a register provided inside the

butterfly operation units

12 and 16, the order is converted, and after the conversion, data is converted therefrom. By outputting two bits at a time, addition and subtraction are performed in units of two bits. This processing eliminates the need for a memory or address generator at the time of data input. In addition, the scale of parallel addition and subtraction, which required input for the data bit width in the conventional example, can be reduced to 2 bits for input and output. become.
[0026]
Hereinafter, the operation of the embodiment will be described. The butterfly operation unit 12 of the two-dimensional DCT processor in FIG. 1 includes a rearrangement unit 100 and an addition / subtraction unit 101. The rearranging unit 100 rearranges the order of the input pixel data, and passes the lower two bits of each pixel data to the addition / subtraction unit 101.
The addition / subtraction unit 101 implements a butterfly operation by performing addition and subtraction on the passed 2-bit pixel data. The calculation result is passed to the product-sum calculation unit 13 in bit slice units. The product-sum operation unit 13 performs one-dimensional DCT in the horizontal direction, and the intermediate result of the obtained DCT coefficients is passed to the rounding unit 14.
The rounding unit 14 performs the multiplication of (1/2) on the right side of the equations (1) and (2) using the rounding, and passes the calculation result to the RAM 15. The RAM 15 is a data transposition RAM having a 15-bit width input / output, transposes input data, and passes the data to the butterfly operation unit 16.
The butterfly operation unit 16 performs data rearrangement and addition / subtraction on a 2-bit basis similarly to the butterfly operation unit 12, and outputs a bit slice to the product-sum operation unit 17.
Thereafter, the product-sum operation unit 17 and the round-off unit 18 perform the same processing as the product-sum operation unit 13 and the round-off unit 14, respectively. Two-dimensional DCT is completed by first performing horizontal one-dimensional DCT and then performing vertical one-dimensional DCT with the RAM 15 interposed therebetween.
[0027]
The operation of each block of the two-dimensional DCT processor in FIG. 1 will be described in more detail. It is assumed that input pixel data to the butterfly operation unit 12 is an 8 × 8 matrix as shown in FIG. 17 and each pixel data is 9 bits.
Here, based on the following equations (1) and (2), the DCT coefficients X0, X1, X2, X3 are obtained from the input pixel data x0, x1, x2, x3, x4, x5, x6, x7 of the first row. A processing procedure of one-dimensional DCT for obtaining X4, X5, X6, and X7 will be described.
[0028]
(Equation 4)

[0029]
FIG. 2 shows the configuration of the butterfly operation unit 12. The butterfly operation unit 12 is roughly divided into a rearrangement unit 100 and an addition / subtraction unit 101. The reordering unit 100 has a 9-bit width pixel data input and a 16-bit width output (8 pixel data × 2 bits) obtained by dividing eight pieces of pixel data in units of 2 bits. The addition bit slice 241 (2 bits × 4 addition) of the addition result of the 16-bit width input and the input data passed from the unit 100 and the subtraction bit slice 242 (2 bits × 4 subtraction) of the subtraction result have a total 16-bit width. Has output.
[0030]
The reordering unit 100 is composed of 9-bit registers (R) 200 to 207 and 9-bit parallel-in and 2-bit serial-out shift registers SR 210 to 217 as shown in FIG.
The addition / subtraction unit 101 includes 1-bit registers 220 to 227 and full adders 230 to 237, and the unit elements thereof are, as shown in FIG. The second input is added to the third input of the 1-bit register 220, which receives the carry-out from the lower part of the full adder 230 as the input, and carries in the carry of one bit of the 1-bit register 220, and outputs the addition result. It comprises full adders 230 to 237 having a first output having a bit width and a 1-bit second output for outputting a carry-out.
[0031]
FIG. 6 is a timing chart showing the operation of the butterfly operation unit 12, and shows a temporal change of data in each element of the input / output and the operation unit. The pixel data x0, x1, x2, x3, x4, 5, x6, and x7 are read into the butterfly operation unit 12 in this order every clock (FIG. 6A), and the period of the clock timings T1 to T4 is one. The input pixel data is input from the register 206 at each clock, sequentially sent to the

registers

204, 202, and 200 and stored (FIGS. 6 (h), 6 (f), 6 (d), 6 (b)).
Next, during the period from T5 to T8, the input pixel data is input from the register 201, sequentially sent to the

registers

203, 205, and 207 and held (FIGS. 6C, 6E, 6G, and 6G). 6 (i)). By repeating this operation, all the registers 200 to 207 are filled with pixel data every eight clocks.
[0032]
When all the registers 200 to 207 are filled with the data, the data held in the registers 200 to 207 is transferred to the shift registers 210 to 217 every eight clocks thereafter (FIGS. 6 (j) to 6 (q)). .
From the shift register 210, the lower two bits of the data stored for each clock are output two bits at a time, and passed to the

full adders

230 and 231. The full adder 230 performs addition in butterfly operation, and the full adder 231 performs subtraction in butterfly operation.
The lower two bits of the data stored at each clock are also output from the shift register 211, and the data is passed to the full adder 230, and the inverted data is passed to the full adder 231.
The

registers

220 and 221 are set to an initial value before the operation of the least significant two bits, and thereafter hold the carry of the operation result. The initial value of the register 220 is 0, and the initial value of the register 221 is 1.
Here, for the full adder 231, the input from the shift register 211 is inverted and the register 221 is set to 1 as an initial value. This is for operating the unit 231 as a subtractor.
The

full adders

230 and 231 perform full addition on a 2-bit basis, and (x0 + x7) and (x0-x7) are obtained on a 2-bit basis from the lower order.
[0033]
(X0 + x7) will be described with reference to FIGS. 3B, 3C, and 3D. FIG. 3B shows a portion extracted from the butterfly operation section of FIG. 2 corresponding to the operation of (x0 + x7). As shown in FIG. 6, during a period from T9 to T16, x0 is stored in the shift register 210, and x7 is stored in the shift register 211. Here, as an example, it is assumed that x0 = 1110101000 and x7 = 1110011111.
The x0 and x7 stored in the shift registers 210 and 211 give an output to the full adder 230 two bits at a time while shifting by two bits each. Full adder 230 repeats 2-bit addition using carry register 220. Since (x0 + x7) has 10 bits, the addition ends in a total of five repetitions.
[0034]
This state is shown in FIG. In the five calculations from timing T9 to T13, the result of (x0 + x7) is output two bits at a time from the lower order. This corresponds to the output 260 in FIG. Similar calculations are performed for the addition results 261 to 263 and the subtraction results 250 to 253 of other butterfly calculations.
Paying attention to the outputs 260 to 262 in FIG. 2, the result is as shown in FIG. As shown in FIG. 3 (D), the calculation result of the addition portion of the butterfly operation is output as a bit slice by two bits in order from the lower order of 260 to 263. In addition, the calculation result of the subtraction part of the butterfly operation is the same as that described above, except that there is an inversion process for making a negative number in a two's complement representation. The result will be output in the slice.
This timing is shown in FIGS. 6 (r) to 6 (u). First, a bit slice corresponding to the addition bit corresponding to the lower two bits is output during a period from T9 to T13. Thereafter, during the period from T14 to T16, the cycle of shifting to the output of the next input data set is repeated with the dummy data interposed therebetween.
[0035]
The product-sum operation unit 13 shown in FIG. 4 (the same applies to the product-sum operation unit 17) has two 4-bit width inputs, first and second outputs each having a 16-bit width, and a DCT coefficient. And a cumulative addition unit 400 to 407 each having a 4-bit width first and second input and a 16-bit width output as an operation result output, and a 16-bit width first, second, and second It is composed of

multiplexers

412 and 413 having third and fourth inputs and outputs, and delay units 420 to 425 each having two 4-bit width inputs and outputs and generating a delay of two clocks.
[0036]
In the product-sum operation unit 13, among the 16-bit bit slices passed from the butterfly operation unit 12, the addition bit slice 241 of (x0 + x7, x1 + x6, x2 + x5, x3 + x4) is added to the accumulation addition units 400 to 403 by (x0− The subtraction bit slice 242 of (x7, x1-x6, x2-x5, x3-x4) is input to the accumulators 404 to 407. Here, each of the addition and subtraction bit slices is expressed separately in a lower bit slice and an upper bit slice. Each of the lower and upper bit slices is composed of 4 bits as shown in FIG.
[0037]
FIG. 5 shows a unit element part of the accumulative addition unit constituting the product-sum operation unit.
In the cumulative addition unit 401 shown in FIG. 5, the ROM 500 and the ROM 501 hold data having the same contents. The ROM 500 inputs the lower bit slice as a 4-bit width address, and the ROM 501 inputs the upper bit slice as a 4-bit width address.
Each of the ROM 500 and the ROM 501 stores 16 kinds of partial sums based on a cosine matrix corresponding to a bit slice value for DCT operation shown in Table 1. Reference numeral 502 denotes an adder that adds the 16-bit width first input 520 and the 17-bit width second input 521 to output an 18-bit width, and 503 denotes an 18-bit first input and a 16-bit width input. An adder that adds two inputs and outputs an 18-bit width, 504 is a multiplexer that outputs 18-bit wide first and second inputs and their upper 16 bits, and 511 is an 18-bit wide register Numeral 512 denotes a 16-bit register, and numeral 513 denotes a circuit for shifting the partial sum output from the ROM 501 to the left by one bit.
[0038]
Further, the

cumulative adders

400 and 402 to 407 of FIG. 4 are completely the same as the cumulative adder 401 except that the data content held in the ROM is different. The accumulator 400 performs an operation corresponding to the DCT coefficient X0 of the equation (1). Hereinafter, the accumulator 401 performs the operation of X2, 402, X4, and 403, and the accumulator 404 calculates the equation of the above equation (2). An operation corresponding to the DCT coefficient X1 is performed, and hereinafter 405 performs an operation of X3, 406 performs an operation of X5, and 407 performs an operation of X7.
The reason why the input bit slices are delayed by the delay units 420 to 425 is to make the operation result one coefficient per clock.
[0039]
The operation of the cumulative addition unit 401 as an example of the cumulative addition unit will be described with reference to FIGS. 5 and 7 and the following equations (5) and (6).
The accumulator 401 calculates the product of the second row of the 4 × 4 matrix on the right side of the equation (5) and the 4 × 1 matrix to calculate the DCT coefficient X2 shown in the following equations (5) and (6). The sum operation is realized by addition in units of bit slices.
[0040]
(Equation 5)

[0041]
First, during a period from T9 to T13, two bits are input for each bit of a clock (FIG. 7A). Lower bit slice $ b₀₀, B₁₀, B₂₀, B₃₀Is input to the ROM 500, and the partial sum 520 of the product-sum operation of the second row of the 4 × 4 matrix and the 4 × 1 matrix on the right side of Equation (5) corresponding to this bit slice is sequentially indexed (see Table 1). ).
Next, the upper bit slice ｛b₀₁, B₁₁, B₂₁, B₃₁入力 is input to the ROM 501, and the partial sum 521 of the product-sum operation of the second row of the 4 × 4 matrix on the right side of Equation (5) and the 4 × 1 matrix corresponding to this bit slice is sequentially indexed (see Table 1). ).
Since the partial sum 521 is one bit higher than the partial sum 520, the value obtained by shifting the partial sum 521 to the left by one bit, that is, the value obtained by multiplying by 2 is added to the partial sum 520 (partial sum 520 + partial sum). 521 x 2).
The DCT coefficient X2 is obtained by performing the above operation on all the bit slices. However, when the partial sum corresponding to the bit slice of the most significant code bit is input, the sign of the value shifted one bit to the left is inverted and added according to the conversion procedure of the two's complement representation (partial sum 520+ ( − (Partial sum 521 × 2))).
[0042]
The result of the addition by the adder 502 is cumulatively added by the adder 503 via the register 511 to the result of the lower bit slice already calculated. The number of bits of the cumulative addition result is reduced to the required bit precision as an output.
In this embodiment, after the upper 16 bits are stored in the

register

512, 16 bits are output (FIG. 7B). At the same time, it is input to the adder 503 for cumulative addition with the addition result of the next upper bit.
As a result, the operation result is obtained two clocks after the completion of the bit slice input, and two clocks after the input of the most significant bit slice, the second row and the 4 × 1 matrix of the 4 × 4 matrix on the right side of the above equation (1) (Corresponding to X0 on the left side of Equation (1)) is output (FIG. 7C).
[0043]
Similar operations are performed by the

cumulative addition units

400, 402, 403, 404, 405, 406, and 407. Each of the accumulators 400 calculates the product-sum operation result of the first row of the 4 × 4 matrix on the right side of Equation (1) and the 4 × 1 matrix (corresponding to X0 on the left side of Equation (1)) in Equation (1). The product sum operation result (corresponding to X4 on the left side of equation (1)) of the third row of the 4 × 4 matrix on the right side of 1) and the 4 × 4 matrix on the right side of equation (1) are obtained in the same manner as in 403. Of the fourth row and the 4 × 1 matrix (corresponding to X6 on the left side of Equation (1)) is obtained, and the multiplexer 412 outputs the first row during the period of timing T15 to T16 every two clocks. 7 is selected in the order of the second row, the period from T9 to T20 is the third row, and the period from T21 to T22 is selected in the order of the fourth row, and passed to the rounding unit 14 (see FIG. 1) as the sum of products 431 (see FIG. )).
[0044]
Similarly, the accumulator 404 calculates the product-sum operation result of the first row of the 4 × 4 matrix on the right side of the equation (2) and the 4 × 1 matrix (corresponding to X1 on the left side of the equation (2)) in equation (2), The product-sum operation result (corresponding to X3 on the left side of Expression (2)) of the second row of the 4 × 4 matrix on the right side of (2) and the 4 × 1 matrix is calculated by the same 406 as 4 × 4 on the right side of Expression (2). The result of the product-sum operation of the third row of the matrix and the 4 × 1 matrix (corresponding to X5 on the left side of Equation (2)) is expressed by the same 407 as the fourth row of the 4 × 4 matrix on the right side of Equation (2) and 4 × 1 The result of the product-sum operation of the matrix (corresponding to X7 on the left side of Equation (2)) is obtained. The period from timing T15 to T16 is the first row, the period from T17 to T18 is the second row, and the period from T19 to T20 is the third row. , T21 to T22, the multiplexer 413 is switched every two clocks so as to select the fourth row. It is passed to the rounding unit 14 as 432 (FIG. 7C).
[0045]
As shown in FIG. 7C, since the bit slices are input to the

accumulators

401 and 405 with a delay of two clocks from the

accumulators

400 and 404, the operation result is obtained only two clocks later. Become. Similarly, the

accumulators

402 and 406, and 403 and 407 each output the operation result with a delay of two clocks.
The rounding unit 14 multiplies (1/2) on the right side of the equations (1) and (2) and rounds the result.
By performing the above processing, one-dimensional DCT in the horizontal direction is completed.
[0046]
As shown in FIG. 20, the RAM 15 outputs data input in the horizontal raster order in the vertical raster order. That is, the order of writing and reading is changed from horizontal to vertical and is passed to the butterfly operation unit 16.
The vertical DCT performs basically the same operation as the horizontal DCT. However, in the present embodiment, the data input to the butterfly operation unit 16 is 15 bits because the output bit precision of the product-sum operation unit 13 is 16 bits, and the output bit precision is further reduced to 1/2 by the rounding unit 14. It becomes data and is different from the 9-bit width input of the DCT in the horizontal direction. The output data also has a 12-bit width because the bit precision of the DCT coefficient is 12 bits, and is different from DCT in the horizontal direction.
By performing the above processing, two-dimensional DCT is realized.
[0047]
Next, FIG. 8 shows a schematic block diagram of a DCT / IDCT processor which is an embodiment of the orthogonal transform circuit according to the present invention, and a specific operation of the processor will be described below with reference to FIGS. 9 to 17 and FIG. I do.
Here, the two-dimensional DCT / IDCT processor of FIG. 8 is obtained by adding a circuit for IDCT to the two-dimensional DCT processor of FIG. 1. In the case of the DCT mode, the same processing as that of the DCT processor is performed. The same reference numerals as those described above denote the same parts, and a description thereof will be omitted. Only the description in the case of the IDCT mode will be performed.
The feature of the IDCT processor shown in FIG. 8 is that the input DCT coefficient data X0, X1, X2, X3, X4, X5, X6, and X7 shown in FIG. The object of the present invention is to realize data input that does not require a memory or an address generator by performing a bit-by-bit output process. Also, the butterfly operation unit 804 in the subsequent stage is characterized in that it has an adder smaller than that of the conventional example, since the addition is performed sequentially from the lower bit.
[0048]
The bidirectional DCT / IDCT processor shown in FIG. 8 also performs the operations shown in the following equations (3) and (4).
Describing this operation, the reordering section 820 performs only bit slicing of input data, and the bit slice is passed to the product-sum operation section 803. The product-sum operation unit 803 performs one-dimensional DCT in the vertical direction, and the intermediate result of the IDCT obtained thereby is passed to the butterfly operation unit 804. The butterfly operation unit 804 performs addition and subtraction on the right side of Expressions (3) and (4) and multiplication by (1/2) using rounding, and passes the operation result to the RAM 15. The RAM 15 transposes the input data in the vertical raster order and transfers the data to the rearrangement unit 822 in the horizontal raster order. Thereafter, the same operation as that of the one-dimensional IDCT in the vertical direction is performed except that the order of final output as pixel data in the butterfly operation unit 808 is rearranged, and pixel data is output from the butterfly operation unit 808. The two-dimensional IDCT is performed by first performing a one-dimensional IDCT in the vertical direction and then performing a one-dimensional IDCT in the horizontal direction with the RAM 15 interposed therebetween.
[0049]
The operation of each block of the two-dimensional DCT / IDCT processor shown in FIG. 8 will be described in further detail.
In the case of the IDCT mode, the input element data is an 8 × 8 matrix shown in FIG. 19, each coefficient being 12 bits, and the first column of the matrix is X0, X1, X2, X3, X4, X5, X6, X7. Suppose there is. Here, pixel values x0, x1, x2, x3, x4, x5, and x6 are obtained from input data X0, X1, X2, X3, X4, X5, X6, and X7 based on the following equations (3) and (4). , X7, and its processing procedure will be described.
[0050]
(Equation 6)

[0051]
The rearrangement unit 820 of the butterfly operation unit 802 shown in FIG. 9 includes a multiplexer 900 to a multiplexer 900 to select one of the first input having a 9-bit width and the second input having a 12-bit width in the rearrangement unit 100 shown in FIG. 907, and the adder / subtractor 821 has a structure in which the adder / subtractor 101 shown in FIG. 2 is added with multiplexers 910 to 917 for selecting / outputting either of the first and second inputs having a 2-bit width. .
[0052]
FIG. 15 is a timing chart showing the operation of the butterfly operation unit 802 in the case of the IDCT mode. X 0, X 1, X 2, X 3, X 4, X 5, X 6, and X 7 are read into the butterfly operation unit 802 in this order every clock (FIG. 15 (a)). The input pixel data is input from the register 207 and sequentially sent to the

registers

206, 205, 204, 203, 202, 201, and 200 (FIGS. 15 (i) and 15 (b)).
When all the registers 200 to 207 are filled with data, the data held in the registers 200 to 207 are transferred to the shift registers 210 to 217 every eight clocks thereafter (FIGS. 15 (j) to 15 (q)). Each of the shift registers 210 to 217 passes the lower two bits of the data stored for each clock to the addition / subtraction unit 821.
[0053]
The adder / subtractor 821 does not use the adders 230 to 237 in the case of the IDCT mode. That is, the bit slices X0 to X7 passed from the rearranging unit 820 are selected by the multiplexers 910 to 917, and the bit slices A in which X0, X2, X4, and X6 are grouped by the lower two bits, and X1, X3, X5 , X7 and the bit slice B in which the lower two bits are combined from the lower bit are output to the product-sum operation unit 803 (FIGS. 15 (r) to 15 (u)).
[0054]
A product-sum operation unit 803 shown in FIGS. 10 and 11 includes a cumulative addition unit 1000 to 1007 obtained by partially changing the accumulation addition units 400 to 407 of the product-sum operation unit 13 shown in FIG. Adder / subtractors 1008 to 1015 having first and second inputs and third and fourth outputs of 1 bit width, and

multiplexers

1042 and 1043 having first, second, third and fourth inputs and outputs of 1 bit width Is added.
In the product-sum operation unit 803, among the bit slices passed from the rearranging unit 820, the bit slice A of {X0, X2, X4, X6} is supplied to the

accumulative adding units

1000, 1001, 1002, 1003 and {X1, Bit slice B of X3, X5, X7} is input to

cumulative addition sections

1004, 1005, 1006, and 1007.
[0055]
The cumulative addition unit 1000 illustrated in FIG. 12 basically performs the same operation as the cumulative addition unit 401 described in the case of the DCT in the above-described embodiment.
This operation will be described with reference to the timing chart shown in FIG. First, during the period from timing T9 to T14, a bit slice is input every two clocks for each clock (FIG. 16A). Of the two bits, the lower bit slice is input to the ROM 1100 and the upper bit slice is input to the ROM 1101, and

partial sums

520 and 521 corresponding to each bit slice are sequentially indexed.
The process is the same as that of DCT, until each partial sum is added, and the adder 503 cumulatively adds the result of the lower bit slice and obtains the cumulative addition intermediate result. However, when the bit precision of the cumulative addition intermediate result is reduced in the register 1112, the cut-off bits (lower two bits in the present embodiment) are added to the adder 1008 in order to perform addition on the right side of the above equation (3). Equation (4) is different in that it outputs to the adder 1012 to perform the subtraction on the right side (FIG. 16B).
[0056]
The same operation is performed in the

cumulative addition units

1001, 1002, and 1003, and each row of the first term on the right side of Expressions (3) and (4) is obtained by two bits from the lower order, and the

cumulative addition units

1004, 1005, 1006, and 1007 determine Each row of the second term on the right side of Expressions (3) and (4) is obtained by two bits from the lower order.
Based on the calculation results of the accumulator 1000 and the accumulator 1004, the adder 1008 performs the addition of the right side of the equation (3) and the 1012 performs the subtraction of the right side of the equation (4) two bits at a time from the lower order. Output as subtract carry.
The timing at which the carry becomes valid is when the cumulative addition result by the highest bit slice is output from the cumulative addition unit 1000 and the cumulative addition unit 1004, that is, from T16 to T17 (FIG. 16C).
Similarly, the add carry 1032 of each row when adding the right side of Expression (3) is performed by the adders 1009 to 1011 and the subtraction of each row when performing the subtraction of the right side of Expression (4) by the adders 1013 to 1015. Carry 1033 is obtained and passed to butterfly computation section 804 in the order of the first to fourth rows of equations (3) and (4) (FIG. 16 (c)).
[0057]
The butterfly operation unit 804 shown in FIG. 13 performs addition of the right side of Expression (3) and subtraction of the right side of Expression (4), and the first and second inputs having a 16-bit width and the third and fourth inputs having a 1-bit width, respectively. It has an input and a 16 bit wide output.
In the butterfly operation unit 804, reference numeral 1200 denotes an adder which has first and second inputs having a 16-bit width and adds them to output a 17-bit width. An adder having a second input having a bit width and adding them to produce a 16-bit output, 1204 is a register having 17-bit width, 1205 is a register having input and output of 16-bit width, 1202 is 1 bit A multiplexer 1203 selects a 1-bit width output from the first and second width inputs and 1203 selects a 16-bit width output from the 16-bit width first and second inputs.
[0058]
In the butterfly operation unit 804, the adder 1200 in the first stage uses the two inputs of 16-bit width input from the product-sum operation unit 803 and the addition or subtraction carry of the bit width to obtain the expressions (3) and ( 4) The addition and subtraction of the first term and the second term on the right side are alternately performed to calculate and output 17-bit width data X0, X1, X2, X3, X4, X5, X6, and X7.
In the case of addition, using the product sum 1030 of the first term on the right side, the product sum 1031 of the second term on the right side selected by the multiplexer 1203, and the addition carry 1032 selected by the multiplexer 1202, the addition is performed in order from the first row. In the case of subtraction, subtraction is performed in order from the first row using the product-sum 1030, data obtained by inverting the product-sum 1031 selected by the multiplexer 1203, and the subtraction carry 1033 selected by the multiplexer 1202.
As shown in FIG. 16C, the outputs of the

accumulators

1000 and 1004 become valid in the periods T16 and T17, and x0 of the equation (3) is obtained by adding two outputs, and x7 of the equation (4) is obtained by subtraction. Can be Similarly, x1, x6, x2, x5, x3, and x4 are sequentially obtained.
[0059]
17-bit width data x0, x1, x2, x3, x4, x5, x6, x7 obtained by addition and subtraction are passed to the adder 1201 via the register 1204. The adder in the second stage performs multiplication of (1/2) on the right side of Expressions (3) and (4) by rounding in 1201, and the upper 16 bits are stored in the register 1205 and are stored in the RAM 15 as 16-bit width data. Is output to
By performing the above processing, the one-dimensional IDCT in the vertical direction is completed.
[0060]
In the RAM 15, as shown in FIG. 20, the input data is transposed in the vertical raster order, and the data is transferred to the butterfly operation unit 806 in the horizontal raster order. In the butterfly operation unit 806, the same operation as that of the butterfly operation unit 802 is performed, and a bit slice A in which X0, X2, X4, and X6 are grouped by two bits from the lower order, and X1, X3, X5, and X7 are arranged from the lower order. A bit slice B obtained by combining two bits is output to the product-sum operation unit 807 (FIGS. 15 (r) to 15 (u)).
[0061]
The horizontal IDCT performs basically the same operation as the vertical IDCT. However, in the butterfly operation unit 808 shown in FIG. 14, since the bit precision of the output pixel data is 9 bits in the second-stage adder 1300, the 13-bit input is rounded to 9 bits. The points are different.
Similarly to the output of the adder 1201 of the butterfly operation unit 804, the

registers

1302, 1303, and 1304 indicate that the output of the adder 1300 is in the order of x0, x7, x1, x6, x2, x5, x3, and x4. An operation of rearranging output pixel data in the order of x0, x1,..., X6, x7 using the register 1305 and the multiplexer 1306 is added.
[0062]
The operation will be described with reference to the timing chart of the butterfly operation unit 808 shown in FIG. The register 1302 receives x0, x7, x1, x6, x2, x5, x3, and x4 in this order from the adder 1300 every clock (FIG. 17A).
During the period from timing T19 to T24, x0, x7, x1, x6, x2, and x5 are passed from the register 1302 to the register 1303 every clock in this order, and during the period from T24 to T26, x5, x7, x1, x6, x2, and x5 are transferred. Is held (FIG. 17B).
During the period from T20 to T23, x0, x7, x1, and x6 are transferred to the register 1304 in this order from the register 1303 every clock, and x6 is held during the period from T23 to T27 ( FIG. 17 (c)).
During the periods T21 and T22, x0 and x7 are passed from the register 1304 to the register 1305 in this order every clock, and during the period from T22 to T28, x7 is held (FIG. 17 (d) )).
The rearrangement of the output pixel data is realized by switching the output in the order of the

registers

1305, 1304, 1303, 1302, 1303, 1304, and 1305 by the multiplexer 1306 every clock (FIG. 17E).
[0063]
Also, in the present embodiment, the data input to the rearrangement unit 822 is 16-bit data because the output bit precision of the product-sum operation unit 803 is 16 bits, and the bit width of the vertical IDCT is Different from input. Also, the output data has a 9-bit width because the bit precision of the pixel data is 9 bits, which is different from the case of the vertical IDCT. The two-dimensional DCT is realized by performing the above processing.
Note that the specific example shown in the present embodiment is an example, and the present invention can be applied to other examples. For example, in this example, the pixel data is 9 bits, but may be other than 9 bits. The rounding method is also an example, and is not specifically limited. Although the data format is a pixel and a DCT coefficient in the embodiment, the data format is not specifically limited to the image data and the DCT coefficient.
[0064]
【The invention's effect】
As described above, according to the present invention, it is possible to realize a small-scale orthogonal transform circuit without a memory and an address generator provided in a conventional orthogonal transform circuit. Therefore, when the present circuit is integrated, the chip area can be reduced, and the effectiveness of the present invention is large.
Further, according to the present invention, a bidirectional orthogonal transformation circuit that performs orthogonal transformation and inverse orthogonal transformation can be obtained by simply adding a plurality of multiplexers and adders to the orthogonal transformation circuit. In this case, no memory or address generator is required, and a small-scale circuit configuration can be realized.
Further, by making the processor a DCT processor and making it correspond to each dimension, it is possible to configure the multi-dimensional orthogonal transform circuit, thereby providing a circuit effective as a practical means of orthogonal transform.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a DCT processor according to the present invention.
FIG. 2 is a block diagram showing a configuration of a butterfly operation unit in the DCT processor according to the present invention.
FIG. 3 is a conceptual diagram of a configuration of a parallel register and a full adder in a DCT processor according to the present invention and addition.
FIG. 4 is a block diagram showing a configuration of a product-sum operation unit in the DCT processor according to the present invention.
FIG. 5 is a block diagram illustrating a configuration of an accumulative addition unit of the product-sum operation unit illustrated in FIG. 4;
FIG. 6 is a timing chart for explaining the operation of the butterfly operation unit in the DCT processor according to the present invention.
FIG. 7 is a timing chart for explaining the operation of the product-sum operation unit in the DCT processor according to the present invention.
FIG. 8 is a block diagram showing a configuration of a bidirectional DCT / IDCT processor according to the present invention.
FIG. 9 is a block diagram showing a configuration of a butterfly operation unit in the bidirectional DCT / IDCT processor according to the present invention.
FIG. 10 is a block diagram (part 1) illustrating a configuration of a product-sum operation unit in a bidirectional DCT / IDCT processor according to the present invention.
FIG. 11 is a block diagram (No. 2) showing the configuration of the product-sum operation unit in the bidirectional DCT / IDCT processor according to the present invention.
FIG. 12 is a block diagram showing a configuration of an accumulative adder in a bidirectional DCT / IDCT processor according to the present invention.
FIG. 13 is a block diagram showing a configuration of a butterfly operation unit in the bidirectional DCT / IDCT processor according to the present invention.
FIG. 14 is a block diagram showing a configuration of a butterfly operation unit in a bidirectional DCT / IDCT processor according to the present invention.
FIG. 15 is a timing chart for explaining the operation of the butterfly operation unit in the bidirectional DCT / IDCT processor according to the present invention in the IDC mode.
FIG. 16 is a timing chart for explaining the operation of the product-sum operation unit shown in FIGS. 10 and 11;
17 is a diagram for explaining the operation of the butterfly operation unit shown in FIG. 14, and is a timing chart of a register and a multiplexer.
FIG. 18 is a diagram illustrating an input matrix in a DCT mode in the DCT / IDCT processor.
FIG. 19 is a diagram illustrating an input matrix in an IDCT mode in a DCT / IDCT processor.
FIG. 20 is a diagram illustrating an example of transposition in a matrix by changing a holding state of data in a memory.
FIG. 21 is a block diagram showing a conventional DCT / IDCT processor.
[Explanation of symbols]
12, 16 ... butterfly operation unit, 13, 17 ... product-sum operation unit, 14, 18 ... round-off unit, 15 ... RAM,
100, 102 ... rearranging section, 101, 103 ... addition / subtraction section,
200 to 207 register, 210 to 217 shift register, 220 to 227 register, 230 to 237 full adder, 241 addition bit slice, 242 subtraction bit slice, 250 to 253 addition and subtraction result, 260 to 263 Addition and subtraction results,
400 to 407: cumulative addition unit, 412, 413: multiplexer, 420 to 425: delay unit, 431, 432: product-sum operation result output,
500, 501 ROM, 502, 503 Adder, 504 Multiplexer, 511, 512 Register, 513 Shift circuit,
803, 807: product-sum operation unit, 802, 804, 806, 808: butterfly operation unit, 809: multiplexer, 820, 822: rearrangement unit, 821, 823: addition / subtraction unit,
900 to 907, 910 to 917 ... multiplexer,
1000 to 1007 cumulative adder, 1008 to 1015 adder / subtractor, 1030, 1031 product-sum operation result output, 1032 addition carry output, 1033 subtraction carry output, 1040 to 1043 multiplexer
1100, 1101 ROM, 1112 registers, 1113 shift circuit,
1200, 1201 ... adder, 1202, 1203 ... multiplexer, 1204, 1205 ... register,
1300 adder, 1301-1305 register, 1306 multiplexer
1800, 1809: address generator, 1801, 1806: memory, 1802, 1804, 1805, 1808: pipeline register, 1803: product-sum operation unit, 1807: butterfly operation unit.

Claims

入力されるＮ個の要素データに対する直交変換処理をバタフライ演算部と積和演算部を備えるプロセッサにより行う直交変換回路において、前記バタフライ演算部は、入力されるＮ個の要素データを要素データ毎に用意したレジスタに格納するパラレルレジスタ群と該パラレルレジスタ群の各レジスタに接続し該レジスタからのＮ個の要素データを要素データ毎に用意したレジスタに格納するシリアルレジスタ群と該シリアルレジスタ群の各レジスタに接続し該レジスタからのデータを加算するＮ個のＫビット加算器で構成され、前記パラレルレジスタ群は、入力されるＮ個の要素データの順序を変換し、前記シリアルレジスタ群は、前記パラレルレジスタ群によって順序が変換された要素データについて下位ビットから順にＫビットずつシリアルにＮ個の要素データ各々を同時に出力し、前記Ｋビット加算器は、前記シリアルレジスタ群から出力されるＫビット毎の前記Ｎ／２組の要素データの加算及び減算を行い、この結果得られるキャリーを保存しこの保存したキャリーを用いて、下位から上位へ順次該加算及び減算を行うものとし、前記積和演算部は、前記バタフライ演算部で求められた加算及び減算結果として下位ビットから順に同時に出力されるＮ／２組の要素データの各組における同一桁のビットスライスを入力とし、該入力ビットスライスの各桁毎の部分和を加算するものとしたことを特徴とする直交変換回路。In an orthogonal transformation circuit for performing orthogonal transformation processing on input N element data by a processor including a butterfly operation unit and a product-sum operation unit, the butterfly operation unit converts the input N element data for each element data. A parallel register group to be stored in a prepared register, a serial register group connected to each register of the parallel register group, and N element data from the register stored in a prepared register for each element data, and a serial register group The parallel register group is composed of N K-bit adders connected to a register and adding data from the register. The parallel register group converts the order of the input N element data. For element data whose order has been converted by the parallel register group, K bits are sequentially transmitted in order from the lower bit. N pieces of element data are simultaneously output, and the K-bit adder performs addition and subtraction of the N / 2 sets of element data for each K bits output from the serial register group. Using the stored carry, the addition and subtraction are performed sequentially from lower to higher, and the product-sum operation unit calculates the addition and subtraction results obtained by the butterfly operation unit from the lower bits. An orthogonal transformation circuit characterized in that bit slices of the same digit in each set of N / 2 sets of element data which are simultaneously output in order are input, and a partial sum of each digit of the input bit slice is added. .

請求項１に記載のプロセッサをＤＣＴプロセッサとしたことを特徴とする直交変換回路。Orthogonal transform circuit, characterized in that the processor according the DCT processor according to claim 1.

請求項１または２に記載のプロセッサを各次元に対応させることにより多次元の直交変換処理を行うことを特徴とする直交変換回路。Orthogonal transform circuit and performs orthogonal transform processing multidimensional by matching processor according to each dimension in claim 1 or 2.

入力されるＮ個の要素データに対する直交変換処理及び／又は逆直交変換処理を、第１のバタフライ演算部と積和演算部と第２のバタフライ演算部を備えるプロセッサにより行う直交変換回路において、前記第１のバタフライ演算部は、入力されるＮ個の要素データを要素データ毎に用意したレジスタに格納するパラレルレジスタ群と該パラレルレジスタ群の各レジスタに接続し該レジスタからのＮ個の要素データを要素データ毎に用意したレジスタに格納するシリアルレジスタ群と該シリアルレジスタ群の各レジスタに接続し該レジスタからのデータを加算するＮ個のＫビット加算器で構成するものであって、前記パラレルレジスタ群は、入力されるＮ個の要素データの順序を変換し、前記シリアルレジスタ群は、前記パラレルレジスタ群によって順序が変換された要素データについて下位ビットから順にＫビットずつシリアルにＮ個の要素データ各々を同時に出力し、前記Ｋビット加算器は、直交変換処理を行う場合には前記シリアルレジスタ群から出力されるＫビット毎の前記Ｎ／２組の要素データの加算及び減算を行い、この結果得られるキャリーを保存しこの保存したキャリーを用いて、下位から上位へ順次該加算及び減算を行い、逆直交変換処理を行う場合には該Ｋビット加算器を機能させずに前記シリアルレジスタ群の出力を直接出力するものとし、前記積和演算部は、直交変換処理を行う場合には前記第１のバタフライ演算部で求められた加算及び減算結果として下位ビットから順に同時に出力されるＮ／２組の要素データの各組における同一桁のビットスライスの下位ビットから順に同時に出力されるＮ／２組の要素データの各組における同一桁のビットスライスを入力とし、逆直交変換処理を行う場合にはＮ個の要素データを２組に分け得た各組に含まれるＮ／２個の要素データ各々の下位ビットから順に同時に出力される同一桁のビットスライスを入力とし、該入力ビットスライスの各桁毎の部分和を加算するものとし、前記第２のバタフライ演算部は、加算器を備え、該加算器は逆直交変換処理を行う場合には前記積和演算部における加算結果である前記入力ビットスライスの各桁毎の部分和を入力としてその加算及び減算を時分割で行うことを特徴とする直交変換回路。An orthogonal transformation circuit that performs orthogonal transformation processing and / or inverse orthogonal transformation processing on the input N element data by a processor including a first butterfly operation unit, a product-sum operation unit, and a second butterfly operation unit. The first butterfly operation unit includes a parallel register group for storing the input N element data in a register prepared for each element data, and N element data from the register connected to each register of the parallel register group. , A serial register group for storing data in registers prepared for each element data, and N K-bit adders connected to the respective registers of the serial register group and adding data from the registers. The register group converts the order of the input N element data, and the serial register group is the parallel register group. Therefore, for the element data whose order has been converted, N pieces of element data are simultaneously output serially in units of K bits in order from the lower bit, and the K-bit adder outputs data from the serial register group when performing orthogonal transformation processing. The addition and subtraction of the N / 2 sets of element data for each of the K bits are performed, the resulting carry is stored, and the addition and subtraction are sequentially performed from the lower order to the upper order using the stored carry. When performing the orthogonal transformation process, the output of the serial register group is directly output without operating the K-bit adder. When performing the orthogonal transformation process, the product-sum operation unit performs the first Bit slices of the same digit in each set of N / 2 sets of element data that are simultaneously output sequentially from the lower bits as the addition and subtraction results obtained by the butterfly operation unit When inputting the bit slice of the same digit in each set of N / 2 sets of element data which are simultaneously output in order from the lower bit, and performing the inverse orthogonal transform processing, each of the N pieces of element data which can be divided into two sets The bit slices of the same digit, which are simultaneously output in order from the lower bit of each of the N / 2 element data included in the set, are input, and the partial sum of each digit of the input bit slice is added. The butterfly operation unit includes an adder. When performing an inverse orthogonal transform process, the adder receives a partial sum for each digit of the input bit slice, which is an addition result in the product-sum operation unit, and performs the addition. And a subtraction unit that performs the subtraction in a time-division manner.

請求項４に記載のプロセッサをＤＣＴ／ＩＤＣＴプロセッサとしたことを特徴とする直交変換回路。5. An orthogonal transformation circuit, wherein the processor according to claim 4 is a DCT / IDCT processor.

請求項４または５に記載のプロセッサを各次元に対応させることにより多次元の直交変換処理を行うことを特徴とする直交変換回路。An orthogonal transformation circuit for performing a multidimensional orthogonal transformation process by making the processor according to claim 4 correspond to each dimension.