JP2008507039A

JP2008507039A - Programmable processor architecture

Info

Publication number: JP2008507039A
Application number: JP2007521614A
Authority: JP
Inventors: ランチャンドランアミト，; レイドハウザージュニアジョン，
Original assignee: スリープラスワンテクノロジー，インコーポレイテッド
Priority date: 2004-07-13
Filing date: 2005-07-12
Publication date: 2008-03-06
Also published as: WO2006017339A3; EP1779256A2; EP1779256A4; KR20070055487A; CA2572954A1; WO2006017339A2

Abstract

本発明の一実施形態は、Ｗビットを平行して処理することが可能な少なくとも１つのＷタイプのサブプロセッサ（７４、７６）（Ｗは整数値である）と、Ｎビットを平行して処理することが可能な少なくとも１つのＮタイプのサブプロセッサ（７８、８０）（ＮはＷより１／２倍小さい整数値である）とを有する、異種の、高性能で拡張可能なプロセッサを含む。前記プロセッサは、前記少なくとも１つのＷタイプのサブプロセッサと少なくとも１つのＮタイプのサブプロセッサを接続する共有バス、および前記少なくとも１つのＷタイプのサブプロセッサと前記少なくとも１つのＮタイプのサブプロセッサに接続されて共有されるメモリ（３１２）を備え、前記Ｗタイプのサブプロセッサは、高速動作を可能にするアプリケーションの実行に対応するようメモリを再配置する。One embodiment of the present invention includes at least one W-type sub-processor (74, 76) (W is an integer value) capable of processing W bits in parallel, and processing N bits in parallel. A heterogeneous, high-performance and expandable processor having at least one N-type sub-processor (78, 80), where N is an integer value that is 1/2 times smaller than W. The processor is connected to the shared bus connecting the at least one W-type sub-processor and at least one N-type sub-processor, and to the at least one W-type sub-processor and the at least one N-type sub-processor. And the shared memory (312), the W-type sub-processors relocate the memory to accommodate execution of applications that enable high-speed operation.

Description

（関連出願の引用）
本発明は、２００４年７月１３日に出願され、「Ｑｕａｓｉ−ＡｄｉａｂａｔｉｃＰｒｏｇｒａｍｍａｂｌｅｏｒＣＯＯＬＰｒｏｃｅｓｓｏｒｓＡｒｃｈｉｔｅｃｔｕｒｅ」と題された、米国仮特許出願第６０／５９８，６９１号の利益、および、２００４年８月２日に出願され、「Ｑｕａｓｉ−ＡｄｉａｂａｔｉｃＰｒｏｇｒａｍｍａｂｌｅＰｒｏｃｅｓｓｏｒＡｒｃｈｉｔｅｃｔｕｒｅ」と題された、米国仮特許出願第６０／５９８，４１７号の利益を主張する。 (Citation of related application)
The present invention was filed on July 13, 2004 and is entitled to US Provisional Patent Application No. 60 / 598,691, entitled “Quasi-Adiabatic Programmable or COOL Processors Architecture”, and August 2, 2004 We claim the benefit of US Provisional Patent Application No. 60 / 598,417, filed on the day and entitled “Quasi-Adiabatic Programmable Processor Architecture”.

（本発明の分野）
本発明は、一般に、プロセッサの分野に関し、特に、通信およびマルチメディアのアプリケーションにおいて用いられる低消費電力、高性能、小ダイ面積（ｌｏｗｄｉｅａｒｅａ）、ならびに、柔軟性および拡張性を有するプロセッサに関する。 (Field of the Invention)
The present invention relates generally to the field of processors, and more particularly to low power, high performance, low die area, and flexible and scalable processors used in communications and multimedia applications.

（従来技術の説明）
セルフォンまたはモバイルフォン、デジタルカメラ、ｉＰｏｄ、および携帯情報端末（ＰＤＡ）などの消費者機器の人気の到来により、これらの機器を使用する通信についての多くの新規格が、この業界によって広く導入されてきた。これらの規格のいくつかには、Ｈ２６４、ＭＰＥＧ４、ＵＷＢ、Ｂｌｕｅｔｏｏｔｈ、２Ｇ／２．５Ｇ／３Ｇ／４Ｇ、ＧＰＳ、ＭＰ３およびＳｅｃｕｒｉｔｙが挙げられる。しかしながら、異なる機器の間の通信を定める異なる規格を使用することは、とてつもない開発努力が必要であるという新たな問題が発生している。前述の問題の理由の１つは、現在市販されているプロセッサまたはサブプロセッサは、あらゆるデジタル素子によって容易にはプログラム可能でなく、さまざまな強制規格に適合していないということが挙げられる。家庭用電化製品の保証における新しい傾向としてこの問題が大きくなることは時間の問題であり、ましてや今後この業界により導入される規格はなおさらのことである。 (Description of prior art)
With the advent of consumer devices such as cell phones or mobile phones, digital cameras, iPods, and personal digital assistants (PDAs), many new standards for communication using these devices have been widely introduced by the industry. It was. Some of these standards include H264, MPEG4, UWB, Bluetooth, 2G / 2.5G / 3G / 4G, GPS, MP3, and Security. However, the use of different standards that define communication between different devices creates a new problem that requires tremendous development efforts. One of the reasons for the aforementioned problems is that currently commercially available processors or sub-processors are not easily programmable by any digital element and do not comply with various mandatory standards. The growing trend of this problem as a new trend in consumer electronics warranty is a matter of time, and even more so, the standards that will be introduced by the industry in the future.

プロセッサの新たな、あるいは現在の要求の１つは、低消費電力、さらには、複数のアプリケーションを処理するのに十分なコードの実行をもたらす能力である。現在の電力消費は、アプリケーション当たりおよそ数百ミリワット未満であるが、多数のアプリケーションを実行するためには、数百ミリワット未満にすることが目標である。プロセッサの別の要求は、低コストである。消費者製品においてプロセッサは幅広く利用されているため、プロセッサは費用をかけずに製造されなければならない。そうでなければ、最も一般的な家庭用電化製品におけるプロセッサの使用は、実利的ではない。 One of the new or current demands on processors is the ability to provide low power consumption and even sufficient code execution to handle multiple applications. Current power consumption is approximately less than a few hundred milliwatts per application, but the goal is to be less than a few hundred milliwatts to run many applications. Another requirement of the processor is low cost. Because processors are widely used in consumer products, the processors must be manufactured without cost. Otherwise, the use of the processor in the most common consumer electronics is not practical.

現在のプロセッサの問題の具体例を挙げると、いくつかの消費者製品において使用されるＲＩＳＣに関連する問題、その他の消費者製品において使用されるマイクロプロセッサに関連する問題、さらにその他の消費者製品において使用されるデジタル信号処理プロセッサ（ＤＳＰ）に関連する問題、さらにその他消費者製品において使用される特定用途向け集積回路（ＡＳＩＣ）に関連する問題、既知のプロセッサのいくつかに関連する問題が挙げられ、それぞれが独特の問題を示しており、以下に簡潔に説明される。これらの問題は、それぞれを使用する利点とともに、その不利点を説明する「欠点」の部分およびその利点を説明する「長所」の部分について、以下で説明される。 Specific examples of current processor problems include problems related to RISC used in some consumer products, problems related to microprocessors used in other consumer products, and other consumer products Issues associated with digital signal processing processors (DSPs) used in consumer electronics, as well as issues associated with application specific integrated circuits (ASICs) used in consumer products, issues associated with some of the known processors Each presents a unique problem and is briefly described below. These problems are described below with the advantages of using each, as well as the “defects” that explain the disadvantages and the “advantages” that explain the advantages.

（Ａ．ＲＩＳＣ／スーパースケーラプロセッサ）
ＲＩＳＣおよびスーパースケーラプロセッサは、あらゆる汎用目的のコンピューティングに最も広く受け入れられるアーキテクチャのソリューションである。それらは、一般のソリューションのコンテクストにおいて特定の特殊な問題を解決するために、アプリケーション特有のアクセラレータを用いて強化されることが多い。 (A. RISC / Superscaler processor)
RISC and superscaler processors are the most widely accepted architectural solutions for all general purpose computing. They are often enhanced with application specific accelerators to solve certain special problems in the context of general solutions.

例：ＡＲＭシリーズ、ＡＲＣシリーズ、ストロングＡＲＭシリーズ、およびＭＩＰＳシリーズ。 Examples: ARM series, ARC series, Strong ARM series, and MIPS series.

長所：
・業界に広く受け入れられることによって、ツールチェーンがより成熟し、ソフトウェアの選択が幅広くなった。
・強固なプログラミングモデルが、Ｃのような高級言語からバイナリを生成するために使用される、極めて効率的な自動コードジェネレータによってもたらされた。
・このカテゴリーにおけるプロセッサは、極めて優れた汎用目的のソリューションである。
・ムーアの法則を性能向上のために効率的に使用することができる。 Pros:
• Wide acceptance in the industry has resulted in a more mature tool chain and a wider choice of software.
A robust programming model was brought about by a highly efficient automatic code generator used to generate binaries from high-level languages like C.
• Processors in this category are extremely good general purpose solutions.
• Use Moore's Law efficiently to improve performance.

欠点：
・アーキテクチャの汎用目的の性質は、価格、電力、性能の改善に関して、アプリケーションのセットまたはサブセットの一般／特定の特性を活用しない。
・提供される計算の量に対して、中程度から大量の電力を消費する。
・性能の向上は、いくつかのマルチメディアおよび通信のアルゴリズムに悪影響を与えるパイプラインレイテンシを代償にして主に達成される。
・一般的アルゴリズムに対する、より効率的な自動コード生成のための、複雑なハードウェアスケジューラ、高度な制御機構、および大幅に削減された制限により、このカテゴリーのソリューションの面積効率が低下した。 Disadvantage:
• The general purpose nature of the architecture does not take advantage of the general / specific properties of a set or subset of applications in terms of price, power and performance improvements.
• It consumes moderate to large amounts of power for the amount of computation provided.
Increased performance is achieved primarily at the expense of pipeline latency that adversely affects some multimedia and communication algorithms.
• Complex hardware schedulers, advanced control mechanisms, and greatly reduced limits for more efficient automatic code generation for common algorithms have reduced the area efficiency of this category of solutions.

（Ｂ．超長命令語（ＶＬＩＷ）およびＤＳＰｓ）
ＶＬＩＷアーキテクチャは、デジタル信号処理スペースにおける極めて一般的なソリューションを生み出すために、ＲＩＳＣおよびスーパースケーラプロセッサのアーキテチャに見られる非効率性のいくつかを解消した。並列性が大幅に増加した。スケジューリングの負担が、面積を確保するためにハードウェアからソフトウェアに移行された。
例：ＴＩ６４ｘｘ、ＴＩ５５ｘｘ、ＳｔａｒＣｏｒｅＳＣ１４０、ＡＤＩＳＨＡＲＣシリーズ。 (B. Very Long Instruction Words (VLIW) and DSPs)
The VLIW architecture has eliminated some of the inefficiencies found in RISC and superscaler processor architectures to create a very common solution in the digital signal processing space. Parallelism has increased significantly. The scheduling burden has been shifted from hardware to software to save space.
Examples: TI 64xx, TI 55xx, StarCore SC140, ADI SHARC series.

長所：
・ソリューションを信号処理スペースに制限することによって、ＲＩＳＣおよびスーパースケーラのアーキテクチャと比べて、３Ｐが改善された。
・ＶＬＩＷアーキテクチャは、ＲＩＳＣおよびスーパースケーラのアーキテクチャに比べ、より高いレベルの並列性を提供する。
・効率的なツールチェーンが急速に作成され、業界での幅広い受け入れが急速に広がった。
・自動コード生成およびプログラマビリティは、信号処理用に設計された多くのプロセッサがこのカテゴリーにあてはまるため、大幅な改善を示している。 Pros:
3P improved over RISC and superscaler architectures by limiting the solution to signal processing space.
The VLIW architecture provides a higher level of parallelism compared to RISC and superscaler architectures.
• An efficient tool chain was rapidly created and widespread acceptance in the industry spread rapidly.
Automatic code generation and programmability represents a significant improvement as many processors designed for signal processing fall into this category.

欠点：
・問題解決能力は、デジタル信号処理スペースにまで低下するが、ＶＬＩＷマシンのような一般的なソリューションが効率的な３Ｐを有するには広すぎる。
・制御は、多くのマルチメディアおよび通信のアプリケーションにおける、特に、基本制御コードに対し、高価であり、なおかつ、電力を消費する。
・いくつかの電力および面積に関する非効率な技術が、自動コード生成を容易にするために使用された。ソフトウェアコミュニティによるこれらの技術への強い依存性は、この非効率性を世代から世代へと持ち越している。
・ＶＬＩＷアーキテクチャは、シリアルコードを処理するにはあまり適切ではない。 Disadvantage:
The problem-solving capability drops to digital signal processing space, but a general solution like a VLIW machine is too broad to have an efficient 3P.
Control is expensive and power consuming in many multimedia and communication applications, especially for the basic control code.
Several power and area inefficient techniques were used to facilitate automatic code generation. The strong reliance on these technologies by the software community carries this inefficiency from generation to generation.
The VLIW architecture is not very suitable for processing serial code.

（Ｃ．再構成可能コンピューティング）
過去１０年間にわたる業界および学界におけるさまざまな努力は、価格、電力、および性能の特性のようなＡＳＩＣを使用して柔軟性のあるソリューションを構築することに集中していた。その多くは、業界においてほとんど成果がなくても、現行法および成熟した法（ｍａｔｕｒｅｄｌａｗ）ならびに設計パラダイムに挑戦した。その試みのほとんどは、アーキテクチャのような粗いＦＰＧＡに基づいてソリューションを生み出すことを目指していた。 (C. Reconfigurable computing)
Various efforts in industry and academia over the past decade have focused on building flexible solutions using ASICs such as price, power, and performance characteristics. Many of them challenged current and mature law and design paradigms with little success in the industry. Most of the attempts aimed at creating solutions based on coarse FPGAs like architecture.

長所：
・特定のアプリケーションに制限され、そのアプリケーションの範囲内で必要とされる柔軟性を提供するいくつかの設計は、価格、電力、性能の競争力があることが判明した。
・研究によると、制限されているが柔軟性のあるそのようなソリューションを、多くのアプリケーションのホットスポットに対処するように生み出すことが可能であることが示された。 Pros:
• Several designs that have been limited to a specific application and provide the flexibility required within that application have been found to be competitive in price, power, and performance.
Research has shown that such a limited but flexible solution can be created to address many application hot spots.

短所：
・このスペースにおけるいくつかの設計は、効率的かつ容易なプログラミングソリューションを提供しなかったので、ＤＳＰのプログラミングに精通しているコミュニティに広く受け入れられなかった。
・Ｃのような高級言語からの自動コード生成は、その設計の多くにとって、実質的に不可能または極めて非効率的であった。
・３Ｐの利点は、１種類の相互接続および１レベルの粒度を使用して異種アプリケーションを統合しようと試みる場合に、失われた。提供された並列性の利用レベルが、大幅に犠牲になった。
・再構成オーバーヘッドは、ほとんどの設計に対する３Ｐにおいて重要であった。
・多くの場合において、外部インターフェースは、専用の再構成可能構造が業界標準のシステム設計手法に適合しなかったため、複雑であった。
・再構成生可能なマシンは単一プロセッサであり、基本制御の処理に対してさえ、タイトに集積されたＲＩＳＣに大きく依存する。 Cons:
Some designs in this space have not been widely accepted by the community familiar with DSP programming because they did not provide an efficient and easy programming solution.
Automatic code generation from high-level languages such as C has been virtually impossible or very inefficient for many of its designs.
The benefits of 3P were lost when attempting to integrate disparate applications using one type of interconnect and one level of granularity. The level of parallelism provided was greatly sacrificed.
• Reconfiguration overhead was important in 3P for most designs.
In many cases, the external interface was complex because a dedicated reconfigurable structure did not fit into industry standard system design techniques.
The reconfigurable machine is a single processor and relies heavily on the tightly integrated RISC, even for basic control processing.

（Ｄ．プロセッサのアレイ）
いくつかの最近の取り組みは、再構成可能なシステムを、異種アプリケーションを処理するのにより適切にすることに集中させられている。１つまたは１セットのアプリケーションに対して最適化された複数のプロセッサを接続し、この方向におけるソリューションは、プロセッサのアレイ構造を作成する。 (D. Array of processors)
Some recent efforts have focused on making reconfigurable systems more suitable for handling heterogeneous applications. Connecting multiple processors optimized for one or a set of applications, a solution in this direction creates an array structure of processors.

長所：
・効率的な構造を使用して共に接続される場合に、異なるセットのアプリケーションに対して最適化された異なるプロセッサは、幅広い問題を解決するのに役立つことができる。
・一様なスケーリングモデルは、性能要件が増加する場合に、ナンバープロセッサが共に接続されることを可能にする。
・複雑なアルゴリズムは、効率的に分割されることができる。 Pros:
Different processors optimized for different sets of applications can help solve a wide range of problems when connected together using an efficient structure.
A uniform scaling model allows number processors to be connected together as performance requirements increase.
• Complex algorithms can be partitioned efficiently.

短所：
・性能要求は十分に満たされ得るが、電力および価格の非効率性が高すぎる。
・プログラミングモデルはプロセッサにごとに異なる。これはアプリケーション開発者の仕事をさらに困難にする。
・多数のプロセッサの一様なスケーリングは、極めて費用がかかり、なおかつ、電力を消費するリソースである。これは、全体のシステムの性能に悪影響を及ぼし得るいくつかの非決定論を表示するために示された。
・システムレベルでのプログラミングモデルは、いかなる共有メモリリソースも備えないので、通信データ、コード、および、制御情報の複雑性を被る。共有メモリが、一様に拡張可能ではないためである。
・異なるタイプのプロセッサを異種のネットワークに接続するのに必要な拡張性および反復性のグルーロジック（ｇｌｕｅｌｏｇｉｃ）は、面積の非効率を増大させ、消費電量を増加させ、レイテンシを増加させる。 Cons:
• Performance requirements can be met well, but power and price inefficiencies are too high.
・ Programming model varies from processor to processor. This makes the application developer's job even more difficult.
• Uniform scaling of multiple processors is a very expensive and power consuming resource. This has been shown to display some non-determinism that can adversely affect the overall system performance.
The system level programming model does not have any shared memory resources and thus suffers from the complexity of communication data, code, and control information. This is because the shared memory is not uniformly expandable.
The scalable and repeatable glue logic required to connect different types of processors to disparate networks increases area inefficiency, increases power consumption, and increases latency.

前述を踏まえて、１つまたは複数のマルチメディアアプリケーションを同時に実行することを可能にするために、低電力、安価、効率的、高性能、柔軟にプログラム可能、なおかつ、異種であるプロセッサが必要とされる。 In light of the foregoing, low power, low cost, efficient, high performance, flexible programmable, and heterogeneous processors are required to allow one or more multimedia applications to run simultaneously. Is done.

簡潔に説明すると、本発明の一実施形態は、Ｗビット以上のビットを並列的に処理することが可能な少なくとも１つのＷタイプのサブプロセッサであって、Ｗは整数値であるサブプロセッサと、Ｎビットを並列的に処理することが可能な少なくとも１つのＮタイプのサブプロセッサであって、ＮはＷより小さい整数値であるサブプロセッサとを備える、異種の、高性能で、拡張可能なプロセッサを含む。前記プロセッサは、前記少なくとも１つのＷタイプのサブプロセッサと少なくとも１つのＮタイプのサブプロセッサとを接続する共有バスと、前記少なくとも１つのＷタイプのサブプロセッサと前記少なくとも１つのＮタイプのサブプロセッサに接続されて共有されるメモリとをさらに備え、前記Ｗタイプのサブプロセッサは、メモリを出入りするバイトを再配置し、アプリケーションの実行に対応することにより、高速動作を可能にする。 Briefly described, one embodiment of the present invention is at least one W-type subprocessor capable of processing more than W bits in parallel, where W is an integer value; Heterogeneous, high performance, scalable processor comprising at least one N type sub-processor capable of processing N bits in parallel, where N is an integer value less than W including. The processor includes a shared bus connecting the at least one W-type sub-processor and at least one N-type sub-processor, the at least one W-type sub-processor, and the at least one N-type sub-processor. And the W-type sub-processor enables high-speed operation by rearranging bytes entering and exiting the memory and supporting execution of applications.

まず図１を参照すると、本発明の実施形態を含むデジタル製品１２に関するアプリケーション１０が示される。図１は、市販されているものに関連する、本発明の実施形態を含む製品の利点のうちの、必ずしも全てではないが、そのいくつかに関する図を読み手に提供することを意図している。 Referring first to FIG. 1, an application 10 for a digital product 12 that includes an embodiment of the present invention is shown. FIG. 1 is intended to provide readers with diagrams relating to some, but not necessarily all, of the advantages of products including embodiments of the present invention relative to what is commercially available.

したがって、製品１２は、現代の携帯電話装置１４、デジタルカメラ装置１６、デジタル録音または音楽装置１８、およびＰＤＡ装置２０により実行される必要のあるアプリケーションの全てを組み込むことにおいて、コンバージェンス製品である。製品１２は、装置１４〜２０の機能のうちの１つまたは複数の機能を同時に実行することができるが、低電力消費である。 Thus, product 12 is a convergence product in incorporating all of the applications that need to be executed by modern mobile phone device 14, digital camera device 16, digital recording or music device 18, and PDA device 20. Product 12 can simultaneously perform one or more of the functions of devices 14-20, but with low power consumption.

製品１２は通常、電池式であるため、装置１４〜２０によって実行されるアプリケーションのうちの複数のアプリケーション実行する場合でさえ、ほとんど電力を消費しない。また製品１２は、Ｈ２６４、ＭＰＥＧ４、ＵＷＢ、Ｂｌｕｅｔｏｏｔｈ、２Ｇ／２．５Ｇ／３Ｇ／４Ｇ、ＧＰＳ、ＭＰ３、およびＳｅｃｕｒｉｔｙを含むがそれだけに限定されない複数のアプリケーションに適合して動作を達成するために、コードを実行することもできる。 Since the product 12 is typically battery powered, it consumes little power even when executing multiple applications of the applications executed by the devices 14-20. Product 12 also includes code to achieve operation that is compatible with multiple applications including, but not limited to, H264, MPEG4, UWB, Bluetooth, 2G / 2.5G / 3G / 4G, GPS, MP3, and Security. Can also be executed.

図２は、本発明の実施形態にしたがう、メモリコントローラおよびダイレクトメモリアクセス（ＤＭＡ）回路２４に接続される、異種の、高性能で、拡張可能なプロセッサ２２を備える典型的な集積回路２０を示す。また図２において、プロセッサ２２は、汎用バス３０を介してインターフェース回路２６に接続され、汎用バス３１を介してインターフェース回路２８に接続され、バス３０を介し、バス３１を介して汎用プロセッサ３２とさらに接続されることを示す。回路２０は、回路１０の残りの回路によって利用されるクロック、同様に利用されるリセット信号、および同様に電力を管理するための回路を生成するために、クロックリセットおよび電源管理３４を含むようにさらに示される。回路２０には、ＪｏｉｎｔＴｅｓｔＡｃｔｉｏｎＧｒｏｕｐ（ＪＴＡＧ）回路３６がさらに含まれる。ＪＴＡＧはチップを検査するための規格として使用される。 FIG. 2 illustrates an exemplary integrated circuit 20 comprising a heterogeneous, high performance, scalable processor 22 connected to a memory controller and direct memory access (DMA) circuit 24 in accordance with an embodiment of the present invention. . In FIG. 2, the processor 22 is connected to the interface circuit 26 via the general-purpose bus 30, connected to the interface circuit 28 via the general-purpose bus 31, and further connected to the general-purpose processor 32 via the bus 30 and the bus 31. Indicates that it is connected. Circuit 20 includes a clock reset and power management 34 to generate a clock utilized by the remaining circuits of circuit 10, a reset signal utilized as well, and a circuit for managing power as well. Further shown. The circuit 20 further includes a Joint Test Action Group (JTAG) circuit 36. JTAG is used as a standard for inspecting chips.

バス３０に接続されて示されるインターフェース回路２６およびバス３１に接続されて示されるインターフェース回路２８は、当業者にとっては通常既知である現行のプロセッサにより使用されるブロック４０〜６６を備える。 The interface circuit 26 shown connected to the bus 30 and the interface circuit 28 shown connected to the bus 31 comprise blocks 40-66 used by current processors that are usually known to those skilled in the art.

異種のマルチプロセッサであるプロセッサ２２は、共有データメモリ７０、共有データメモリ７２、ＣｏｏｌＷサブプロセッサ（またはブロック）７４、ＣｏｏｌＷサブプロセッサ（またはブロック）７６、ＣｏｏｌＮサブプロセッサ（またはブロック）７８、および、ＣｏｏｌＮサブプロセッサ（またはブロック）８０を含むように示される。ブロック７４〜８０の各々は、命令メモリと関連し、例えば、ＣｏｏｌＷブロック７４は命令メモリ８２と関連し、ＣｏｏｌＷブロック７６は命令メモリ８４に関連し、ＣｏｏｌＮブロック７８は命令メモリ８６に関連し、ＣｏｏｌＮブロック８０は命令メモリ８８に関連する。同様に、ブロック７４〜８０の各々は、制御ブロックに関連する。ブロック７４は制御ブロック９０に関連し、ブロック７６は制御ブロック９２に関連し、ブロック７８は制御ブロック９４に関連し、ブロック８０は制御回路９６に関連する。ブロック７４および７６は、１６、２４、３２および６４ビットの動作またはアプリケーションに対して通常効率的に動作するように設計され、一方、ブロック７８および８０は、１、４、または８ビットの動作またはアプリケーションに対して通常効率的に動作するように設計される。 The processor 22, which is a heterogeneous multiprocessor, includes a shared data memory 70, a shared data memory 72, a CoolW subprocessor (or block) 74, a CoolW subprocessor (or block) 76, a CoolN subprocessor (or block) 78, and CoolN A sub-processor (or block) 80 is shown to be included. Each of blocks 74-80 is associated with instruction memory, for example, CoolW block 74 is associated with instruction memory 82, CoolW block 76 is associated with instruction memory 84, CoolN block 78 is associated with instruction memory 86, CoolN Block 80 is associated with instruction memory 88. Similarly, each of blocks 74-80 is associated with a control block. Block 74 is associated with control block 90, block 76 is associated with control block 92, block 78 is associated with control block 94, and block 80 is associated with control circuit 96. Blocks 74 and 76 are designed to operate normally efficiently for 16, 24, 32 and 64 bit operations or applications, while blocks 78 and 80 are either 1, 4 or 8 bit operations or Designed to work normally efficiently for applications.

ブロック７４〜８０は、本質的にサブプロセッサであり、ＣｏｏｌＷブロック７４および７６はワイド（またはＷ）タイプのブロックで、一方、ＣｏｏｌＮブロック７８および８０はナロー（またはＮ）タイプのブロックである。ワイドおよびナローは、サブプロセッサ内で処理または転送される並列ビットの相対数を参照し、プロセッサ２２に異種特性を付与する。さらに、回路２４は、サブプロセッサのうちの１つすなわち、ブロック７４〜８０のうちの１つと直接接続され、接続されるサブプロセッサを介して低レイテンシパスをもたらす。図２において、回路２４は、ブロック７６に直接接続されて示されるが、ブロック７４、７８、または８０のいずれに接続されてもよい。優先順位の高いエージェントまたはタスクは、回路２４に直接接続されるブロックに割り当てられ得る。 Blocks 74-80 are essentially sub-processors, CoolW blocks 74 and 76 are wide (or W) type blocks, while CoolN blocks 78 and 80 are narrow (or N) type blocks. Wide and narrow refer to the relative number of parallel bits that are processed or transferred within the sub-processor and impart different characteristics to the processor 22. In addition, the circuit 24 is directly connected to one of the sub-processors, ie one of the blocks 74-80, and provides a low latency path through the connected sub-processor. In FIG. 2, circuit 24 is shown connected directly to block 76, but may be connected to any of blocks 74, 78, or 80. High priority agents or tasks may be assigned to blocks directly connected to the circuit 24.

４つのブロック７４〜８０が示されているが、その他の数のブロックが利用されてもよい。しかしながら、追加のブロックを利用することによってダイスペースが増加し、製造原価が高くなることが留意されるべきである。 Although four blocks 74-80 are shown, other numbers of blocks may be utilized. However, it should be noted that using additional blocks increases die space and increases manufacturing costs.

多大な処理能力を必要とする複雑なアプリケーションは、回路２０に散布（ｓｃａｔｔｅｒ）されず、むしろ、特定のサブプロセッサまたはブロックに集められ、またはブロック内に限定されており、素線（金属）または経路長さを排除または少なくとも削減し、素線の電気容量を減らすことによって、本質的に電力消費を改善する。さらに、利用が増えてアクティビティが減少すると、低電力消費に寄与することになる。 Complex applications that require significant processing power are not scattered to the circuit 20, but rather are collected in a specific sub-processor or block, or confined within a block, Essentially improving power consumption by eliminating or at least reducing the path length and reducing the electrical capacity of the strands. In addition, increased usage and reduced activity will contribute to lower power consumption.

回路２０は、マルチメディアおよび通信のアプリケーションのために擬似断熱のプログラム可能な（Ｑｕａｓｉ−ＡｄｉａｂａｔｉｃＰｒｏｇｒａｍｍａｂｌｅ）サブプロセッサを提供する、シリコンオンチップ（またはＳｏＣ）の例であって、前述のとおり、ＷタイプとＮタイプの２種類のサブプロセッサが提供される。Ｗタイプつまりワイドタイプのプロセッサは、１６、２４、３２、および６４ビットの処理を必要とするアプリケーションにおいて、高出力、高価、高性能の効率のために設計される。Ｎタイプつまりナロータイプのプロセッサは、８、４、および１ビットの処理を必要とするアプリケーションにおいて高効率のために設計される。これらのビット数が本発明の実施形態において、図面および説明によって使用されるが、その他のビット数が容易に用いられ得る。 Circuit 20 is an example of a silicon-on-chip (or SoC) that provides a quasi-insulated programmable sub-processor for multimedia and communications applications, and as described above, W-type And two types of sub-processors are provided. W-type or wide-type processors are designed for high power, high cost, and high performance efficiency in applications requiring 16, 24, 32, and 64 bit processing. N-type or narrow type processors are designed for high efficiency in applications requiring 8, 4, and 1 bit processing. Although these numbers of bits are used in the embodiments of the present invention by the drawings and description, other numbers of bits can be readily used.

アプリケーションが異なると、異なる性能または処理能力が必要になるため、アプリケーションが異なると、異なるタイプのブロックまたはサブプロセッサによって実行される。例えば、通常ＤＳＰｓによって実行されるアプリケーションは、通常発生するＤＳＰカーネルを特徴的に備えるため、通常、図２のブロック７４または７６などのＷタイプのサブプロセッサによって処理される。このようなアプリケーションは、高速フーリエ変換（ＦＦＴ）または逆高速フーリエ変換（ＩＦＦＴ）、適応有限インパルス応答（ＦＩＲ）フィルタ、離散コサイン変換（ＤＣＴ）または逆離散コサイン変換（ＩＤＣＴ）、リアル／コンプレックスＦＩＲフィルタ、ＨＲフィルタ、抵抗コンデンサのルートレイズコサイン（ＲＲＣ）フィルタ、カラースペースコンバータ、３Ｄバイリニアテクスチャマッピング、グローシェーディング、Ｇｏｌａｙ相関、バイリニア補間、メジアン／行／列フィルタ、アルファブレンディング、高次サーフェステセレーション（Ｈｉｇｈｅｒ―ＯｒｄｅｒＳｕｒｆａｃｅＴｅｓｓｅｌｌａｔｉｏｎ）、バーテックスシェード（トランス／ライト）、トライアングルセットアップ、フルスクリーンアンチエイリアスおよび量子化が含まれるが、それだけに限定されない。 Different applications require different performance or processing power, so different applications are executed by different types of blocks or sub-processors. For example, applications typically executed by DSPs are typically processed by a W-type sub-processor, such as block 74 or 76 of FIG. Such applications include fast Fourier transform (FFT) or inverse fast Fourier transform (IFFT), adaptive finite impulse response (FIR) filter, discrete cosine transform (DCT) or inverse discrete cosine transform (IDCT), real / complex FIR filter , HR filter, Root Raise Cosine (RRC) filter of resistor capacitor, color space converter, 3D bilinear texture mapping, glow shading, Golay correlation, bilinear interpolation, median / row / column filter, alpha blending, higher order surface tessellation (Higher) -Order Surface Test), Vertex Shade (Trans / Light), Triangle Setup, Full Screen Anti-Axis Includes but is not limited to Rias and Quantization.

その他の通常発生するＤＳＰカーネルは、ブロック７８および８０などのＮタイプのサブプロセッサによって実行されることが可能であり、可変長コーデック、ビタビコーデック、ターボコーデック、周期的冗長検査、Ｗａｌｓｈコードジェネレータ、インタリーバ／デインタリーバ、ＬＦＳＲ、スクランブラ、デスプレッダ、コンボリューションエンコーダ、リードソロモンコーデック、スクランブルコードジェネレータ、およびパンクチュアリング／デパンクチュアリングを含むが、それだけに限定されない。 Other commonly occurring DSP kernels can be executed by N type sub-processors such as blocks 78 and 80, variable length codec, Viterbi codec, turbo codec, cyclic redundancy check, Walsh code generator, interleaver / Deinterleaver, LFSR, scrambler, despreader, convolution encoder, Reed-Solomon codec, scramble code generator, and puncturing / depuncturing.

ＷタイプとＮタイプの両方のサブプロセッサは、ＲＩＳＣ、再構成、スーパースケーラ、ＶＬＩＷ、およびマルチプロセッサのアプローチなどの既存のアーキテクチャのアプローチと比べ、利用が増加しても高性能を維持しつつ、ネットアクティビティおよびその結果として生じる遷移ごとのエネルギーを低く維持することが可能である。プロセッサ２２のサブプロセッサのアーキテクチャは、結果として最適な処理ソリューションをもたらすダイサイズを縮小し、「擬似断熱（Ｑｕａｓｉ−Ａｄｉａｂａｔｉｃ）」または「ＣＯＯＬ」アーキテクチャと呼ばれる新規のアーキテクチャを含む。これにしたがうプログラム可能なプロセッサは、擬似断熱プログラム可能（Ｑｕａｓｉ−ＡｄｉａｂａｔｉｃＰｒｏｇｒａｍｍａｂｌｅ）またはＣＯＯＬプロセッサと呼ばれる。 Both W-type and N-type subprocessors maintain high performance as usage increases compared to existing architectural approaches such as RISC, reconfiguration, superscalar, VLIW, and multiprocessor approaches, It is possible to keep the energy per net activity and the resulting transitions low. The sub-processor architecture of processor 22 reduces the die size resulting in an optimal processing solution and includes a new architecture called the “Quasi-Adiabatic” or “COOL” architecture. A programmable processor according to this is called a quasi-adiabatic programmable or COOL processor.

擬似断熱プログラム可能（Ｑｕａｓｉ−ＡｄｉａｂａｔｉｃＰｒｏｇｒａｍｍａｂｌｅ）またはＣＯＯＬプロセッサは、上に説明されたとおり、アプリケーションの有限サブセットと適合させるために、データパス、制御、メモリ、および機能ユニット粒度を最適化する。これが達成される方法は、以下に提示されるプロセッサ２２の異なるユニットまたはブロックまたは回路および相互動作に関する図面の説明および提示によって明白である。 A Quasi-Adiabatic Programmable or COOL processor optimizes data path, control, memory, and functional unit granularity to fit with a finite subset of applications, as described above. The manner in which this is accomplished will be apparent from the description and presentation of the figures relating to the different units or blocks or circuits and the interaction of the processor 22 presented below.

「擬似断熱プログラム可能」または異種の相互接続および機能ユニットの並列アプリケーション（ＣＯＯＬ；ＣｏｎｃｕｒｒｅｎｔＡｐｐｌｉｃａｔｉｏｎｓｏｆＨｅｔｅｒＯｇｅｎｅｏｕｓｉｎｔｅｒｃＯｎｎｅｃｔａｎｄｆｕｎｃｔｉｏｎａＬｕｎｉｔ）プロセッサ。熱力学の観点から見ると、断熱プロセッサは、熱を無駄にせず、全ての使用されるエネルギーを有効な仕事を実行することに変換する。既存の標準プロセッサの非断熱の性質、回路設計、および論理セルのライブラリーデザイン技術のために、断熱プロセッサを製造することは、今までは可能でなかった。しかしながら、実行し得る、異なる可能性のあるプロセッサのアーキテクチャの中でいくつかは断熱に近い。本発明のさまざまな実施形態は、従来の技術のアーキテクチャと比べ、著しく断熱に近いがそれでもなおプログラム可能であるプロセッサのアーキテクチャの種類を示す。それらは、「擬似断熱プログラム可能プロセッサ」と呼ばれる。 A “pseudo-insulated programmable” or heterogeneous interconnect and functional unit parallel application (COOL) processor of Heterogeneous interconnectConnect and functionaL unit. From a thermodynamic point of view, an adiabatic processor does not waste heat and converts all used energy into performing useful work. Due to the non-adiabatic nature of existing standard processors, circuit design, and logic cell library design techniques, it has not been possible to produce an adiabatic processor. However, some of the different possible processor architectures that can be implemented are close to adiabatic. Various embodiments of the present invention show a type of processor architecture that is significantly more adiabatic, but still programmable, compared to prior art architectures. They are called “pseudo-insulated programmable processors”.

集積回路２０は、プロセッサ２２内のリソースによってサポートされ得るできるだけ多くのアプリケーションが、共にまたは並列的に実行されることを可能にし、そのアプリケーションの数は、現行のプロセッサによって対応される数をはるかに超える。集積回路２０によって同時または平行して実行されることが可能なアプリケーションの例は、受信された映画をデコードしながらワイヤレス機器からアプリケーションをダウンロードし、映画は同時にダウンロードとデコードされることが可能であることを含むが、それだけに限定されない。集積回路２０が対応するアプリケーションの数に比べ、小さいダイサイズまたはシリコン領域を有する集積回路２０で同時にアプリケーションを実行することを達成することにより、図１の多種機器に必要であるコストよりも、集積回路を製造するコストが大幅に削減される。加えて、プロセッサ２２は、マルチメディアの複雑なアプリケーションなどの多数の機能を実行するために、単一のプログラム可能なフレームワークをユーザーに提供する。この業界に採用される将来的な規格に対応するために、集積回路２０、つまりプロセッサ２２の能力は重要な価値を有し、この業界は現在の規格のものよりもさらに複雑になることが予想される。 Integrated circuit 20 allows as many applications as possible that can be supported by resources in processor 22 to be executed together or in parallel, with the number of applications far exceeding the number supported by current processors. Exceed. An example of an application that can be executed simultaneously or in parallel by the integrated circuit 20 is to download an application from a wireless device while decoding a received movie, and the movie can be downloaded and decoded simultaneously. Including, but not limited to. By achieving simultaneous application execution on the integrated circuit 20 having a small die size or silicon area compared to the number of applications that the integrated circuit 20 corresponds to, the integration is more than the cost required for the multiple devices of FIG. The cost of manufacturing the circuit is greatly reduced. In addition, the processor 22 provides the user with a single programmable framework for performing a number of functions, such as complex multimedia applications. To accommodate future standards adopted by the industry, the capabilities of the integrated circuit 20, i.e. the processor 22, have significant value and the industry is expected to be more complex than that of the current standard. Is done.

ブロック７４〜８０の各々は、プログラムの１つだけのシーケンス（またはストリーム）を所定の時間に実行することができる。プログラムのシーケンスは、特定のアプリケーションと関連する機能によって決まる。例えば、ＦＦＴはシーケンスの種類である。しかしながら、シーケンスが異なる場合でも相互に依存する場合がある。例えば、ＦＦＴプログラムは、完了すると、その結果をメモリ７０に保存し、次のシーケンスは、保存された結果を使用し得る。このように情報を共有する、またはこのように相互に依存する異なるシーケンスは、「ストリームフロー」と呼ばれる。 Each of the blocks 74-80 can execute only one sequence (or stream) of the program at a given time. The program sequence depends on the functions associated with a particular application. For example, FFT is a sequence type. However, even if the sequences are different, they may depend on each other. For example, when the FFT program is complete, it saves the result in memory 70 and the next sequence may use the saved result. Such different sequences sharing information or thus interdependent are called "stream flows".

図２において、メモリ７０および７２は、それぞれ８ブロックの１６キロバイトのメモリを備えるが、その他の実施形態において異なるサイズのメモリは使用されてもよい。 In FIG. 2, the memories 70 and 72 each comprise 8 blocks of 16 kilobytes of memory, although different sized memories may be used in other embodiments.

命令メモリ８２、８４、８６、および８８は、ブロック７４〜８０による実行のために、命令を保存するためにそれぞれ使用される。 Instruction memories 82, 84, 86, and 88 are used to store instructions for execution by blocks 74-80, respectively.

図３は、本発明の実施形態にしたがう、プロセッサ２０のさらなる詳細を示す。図３において、プロセッサ２０はサブプロセッサ７４〜８０を含むように示され、それぞれのサブプロセッサによって処理される命令を保存するために、命令キャッシュ３０２〜３０８をそれぞれ含む。プロセッサ２０は、図３に示されるように接続される、アービトレーションブロック３１０、データメモリ３１２、汎用入力／出力（ＧＰＩＯ）ブロック３１４、共有ＳｏＣバスブロック３１６、ＤＭＡとの高周波（ＲＦ）インターフェースブロック３１８、ＤＭＡコントローラブロック３２０、およびメモリコントローラブロック３２２を含むようにさらに示される。データメモリ３１２は、図３に示されるさまざまな構造／ブロックの動作およびデータトラフィックを指示するアービトレーションブロック３１０の指示のもと、サブプロセッサおよびその他のブロックによって利用されるデータ情報の保存の役割を果たす。ブロック３１４は、プロセッサ２２に出入りする入力および出力トラフィックを調整し、ブロック３２０はバス３１６を介してプロセッサ２２によって実行されるＤＭＡ動作を制御し、ブロック３２２はバス３１６を介してメモリ３１２に対して動作を制御し、ブロック３１８はＤＭＡ動作を処理するための回路を含み、信号３２４を介して接続されるＲＦ信号を受信および／送信することができる。 FIG. 3 illustrates further details of the processor 20 in accordance with an embodiment of the present invention. In FIG. 3, processor 20 is shown to include sub-processors 74-80 and includes instruction caches 302-308, respectively, for storing instructions processed by the respective sub-processors. The processor 20 includes an arbitration block 310, a data memory 312, a general purpose input / output (GPIO) block 314, a shared SoC bus block 316, a radio frequency (RF) interface block 318 with DMA, connected as shown in FIG. It is further shown to include a DMA controller block 320 and a memory controller block 322. Data memory 312 serves to store data information utilized by sub-processors and other blocks under the direction of arbitration block 310 that directs the operation and data traffic of the various structures / blocks shown in FIG. . Block 314 regulates input and output traffic to and from processor 22, block 320 controls DMA operations performed by processor 22 via bus 316, and block 322 to memory 312 via bus 316. In operation, block 318 includes circuitry for processing DMA operations, and can receive and / or transmit RF signals connected via signal 324.

任意で、共有レジスタ３２６および３２８は、２種類のサブプロセッサ間での直接通信をもたらす。例えば、図３において、レジスタ３２６はブロック７４と７８に接続され、これらのブロックによって共有されるべき情報を保存し、実行を迅速に処理するために複数のサブプロセッサを利用するアプリケーションの実行を容易にする。同様に、レジスタ３２８は、レジスタ３２６と同一の機能でブロック８０と７６に接続されて示される。 Optionally, shared registers 326 and 328 provide direct communication between the two types of subprocessors. For example, in FIG. 3, register 326 is connected to blocks 74 and 78 to store information to be shared by these blocks and facilitate the execution of applications that utilize multiple sub-processors to expedite execution. To. Similarly, register 328 is shown connected to blocks 80 and 76 with the same function as register 326.

図４は、本発明の実施形態にしたがい、ブロック７４または７６などのＷタイプブロックのうちの１つの中に備えられるブロックまたは構造のハイレベルブロック図を示す。例として、ブロック７４が図４において使用される。図４において、およびこの明細書全体において、機能ユニットまたはマクロブロックは、加算器、乗算器、レジスタ、およびマルチプレクサなどの構成要素間で、具体的な相互接続構造とともに提示される。これらのマクロブロックは、「マクロ機能ユニット」または「ＭＦＵ」と呼ばれる。ＭＦＵｓは、マルチメディアおよび通信のアプリケーションの有限セットにおいて１つまたは複数の通常発生する動作のうちの効率的なプログラム可能なサブセットを示す。マクロ機能ユニットにおける高効率は、対象のアプリケーションに見られる原子動作（ａｔｏｍｉｃｏｐｅｒａｔｉｏｎ）のクリティカルなグループを、さらに優れた性能および電力性能を示す派生動作（ｄｅｒｉｖｅｄｏｐｅｒａｔｉｏｎ）のセットに置換えたことによってもたらされる。場合によって、通常発生する動作は、ハードウェアを効率的に再利用するために、独特な方法で組み合わされられてきた。 FIG. 4 shows a high-level block diagram of a block or structure provided in one of the W-type blocks, such as blocks 74 or 76, according to an embodiment of the present invention. As an example, block 74 is used in FIG. In FIG. 4 and throughout this specification, functional units or macroblocks are presented with specific interconnection structures between components such as adders, multipliers, registers, and multiplexers. These macroblocks are called “macro functional units” or “MFUs”. MFUs represent an efficient programmable subset of one or more normally occurring operations in a finite set of multimedia and communication applications. High efficiency in the macro functional unit comes from replacing a critical group of atomic operations found in the target application with a set of derived operations that exhibit better performance and power performance. . In some cases, the operations that normally occur have been combined in unique ways to efficiently reuse the hardware.

図４において、ブロック７４は、図４に示されるように共に接続される、ロード／ストアＭＦＵブロック４０２、スカラー算術論理演算ユニット（ＡＬＵ）および乗累算（ＡＣＣ）ＭＦＵｓブロック４０６、ベクトルｘＭＦＵブロック４０４、ベクトルＡＬＵおよび乗累算ＡＣＣＭＦＵブロック４０８、およびローカルメモリ４１０を備えるように示される。ブロック４０２はメモリアドレスを生成し、メモリアドレスバス４１２にメモリアドレスを接続する。メモリデータは、メモリデータバス４１４に接続され、ブロック４０４とブロック４０６に双方向で接続される。ベクトル保存マスクは、ベクトル保存マスクバス４１６に接続され、ブロック４０４によって生成される。各ブロックのさらなる詳細は、後に続く図面に関して提示および説明される。その提示および説明の前に、ブロック７４の一般的な機能およびブロックのいくつかを以下のとおり説明する。 In FIG. 4, block 74 is a load / store MFU block 402, a scalar arithmetic logic unit (ALU) and a multiply-accumulate (ACC) MFUs block 406, and a vector xMFU block 404, connected together as shown in FIG. , Vector ALU and multiply-accumulate ACC MFU block 408, and local memory 410. Block 402 generates a memory address and connects the memory address to memory address bus 412. The memory data is connected to the memory data bus 414 and connected to the block 404 and the block 406 bidirectionally. The vector storage mask is connected to the vector storage mask bus 416 and is generated by block 404. Further details of each block are presented and described with respect to the following figures. Prior to its presentation and description, some of the general functions and blocks of block 74 are described as follows.

ブロック４０６および４０８は、データの実際の計算の大部分を実行する。ロード／ストアＭＦＵブロック４０２は、メモリ３１２およびメモリ４１０に出入りするアクセスのためにアドレスを計算する。ベクトルＸＭＦＵブロック４０４は、ベクトルデータをメモリ３１２およびブロック４０８の間の途中に再配置する。ベクトルＸＭＦＵブロック４０４は、ベクトルをメモリ３１２に保存するために、ベクトル保存マスクを生成するためにも使用される。ブロック４０６は、所定の時間に１つのデータを動作するのみであるが、ブロック４０４および４０８は、ベクトルの形式でデータ上を動作する。ブロック４０２はメモリアクセスにアドレスを提供する。計算によっては、ブロック４０２によって実行されるものもあるが、本質的にはオーバーヘッドの計算である。 Blocks 406 and 408 perform most of the actual calculation of the data. Load / store MFU block 402 calculates addresses for access to and from memory 312 and memory 410. Vector X MFU block 404 rearranges vector data midway between memory 312 and block 408. Vector X MFU block 404 is also used to generate a vector save mask to save the vector in memory 312. Block 406 only operates on one piece of data at a given time, while blocks 404 and 408 operate on the data in the form of vectors. Block 402 provides an address for memory access. Some computations are performed by block 402, but are essentially overhead computations.

機械命令エンコード（必要に応じ）は、ＭＦＵブロック間のデータを移動する動作の他に、さまざまなＭＦＵブロックのための動作を区別する。単一の命令におけるすべての動作は平行して実行される。ベクトルＸＭＦＵブロック４０４は、命令において別々にエンコードされた動作の制御のもと、ベクトルデータの再配置およびベクトル保存マスクの生成をもたらす。ローカルメモリ４１０は、命令毎にブロック７４の外部の情報にわざわざアクセスすることを回避するために、局所的に情報を保存するために使用される。バス４１２は、メモリアドレスが提供されるメモリ３１２に接続される。 Machine instruction encoding (if necessary) distinguishes operations for various MFU blocks in addition to operations that move data between MFU blocks. All operations in a single instruction are performed in parallel. Vector X MFU block 404 provides for the relocation of vector data and generation of a vector preservation mask under the control of separately encoded operations in the instructions. The local memory 410 is used to store information locally to avoid accessing the information outside the block 74 for each instruction. The bus 412 is connected to a memory 312 that is provided with a memory address.

ブロック４０２は、バス４２４を介してブロック４４に接続されて示され、ブロック４０２は、バス４２６を介してブロック４０６に接続されるようにさらに示され、ブロック４０２は、バス４２８を介してブロック４１０に接続されてさらに示される。ブロック４０４、４０８、および４１０は、ベクトルバス４２０を介して相互に接続されて示され、ブロック４０６、４０４、４０８、および４１０は、スカラーバス４２２を介して相互に接続されて示される。バスは通常、素線の集まりであり、各素線は信号に接続し、その素線は相互に平行であるゆえ、平行して信号を接続することができる。バス内の素線の数はバイナリビット数を規定し、バスの特性としての役割を果たす。図４において、ベクトルバス４２０は、スカラーバス４２２よりも広い、すなわち、バス４２０はバス４２２に比べ、平行してより多くの信号を接続可能であるより多くのビットまたは素線を含む。バス４２０とバス４２２のビット数の割合の例は４倍であり、例えば、バス４２２が３２ビットである場合、バス４２０は３２ビットの４倍の１２８ビットである。 Block 402 is shown connected to block 44 via bus 424, block 402 is further shown to be connected to block 406 via bus 426, and block 402 is shown as block 410 via bus 428. Connected to and further shown. Blocks 404, 408, and 410 are shown connected to each other via vector bus 420, and blocks 406, 404, 408, and 410 are shown connected to each other via scalar bus 422. A bus is usually a collection of strands, and each strand connects to a signal, and since the strands are parallel to each other, signals can be connected in parallel. The number of strands in the bus defines the number of binary bits and serves as a bus characteristic. In FIG. 4, the vector bus 420 is wider than the scalar bus 422, that is, the bus 420 includes more bits or strands that can connect more signals in parallel than the bus 422. An example of the ratio of the number of bits of the bus 420 and the bus 422 is four times. For example, when the bus 422 has 32 bits, the bus 420 has 128 bits, which is four times 32 bits.

ブロック４０４は、バス４１６に接続されるベクトル保存マスクも提供する。 Block 404 also provides a vector storage mask connected to bus 416.

メモリデータは、計算動作のためにブロック４０２からブロック４０６に接続されるが、ベクトルデータがまずブロック４０４に提供される。ブロック４０４は、計算ユニット、すなわちブロック４０８において必要とされるものに適合させるために、メモリにおけるデータを整理する能力を提供することにより、性能が大幅に向上するということを留意することが重要である。 Memory data is connected from block 402 to block 406 for computational operations, but vector data is first provided to block 404. It is important to note that block 404 significantly improves performance by providing the ability to organize the data in memory to accommodate the computational unit, ie, what is needed in block 408. is there.

図５は、本発明の実施形態にしたがい、ブロック４０２に含まれる回路ブロックのブロック図を示す。ブロック４０２は、図５に示されるように共に接続される、アドレスブロック５０２、サーキュラバッファレジスタブロック５０４、アドレスジェネレータブロック５０８、アドレスジェネレータブロック５０６、マルチプレクサ（ｍｕｘ）５１０、およびｍｕｘ５１２を含む。 FIG. 5 shows a block diagram of circuit blocks included in block 402 in accordance with an embodiment of the present invention. Block 402 includes an address block 502, a circular buffer register block 504, an address generator block 508, an address generator block 506, a multiplexer (mux) 510, and a mux 512 connected together as shown in FIG.

ブロック５０２は、図４に示されるブロック４０２のその他のブロックと接続され、アドレスを保管する。ブロック５０４は、サーキュラバッファレンジをサーキュラバッファレジスタ（ブロック５０４）のうちの１つに保存する役割を果たす。ブロック５０６および５０８は、プログラムによって要求される場合にサーキュラバッファレンジ内で包括（ｗｒａｐ）するために、アドレス計算をもたらす。ブロック５０４に向かっている矢印は、それらのレジスタがロードされることを可能にする。すなわち、ブロック５０６は、ブロック５０４によって生成されるアドレス、またはブロック４０６から受信されるアドレス、さらにブロック５０２から生成されるアドレスを修正する役割を果たし、一方、ブロック５０８は、ブロック５０２および／またはブロック４０６さらにブロック５０４から受信されるアドレスを修正する役割を果たす。 Block 502 is connected to other blocks of block 402 shown in FIG. 4 and stores addresses. Block 504 serves to store the circular buffer range in one of the circular buffer registers (block 504). Blocks 506 and 508 provide address calculations to wrap within the circular buffer range when required by the program. The arrows pointing to block 504 allow those registers to be loaded. That is, block 506 serves to modify the address generated by block 504, or the address received from block 406, and further the address generated from block 502, while block 508 includes block 502 and / or block 406 Further serves to modify the address received from block 504.

ブロック４０２のアドレスレジスタおよびブロック４０４のサーキュラバッファレジスタは、入力をブロック５０６および５０８のアドレスジェネレータに提供する。ブロック４０２のアドレスレジスタの場合、それらの入力は前に保存されたアドレスであり、一方、ブロック４０４のサーキュラバッファレジスタについては、それらの入力はサーキュラバッファに関する情報である。 The address register of block 402 and the circular buffer register of block 404 provide inputs to the address generators of blocks 506 and 508. In the case of the block 402 address register, the inputs are previously stored addresses, while for the block 404 circular buffer register, the inputs are information about the circular buffer.

ブロック５０６および５０８は、アドレスを改変する役割を果たす。すなわち、ブロック５０６は、ブロック５０４によって生成されるアドレス、またはブロック４０６から受信されるアドレス、さらにブロック５０２から生成されるアドレスを改変する役割を果たし、一方、ブロック５０８は、ブロック５０２および／またはブロック４０６、さらにブロック５０４から受信されるアドレスを修正する役割を果たす。ブロック５０６の出力は、次に、ｍｕｘ５１２への入力として提供され、そのｍｕｘ５１２は、ブロック５０２によって生成されるアドレスを入力として受信もする。ｍｕｘ５１２は、次に、その入力のうちの１つを選択し、図４に示されるブロック７４のその他のブロックによる受信のために、選択されたものをバス５２０に接続する。同様に、ブロック５０８の出力は、ｍｕｘ５１０への入力として提供され、そのｍｕｘ５１０は、ブロック５０２によって生成されるアドレスを入力として受信もする。ｍｕｘ５１０は、次に、その入力のうちの１つを選択し、図４に示されるブロック７４のメモリによる受信のために、選択されたものをバス５２２に接続する。 Blocks 506 and 508 serve to modify the address. That is, block 506 serves to modify the address generated by block 504, or the address received from block 406, and further the address generated from block 502, while block 508 includes block 502 and / or block 406, and further serves to modify the address received from block 504. The output of block 506 is then provided as an input to mux 512, which also receives the address generated by block 502 as an input. The mux 512 then selects one of its inputs and connects the selected to the bus 520 for reception by the other blocks of the block 74 shown in FIG. Similarly, the output of block 508 is provided as an input to mux 510, which also receives the address generated by block 502 as an input. The mux 510 then selects one of its inputs and connects the selected to the bus 522 for reception by the memory of block 74 shown in FIG.

このようにして、ロード／ストアＭＦＵは、並列的に２つのアドレスを生成することができる。アドレスは、アドレスレジスタと、スカラーＡＬＵＭＦＵからの定数または値のどちらかを組み合わせることによって計算される。計算されたアドレスは、任意でサーキュラバッファの範囲内で包括され得る。計算されたアドレスは、主に、メモリにアクセスする際に使用されるが、アドレスレジスタまたはサーキュラバッファレジスタに割り当てられ、またはその他のＭＦＵの入力として使用され得る。 In this way, the load / store MFU can generate two addresses in parallel. The address is calculated by combining the address register and either a constant or a value from the scalar ALU MFU. The calculated address can optionally be encompassed within the circular buffer. The calculated address is primarily used in accessing the memory, but can be assigned to an address register or a circular buffer register, or used as an input to other MFUs.

図６は、マクロ機能ユニット、特にブロック４０２、４０４、４０６、および４０８に転送するレジスタファイルのために用いられる一般構造をさらに詳細に示す。図６において、複数のレジスタ６０２、複数のｍｕｘ６０４、クロスバー６０６、レジスタブロック６０８、複数の中継レジスタ６１０、複数の機能ユニット６１２、および複数のｍｕｘ６１４が、本発明の実施形態にしたがって示される。レジスタ６０２は、ｍｕｘ６０４に接続されて示され、ｍｕｘ６０４は、順に、クロスバー６０６に接続されて示される。クロスバー６０６は、レジスタ６１０に接続されて示され、レジスタ６１０は、順に、機能ユニット６１２に接続され、機能ユニット６１２は、ｍｕｘ６１４に接続されて示される。通常、ｍｕｘの機能は、ｍｕｘに提供される入力の中から選択することと、選択された入力を生成することである。クロスバー６０６の出力は、図４のその他のブロックに提供もされる。ユニット、ｍｕｘ、および／またはレジスタの特定の数が図６において示されるが、これらの構造のその他の数が用いられてもよい。 FIG. 6 shows in more detail the general structure used for the macro functional unit, in particular the register file transferred to blocks 402, 404, 406 and 408. In FIG. 6, a plurality of registers 602, a plurality of muxes 604, a crossbar 606, a register block 608, a plurality of relay registers 610, a plurality of functional units 612, and a plurality of muxes 614 are shown according to an embodiment of the present invention. Register 602 is shown connected to mux 604, which in turn is shown connected to crossbar 606. Crossbar 606 is shown connected to register 610, which in turn is connected to functional unit 612, and functional unit 612 is shown connected to mux 614. Normally, the mux function is to select from the inputs provided to the mux and to generate the selected input. The output of the crossbar 606 is also provided to the other blocks in FIG. Although a particular number of units, muxes, and / or registers are shown in FIG. 6, other numbers of these structures may be used.

図６の構造は、図６に示されるように共に接続される。ｍｕｘ６０４は、図４のその他のブロックからの追加の入力で、少なくとも２つのそのような入力、ならびにｍｕｘｅｓ６１４の出力を受信するように示される。 The structures of FIG. 6 are connected together as shown in FIG. The mux 604 is shown to receive at least two such inputs, as well as the output of the muxes 614, with additional inputs from the other blocks of FIG.

図６のレジスタおよびフィードバックパス（接続されている）は、面積、エネルギー、および性能のトレードオフを最適化するためにユニークな組織を提供する。この組織は、以下の３つの主な特性を有する。 The registers and feedback paths (connected) in FIG. 6 provide a unique organization to optimize area, energy, and performance tradeoffs. This organization has three main characteristics:

・アセンブリ言語にビジブルであって数個以上のレジスタを有するレジスタファイルは、次の２つのサブセットに分割される。数個のレジスタは完全なアクセシビリティで実行され、一方、その他のレジスタはより制限されたアクセシビリティで実行される。最初の４つのレジスタ（０から３番）は、ほとんどの場合、完全なアクセシビリティに対応する。このレジスタファイルを伴う機械動作について、完全にアクセス可能なレジスタのうちのいずれもが、動作のソースおよびディスティネーションとして同時に選択され得る。対照的に、制限されたアクセシビリティを有するレジスタは、その間に少数の読み出しおよび書き込みポートのみを共有する。制限されたアクセシビリティを有するレジスタにおいて、レジスタが共有する最大２つの読み出しおよび書き込みポートおよび１つの書き込みポートが存在する。この配置は、セットにおけるほとんどのレジスタについて、１つまたは２つ以上の読み出し／書き込みポートを必要とすることがないので、多数の読み出し／書き込みポートを有するレジスタファイルの利点のほとんどを提供する。 A register file that is visible to assembly language and has more than a few registers is divided into the following two subsets: Some registers are executed with full accessibility, while others are executed with more limited accessibility. The first four registers (numbered 0 through 3) most of the time correspond to full accessibility. For machine operations involving this register file, any of the fully accessible registers can be selected simultaneously as the source and destination of the operation. In contrast, registers with limited accessibility share only a few read and write ports in between. In registers with limited accessibility, there are a maximum of two read and write ports and one write port shared by the registers. This arrangement provides most of the advantages of a register file with a large number of read / write ports because it does not require one or more read / write ports for most registers in the set.

・各機能ユニットの入力に「中継レジスタ」が存在する。機能ユニットがクロックサイクルで使用される前に、その入力の中継レジスタが、前のクロックサイクルの終わりに適切な入力値でセットされなければならない。同時に使用できない機能ユニットは、同一の中継レジスタを共有するためにまとめられ、レジスタの総数を減らすことができる。同一の中継レジスタを共有する機能ユニットがクロックサイクルで必要とされない場合、レジスタの前の値が保持され、これにより、そのサイクルについてのその機能ユニットにおける遷移電力消費を削減する。・ “Relay register” exists at the input of each functional unit. Before a functional unit can be used in a clock cycle, its input relay register must be set with the appropriate input value at the end of the previous clock cycle. Functional units that cannot be used simultaneously can be grouped together to share the same relay register, reducing the total number of registers. If a functional unit sharing the same relay register is not needed in a clock cycle, the previous value of the register is retained, thereby reducing the transition power consumption in that functional unit for that cycle.

・機能ユニット間での転送は、２段階で実施される。第１段階で、完全にアクセス可能なレジスタの次の値は、制限されたアクセシビリティを有するレジスタに書き込むための値または複数の値があればそれと共に、マルチプレクサを介して選択される。第２段階において、完全にアクセス可能なレジスタの次の値、および制限されたアクセシビリティを有するレジスタの読み出しポートからの値は、クロックサイクルの終わりに中継レジスタに書き込まれる値を選択するクロスバーに共に送り込まれる（そして、次のクロックサイクルでの機能ユニットのために）。この組織は、１つでなく多重の段階を経ることから遅れを増加させる恐れがあるが、サイズに大幅に影響を与えるクロスバーへの入力の数を最小化する。 • Transfer between functional units is performed in two stages. In the first stage, the next value in the fully accessible register is selected via the multiplexer along with the value or values to write to the register with limited accessibility. In the second stage, the next value of the fully accessible register and the value from the read port of the register with limited accessibility are both put together in the crossbar that selects the value to be written to the relay register at the end of the clock cycle. Sent in (and for the functional unit in the next clock cycle). This organization may increase delays through multiple stages rather than one, but minimizes the number of inputs to the crossbar that significantly affect size.

制限されたアクセシビリティを有するレジスタの書き込みおよび読み出しポートの間で、転送が実施される場合、または実施されない場合がある。転送がここで実施されない場合、さらに追加のレイテンシのサイクルが、これらのレジスタのうちの１つを書き込む動作とその後のそれを読み込む動作との間で当然発生する。 Transfers may or may not be performed between register write and read ports with limited accessibility. If a transfer is not performed here, then an additional latency cycle naturally occurs between an operation that writes one of these registers and a subsequent operation that reads it.

図７は、本発明の実施形態にしたがい、ハイレベルブロック図式において、ブロック４０８のさらなる詳細を示す。図７において、ベクトルレジスタブロック７０２は、ＮＡＬＵｓブロック７０４、ベクトル要素シフタブロック７０６，ベクトル要素セレクタブロック７０８、２ＮおよびＮビットコンバータブロック７１０、ＮＡＬＵｓブロック７１２および２Ｎ乗算器ブロック７１４に接続されて示される。図７において、ブロック４０８は、Ｎ加算器ブロック７１８、Ｎシフタブロック７２０、ベクトルの和ブロック７２２、Ｎ３入力加算器ブロック７２４、２ＮおよびＮビットコンバータ７２６、ｍｕｘ７２３およびｍｕｘ７３２に接続されるベクトルレジスタブロック７１６を含むようにさらに示される。図７のブロックおよびｍｕｘｅｓは、図７に示されように共に接続される。ブロック７０２は、図４のその他のブロックに接続され、ブロック７０４〜７１４にさらに接続される。ブロック７１６は、ブロック４０６から、ならびに、ｍｕｘ７３２、ブロック７１０およびブロック７１４の他にブロック７２４の出力からの入力を受信するように示される。ブロック７０２は、ｍｕｘ７０４に接続されて示され、ｍｕｘ７０４は、ブロック７１２および７２６にさらに接続される。一般的に、図７の回路またはブロックは、Ｍは整数のビット数であるＮの数のＭビット値などの、ベクトルタイプの値で並列的に動作する。 FIG. 7 shows further details of block 408 in a high-level block diagram in accordance with an embodiment of the present invention. In FIG. 7, vector register block 702 is shown connected to N ALUs block 704, vector element shifter block 706, vector element selector block 708, 2N and N bit converter block 710, N ALUs block 712 and 2N multiplier block 714. It is. In FIG. 7, block 408 is a vector register block 716 connected to an N adder block 718, an N shifter block 720, a vector sum block 722, an N3 input adder block 724, 2N and an N bit converter 726, mux 723 and mux 732. Is further shown to include. The blocks and muxes of FIG. 7 are connected together as shown in FIG. Block 702 is connected to the other blocks in FIG. 4 and is further connected to blocks 704-714. Block 716 is shown to receive input from block 406 and from the output of block 724 in addition to mux 732, block 710 and block 714. Block 702 is shown connected to mux 704, which is further connected to blocks 712 and 726. In general, the circuit or block of FIG. 7 operates in parallel with vector type values, such as N M bit values, where M is an integer number of bits.

ｍｕｘ７３２は、ブロック７１８および７２０によって生成される出力を、入力として受信し、ｍｕｘ７３０は、ブロック７０４および７０６によって生成される入力を受信し、ブロック７０２によって受信される出力をさらに生成する。ブロック７０８および７２２の出力は、ブロック４０６に提供される。本明細書で使用されるＮは、整数値であり、例えば、ＮＡＬＵはＮ個のＡＬＵ回路である。 Mux 732 receives the output generated by blocks 718 and 720 as input, and mux 730 receives the input generated by blocks 704 and 706 and further generates the output received by block 702. The outputs of blocks 708 and 722 are provided to block 406. As used herein, N is an integer value, for example, N ALU is N ALU circuits.

ブロック７０２〜７１４およびｍｕｘ７３０は、一般的に乗累算（ＭＡＣ）機能を実行し、一方、ブロック７１６〜７２６およびｍｕｘ７３２はＡＬＵ機能を実行するが、そのＭＡＣおよびＡＬＵ機能が並列的に実行されるビット数は、一般的に、ブロック４０６によって処理されるビット数よりもＮ倍多い。ブロック７０４および７１２はセグメント可能、すなわち、それらは加算演算を選択的にセグメントすることができる。例えば、Ｎ個の３２ビットの加算演算を実行可能であることに加え、平行してＮ個の３２ビットが処理される場合、各ＡＬＵブロックは、２Ｎ個の１６ビットの加算演算または４Ｎ個の８ビットの加算演算を実行することができる。ブロック７１４は、図１１のブロック１１１０と同じように機能し、それについて簡潔に説明する。ブロック７１０および７２６は、Ｎ個の３２ビット値をＮ個の４０ビット値、または２Ｎ個の１６ビット値を２Ｎ個の４０ビット値に変換する働きをする。一例において、３２ビット値は４０ビット値に変換され、別の例において、１６ビット値は４０ビット値に変換され、このようにしてビット変換能力を提供する。 Blocks 702-714 and mux 730 generally perform a multiply-accumulate (MAC) function, while blocks 716-726 and mux 732 perform an ALU function, but the MAC and ALU functions are performed in parallel. The number of bits is typically N times greater than the number of bits processed by block 406. Blocks 704 and 712 are segmentable, i.e., they can selectively segment the addition operation. For example, in addition to being able to perform N 32-bit addition operations, if N 32-bits are processed in parallel, each ALU block may have 2N 16-bit addition operations or 4N An 8-bit addition operation can be performed. Block 714 functions in the same manner as block 1110 of FIG. 11 and is briefly described. Blocks 710 and 726 serve to convert N 32-bit values to N 40-bit values or 2N 16-bit values to 2N 40-bit values. In one example, a 32-bit value is converted to a 40-bit value, and in another example, a 16-bit value is converted to a 40-bit value, thus providing bit conversion capability.

ブロック７０６はベクトル値、すなわちＮ個のＭビット値を、整数値によって左右にシフトさせる。ベクトルシフトの例を以下のベクトルで挙げる。
＜ａ０、ａ１、ａ２、ａ３、ａ４、ａ５、ａ６、ａ７＞
この場合、８つの値であり、
＜ａ１、ａ２、ａ３、ａ４、ａ５、ａ６、ａ７、０＞
または、
＜０、０、０、ａ０、ａ１、ａ２、ａ３、ａ４＞
にベクトルを戻す。 Block 706 shifts the vector value, ie, the N M-bit values, left and right by the integer value. An example of vector shift is given by the following vector.
<A0, a1, a2, a3, a4, a5, a6, a7>
In this case, there are 8 values,
<A1, a2, a3, a4, a5, a6, a7, 0>
Or
<0, 0, 0, a0, a1, a2, a3, a4>
Return the vector to.

これらの動作は通常、乗算または除算として解釈されない。ブロック７０８は、ベクトル値の単一の要素を選択することを可能にし、例えば、特定のバイト（８ビット）は、ベクトル値から選択され得る。 These operations are not normally interpreted as multiplication or division. Block 708 allows a single element of the vector value to be selected, for example, a particular byte (8 bits) may be selected from the vector value.

ブロック７２０はブロック７０６と同じように機能し、ブロック７２６はブロック７１０と同じように機能する。ブロック７１２および７２６の出力は、ｍｕｘ７０４を介して選択的にブロック７０２に提供され、ブロック７０６および７０４の出力は、ｍｕｘ７３０を介して選択的にブロック７０２に提供される。さらに、ブロック７２０および７１８の出力は、ｍｕｘ７３２を介してブロック７１６に選択的に提供される。 Block 720 functions in the same manner as block 706 and block 726 functions in the same manner as block 710. The outputs of blocks 712 and 726 are selectively provided to block 702 via mux 704, and the outputs of blocks 706 and 704 are selectively provided to block 702 via mux 730. Further, the outputs of blocks 720 and 718 are selectively provided to block 716 via mux 732.

ブロック７２２は、ベクトルベースで加算動作を実行し、一方、ブロック４０８のその他のブロックは、要素ベースで動作する。すなわち、ブロック７２２は、単一のベクトルの全ての要素を加算し、要素ベースで動作するブロックは、異なるベクトルの選択された対応する１つまたは複数の要素に演算を実行する。 Block 722 performs a vector-based addition operation, while the other blocks in block 408 operate on an element basis. That is, block 722 adds all elements of a single vector, and a block operating on an element basis performs an operation on a selected corresponding element or elements of a different vector.

ブロック７１０および７２６は、それぞれ、Ｎまたは２Ｎからの変換を選択的に可能にする。図８にさらに示されるように、ブロック８０４の出力は、ブロック８０２の入力へフィードバックされる。 Blocks 710 and 726 selectively allow conversion from N or 2N, respectively. As further shown in FIG. 8, the output of block 804 is fed back to the input of block 802.

図８は、本発明の実施形態にしたがい、ブロック図式において、ブロック４０４のさらなる詳細を示す。図８において、ブロック４０４は、図８に示されるように共に接続される、マスク制御レジスタブロック８０２、マスクジェネレータブロック８０４、マスクレジスタブロック８０６、ベクトルレジスタブロック８０８、およびベクトルバイトマスク置換ブロック８１０を含むように示される。 FIG. 8 shows further details of block 404 in a block diagram, in accordance with an embodiment of the present invention. In FIG. 8, block 404 includes a mask control register block 802, a mask generator block 804, a mask register block 806, a vector register block 808, and a vector byte mask replacement block 810 connected together as shown in FIG. As shown.

ブロック８０２は、図４のその他のブロックからの入力を受信し、ブロック８０６に接続されて示されるブロック８０４への入力を生成するように示される。ブロック８０６は、ブロック８０１に接続されて示され、図４のその他のブロックの他に、メモリ３１２にもさらに接続される。ブロック８０８は、メモリ３１２および図４のその他のブロックに接続されて示される。ブロック８１０は、ブロック８０６および８０８からの入力を受信するように接続されて示される。 Block 802 is shown to receive input from the other blocks of FIG. 4 and generate an input to block 804 shown connected to block 806. Block 806 is shown connected to block 801 and is further connected to memory 312 in addition to the other blocks of FIG. Block 808 is shown connected to memory 312 and the other blocks of FIG. Block 810 is shown connected to receive inputs from blocks 806 and 808.

一例において、ブロック４０４は、ブロック４０８と同一のＮについて、Ｎ^＊３２ビットのベクトルレジスタのレジスタファイルであるブロック８０８を有する。ブロック４０４のブロック８０６は、サイズがＮ^＊４ビットのマスクレジスタを含む。マスクレジスタの各ビットは、ベクトルレジスタの１バイトに一致する。Ｎ^＊３２ビットベクトルが外部の共有メモリに保存される場合、Ｎ^＊４ビットマスクは、どのベクトルのバイトが実際にメモリに書き込まれるかを示すために提供される（ゼロメモリバイトに一致するメモリバイトは変わらない。）マスクジェネレータ機能は、マスク制御レジスタの設定に基づいて、４^＊Ｎビットマスクを計算する。 In one example, block 404 has a block 808 that is a register file of N ^* 32-bit vector registers for the same N as block 408. Block 806 of block 404 includes a mask register of size N ^* 4 bits. Each bit of the mask register matches one byte of the vector register. When an N ^* 32 bit vector is stored in external shared memory, an N ^* 4 bit mask is provided to indicate which vector bytes are actually written to memory (memory matching zero memory bytes). The byte does not change.) The mask generator function calculates a 4 ^* N bit mask based on the setting of the mask control register.

ブロック４０４は、４^＊Ｎバイトを選択するために、２つのベクトルレジスタの８^＊Ｎバイトを置換することができる。一般的な場合において、特定の置換は、第３のベクトルレジスタの値によって制御される。特定の「既にコード化された」置換は、制御ベクトルの使用を必要としない。これらは、２つの入力ベクトルレジスタの左右にある全てのファンネルシフト（ｆｕｎｎｅｌｓｈｉｆｔ）を備える。２つのベクトルレジスタの８^＊Ｎバイトが置換されると同時に、２つのマスクレジスタの８^＊Ｎビットが、マスク値とベクトル値の間で同一のビットとバイトの一致を保持するために、同じように置換され得る。 Block 404 can replace the 8 ^* N bytes of the two vector registers to select 4 ^* N bytes. In the general case, the particular permutation is controlled by the value of the third vector register. Certain “already coded” permutations do not require the use of control vectors. These comprise all funnel shifts to the left and right of the two input vector registers. The 8 ^* N bytes of the two vector registers are replaced at the same time, so that the 8 ^* N bits of the two mask registers hold the same bit and byte match between the mask value and the vector value. Can be substituted.

図８のブロックは、ベクトル値ベースで動作する。ブロック８１０は、上に説明されたように、ベクトル値の再配置を可能にする。これは、図９および１０を参照してさらに説明される置換を使用して実行される。ブロック８１０は、どの置換が予測されるかという情報を提供する。同様に、ブロック８０４および８０６からの置換されたマスクは、置換されたマスクのどれが提供されるかを示す。一般的に、保存される各バイトに１マスクビットが存在する。 The block of FIG. 8 operates on a vector value basis. Block 810 allows relocation of vector values as described above. This is performed using the substitution described further with reference to FIGS. Block 810 provides information about which substitutions are predicted. Similarly, the replaced mask from blocks 804 and 806 indicates which of the replaced masks are provided. In general, there is one mask bit for each byte stored.

図８のブロック８０２、８０４、８０６、および８１０は、実行を行なう特定のアプリケーションに適合させるためにメモリにおけるアドレスを再配置する能力をもたらす。従来技術において、再配置は通常、自動的に実行されるが、本発明の実施形態において、プログラマーは、プログラムまたはコードに従って、プログラムで所望の再配置を実行することができる。これは、プログラマーの必要に応じて、ほとんど無限に近い再配置の組を可能にし、それは従来技術が全く提供できないことである。つまり、再配置する能力は、既定であり、再配置の可能性の既定の組を含む。したがって、実行されるプログラムに従ってマスクを生成することは、メモリにおけるアドレスの再配置に関して、さらなる柔軟性を提供する。 Blocks 802, 804, 806, and 810 of FIG. 8 provide the ability to relocate addresses in memory to suit the particular application performing the execution. In the prior art, relocation is usually performed automatically, but in embodiments of the present invention, the programmer can perform the desired relocation in the program according to the program or code. This allows for a near-infinite relocation set, depending on the programmer's needs, which the prior art cannot provide at all. That is, the ability to relocate is a default and includes a default set of relocation possibilities. Thus, generating a mask according to the program being executed provides more flexibility with respect to address relocation in memory.

ＳＩＭＤは、単一命令、多重データ（ＳｉｎｇｌｅＩｎｓｔｒｕｃｔｉｏｎ、ＭｕｌｔｉｐｌｅＤａｔａ）の頭字語であり、ＭＩＭＤは、多重命令、多重データ（ＭｕｌｔｉｐｌｅＩｎｓｔｒｕｃｔｉｏｎ、ＭｕｌｔｉｐｌｅＤａｔａ）の頭字語である。これらは、当業者に既知であるコンピュータアーキテクチャおよびプログラミングにおける標準用語である。 SIMD is an acronym for single instruction, multiple data (single instruction, multiple data), and MIMD is an acronym for multiple instruction, multiple data (multiple instruction, multiple data). These are standard terms in computer architecture and programming that are known to those skilled in the art.

図９および図１０は、ブロック＜数字＞の置換回路のさらなる詳細を示し、＜数字＞は、「ベクトルバイト＋マスク置換」ボックスの数である。ブロック４０４は、図９および１０に示されるように、置換される結果ベクトルを生成するために、２つのベクトルの置換を実行する機能ユニットを有する。置換を実行するために使用される回路は、それぞれがＮユニットである入力ベクトルＡおよびＢを例として挙げ、Ｎユニットの出力ベクトルＺも生成する一般的な方法で説明されることが可能で、ユニットは任意であるが一定のビット数であり、Ｎは２の累乗である必要がある。ＫをＮの対数の底２とする。置換回路は、図に示されるように、一定の種類のＮのスイッチボックスを備えるＫ＋１段階を有する。「タイプＡ」、「タイプＢ」、および「タイプＣ」と呼ばれる３種類のスイッチボックスが全体に存在する。スイッチボックスタイプＡは、第１段階でのみ使用され、スイッチボックスタイプＣは、最終段階でのみ使用され、中間における全ての段階は、スイッチボックスのタイプＢを用いる。各タイプのスイッチボックスによって対応される接続は別々に示される。隣接する段階の各ペアのスイッチボックスの間に、ディスタンス１の交換から始まり、ディスタンスＮ／２の交換まで機能するバタフライ交換が存在する。スイッチボックスの設定は、置換回路への第３の入力である「制御ベクトル」によって全て決定される。タイプＡおよびタイプＣのスイッチボックスのそれぞれの設定は、単一のビットのみを特定する必要があり、それぞれのタイプＢの設定は、２つのビットを正確に特定する必要があり、完全な制御ベクトルは、２^＊Ｋ^＊Ｎビットを必要とする。制御ベクトルは、実行される置換命令から全体に示され、またはある方法においては、プログラムによって部分または全体に提供され得る。 9 and 10 show further details of the replacement circuit for block <number>, where <number> is the number of "vector byte + mask replacement" boxes. Block 404 has a functional unit that performs the permutation of two vectors to produce a permuted result vector, as shown in FIGS. The circuit used to perform the permutation can be described in a general way, taking as an example input vectors A and B, each of which is N units, and also generating an output vector Z of N units, The unit is arbitrary but a fixed number of bits, and N must be a power of two. Let K be the base 2 of the logarithm of N. The replacement circuit has K + 1 stages with certain types of N switch boxes, as shown in the figure. There are three types of switch boxes called “type A”, “type B”, and “type C” throughout. Switch box type A is used only in the first stage, switch box type C is used only in the final stage, and all stages in the middle use switch box type B. The connections supported by each type of switch box are shown separately. Between each pair of switch boxes in adjacent stages, there is a butterfly exchange that begins with a distance 1 exchange and functions up to a distance N / 2 exchange. The switch box settings are all determined by the “control vector” which is the third input to the replacement circuit. Each setting of the type A and type C switch boxes needs to specify only a single bit, and each type B setting needs to specify exactly two bits, the complete control vector Requires 2 ^* K ^* N bits. The control vector is shown in full from the replacement instruction to be executed, or in some methods, can be provided in part or in whole by a program.

図１１は、本発明の実施形態にしたがい、ブロック図形式において、ブロック４０６の構成要素のさらなる詳細を示す。図１１において、レジスタブロック１１０２は、ＡＬＵブロック１１０４、ビットコンバータブロック１１０６、ＡＬＵブロック１１０８、および乗算器ブロック１１１０に接続されて示される。ブロック４０６は、レジスタブロック１１１２、シフタブロック１１１４、加算器ブロック１１１６、およびビットコンバータブロック１１１８を含むようにさらに示される。Ｍｕｘｅｓ１１２２、１１２０、および１１２４も、図１１に示される。ｍｕｘおよびブロックは、図１１に示されるように、共に接続される。 FIG. 11 shows further details of the components of block 406 in block diagram form, in accordance with an embodiment of the present invention. In FIG. 11, a register block 1102 is shown connected to an ALU block 1104, a bit converter block 1106, an ALU block 1108, and a multiplier block 1110. Block 406 is further shown to include a register block 1112, a shifter block 1114, an adder block 1116, and a bit converter block 1118. Muxes 1122, 1120, and 1124 are also shown in FIG. The mux and block are connected together as shown in FIG.

ブロック１１０２は、図４のメモリ３１２およびその他のブロックに接続されて示され、ｍｕｘ１１２２およびｍｕｘ１１２０からの入力を受信する。シフタブロック１１１４は、ｍｕｘ１１２２の入力のうちの１つを提供し、ブロック１１０４は、その他の入力を提供する。ｍｕｘ１１２０は、ブロック１１１８および１１０８からその入力を受信する。ブロック１１１４は、ブロック１１０２に接続されてさらに示され、ｍｕｘ１１２４は、ブロック１１１２および１１０２から入力を受信し、ブロック１１１４への出力を生成するように示される。 Block 1102 is shown connected to memory 312 and other blocks of FIG. 4 and receives inputs from mux 1122 and mux 1120. Shifter block 1114 provides one of the inputs of mux 1122 and block 1104 provides the other input. The mux 1120 receives its input from blocks 1118 and 1108. Block 1114 is further shown connected to block 1102 and mux 1124 is shown to receive input from blocks 1112 and 1102 and generate output to block 1114.

ブロック１１１２は、ブロック１１１２への入力として提供される出力を生成するブロック１１１６に接続されて示される。ブロック１１１８は、ブロック１１１２およびブロック１１０６に接続されて示され、１１１０はブロック１１１２に接続されて示される。 Block 1112 is shown connected to block 1116 that produces an output that is provided as an input to block 1112. Block 1118 is shown connected to block 1112 and block 1106, and 1110 is shown connected to block 1112.

ブロック１１０２、１１０４、１１０６、１１０８、および１１１０、ならびにｍｕｘ１１２２は、ＡＬＵ機能を実行し、ブロック１１１２〜１１１８およびｍｕｘ１１２４は、乗累算（ＭＡＣ）機能を実行する。 Blocks 1102, 1104, 1106, 1108, and 1110, and mux 1122 perform ALU functions, and blocks 1112-1118 and mux 1124 perform multiply-accumulate (MAC) functions.

ブロック１１０４および１１０８はＡＬＵであり、その機能を実行し、それらの出力は、ｍｕｘ１１２２および１１２０を介して、ブロック１１０２への入力（またはフィードバック）として選択的に提供される。クロックサイクル毎に、２つのＡＬＵ演算が実行され得る。ブロック１１１０は乗算機能を実行し、ブロック１１０２より多くのビット数を処理することができるブロック１１１２に提供される出力を生成する。例えば、ブロック１１０２が３２ビット能力を有する場合、ブロック１１１２は４０ビット能力を有する。ブロック１１１２はアキュムレータレジスタ、つまり、入力を累積して加算する役割をする。 Blocks 1104 and 1108 are ALUs that perform their functions, and their outputs are selectively provided as inputs (or feedback) to block 1102 via muxes 1122 and 1120. Two ALU operations can be performed per clock cycle. Block 1110 performs a multiply function and produces an output provided to block 1112 that can process a greater number of bits than block 1102. For example, if block 1102 has 32-bit capability, block 1112 has 40-bit capability. Block 1112 serves to accumulate and add accumulator registers, ie, inputs.

ブロック１１０６は、Ｎビット値を、Ｘは整数値であるＮ＋Ｘに変換する。例えば、３２ビット値は４０ビット値に変換されることができる。ブロック１１１４は、値を既定のビット数でシフトし、ｍｕｘ１１２２を介してその結果をブロック１１０２へパスする。 Block 1106 converts the N-bit value to N + X where X is an integer value. For example, a 32-bit value can be converted to a 40-bit value. Block 1114 shifts the value by a predetermined number of bits and passes the result to block 1102 via mux 1122.

ブロック１１１８は、４０ビットを３２ビットにするなど、より大きいビット数をより小さいビット数に変換する。ブロックはブロック４０８に接続される。ブロック４０６は、ブロック１１０２からの値で２つのＡＬＵを並行して実行することができる。第１のＡＬＵ演算の代わりに、Ｎビットのシフト演算が実行され、またはＮビット値からＸビット値への変換がブロック１１１２に保存され得る。第２のＡＬＵ演算の代わりに、乗算がブロック１１１０およびブロック１１１２のレジスタのうちの１つに保存された結果によって実行され得る。 Block 1118 converts the larger number of bits to a smaller number of bits, such as 40 bits to 32 bits. The block is connected to block 408. Block 406 may execute two ALUs in parallel with the values from block 1102. Instead of the first ALU operation, an N-bit shift operation may be performed, or a conversion from an N-bit value to an X-bit value may be stored in block 1112. Instead of a second ALU operation, a multiplication can be performed with the result stored in one of the registers of block 1110 and block 1112.

ブロック４０６は、４０ビットのシフト、４０ビットの加算／減算、および４０ビット値から３２ビット値への変換を並列的に実行することが可能であり、スカラーＡＬＵＭＭＦＵの３２ビットレジスタのうちの１つに保存される。 Block 406 is capable of performing 40-bit shifts, 40-bit addition / subtraction, and conversion from 40-bit values to 32-bit values in parallel, one of the 32-bit registers of the scalar ALU MMFU. Saved in one.

ブロック７８などのＮタイプのサブプロセッサのうちの１つのさらなる詳細は、後に続く図面を参照して説明される。Ｗタイプのサブプロセッサに関する図４のブロック４０６およびブロック４０４は、ブロック７８などのＮタイプのサブプロセッサに共通する。 Further details of one of the N type sub-processors, such as block 78, will be described with reference to the following figures. Block 406 and block 404 in FIG. 4 for the W-type sub-processor are common to N-type sub-processors such as block 78.

図１２は、本発明の実施形態にしたがい、ブロック７８の詳細のハイレベルブロック図を示す。図１２において、ブロック７８は、データパスユニット（ＤＰＵ）ブロック１２０２、メモリブロックへのパス１２０４、ならびに制御装置、シーケンサ、およびデータアドレスジェネレータ（ＤＡＧ）ブロック１２０６を含むように示される。ブロック１２０４および１２０６は、Ｗタイプのサブプロセッサのブロックと共通して見られる。ブロック１２０６は一般的に、ブロック４０２と機能的に同じである。 FIG. 12 shows a high-level block diagram of the details of block 78 in accordance with an embodiment of the present invention. In FIG. 12, block 78 is shown to include a data path unit (DPU) block 1202, a path to memory block 1204, and a controller, sequencer, and data address generator (DAG) block 1206. Blocks 1204 and 1206 are commonly seen with blocks of W type sub-processors. Block 1206 is generally functionally the same as block 402.

図１３は、本発明の実施形態にしたがい、ハイレベルブロック図式において、ブロック７８のさらなる詳細を示す。図７８において、ストアユニットブロック１３０２は、Ｘユニットブロック１３０４に接続されて示され、順に、Ｘユニットブロック１３０４は、ロードユニットブロック１３０６に接続されて示される。ブロック１３０４は、一般的に、ブロック４０４と機能的に同じなので、さらなる詳細は上に説明される。 FIG. 13 shows further details of block 78 in a high level block diagram in accordance with an embodiment of the present invention. 78, the store unit block 1302 is shown connected to the X unit block 1304, and the X unit block 1304 is shown connected to the load unit block 1306 in order. Since block 1304 is generally functionally the same as block 404, further details are described above.

ブロック１３０６は、マクロ機能ブロック１３４０にさらに接続されて示され、順に、マクロ機能バス１３１０を介して、ブロック１３０２に接続されて示される。ブロック１３０２は、ストアバッファ１３１４、ストアバッファ１３１２、およびバスインターコネクトブロック１３０８を含むように示される。ブロック１３０２は、メモリ３１２などのメモリに提供される出力を生成するので、ブロック１３１４を介して適宜接続される。ブロック１３０４は入力を受信するように示され、メモリ３１２などのメモリに接続される。ブロック１３０６は、ロードバッファ１３２０、ロードバッファ１３１８、およびブロック１３４０に接続されるバスインターコネクトブロック１３１６を含むように示される。 Block 1306 is shown further connected to macro function block 1340 and, in turn, is shown connected to block 1302 via macro function bus 1310. Block 1302 is shown to include a store buffer 1314, a store buffer 1312, and a bus interconnect block 1308. Block 1302 generates an output that is provided to a memory, such as memory 312, and is therefore connected as appropriate via block 1314. Block 1304 is shown to receive input and is connected to a memory, such as memory 312. Block 1306 is shown to include a load buffer 1320, a load buffer 1318, and a bus interconnect block 1316 that is connected to block 1340.

ブロック１３４０は、ガロアフィールドＭＡＣブロック１３２２、特別のＡＬＵブロック１３２４、結合器ブロック１３２６、メモリ１３２８、パンクチュアリング／デパンクチュアリングブロック１３３０、インタリーバブロック１３３２、およびビタビブロック１３３４を含むように示され、それぞれがバス１３１０に接続されるように示される。ブロック１３２２〜１３３２は、ブロック１３１６からの入力を受信し、ブロック１３１６に接続されるようにそれぞれ示される。ブロック１３３４は、ブロック１３３２からの入力を受信し、そのブロックでデータを受信および生成するために接続される。 Block 1340 is shown to include a Galois Field MAC block 1322, a special ALU block 1324, a combiner block 1326, a memory 1328, a puncturing / depuncturing block 1330, an interleaver block 1332, and a Viterbi block 1334, Each is shown connected to a bus 1310. Blocks 1322-1332 are each shown to receive input from block 1316 and be connected to block 1316. Block 1334 is connected to receive input from block 1332 and to receive and generate data in that block.

データの流れは上述のとおりで、データまたは情報は、フロック１３０６からおよびブロック１３０６を介して、ブロック１３４０に流れ、次にブロック１３０２、そしてメモリに流れ出る。このようにして、パイプラインの影響が導入され、パイプラインの方法で多数の動作が重複し、同時に処理される。例えば、情報は、ブロック１３０６によってロードされ、一方で、情報はブロック１３０２によってメモリに保存される。データは、ブロック１３０４によってメモリから受信された後に、ブロック１３０６のブロック１３２０および１３２８に保存され、その後、ブロック１３４０に提供され、ブロック１３４０によって処理されるその詳細は、以下に続く図面を参照して簡潔に説明される。 The data flow is as described above, with data or information flowing from block 1306 and through block 1306 to block 1340, then to block 1302, and then to memory. In this way, the influence of the pipeline is introduced and a number of operations are duplicated and processed simultaneously in the pipeline manner. For example, the information is loaded by block 1306 while the information is stored in memory by block 1302. After the data is received from the memory by block 1304, it is stored in blocks 1320 and 1328 of block 1306 and then provided to block 1340, the details of which are processed by block 1340 with reference to the drawings that follow. Briefly described.

ブロック１３４０による処理の完了後、処理されたデータは、バス１３１０を介してブロック１３０２に提供され、メモリに受信されるように接続されるまで、データが保存されるブロック１３１２および１３１４に保存される。ブロック１３１４、１３１２、１３１８、および１３２０のバッファは、既定の幅、またはビット数を並行して有する。一例において、これらのバッファの各々は２５６ビット幅であるが、その他のビット数が用いられてもよい。 After completion of processing by block 1340, the processed data is provided to block 1302 via bus 1310 and stored in blocks 1312 and 1314 where the data is stored until connected to be received in memory. . The buffers in blocks 1314, 1312, 1318, and 1320 have a predetermined width or number of bits in parallel. In one example, each of these buffers is 256 bits wide, but other numbers of bits may be used.

ブロック１３４０によって処理された可能性のある値またはデータは、再利用のために、ブロック１３０２からブロック１３０６に移動され得る。さらに、データは、メモリからブロック１３０４によって受信され、次に、処理のためにブロック１３０６に移動され得る。ブロック１３４０の各々のさらなる詳細が、ここに提示される。ブロック１３１４および１３１２は、ダブルバッファリングの効果をもたらし、パイプライン動作に共通して発生する「失速（ｓｔａｌｌｉｎｇ）」を削減することに役立てられ、ブロック１３１８および１３２０も同様のことが言える。失速は、ブロック１３０２および１３０６のメモリからの同時のアクセスによってもたらされる。別の実施形態において、ブロック１３１４および１３１２は、１つのブロックであってもよく、ブロック１３１８および１３２０は、１つのブロックであってもよい。 Values or data that may have been processed by block 1340 may be moved from block 1302 to block 1306 for reuse. Further, data may be received from memory by block 1304 and then moved to block 1306 for processing. Further details of each of the blocks 1340 are presented here. Blocks 1314 and 1312 provide the effect of double buffering, helping to reduce “stalling” that commonly occurs in pipeline operations, and blocks 1318 and 1320 are similar. The stall is caused by simultaneous access from the memory of blocks 1302 and 1306. In another embodiment, blocks 1314 and 1312 may be one block and blocks 1318 and 1320 may be one block.

レイテンシは動作に関連し得、または、パイプラインの影響が存在し得る。レイテンシは、ブロック１３４０に関するブロックの各々によってもたらされ得る。 Latency can be related to operation or there can be pipeline effects. Latency may be provided by each of the blocks for block 1340.

図１４は、本発明の実施形態にしたがう、ブロック１３２２のさらなる詳細を示す。図１４において、ガロアフィールドブロック１４０２は、ＸＯＲ／Ｃｌｒ回路１４０４に接続されて示され、順に、アキュムレータレジスタブロック１４０６に接続されて示される。ブロック１４０２は、ガロアフィールド出力信号１４０８を生成するように示され、ガロアフィールド出力信号１４０８は、ガロアフィールドｍｕｘ１４１０への入力としての役割を果たし、ガロアフィールドｍｕｘ１４１０は、ブロック１４０６の出力によって生成され、アキュムレータレジスタブロック出力信号１４１２と呼ばれるさらに別の入力を受信する。信号１４０８および１４１２は、図１３の１３１０のバスに接続されるガロアフィールドＭＡＣ出力信号１４１６を選択的に生成するために、ｍｕｘ１４１０への入力としての役割を果たす。ｍｕｘ１４１０への別の入力としての役割を果たす選択信号１４１４は、信号１４１６の生成のために、信号１４０８および１４１２のうちの１つを選択する働きをする。したがって、事実上ガロアフィールド演算の結果であるブロック１４０２の出力は、ブロック１３２２の出力として提供され、またはガロアフィールドＭＡＣ演算の結果は、ブロック１３２２の出力として提供される。 FIG. 14 shows further details of block 1322 in accordance with an embodiment of the present invention. In FIG. 14, the Galois field block 1402 is shown connected to the XOR / Clr circuit 1404, and in turn connected to the accumulator register block 1406. Block 1402 is shown to generate a Galois field output signal 1408, which serves as an input to the Galois field mux 1410, which is generated by the output of block 1406 and is an accumulator. A further input called register block output signal 1412 is received. Signals 1408 and 1412 serve as inputs to mux 1410 to selectively generate Galois Field MAC output signal 1416 connected to the 1310 bus of FIG. Selection signal 1414, which serves as another input to mux 1410, serves to select one of signals 1408 and 1412 for generation of signal 1416. Thus, the output of block 1402 that is effectively the result of the Galois field operation is provided as the output of block 1322, or the result of the Galois field MAC operation is provided as the output of block 1322.

ブロック１４０６の出力は、その別の入力として回路１４０４に接続されて示される。ブロック１４０４の出力はブロック１４０６に提供され、その接続はガロアフィールドＭＡＣ演算のＭＡＣ部分を達成する。ブロック１４０４は、ガロアフィールドＭＡＣ演算で通常使用されるＸＯＲ乗算演算を効果的に実行する。 The output of block 1406 is shown connected to circuit 1404 as its other input. The output of block 1404 is provided to block 1406, whose connection accomplishes the MAC portion of the Galois Field MAC operation. Block 1404 effectively performs the XOR multiplication operation normally used in Galois Field MAC operations.

ブロック１４０２は、Ｘｏｒツリーブロック１４２４に接続されて示される、レジスタブロック１４２０およびレジスタブロック１４２２を含むように示される。ブロック１４２０は、レジスタブロック１４２６、ガロアフィールド乗算反復１の１４２８、レジスタブロック１４３０、ガロアフィールド乗算反復１の１４３２、レジスタブロック１４３４、およびレジスタブロック１４３６を含むようにさらに示される。図１４に図示されないが、ブロック１４３４および１４３６などの追加の数のレジスタブロックは、ブロック１４３４と１４３６の間に連続して備えられ、接続される。 Block 1402 is shown to include a register block 1420 and a register block 1422 shown connected to Xor tree block 1424. Block 1420 is further shown to include a register block 1426, Galois field multiplication iteration 1 1428, register block 1430, Galois field multiplication iteration 1 1432, register block 1434, and register block 1436. Although not shown in FIG. 14, an additional number of register blocks, such as blocks 1434 and 1436, are provided and connected in series between blocks 1434 and 1436.

ブロック１４２４は、ブロック１４２６に接続されて示され、順に、ブロック１４２６はブロック１４２８に接続されて示され、順に、ブロック１４２８はブロック１４３０に接続されて示され、順に、ブロック１４３０は、ブロック１４３２に接続されて示され、順に、ブロック１４３２は、ブロック１４３４に接続されて示され、ブロック１４３４は、ブロック１４３６またはブロック１４３４とブロック１４３６との間に位置する１つまたは複数のレジスタブロックに接続される。 Block 1424 is shown connected to block 1426, in turn, block 1426 is shown connected to block 1428, in turn, block 1428 is shown connected to block 1430, and in turn, block 1430 is shown in block 1432. Connected and shown, in turn, block 1432 is shown connected to block 1434, which is connected to block 1436 or one or more register blocks located between block 1434 and block 1436. .

図１４において、ブロック１４２０および１４２２は、ブロック１３０６から入力を受信し、別の実施様態においては、１つのブロックに結合されてもよい。ブロック１４０２は一般的に、当業者にとって既知であるガロアフィールド処理を実行し、図１４の残りのブロックはＭＡＣ演算の実行をもたらす。ブロック１４２６、１４３０、１４３４、および１４３６は、ガロアツリーの異なる反復としての役割を果たし、最悪の場合のシナリオにおいて、反復の数は８つのレジスタブロックを必要とする８であることが発見された。ＭＡＣ演算の乗算部分は、一般的に、回路１４０４によって実行されるＸＯＲ動作によって実行され、ブロック１４０６は、アキュムレータ機能としての役割を果たす。回路１４０４は、ブロック１４０２によって実行されるガロアフィールドの演算の最終の反復、図１４の場合、ブロック１４３６からの入力を受信する。 In FIG. 14, blocks 1420 and 1422 receive input from block 1306 and may be combined into one block in another embodiment. Block 1402 generally performs Galois field processing as known to those skilled in the art, and the remaining blocks in FIG. 14 result in the execution of MAC operations. Blocks 1426, 1430, 1434, and 1436 served as different iterations of the Galois tree, and in the worst case scenario, the number of iterations was found to be 8, requiring 8 register blocks. The multiply portion of the MAC operation is typically performed by an XOR operation performed by circuit 1404, and block 1406 serves as an accumulator function. Circuit 1404 receives the input from the final iteration of the Galois field operation performed by block 1402, in the case of FIG.

動作において、ブロック１３２２は、８ビット値などのＮビット値またはデータを演算し、別のＮビット値に基づいて元々の値の８ｗａｙをシフトすることによって、同一のものに基づいてＮビット値またはデータを生成する。Ｎビット値は、次に、その結果が減少定数を有するＮビットに減らされるまで、ブロック１４０４によってＸＯＲされ、ブロック１４０６の値などのＮビットアキュムレータレジスタの内容に選択的に加えられる。「クリア」動作もブロック１４０６によって実行され得る。ガロアフィールドのＭＡＣ演算、つまりブロック１３２２を用いるアプリケーションの例は、循回冗長コード（ＣＲＣ）演算、畳み込みエンコーダ演算、スクランブルコードジェネレータ演算、およびその他のものが含まれるが、それだけに限定されない。 In operation, block 1322 computes an N-bit value or data, such as an 8-bit value, and shifts the 8-way of the original value based on another N-bit value to produce an N-bit value or Generate data. The N bit value is then XORed by block 1404 and selectively added to the contents of the N bit accumulator register, such as the value of block 1406, until the result is reduced to N bits having a decreasing constant. A “clear” operation may also be performed by block 1406. Examples of applications that use Galois Field MAC operations, or blocks 1322, include, but are not limited to, cyclic redundancy code (CRC) operations, convolutional encoder operations, scramble code generator operations, and others.

図１５は、本発明の実施形態にしたがい、ハイレベルブロック図式において、ブロック１３２４に含まれる回路のさらなる詳細を示す。図１５において、ｍｕｘ１５０４および１５０２は、それぞれ、Ａレジスタブロック１５０８およびＢレジスタブロック１５０６に接続されて示される。ブロック１５０８は、Ａと呼ばれる値を保存し、ブロック１５０２は、Ｂと呼ばれる値を保存し、そこでＡおよびＢの値は、ブロック１３２４によって演算されるデータである。ＡおよびＢの値は、それぞれＮビット幅である。 FIG. 15 shows further details of circuitry included in block 1324 in a high level block diagram in accordance with an embodiment of the present invention. In FIG. 15, mux 1504 and 1502 are shown connected to A register block 1508 and B register block 1506, respectively. Block 1508 stores a value called A, and block 1502 stores a value called B, where the values of A and B are the data computed by block 1324. The values of A and B are each N bits wide.

ブロック１５０８および１５０６は、条件付きレジスタブロック１５１２への入力を生成するように示され、加算／減算／Ａｂｓ／差分／条件付き加算−減算／乗算（ＡＧＵ）ブロック１５１０への入力を生成するように接続されてさらに示され、順に、ブロック１５１０は、出力レジスタブロック１５１４への入力を生成する。ブロック１５１４は、ｍｕｘ１５１６に接続されて示され、順に、ｍｕｘ１５１６は、加算器１５１８に接続されて示される。加算器１５１８は、アキュムレータレジスタブロック１５２０に接続されて示され、その出力は、加算器１５１８の別の入力としての役割を果たすように示される。ブロック１５２０の別の出力は、ｍｕｘ１５２２への入力としての役割を果たすように示され、ｍｕｘ１５２２は、ブロック１５１４の出力としての別の入力として受信する。ｍｕｘ１５２２は、バス１３１０へ接続される出力１５３０を生成する。ｍｕｘ１５０４および１５０２への入力いくつかは、ブロック１３１６から受信される。 Blocks 1508 and 1506 are shown to generate an input to conditional register block 1512 and to generate an input to add / subtract / Abs / difference / conditional add-subtract / multiply (AGU) block 1510. Connected and further shown, in turn, block 1510 generates an input to output register block 1514. Block 1514 is shown connected to mux 1516 and, in turn, mux 1516 is shown connected to adder 1518. Adder 1518 is shown connected to accumulator register block 1520 and its output is shown to serve as another input for adder 1518. Another output of block 1520 is shown to serve as an input to mux 1522, and mux 1522 receives as another input as the output of block 1514. The mux 1522 generates an output 1530 that is connected to the bus 1310. Some of the inputs to mux 1504 and 1502 are received from block 1316.

ｍｕｘ１５０４および１５０２の各々は、４つの入力を受信するように示される。ｍｕｘ１５０４の入力のうちの１つ、ｄｐは、ブロック１３０６からｍｕｘ１５０２の入力、ｄｐとして受信される。ｍｕｘ１５０４の別の入力は、ブロック１５１４の出力の一連の最下位ビットからもたらされ、ｍｕｘ１５０２の入力のうちの１つも同様である。ｍｕｘ１５０４の別の入力は、ブロック１５１４の同一の出力の最上位ビットからもたらされる。しかし、ｍｕｘ１５０４の別の入力は、値が「０」である。ｍｕｘ１５０２の入力のうちの１つは値が「１」であり、入力のうちの別のものは値が「−１」である。「０」、「１」、および「−１」の値は、これらの値がさまざまな動作において繰り返し利用されているため、この値の存在がシステム性能を向上させるという点において、ブロック１３２４によって実行される動作を迅速に処理する目的で提供される。性能を向上させるために利用される、複数のブロック１５１０が存在してもよいことが留意されるべきである。ブロック１３２４は、実行される多数の動作が、単一のクロックサイクルで実行されることを可能にするために、図１５において示されるように組織される。 Each of mux 1504 and 1502 is shown to receive four inputs. One of the mux 1504 inputs, dp, is received from block 1306 as the mux 1502 input, dp. Another input of mux 1504 comes from the series of least significant bits of the output of block 1514, as well as one of the inputs of mux 1502. Another input of mux 1504 comes from the most significant bit of the same output of block 1514. However, another input of mux 1504 has a value of “0”. One of the inputs of mux 1502 has a value of “1” and another of the inputs has a value of “−1”. The values “0”, “1”, and “−1” are performed by block 1324 in that the presence of this value improves system performance because these values are used repeatedly in various operations. Is provided for the purpose of quickly processing the operations performed. It should be noted that there may be a plurality of blocks 1510 that are utilized to improve performance. Block 1324 is organized as shown in FIG. 15 to allow multiple operations to be performed to be performed in a single clock cycle.

動作において、ブロック１５１０および１５１２は、ブロック１５０８および１５０６によってそれぞれ提供されるＡおよびＢの値で動作する。ｍｕｘ１５１６への２つのその他の入力は、後に簡潔に説明されるブロック１５２０内のリダクション動作ブロック（図１５に図示せず）によって生成される。これら２つの入力は、ここでは「隣接アキュムレータレジスタ」および「リダクションアキュムレータレジスタ」と呼ばれ、各々は２Ｎ幅である。 In operation, blocks 1510 and 1512 operate on the A and B values provided by blocks 1508 and 1506, respectively. Two other inputs to mux 1516 are generated by a reduction action block (not shown in FIG. 15) in block 1520, which will be briefly described later. These two inputs are referred to herein as “adjacent accumulator registers” and “reduction accumulator registers”, each being 2N wide.

ブロック１５１２は、デスプレッド動作における使用のために、ブロック１５１０によって実行される、条件付き加算または条件付き減算演算を可能にする２Ｎ幅レジスタである。ブロック１５１２は、実質的に、ブロック１５１０による使用のために、ＡおよびＢの値を改変する。 Block 1512 is a 2N wide register that enables conditional addition or subtraction operations performed by block 1510 for use in despread operations. Block 1512 substantially modifies the values of A and B for use by block 1510.

ｍｕｘ１５２２は、実質的に、信号１５３０を介してブロック１３０２に選択的に提供されるように、および、ｍｕｘ１５２２へのさらに別の入力として提供される選択信号によって決定されるように、ブロック１５１４によって保存された後、ブロック１５１０の出力を可能にする。そうでないと、ブロック１５１０の結果は、蓄積加算動作を行い、その最終結果は、ブロック１３０２に提供される前に、ブロック１５１８および１５２０を介して、ブロック１５２０に保存される。 The mux 1522 is saved by the block 1514, substantially as provided selectively to the block 1302 via the signal 1530, and as determined by the selection signal provided as yet another input to the mux 1522. Once enabled, block 1510 output is enabled. Otherwise, the result of block 1510 performs a store-and-add operation and the final result is stored in block 1520 via blocks 1518 and 1520 before being provided to block 1302.

ブロック１３２４は、以下の動作に対応する１つまたは複数のＡＬＵｓを備えるＮレイヤーＡＬＵである。
− ２つのＮビット値が、その和分または差分を生成するために動作されるＮ加算／減算動作
− ２つの入力値のＮビットのＸＯＲ
− ２つのＮビット入力値の最大／最小動作
− ２つのＮビット入力値の最大^＊動作であって、その結果は、次のように計算される：ｍａｘ（ａ、ｂ）＋定数（メモリまたはあらかじめ組み込まれるルックアップテーブルから）
− 条件付き加算−減算：一般的にブロック１５１２の使用によりもたらされるこの機能は、条件付きで、入力コードによって決まるＮビット値のストリームを加算または減算する。入力コードは、制御レジスタにあらかじめロードされる。入力コードにおける「１」は、減算動作をもたらし、「０」は加算動作をもたらす。出力は、１６ビットのアキュムレータレジスタで使用可能である。この機能に対応しているその他の特別ＡＬＵからの「収集」動作のためのサポートも存在する。
− 条件付きの加算−減算動作と同一のアキュムレータを使用するＳＡＤ
− Ｎ×Ｎ乗算。 Block 1324 is an N-layer ALU comprising one or more ALUs corresponding to the following operations.
-N addition / subtraction operation in which two N-bit values are operated to generate their sum or difference-N-bit XOR of two input values
-Maximum / minimum operation of two N-bit input values-Maximum ^* operation of two N-bit input values, the result of which is calculated as follows: max (a, b) + constant (memory or (From pre-built lookup table)
Conditional addition-subtraction: This function, typically provided by the use of block 1512, conditionally adds or subtracts a stream of N-bit values determined by the input code. The input code is preloaded into the control register. A “1” in the input code results in a subtraction operation and a “0” results in an addition operation. The output is available in a 16-bit accumulator register. There is also support for “collect” operations from other special ALUs that support this function.
-SAD using the same accumulator as the conditional add-subtract operation
-N x N multiplication.

ブロック１５１０は、各ブロック１５１０が少なくとも１２８ビットを読み込むことができる、つまり、２つのブロックは、メモリにコンテンションが存在しない場合、少なくとも２５６ビットのデータをクロックサイクル毎に読み込むことができる、Ｗタイプのサブプロセッサに共通している。 Block 1510 is a W type where each block 1510 can read at least 128 bits, ie, two blocks can read at least 256 bits of data every clock cycle if there is no contention in memory Common to all sub-processors.

図１６は、本発明の実施形態にしたがい、ブロック１５２０内に含まれるリダクション回路ブロック１６０２のブロック図を示す。図１６において、Ｍ段階のアキュムレータレジスタ回路、アキュムレータレジスタブロック１６１０に示されるアキュムレータレジスタ回路の各々の詳細が示される。例えば、アキュムレータレジスタ回路ブロック１６０２は、図１６に示されるように接続されるブロック４つの１６１０を含む。同様に、アキュムレータレジスタ回路ブロック１６０４〜１６０８の各々は、ブロック１６１０のような４段階のアキュムレータレジスタ回路を含む。ブロック１６０２〜１６０８のうちの各々内での各段階の出力または結果は、次の段階への入力として使用され、蓄積を達成するために加算される。ブロック１６０２〜１６０８は、４段階またはブロック１６１０のような４ブロックを含むように示されるが、その他の数のブロックまたは段階が使用されてもよい。 FIG. 16 shows a block diagram of a reduction circuit block 1602 included within block 1520 in accordance with an embodiment of the present invention. FIG. 16 shows details of each of the M-stage accumulator register circuit and the accumulator register circuit shown in the accumulator register block 1610. For example, accumulator register circuit block 1602 includes four blocks 1610 that are connected as shown in FIG. Similarly, each of accumulator register circuit blocks 1604-1608 includes a four stage accumulator register circuit, such as block 1610. The output or result of each stage within each of blocks 1602-1608 is used as an input to the next stage and added to achieve accumulation. Blocks 1602-1608 are shown to include four stages or four blocks, such as block 1610, although other numbers of blocks or stages may be used.

ブロック１６０２〜１６０８の各々の結果は、その他のブロックに使用可能になる。例えば、ブロック１６０２の結果は、ブロック１６０４への入力としての役割を果たし、ブロック１６０４の結果または出力は、ブロック１６０８内の最終のアキュムレータレジスタブロックへの入力としての役割を果たし、ブロック１６０６の結果または出力は、ブロック１６０８への入力としての役割を果たす。ブロックの結果は、前方向および同時に段階の蓄積に提供されるため、４段階のアキュムレータレジスタブロックが用いられる場合、７サイクルのみがリダクション動作を実行するのに必要とされる。 The result of each of blocks 1602-1608 becomes available for the other blocks. For example, the result of block 1602 serves as an input to block 1604, the result or output of block 1604 serves as an input to the final accumulator register block in block 1608, and the result of block 1606 or The output serves as an input to block 1608. Since the result of the block is provided for forward and simultaneous stage accumulation, if a four stage accumulator register block is used, only seven cycles are required to perform the reduction operation.

ブロック１６は、アキュムレータに接続されるｍｕｘから構成される。ｍｕｘは、アキュムレータに提供されるために、２つの入力のうちの１つを選択する２：１のｍｕｘである。ブロック１６１０のｍｕｘの２つの入力のうちの１つは、ブロック１５１４の出力によって提供され、その他の入力は、前の段階のアキュムレータレジスタブロックの結果である。このように、図１６のリダクション機能は、データに対するそのマニピュレーションにおいて、柔軟性がある。段階の直前の出力からの入力の各々は、ｍｕｘ１５１６への隣接アキュムレータシーケンスを生成する「隣接（ｎｅｉｇｈｂｏｒ）」信号１６１６と呼ばれる。段階内のいくつかの出力は、ｍｕｘ１５１６へのリダクションアキュムレータｓｅｇを生成し、「リダクション」信号１６１８と呼ばれる。ブロック１６０８の最終のアキュムレータブロックの出力は、ｍｕｘ１５３０に接続される出力１６２０を生成する。図１６のリダクション回路は、リダクション動作を実行し、電力消費を節約するために、最小のクロックサイクルをもたらす。 Block 16 consists of a mux connected to an accumulator. The mux is a 2: 1 mux that selects one of the two inputs to be provided to the accumulator. One of the two mux inputs of block 1610 is provided by the output of block 1514, the other input being the result of the previous stage accumulator register block. Thus, the reduction function of FIG. 16 is flexible in its manipulation of data. Each of the inputs from the output immediately before the stage is referred to as a “neighbor” signal 1616 that generates a neighbor accumulator sequence to mux 1516. Several outputs in the stage generate a reduction accumulator seg to mux 1516 and are referred to as a “reduction” signal 1618. The output of the final accumulator block at block 1608 produces an output 1620 that is connected to mux 1530. The reduction circuit of FIG. 16 results in a minimum clock cycle to perform the reduction operation and save power consumption.

図１７は、本発明の実施形態にしたがい、ハイレベルブロック図式において、ブロック１３２６に含まれる回路のさらなる詳細を示す。図１７において、ブロック１３２６は、ブロック１３０６から受信されるデータ入力をシフトするために、シフタ１７０２〜１７１２を含むように示される。一実施形態において、入力１７００は１２８ビットであるが、その他のビット数が用いられてもよい。シフタ１７０２〜１７１２の各々の出力は、レジスタバンクブロック１７１４に接続されて示される。シフタ１７０２〜１７１２は、入力１７００のビットの異なる結合を生成する。 FIG. 17 shows further details of the circuitry included in block 1326 in a high level block diagram in accordance with an embodiment of the present invention. In FIG. 17, block 1326 is shown to include shifters 1702-1712 to shift the data input received from block 1306. In one embodiment, input 1700 is 128 bits, although other numbers of bits may be used. Each output of shifters 1702-1712 is shown connected to register bank block 1714. Shifters 1702-1712 generate different combinations of the bits of input 1700.

ブロック１７１４は、シフタ１７０２〜１７１２の出力の結合を生成するために使用されるレジスタ１７１６から１７４６を含む複数のレジスタを備える。例えば、シフタ１７０２〜１７１２の出力の各々の最低８ビットは、選択的にどの最低８ビットが最終的に生成されるべきかを選択するために、ｍｕｘを経由するように生成され得る。したがって、ブロック１７１４のレジスタの各々は、シフトされたビットの「好位置（ｉｎｔｅｒｅｓｔｉｎｇｐｏｓｉｔｉｏｎ）」で、任意に選択されることができる。好位置は、シフタ１７０２〜１７１２の各々の出力によって決定される。ブロック１７１４の出力は、バス１３１０に提供される。 Block 1714 comprises a plurality of registers, including registers 1716 to 1746, used to generate a combination of the outputs of shifters 1702-1712. For example, a minimum of 8 bits of each of the outputs of shifters 1702-1712 can be generated via a mux to selectively select which minimum 8 bits are to be finally generated. Thus, each of the registers in block 1714 can be arbitrarily selected with the “interesting position” of the shifted bits. The good position is determined by the output of each of the shifters 1702-1712. The output of block 1714 is provided to bus 1310.

したがって、本発明の実施形態において、ブロック１３２６は、４つの２０ビットおよび２つの２４ビット入力レジスタを備える。それは、８つの１６ビットレジスタを含み、入力レジスタからの３２、１６、８、および４ビットのビットの組合せがランダムに生成および保存される。ブロック１３２６は、次の３つのモードで使用され得る。出力生成のために２つの特定の２０ビットレジスタを使用する。２）出力生成のために４つの２０ビットレジスタを使用する。または、３）出力生成のために７つ全てのレジスタを使用する。シフタ１７０２〜１７１２は、入力レジスタを含むが、当業者には、シフタの構造と機能は既知であるため、図示されない。 Thus, in an embodiment of the present invention, block 1326 comprises four 20-bit and two 24-bit input registers. It contains eight 16-bit registers, and 32, 16, 8, and 4-bit bit combinations from the input register are randomly generated and stored. Block 1326 may be used in the following three modes: Two specific 20-bit registers are used for output generation. 2) Use four 20-bit registers for output generation. Or 3) use all seven registers for output generation. Shifters 1702-1712 include input registers, but are not shown to those skilled in the art because the structure and function of the shifters are known.

ブロック１３２６の結合機能を実行するために必要なハードウェアあるいはブロックまたは回路の数を減らすために、３２ビットの出力レジスタにおける各ビットは、第１モードで、最下位８ビットから２つの２０ビットのレジスタに、第２モードで、４つの最下位ビットを４つの２０ビットのレジスタに、第３モードで、２つの最下位ビットを４つの２０ビットのレジスタに、４つの最下位ビットを２４ビットのレジスタに、満たすことができる。入力レジスタからのランダム結合は、２ステップの処理である。第１のステップは、「好（ｉｎｔｅｒｅｓｔｉｎｇ）」ビットを最下位位置にシフトすることを伴い、その最下位位置から出力レジスタへの無作為に満たすことが、そのモードで可能であり得る。本明細書において図１７に関連して使用される例において、ブロック１３２６は、好ビットを最下位位置にするために、入力レジスタでのシフト動作でパイプラインされる場合に、サイクル毎に１６の結合されたビットを生成することができる。出力のいくつかの結合は、多数のクロックサイクルを必要とし得る。 In order to reduce the number of hardware or blocks or circuits required to perform the combining function of block 1326, each bit in the 32-bit output register is changed from the least significant 8 bits to two 20-bits in the first mode. In the second mode, in the second mode, the four least significant bits are converted into four 20-bit registers. In the third mode, the two least significant bits are converted into four 20-bit registers, and the four least significant bits are converted into 24 bits. The register can be filled. Random combination from the input register is a two-step process. The first step involves shifting the “interesting” bit to the lowest position, and it may be possible in that mode to randomly fill the output register from that lowest position. In the example used in connection with FIG. 17 herein, block 1326 is 16 pipelines per cycle when pipelined with a shift operation on the input register to bring the good bit to the least significant position. Combined bits can be generated. Some combinations of outputs may require a large number of clock cycles.

メモリ１３２６は、一般的なランダムアクセスメモリであるため、さらなる詳細は説明されない。しかし、メモリのサイズが、Ｎタイプのサブプロセッサが使用されるアプリケーションに基づくということだけを言えば、十分である。 Since the memory 1326 is a general random access memory, no further details will be described. However, it is sufficient to just say that the size of the memory is based on the application in which the N type sub-processor is used.

図１８は、本発明の実施形態において、ハイレベルブロック図式において、ブロック１３３０に含まれる回路のさらなる詳細を示す。図１８において、１ワードレジスタ１８０２は、８ビット位置を含むように示され、各ビット位置１８０４は、ビット選択回路１８０６によって改変されることが可能である。その改変は、「０」の挿入、「１」の挿入、ビットの反転に相当する、またはビットを全く改変しない、「ＮＯＰ」つまり動作無し（ｎｏ−ｏｐｅｒａｔｉｏｎ）に相当する、ビットのＮＯＴｉｎｇ、を含むがそれだけに限定されない。１ワードレジスタは繰り返される、すなわち、ワードレジスタ１８１０〜１８２０は、それぞれ、ワードをレジスタ１８０２として保存および改変する。したがって、１６ビットワードおよび８ワードの例において、八つの１６ビットワードの改変は、同一のことを実行するためには多数のサイクルを必要とする従来のＤＳＰｓと違って、１クロックサイクルにおいて実行される。ワードの各ビットの改変またはパンクチュアリング／デパンクチュアリングは、図１８に示されるように、相互およびレジスタ１８０２に接続されるｍｕｘ１８２４およびフリップフロップ１８２６によって制御される。レジスタ１８１０〜１８２２も、その他のｍｕｘおよびフリップフロップ回路に同様に接続される。モード選択ビットは、ｍｕｘの４つの入力のうちのどれが選択されるかを選択し、それは命令コードから生成される。ｍｕｘ１８２４へのインプット１８２８のうちの２つも命令コードからもたらされるが、ｍｕｘ入力のその他の２つはメモリからもたらされ、図１８に示されるように、そのうちの１つは他方の反転版であり得る。 FIG. 18 shows further details of circuitry included in block 1330 in a high level block diagram in an embodiment of the present invention. In FIG. 18, 1 word register 1802 is shown to include 8 bit positions, and each bit position 1804 can be modified by a bit selection circuit 1806. The modification corresponds to the insertion of “0”, the insertion of “1”, the inversion of the bit, or the bit is not modified at all, “NOP”, that is, the bit NOTING corresponding to the no-operation. Including but not limited to. One word register is repeated, ie, word registers 1810-1820 each store and modify a word as register 1802. Thus, in the 16-bit word and 8-word examples, modification of eight 16-bit words is performed in one clock cycle, unlike traditional DSPs that require multiple cycles to do the same thing. The The modification or puncturing / depuncturing of each bit of the word is controlled by mux 1824 and flip-flop 1826 connected to each other and to register 1802, as shown in FIG. Registers 1810-1822 are similarly connected to the other mux and flip-flop circuits. The mode select bit selects which of the four mux inputs is selected, which is generated from the instruction code. Two of the inputs 1828 to mux 1824 also come from the instruction code, while the other two of the mux inputs come from memory, one of which is the reverse version of the other, as shown in FIG. obtain.

ブロック１３３０の回路への入力は、ブロック１３３２から生成され、ブロック１３３２は、ここで簡潔に説明すると、完全インタリーブ、部分的インタリーブ、または非インタリーブのＮビットワードをブロック１３３０に生成する。一例において、動作は２５６ビットワードにあり、その場合、ブロック１３３０は、所定の時間に１６ビットで動作する。プリフェッチされた制御ワードは、１６ビットワード内のどのビットが反転されるべきかを決定するために使用される。選択的に、「０」または「１」の値が、反転する他に、特定のビット位置に入力される。 Input to the circuit of block 1330 is generated from block 1332, which, briefly described herein, generates a fully interleaved, partially interleaved, or non-interleaved N-bit word in block 1330. In one example, the operation is in a 256-bit word, in which case block 1330 operates on 16 bits at a given time. The prefetched control word is used to determine which bits within the 16-bit word should be inverted. Optionally, a value of “0” or “1” is input to a specific bit position in addition to inversion.

図１９は、本発明の実施形態にしたがい、ハイレベルブロック図式において、ブロック１３３２に備えられる回路のさらなる詳細を示す。図１９において、メモリアレイ１９０２は、バス１３１６を介して入力装置から入力１０４、およびバス１３１６を介してリードイネーブル入力１９０６を受信する、さらにブロック１３０２に提供される出力装置信号１９１０を生成するために、制御行−列アドレス生成ブロック１９０８から入力をさらに受信するように示される。一例において、ブロック１９０２は、１２８×１６ビットから構成されるメモリアレイを含む。データは、行ベースまたは列ベースで、ブロック１９０２から書き出しまたは読み込まれることができる。読み込まれ得るのはブロック１９０２のメモリアレイの行であり、読み込まれ得るのは、ブロック１９０２のメモリアレイの列である。さらに、データは、行ベースで書き込まれることが可能であるが、列ベースで読み込まれることも可能であり、逆も可能である。 FIG. 19 shows further details of the circuitry provided in block 1332 in a high-level block diagram in accordance with an embodiment of the present invention. In FIG. 19, memory array 1902 receives input 104 from an input device via bus 1316 and read enable input 1906 via bus 1316 and further generates an output device signal 1910 that is provided to block 1302. , Shown to receive further input from control row-column address generation block 1908. In one example, block 1902 includes a memory array composed of 128 × 16 bits. Data can be written or read from block 1902 on a row or column basis. What can be read is a row of the memory array of block 1902 and what can be read is a column of the memory array of block 1902. Furthermore, data can be written on a row basis, but can also be read on a column basis and vice versa.

図２０は、本発明の実施形態にしたがい、ハイレベルブロック図式において、ブロック１３３４に備えられる回路のさらなる詳細を示す。図２０において、ブランチメトリックユニット２００２は、ブロック１３３２からの入力を受信するように示され、加算／比較／選択ブロックに接続されて示され、加算／比較／選択ブロックは、サバイバ（ｓｕｒｖｉｖｏｒ）メモリブロック２０１２に接続されて示され、順に、サバイバメモリブロック２０１２は、ｍｕｘ２０２０に接続されて示され、ｍｕｘ２０２０は、バス１３１０に接続される出力２０２２を生成する。ｍｕｘ２０２０は、ｍｕｘ２０１６から入力を受信するアキュムレータ２０１８の出力から、別の入力を受信するようにさらに示される。任意で、ｍｕｘ２０１６への入力を生成するために、絶対差の和（ＳＡＤ）ブロック２００８およびデスプレッダ（デスプレッドするために）ブロック２０１０が使用される。ブロック２００８および２０１０が存在しない場合、ｍｕｘ２０１６、ブロック２０１８、およびｍｕｘ２０２０が使用され得る。ローカルメモリ２００６は、ブロック２００４に接続されて示される。ブロック２００２は、ビタビコード／デコードに精通する者には既知であるブランチメトリック計算を実行する。ビタビコード／デコードに精通する者に既知であるサバイバパスも、ブロック２０１２に保存される。 FIG. 20 shows further details of the circuitry provided in block 1334 in a high-level block diagram in accordance with an embodiment of the present invention. In FIG. 20, a branch metric unit 2002 is shown receiving the input from block 1332 and shown connected to an add / compare / select block, where the add / compare / select block is a survivor memory block. The survivor memory block 2012 is shown connected to mux 2020, which in turn produces an output 2022 that is connected to bus 1310. The mux 2020 is further shown to receive another input from the output of the accumulator 2018 that receives the input from the mux 2016. Optionally, a sum of absolute differences (SAD) block 2008 and a despreader (to despread) block 2010 are used to generate the input to mux 2016. If blocks 2008 and 2010 are not present, mux 2016, block 2018, and mux 2020 may be used. Local memory 2006 is shown connected to block 2004. Block 2002 performs branch metric calculations known to those familiar with Viterbi code / decoding. Survivor paths known to those familiar with Viterbi code / decode are also stored in block 2012.

ブロック１３３４は、ターボデコーダ、ＳＡＤおよびデスプレッド機能を実行することができる。一例において、３２〜２５６の加算−比較−選択動作は、ローカルメモリ２００６によって生成される１６ビットブランチおよびパスメトリック値で、ブロック２００４によって、平行して実行されることができる。一例において、ローカルメモリ２００６のサイズは、１キロビットおよび１６キロビットである。 Block 1334 may perform a turbo decoder, SAD and despread function. In one example, 32-256 add-compare-select operations can be performed in parallel by block 2004 with 16-bit branches and path metric values generated by local memory 2006. In one example, the size of the local memory 2006 is 1 kilobit and 16 kilobits.

ブロック１３３４に備えられる複数のブロック２００４が存在してもよく、その各々は、８ビット符号付き加算器を備えてもよい。さらに、各々は、ウィニング（ｗｉｎｎｉｎｇ）パスおよび決定ビットを戻す比較および選択ブロックを備えてもよい。加算−比較−選択動作は、ウィニングパスおよび決定ビットをもたらす。ウィニングパスは、トレリスを伝えるために、「マルチキャスト」相互接続計画を使用する隣接するブロック２００４と共有されることができる。ウィニングブランチおよびパスメトリック値を有する決定ビットは、バックトラックのために保存される。 There may be a plurality of blocks 2004 provided in block 1334, each of which may comprise an 8-bit signed adder. Further, each may comprise a comparison and selection block that returns a winning path and a decision bit. The add-compare-select operation results in a winning path and a decision bit. Winning paths can be shared with adjacent blocks 2004 that use a “multicast” interconnection scheme to convey the trellis. Decision bits with winning branches and path metric values are saved for backtracking.

ブロック２００８は、４つの８ビットのＡＬＵｓを使用し、一例において、サイクル毎に計算可能な４つの絶対差を使用する。リダクションツリーは、絶対差を１６ビットのアキュムレータに蓄積するために、ブロック２００４に組み込まれる。マルチキャストネットワークは、これらの値をさらにリダクションするために送信するように使用されることができる。総数１２８の８ビット（６４の１６ビット）ブロック２００８が、クロックサイクル毎に可能である。しかしながら、オーバーヘッドの全てを考慮することにより少数にすることが、効率的利用であると考えられる。 Block 2008 uses four 8-bit ALUs, and in one example, uses four absolute differences that can be calculated per cycle. The reduction tree is incorporated into block 2004 to store the absolute difference in a 16-bit accumulator. A multicast network can be used to transmit these values for further reduction. A total of 128 8-bit (64 16-bit) blocks 2008 are possible every clock cycle. However, it is considered efficient use to reduce the number by considering all of the overhead.

ＡＬＵは、特別ＡＬＵブロックが実行し、上に説明されたような同一の条件付き加算−減算機能を実行する。デスプレッドが必要な制御ビットは、それがレジスタにフェッチおよび保存される場所から、ローカルメモリにロードされなければならない。結果は、リダクション動作のためにその他のブロック２００４に転送可能な場所から、１６ビットのアキュムレータに蓄積される。デスプレッドによって、一例において、単一サイクルで１２８の条件付き加算−減算を同時に実行することが可能である。このユニットにおける遷移毎のエネルギーは、デスプレッドおよびＳＡＤ以外のいくつかの一般的な機能のために機能する特別ＡＬＵに使用されるものよりも高い。指より少ない数、または低い動き検出率のためには、特別ＡＬＵはより電力効率の良い選択である。 The ALU is executed by a special ALU block and performs the same conditional add-subtract function as described above. Control bits that need to be despread must be loaded into local memory from where they are fetched and stored in registers. The result is stored in a 16-bit accumulator from where it can be transferred to other blocks 2004 for reduction operations. With despreading, in one example, 128 conditional add-subtracts can be performed simultaneously in a single cycle. The energy per transition in this unit is higher than that used for special ALUs that function for some common functions other than despread and SAD. For fewer fingers or lower motion detection rates, special ALU is a more power efficient choice.

図２１は、本発明の実施形態にしたがう、プロセッサ２２を使用してフローおよびツールをプログラミングする例を示す。図２２は、本発明の実施形態の拡張性の例を示す。例えば、図２２において、バス２２０４を使用して相互接続されるように示される、クラスタ２２０２またはＷタイプおよびＮタイプのサブプロセッサが存在する。各クラスタ２２０２は、２つまたは４つのサブプロセッサを備える。バス２２０４は、一例において、標準のＳｏＣバスである。階層的設計方法論を保持することによって、相互接続性が対処される。 FIG. 21 illustrates an example of programming flows and tools using the processor 22 in accordance with an embodiment of the present invention. FIG. 22 shows an example of extensibility of an embodiment of the present invention. For example, in FIG. 22, there are clusters 2202 or W-type and N-type sub-processors that are shown to be interconnected using bus 2204. Each cluster 2202 comprises two or four subprocessors. Bus 2204 is a standard SoC bus in one example. By maintaining a hierarchical design methodology, interconnectivity is addressed.

プロセッサ２０をスケーリングすることにより、各クラスタ用の別々のバスを有する４つのサブプロセッサのクラスタをもたらし、あるいは、４つのサブプロセッサは、単一のメモリを共有し得る。プロセッサに関する拡張性は、一般的に、プロセッサの数を増加、またはプロセッサの周波数または速度を増加させることによってもたらされてきた。しかしながら、複雑なアプリケーションは、従来行なわれてきたもの以上のスケーリングを必要とする。本発明において、ＷタイプおよびＮタイプのサブプロセッサは、処理を形成する４つのそのようなサブプロセッサが単一のアプリケーションを処理することができるように修正される。 Scaling the processor 20 results in a cluster of four subprocessors with separate buses for each cluster, or the four subprocessors may share a single memory. Scalability with respect to processors has generally been brought about by increasing the number of processors or increasing the frequency or speed of the processors. However, complex applications require more scaling than previously done. In the present invention, W-type and N-type sub-processors are modified so that four such sub-processors that form the process can process a single application.

したがって、プロセッサ２２は、Ｃコードからのコンパイルに直接基づいて、ＲＩＳＣおよびスーパースケーラプロセッサよりも効率的な、対象のアプリケーションに見られる制御およびシーケンシャルＤＳＰコードを実行する能力が備えられる。同時に、プロセッサ２２は、レガシーおよびライトアプリケーションのために、ＲＩＳＣおよびスーパースケーラプロセッサに使用される自動コード生成技術を利用するように設計される。さらに、プロセッサ２２は、アプリケーションマッピングおよび開発のために、Ｓｉｍｕｌｉｎｋのような、成熟した業界基準のソフトウェアツールで機能する。ムーアの法則が、プロセッサ２２の性能を向上させるために利用されることができる。プロセッサ２２は、非常に平行性のある機械であるだけでなく、異種マルチプロセッサでもある。要求事項の多いマルチメディアおよび通信のアプリケーションに対処するために、平行性のある異種のマルチプロセッサが必要とされていることが、業界と学界の両方において証明された事実である。プロセッサ２２、電力および面積の非効率的な技術を使用しないで、ＶＬＩＷに使用される多くの自動コード生成技術の利用を可能にする。プロセッサ２２は、Ｃからの制御コードのコンパイルに基づき、繰り返しパターンを利用するように最適化される。このことは制御電力を大幅に減少させ、コンパイルされたシリアルコードを効率的に実行することを可能にする。さらに、プロセッサ２２のプログラミングモデルは、Ｓｉｍｕｌｉｎｋのようなプログラマーに精通するツールを使用して、ＤＳＰプログラマーの大きなコミュニティを適合させるように設計される。その開発フローは、制御およびシーケンシャルＤＳＰコードの効率的なＣコンパイル手段を提供する。また、極めて効率的な通信およびマルチメディアのカーネルのライブラリの広範囲のセットが提供される。例として、ＦＦＴ、ＩＤＣＴ、ＲＲＣ、ビタビ、ＶＬＣ、２Ｄ／３Ｄグラフィック、ターボコード、およびデスクランブラのパラメータ化されたライブラリが挙げられる。 Thus, the processor 22 is equipped with the ability to execute control and sequential DSP code found in the target application that is more efficient than RISC and superscaler processors, based directly on compilation from C code. At the same time, the processor 22 is designed to take advantage of the automatic code generation techniques used for RISC and superscaler processors for legacy and light applications. In addition, the processor 22 works with mature industry standard software tools, such as Simulink, for application mapping and development. Moore's law can be used to improve the performance of the processor 22. The processor 22 is not only a very parallel machine, but also a heterogeneous multiprocessor. It is a proven fact in both industry and academia that parallel, heterogeneous multiprocessors are needed to address demanding multimedia and communications applications. Without the use of processor 22, power and area inefficiency techniques, it allows the utilization of many automatic code generation techniques used in VLIW. The processor 22 is optimized to use the repetitive pattern based on the compilation of the control code from C. This greatly reduces control power and allows the compiled serial code to be executed efficiently. In addition, the programming model of processor 22 is designed to fit a large community of DSP programmers using tools familiar to programmers such as Simulink. The development flow provides an efficient C compilation means for control and sequential DSP code. An extensive set of highly efficient communication and multimedia kernel libraries is also provided. Examples include parameterized libraries of FFT, IDCT, RRC, Viterbi, VLC, 2D / 3D graphics, turbo code, and descrambler.

プロセッサ２２におけるデータパス設計は、注目されかつ非常に有利なアプリケーションの混合を効率的に対処するために、さまざまな粒度の機能ユニットを接続するさまざまな相互接続構造を成功裏に統一する。 The data path design in the processor 22 successfully unifies the various interconnect structures that connect the functional units of various granularities in order to efficiently deal with a mix of noted and highly advantageous applications.

プロセッサ２２の拡張性は、標準のＳｏＣバスに基づいて、ブロック内の最隣接接続を有する単一ブロック（時分割）で、全てのアプリケーションを適合させるように設計される。多数のブロックが、そのブロック間での専用のコミュニケーション無しで、多数のアプリケーションを処理するために使用可能であるため、非効率性が大幅に減少し、システムレベルの非決定論の全てが削減される。 The scalability of the processor 22 is designed to fit all applications in a single block (time division) with the nearest neighbor connections in the block, based on the standard SoC bus. Many blocks can be used to handle many applications without dedicated communication between them, greatly reducing inefficiencies and reducing all system-level non-determinism .

図２３は、本発明の拡張性の利点のいくつかを示すチャートを示す。 FIG. 23 shows a chart showing some of the scalability benefits of the present invention.

本発明は特定の実施形態に関して説明されたが、その代替および改変が、当業者にとっては明白であることが理解される。したがって、以下の請求の範囲は、本発明の真の精神および範囲内にあるそのような代替および改変の全ての範囲をカバーするように解釈されることが意図される。 Although the present invention has been described with respect to particular embodiments, it will be understood that alternatives and modifications will be apparent to those skilled in the art. Accordingly, the following claims are intended to be construed to cover the full scope of such alternatives and modifications that are within the true spirit and scope of this invention.

図１は、本発明の実施形態を含むデジタル製品１２に関するアプリケーション１０が示される。FIG. 1 shows an application 10 for a digital product 12 that includes an embodiment of the present invention. 図２ｉは、本発明の実施形態にしたがう、メモリコントローラおよびダイレクトメモリアクセス（ＤＭＡ）回路２４に接続される、異種の、高性能で、拡張可能なプロセッサ２２を備える、典型的な集積回路２０を示す。FIG. 2i illustrates an exemplary integrated circuit 20 comprising a heterogeneous, high performance, scalable processor 22 connected to a memory controller and direct memory access (DMA) circuit 24, in accordance with an embodiment of the present invention. Show. 図２ｉｉは、本発明の実施形態にしたがう、メモリコントローラおよびダイレクトメモリアクセス（ＤＭＡ）回路２４に接続される、異種の、高性能で、拡張可能なプロセッサ２２を備える、典型的な集積回路２０を示す。FIG. 2ii illustrates an exemplary integrated circuit 20 comprising a heterogeneous, high performance, scalable processor 22 connected to a memory controller and direct memory access (DMA) circuit 24 in accordance with an embodiment of the present invention. Show. 図３は、本発明の実施形態にしたがう、プロセッサ２０のさらなる詳細を示す。FIG. 3 illustrates further details of the processor 20 in accordance with an embodiment of the present invention. 図４は、本発明の実施形態にしたがう、ブロック７４または７６などのＷタイプのブロックのうちの１つの中に備えられるブロックまたは構造のハイレベルブロック図を示す。FIG. 4 shows a high-level block diagram of a block or structure provided in one of the W-type blocks, such as block 74 or 76, according to an embodiment of the present invention. 図５は、本発明の実施形態にしたがう、ブロック４０２に備えられる回路ブロックのブロック図を示す。FIG. 5 shows a block diagram of a circuit block provided in block 402 according to an embodiment of the present invention. 図６は、マクロ機能ユニット内、特にブロック４０２、４０４、４０６、および４０８に転送するレジスタファイルのために用いられる一般構造をさらに詳細に示す。FIG. 6 shows in more detail the general structure used within the macro functional unit, particularly for register files that are transferred to blocks 402, 404, 406, and 408. 図７は、本発明の実施形態にしたがう、ハイレベルブロック図形式において、ブロック４０８のさらなる詳細を示す。FIG. 7 shows further details of block 408 in a high level block diagram format in accordance with an embodiment of the present invention. 図８は、本発明の実施形態にしたがう、ブロック図形式において、ブロック４０４のさらなる詳細を示す。FIG. 8 shows further details of block 404 in block diagram form, in accordance with an embodiment of the present invention. 図９ｉは、特に、置換の実行に関するブロック４０４のさらなる詳細を示す。FIG. 9 i shows further details of block 404, particularly regarding performing the replacement. 図９ｉｉは、特に、置換の実行に関するブロック４０４のさらなる詳細を示す。FIG. 9ii shows further details of block 404, particularly relating to performing replacement. 図１０は、特に、置換の実行に関するブロック４０４のさらなる詳細を示す。FIG. 10 shows further details of block 404, particularly relating to performing replacement. 図１１は、本発明の実施形態にしたがう、ブロック図形式において、ブロック４０６の構成要素のさらなる詳細を示す。FIG. 11 shows further details of the components of block 406 in block diagram form in accordance with an embodiment of the present invention. 図１２は、本発明の実施形態にしたがう、ブロック７８の詳細のハイレベルブロック図を示す。FIG. 12 shows a high-level block diagram of the details of block 78 in accordance with an embodiment of the present invention. 図１３は、本発明の実施形態にしたがう、ハイレベルブロック図形式において、ブロック７８のさらなる詳細を示す。FIG. 13 shows further details of block 78 in a high level block diagram format in accordance with an embodiment of the present invention. 図１４は、本発明の実施形態にしたがう、ブロック１３２２のさらなる詳細を示す。FIG. 14 shows further details of block 1322 in accordance with an embodiment of the present invention. 図１５は、本発明の実施形態にしたがう、ハイレベルブロック図形式において、ブロック１３２４に備えられる回路のさらなる詳細を示す。FIG. 15 shows further details of the circuitry provided in block 1324 in high level block diagram form, in accordance with an embodiment of the present invention. 図１６ｉは、本発明の実施形態にしたがう、ブロック１５２０内に備えられるリダクション回路ブロック１６０２のブロック図を示す。FIG. 16i shows a block diagram of a reduction circuit block 1602 provided within block 1520 in accordance with an embodiment of the present invention. 図１６ｉｉは、本発明の実施形態にしたがう、ブロック１５２０内に備えられるリダクション回路ブロック１６０２のブロック図を示す。FIG. 16ii shows a block diagram of a reduction circuit block 1602 provided within block 1520, in accordance with an embodiment of the present invention. 図１７は、本発明の実施形態にしたがう、ハイレベルブロック図形式において、ブロック１３２６に備えられる回路のさらなる詳細を示す。FIG. 17 illustrates further details of the circuitry provided in block 1326 in a high level block diagram format, in accordance with an embodiment of the present invention. 図１８は、本発明の実施形態にしたがう、ハイレベルブロック図形式において、ブロック１３３０に備えられる回路のさらなる詳細を示す。FIG. 18 illustrates further details of the circuitry provided in block 1330 in a high level block diagram format, in accordance with an embodiment of the present invention. 図１９は、本発明の実施形態にしたがう、ハイレベルブロック図形式において、ブロック１３３２に備えられる回路のさらなる詳細を示す。FIG. 19 shows further details of the circuitry provided in block 1332 in a high level block diagram format, in accordance with an embodiment of the present invention. 図２０は、本発明の実施形態にしたがう、ハイレベルブロック図形式において、ブロック１３３４に備えられる回路のさらなる詳細を示す。FIG. 20 shows further details of the circuitry provided in block 1334 in high-level block diagram form, in accordance with an embodiment of the present invention. 図２１ｉは、本発明の実施形態にしたがう、プロセッサ２２を使用してフローおよびツールをプログラミングする例を示す。FIG. 21i illustrates an example of programming flows and tools using processor 22 according to an embodiment of the present invention. 図２１ｉｉは、本発明の実施形態にしたがう、プロセッサ２２を使用してフローおよびツールをプログラミングする例を示す。FIG. 21ii illustrates an example of programming flows and tools using the processor 22 in accordance with an embodiment of the present invention. 図２１ｉｉｉは、本発明の実施形態にしたがう、プロセッサ２２を使用してフローおよびツールをプログラミングする例を示す。FIG. 21iii shows an example of programming flows and tools using processor 22 according to an embodiment of the present invention. 図２１ｉｖは、本発明の実施形態にしたがう、プロセッサ２２を使用してフローおよびツールをプログラミングする例を示す。FIG. 21 iv illustrates an example of programming flows and tools using the processor 22 according to an embodiment of the present invention. 図２１ｖは、本発明の実施形態にしたがう、プロセッサ２２を使用してフローおよびツールをプログラミングする例を示す。FIG. 21v illustrates an example of programming flows and tools using processor 22 according to an embodiment of the present invention. 図２１ｖｉは、本発明の実施形態にしたがう、プロセッサ２２を使用してフローおよびツールをプログラミングする例を示す。FIG. 21vi shows an example of programming flows and tools using processor 22 according to an embodiment of the present invention. 図２２は、本発明の実施形態の拡張性の例を示す。FIG. 22 shows an example of extensibility of an embodiment of the present invention. 図２３ｉは、本発明の拡張性の利点のいくつかを示すチャートを示す。FIG. 23i shows a chart showing some of the scalability benefits of the present invention. 図２３ｉｉは、本発明の拡張性の利点のいくつかを示すチャートを示す。FIG. 23ii shows a chart illustrating some of the scalability benefits of the present invention.

Claims

異種の、高性能で、拡張可能なプロセッサであって、
Ｗビット以上を並列的に処理することが可能な少なくとも１つのＷタイプのサブプロセッサであって、Ｗは整数値である、サブプロセッサと、
Ｎビットを並列的に処理することが可能な少なくとも１つのＮタイプのサブプロセッサであって、Ｎは整数値でありＷより小さい、サブプロセッサと、
該少なくとも１つのＷタイプのサブプロセッサと少なくとも１つのＮタイプのサブプロセッサとを接続する共有バスと、
該少なくとも１つのＷタイプのサブプロセッサと該少なくとも１つのＮタイプのサブプロセッサとに接続されて共有されるメモリと
を備え、該Ｗタイプのサブプロセッサは、メモリを出入りするバイトを再配置し、アプリケーションの実行に対応することにより、高速動作を可能にする、プロセッサ。 A heterogeneous, high-performance, expandable processor,
At least one W-type sub-processor capable of processing more than W bits in parallel, wherein W is an integer value;
At least one N-type sub-processor capable of processing N bits in parallel, where N is an integer value and smaller than W;
A shared bus connecting the at least one W-type sub-processor and the at least one N-type sub-processor;
A memory connected to and shared by the at least one W-type sub-processor and the at least one N-type sub-processor, wherein the W-type sub-processor rearranges bytes entering and exiting the memory; A processor that enables high-speed operation by supporting application execution.

前記プロセッサは、拡張可能である、請求項１に記載の異種の、高性能で、拡張可能なプロセッサ。 The heterogeneous, high performance, expandable processor of claim 1, wherein the processor is expandable.

少なくとも１つのＷタイプのサブプロセッサのうちの２つと、前記少なくとも１つのＮタイプのサブプロセッサのうちの２つである、請求項１に記載の異種の、高性能で、拡張可能なプロセッサ。 The heterogeneous, high performance, expandable processor of claim 1, wherein two of at least one W-type subprocessor and two of the at least one N-type subprocessor.

前記少なくとも１つのＷタイプのサブプロセッサと前記少なくともＮタイプのサブプロセッサは、マルチメディアアプリケーションに対するプログラムを実行する、請求項２に記載の異種の、高性能で、拡張可能なプロセッサ。 The heterogeneous, high performance, scalable processor of claim 2, wherein the at least one W-type sub-processor and the at least N-type sub-processor execute programs for multimedia applications.

前記少なくとも１つのＷタイプのサブプロセッサのうちの各々は、複数のマクロ機能ユニットを含む、請求項４に記載の異種の、高性能で、拡張可能なプロセッサ。 The heterogeneous, high performance, expandable processor of claim 4, wherein each of the at least one W-type subprocessor includes a plurality of macro functional units.

前記複数のマクロ機能ユニットは、該複数のマクロ機能ユニットのその他による使用のためのメモリアドレスを生成するために、ロードストアブロックを含む、請求項５に記載の異種の、高性能で、拡張可能なプロセッサ。 6. The heterogeneous, high performance, expandable of claim 5, wherein the plurality of macro functional units includes a load store block to generate memory addresses for use by others of the plurality of macro functional units. Processor.

前記複数のマクロ機能ユニットは、前記ロードストアブロックに接続されたスカラー算術論理ユニット（ＡＬＵ）および乗加算ブロックを含み、該スカラー算術論理ユニットおよび該乗加算ブロックは、該ロードストアブロックから受信されるデータに対し、スカラー算術論理演算および乗算演算を実行する、請求項６に記載の異種で、高性能で、拡張可能なプロセッサ。 The plurality of macro functional units includes a scalar arithmetic logic unit (ALU) and a multiply-add block connected to the load store block, wherein the scalar arithmetic logic unit and the multiply-add block are received from the load store block. The heterogeneous, high performance, expandable processor of claim 6 that performs scalar arithmetic logic operations and multiplication operations on data.

前記複数のマクロ機能ユニットは、ベクトルＸブロックを含み、該ベクトルＸブロックは、前記ロードストアブロック、前記スカラーＡＬＵ、および、複数の乗加算ブロックに接続され、該ロードストアブロックからのデータに対してベクトル演算を実行し、該ベクトルＸブロックは、ベクトルデータを生成する、請求項７に記載の異種の、高性能で、拡張可能なプロセッサ。 The plurality of macro functional units include a vector X block, and the vector X block is connected to the load store block, the scalar ALU, and a plurality of multiply-add blocks, and receives data from the load store block. The heterogeneous, high performance, expandable processor of claim 7, wherein the vector X block performs vector operations and the vector X block generates vector data.

前記複数のマクロ機能ユニットは、ベクトルＡＬＵおよび乗加算ブロックを含み、該ベクトルＡＬＵおよび該乗加算ブロックは、前記スカラーＡＬＵおよび乗加算ブロックおよび前記ベクトルＸブロックに接続され、該ベクトルＸブロックから受信されるベクトルデータに対し、ベクトルＡＬＵ演算および乗加算演算を実行する、請求項８に記載の異種の、高性能で、拡張可能なプロセッサ。 The plurality of macro functional units include a vector ALU and a multiply-add block, and the vector ALU and the multiply-add block are connected to and received from the scalar ALU, multiply-add block and the vector X block. 9. The heterogeneous, high-performance and expandable processor according to claim 8, wherein vector ALU operation and multiplication / addition operation are performed on vector data.

前記少なくとも１つのＮタイプのサブプロセッサは、ストアユニットブロックと、マクロ機能ブロックと、ロードユニットブロックとを含み、該マクロ機能ブロックは、ロードユニットブロックに接続され、該マクロ機能ブロックを該ストアブロックに接続するためのマクロ機能バスにさらに接続される、請求項２に記載の異種の、高性能で、拡張可能なプロセッサ。 The at least one N-type sub-processor includes a store unit block, a macro function block, and a load unit block, and the macro function block is connected to the load unit block, and the macro function block is connected to the store block. The heterogeneous, high performance, expandable processor of claim 2, further connected to a macro function bus for connection.

前記少なくとも１つのＮタイプのサブプロセッサは、複数のＷタイプのサブプロセッサのうちの少なくとも１つによって共有される、データパスユニット（ＤＰＵ）ブロックと、制御装置と、シーケンサと、データアドレスジェネレータ（ＤＡＧ）ブロックとを含む、請求項１０に記載の異種の、高性能で、拡張可能なプロセッサ。 The at least one N-type sub-processor is shared by at least one of a plurality of W-type sub-processors, a data path unit (DPU) block, a controller, a sequencer, and a data address generator (DAG). 11. A heterogeneous, high performance, expandable processor according to claim 10 comprising:

前記マクロ機能ブロックは、ガロアフィールドの乗加算（ＭＡＣ）ブロックを含み、該ガロアフィールドの乗加算ブロックは、前記マクロ機能バスと前記ロードユニットブロック１３０６とに接続され、ガロアフィールドの演算を実行する、請求項１０に記載の異種の、高性能で、拡張可能なプロセッサ。 The macro function block includes a Galois field multiplication and addition (MAC) block, and the Galois field multiplication and addition block is connected to the macro function bus and the load unit block 1306, and performs a Galois field operation. 11. A heterogeneous, high performance, expandable processor according to claim 10.

前記マクロ機能ブロックは、特別ＡＬＵを含み、該特別ＡＬＵは、前記ロードユニットブロックと前記ロードユニットブロックとに接続され、特別なＡＬＵ演算を実行する、請求項１２に記載の異種の、高性能で、拡張可能なプロセッサ。 13. The heterogeneous, high performance of claim 12, wherein the macro functional block includes a special ALU, and the special ALU is connected to the load unit block and the load unit block to perform a special ALU operation. , Expandable processor.

前記マクロ機能ブロックは、パンクチュアリング／デパンクチュアリングブロックを含み、該パンクチュアリ／デパンクチュアリブロックは、前記ロードユニットブロックと前記ロードユニットブロックとに接続され、パンクチュアリング／デパンクチュアリング演算を実行する、請求項１３に記載の異種の、高性能で、拡張可能なプロセッサ。 The macro functional block includes a puncturing / depuncturing block, and the puncturing / depuncturing block is connected to the load unit block and the load unit block, and puncturing / depuncturing. The heterogeneous, high performance, expandable processor of claim 13 that performs operations.

前記マクロ機能ブロックは、インタリーバブロックを含み、該インタリーバブロックは、前記ロードユニットブロックと前記ロードユニットブロックとに接続され、インタリーバ演算を実行する、請求項１４に記載の異種の、高性能で、拡張可能なプロセッサ。 15. The heterogeneous, high performance, extended of claim 14, wherein the macro functional block includes an interleaver block, the interleaver block being connected to the load unit block and the load unit block for performing an interleaver operation. Possible processor.

前記マクロ機能ブロックは、ビタビブロックを含み、該ビタビブロックは、前記ストアユニットブロックと前記インタリーバブロックとに接続され、ビタビ演算を実行する、請求項１５に記載の異種の、高性能で、拡張可能なプロセッサ。 16. The heterogeneous, high performance, expandable of claim 15, wherein the macro functional block includes a Viterbi block, the Viterbi block being connected to the store unit block and the interleaver block to perform Viterbi operations. Processor.

前記マクロ機能ブロックは、結合器ブロックを含み、該結合器ブロックは、前記ロードユニットブロックと前記ロードユニットブロックとに接続され、結合演算を実行する、請求項１６に記載の異種の、高性能で、拡張可能なプロセッサ。 17. The heterogeneous, high performance of claim 16, wherein the macro functional block includes a combiner block, the combiner block being connected to the load unit block and the load unit block to perform a join operation. , Expandable processor.

前記少なくとも１つのＮタイプのサブプロセッサは、前記ストアユニットブロックと前記ロードユニットブロックとの間に接続されたＸユニットブロックを含む、請求項１６に記載の異種の、高性能で、拡張可能なプロセッサ。 17. The heterogeneous, high performance, expandable processor of claim 16, wherein the at least one N-type sub-processor includes an X unit block connected between the store unit block and the load unit block. .

前記少なくとも１つのＷタイプのサブプロセッサと前記少なくとも１つのＮタイプのサブプロセッサとの間の直接通信のために、該少なくとも１つのＷタイプのサブプロセッサと該少なくとも１つのＮタイプのサブプロセッサとの間に接続された、共有レジスタを含む、請求項１６に記載の異種の、高性能で、拡張可能なプロセッサ。 Between the at least one W-type subprocessor and the at least one N-type subprocessor for direct communication between the at least one W-type subprocessor and the at least one N-type subprocessor; The heterogeneous, high performance, expandable processor of claim 16 including a shared register connected therebetween.

異種の、高性能で、拡張可能なプロセッサを備える、情報を処理する方法であって、
Ｗビットを並列的に処理することが可能な少なくとも１つのＷタイプのサブプロセッサを使用して、データを処理することであって、Ｗは整数値である、ことと、
Ｎビットを並列的に処理することが可能な少なくとも１つのＮタイプのサブプロセッサを使用して、データを同時に処理することであって、ＮはＷより１／２倍小さい整数値である、ことと、
低電力消費とプログラマビリティの容易さとを維持する一方で、マルチメディアアプリケーションの高速実行をもたらすことと
を含む、方法。 A method of processing information comprising a heterogeneous, high performance, scalable processor comprising:
Processing the data using at least one W-type sub-processor capable of processing W bits in parallel, where W is an integer value;
Processing data simultaneously using at least one N type sub-processor capable of processing N bits in parallel, where N is an integer value ½ times smaller than W When,
Maintaining low power consumption and ease of programmability while providing fast execution of multimedia applications.