KR20070055487A

KR20070055487A - Programmable processor architecture

Info

Publication number: KR20070055487A
Application number: KR1020077000909A
Authority: KR
Inventors: 람찬드란 아미트; 레이드 하우서 쥬니어 존
Original assignee: 쓰리플러스원 테크놀러지, 인크
Priority date: 2004-07-13
Filing date: 2005-07-12
Publication date: 2007-05-30
Also published as: EP1779256A2; WO2006017339A2; EP1779256A4; JP2008507039A; WO2006017339A3; CA2572954A1

Abstract

본 발명의 일 실시예는 이종 고성능 스케일러블 프로세서를 포함하며, 이 프로세서는 W 이상의 비트들을 병렬로 처리할 수 있는 적어도 하나의 W-형 서브-프로세서로서, W 가 정수값인 W-형 서브-프로세서와, N 이상의 비트들을 병렬로 처리할 수 있는 적어도 하나의 N-형 서브-프로세서로서, N이 W보다 2의 인자만큼 작은 정수값인 N-형 서브-프로세서를 구비한다. 이 프로세서는 적어도 하나의 W-형 서브-프로세서와 적어도 하나의 N-형 서브-프로세서를 연결하는 공유 버스, 및 적어도 하나의 W-형 서브-프로세서와 적어도 하나의 N-형 서브-프로세서에 연결 및 공유된 메모리를 추가로 포함하고, W-형 서브-프로세서는 애플리케이션들의 실행을 수용하기 위해 메모리로부터 또는 메모리로 전달되는 바이트들을 재배열하여 신속한 연산들을 가능하게 한다.One embodiment of the invention includes a heterogeneous high performance scalable processor, which is at least one W-type sub-processor capable of processing W or more bits in parallel, wherein W is an integer value of the W-type sub-processor. A processor and at least one N-type sub-processor capable of processing N or more bits in parallel, wherein the N-type sub-processor is an integer value where N is an integer value less than two. The processor is connected to a shared bus connecting at least one W-type sub-processor and at least one N-type sub-processor, and to at least one W-type sub-processor and at least one N-type sub-processor. And shared memory, wherein the W-type sub-processor rearranges the bytes transferred from or to the memory to accommodate rapid execution of applications.

프로세서, 스케일러블 프로세서, 메모리 Processors, Scalable Processors, Memory

Description

프로그램가능한 프로세서 아키텍처{Programmable processor architecture}Programmable Processor Architecture

관련 출원의 참조Reference of related application

본 출원은 2004년 7월 13일자로 출원한 발명의 명칭이 "Quasi-Adiabatic Programmable or COOL Processors Architecture"인 미국 가특허 출원 제60/598,691호 및 2004년 8월 2일자로 출원된 발명의 명칭이 "Qasi-Adiabatic Programmable Processor Architecture"인 미국 가특허 출원 제60/598,417호의 권익을 청구한다.This application is entitled U.S. Provisional Patent Application No. 60 / 598,691, filed July 13, 2004, entitled "Quasi-Adiabatic Programmable or COOL Processors Architecture," and an application filed on August 2, 2004. Claims the rights of U.S. Provisional Patent Application 60 / 598,417, "Qasi-Adiabatic Programmable Processor Architecture."

발명의 분야Field of invention

본 발명은 일반적으로 프로세서들, 특히, 낮은 전력 소비, 높은 성능, 낮은 다이 면적을 가지면서 멀티미디어 및 통신 애플리케이션들에 유연하면서 스케일가능하게 사용되는 프로세서에 관한 것이다.FIELD OF THE INVENTION The present invention relates generally to processors, in particular to processors that are used flexibly and scalable for multimedia and communication applications while having low power consumption, high performance, and low die area.

셀 또는 모바일 전화기들, 디지털 카메라들, iPod들 및 퍼스널 데이터 어시스턴트들(PDA들) 같은 소비자 가전들의 대중화의 도래로, 이들 가전들과의 통신을 위한 다수의 신규한 표준들이 산업적으로 널리 채택되어 오고 있다. 이들 표준들 중 일부는 H264, MPEG4, UWB, 블루투스, 2G/2.5G/3G/4G, GPS, MP3 및 시큐리티를 포함한다. 그러나, 도래하는 문제점은 서로 다른 가전들 사이의, 그리고, 그들의 통신을 지시하는 서로 다른 표준들의 사용이며, 이는 방대한 개발 노력을 필요로 한다. 상술한 문제점의 이유 중 하나는 현재 시장으로부터 입수할 수 있는 프로세서 또는 서브-프로세서 중 어떠한 것도, 모든 디지털 디바이스들에 의한 사용을 위해 쉽게 프로그램할 수 있으면서, 다양한 지정된 표준들에 부합할 수 없다는 것이다. 소비자 가전들의 새로운 경향들로 인해 미래의 산업적으로 매우 많은 표준들의 채택이 자명하기 때문에, 이 문제점이 심화되는 것은 시간 문제이다.With the advent of the popularity of consumer electronics such as cell or mobile telephones, digital cameras, iPods and personal data assistants (PDAs), many new standards for communication with these consumer electronics have been widely adopted industrially. have. Some of these standards include H264, MPEG4, UWB, Bluetooth, 2G / 2.5G / 3G / 4G, GPS, MP3 and security. However, an emerging problem is the use of different standards between different appliances and to direct their communication, which requires extensive development effort. One of the reasons for the above-mentioned problems is that none of the processors or sub-processors currently available on the market can be easily programmed for use by all digital devices, while meeting various specified standards. It is only a matter of time that this problem is exacerbated, as new trends in consumer electronics make adoption of so many industrial standards of the future apparent.

도래하는, 그리고, 그렇지는 않더라도, 현재의 프로세서들의 요구조건들 중 하나는 다수의 애플리케이션들을 처리하기에 충분한 코드의 실행을 유발하기 위한 기능에도 불구한 낮은 전력 소비이다. 현재의 전력 소비는 애플리케이션 당 수백 밀리와트 수준이지만, 목적은, 다수의 애플리케이션들을 실행하기 위해 수백 밀리와트 미만이 되는 것이다. 프로세서의 다른 요구조건은 낮은 비용이다. 소비자 제품들의 프로세서들의 방대한 활용으로 인해, 프로세서는 제조 비용이 낮아야만 하며, 그렇지 않으면, 대부분의 일반적 소비자 가전들에 이를 사용하는 것이 비실용적이다.One of the requirements of today's processors to come and, if not, is low power consumption despite the ability to cause execution of code sufficient to handle multiple applications. Current power consumption is on the order of hundreds of milliwatts per application, but the goal is to be less than a few hundred milliwatts to run multiple applications. Another requirement for the processor is low cost. Due to the widespread use of processors in consumer products, the processor must have a low manufacturing cost, otherwise it is impractical to use it in most common consumer electronics.

현재의 프로세서 문제들에 대한 특정 예들을 제공하기 위해, 각각 고유한 문제점을 나타내는 일부 소비자 제품에 사용되는 RISC들, 다른 소비자 제품들에 사용되는 마이크로프로세서들, 또 다른 소비자 제품들에 사용되는 디지털 신호 프로세서들(DSP들) 및 또 다른 소비자 제품들에 사용되는 용도 특정 집적 회로들(ASIC들)과, 일부 다른 잘 알려진 프로세서들과 연계한 문제점들을 간단히 후술한다. 이하, 각각의 사용의 장점들과 함께, 이들 문제점들을 그 단점들을 설명하는 "장점" 섹션과 그 이득들을 설명하는 "단점" 섹션에서 개요설명한다.To provide specific examples of current processor problems, RISCs used in some consumer products, microprocessors used in other consumer products, and digital signals used in other consumer products, each representing a unique problem. Problems associated with application specific integrated circuits (ASICs) used in processors (DSPs) and other consumer products, and some other well known processors, are briefly described below. Together with the advantages of each use, these problems are outlined in the "Advantages" section describing the disadvantages and the "Pros" section describing the benefits.

A. RISC/슈퍼 스칼라 프로세서들 A. RISC / Super Scalar Processors

RISC 및 슈퍼 스칼라 프로세서들은 모든 범용 목적 연산을 위해 가장 널리 수용되는 구조적 해법이되어 왔다. 이들은 종종 일반적 해법의 범주내에서 특정 특수한 문제점들을 해결하기 위해 용도 특정 가속기들로 강화된다. RISC and superscalar processors have been the most widely accepted structural solution for all general purpose computations. These are often enhanced with application specific accelerators to solve certain special problems within the scope of general solutions.

예들은 ARM 시리즈, ARC 시리즈, StrongARM 시리즈 및 MIPS 시리즈를 포함한다.Examples include ARM series, ARC series, StrongARM series and MIPS series.

장점 :Advantages :

ㆍ 산업적 넓은 수용은 보다 성숙한 툴 체인 및 넓은 소프트웨어 선택들을 초래한다. Industry wide acceptance leads to more mature tool chains and wider software choices.

ㆍ C 같은 고 레벨 언어들로부터 이진수들을 생성하기 위해 사용되는 매우 효과적인 자동 코드 생성기로부터 강인한 프로그래밍 모듈이 도출된다.A robust programming module is derived from the highly effective automatic code generator used to generate binary numbers from high level languages such as C.

ㆍ 이 부류의 프로세서들은 매우 양호한 범용 목적 해법들이다.This class of processors are very good general purpose solutions.

ㆍ 성능 증가를 위해 Moore의 법칙이 효과적으로 사용될 수 있다.Moore's law can be used effectively to increase performance.

단점 :Disadvantages :

ㆍ 이 아키텍처의 범용 목적 특성은 보다 양호한 가격, 파워 및 성능을 위한 애플리케이션의 집합 또는 부분집합의 공용/특정 특성들에 영향을 주지 않는다. The general purpose characteristics of this architecture do not affect the common / specific characteristics of a set or subset of applications for better price, power and performance.

ㆍ 제공되는 연산의 양에 관해 중간 내지 높은 양의 전력을 소비한다.Consumes moderate to high amounts of power with respect to the amount of computation provided.

ㆍ 성능 증가는 대부분 파이프라인 지연을 댓가로 하여 얻어지며, 이는 다수의 멀티미디어 및 통신 알고리즘들에 부정적인 영향을 준다.Performance gains are mostly obtained at the expense of pipeline delays, which negatively affects many multimedia and communication algorithms.

ㆍ 복잡한 하드웨어 스케줄러, 정교한 제어 메커니즘 및 범용 알고리즘들을 위한 보다 효과적인 자동 코드 생성에 대한 현저히 감소된 규제들은 이 부류의 해법들이 공간적으로 비효율적이 되게 한다.Significantly reduced regulations for more efficient automatic code generation for complex hardware schedulers, sophisticated control mechanisms and general purpose algorithms make this class of solutions spatially inefficient.

B. VLIW (Very Long Instruction Word) 및 DSP 들 B. The VLIW (Very Long Instruction Word) and DSP

VLIW 아키텍처들은 디지털 신호 처리 공간의 매우 일반적 해법을 생성하기 위해, RISC 및 슈퍼 스칼라 아키텍처들에서 발견되는 비효율성들 중 일부를 제거하였다. 병렬화는 현저히 증가되었다. 스케줄링의 부담은 면적을 절감하기 위해 하드웨어로부터 소프트웨어로 전가되었다.VLIW architectures have eliminated some of the inefficiencies found in RISC and super scalar architectures to create a very general solution of digital signal processing space. Parallelism has increased significantly. The burden of scheduling has shifted from hardware to software to save area.

예들은 TI 64xx, TI 55xx, StarCore SC140, ADI SHARC 시리즈를 포함한다.Examples include the TI 64xx, TI 55xx, StarCore SC140, and ADI SHARC series.

장점 : Advantages :

ㆍ 신호 처리 공간에 대한 해법의 규제는 RISC 및 슈퍼 스칼라 아키텍처들에 비해 3P를 증가시켰다.The regulation of solutions to the signal processing space has increased 3P over RISC and super scalar architectures.

ㆍ VLIW 아키텍처들은 RISC 및 슈퍼스칼라 아키텍처들에 대하여 보다 높은 수준의 병렬화를 제공한다.VLIW architectures provide a higher level of parallelism for RISC and superscalar architectures.

ㆍ 효과적 툴 체인 및 산업적 넓은 수용이 매우 급속히 발생되었다.Effective tool chains and widespread industrial acceptance have occurred very rapidly.

ㆍ 신호 처리를 위해 설계된 보다 많은 프로세서들이 이 부류에 들기 때문에 자동 코드 생성 및 프로그램성은 현저한 개선을 나타낸다.Automatic code generation and programmability represent a significant improvement as more processors designed for signal processing fall into this class.

단점 :Disadvantages :

ㆍ 비록, 문제 해결 능력이 디지털 신호 처리 공간으로 감소되었지만, VLIW 기계 같은 범용 해법이 효율적인 3P를 갖게 하기에는 너무 넓다.Although the problem solving ability has been reduced to the digital signal processing space, general solutions such as VLIW machines are too wide to have an efficient 3P.

ㆍ 특히, 다수의 멀티미디어 및 통신 애플리케이션들의 원시 제어 코드에 대 해 제어가 많은 비용을 소요하며 전력 소비가 높다.In particular, control is costly and power consuming for native control code for many multimedia and communication applications.

ㆍ 자동 코드 생성을 용이하게 하기 위해 다수의 파워 및 면적 비효율적 기술들이 사용된다. 소프트웨어 커뮤니티에 의한 이들 기술들에 대한 의존성은 세대가 지나도 이 비효율성을 지속시킨다. Multiple power and area inefficient techniques are used to facilitate automatic code generation. The dependence on these technologies by the software community persists this inefficiency over generations.

ㆍ VLIW 아키텍처들은 시리얼 코드를 처리하는데는 다소 부적합하다.VLIW architectures are somewhat inadequate for handling serial code.

C. 재구성가능 연산 C. Reconfigurable Operations

최근 10여년간 산업 및 학계의 다수의 노력들은 가격, 파워 및 성능 특성들의 유연한 해법을 구축하는 것에 초점을 두어왔다. 다수가 현존하는 그리고 성숙한 법규들 및 디자인 페러다임들에 도전하였으나, 산업적 성공은 적었다. 대부분의 시도들은 조립질(coarser grain) FPGA형 아키텍처들에 기초한 해법들의 생성에 방향을 두고 있다.In recent decades, many efforts in industry and academia have focused on building flexible solutions of price, power and performance characteristics. Many have challenged existing and mature laws and design paradigms, but had little industrial success. Most attempts are directed at creating solutions based on coarser grain FPGA-type architectures.

장점 :Advantages :

ㆍ 특정 애플리케이션에 제한된, 그 애플리케이션 내에서 필요한 유연성을 제공하는 소정의 디자인들은 가격, 전력, 성능 경쟁력이 있는 것으로 판명되었다.Certain designs, limited to a particular application, that provide the flexibility needed within that application, have proven to be competitive in price, power, and performance.

ㆍ 연구결과 이런 제한된, 그러나, 유연한 해법들은 다수의 애플리케이션 난점들을 해결할 수 있도록 생성될 수 있는 것으로 나타났다.Research has shown that these limited, but flexible solutions can be created to solve many application difficulties.

단점 :Disadvantages :

ㆍ 이 공간의 다수의 디자인들은 효율적이고 쉬운 프로그래밍 해법을 제공하지 않으며, 따라서, DPS들의 프로그래밍에 정통한 커뮤니티에 널리 수용되지 않았다.Many of the designs in this space do not provide an efficient and easy programming solution and therefore are not widely accepted by the community versed in the programming of DPS.

ㆍ C 같은 보다 높은 레벨의 언어들로부터의 자동 코드 생성이 다수의 디자인들에 대하여 실질적으로 불가능하거나, 매우 비효율적이다.Automatic code generation from higher level languages such as C is practically impossible or very inefficient for many designs.

ㆍ 한 레벨의 그래뉼러리티와 일 유형의 상호접속을 사용한 이종 애플리케이션들을 조합하기 위한 시도가 이루어질 때, 3P 장점은 소실된다. 제공된 병렬화의 활용 정도는 매우 빈약하다.3P advantages are lost when attempts are made to combine heterogeneous applications using one level of granularity and one type of interconnect. The degree of utilization of the provided parallelism is very poor.

ㆍ 대부분의 디자인들을 위한 3P에서 재구성 오버헤드가 현저하다. Reconstruction overhead is significant at 3P for most designs.

ㆍ 다수의 경우들에서, 외부 인터페이스가 복잡하며, 그 이유는 독점적 재구성가능한 페브릭이 표준 시스템 디자인 방법들과 일치하지 않기 때문이다. In many cases, the external interface is complex because the proprietary reconfigurable fabric is inconsistent with standard system design methods.

ㆍ 재구성가능한 기계들은 유니-프로세서들이며, 원시 제어를 처리하기 위해서도 밀집 집적된 RISC에 크게 의존한다. Reconfigurable machines are uni-processors and rely heavily on dense RISC to handle raw control.

D. 프로세서들의 어레이 D. Array of Processors

소정의 최근 접근법들은 이종 애플리케이션들을 처리하기 위해 보다 적합한 재구성가능한 시스템들의 구현에 초점을 두고 있다. 이 방향의 해법들은 프로세서 어레이 페브릭을 생성하기 위해, 하나의 애플리케이션 또는 애플리케이션들의 집합을 위해 최적화된 다수의 프로세서들을 연결한다.Certain recent approaches focus on the implementation of more suitable reconfigurable systems to handle heterogeneous applications. Solutions in this direction connect multiple processors optimized for a single application or set of applications to create a processor array fabric.

장점 :Advantages :

ㆍ 효율적 페브릭을 사용하여 함께 접속될 때, 서로 다른 애플리케이션들의 집합들을 위해 최적화된 다른 프로세서들이 광범위한 문제들의 해결을 도울 수 있다.When connected together using efficient fabrics, different processors optimized for different sets of applications can help solve a wide range of problems.

ㆍ 균일한 스케일링 모델은 성능 요구들이 증가함에 따라 다수의 프로세서들 이 함께 접속될 수 있게 한다. Uniform scaling model allows multiple processors to be connected together as performance requirements increase.

ㆍ 복합적 알고리즘들이 효율적으로 나누어질 수 있다.Complex algorithms can be efficiently divided.

단점 :Disadvantages :

ㆍ 비록, 성능 요구들이 적절히 응답될 수 있지만, 파워 및 가격 비효율성들이 너무 높다.Although performance requirements can be answered appropriately, power and price inefficiencies are too high.

ㆍ 프로그래밍 모델은 프로세서들간에 변한다. 이는 애플리케이션 개발자의 작업을 보다 힘들게 한다. The programming model varies between processors. This makes the task of the application developer more difficult.

ㆍ 다수의 프로세서들의 균일한 스케일링은 매우 비용 및 전력 소모적 자원이다. 이는 전체 시스템의 성능에 불리할 수 있는 소정의 논-디터미니즘(non-determinism)을 보여주는 것으로 나타났다.Uniform scaling of multiple processors is a very costly and power consuming resource. This has been shown to exhibit some non-determinism that can be detrimental to the performance of the overall system.

ㆍ 시스템 레벨의 프로그래밍 모델은 어떠한 공유된 메모리 자원들도 없는, 데이터, 코드 및 제어 정보의 통신의 복잡성으로 고통받는다-공유된 메모리는 균일하게 스케일할 수 없기 때문임.The system level programming model suffers from the complexity of communication of data, code and control information without any shared memory resources-since shared memory cannot be scaled uniformly.

ㆍ 다른 유형의 프로세서들을 균질 네트워크에 연결하기 위해 필요한 고가의 그리고, 반복적인 결합 논리는 면적 비효율성을 추가하고, 파워를 증가시키며, 지연을 추가한다.Expensive and iterative coupling logic required to connect different types of processors to a homogeneous network adds area inefficiency, increases power, and adds delay.

상술한 견지에서, 하나 이상의 멀티미디어 애플리케이션들의 동시 실행을 가능하게 하기 위해, 저 전력, 저가의, 효율적인, 고성능의, 유연하게 프로그램할 수 있는 균질한 프로세서에 대한 필요성이 존재한다.In view of the foregoing, there is a need for a low power, low cost, efficient, high performance, flexible programmable homogeneous processor to enable simultaneous execution of one or more multimedia applications.

간단히, 본 발명의 일 실시예는 이종, 고성능, 스케일러블 프로세서를 포함하며, 이는 W가 정수값인, 병렬로 W 비트 이상을 처리할 수 있는 적어도 하나의 W-형 서브-프로세서와, N이 W보다 작은 정수값인, 병렬로 N 비트 이상을 처리할 수 있는 적어도 하나의 N-형 서브-프로세서를 구비한다. 이 프로세서는 적어도 하나의 W-형 서브-프로세서와 적어도 하나의 N-형 서브-프로세서를 결합하는 공유 버스와, 적어도 하나의 W-형 서브-프로세서와 적어도 하나의 N-형 서브-프로세서에 결합되어 공유된 메모리를 추가로 포함하며, W-형 서브-프로세서는 애플리케이션들의 실행을 수용하기 위해 메모리 내외로 전달되는 바이트를 재배열하여 신속한 연산들을 가능하게 한다. Briefly, one embodiment of the present invention includes a heterogeneous, high performance, scalable processor comprising at least one W-type sub-processor capable of processing more than W bits in parallel, where W is an integer value, and N At least one N-type sub-processor capable of processing more than N bits in parallel, an integer value less than W. The processor is coupled to a shared bus that combines at least one W-type sub-processor and at least one N-type sub-processor, and to at least one W-type sub-processor and at least one N-type sub-processor. And further includes shared memory, and the W-type sub-processor rearranges the bytes passed into and out of memory to accommodate the execution of applications to enable rapid operations.

도 1은 본 발명의 실시예를 포함하는 디지털 제품(12)을 참조로 도시된 애플리케이션(10)을 도시하는 도면.1 shows an application 10 shown with reference to a digital product 12 that includes an embodiment of the invention.

도 2는 본 발명의 실시예에 따른 직접 메모리 액세스(DMA) 회로(@4) 및 메모리 제어기에 결합된 이종, 고성능 스케일러블 프로세서(22)를 포함하는 예시적 집적 회로(20)를 도시하는 도면.FIG. 2 illustrates an exemplary integrated circuit 20 including a heterogeneous, high performance scalable processor 22 coupled to a direct memory access (DMA) circuit (@ 4) and a memory controller in accordance with an embodiment of the present invention. .

도 3은 본 발명의 실시예에 따른 프로세서(20)의 추가적 상세도.3 is a further detailed view of a processor 20 in accordance with an embodiment of the present invention.

도 4는 본 발명의 실시예에 따른, 블록(74 또는 76) 같은 W형 블록들 중 하나내에 포함된 블록들 또는 구조들의 고레벨 블록도.4 is a high level block diagram of blocks or structures contained within one of the W-shaped blocks, such as block 74 or 76, in accordance with an embodiment of the present invention.

도 5는 본 발명의 실시예에 따른, 블록(402)내에 포함된 회로 블록들의 블록도.5 is a block diagram of circuit blocks included in block 402, in accordance with an embodiment of the invention.

도 6은 파일들을 등록하고, 매크로 펑셔널 유닛들, 특히, 블록들(402, 404, 406 및 408)내에서 포워딩하기 위해 사용되는 일반적 구조의 추가 상세도.6 is a further detailed view of the general structure used to register files and to forward within macro functional units, in particular blocks 402, 404, 406 and 408.

도 7은 본 발명의 실시예에 따른, 블록(408)의 추가 상세 고레벨 블록도.7 is a further detailed high level block diagram of block 408, in accordance with an embodiment of the present invention.

도 8은 본 발명의 실시예에 따른, 블록(404)의 추가 상세 블록도.8 is a further detailed block diagram of block 404, in accordance with an embodiment of the present invention.

도 9 및 도 10은 특히, 치환들을 수행하는 것에 관한, 블록(404)의 추가 상세도.9 and 10 show further details of block 404, in particular regarding performing substitutions.

도 11은 본 발명의 실시예에 따른 블록(406)의 콤포넌트들의 추가 상세 블록도.11 is a further detailed block diagram of the components of block 406 in accordance with an embodiment of the present invention.

도 12는 본 발명의 실시예에 따른, 블록(78)의 상세 고레벨 블록도. 12 is a detailed high level block diagram of block 78, in accordance with an embodiment of the present invention.

도 13은 본 발명의 실시예에 따른 블록(78)의 또 다른 상세 고레벨 블록도.13 is another detailed high level block diagram of block 78 in accordance with an embodiment of the present invention.

도 14는 본 발명의 실시예에 따른, 블록(1322)의 추가 상세도.14 is a further detailed view of block 1322, in accordance with an embodiment of the present invention.

도 15는 본 발명의 실시예에 따른, 블록(1324)에 포함된 회로의 추가 상세 고레벨 블록도.15 is a further detailed high level block diagram of a circuit included in block 1324, in accordance with an embodiment of the present invention.

도 16은 본 발명의 실시예에 따른, 블록(1520)내에 포함된 감축 회로 블록(1602)의 블록도.16 is a block diagram of reduction circuit block 1602 included in block 1520, in accordance with an embodiment of the present invention.

도 17은 본 발명의 실시예에 따른 블록(1326)에 포함된 회로의 추가 상세 고레벨 블록도.17 is a further detailed high level block diagram of the circuitry included in block 1326 in accordance with an embodiment of the present invention.

도 18은 본 발명의 실시예에 따른 블록(1330)에 포함된 회로의 추가 상세 고레벨 블록도.18 is a further detailed high level block diagram of a circuit included in block 1330 in accordance with an embodiment of the present invention.

도 19는 본 발명의 실시예에 따른 블록(1332)에 포함된 회로의 추가 상세 고 레벨 블록도.19 is a further detailed high level block diagram of the circuitry included in block 1332 in accordance with an embodiment of the present invention.

도 20은 본 발명의 실시예에 따른 블록(1334)에 포함된 회로의 추가 상세 고레벨 블록도.20 is a further detailed high level block diagram of a circuit included in block 1334 in accordance with an embodiment of the present invention.

도 21은 본 발명의 실시예에 따른 프로세서(22)에 사용하는 툴들 및 프로그래밍 플로우를 예시하는 도면.21 illustrates tools and programming flow for use with processor 22 in accordance with an embodiment of the present invention.

도 22는 본 발명의 실시예의 스케일능의 예를 도시하는 도면.Fig. 22 is a diagram showing an example of scale capability in the embodiment of the present invention.

도 23은 본 발명의 스케일능의 이득 중 일부를 나타내는 차트.23 is a chart showing a part of the gain of the scale capability of the present invention.

이제, 도 1을 참조하면, 본 발명의 실시예를 포함하는 디지털 제품(12)을 참조로, 애플리케이션(10)이 도시되어 있다. 도 1은 시장에서 입수할 수 있는 것들에 대한 본 발명의 실시예를 포함하는 제품의 장점들의 전체가 아닌 일부에 관한 전망을 독자에게 제공하기 위한 것이다.Referring now to FIG. 1, with reference to a digital product 12 that includes an embodiment of the invention, an application 10 is shown. 1 is intended to provide the reader with a view of some, but not all, of the advantages of a product including embodiments of the invention over those available on the market.

따라서, 제품(12)은 오늘날의 모바일 전화 디바이스(14), 디지털 카메라 디바이스(16), 디지털 기록 또는 음악 디바이스(18) 및 PDA 디바이스(20)에 의해 실행될 필요가 있는 애플리케이션들 모두를 통합하고 있는 집합 제품이다. 제품(12)은 디바이스들(14-20)의 펑션들 중 하나 이상을 동시에 실행할 수 있지만, 보다 적은 전력을 사용한다.Thus, product 12 incorporates all of the applications that need to be executed by today's mobile phone device 14, digital camera device 16, digital recording or music device 18, and PDA device 20. It is an assembly product. Product 12 may execute one or more of the functions of devices 14-20 simultaneously, but uses less power.

제품(12)은 통상 배터리 동작식이며, 따라서, 디바이스(14-20)에 의해 실행되는 애플리케이션들 중 다수의 애플리케이션들을 실행할 때에도 적은 전력을 소비한다. 또한, 이는 H264, MPEG4, UWB, 블루투스, 2G/2.5G/3G/4G, GPS, MP3 및 시큐 리티를 비제한적으로 포함하는 다수의 애플리케이션들과 일치하여 연산들을 실행하도록 코드를 실행할 수 있다.The product 12 is typically battery operated and therefore consumes less power even when executing many of the applications executed by the device 14-20. It can also execute code to execute operations consistent with a number of applications including, but not limited to, H264, MPEG4, UWB, Bluetooth, 2G / 2.5G / 3G / 4G, GPS, MP3, and security.

도 2는 본 발명의 실시예에 따른 메모리 제어기 및 직접 메모리 액세스(DMA) 회로(24)에 결합된 이종, 고성능 스케일러블 프로세서(22)를 포함하는 예시적 집적 회로(20)를 도시한다. 또한 도 2에서, 프로세서(22)가 범용 목적 버스(30)를 통해 인터페이스 회로(26)에 연결되고, 범용 목적 버스(31)를 통해 인터페이스 회로(28)에 연결되고, 추가로, 버스(30)를 통해, 버스(31)를 통해 범용 목적 프로세서(32)에 연결되어 있는 것이 도시되어 있다. 회로(20)는 또한, 회로(10)의 잔여 회로들에 의해 사용되는 클록 및 동일한 방식으로 사용되는 리셋 신호를 생성하기 위한 클록 리셋 및 전력 관리부(34)와, 동자에 의해 전력을 관리하기 위한 회로를 포함하는 것으로 도시되어 있다. 또한, 회로(20)에는 조인트 테스트 액션 그룹(JTAG) 회로(36)가 포함되어 있다. JTAG는 칩들을 테스트하기 위한 표준으로서 사용된다.2 illustrates an exemplary integrated circuit 20 including a heterogeneous, high performance scalable processor 22 coupled to a memory controller and a direct memory access (DMA) circuit 24 in accordance with an embodiment of the present invention. Also in FIG. 2, the processor 22 is connected to the interface circuit 26 via the general purpose bus 30, to the interface circuit 28 via the general purpose bus 31, and further, to the bus 30. Is connected to a general purpose processor 32 via a bus 31. The circuit 20 also includes a clock reset and power manager 34 for generating a clock used by the remaining circuits of the circuit 10 and a reset signal used in the same manner, and for managing power by the same partner. It is shown to include a circuit. The circuit 20 also includes a joint test action group (JTAG) circuit 36. JTAG is used as a standard for testing chips.

버스(30)에 연결된 것으로 도시된 인터페이스 회로(26) 및 버스(31)에 연결된 것으로 도시된 인터페이스 회로(28)는 블록들(40-66)을 포함하며, 이들은 현용의 프로세서들에 의해 사용되며, 당업자들에게 잘 알려져 있다.The interface circuit 26 shown as connected to the bus 30 and the interface circuit 28 shown as connected to the bus 31 comprise blocks 40-66, which are used by current processors and It is well known to those skilled in the art.

이종 멀티 프로세서인 프로세서(22)는 공유 데이터 메모리(70), 공유 데이터 메모리(72), CoolW 서브-프로세서(또는 블록)(74), CoolW 서브-프로세서(또는 블록)(76), CoolN 서브-프로세서(또는 블록)(78) 및 CoolN 서브-프로세서(또는 블록)(80)을 포함하는 것으로 도시되어 있다. 블록들(74-80) 각각은 그와 연계된 명령 메모리를 가지며, 예로서, CoolW 블록(74)은 그와 연계된 명령 메모리(82)를 가 지고, CoolW 블록(76)은 그와 연계된 명령 메모리(84)를 가지고, CoolN 블록(78)은 그와 연계된 명령 메모리(86)를 가지고, CoolN 블록(80)은 그와 연계된 명령 메모리(88)를 갖는다. 유사하게, 블록들(74-80) 각각은 제어 블록과 연계되어 있다. 블록(74)은 제어 블록(90)과 연계되고, 블록(76)은 제어 블록(92)과 연계되고, 블록(78)은 제어 블록(94)과 연계되고, 블록(80)은 제어 블록(96)과 연계되어 있다. 블록들(74 및 76)은 일반적으로 16, 24, 32 및 64-비트 연산들 또는 애플리케이션들을 위해 효과적으로 동작하도록 설계되어 있는 반면, 블록들(78 및 80)은 일반적으로, 1, 4 또는 8-비트 연산들 또는 애플리케이션들을 위해 효과적으로 동작하도록 설계되어 있다.The heterogeneous multiprocessor processor 22 includes shared data memory 70, shared data memory 72, CoolW sub-processor (or block) 74, CoolW sub-processor (or block) 76, CoolN sub- It is shown to include a processor (or block) 78 and a CoolN sub-processor (or block) 80. Each of the blocks 74-80 has an instruction memory associated with it, for example, the CoolW block 74 has an instruction memory 82 associated with it, and the CoolW block 76 is associated with it. With instruction memory 84, CoolN block 78 has instruction memory 86 associated with it, and CoolN block 80 has instruction memory 88 associated with it. Similarly, each of blocks 74-80 is associated with a control block. Block 74 is associated with control block 90, block 76 is associated with control block 92, block 78 is associated with control block 94, and block 80 is associated with control block ( 96). Blocks 74 and 76 are generally designed to operate effectively for 16, 24, 32 and 64-bit operations or applications, while blocks 78 and 80 are generally 1, 4 or 8- It is designed to work effectively for bit operations or applications.

블록들(74-80)은 주로 서브-프로세서들이며, CoolW 브록들(74 및 76)은 와이드(또는 W) 형 블록들인 반면, CoolN 블록들(78, 80)은 네로우(또는 N) 형 블록들이다. 와이드 및 네로우는 서브-프로세서내에서 처리 또는 라우팅되는 병렬 비트들의 상대적 수를 지칭하며, 프로세서(22)의 이종 특성을 제공한다. 또한, 회로(24)는 직접적으로, 서브-프로세서들 중 하나, 즉, 블록들(74-80) 중 하나에 연결되어, 연결되어 있는 서브-프로세서를 통한 최저 지연 경로를 도출한다. 도 2에서, 회로(24)는 블록(76)에 직접적으로 연결되어 있는 것으로 도시되어 있지만, 이는 블록들(74, 78 또는 80) 중 임의의 것에 연결될 수 있다. 보다 높은 우선순위의 에이전트들 또는 테스크들은 회로(24)에 직접 연결된 블록에 할당될 수 있다.Blocks 74-80 are primarily sub-processors, while CoolW blocks 74 and 76 are wide (or W) shaped blocks, while CoolN blocks 78 and 80 are narrow (or N) shaped blocks. admit. Wide and narrow refer to the relative number of parallel bits that are processed or routed within a sub-processor and provide the heterogeneous nature of the processor 22. In addition, the circuit 24 is directly connected to one of the sub-processors, i.e., one of the blocks 74-80, to derive the lowest delay path through the connected sub-processor. In FIG. 2, circuit 24 is shown as being directly connected to block 76, but it may be connected to any of blocks 74, 78, or 80. Higher priority agents or tasks may be assigned to a block directly connected to circuit 24.

4개 블록들(74-80)이 도시되어 있지만, 다른 수의 블록들이 사용될 수 있지만, 부가적인 블록들의 사용은 자명하게, 부가적인 다이 공간 및 보다 높은 제조 비용들을 초래한다.Although four blocks 74-80 are shown, other numbers of blocks may be used, but the use of additional blocks obviously results in additional die space and higher manufacturing costs.

큰 처리 파워를 필요로 하는 복잡한 애플리케이션들은 회로(20) 내에서 분산되어 있지 않으며, 오히려, 이들은 처리를 위한 특정 서브-프로세서 또는 블록에 그룹화 또는 한정되어 있으며, 이는 배선(금속) 또는 라우팅 길이들을 제거 또는 적어도 감소시켜 배선 커패시턴스를 감소시킴으로써, 전력 소비를 현저히 개선한다. 부가적으로, 활용도가 증가되고 활동도가 감소되어 보다 낮은 전력 소비에 기여한다.Complex applications requiring large processing power are not distributed within the circuit 20, but rather they are grouped or defined in a specific sub-processor or block for processing, which eliminates wiring (metal) or routing lengths. Or at least to reduce wiring capacitance, thereby significantly improving power consumption. In addition, increased utilization and reduced activity contribute to lower power consumption.

회로(20)는 멀티미디어 및 통신 애플리케이션들을 위한 준-단열 프로그램가능한 서브-프로세서들을 제공하는 실리콘 온 칩(또는 SoC)의 예이며, 전술된 바와 같이, 서브-프로세서들의 두 가지 유형들, 즉 W-형 및 N-형이 포함되어 있다. W-형 또는 와이드 형 프로세서는 16, 24, 32 및 64 비트 처리를 필요로 하는 애플리케이션들에서의 높은 전력, 가격, 성능 효율을 위해 설계되어 있다. N-형 또는 네로우 형 프로세서는 8, 4 및 1 비트 처리를 필요로 하는 애플리케이션들에서의 높은 효율을 위해 설계되어 있다. 이들 비트 수들이 도면 및 설명에 의거한 본 발명의 실시예들에서 사용되지만, 다른 비트수들이 쉽게 사용될 수 있다.Circuit 20 is an example of a silicon on chip (or SoC) that provides quasi-insulated programmable sub-processors for multimedia and communications applications, and as described above, two types of sub-processors, namely W- Type and N-type are included. W-type or wide-type processors are designed for high power, cost and performance efficiency in applications requiring 16, 24, 32 and 64 bit processing. N-type or narrow-type processors are designed for high efficiency in applications requiring 8, 4 and 1 bit processing. While these bit numbers are used in embodiments of the present invention based on the drawings and description, other bit numbers can be easily used.

다른 애플리케이션들은 다른 성능 또는 처리 기능들을 필요로 하며, 따라서, 다른 유형의 블록 또는 서브-프로세서에 의해 실행된다. 테스크, 예로서, DSP들에 의해 통상적으로 실행되는 애플리케이션들은 일반적으로, 특성적으로, 공통적으로 발생하는 DSP 커널들을 포함하기 때문에, 도 2의 블록들(74, 76) 같은 W-형 서브-프로세서들에 의해 처리된다. 이런 애플리케이션들은 패스트 푸리에 변환(FFT) 또 는 역 FFT(IFFT), 적응성 유한 임펄스 응답(FIR) 필터들, 이산 여현 변환(DCT) 또는 역 DCT(IDCT), 실/복소 FIR 필터, IIR 필터, 저항 커패시터 루트 라이즈 여현(RRC) 필터, 컬러 공간 컨버터, 3D 바이리니어 텍스쳐 맵핑, 고우라우드 세이딩, 골레이 콜렐레이션, 바이리니어 보간, 메디안/로우/컬럼 필터, 알파 블랜딩, 하이-오더 표면 모자이크, 버텍스 세이드(트랜스/라이트), 트라이앵글 셋업, 풀 스크린 안티 알리아싱 및 양자화를 비제한적으로 포함한다.Different applications require different performance or processing functions and, therefore, are executed by different types of blocks or sub-processors. Applications typically executed by DSPs, eg, DSPs, typically include W-type sub-processors, such as blocks 74 and 76 of FIG. 2, because they typically include commonly occurring DSP kernels. Processed by them. These applications include fast Fourier transform (FFT) or inverse FFT (IFFT), adaptive finite impulse response (FIR) filters, discrete cosine transform (DCT) or inverse DCT (IDCT), real / complex FIR filters, IIR filters, resistors Capacitor routed cosine (RRC) filters, color space converters, 3D bilinear texture mapping, gourd shading, golay collation, bilinear interpolation, median / low / column filters, alpha blending, high-order surface mosaic, vertex Shades (trans / light), triangle setup, full screen antialiasing and quantization.

다른 공통적으로 발생하는 DSP 커널들은 블록들(78, 80) 같은 N-형 서브-프로세서들에 의해 실행될 수 있으며, 가변 길이 코덱, 비터비 코덱, 터보 코덱, 사이클릭 러던던시 체크, 왈시 코드 생성기, 인터리버/역인터리버, LFSR, 스크램블러, 역분산기, 길쌈 인코더, 리드-솔로몬 코덱, 스크램블링 코드 생성기 및 펀처링/역펀처링을 비제한적으로 포함한다.Other commonly occurring DSP kernels can be executed by N-type sub-processors, such as blocks 78 and 80, and include variable length codecs, Viterbi codecs, turbo codecs, cyclic redundancy checks, Walsh code generators. , Interleaver / deinterleaver, LFSR, scrambler, de-disperser, convolutional encoder, Reed-Solomon codec, scrambling code generator and punching / reverse punching.

W 및 N-형 서브-프로세서들 양자 모두는 순수 활동도 및 결과적인 트렌지션 당 에너지를 낮게 유지할 수 있으며, 동시에, RISC, 재구성가능, 슈퍼스칼라, VLIW 및 다중 프로세서 접근법들 같은 현존하는 구조적 접근법들에 비해 증가된 활용도로 높은 성능을 유지할 수 있다. 프로세서(22)의 서브-프로세서 아키텍처는 최적의 처리 해법을 초래하는 다이 크기를 감소시키며, “준-단열” 또는 “COOL" 아키텍처라 지칭되는 신규한 아키텍처를 포함한다. 이에 따른 프로그램가능한 프로세서들은 준-단열 프로그램가능 또는 COOL 프로세서들이라 지칭된다.Both W and N-type sub-processors can keep pure activity and resulting energy per transition low, while simultaneously adapting to existing structural approaches such as RISC, reconfigurable, superscalar, VLIW and multiprocessor approaches. Compared to the increased utilization can maintain high performance. The sub-processor architecture of the processor 22 reduces the die size resulting in an optimal processing solution and includes a novel architecture called a "quasi-insulation" or "COOL" architecture. -Insulating programmable or COOL processors.

준-단열 프로그램가능 또는 COOL 프로세서들은 전술한 바와 같이, 애플리케이션들의 유한 부분집합을 일치시키도록 데이터 경로, 제어, 메모리 및 펑셔널 유 닛 그래뉼러리티를 최적화한다. 이것이 달성되는 방식은 후술된 바와 같은, 프로세서(22)의 그 상호 연산들 및 다른 유닛들 또는 블록들 또는 회로들에 관한 도면들의 도시 및 설명에 관하여 명백히 알 수 있을 것이다.Quasi-insulated programmable or COOL processors optimize data path, control, memory, and functional unit granularity to match a finite subset of applications, as described above. The manner in which this is achieved will be apparent with regard to the illustration and description of the drawings relating to its interoperations and other units or blocks or circuits of the processor 22, as described below.

"준-단열 프로그램가능" 또는 이종 상호접속 및 펑셔널 유닛들의 동시발생 애플리케이션(Concurrent Applications of heterOgeneous intercOnnect and functionaL units(COOL)) 프로세서들. 열역학에 관하여, 단열 프로세스는 열을 낭비하지 않으며, 유용한 일을 수행하기 위해 모든 사용된 에너지를 전달한다. 현존하는 표준 프로세스들, 회로 디자인 및 논리 셀 라이브러리 디자인 기술들의 비단열 특성으로 인해, 단열 프로세서들을 제조할 수 없다. 그러나, 가능한 다른 가능 프로세서 아키텍처 중, 일부는 단열에 보다 근접할 수 있다. 본 발명의 다양한 실시예들은 종래 기술의 아키텍처들에 비해, 단열에 현저히 보다 근접한 프로세서 아키텍처들의 부류를 보여주며, 이들은 그럼에도 불구하고, 프로그램가능하다. 이들은 "준-단열 프로그램가능 프로세서들"이라 지칭된다.Concurrent Applications of heterOgeneous intercOnnect and functionaL units (COOL) processors. In terms of thermodynamics, the adiabatic process does not waste heat and transfers all the used energy to perform useful work. Due to the non-insulating nature of existing standard processes, circuit design and logic cell library design techniques, it is not possible to manufacture adiabatic processors. However, among other possible processor architectures possible, some may be closer to thermal insulation. Various embodiments of the present invention show a class of processor architectures that are significantly closer to thermal insulation than prior art architectures, which are nevertheless programmable. These are referred to as "quasi-insulated programmable processors".

집적 회로(20)는 프로세서(22)내의 자원들에 의해 지원될 수 있는 가능한 다수의 애플리케이션들이 함께 또는 동시에 실행될 수 있게 하며, 이런 애플리케이션들의 수는 현용의 프로세서들에 의해 지원되는 것을 현저히 초과한다. 동시에, 또는 동시발생적으로 집적 회로(20)에 의해 실행될 수 있는 애플리케이션들의 예는 수신된 영화를 디코딩하면서, 무선 디바이스로부터 애플리케이션을 다운로딩하는 것, 따라서, 영화가 동시에 다운로딩 및 디코딩될 수 있는 것을 비제한적으로 포함한다. 지원하는 애플리케이션들의 수에 비해 작은 다이 크기 또는 실리콘 면적을 갖는 집적 회로(20)상에서 동시적 애플리케이션 실행을 달성하는 것으로 인해, 집적 회로의 제조 비용들은 도 1의 다수의 디바이스들을 위해 필요한 것보다 현저히 낮다. 부가적으로, 프로세서(22)는 멀티미디어 복합 애플리케이션들 같은 다수의 펑션들을 구현하기 위해 사용자에게 단일의 프로그램가능한 프레임워크를 제공한다. 중요한 가치는 집적 회로(20), 즉, 프로세서(22)의 산업적으로 채택된 미래의 표준들을 지원하는 기능이며, 이 미래의 표준들은 오늘날의 표준들보다 현저히 복잡할 것으로 예상된다.The integrated circuit 20 allows a number of possible applications to be executed together or concurrently, which can be supported by the resources in the processor 22, and the number of such applications significantly exceeds that supported by current processors. An example of applications that can be executed simultaneously or concurrently by the integrated circuit 20 is to download the application from the wireless device while decoding the received movie, thus allowing the movie to be downloaded and decoded simultaneously. Include without limitation. Due to achieving simultaneous application execution on an integrated circuit 20 having a small die size or silicon area relative to the number of applications it supports, the manufacturing costs of the integrated circuit are significantly lower than what is needed for the multiple devices of FIG. . In addition, processor 22 provides a single programmable framework to a user to implement multiple functions, such as multimedia composite applications. An important value is the ability to support the industrially adopted future standards of the integrated circuit 20, i.e., the processor 22, which are expected to be significantly more complex than today's standards.

블록들(74-80) 각각은 주어진 시간에 단 하나의 프로그램들의 시퀀스(또는 스트림)만을 실행할 수 있다. 프로그램의 시퀀스는 특정 애플리케이션과 연계된 펑션이라 지칭된다. 예로서, FFT는 시퀀스의 일 유형이다. 그러나, 다른 시퀀스들이 서로 의존적일 수 있다. 예로서, 완료 이후의 FFT 프로그램은 그 결과들을 메모리(70)에 저장할 수 있고, 그후, 다음 시퀀스가 저장된 결과를 사용할 수 있다. 이 방식으로 정보를 공유하거나 이 방식으로 서로 의존적인 다른 시퀀스들은 "스트림 플로우"라 지칭된다.Each of blocks 74-80 may execute only a sequence (or stream) of only one program at a given time. A sequence of programs is called a function associated with a particular application. As an example, an FFT is one type of sequence. However, other sequences may be dependent on each other. As an example, the FFT program after completion may store the results in memory 70 and then use the stored result of the next sequence. Other sequences that share information in this way or depend on each other in this way are referred to as "stream flows".

도 2에서, 메모리들(70, 72) 각각은 16 킬로바이트의 메모리의 8 블록들을 포함하지만, 다른 실시예들에서, 다른 크기의 메모리가 사용될 수 있다.In FIG. 2, each of the memories 70, 72 includes eight blocks of 16 kilobytes of memory, but in other embodiments, other sizes of memory may be used.

명령 메모리(82, 84, 86 및 88)는 각각 블록들(74-80)에 의해 실행되는 명령들을 저장하기 위해 사용된다.Instruction memories 82, 84, 86, and 88 are used to store instructions executed by blocks 74-80, respectively.

도 3은 본 발명의 실시예에 따른 프로세서(20)의 추가 세부사항을 도시한다. 도 3에서, 프로세서(20)는 각 서브-프로세서에 의해 처리되는 명령들을 각각 저장 하기 위해, 명령 캐시(302-308)를 각각 포함하는 서브-프로세서들(74-80)을 포함하는 것으로 도시되어 있다. 프로세서(20)는 추가로, 도 3에 도시된 방식으로 결합된, 중재 블록(310), 데이터 메모리(312), 범용 목적 입력/출력(GPIO) 블록(314), 공유된 SoC 버스 블록(316), DMA 블록(318)과의 라디오 주파수(RF) 인터페이스, DMA 제어기 블록(320) 및 메모리 제어기 블록(322)을 포함하는 것으로 도시되어 있다. 데이터 메모리(312)는 서브-프로세서들 및 중재 블록(310)의 감독하의 다른 블록에 의해 사용되는 데이터 정보의 저장부로서 기능하며, 이 중재 블록은 도 3에 도시된 다양한 구조들/블록들의 연산 및 데이터 트래픽을 감독한다. 블록(314)은 프로세서(22)로의 입력 및 출력 트래픽을 규제하고, 블록(320)은 버스(316)를 통해 프로세서(22)에 의해 수행되는 DMA 연산들을 제어하며, 블록(322)은 버스(316)를 통해 메모리(312)에 관한 연산들을 제어하고, 블록(318)은 DMA 연산들을 취급하기 위한 회로를 포함하고, 신호(들)(324)을 통해 연결된 RF 신호들을 전송 및/또는 수신할 수 있다.3 shows further details of a processor 20 according to an embodiment of the invention. In FIG. 3, processor 20 is shown as including sub-processors 74-80, each including instruction caches 302-308, for storing instructions respectively processed by each sub-processor. have. The processor 20 may further include an arbitration block 310, a data memory 312, a general purpose input / output (GPIO) block 314, a shared SoC bus block 316, coupled in the manner shown in FIG. 3. ), A radio frequency (RF) interface with a DMA block 318, a DMA controller block 320, and a memory controller block 322. Data memory 312 functions as a storage of data information used by sub-processors and other blocks under the supervision of arbitration block 310, which mediates the operation of the various structures / blocks shown in FIG. And supervise data traffic. Block 314 regulates input and output traffic to processor 22, block 320 controls DMA operations performed by processor 22 over bus 316, and block 322 controls bus (322). 316 controls operations on memory 312, and block 318 includes circuitry for handling DMA operations, and to transmit and / or receive RF signals coupled via signal (s) 324. Can be.

선택적으로, 공유된 레지스터들(326, 328)은 서브-프로세서들의 두 유형들 사이의 직접적인 통신을 유발한다. 예로서, 도 3에서, 레지스터(326)는 이들 블록들에 의해 공유되는 정보의 저장을 유발하기 위해 블록들(74-80)에 연결된 것으로 도시되어 있으며, 이는 그 실행을 촉진하는 튠(tune)으로 하나 이상의 서브-프로세서를 활용하여 애플리케이션들의 실행을 촉진한다. 유사하게, 레지스터(328)는 레지스터(326)의 것과 동일한 펑션을 위해 블록들(80 및 76)에 연결된 것으로 도시되어 있다.Optionally, shared registers 326 and 328 cause direct communication between the two types of sub-processors. As an example, in FIG. 3, register 326 is shown coupled to blocks 74-80 to cause storage of information shared by these blocks, which is a tune that facilitates its execution. Utilize one or more sub-processors to facilitate the execution of applications. Similarly, register 328 is shown coupled to blocks 80 and 76 for the same function as that of register 326.

도 4는 본 발명의 실시예에 따른, 블록(74 또는 76) 같은 W-형 블록들 중 하나에 포함된 블록들 또는 구조의 고레벨 블록도를 도시한다. 도 4에서는 일예로서 블록(74)이 사용된다. 도 4에서, 그리고, 본 명세서 전반에 걸쳐, 매우 특정한 상호접속 구조를 갖는 펑셔널 유닛들 또는 매크로블록들은 가산기들, 승산기들, 레지스터들 및 멀티플렉서들 같은 콤포넌트들 사이에 표시되어 있다. 이들 매크로블록들은 "매크로 펑셔널 유닛들" 또는 "MFU"라 지칭된다. "MFU"는 멀티미디어 및 통신 애플리케이션들의 유한 집합내의 하나 이상의 공통적으로 발생하는 연산들의 효과적으로 프로그램가능한 부분집합을 나타낸다. 매크로 펑셔널 유닛들의 높은 효율은 타겟 애플리케이션들에 형성된 미소 연산들의 임계적 그룹들을 매우 우수한 성능 및 전력 성능을 나타내는 유도된 연산들의 집합으로 치환한 결과이다. 일부 경우들에서, 공통적으로 발생하는 연산들은 하드웨어를 효과적으로 재사용하기 위해 고유한 방식으로 조합되어 있다.4 shows a high level block diagram of blocks or structure included in one of the W-shaped blocks, such as block 74 or 76, in accordance with an embodiment of the invention. In FIG. 4, block 74 is used as an example. In FIG. 4 and throughout this specification, functional units or macroblocks having a very specific interconnect structure are indicated between components such as adders, multipliers, registers and multiplexers. These macroblocks are referred to as "macro functional units" or "MFU". "MFU" refers to an effectively programmable subset of one or more commonly occurring operations within a finite set of multimedia and communication applications. The high efficiency of macro functional units is the result of replacing critical groups of micro-operations formed in target applications with a set of derived operations that exhibit very good performance and power performance. In some cases, commonly occurring operations are combined in a unique way to effectively reuse hardware.

도 4에서, 블록(74)은 로드/저장 MFU 블록(402), 스칼라 산술 논리 유닛(ALU) 및 승산 누산(ACC) MFU들 블록(406), 벡터 x MFU 블록(404), 벡터 ALU 및 승산 누산 MFU 블록(408) 및 로컬 메모리(410)를 도 4에 도시된 방식으로 함께 연결된 상태로 포함하는 것으로 도시되어 있다. 블록(402)은 메모리 어드레스들을 생성하며, 이를 메모리 어드레스들 버스(412)에 연결한다. 메모리 데이터는 메모리 데이터 버스(414)상에 결합되며, 블록들(404, 406)에 양방향적으로 결합된다. 벡터 저장 마스크는 벡터 저장 마스크 버스(416)상에 결합되며, 블록(404)에 의해 생성된다. 각 블록의 추가 세부사항들은 후속 도면들에 관하여 설명 및 제시되어 있다. 이런 제시 및 설명 이전에, 블록(74)의 일반적 특징들 및 블록들을 하기와 같이 설명한다.In FIG. 4, block 74 is a load / store MFU block 402, a scalar arithmetic logic unit (ALU) and a multiplication accumulating (ACC) MFUs block 406, a vector x MFU block 404, a vector ALU and a multiplication. It is shown to include the accumulating MFU block 408 and local memory 410 connected together in the manner shown in FIG. Block 402 generates memory addresses and connects them to the memory addresses bus 412. Memory data is coupled on memory data bus 414 and bidirectionally coupled to blocks 404 and 406. The vector storage mask is combined on the vector storage mask bus 416 and is generated by block 404. Further details of each block are described and presented with respect to the subsequent figures. Prior to this presentation and description, the general features and blocks of block 74 are described as follows.

블록들(406, 408)은 데이터상에 실제 연산의 대부분을 수행한다. 로드/저장 MFU 블록(402)은 메모리(410) 및 메모리(312)내외로 이루어지는 액세스들을 위한 어드레스들을 연산한다. 벡터 x MFU 블록(404)은 메모리(312)와 블록(408) 사이의 경로중에 벡터 데이터를 재배열한다. 벡터 x MFU 블록(404)은 또한 메모리(312)에 벡터 저장부들을 위한 벡터 저장 마스크들을 생성하기 위해서도 사용된다. 블록(406)은 단지 주어진 시간에 하나의 데이터 단편상에만 동작하는 반면, 블록들(404, 408)은 벡터의 형태로 데이터상에 동작한다. 블록(402)은 메모리 액세스들을 위한 어드레스들을 제공한다. 소정의 연산은 블록(402)에 의해 수행되지만, 이는 본질적으로 오버헤드 연산들이다.Blocks 406 and 408 perform most of the actual operations on the data. The load / store MFU block 402 computes addresses for accesses made in and out of the memory 410 and memory 312. The vector x MFU block 404 rearranges the vector data in the path between the memory 312 and the block 408. The vector x MFU block 404 is also used to generate vector storage masks for the vector stores in the memory 312. Block 406 operates only on one data fragment at a given time, while blocks 404 and 408 operate on data in the form of a vector. Block 402 provides the addresses for memory accesses. Certain operations are performed by block 402, but these are essentially overhead operations.

기계 명령은 MFU 블록들 사이에서 데이터를 이동시키기 위한 연산들에 부가하여, 다양한 MFU 블록들을 위한 별개의 연산들을 인코딩한다(필요에 따라). 단일 명령의 모든 연산들은 병렬로 실행된다. 벡터 x MFU 블록(404)은 명령들의 개별적으로 인코딩된 연산들의 제어하에 벡터 데이터의 재배열 및 벡터 저장 마크들의 생성을 유발한다. 로컬 메모리(410)는 모든 명령을 위해 블록(74) 외부의 정보를 액세스하여야하는 것을 피하기 위해, 정보를 로컬 저장하기 위해 사용된다. 버스(412)는 메모리 어드레스들이 그를 통해 제공되는 메모리(312)에 연결된다.Machine instructions encode (as needed) separate operations for the various MFU blocks in addition to the operations for moving data between MFU blocks. All operations of a single instruction are executed in parallel. The vector x MFU block 404 causes the rearrangement of the vector data and the generation of vector storage marks under the control of the individually encoded operations of the instructions. Local memory 410 is used to store information locally, to avoid having to access information outside block 74 for all instructions. Bus 412 is coupled to memory 312 through which memory addresses are provided.

블록(402)은 버스(424)를 통해 블록(44)에 연결된 것으로 도시되어 있고, 블록(402)은 추가로, 버스(426)를 통해 블록(406)에 연결된 것으로 도시되어 있으며, 블록(402)은 추가로, 버스(428)를 통해 블록(410)에 연결된 것으로 도시되어 있다. 블록들(404, 408 및 410)은 벡터 버스(420)를 통해 서로 연결된 것으로 도시되어 있으며, 블록들(406, 404, 408, 410)은 스칼라 버스(422)를 통해 서로 연결된 것으로 도시되어 있다. 버스는 일반적으로, 배선들의 그룹이며, 각 배선은 신호를 연결하고, 배선들은 서로 병렬이며, 따라서, 병렬로 신호들을 연결할 수 있다. 도 4에서, 벡터 버스(420)는 스칼라 버스(422)보다 넓다. 즉, 버스(420)는 버스(422)에 비해, 병렬로 보다 많은 신호들을 연결할 수 있는 보다 많은 비트들 또는 배선들을 포함한다. 버스(422)에 대한 버스(420)의 비트수의 비율의 예는 예로서, 버스(422)가 32 비트이고, 버스(420)가 4 x 32 비트 또는 128 비트인 예에서, 4의 인자이다.Block 402 is shown connected to block 44 via bus 424, and block 402 is further shown to be connected to block 406 via bus 426, and block 402. ) Is further shown connected to block 410 via bus 428. Blocks 404, 408 and 410 are shown to be connected to each other via a vector bus 420, and blocks 406, 404, 408 and 410 are shown to be connected to each other via a scalar bus 422. A bus is generally a group of wires, each wire connecting a signal, and the wires are parallel to each other, and thus, the signals can be connected in parallel. In FIG. 4, the vector bus 420 is wider than the scalar bus 422. That is, bus 420 includes more bits or wires that can connect more signals in parallel than bus 422. An example of the ratio of the number of bits of the bus 420 to the bus 422 is, for example, a factor of 4 in the example where the bus 422 is 32 bits and the bus 420 is 4 x 32 bits or 128 bits. .

블록(404)은 또한 벡터 저장 마스크를 제공하며, 이는 버스(416)상에 결합된다. Block 404 also provides a vector storage mask, which is coupled on bus 416.

메모리 데이터는 연산 동작들을 위해 블록(402)으로부터 블록(406)으로 연결되지만, 벡터 데이터는 최초에 블록(404)에 제공된다. 블록(404)은 연산 유닛, 즉, 블록(408)에 필요한 것에 일치하도록 메모리내의 데이터를 조직화하는 기능을 제공하며, 그에 의해, 현저한 성능 증가를 제공한다는 것을 인지하는 것이 중요하다. Memory data is connected from block 402 to block 406 for computational operations, but vector data is initially provided to block 404. It is important to note that block 404 provides the ability to organize data in memory to match what is required for the computational unit, i.e., block 408, thereby providing a significant performance increase.

도 5는 본 발명의 실시예에 따른, 블록(402)에 포함된 회로 블록들의 블록도를 도시한다. 블록(402)은 도 5에 도시된 방식으로 함께 연결된, 어드레스 블록(502), 서큘러 버퍼 레지스터 블록(504), 어드레스 생성기 블록(508), 어드레스 생성기 블록(506), 멀티플렉서(mux)(510) 및 mux(512)를 포함한다.5 shows a block diagram of circuit blocks included in block 402, in accordance with an embodiment of the invention. Block 402 is address block 502, circular buffer register block 504, address generator block 508, address generator block 506, multiplexer (mux) 510, connected together in the manner shown in FIG. And mux 512.

블록(502)은 도 4에 도시된 바와 같이, 블록(402)의 다른 블록들에 연결되 며, 어드레스들을 저장한다. 블록(504)은 서큘러 버퍼 레지스터들(블록 504) 중 하나내에 서큘러 버퍼 범위를 저장하도록 기능한다. 블록들(506, 508)은 프로그램에 의하여 요청되었을 때, 어드레스 연산이 서큘러 버퍼내에서 랩 어라운드되게 한다. 블록(504)에 이어지는 화살표는 이들 레지스터들이 로딩될 수 있게 한다. 즉, 블록(506)은 블록(504)에 의해 생성된 어드레스들 또는 블록(406)으로부터 수신된 어드레스 또는 심지어, 블록(502)으로부터 생성된 어드레스들을 변경하도록 기능하며, 블록(508)은 블록(502) 및/또는 블록(406) 및 심지어 블록(504)로부터 수신된 어드레스들을 변경하도록 기능한다.Block 502 is connected to other blocks of block 402 and stores addresses, as shown in FIG. Block 504 functions to store the circular buffer range in one of the circular buffer registers (block 504). Blocks 506 and 508 cause the address operation to wrap around in the circular buffer when requested by the program. The arrow following block 504 allows these registers to be loaded. That is, block 506 functions to change the addresses generated by block 504 or the addresses received from block 406 or even the addresses generated from block 502, and block 508 may block (508). 502 and / or change the addresses received from block 406 and even block 504.

블록(402)의 어드레스 레지스터들 및 블록(404)의 서큘러 버퍼 레지스터들은 블록들(506, 508)의 어드레스 생성기들에 입력들을 제공한다. 블록(402)의 어드레스 레지스터들의 경우에, 이들 입력들은 이전에 저장된 어드레스들이며, 블록(404)의 서큘러 버퍼 레지스터들에 대하여, 이들 입력들은 서큘러 버퍼들에 대한 정보이다.The address registers of block 402 and the circular buffer registers of block 404 provide inputs to the address generators of blocks 506 and 508. In the case of the address registers of block 402, these inputs are previously stored addresses, and with respect to the circular buffer registers of block 404, these inputs are information about the circular buffers.

블록들(506, 508)은 어드레스들을 변경하도록 기능한다. 즉, 블록(506)은 블록(504)에 의해 생성된 어드레스들 또는 블록(406)으로부터 수신된 어드레스 또는 심지어 블록(502)으로부터 생성된 어드레스들을 변경하도록 기능하며, 블록(508)은 블록(502) 및/또는 블록(406)과 심지어 블록(504)으로부터 수신된 어드레스들을 변경하도록 기능한다. 블록(506)의 출력은 그후 mux(512)에 입력으로서 제공되며, 이는 또한, 입력으로서, 블록(502)에 의해 생성된 어드레스들을 수신한다. mux(512)는 그후, 그 입력들 중 하나를 선택하고, 도 4에 도시된 바와 같이, 블록(74)의 다 른 블록들에 의한 수신을 위해 버스(520)상에 이를 연결한다. 유사하게, 블록(508)의 출력은 mux(510)에 입력으로서 제공되며, 이는 또한, 입력으로서 블록(502)에 의해 생성된 어드레스들을 수신한다. mux(510)는 그후, 그 입력들 중 하나를 선택하고, 이를 도 4에 도시된 바와 같이 블록(74)의 메모리들에 의한 수신을 위해 버스(522)상에 연결한다.Blocks 506 and 508 function to change addresses. That is, block 506 functions to change the addresses generated by block 504 or the address received from block 406 or even the addresses generated from block 502, and block 508 blocks 502. And / or to change addresses received from block 406 and even block 504. The output of block 506 is then provided as input to mux 512, which also receives, as input, the addresses generated by block 502. mux 512 then selects one of its inputs and connects it on bus 520 for reception by the other blocks of block 74, as shown in FIG. Similarly, the output of block 508 is provided as an input to mux 510, which also receives the addresses generated by block 502 as input. The mux 510 then selects one of its inputs and connects it on bus 522 for reception by the memories of block 74 as shown in FIG. 4.

따라서, 로드/저장 MFU는 병렬로 두 개의 어드레스들을 생성할 수 있다. 어드레스는 스칼라 ALU MFU로부터의 값 또는 상수 중 어느 하나와, 어드레스 레지스터를 조합함으로써 연산된다. 연산된 어드레스는 선택적으로, 서큘러 버퍼의 경계들내에서 랩 어라운드(wrap arround)될 수 있다. 연산된 어드레스들은 주로 메모리들을 액세스할 때 사용하기 위한 것이지만, 또한, 다른 MFU들에 대한 입력으로서 사용되거나, 어드레스 레지스터들 또는 서큘러 버퍼 레지스터들에 할당될 수도 있다.Thus, the load / store MFU can generate two addresses in parallel. The address is computed by combining the address register with either a value or a constant from the scalar ALU MFU. The computed address can optionally be wrapped around within the boundaries of the circular buffer. The computed addresses are primarily for use when accessing memories, but may also be used as input to other MFUs or assigned to address registers or circular buffer registers.

도 6은 파일들을 등록하고, 매크로 펑셔널 유닛들, 특히, 블록들(402, 404, 406, 408)내에서 포워딩하기 위하여 사용되는 일반적 구조를 더욱 상세히 도시한다. 도 6에서, 다수의 레지스터들(602), 다수의 mux들(604), 크로스바(606), 레지스터 블록(608), 다수의 스테이징 레지스터들(610), 다수의 펑셔널 유닛들(612) 및 다수의 mux들(614)이 본 발명의 실시예에 따라 도시되어 있다. 레지스터들(602)은 mux들(604)에 연결된 것으로 도시되어 있으며, 이 mux들은 순차적으로, 크로스바(606)에 연결된 것으로 도시되어 있다. 크로스바(606)는 레지스터들(610)에 연결된 것으로 도시되어 있으며, 이 레지스터들은 순차적으로 펑셔널 유닛(612)에 연결 된 것으로 도시되어 있으며, 펑셔널 유닛들(612)은 mux들(614)에 연결된 것으로 도시되어 있다. 일반적으로, mux의 펑션은 제공된 입력들 사이에서 선택하고, 선택된 입력을 생성하는 것이다. 크로스바(606)의 출력은 또한, 도 4의 다른 블록에 제공된다. 특정 수의 유닛들, mux들 및/또는 레지스터들이 도 6에 도시되어 있지만, 다른 수의 이들 구조들이 사용될 수 있다.6 illustrates in more detail the general structure used to register files and to forward macro functional units, in particular, blocks 402, 404, 406, 408. In FIG. 6, multiple registers 602, multiple mux 604, crossbar 606, register block 608, multiple staging registers 610, multiple functional units 612, and Multiple muxes 614 are shown in accordance with an embodiment of the present invention. Registers 602 are shown coupled to mux 604, which in turn are shown coupled to crossbar 606. Crossbar 606 is shown as connected to registers 610, which are shown as being sequentially connected to functional unit 612, which is connected to mux 614. Shown as connected. In general, the function of mux is to select between the provided inputs and to generate the selected input. The output of the crossbar 606 is also provided to the other block of FIG. Although a specific number of units, muxes and / or registers are shown in FIG. 6, other numbers of these structures can be used.

도 6의 구조들은 여기에 도시된 방식으로 함께 연결되어 있다. mux들(604)은 도 4의 다른 블록들로부터의 부가적인 입력과, 이런 입력들 중 적어도 두 개와, mux들(614)의 출력을 수신하는 것으로 도시되어 있다.The structures of FIG. 6 are connected together in the manner shown here. Mux 604 is shown to receive additional input from the other blocks of FIG. 4, at least two of these inputs, and the output of mux 614.

도 6의 레지스터들 및 피드백 경로들(연결)은 면적, 에너지 및 성능의 절충을 최적화하기 위해, 고유한 조직화를 제공한다. 이 조직화는 세 가지 주된 특성들을 갖는다.The registers and feedback paths (connections) of FIG. 6 provide unique organization to optimize tradeoffs in area, energy and performance. This organization has three main characteristics.

■ 어셈블리 언어에 가시적이며, 몇몇 이상의 레지스터들을 갖는 레지스터 파일들은 두 개의 부분집합들로 나누어진다 : 몇몇 레지스터들은 완전한 액세스성으로 구현되며, 나머지 레지스터들은 보다 제한된 액세스성으로 구현된다. 단지 최초 4개 레지스터들(0 내지 3으로 번호매김됨)만이 대부분의 경우에 완전한 액세스성을 지원한다. 이 레지스터 파일을 수반하는 기계 동작들을 위해, 전체 액세스가능한 레지스터들 모두 또는 그중 임의의 것은 동시에 연산들의 소스들 및 착신처들로서 선택될 수 있다. 대조적으로, 제한된 액세스성을 갖는 레지스터들은 그들 사이에 단지 소수의 판독 및 기록 포트들을 공유한다. 제한된 액세스성을 갖는 레지스터들에서, 대부분, 그들이 공유하는 두개의 판독 포트들과 하나의 기록 포트들을 가진다. 이 배열은 집합내의 레지스터들의 대부분을 위해 하나 또는 둘 이상의 판독/기록 포트들을 필요로 하지 않고, 다수의 판독 및 기록 포트들을 갖는 레지스터 파일의 이득들의 대부분을 제공한다.Visible to assembly language, register files with several or more registers are divided into two subsets: some registers are implemented with full access, and others are implemented with more limited access. Only the first four registers (numbered from 0 to 3) support full access in most cases. For machine operations involving this register file, all or any of the totally accessible registers may be selected as the sources and destinations of the operations at the same time. In contrast, registers with limited access share only a few read and write ports between them. In registers with limited access, most have two read ports and one write port that they share. This arrangement does not require one or more read / write ports for most of the registers in the collection, but provides most of the gains of a register file with multiple read and write ports.

■ 모든 펑셔널 유닛의 입력들에는 "스테이징 레지스터들"이 있다. 펑셔널 유닛이 일 클록 사이클내에서 사용되기 이전에, 그 입력 스테이징 레지스터들은 적절한 입력 값들을 갖는 이전 클록 사이클의 단부에서 설정되어야만 한다. 동시에 사용될 수 없는 펑셔널 유닛들은 레지스터들의 총수를 감소시키도록 동일 스테이징 레지스터들을 공유하도록 함께 그룹화될 수 있다. 동일 스테이징 레지스터들을 공유하는 펑셔널 유닛들 중 어떠한 것도 클록 사이클에 필요하지 않은 경우, 레지스터들의 이전 값들이 유지되며, 따라서, 그 사이클을 위해 이들 펑셔널 유닛들내에서의 트랜지션 전력 소비를 제거한다. ■ All functional unit inputs have "staging registers". Before the functional unit is used in one clock cycle, its input staging registers must be set at the end of the previous clock cycle with the appropriate input values. Functional units that cannot be used simultaneously can be grouped together to share the same staging registers to reduce the total number of registers. If none of the functional units sharing the same staging registers are needed for the clock cycle, the previous values of the registers are retained, thus eliminating the transition power consumption within these functional units for that cycle.

■ 펑셔널 유닛들 사이에서의 포워딩은 두 개의 스테이지들로 구현된다. 먼저, 전체 액세스가능 레지스터들의 다음 값들이, 필요시, 제한된 액세스성을 갖는 레지스터들에 대한 기록을 위한 값 또는 값들과 함께 멀티플렉서들을 통해 선택된다. 제2 스테이지에서, 전체 액세스가능 레지스터들의 다음 값들 및 제한된 액세스성을 갖는 레지스터들의 판독 포트들로부터의 값들이 함께 크로스바에 공급되고, 크로스바는 클록 사이클의 종점에서 스테이징 레지스터들에(그리고, 이에 따라, 다음 클록 사이클의 펑셔널 유닛들을 위해) 기록될 값들을 선택한다. 이 조직화는 하나가 아닌 두 개의 멀티플렉싱 스테이지들을 통해 진행하는 것으로부터 발생할 수 있는 증가된 지연을 댓가로, 그 크기에 큰 영향을 주는 크로스바에 대한 입력들의 수를 최소화한다.Forwarding between functional units is implemented in two stages. First, the following values of all accessible registers are selected through multiplexers with a value or values for writing to registers with limited access, if necessary. In the second stage, the next values of the entire accessible registers and the values from the read ports of the registers with limited accessibility are fed together to the crossbar, which crossbar is then fed to the staging registers (and thus, at the end of the clock cycle). Select values to be written) for the functional units of the next clock cycle. This organization minimizes the number of inputs to the crossbar that greatly affect its size, at the expense of the increased delay that can result from going through two multiplexing stages rather than one.

제한된 액세스성을 가지는 레지스터들의 기록 및 판독 포트들 사이에서, 포워딩은 구현되거나 구현되지 않을 수 있다. 포워딩이 여기에서 이루어지지 않는 경우, 이들 레지스터들 중 하나를 기록하는 동작과, 이를 판독하는 후속 동작 사이에 하나의 여분의 지연 사이클이 존재한다는 것은 명백하다.Between write and read ports of registers with limited accessibility, forwarding may or may not be implemented. If forwarding is not done here, it is clear that there is one extra delay cycle between the operation of writing one of these registers and the subsequent operation of reading it.

도 7은 본 발명의 실시예에 따른, 블록(408)의 추가 세부사항을 고레벨 블록도의 형태로 도시한다. 도 7에서, 벡터 레지스터 블록(702)은 N ALU들 블록(704), 벡터 엘리먼트 시프터 블록(706), 벡터 엘리먼트 선택기 블록(708), 2N 및 N 비트 컨버터 블록(710), N ALU들 블록(712) 및 2N 승산기 블록(714)에 연결된 것으로 도시되어 있다. 도 7에서, 블록(408)은 N 가산기들 블록(718)에 연결된 벡터 레지스터들 블록(716), N 시프터들 블록(720), 벡터 합산 블록(722), N 3-입력 가산기들 블록(724), 2N 및 N 비트 컨버터(726), mux(723) 및 mux(732)을 포함하는 것으로 추가로 도시되어 있다. 도 7의 블록들 및 mux들은 도 7에 도시된 방식으로 함께 연결된다. 블록(702)은 도 4의 다른 블록들에 연결되고, 블록들(704-714)에 추가로 연결된다. 블록(716)은 블록(406)과, mux(732), 블록(710) 및 블록(714)의 출력으로부터 입력을 수신하는 것으로 도시되어 있다. 일반적으로, 도 7의 회로들 또는 블록들은 N 수의 M-비트 값 같은 벡터 유형의 값상에 병렬로 동작하며, M은 정수의 비트이다.7 shows further details of block 408 in the form of a high level block diagram, in accordance with an embodiment of the present invention. In FIG. 7, the vector register block 702 includes the N ALUs block 704, the vector element shifter block 706, the vector element selector block 708, the 2N and N bit converter block 710, the N ALUs block ( 712 and 2N multiplier block 714 are shown. In FIG. 7, block 408 is a vector registers block 716, an N shifters block 720, a vector adder block 722, an N three-input adder block 724 connected to N adders block 718. ), 2N and N bit converter 726, mux 723 and mux 732. The blocks and mux of FIG. 7 are connected together in the manner shown in FIG. Block 702 is connected to the other blocks of FIG. 4 and further connected to blocks 704-714. Block 716 is shown as receiving input from block 406 and the outputs of mux 732, block 710 and block 714. In general, the circuits or blocks of FIG. 7 operate in parallel on a vector type of value, such as an N number of M-bit values, where M is an integer bit.

mux(732)는 입력으로서, 블록(718, 720)에 의해 생성된 출력들을 수신하고, mux(730)는 블록들(704, 706)에 의해 생성된 입력들을 수신하고, 블록(702)에 의해 수신되는 출력을 생성한다. 블록(708, 722)의 출력은 블록(406)에 제공된다. 여기서 사용시 N은 정수값이며, 예로서, N ALU들은 N 수의 ALU 회로들이다.mux 732 receives, as inputs, the outputs generated by blocks 718 and 720, and mux 730 receives inputs generated by blocks 704 and 706, and by block 702. Generate incoming output. The output of blocks 708, 722 is provided to block 406. As used herein, N is an integer value, for example, N ALUs are N number of ALU circuits.

블록들(702-714) 및 mux(730)는 일반적으로, 승산 누산(MAC) 펑션을 수행하는 반면, 블록들(716-726) 및 mux(732)는 ALU 펑션을 수행하지만, 이런 MAC 및ALU 펑션들이 그 위에서 수행되는 병렬 비트 수는 일반적으로, 블록(406)에 의해 처리되는 비트 수보다 N 배 크다. 블록들(704, 712)은 세그먼트화할 수 있으며, 즉, 이들은 가산 연산을 선택적으로 세그먼트화할 수 있다. 예로서, N 32 비트가 병렬로 처리되는 경우, N 32 비트 가산 연산들을 수행할 수 있게 되는 것에 부가하여, 각 ALU 블록은 2N 16 비트 가산 연산들 또는 4N 8비트 가산 연산들을 수행할 수 있다. 블록(714)은 간단히 후술될 도 11의 블록(1110)의 것과 동일한 방식으로 기능한다. 블록들(710, 726)은 N 40 비트 값들 또는 2N 16 비트 값들을 2N 40 비트 값들로 변환하도록 기능한다. 일 예에서, 32 비트 값은 40 비트 값으로 변환되고, 다른 예에서, 16 비트 값이 40 비트 값으로 변환되며, 따라서, 비트 변환 기능을 제공한다.Blocks 702-714 and mux 730 generally perform a multiplication accumulation (MAC) function, while blocks 716-726 and mux 732 perform an ALU function, but such MAC and ALU The number of parallel bits over which the functions are performed is generally N times greater than the number of bits processed by block 406. Blocks 704 and 712 can segment, that is, they can selectively segment add operations. For example, in the case where N 32 bits are processed in parallel, in addition to being able to perform N 32 bit add operations, each ALU block may perform 2N 16 bit add operations or 4N 8 bit add operations. Block 714 functions in the same manner as that of block 1110 of FIG. 11 to be described briefly below. Blocks 710 and 726 function to convert N 40 bit values or 2N 16 bit values to 2N 40 bit values. In one example, the 32 bit value is converted to a 40 bit value, and in another example, the 16 bit value is converted to a 40 bit value, thus providing a bit conversion function.

블록(706)은 벡터 값, 즉, N M 비트 값을 정수값 만큼 우측 또는 좌측으로 이동시킨다. 벡터 이동의 예는 아래와 같은 벡터를 취하기 위한 것이며, Block 706 moves the vector value, ie, the N M bit value, right or left by an integer value. An example of vector movement is to take the following vector,

<a0, a1, a2, a3, a4, a5, a6, a7><a0, a1, a2, a3, a4, a5, a6, a7>

이는 이 경우 8개 값이고, 벡터This is 8 values in this case, the vector

<a1, a2, a3, a4, a5, a6, a7, 0><a1, a2, a3, a4, a5, a6, a7, 0>

또는, 가능하게는 Or possibly

<0, 0, 0, a0, a1, a2, a3, a4>를 반환한다.Returns <0, 0, 0, a0, a1, a2, a3, a4>.

이 연산들은 일반적으로, 임의의 종류의 승산 또는 제산으로서 해석되지 않는다. 블록(708)은 벡터 값의 단일 엘리먼트를 선택할 수 있게 하며, 예로서, 특정 바이트(8개 비트)가 벡터 값으로부터 선택될 수 있다. These operations are generally not to be interpreted as any kind of multiplication or division. Block 708 allows selecting a single element of a vector value, as an example, a particular byte (8 bits) can be selected from the vector value.

블록(720)은 블록(706)과 유사한 방식으로 기능하며, 블록(726)은 블록(710)과 유사한 방식으로 기능한다. 블록들(712, 726)의 출력은 mux(704)를 통해 블록(702)에 선택적으로 제공되고, 블록들(706, 704)의 출력은 mux(730)를 통해 선택적으로 블록(702)에 제공된다. 또한, 블록들(720, 718)의 출력들은 mux(732)를 통해 선택적으로 블록(716)에 제공된다.Block 720 functions in a similar manner to block 706, and block 726 functions in a similar manner to block 710. The output of blocks 712, 726 is optionally provided to block 702 via mux 704, and the output of blocks 706, 704 is optionally provided to block 702 via mux 730. do. In addition, the outputs of blocks 720 and 718 are optionally provided to block 716 via mux 732.

블록(722)은 벡터 기반상에서 가산 연산을 수행하는 반면, 블록(408)의 다른 블록들은 엘리먼트 기반상에서 동작한다. 즉, 블록(722)은 단일 벡터의 엘리먼트들 모두를 함께 가산하며, 엘리먼트 기반상에서 동작하는 블록들은 선택된, 그리고, 대응하는 다른 벡터들의 엘리먼트(들) 중 하나 이상 상에 동작을 수행한다. Block 722 performs addition operations on a vector basis, while other blocks of block 408 operate on element basis. That is, block 722 adds all of the elements of a single vector together, and the blocks operating on the element base perform the operation on one or more of the selected and corresponding element (s) of the other vectors.

블록들(710, 726)은 각각 선택적으로 N으로부터 2N으로의 변환을 가능하게 한다. 도 8에는 블록(804)의 출력이 블록(802)의 입력으로 피드백되는 것이 추가로 도시되어 있다.Blocks 710 and 726 optionally enable conversion from N to 2N, respectively. 8 further shows that the output of block 804 is fed back to the input of block 802.

도 8은 본 발명의 실시예에 따른, 블록(404)의 추가 세부사항을 블록도 형태로 도시한다. 도 8에서, 블록(404)은 도 8에 도시된 방식으로 함께 연결된, 마스크 제어 레지스터들 블록(802), 마스크 생성기 블록(804), 마스크 레지스터들 블록(806), 벡터 레지스터들 블록(808) 및 벡터 바이트 마스크 치환 블록(810)을 포함하는 것으로 도시되어 있다.8 illustrates, in block diagram form, further details of block 404, in accordance with an embodiment of the present invention. In FIG. 8, block 404 is mask control registers block 802, mask generator block 804, mask registers block 806, vector registers block 808, connected together in the manner shown in FIG. 8. And vector byte mask substitution block 810.

블록(802)은 도 4의 다른 블록들로부터 입력을 수신하고, 블록(804)에 대한 입력을 생성하는 것으로 도시되어 있으며, 블록(804)은 블록(806)에 연결된 것으로 도시되어 있다. 블록(806)은 블록(801)에 연결된 것으로 도시되어 있으며, 도 4의 다른 블록들 및 메모리(312)에 추가로 연결되어 있다. 블록(808)은 메모리(312) 및 도 4의 다른 블록들에 연결된 것으로 도시되어 있다. 블록(810)은 블록들(806, 808)로부터 입력을 수신하도록 연결된 것으로 도시되어 있다.Block 802 is shown to receive input from the other blocks of FIG. 4 and generate input to block 804, and block 804 is shown to be connected to block 806. Block 806 is shown as connected to block 801 and is further connected to the other blocks and memory 312 of FIG. Block 808 is shown coupled to memory 312 and other blocks of FIG. 4. Block 810 is shown coupled to receive input from blocks 806 and 808.

일 실시예에서, 블록(404)은 블록(408)과 동일한 N을 위한 N*32 비트 벡터 레지스터들로 이루어지는 레지스터 파일, 블록(808)을 가진다. 블록(404)의 블록(806)은 N*4 비트의 마스크 레지스터들을 포함한다. 마스크 레지스터들 각각은 벡터 레지스터의 1 바이트에 대응한다. N*32 비트 벡터가 외부 공유 메모리에 저장될 때, 벡터의 어느 바이트들이 실제로 메모리에 기록되는지를 나타내도록 N*4 비트 마스크가 공급될 수 있다. (마스크내의 0 비트에 대응하는 메모리 바이트들은 불변으로 남아있는다). 마스크 생성기 펑션은 마스크 제어 레지스터의 설정에 기초하여 4*N 비트 마스크를 연산한다.In one embodiment, block 404 has a register file, block 808, consisting of N * 32 bit vector registers for N equal to block 408. Block 806 of block 404 includes mask registers of N * 4 bits. Each of the mask registers corresponds to one byte of the vector register. When an N * 32 bit vector is stored in external shared memory, an N * 4 bit mask can be supplied to indicate which bytes of the vector are actually written to memory. (Memory bytes corresponding to 0 bits in the mask remain unchanged). The mask generator function computes a 4 * N bit mask based on the setting of the mask control register.

블록(404)은 4*N 바이트들을 선택하도록 두 개의 벡터 레지스터들의 8*N 바이트들을 치환할 수 있다. 일반적 경우에, 특정 치환은 제3 벡터 레지스터의 값에 의해 제어된다. 특정 "사전코딩된" 치환들은 제어 벡터의 사용을 필요로 하지 않으며, 이들은 두 입력 벡터 레지스터들의 좌우측의 모든 펀늘 이동들을 포함한다. 두 벡터 레지스터들의 8*N 바이트들이 치환되는 것과 동시에, 두 마스크 레지스터들의 8*N 비트들이 동일하게 치환되어 마스크와 벡터 값들 사이의 비트-포-바이트 대응 성을 유지한다. Block 404 may replace 8 * N bytes of two vector registers to select 4 * N bytes. In the general case, the specific substitution is controlled by the value of the third vector register. Certain "precoded" substitutions do not require the use of a control vector, which includes all funnel movements to the left and right of the two input vector registers. At the same time that the 8 * N bytes of the two vector registers are replaced, the 8 * N bits of the two mask registers are replaced identically to maintain the bit-for-byte correspondence between the mask and the vector values.

도 8의 블록들은 벡터 값 기반상에서 동작한다. 블록(810)은 전술한 바와 같은 벡터 값들의 재배열을 가능하게 한다. 이는 도 9 및 도 10을 참조로 추가로 설명되는 치환들을 사용하여 이루어진다. 블록들(810)은 어떤 치환이 예상되는지에 관한 정보를 제공한다. 유사하게, 블록들(804, 806)로부터 치환된 마스크는 어떤 치환된 마스크가 제공될지를 나타낸다. 일반적으로, 모든 바이트를 각각을 위한 하나의 마스크 비트가 저장된다. The blocks in FIG. 8 operate on a vector value basis. Block 810 enables rearrangement of vector values as described above. This is done using the substitutions further described with reference to FIGS. 9 and 10. Blocks 810 provide information about what substitution is expected. Similarly, the substituted mask from blocks 804 and 806 indicates which substituted mask is to be provided. In general, one mask bit for each byte is stored.

도 8의 블록들(802, 804, 806, 810)은 실행중인 특정 애플리케이션에 적합하도록 메모리내의 어드레스들을 재배열하는 기능을 제공한다. 종래 기술들에서, 재배열은 통상적으로 자동으로 수행되지만, 본 발명의 실시예에서는 프로그램 또는 코드에 따라서, 프로그래머가 필요에 따라 재배열을 프로그램가능하게 수행할 수 있다. 이는 종래기술들은 단순히 제공하지 않는, 즉, 재배열을 위한 기능이 사전결정되고, 사전결정된 재배열 가능성들의 집합을 포함하는, 프로그래머의 필요성에 따른 거의 무한의 재배열들의 집합들을 가능하게 한다. Blocks 802, 804, 806, and 810 of FIG. 8 provide the ability to rearrange addresses in memory to suit a particular application executing. In prior arts, the rearrangement is typically performed automatically, but in embodiments of the present invention, the programmer may programmatically perform the rearrangement as needed, depending on the program or code. This enables almost infinite sets of rearrangements according to the needs of the programmer, which the prior art simply does not provide, ie the function for rearrangement is predetermined and includes a set of predetermined rearrangement possibilities.

SIMD는 _단일 명령, 다중 데이터_를 위한 두문자어이며, MIMD는 _다중 명령, 다중 데이터_를 위한 두문자어이다. 이들은 당 분야에 알려진 컴퓨터 아키텍처 및 프로그래밍의 표준 용어들이다. SIMD is an acronym for single command, multiple data, and MIMD is an acronym for multiple command, multiple data. These are standard terms of computer architecture and programming known in the art.

도 9 및 도 10은 블록 <수>의 치환 회로의 추가 세부사항을 도시하며, 여기서, <수>는 "벡터 바이트 + 마스크 치환" 박스를 위한 수이다. 블록(404)은 도 9 및 도 10에 도시된 바와 같이, 치환된 결과 백터를 생성하기 위해, 두 개의 벡터들 의 치환을 수행하는 펑셔널 유닛을 갖는다. 치환을 수행하기 위해 사용되는 회로는 각각 N 유닛들의 두 개의 입력 벡터들 A 및 B를 취하고, 역시 N 유닛들의 출력 벡터 Z를 생성하는 것으로, 일반적인 방식으로 설명될 수 있으며, 여기서, 유닛은 임의의 임의적이지만 균일한 비트들의 수일 수 있고, 여기서, N은 2의 멱이될 필요가 있다. K가 N의 베이스 2 로그인 것으로 한다. 치환 회로는 도면에 도시된 바와 같이, 각각 특정 유형의 N 스위치 박스들을 갖는 K+1 스테이지들을 갖는다. 전체적으로, "유형 A", "유형 B" 및 "유형 C"라 지칭되는 스위치 박스들의 세개의 유형들이 존재한다. 스위치 박스 유형 A는 제1 스테이지에서만 사용되고, 스위치 박스 C는 단지 최종 스테이지에서만 사용되며, 중간의 모든 스테이지들은 단지 스위치 박스 유형 B만을 사용한다. 스위치 박스의 각 유형에 의해 지원되는 연결들은 개별적으로 도시되어 있다. 인접 스테이지들의 각 쌍의 스위치 박스들 사이에는 거리 1의 교환으로 시작하여, 거리 N/2의 교환까지 동작하는 버터플라이 교환이 존재한다. 스위치 박스들의 설정들은 모두 치환 회로에 대한 제3 입력인 "제어 벡터"에 의해 독립적으로 결정된다. 유형 A 및 유형 C 스위치 박스 각각의 설정이 서술을 위해 단지 단일의 비트를 필요로 하고, 각 유형 B 스위치 박스가 정확히 두 개의 비트를 필요로 하기 때문에, 완전한 제어 벡터는 2*K*N 비트를 필요로 한다. 제어 벡터는 전적으로 실행되는 치환 명령으로부터 암시될 수 있거나, 소정의 방식으로 프로그램에 의해 부분적으로 또는 전체적으로 제공될 수 있다.9 and 10 show further details of the substitution circuit of the block <number>, where <number> is the number for the "vector byte + mask substitution" box. Block 404 has a functional unit that performs the substitution of two vectors to produce a substituted result vector, as shown in FIGS. 9 and 10. The circuit used to perform the substitution may be described in a general manner, taking two input vectors A and B of N units, respectively, and also generating an output vector Z of N units, where the unit is any It can be an arbitrary but uniform number of bits, where N needs to be a power of two. Let K be the base 2 login of N. The replacement circuit has K + 1 stages, each with a specific type of N switch boxes, as shown in the figure. In total, there are three types of switch boxes called "Type A", "Type B" and "Type C". Switch box type A is used only in the first stage, switch box C is used only in the final stage, and all intermediate stages use only switch box type B. The connections supported by each type of switch box are shown separately. Between each pair of switch boxes of adjacent stages there is a butterfly exchange starting with the exchange of distance 1 and operating up to the exchange of distance N / 2. The settings of the switch boxes are all independently determined by the "control vector" which is the third input to the substitution circuit. Since each configuration of the type A and type C switch box requires only a single bit for description, and each type B switch box requires exactly two bits, the complete control vector contains 2 * K * N bits. in need. The control vector may be implied from a substitution instruction that is wholly executed or may be provided in part or in whole by a program in a predetermined manner.

도 11은 본 발명에 따른, 블록(406)의 콤포넌트들의 추가 세부사항들을 블록도 형태로 도시한다. 도 11에서, 레지스터들 블록(1102)은 ALU 블록(1104), 비트 컨버터 블록(1106), ALU 블록(1108) 및 승산기 블록(1110)에 연결된 것으로 도시되어 있다. 블록(406)은 레지스터 블록(1112), 시프터 블록(1114), 가산기 블록(1116) 및 비트 컨버터 블록(1118)을 포함하는 것으로 추가로 도시되어 있다. mux들(1122, 1120, 1124)도 도 11에 도시되어 있다. 도 11의 mux들 및 블록들은 여기에 도시된 방식으로 함께 연결되어 있다.11 shows in block diagram form additional details of the components of block 406, in accordance with the present invention. In FIG. 11, registers block 1102 is shown coupled to an ALU block 1104, a bit converter block 1106, an ALU block 1108, and a multiplier block 1110. Block 406 is further shown to include a register block 1112, a shifter block 1114, an adder block 1116, and a bit converter block 1118. Muxes 1122, 1120, 1124 are also shown in FIG. 11. The muxes and blocks of FIG. 11 are connected together in the manner shown here.

블록(1102)은 메모리(312) 및 도 4의 다른 블록들에 연결되는 것으로 도시되어 있으며, mux(1122) 및 mux(1120)로부터 입력을 수신한다. 시프터 블록(1114)은 mux(1122)의 입력들 중 하나를 제공하고, 블록(1104)은 그 다른 입력을 제공한다. mux(1120)는 블록들(1118, 1108)로부터 그 입력들을 수신한다. 블록(1114)은 블록(1102)에 연결된 것으로 추가로 도시되어 있으며, mux(1124)는 블록들(1112, 1102)로부터 입력들을 수신하고, 블록(1114)에 대한 출력을 생성하는 것으로 도시되어 있다.Block 1102 is shown coupled to memory 312 and other blocks of FIG. 4, and receives input from mux 1122 and mux 1120. Shifter block 1114 provides one of the inputs of mux 1122, and block 1104 provides the other input. mux 1120 receives its inputs from blocks 1118 and 1108. Block 1114 is further shown connected to block 1102, and mux 1124 is shown to receive inputs from blocks 1112 and 1102 and generate an output for block 1114. .

블록(1112)은 블록(1112)에 대한 입력으로서 제공되는 출력을 생성하는 블록(1116)에 연결된 것으로 도시되어 있다. 블록(1118)은 블록(1112)에 연결된 것으로 도시되어 있으며, 블록들(1106, 1110)은 블록(1112)에 연결된 것으로 도시되어 있다.Block 1112 is shown connected to block 1116 to produce an output that serves as an input to block 1112. Block 1118 is shown connected to block 1112, and blocks 1106 and 1110 are shown connected to block 1112.

블록들(1102, 1104, 1106, 1108, 1110) 및 mux(1122)는 ALU 펑션이 수행되게 하고, 블록들(1112-1118) 및 mux(1124)는 승산 누산(MAC) 펑션이 수행될 수 있게 한다. Blocks 1102, 1104, 1106, 1108, 1110 and mux 1122 allow the ALU function to be performed, and blocks 1112-1118 and mux 1124 allow the multiplication accumulation (MAC) function to be performed. do.

블록들(1104, 1108)은 ALU들이며, 이런 펑션들을 수행하고, 그 출력은 선택 적으로 mux들(1122, 1120)을 통해, 블록(1102)에 입력(또는 피드백)으로서 제공된다. 매 클록 사이클 마다, 두 개의 ALU 연산들이 수행될 수 있다. 블록(1110)은 승산 펑션을 수행하고, 블록(1112)에 제공되는 출력을 생성하며, 블록(1112)은 블록(1102)보다 높은 수의 비트들을 병렬로 처리할 수 있다. 예로서, 블록(1102)이 32 비트 성능을 가지는 경우, 블록(1112)은 40 비트 성능을 갖는다. 블록(1112)은 누산기 레지스터로서 기능, 즉, 입력들을 누산 가산한다.Blocks 1104 and 1108 are ALUs and perform these functions, the output of which is optionally provided as input (or feedback) to block 1102, via mux 1122 and 1120. Every clock cycle, two ALU operations may be performed. Block 1110 performs a multiplication function, generates an output provided to block 1112, and block 1112 may process a higher number of bits in parallel than block 1102. For example, if block 1102 has 32 bit capability, block 1112 has 40 bit capability. Block 1112 functions as an accumulator register, ie, accumulates the inputs.

블록(1106)은 N 비트 값을 N+X로 변환하며, 여기서, X는 정수값이다. 예로서, 32 비트 값이 40 비트 값으로 변환될 수 있다. 블록(1114)은 값을 사전결정된 비트수 만큼 이동시키며, 결과를 mux(1122)를 통해 블록(1102)에 전달한다.Block 1106 converts the N bit value into N + X, where X is an integer value. For example, a 32 bit value may be converted to a 40 bit value. Block 1114 moves the value by a predetermined number of bits and passes the result to block 1102 through mux 1122.

블록(1118)은 40 비트로부터 32 비트로 같이, 보다 높은 비트수로부터 보다 낮은 비트수로 변환한다. 이 블록은 블록(408)에 연결된다. 블록(406)은 블록(1102)으로부터의 값들에 두 개의 ALU 연산들을 실행할 수 있다. 제1 ALU 연산시, N 비트 이동 연산이 수행될 수 있거나, N 비트 값으로부터 블록(1112)에 저장될 X 비트 값으로의 변환이 수행될 수 있다. 제2 ALU 연산시, 블록(1110)에 의해 승산이 수행될 수 있으며, 결과가 블록(1112)의 레지스터들 중 하나에 저장될 수 있다. Block 1118 converts from a higher number of bits to a lower number of bits, such as from 40 bits to 32 bits. This block is connected to block 408. Block 406 may execute two ALU operations on the values from block 1102. In a first ALU operation, an N bit shift operation may be performed, or a conversion from an N bit value to an X bit value to be stored in block 1112 may be performed. In the second ALU operation, multiplication may be performed by block 1110 and the result may be stored in one of the registers of block 1112.

블록(406)은 40 비트 이동, 40 비트 가산/감산 및 40 비트 값의 스칼라 ALU MFU들 32 비트 레지스터들 중 하나에 저장될 32 비트 값으로의 변환을 병렬로 수행할 수 있다.Block 406 may perform a 40-bit shift, a 40-bit add / subtract, and a conversion of the 40-bit value to a 32-bit value to be stored in one of the scalar ALU MFUs 32-bit registers.

블록(78) 같은 N-형 서브-프로세서들 중 하나의 추가적 세부사항들을 이제 도면들을 참조로 후술한다. W-형 서브-프로세서에 대한, 도 4의 블록들(406, 404)은 블록(78) 같은 N-형 서브-프로세서들에 공통적이라는 것을 주의하여야 한다.Further details of one of the N-type sub-processors, such as block 78, are now described below with reference to the drawings. It should be noted that for the W-type sub-processor, blocks 406 and 404 of FIG. 4 are common to N-type sub-processors such as block 78.

도 12는 본 발명의 실시예에 따른 블록(78)의 세부사항들의 고레벨 블록도를 도시한다. 도 12에서, 블록(78)은 데이터 경로 유닛(DPU) 블록(1202), 메모리로의 경로 블록(1204) 및 제어기, 시퀀서 및 데이터 어드레스 생성기(DAG) 블록(1206)을 포함하는 것으로 도시되어 있다. 블록들(1204, 1206)은 W-형 서브-프로세서들의 블록들과 공통적이며, 그들에서 발견된다. 블록(1206)은 일반적으로, 기능적으로, 블록(402)과 동일하다.12 shows a high level block diagram of the details of block 78 in accordance with an embodiment of the present invention. In FIG. 12, block 78 is shown to include a data path unit (DPU) block 1202, a path block 1204 to memory and a controller, sequencer, and data address generator (DAG) block 1206. . Blocks 1204 and 1206 are common to the blocks of W-type sub-processors and are found there. Block 1206 is generally the same as block 402 functionally.

도 13은 본 발명의 실시예에 따른, 블록(78)의 다른 추가적 세부사항을 블록도 형태로 도시한다. 도 78에서, 저장 유닛 블록(1302)은 X 유닛 블록(1304)에 연결된 것으로 도시되어 있으며, X 유닛 블록은 순차적으로, 로드 유닛 블록(1306)에 연결된 것으로 도시되어 있다. 블록(1304)은 일반적으로, 기능적으로 블록(404)과 동일하며, 따라서, 상세히 상술되었다.13 shows another additional detail of block 78 in block diagram form, in accordance with an embodiment of the present invention. In FIG. 78, the storage unit block 1302 is shown connected to the X unit block 1304, which in turn is shown to be connected to the load unit block 1306. Block 1304 is generally the same as block 404 functionally, and thus has been described in detail above.

블록(1306)은 매크로 펑션 블록들(1340)에 추가로 연결된 것으로 도시되어 있으며, 매크로 펑션 블록들(1340)은 순차적으로, 매크로 펑션 버스(1310)를 통해 블록(1302)에 연결된 것으로 도시되어 있다. 블록(1302)은 저장 버퍼(1314), 저장 버퍼(1312) 및 버스 상호접속 블록(1308)을 포함하는 것으로 도시되어 있다. 블록(1302)은 메모리(312) 같은 메모리에 제공되는 출력을 생성하며, 따라서, 블록(1314)을 통해 이에 따라 연결되어 있다. 블록(1304)은 메모리(312) 같은 메모리에 연결되거나, 입력을 수신하는 것으로 도시되어 있으며, 블록(1306)은 로드 버 퍼(1320), 로드 버퍼(1318) 및 블록들(1340)에 연결된 버스 상호접속 블록(1316)을 포함하는 것으로 도시되어 있다. Block 1306 is shown as being further connected to macro function blocks 1340, and macro function blocks 1340 are shown as being connected to block 1302 sequentially via macro function bus 1310. . Block 1302 is shown to include a storage buffer 1314, a storage buffer 1312, and a bus interconnect block 1308. Block 1302 produces an output that is provided to a memory, such as memory 312, and is thus connected via block 1314. Block 1304 is shown as being connected to, or receiving input from, a memory such as memory 312, and block 1306 is a bus connected to load buffer 1320, load buffer 1318, and blocks 1340. It is shown as including an interconnect block 1316.

블록들(1340)은 갈로아체(Galois field) MAC 블록(1322), 특수 ALU 블록(1324), 결합기 블록(1326), 메모리(1328), 펀처링/역펀처링 블록(1330), 인터리버 블록(1332) 및 비터비 블록(1334)을 포함하는 것으로 도시되어 있으며, 이들은 각각 버스(1310)에 연결되어 있는 것으로 도시되어 있다. 블록들(1322-1332)은 각각 블록(1316)으로부터 입력을 수신하거나 그에 연결된 것으로 도시되어 있다. 블록(1334)은 블록(1332)으로부터 입력을 수신하며, 그에 대한 데이터를 수신 및 생성하도록 연결되어 있다.Blocks 1340 are Galois field MAC block 1322, special ALU block 1324, combiner block 1326, memory 1328, punching / de-punching block 1330, interleaver block 1332. ) And Viterbi block 1334, each of which is shown to be connected to bus 1310. Blocks 1322-1332 are each shown as receiving inputs from or connected to block 1316. Block 1334 receives input from block 1332 and is coupled to receive and generate data thereon.

데이터의 흐름은 데이터 또는 정보가 블록(1306)으로부터, 그리고, 그를 통한, 블록(1340)으로, 그리고, 그후, 블록(1302)로, 그리고, 메모리로 흐르도록 이루어진다. 이 방식으로, 파이프라인 효과가 도입되며, 여기서, 다수의 연산들이 중첩하고, 파이프라인 형태로 동시적으로 처리된다. 예로서, 정보는 블록(1306)에 의해 로딩되고, 동시에, 정보는 블록(1302)에 의해 메모리내에 저장될 수 있다. 데이터는 메모리로부터 블록(1304)에 의해 수신된 이후, 블록(1306)의 블록들(1320, 1328)내에 저장되고, 후속하여, 블록(1340)에 제공되어 그에 의해 처리될 수 있으며, 그 세부사항은 후속 도면에 관련하여 간단히 설명될 것이다.The flow of data is such that data or information flows from and through block 1306 to block 1340 and then to block 1302 and to memory. In this way, a pipeline effect is introduced, where multiple operations overlap and are processed concurrently in the form of a pipeline. By way of example, information may be loaded by block 1306, and at the same time, information may be stored in memory by block 1302. After data is received from the memory by block 1304, it can be stored in blocks 1320, 1328 of block 1306, and subsequently provided to block 1340 to be processed by it, details thereof. Will be briefly described with reference to the subsequent figures.

블록들(1340)에 의한 처리의 완료시, 처리된 데이터는 버스(1310)를 통해 블록(1302)에 제공되고, 블록(1312, 1314)에 저장되며, 이들은 메모리에 의해 수신되도록 연결될 때까지 저장된다. 블록들(1314, 1312, 1318, 1320)의 버퍼들은 병렬로 사전결정된 폭 또는 수의 비트로 이루어진다. 예로서, 이들 버퍼들 각각은 256 비트 폭이지만, 다른 수의 비트들이 사용될 수 있다.Upon completion of the processing by blocks 1340, the processed data is provided to block 1302 via bus 1310 and stored in blocks 1312 and 1314, which are stored until connected to be received by memory. do. The buffers of blocks 1314, 1312, 1318, and 1320 consist of a predetermined width or number of bits in parallel. As an example, each of these buffers is 256 bits wide, but other numbers of bits may be used.

블록들(1340)에 의해 처리될 수 있는 값 또는 데이터는 재사용을 위해 블록(1302)으로부터 블록(1306)으로 이동될 수 있다. 또한, 데이터는 메모리로부터 블록(1304)에 의해 수신되고, 그후, 그 처리를 위해 블록(1306)로 이동될 수 있다. 각 블록들(1340)의 추가적 세부사항들을 이제 설명한다. 블록들(1314, 1312)은 두 배의 버퍼링 효과를 유발하며, 이는 블록들(1318, 1320)이 하는 바와 같이, 파이프라이닝 동작들에서 일반적으로 겪게되는 "실속"을 감소시키는 것을 돕는다. 실속은 메모리에 의한 동시적인 블록들(1302, 1306)의 액세스로부터 초래된다. 다른 실시예에서, 블록들(1314, 1312)은 하나의 블록일 수 있으며, 블록들(1318, 1320)은 하나의 블록일 수 있다.The value or data that may be processed by blocks 1340 may be moved from block 1302 to block 1306 for reuse. In addition, data may be received from the memory by block 1304 and then moved to block 1306 for processing. Further details of each block 1340 are now described. Blocks 1314 and 1312 cause a double buffering effect, which helps to reduce the "stall" typically experienced in pipelining operations, as blocks 1318 and 1320 do. Stallion results from access of concurrent blocks 1302 and 1306 by memory. In another embodiment, blocks 1314 and 1312 may be one block, and blocks 1318 and 1320 may be one block.

지연은 연산과 연계될 수 있거나, 파이프라인 영향이 존재할 수 있다. 지연은 블록들(1340)을 갖는 블록들 각각으로부터 초래될 수 있다.The delay may be associated with an operation or there may be a pipeline effect. Delay may result from each of the blocks having blocks 1340.

도 14는 본 발명의 실시예에 따른 블록(1322)의 추가적 세부사항을 도시한다. 도 14에서, 갈로아체 블록(1402)은 XOR/Clr 회로(1404)에 연결된 것으로 도시되어 있으며, XOR/Clr 회로(1404)는 순차적으로, 누산기 레지스터 블록(1406)에 연결된 것으로 도시되어 있다. 블록(1402)은 갈로아체 mux(1410)에 대한 입력으로서 기능하는 갈로아체 출력 신호(408)를 생성하는 것으로 도시되어 있으며, 갈로아체 mux(1410)는 추가로, 누산기 레지스터 블록 출력 신호들(1412)이라 지칭되는, 블록(1406)의 출력에 의해 생성된 다른 입력을 수신한다. 신호들(1408, 1412)은 도 13의 버스(1310)상에 연결되어 있는 갈로아체 MAC 출력 신호(1416)를 선택적으로 생성하기 위한 mux(1410)에 대한 입력들로서 기능한다. mux(1410)에 대한 다른 입력으로서 기능하는 선택 신호(1414)는 신호(1416)의 생성을 위해, 신호들(1408, 1412) 중 하나를 선택하도록 기능한다. 따라서, 효과적으로는 갈로아체 연산의 결과인 블록(1402)의 출력이 블록(1322)의 출력으로서 제공되거나, 갈로아체 MAC 연산 결과가 블록(1322)의 출력으로서 제공된다.14 illustrates additional details of block 1322 in accordance with an embodiment of the present invention. In FIG. 14, galloache block 1402 is shown connected to an XOR / Clr circuit 1404, and the XOR / Clr circuit 1404 is sequentially shown to be connected to an accumulator register block 1406. Block 1402 is shown to produce a Galoache output signal 408 that serves as an input to Galoache mux 1410, which is further configured to accumulate register block output signals 1412. Receives another input generated by the output of block 1406, referred to as < RTI ID = 0.0 > Signals 1408 and 1412 serve as inputs to mux 1410 for selectively generating a Galloche MAC output signal 1416 connected on bus 1310 of FIG. 13. The select signal 1414, which serves as another input to the mux 1410, serves to select one of the signals 1408, 1412 for generation of the signal 1416. Thus, the output of block 1402, which is effectively the result of a Galoiche operation, is provided as the output of block 1322, or the result of the Galois MAC operation is provided as the output of block 1322.

블록(1406)의 출력은 그 다른 입력으로서 회로(1404)에 연결된 것으로 도시되어 있다. 블록(1404)의 출력은 블록(1406)에 제공되며, 이런 연결은 갈로아체 MAC 연산의 MAC 부분을 실행한다. 블록(1404)은 효과적으로, 갈로아체 MAC 연산들에 통상 사용되는 XOR 승산 연산을 수행한다.The output of block 1406 is shown connected to circuit 1404 as its other input. The output of block 1404 is provided to block 1406, which connects to the MAC portion of the Galois MAC operation. Block 1404 effectively performs the XOR multiplication operation that is commonly used for Galoise MAC operations.

블록(1402)은 Xor 트리 블록(1424)에 연결된 것으로 도시되어 있는 레지스터 블록(1422) 및 레지스터 블록(1420)을 포함하는 것으로 도시되어 있다. 블록(1420)은 레지스터 블록(1426), 갈로아체 승산 반복 1(1428), 레지스터 블록(1430), 갈로아체 승산 반복 1(1432), 레지스터 블록(1434) 및 레지스터 블록(1436)을 포함하는 것으로 추가로 도시되어 있다. 도 14에 도시되어 있지 않지만, 블록들(1434, 1436) 같은 부가적인 수의 레지스터 블록들이 블록들(1434, 1436) 사이에 포함되어 직렬로 연결될 수 있다.Block 1402 is shown to include a register block 1422 and a register block 1420 that are shown to be connected to the Xor tree block 1424. Block 1420 is comprised of register block 1426, Galoache multiplication iteration 1 1428, register block 1430, Galoache multiplication iteration 1 1432, register block 1434, and register block 1434. Further shown. Although not shown in FIG. 14, additional numbers of register blocks, such as blocks 1434 and 1436, may be included between the blocks 1434 and 1436 to be connected in series.

블록(1424)은 블록(1426)에 연결된 것으로 도시되어 있으며, 블록(1426)은 블록(1428)에 순차적으로 연결된 것으로 도시되어 있고, 블록(1428)은 순차적으로, 블록(1430)에 연결된 것으로 도시되어 있으며, 블록(1430)은 순차적으로, 블 록(1432)에 연결된 것으로 도시되어 있고, 블록(1432)은 순차적으로, 블록(1434)에 연결된 것으로 도시되어 있고, 블록(1434)은 블록(1436) 또는 블록들(1434, 1436) 사이에 바로 위치되어 있는 하나 이상의 레지스터 블록들 중 어느 한쪽에 연결된다.Block 1424 is shown connected to block 1426, block 1426 is shown to be connected sequentially to block 1428, and block 1428 is shown to be connected sequentially to block 1430. Block 1430 is shown in sequential order, coupled to block 1432, block 1432 is shown in sequential order, coupled to block 1434, and block 1434 is shown in block 1434. Or one of the one or more register blocks located immediately between blocks 1434 and 1436.

도 14에서, 블록들(1420, 1422)은 블록(1306)으로부터 입력을 수신하며, 다른 실시예는 하나의 블록에 조합될 수 있다. 블록(1402)은 일반적으로, 당업자들에게 알려진 갈로아체 처리를 수행하며, 도 14의 나머지 블록들은 MAC 연산의 성능을 유발한다. 블록들(1426, 1430, 1434, 1436)은 갈로아 트리의 다른 반복들로서 기능하며, 최악의 경우의 시나리오에서, 반복들의 수는 8이고, 따라서, 8개 레지스터 블록들을 포함한다는 것이 경험되었다. MAC 연산의 승산 부분은 일반적으로, 회로(1404)에 의해 수행되는 XOR 연산에 의해 수행되며, 블록(1406)은 누산기 펑션으로서 기능한다. 회로(1404)는 블록(1402), 그리고, 도 14의 경우에는 블록(1436)에 의해 수행된 갈로아체 연산의 최종 반복으로부터 그 입력을 수신한다.In FIG. 14, blocks 1420 and 1422 receive input from block 1306, and other embodiments may be combined in one block. Block 1402 generally performs Galoache processing known to those skilled in the art, and the remaining blocks of FIG. 14 cause performance of MAC operations. Blocks 1426, 1430, 1434, 1436 serve as other iterations of the Galoa tree, and in the worst case scenario, it has been found that the number of iterations is eight, thus including eight register blocks. The multiplication part of the MAC operation is generally performed by an XOR operation performed by the circuit 1404, and block 1406 functions as an accumulator function. The circuit 1404 receives its input from the block 1402, and in the case of FIG. 14, the final iteration of the Galoache operation performed by block 1434.

연산시, 블록(1322)은 8 비트 값 같은 N 비트 값 또는 테이터상에 동작하며, 동자에 기초하여, 다른 N 비트 값에 기초하여 8개 경로들로 원본 값을 이동시킴으로써, N 비트 값 또는 데이터를 생성한다. N 비트 값들은 그후, 결과가 감축 상수를 갖는 N 비트로 감축될 때까지 블록(1404)에 의해 XOR 연산되며, 선택적으로, 블록(1406)의 값 같은 N 비트 누산기 레지스터의 내용들과 가산된다. "클리어" 연산이 또한 블록(1406)에 의해 수행될 수도 있다. 갈로아체 MAC 연산들을 사용는 예에서, 블록(1322)은 이에 따라, 사이클릭 러던던시 코드(CRC) 연산들, 길쌈 인코더 연산들, 스크램블 코드 생성기 연산들 및 다른 연산들을 비제한적으로 포함한다.In operation, block 1322 operates on an N bit value or data, such as an 8 bit value, and moves the original value in eight paths based on the other, based on another N bit value, thereby moving the original value to N bit values or data. Create The N bit values are then XORed by block 1404 until the result is reduced to N bits with a reduction constant, and optionally added with the contents of the N bit accumulator register, such as the value of block 1406. A "clear" operation may also be performed by block 1406. In an example using Galoache MAC operations, block 1322 thus includes, without limitation, cyclic redundancy code (CRC) operations, convolutional encoder operations, scrambled code generator operations, and other operations.

도 15는 본 발명의 실시예에 따른, 블록(1324)에 포함된 회로의 추가적 세부사항들을 블록도 형태로 도시한다. 도 15에서, mux들(1504, 1502)은 각각 A 레지스터 블록(1508) 및 B 레지스터 블록(1506)에 연결된 것으로 도시되어 있다. 블록(1508)은 A라 지칭되는 값을 저장하고, 블록(1502)은 B라지칭되는 값을 저장하며, 이들 A 및 B 값들은 블록(1324)에 의해 연산될 데이터이다. A 및 B 값들은 각각 N 비트 폭이다.15 shows in block diagram form additional details of circuitry included in block 1324, in accordance with an embodiment of the present invention. In FIG. 15, muxes 1504 and 1502 are shown coupled to A register block 1508 and B register block 1506, respectively. Block 1508 stores a value called A, block 1502 stores a value called B, and these A and B values are data to be computed by block 1324. A and B values are each N bits wide.

블록들(1508, 1506)은 조건부 레지스터 블록(1512)에 대한 입력들을 생성하는 것으로 도시되어 있으며, 추가로, 애드/서브/에이비에스/디프/조건부(add/sub/Abs/diff/conditional) 가산-감산/승산(AGU) 블록(1510)에 대한 입력을 생성하도록 연결된 것으로 도시되어 있으며, 이는 출력 레지스터 블록(1514)에 대한 입력을 생성한다. 블록(1514)은 mux(1516)에 연결된 것으로 도시되어 있으며, mux(1516)는 순차적으로 가산기(1518)에 연결된 것으로 도시되어 있다. 가산기(1518)는 누산기-레지스터 블록(1520)에 연결된 것으로 도시되어 있으며, 누산기-레지스터 블록(1520)의 출력은 가산기(1518)의 다른 입력으로서 기능하는 것으로 도시되어 있다. 블록(1520)의 다른 출력은 다른 입력으로서 블록(1514)의 출력을 수신하는 mux(1522)에 대한 입력으로서 기능하는 것으로 도시되어있다. mux(1522)는 버스(1310)에 연결된 출력(1530)을 생성한다. mux들(1504, 1502)에 대한 입력들 중 일부는 블록(1316)으로부터 수신된다.Blocks 1508 and 1506 are shown to generate inputs for conditional register block 1512, and additionally, add / sub / ABS / diff / conditional addition. It is shown as being connected to generate an input to a subtraction / multiplication (AGU) block 1510, which produces an input to an output register block 1514. Block 1514 is shown connected to mux 1516, and mux 1516 is shown to be connected sequentially to adder 1518. Adder 1518 is shown coupled to accumulator-register block 1520, and the output of accumulator-register block 1520 is shown to function as another input of adder 1518. The other output of block 1520 is shown as functioning as an input to mux 1522 which receives the output of block 1514 as another input. The mux 1522 generates an output 1530 coupled to the bus 1310. Some of the inputs to muxes 1504 and 1502 are received from block 1316.

mux들(1504, 1502) 각각은 4개 입력들을 수신하는 것으로 도시되어 있다. mux(1504)의 입력들 중 하나는 데이터 처리과정에서의 mux(1502)의 입력과 같이, 데이터 처리과정에서, 블록(1306)으로부터 수신된다. mux(1504)의 다른 입력은 mux(1502)의 입력들 중 하나가 그런 것 같이, 블록(1514)의 출력의 일련의 최저 오더 비트로부터 도입된다. mux(1504)의 다른 입력은 블록(1514)의 동일 출력의 최고 오더 비트들로부터 도입된다. mux(1504)의 또 다른 입력은 값 '0'이다. mux(1502)의 입력들 중 하나는 값 '1'이며, 그 입력들 중 다른 하나는 값 '-1'이다. 값들 '0', '1', '-1'은 블록(1324)에 의해 수행된 연산들을 촉진시키기 위한 노력에서 제공되며, 이들 값들은 다양한 연산들에서 반복적으로 활용되며, 따라서, 그 존재는 시스템 성능을 증가시키는 것으로 판명되었다. 증가된 성능을 위해 활용되는 다수의 블록들(1510)이 존재할 수 있다는 것을 인지하여야 한다. 블록(1324)은 도 15에 도시된 바와 같이, 다수의 연산들이 수행될 수 있게 하여, 단일 클록 사이클에서 수행될 수 있게 하도록 조직화된다. Each of the muxes 1504 and 1502 is shown to receive four inputs. One of the inputs of mux 1504 is received from block 1306 in data processing, such as input of mux 1502 in data processing. The other input of mux 1504 is introduced from a series of lowest order bits of the output of block 1514, such as one of the inputs of mux 1502. The other input of mux 1504 is introduced from the highest order bits of the same output of block 1514. Another input of mux 1504 is the value '0'. One of the inputs of mux 1502 is the value '1' and the other of the inputs is the value '-1'. The values '0', '1', '-1' are provided in an effort to facilitate the operations performed by block 1324, and these values are utilized repeatedly in various operations, so that their presence is a system. It has been found to increase performance. It should be appreciated that there may be multiple blocks 1510 utilized for increased performance. Block 1324 is organized to allow multiple operations to be performed, as shown in FIG. 15, so that they can be performed in a single clock cycle.

연산시, 블록들(1510, 1512)은 각각 블록들(1508, 1506)에 의해 제공된 A 및 B 값들상에 동작한다. mux(1516)에 대한 두개의 다른 입력들은 간단히 설명될 블록(1520)(도 15에는 미도시)내의 감축 연산 블록에 의해 생성된다. 이제, 이들 두 입력들에 대하여, 'neighbor-acc-reg' 및 'reduction-acc-reg'라 지칭하며, 각각은 2N 폭이다.In operation, blocks 1510 and 1512 operate on the A and B values provided by blocks 1508 and 1506, respectively. Two other inputs to mux 1516 are generated by a reduction operation block in block 1520 (not shown in FIG. 15) to be briefly described. Now, for these two inputs, we refer to 'neighbor-acc-reg' and 'reduction-acc-reg', each 2N wide.

블록(1512)은 2N 폭 레지스터이며, 역환산 연산들에 사용하기 위하여 블록(1510)에 의해 조건부 가산 또는 조건부 감산 연산들이 수행될 수 있게 한다. 블록(1512)은 블록(1510)에 의한 사용을 위해 A 및 B 값들을 실질적으로 변경한다.Block 1512 is a 2N wide register and allows conditional addition or conditional subtraction operations to be performed by block 1510 for use in inverse operations. Block 1512 substantially changes A and B values for use by block 1510.

mux(1522)는 mux(1522)에 대한 또 다른 입력으로서 제공된 선택 신호에 의해 결정되는 바와 같이, 블록(1514)에 의해 저장되어 있는 블록(1510)의 출력이 신호(1530)를 통해, 블록(1302)에 선택적으로 제공될 수 있게한다. 달리, 블록(1510)의 결과는 누산 가산 연산을 받으며, 그 최종 결과는 블록(1302)에 제공되기 이전에, 블록들(1518, 1520)을 통해, 블록(1520)에 저장된다. The mux 1522 outputs the output of the block 1510 stored by the block 1514 via the signal 1530, as determined by the selection signal provided as another input to the mux 1522. 1302 may optionally be provided. Alternatively, the result of block 1510 is subjected to an accumulate addition operation, and the final result is stored in block 1520 through blocks 1518 and 1520 before being provided to block 1302.

블록(1324)은 하기의 연산들을 지원하는 하나 이상의 ALU들을 포함하는 N-레이어 ALU이다.Block 1324 is an N-layer ALU that includes one or more ALUs that support the following operations.

- 두 개의 N 비트 값들이 그 합 또는 편차들을 생성하도록 연산되는 N 가산/감산 연산들N add / subtract operations where two N bit values are computed to produce their sum or deviations

- 두 개의 값들상의 N 비트 XORN-bit XOR on two values

- 두 개의 N 비트 입력값들상의 최대값/최소값 연산Maximum / minimum calculations on two N-bit inputs

- 그 결과가 하기와 같이 산출되도록 두 개의 N 비트 입력값들상에 최대값* 연산 : max(a,b) + 상수(메모리 또는 소형의 사전로딩된 참조표로부터)A maximum * operation on two N-bit input values such that the result is calculated as follows: max (a, b) + constant (from memory or a small preloaded reference table)

- 조건부 가산-감산 : 일반적으로 블록(1512)의 사용으로부터 초래하는 이 함수는 입력 코드에 의존하는 N 비트 값들의 스트림을 조건부 가산 또는 감산한다. 입력 코드는 제어 레지스터에 사전로딩된다. 입력 코드 ‘1’은 감산 연산을 초래하고, ‘0’는 가산 연산을 초래한다. 출력은 16 비트 누산기 레지스터에서 가용하다. 또한, 이 연산을 지원하는 다른 특수 ALU들로부터 '수집' 연산을 위한 지원도 존재한다.Conditional Add-Subtract: This function, which generally results from the use of block 1512, conditionally adds or subtracts a stream of N bit values depending on the input code. The input code is preloaded into the control register. Input code '1' results in a subtraction operation, and '0' results in an addition operation. The output is available in a 16-bit accumulator register. There is also support for 'collect' operations from other special ALUs that support this operation.

- 조건부 가산-감산 연산에서와 동일한 누산기를 사용하는 SADSAD using the same accumulator as in the conditional add-subtract operation

- N x N 승산 -N times N multiplication

블록(1510)은 W-형 서브-프로세서에 공통적이며, 각 블록(1510)은 적어도 128 비트를 판독할 수 있고, 따라서, 두 개의 블록들은 메모리내에 어떠한 내용도 없을 때 매 클록 사이클마다 적어도 256 비트의 데이터를 판독할 수 있다.Block 1510 is common to the W-type sub-processor, and each block 1510 can read at least 128 bits, so that the two blocks have at least 256 bits per clock cycle when there is no content in memory. Can read the data.

도 16은 본 발명의 실시예에 따른, 블록(1520)내에 포함된 감축 회로 블록(1602)의 블록도를 도시한다. 도 16에서, M 스테이지 누산기 레지스터 회로가 도시되어 있으며, acc-reg 블록(1610)내에 도시된 누산기 레지스터 회로들 중 각각의 세부사항들이 도시되어있다. 예로서, acc-reg 회로 블록(1602)은 도 16에도시된 방식으로 연결된 네 개의 블록들(1610)을 포함한다. 유사하게, acc-reg 회로 블록들(1604-608) 각각은 블록(1610)의 것 같은 4 스테이지 acc-reg 회로를 포함한다. 블록들(1602-1608) 각각내의 스테이지들 각각의 출력 또는 결과는 다음 스테이지에 대한 이력으로서 사용되며, 따라서, 누산을 달성하기 위해 가산된다. 블록들(1602-1608)은 각각 블록(1610) 같이 4 스테이지들 또는 4 블록들을 포함하는 것으로 도시되어 있지만, 다른 수의 블록들 또는 스테이지들이 사용될 수 있다. 16 shows a block diagram of a reduction circuit block 1602 included in block 1520 according to an embodiment of the present invention. In FIG. 16, an M stage accumulator register circuit is shown, with details of each of the accumulator register circuits shown in acc-reg block 1610. As an example, the acc-reg circuit block 1602 includes four blocks 1610 connected in the manner shown in FIG. Similarly, each of the acc-reg circuit blocks 1602-608 includes a four stage acc-reg circuit, such as that of block 1610. The output or result of each of the stages in each of blocks 1602-1608 is used as a history for the next stage and therefore added to achieve the accumulation. Blocks 1602-1608 are each shown as including 4 stages or 4 blocks, such as block 1610, although other numbers of blocks or stages can be used.

블록들(1602-1608) 각각의 결과는 다른 블록에 가용해진다. 예로서, 블록(1602)의 결과는 블록(1604)의 입력으로서 기능하고, 블록(1604)의 결과 또는 출력은 블록(1608)내의 최종 acc-reg 블록에 입력으로서 기능하며, 블록(1606)의 결과 또는 출력은 블록(1608)에 대한 입력으로서 기능한다. 블록들의 결과들이 전진식으로 제공되며, 블록내의 스테이지들의 누산과 동시에 제공되기 때문에, 4 스테이지 acc-reg 블록이 사용될 때, 감축 연산을 수행하기 위해 단지 7개 사이클들이 필요하다. The result of each of blocks 1602-1608 is available to the other block. By way of example, the result of block 1602 functions as the input of block 1604, and the result or output of block 1604 functions as an input to the last acc-reg block in block 1608, and The result or output serves as an input to block 1608. Since the results of the blocks are provided forward and simultaneously with the accumulation of stages in the block, when a four stage acc-reg block is used, only seven cycles are needed to perform the reduction operation.

블록(16)은 누산기에 연결된 mux로 구성된다. mux는 누산기에 제공되는 두 개의 입력들 중 하나를 선택하는 2:1 mux이다. 블록(1610)의 mux의 두 입력들 중 하나는 블록(1514)의 출력에 의해 제공되고, 다른 이력은 이전 스테이지 acc-reg 블록의 결과이다. 이 방식으로, 도 16의 감축 펑션은 그 데이터의 조작에 유연하다. 스테이지의 직전 출력으로부터의 입력들 각각은 mux(1516)에 대한 neighbor-acc-seq 입력을 생성하는 ‘이웃’ 신호들(1616)이라 지칭된다. 스테이지들 중 일부의 출력은 mux(1516)에 reduction-acc-seg를 생성하며, ‘감축’ 신호들(1618)이라 지칭된다. 블록(1608)의 최종 acc-reg 블록의 출력은 mux(1530)와 연결된 출력(1620)을 생성한다. 도 16의 감축 회로는 전력 소비를 절감하면서 감축 연산을 수행하기 위한 최소의 클록 사이클들을 초래한다.Block 16 consists of mux coupled to the accumulator. mux is a 2: 1 mux that selects one of the two inputs provided to the accumulator. One of the two inputs of mux of block 1610 is provided by the output of block 1514, and the other history is the result of the previous stage acc-reg block. In this way, the reduction function of FIG. 16 is flexible to the manipulation of the data. Each of the inputs from the immediate output of the stage are referred to as 'neighbor' signals 1616 that generate a neighbor-acc-seq input to mux 1516. The output of some of the stages produces a reduction-acc-seg in mux 1516 and is referred to as 'reduction' signals 1618. The output of the last acc-reg block of block 1608 produces an output 1620 coupled with mux 1530. The reduction circuit of FIG. 16 results in minimal clock cycles for performing the reduction operation while reducing power consumption.

도 17은 본 발명의 실시예에 따른, 블록(1326)에 포함된 회로의 추가 세부사항을 고레벨 블록도의 형태로 도시한다. 도 17에서, 블록(1326)은 블록(1306)으로부터 수신된 데이터 입력을 이동시키기 위한 시프터들(1702-1712)을 포함하는 것으로 도시되어 있다. 일 실시예에서, 입력(1700)은 128 비트이지만, 다른 수의 비트가 사용될 수 있다. 시프터들(1702-1712) 각각의 출력은 레지스터 뱅크 블록(1714)에 연결되는 것으로 도시되어 있다. 시프터들(1702-1712)은 입력(1700)의 비트들의 다른 조합들을 생성한다.17 illustrates, in the form of a high level block diagram, additional details of the circuitry included in block 1326, in accordance with an embodiment of the present invention. In FIG. 17, block 1326 is shown to include shifters 1702-1712 for moving data input received from block 1306. In one embodiment, input 1700 is 128 bits, although other numbers of bits may be used. The output of each of the shifters 1702-1712 is shown connected to the register bank block 1714. Shifters 1702-1712 generate different combinations of bits of input 1700.

블록(1714)은 시프터들(1702-1712)의 출력의 조합을 생성하기 위해 사용되는 레지스터들(1716 내지 1746)을 포함하는 다수의 레지스터들을 포함한다. 예로서, 시프터(1702-1712) 출력 각각의 하부 8개 비트는 어느 하부 8개 비트들이 궁극적으로 생성되어야 하는지를 선택적으로 선택하도록 mux를 통과하게 될 수 있다. 따라서, 블록(1714)의 레지스터들 각각은 이동된 비트들의 "관심 위치“ 중에서 임의적으로 선택할 수 있다. 관심 위치는 시프터들(1702-1712) 각각의 출력에 의해 결정된다. 블록(1714)의 출력은 버스(1310)에 제공된다.Block 1714 includes a number of registers, including registers 1716-1746 that are used to generate a combination of the outputs of the shifters 1702-1712. As an example, the bottom eight bits of each of the shifter 1702-1712 outputs can be passed through mux to selectively select which bottom eight bits should ultimately be generated. Thus, each of the registers of block 1714 can randomly select among the “points of interest” of the shifted bits.The position of interest is determined by the output of each of the shifters 1702-1712. The output of block 1714 Is provided to the bus 1310.

따라서, 본 발명의 일 실시예에서, 블록(1326)은 4개 20-비트 및 2개 24 비트 입력 레지스터들을 포함한다. 이는 그 입력 레지스터들로부터 난수적 32, 16, 8 및 4 비트의 비트 조합들이 생성 및 저장되는 8개 16 비트 레지스터들을 포함한다. 블록(1326)은 세가지 모드들로 사용될 수 있다 : 1) 출력 생성을 위해 두 개의 특정 20-비트 레지스터들을 사용, 2) 출력 생성을 위해 4개 20 비트 레지스터들을 사용 또는 3) 출력 생성을 위해 모든 7개 레지스터들을 사용. 시프터들(1702-1712)은 당업자들에게 시프터의 구조 및 기능이 알려져 있기 때문에 도시되어 있지 않은 입력 레지스터들을 포함한다.Thus, in one embodiment of the present invention, block 1326 includes four 20-bit and two 24-bit input registers. It includes eight 16-bit registers from which the random registers of 32, 16, 8, and 4 bits of random numbers are generated and stored. Block 1326 can be used in three modes: 1) use two specific 20-bit registers for output generation, 2) use four 20-bit registers for output generation, or 3) all for output generation. Use 7 registers. Shifters 1702-1712 include input registers that are not shown because the structure and function of the shifter are known to those skilled in the art.

블록(1326)의 조합 펑션을 수행하기 위해 필요한 회로들 또는 블록들의 수나 하드웨어를 감소시키기 위해, 32 비트 출력 레지스터내의 각 비트는 제1 모드에서 단지, 두 개의 20 비트 레지스터들의 최하위 8 비트로부터 채워지고, 제2 모드에서, 4개 20 비트 레지스터들의 4개 최하위 비트로부터 채워지고, 제3 모드에서, 24 비트 레지스터들의 4개 최하위 비트 및 4개 20 비트 레지스터들로부터의 2개 최하위 비트로 채워진다. 입력 레지스터들로부터의 난수 조합들은 2 단계 프로세스이며, 여기서 제1 단계는 출력 레지스터내로의 랜덤 충전이 이 모드에서 허용될 수 있는 최하위 위치들로 “관심” 비트들을 이동시키는 것을 수반한다. 도 17에 관하여 여기서 사용된 예에서, 블록(1326)은 최하위 위치들로 관심 비트들을 이동시키도록 입력 레지스터들상의 시프트 연산으로 파이프라인될 때의 매 사이클 마다 16 조합된 비트들을 생성할 수 있다. 출력의 소정의 조합들은 다수의 클록 사이클들을 취할 수 있다.In order to reduce the hardware or the number of circuits or blocks needed to perform the combination function of block 1326, each bit in the 32 bit output register is filled from the lowest 8 bits of the two 20 bit registers only in the first mode. , In the second mode, is filled from the four least significant bits of the four twenty bit registers, and in the third mode, it is filled with the four least significant bits of the twenty four bit registers and the two least significant bits from the four twenty bit registers. Random number combinations from the input registers are a two-step process, where the first step involves moving the "interest" bits to the lowest positions where random charging into the output register can be allowed in this mode. In the example used herein with respect to FIG. 17, block 1326 may generate 16 combined bits every cycle when pipelined with a shift operation on the input registers to move the bits of interest to the lowest positions. Certain combinations of output can take multiple clock cycles.

메모리(1326)는 일반적 임의 접근 메모리이며, 따라서, 추가로 상세히 설명하지 않는다. 그러나, 메모리의 크기는 N-형 서브-프로세서가 그를 위해 사용되는 애플리케이션들에 기초한다는 것을 언급하는 것으로 충분하다.The memory 1326 is a general random access memory, and thus will not be described in further detail. However, it is sufficient to mention that the size of the memory is based on the applications that the N-type sub-processor is used for.

도 18은 본 발명의 실시예에 따른 블록(1330)내에 포함된 회로의 추가적 세부사항들을 고레벨 블록도 형태로 도시한다. 도 18에서, 1-워드 레지스터(1802)는 8 비트 위치들을 포함하는 것으로 도시되어 있으며, 각 비트 위치(1804)는 비트 선택 회로(1806)에 의해 변경될 수 있다. 이런 변경들은 ‘0’ 삽입, ‘1’ 삽입 및 “NOP"나 비 연산과 등가인 비트를 전혀 변경하지 않는 것 또는 비트를 반전시키는 것과 등가인 낫씽(NOTing) 비트를 비제한적으로 포함한다. 1 워드 레지스터는 반복, 즉, 워드 레지스터들(1810-1820)은 각각 레지스터(1802)로서 워드를 저장 및 변경한다. 따라서, 16 비트 워드 및 8 워드들의 예에서, 8개 16 비트 워드들의 변경은 이를 수행하기 위해 다수의 사이클들을 필요로 하는 종래의 DSP들과는 달리, 하나의 클록 사이클에서 수행된다. 워드들의 각 비트의 변경 또는 펀처링/역펀처링은 도18에 도시된 방식으로 레지스터 1(802) 및 서로에 연결된 mux(1824) 및 플립-플롭(1826)에 의해 제어된다. 레지스터들(1810-1822)은 또한 유사하게 다른 mux 및 플립-플롭 회로들에 연결된다. 모드 선택 비트는 mux의 4개 입력들 중 어느 것을 선택하여야하는지를 선택하고, 이는 명령 코드로부터 선택된다. mux(1824)에 대한 입력들(1828) 중 두 개는 또한, 명령 코드로부터 얻어지는 반면, mux 입력들 중 나머지 두 개는 메모리로부터 얻어지고, 그 중 하나는 도 18에 도시된 바와 같이, 나머지의 반전된 버전일 수 있다.18 illustrates, in high level block diagram form, additional details of circuitry included in block 1330 in accordance with an embodiment of the present invention. In FIG. 18, the one-word register 1802 is shown to include eight bit positions, each bit position 1804 may be changed by the bit select circuit 1806. These changes include, but are not limited to, inserting '0's, inserting' 1's, and noting bits that are equivalent to “NOP” or non-operation at all, or inverting bits. The word register is iterative, that is, word registers 1810-1820 respectively store and change the word as register 1802. Thus, in the example of a 16 bit word and 8 words, a change of 8 16 bit words may cause this. Unlike conventional DSPs, which require multiple cycles to perform, they are performed in one clock cycle: Changing or punching / re-punching each bit of words is done in register 1 802 and in the manner shown in FIG. It is controlled by mux 1824 and flip-flop 1826 connected to each other .. Registers 1810-1822 are similarly connected to other mux and flip-flop circuits. Select any of the inputs Choose which one to take, which is selected from the command code, two of the inputs 1828 to mux 1824 are also obtained from the command code, while the other two of the mux inputs are obtained from memory and One of which may be the inverted version of the other, as shown in FIG.

간단히 설명될, 블록(1330)의 회로들에 대한 입력은 블록(1332)으로부터 생성되었지만, 이제는, 블록(1330)으로, 전체 인터리브들, 부분 인터리브들 또는 비 인터리브들 N 비트 워드들 중 어느 하나를 생성한다. 일 예에서, 이 연산은 256 비트상에서 이루어지고, 이 경우, 블록(1330)은 주어진 시간에 16 비트상에서 동작한다. 프리패치된 제어 워드가 16 비트 워드내의 어느 비트들이 반전되어야만 하는지를 판정하기 위해 사용된다. 선택적으로, ‘0’ 또는 ‘1’ 값이 반전에 부가하여 특정 비트 위치들에 입력된다.The input to the circuits of block 1330, which will be briefly described, was generated from block 1332, but now, with block 1330, any one of full interleaves, partial interleaves or non-interleaved N bit words Create In one example, this operation is on 256 bits, in which case block 1330 operates on 16 bits at a given time. The prefetched control word is used to determine which bits in the 16 bit word should be inverted. Optionally, a value of '0' or '1' is input at specific bit positions in addition to inversion.

도 19는 본 발명의 실시예에 따른 블록(1332)에 포함된 회로의 추가 세부사항들을 고레벨 블록도의 형태로 도시한다. 도 19에서, 메모리 어레이(1902)는 버스(1316)를 통해 입력 디바이스로부터 입력(104)을 수신하고, 버스(1316)를 통해 판독 가능화 입력(1906)을 수신하며, 추가로, 출력 디바이스 신호(1910)를 생성하기 위해 제어 로우-컬럼 어드레스 생성 블록(1908)으로부터 입력을 수신하는 것으로 도시되어 있으며, 출력 디바이스 신호(1910)는 블록(1302)에 제공된다. 일 예에서, 블록(1902)은 128 x 16 비트로 이루어진 메모리 어레이를 포함한다. 데이터는 로우 기반 또는 컬럼 기반 중 어느 한쪽으로 블록(1902)에 기록 또는 그로부터 판 독될 수 있다. 즉, 블록(1902)의 메모리 어레이의 로우가 판독될 수 있거나, 블록(1902)의 메모리 어레이의 컬럼이 판독될 수 있다. 부가적으로, 데이터는 로우 기반으로 기록되지만 컬럼 기반으로 판독될 수 있고, 그 반대도 마찬가지이다.19 illustrates additional details of the circuitry included in block 1332 in accordance with an embodiment of the present invention in the form of a high level block diagram. In FIG. 19, memory array 1902 receives input 104 from an input device over bus 1316, receives readable input 1906 over bus 1316, and further, output device signal. It is shown receiving an input from control low-column address generation block 1908 to generate 1910, and an output device signal 1910 is provided to block 1302. In one example, block 1902 includes a memory array of 128 x 16 bits. Data can be written to or read from block 1902 either on a row basis or on a column basis. That is, the rows of the memory array of block 1902 can be read, or the columns of the memory array of block 1902 can be read. Additionally, data is written on a row basis but can be read on a column basis and vice versa.

도 20은 본 발명의 실시예에 따른 블록(1334)에 포함된 회로의 추가 세부사항들을 고레벨 블록도의 형태로 도시한다. 도 20에서, 브랜치 메트릭 유닛(2002)은 순차적으로, 버스(1310)에 연결된 출력(2022)을 생성하는 mux(2020)에 연결된 것으로 도시되어 있는 서바이버 메모리 블록(2012)에 연결된 것으로 도시되어 있는 가산/비교/선택 블록에 연결되고, 블록(1332)으로부터 입력을 수신하는 것으로 도시되어 있다. mux(2020)는 또한, mux(2016)로부터 입력을 수신하는 누산기(2018)의 출력으로부터 다른 입력을 수신하는 것으로 추가로 도시되어 있다. 선택적으로, 절대 편차의 합(SAD) 블록(2008) 및 역분산기(역분산을 위한) 블록(2010)이 mux(1016)에 대한 입력을 생성하기 위해 사용된다. 블록들(2008, 2010)이 없는 경우, mux(2016), 블록(2018) 및 mux(2020)는 사용되지 않는다. 로컬 메모리(2006)는 블록(2004)에 연결되는 것으로 도시되어 있다. 블록(2002)은 비터비 코딩/디코딩에 친숙한 당업자들에게 알려진 브랜치 메트릭 계산을 수행한다. 역시, 비터비 코딩/디코딩에 친숙한 당업자들에게 알려져 있는 서바이버 경로들은 블록(2012)내에 저장된다.20 shows further details of circuitry included in block 1334 in accordance with an embodiment of the present invention in the form of a high level block diagram. In FIG. 20, the branch metric unit 2002 is sequentially added to the survivor memory block 2012, which is shown to be connected to the mux 2020, which produces an output 2022 connected to the bus 1310. It is shown connected to the / compare / selection block and receiving input from block 1332. The mux 2020 is further shown to receive another input from the output of the accumulator 2018 that receives an input from the mux 2016. Optionally, a sum of absolute deviation (SAD) block 2008 and an inverse spreader (for inverse dispersion) block 2010 are used to generate an input to mux 1016. If there are no blocks 2008, 2010, mux 2016, block 2018 and mux 2020 are not used. Local memory 2006 is shown as being connected to block 2004. Block 2002 performs branch metric calculations known to those skilled in the art familiar with Viterbi coding / decoding. Again, survivor paths known to those skilled in the art familiar with Viterbi coding / decoding are stored in block 2012.

블록(1334)은 터보 디코더, SAD 및 역분산 펑션들을 실행할 수 있다. 일 예에서, 32 내지 256 가산-비교-선택 연산들이 로컬 메모리(2006)에 의해 생성된 경로 메트릭 값들 및 16 비트 브랜치상에서 블록(2004)에 의해 병렬로 수행될 수 있 다. 일 예에서, 로컬 메모리(2006)의 크기는 1 킬로바이트와 16 킬로바이트이다. Block 1334 can execute the turbo decoder, SAD and inverse dispersion functions. In one example, 32 to 256 add-compare-select operations may be performed in parallel by block 2004 on a 16-bit branch and path metric values generated by local memory 2006. In one example, the size of local memory 2006 is 1 kilobyte and 16 kilobytes.

각각 8 비트 서명된 가산기들을 포함할 수 있는 블록(1334)내에 포함된 다수의 블록들(2004)이 존재할 수 있다. 부가적으로, 각각은 위닝(winning) 경로 및 판정 비트를 반환하는 비교 및 선택 블록을 포함할 수 있다. 가산-비교-선택 연산들은 위닝 경로 및 판정 비트들을 초래한다. 위닝 경로는 이리저리 격자를 내려가는 “멀티-캐스트” 상호접속 체계를 사용하여 이웃 블록들(2004)과 공유될 수 있다. 위닝 브랜치 및 경로 메트릭 값들을 가지는 판정 비트들은 역추적을 위해 저장된다.There may be a number of blocks 2004 contained within block 1334 that may each include 8 bit signed adders. In addition, each may include a comparison and selection block that returns a winning path and decision bits. Add-compare-select operations result in a winning path and decision bits. The winning path may be shared with neighboring blocks 2004 using a "multi-cast" interconnect scheme that descends the grid back and forth. Decision bits with winning branch and path metric values are stored for backtracking.

블록(2008)은 4개 8 비트 ALU들을 포함하며, 일 예에서, 그 4개 절대 편차들이 매 사이클마다 산출될 수 있다. 감축 트리는 절대 편차들을 16 비트 누산기에 누산하기 위해, 블록(2004)내에 누적된다. 멀티 캐스트 네트워크는 추가 감축을 위해 이들 값들을 횡단 전송하기 위해 사용될 수 있다. 클록 사이클 당 총 128 8 비트(64 16 비트) 블록들(2008)이 가능하다. 그러나, 오버헤드들 모두를 고려한 효과적 활용은 보다 낮은 수에서 얻어지는 것으로 믿어진다.Block 2008 includes four 8-bit ALUs, and in one example, the four absolute deviations may be calculated every cycle. The reduction tree is accumulated in block 2004 to accumulate absolute deviations into a 16 bit accumulator. Multicast networks can be used to traverse these values for further reduction. A total of 128 8 bit (64 16 bit) blocks 2008 per clock cycle is possible. However, it is believed that effective utilization, taking into account all of the overheads, is obtained at a lower number.

ALU들은 전술된, 특수 ALU 블록이 이행하는 바와 동일한 조건부 가산-감산 펑션을 이행한다. 역분산을 위해 필요한 제어 비트들은 로컬 메모리내에 로딩되어야만 하며, 이 로컬 메모리로부터 패칭되고 레지스터에 저장된다. 이 결과들은 16 비트 누산기에 누산되며, 이 누산기로부터 그에 대한 감축 작업을 위해 다른 블록들(2004)로 전달될 수 있다. 역분산을 사용하여, 일 예에서, 단일 사이클에서 128 동시 조건부 가산-감산들을 수행하는 것이 가능하다. 이 유닛의 트랜지션 당 에너 지는 역분산 및 SAD 이외의 소정의 범용 펑션들을 수행하는 특수 ALU를 위해 사용되는 것보다 높다. 보다 작은 수의 핑거들 또는 보다 느린 운동 추정 속도에 대하여, 특수 ALU는 보다 전력 효율적인 선택사항이다.The ALUs implement the same conditional add-subtract function as the special ALU block, described above. The control bits needed for devariance must be loaded into local memory, patched from this local memory and stored in registers. These results are accumulated in a 16-bit accumulator and can be passed from this accumulator to other blocks 2004 for reduction work thereon. Using inverse variance, in one example, it is possible to perform 128 simultaneous conditional add-subtractions in a single cycle. The energy per transition of this unit is higher than that used for special ALUs that perform some general purpose functions other than back dispersion and SAD. For smaller numbers of fingers or slower motion estimation speed, the special ALU is a more power efficient option.

도 21은 본 발명의 실시예에 따른, 프로세서(22)를 사용하는 프로그래밍 흐름 및 툴들의 예를 도시한다. 도 22는 본 발명의 실시예들의 스케일능의 예를 도시한다. 예로서, 도 22에서, 클러스터들(2202) 또는 N-형 및 W-형의 서브-프로세서들이 버스(2204)를 사용하여 상호접속되는 것으로 도시되어 있다. 각 클러스터(2202)는 두 개 또는 네 개의 서브-프로세서들을 포함한다. 버스(2204)는 일 예에서, 표준 SoC 버스이다. 계층적 설계 방법을 유지함으로써, 상호접속성이 해결된다.21 illustrates an example of a programming flow and tools using processor 22, in accordance with an embodiment of the present invention. 22 shows an example of the scale capability of embodiments of the present invention. As an example, in FIG. 22, clusters 2202 or N-type and W-type sub-processors are shown to be interconnected using bus 2204. Each cluster 2202 includes two or four sub-processors. Bus 2204 is, in one example, a standard SoC bus. By maintaining a hierarchical design method, interconnectivity is solved.

프로세서(20)의 스케일링은 각 클러스터를 위한 별개의 버스들을 갖는 네 개의 서브-프로세서들의 클러스터들을 도출하며, 이렇지 않으면, 네 개의 서브-프로세서들은 단일 메모리를 공유할 수 있다. 프로세서들에 관한 스케일능은 일반적으로, 프로세서의 주파수 또는 속도의 증가나 프로세서들의 수의 증가에 의해 이루어진다. 그러나, 복잡한 애플리케이션들은 이전에 이행되는 것을 초과한 스케일링을 필요로한다. 본 발명에서, W-형 및 N-형 서브-프로세서들은 프로세싱을 형성하는 4개의 이런 서브-프로세서들이 단일 애플리케이션을 처리할 수 있도록 변형되어 있다.Scaling of the processor 20 results in clusters of four sub-processors with separate buses for each cluster, or the four sub-processors may share a single memory. Scalability with respect to processors is generally achieved by increasing the frequency or speed of the processor or by increasing the number of processors. However, complex applications require scaling beyond that previously implemented. In the present invention, the W-type and N-type sub-processors are modified such that four such sub-processors forming processing can handle a single application.

따라서, 프로세서(22)는 C 코드로부터의 편집에 직접적으로 기초한 RISC 및 슈퍼 스칼라 프로세서들보다 효율적으로 목표 애플리케이션들에 형성된 제어 및 순차 DSP 코드를 운용하기 위한 기능을 갖는다. 동시에, 이는 리가시 및 경량 애플리 케이션들을 위하여 RISC 및 슈퍼 스칼라 프로세서들에 사용되는 자동 코드 생성 기술들의 장점을 취하도록 설계되어 있다. 또한, 프로세서(22)는 애플리케이션 맵핑 및 개발을 위해 시뮬링크 같은 성숙하고 산업적 표준인 소프트웨어 툴들을 사용하여 작업한다. 무어의 법칙이 프로세서(22)의 성능을 향상시키기 위해 사용될 수 있다. 프로세서(22)는 매우 병렬적인 기계일 뿐만 아니라, 또한, 이종 멀티 프로세서이다. 산업 및 학계 양자에서, 멀티미디어 및 통신 애플리케이션들의 수요를 해결하기 위해 병렬 이종 멀티 프로세서들이 필요하다는 사실이 검증되었다. 이는 임의의 전력 및 면적 비효율적 기술들을 사용하지 않고, VLIW에서 사용되는 자동 코드 생성 기술들 중 다수의 활용을 가능하게 한다. 이는 C로부터의 제어 코드의 편집에 기초한 패턴들을 반복하는 장점을 취하도록 최적화되어 있다. 이는 제어 전력을 현저히 감소시키고, 컴파일된 시리얼 코드를 효과적으로 운용할 수 있게 한다. 부가적으로, 프로세서(22)의 프로그래밍 모델은 시뮬링크 같은 그들에 친숙한 툴들을 사용하여 DSP 프로그래머들의 대형 커뮤니티에 적합하도록 설계되어 있다. 그 개발 흐름은 효율적인 제어 및 순차 DSP 코드의 C-편집을 위한 수단을 제공한다. 또한, 매우 효율적인 통신들 및 멀티미디어 커널들의 라이브러리의 방대한 집합이 제공된다. 예들은 FFT, IDCT, RRC, 비터비, VLC, 2D/3D 그래픽스, 터보 코덱 및 역 스크램블러의 파라미터화된 라이브러리이다.Thus, processor 22 has the capability to operate control and sequential DSP code formed in target applications more efficiently than RISC and super scalar processors based directly on editing from C code. At the same time, it is designed to take advantage of the automatic code generation techniques used in RISC and super scalar processors for legacy and lightweight applications. The processor 22 also works with mature, industry standard software tools such as Simulink for application mapping and development. Moore's Law can be used to improve the performance of the processor 22. The processor 22 is not only a very parallel machine, but also a heterogeneous multiprocessor. In both industry and academia, it has been proven that parallel heterogeneous multiprocessors are needed to address the demands of multimedia and communications applications. This allows the utilization of many of the automatic code generation techniques used in VLIW without using any power and area inefficient techniques. It is optimized to take advantage of repeating patterns based on editing of control code from C. This significantly reduces control power and allows for effective operation of compiled serial code. In addition, the programming model of processor 22 is designed to suit a large community of DSP programmers using tools familiar to them such as Simulink. The development flow provides a means for efficient control and C-editing of sequential DSP code. In addition, a vast collection of highly efficient communications and a library of multimedia kernels is provided. Examples are parameterized libraries of FFT, IDCT, RRC, Viterbi, VLC, 2D / 3D graphics, turbo codec and inverse scrambler.

프로세서(22)의 데이터 경로 디자인은 집중된, 그러나 매우 유리한 애플리케이션 혼합물을 효과적으로 어드레스하기 위해 변하는 그래뉼러리티의 펑셔널 유닛들을 연결하는 가변적 상호접속 구조들을 성공적으로 통합한다.The data path design of the processor 22 successfully integrates variable interconnect structures that connect functional units of varying granularity to effectively address a centralized but highly advantageous application mixture.

프로세서(22)의 스케일능은 표준 SoC 버스에 기초한 블록내의 가장 가까운 이웃 접속들을 갖는 단일 블록(시간 멀티플렉싱된)내의 모든 애플리케이션들에 맞도록 설계되어있다. 다수의 블록들이 그들 사이의 임의의 전용 통신 없이, 다수의 애플리케이션들을 처리하도록 사용될 수 있기 때문에, 현저한 양의 비효율성 및 모든 시스템 레벨 논 디터미니즘이 감소된다.The scalability of the processor 22 is designed to fit all applications in a single block (time multiplexed) with the nearest neighbor connections in the block based on the standard SoC bus. Since multiple blocks can be used to handle multiple applications, without any dedicated communication between them, a significant amount of inefficiency and all system level non-determinism is reduced.

도 23은 본 발명의 스케일능의 이득 중 일부를 나타내는 차트를 도시한다.Fig. 23 shows a chart showing some of the gains of the scale capability of the present invention.

비록, 특정 실시예들에 관하여, 본 발명을 설명하였지만, 당업자들은 그 변형 및 대안들을 명백히 알 수 있다는 것은 의심할 여지가 없다. 따라서, 하기의 청구범위는 본 발명의 진정한 개념 및 범주내에 드는 것으로 이런 모든 대안들 및 변용들을 포함하는 것으로 해석되어야 한다.Although the invention has been described with respect to specific embodiments, there is no doubt that those skilled in the art can clearly see the variations and alternatives. Accordingly, the following claims are to be construed as including all such alternatives and modifications as fall within the true spirit and scope of the present invention.

Claims

이종 고성능 스케일러블 프로세서에 있어서,Heterogeneous high performance scalable processor,

W 이상의 비트들을 병렬로 처리할 수 있는 적어도 하나의 W-형 서브-프로세서로서, W는 정수값인, 상기 적어도 하나의 W-형 서브-프로세서;At least one W-type sub-processor capable of processing W or more bits in parallel, wherein W is an integer value;

N 비트들을 병렬로 처리할 수 있는 적어도 하나의 N-형 서브-프로세서로서, N은 W보다 작은 정수값인, 상기 적어도 하나의 N-형 서브-프로세서;At least one N-type sub-processor capable of processing N bits in parallel, wherein N is an integer value less than W;

상기 적어도 하나의 W-형 서브-프로세서와 상기 적어도 하나의 N-형 서브-프로세서를 연결하는 공유 버스; 및A shared bus connecting the at least one W-type sub-processor and the at least one N-type sub-processor; And

상기 적어도 하나의 W-형 서브-프로세서와 상기 적어도 하나의 N-형 서브-프로세서에 연결 및 공유된 메모리를 포함하며,A memory coupled to and shared with the at least one W-type sub-processor and the at least one N-type sub-processor,

상기 W-형 서브-프로세서는 애플리케이션들의 실행을 수용하기 위해 메모리로부터 또는 메모리로 전달되는 바이트들을 재배열하여 신속한 연산들을 가능하게 하는, 이종 고성능 스케일러블 프로세서.And wherein the W-type sub-processor rearranges the bytes transferred from or to the memory to accommodate the execution of applications, thereby enabling fast operations.

제 1 항에 있어서, 상기 프로세서는 스케일가능한, 이종 고성능 스케일러블 프로세서.2. The heterogeneous high performance scalable processor of claim 1, wherein the processor is scalable.

제 1 항에 있어서, 상기 적어도 하나의 W-형 서브-프로세서들은 두 개이고, 상기 적어도 하나의 N-형 서브-프로세서들은 두 개인, 이종 고성능 스케일러블 프 로세서.2. The heterogeneous high performance scalable processor of claim 1, wherein the at least one W-type sub-processors are two and the at least one N-type sub-processors are two.

제 2 항에 있어서, 상기 적어도 하나의 W-형 서브-프로세서 및 상기 적어도 N-형 서브-프로세서는 멀티미디어 애플리케이션들을 위한 프로그램들을 실행하는, 이종 고성능 스케일러블 프로세서.3. The heterogeneous high performance scalable processor of claim 2, wherein the at least one W-type sub-processor and the at least N-type sub-processor execute programs for multimedia applications.

제 4 항에 있어서, 상기 적어도 하나의 W-형 서브-프로세서들 각각은 다수의 매크로 펑션 유닛들을 포함하는, 이종 고성능 스케일러블 프로세서.5. The heterogeneous high performance scalable processor of claim 4, wherein each of the at least one W-type sub-processors comprises a plurality of macro function units.

제 5 항에 있어서, 상기 다수의 매크로 펑션 유닛들은 다수의 매크로 펑션 유닛들 중 나머지에 의한 사용을 위한 메모리 어드레스들을 생성하기 위한 로드 저장 블록을 포함하는, 이종 고성능 스케일러블 프로세서.6. The heterogeneous high performance scalable processor of claim 5, wherein the plurality of macro function units comprises a load storage block for generating memory addresses for use by the rest of the plurality of macro function units.

제 6 항에 있어서, 상기 다수의 매크로 펑션 유닛들은 로드 저장 블록에 연결된 승산 누산 블록 및 스칼라 산술 논리 유닛(ALU)을 포함하여, 스칼라 산술 및 논리와, 상기 로드 저장 블록으로부터 수신된 데이터에 승산 연산들을 수행하는, 이종 고성능 스케일러블 프로세서.7. The apparatus of claim 6, wherein the plurality of macro function units comprise a multiplication accumulating block and a scalar arithmetic logic unit (ALU) coupled to a load storage block to multiply the scalar arithmetic and logic and data received from the load storage block. Heterogeneous high performance scalable processor.

제 7 항에 있어서, 상기 다수의 매크로 펑션 유닛들은 상기 로드 저장 블록에 연결된 벡터 X 블록을 포함하고, 상기 스칼라 ALU 및 승산 누산 블록은 상기 로 드 저장 블록으로부터의 데이터에 벡터 연산들을 수행하며, 벡터 X 블록은 벡터 데이터를 생성하는, 이종 고성능 스케일러블 프로세서.8. The apparatus of claim 7, wherein the plurality of macro function units comprise a vector X block coupled to the load storage block, wherein the scalar ALU and multiply accumulate block perform vector operations on data from the load storage block, X block is a heterogeneous high performance scalable processor for generating vector data.

제 8 항에 있어서, 다수의 매크로 펑션 유닛들은 벡터 ALU와, 상기 스칼라 ALU에 연결된 승산 누산 블록을 포함하고, 상기 승산 누산 블록 및 상기 벡터 X 블록은 상기 벡터 X 블록으로부터 수신된 벡터 데이터에 벡터 ALU 및 승산 누산 연산들을 수행하기 위한 것인, 이종 고성능 스케일러블 프로세서.9. The apparatus of claim 8, wherein the plurality of macro function units comprise a vector ALU and a multiplication accumulating block coupled to the scalar ALU, wherein the multiplication accumulating block and the vector X block are vector ALUs to vector data received from the vector X block. And performing multiplication accumulation operations.

제 2 항에 있어서, 상기 적어도 하나의 N-형 서브-프로세서는 저장 유닛 블록, 매크로 펑션 블록들 및 로드 유닛 블록을 포함하고, 상기 매크로 펑션 블록들은 상기 로드 유닛 블록에 연결되고, 상기 매크로 펑션 블록들을 상기 저장 블록에 연결하기 위한 매크로 펑션 버스에 더 연결되는, 이종 고성능 스케일러블 프로세서.3. The apparatus of claim 2, wherein the at least one N-type sub-processor includes a storage unit block, macro function blocks, and a load unit block, the macro function blocks coupled to the load unit block, and wherein the macro function block Heterogeneous high performance scalable processor further coupled to a macro function bus for coupling the signals to the storage block.

제 10 항에 있어서, 상기 적어도 하나의 N-형 서브-프로세서는 데이터 경로 유닛(DPU) 블록 및 제어기와, 상기 W-형 서브-프로세서들 중 적어도 하나에 의해 공유된 시퀀서 및 데이터 어드레스 생성기(DAG) 블록을 포함하는, 이종 고성능 스케일러블 프로세서.11. The apparatus of claim 10, wherein the at least one N-type sub-processor comprises a data path unit (DPU) block and a controller and a sequencer and data address generator (DAG) shared by at least one of the W-type sub-processors. Heterogeneous high performance scalable processor comprising: a block.

제 10 항에 있어서, 상기 매크로 펑션 블록들은 갈로아체(Galois field) 연 산들을 수행하기 위해 펑션 버스 및 로드 유닛 블록(1306)에 연결된 갈로아체 승산 누산(MAC) 블록을 포함하는, 이종 고성능 스케일러블 프로세서.12. The heterogeneous high performance scalable of claim 10, wherein the macro function blocks comprise a Galoache Multiplication Accumulation (MAC) block coupled to a function bus and a load unit block 1306 to perform Galois field operations. Processor.

제 12 항에 있어서, 상기 매크로 펑션 블록들은 특수 ALU 연산들을 수행하기 위해 상기 로드 유닛 블록 및 상기 로드 유닛 블록에 연결된 특수 ALU를 포함하는, 이종 고성능 스케일러블 프로세서.13. The heterogeneous high performance scalable processor of claim 12, wherein the macro function blocks comprise a load unit block and a special ALU coupled to the load unit block to perform special ALU operations.

제 13 항에 있어서, 상기 매크로 펑션 블록들은 펀처링/역펀처링 연산들을 수행하기 위해 상기 로드 유닛 블록 및 상기 로드 유닛 블록에 연결된 펀처링/역펀처링 블록을 포함하는, 이종 고성능 스케일러블 프로세서.14. The heterogeneous high performance scalable processor of claim 13, wherein the macro function blocks comprise a puncturing / reverse punching block coupled to the load unit block and the load unit block to perform punching / reverse punching operations.

제 14 항에 있어서, 상기 매크로 펑션 블록들은 인터리빙 연산들을 수행하기 위해 상기 로드 유닛 블록 및 상기 로드 유닛 블록에 연결된 인터리버 블록을 포함하는, 이종 고성능 스케일러블 프로세서.15. The heterogeneous high performance scalable processor of claim 14, wherein the macro function blocks comprise an interleaver block coupled to the load unit block and the load unit block to perform interleaving operations.

제 15 항에 있어서, 상기 매크로 펑션 블록들은 비터비 연산들을 수행하기 위해 상기 저장 유닛 블록 및 상기 인터리버 블록에 연결된 비터비 블록을 포함하는, 이종 고성능 스케일러블 프로세서.16. The heterogeneous high performance scalable processor of claim 15, wherein the macro function blocks comprise a Viterbi block coupled to the storage unit block and the interleaver block to perform Viterbi operations.

제 16 항에 있어서, 상기 매크로 펑션 블록들은 결합 연산들을 수행하기 위 해 상기 로드 유닛 블록 및 상기 로드 유닛 블록에 연결된 결합기 블록을 포함하는, 이종 고성능 스케일러블 프로세서.17. The heterogeneous high performance scalable processor of claim 16, wherein the macro function blocks comprise a combiner block coupled to the load unit block and the load unit block to perform combine operations.

제 16 항에 있어서, 상기 적어도 하나의 N-형 서브-프로세서들은 상기 저장 유닛 블록과 상기 로드 유닛 블록 사이에 연결된 X 유닛 블록을 포함하는, 이종 고성능 스케일러블 프로세서.17. The heterogeneous high performance scalable processor of claim 16, wherein the at least one N-type sub-processors comprise an X unit block coupled between the storage unit block and the load unit block.

제 16 항에 있어서, 상기 적어도 하나의 W-형 서브-프로세서와 상기 적어도 하나의 N-형 서브-프로세서 사이에 연결된, 상기 적어도 하나의 W-형 서브-프로세서와 상기 적어도 하나의 N-형 서브-프로세서 사이의 직접 통신을 위한 공유 레지스터를 포함하는, 이종 고성능 스케일러블 프로세서.17. The system of claim 16, wherein the at least one W-type sub-processor and the at least one N-type sub-processor connected between the at least one W-type sub-processor and the at least one N-type sub-processor. A heterogeneous high performance scalable processor comprising shared registers for direct communication between the processors.

정보를 처리하는 방법에 있어서,In the method of processing information,

이종 고성능 스케일러블 프로세서는:Heterogeneous high performance scalable processors include:

W 비트들을 병렬로 처리할 수 있는 적어도 하나의 W-형 서브-프로세서를 이용하여 데이터를 처리하는 단계로서, W는 정수값인, 상기 적어도 하나의 W-형 서브-프로세서를 이용하여 데이터를 처리하는 단계;Processing data using at least one W-type sub-processor capable of processing W bits in parallel, wherein W is an integer value and processing the data using the at least one W-type sub-processor Doing;

N 비트들을 병렬로 처리할 수 있는 적어도 하나의 N-형 서브-프로세서를 이용하여 데이터를 동시에 처리하는 단계로서, N은 W보다 2의 인자만큼 작은 정수값인, 상기 적어도 하나의 N-형 서브-프로세서를 이용하여 데이터를 처리하는 단계; 및 Processing data simultaneously using at least one N-type sub-processor capable of processing N bits in parallel, wherein N is an integer value that is an integer value less than two, wherein the at least one N-type sub-processor Processing data using a processor; And

낮은 전력 소비 및 용이한 프로그램가능성을 유지하면서, 멀티미디어 애플리케이션들의 신속한 실행을 유발하는 단계를 포함하는, 정보 처리 방법.Causing rapid execution of multimedia applications while maintaining low power consumption and easy programmability.