TW200407705A - Access to a wide memory - Google Patents

Access to a wide memory Download PDF

Info

Publication number
TW200407705A
TW200407705A TW092113718A TW92113718A TW200407705A TW 200407705 A TW200407705 A TW 200407705A TW 092113718 A TW092113718 A TW 092113718A TW 92113718 A TW92113718 A TW 92113718A TW 200407705 A TW200407705 A TW 200407705A
Authority
TW
Taiwan
Prior art keywords
data
register
memory
vector
processing system
Prior art date
Application number
TW092113718A
Other languages
Chinese (zh)
Other versions
TWI291096B (en
Inventor
Berkel Cornelis Hermanus Van
Patrick Peter Elizabeth Meuwissen
Original Assignee
Koninkl Philips Electronics Nv
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koninkl Philips Electronics Nv filed Critical Koninkl Philips Electronics Nv
Publication of TW200407705A publication Critical patent/TW200407705A/en
Application granted granted Critical
Publication of TWI291096B publication Critical patent/TWI291096B/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3893Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8053Vector processors
    • G06F15/8061Details on data memory access
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30025Format conversion instructions, e.g. Floating-Point to Integer, decimal conversion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30032Movement instructions, e.g. MOVE, SHIFT, ROTATE, SHUFFLE
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3824Operand accessing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3824Operand accessing
    • G06F9/3834Maintaining memory consistency
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3877Concurrent instruction execution, e.g. pipeline or look ahead using a slave processor, e.g. coprocessor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3887Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Complex Calculations (AREA)
  • Saccharide Compounds (AREA)
  • Static Random-Access Memory (AREA)
  • Advance Control (AREA)
  • Image Input (AREA)
  • Executing Machine-Instructions (AREA)
  • Multi Processors (AREA)

Abstract

A processing system includes a processor and a physical memory 500 with a single-size memory port 505 for accessing data in the memory. The processor is arranged to operate on data of at least a first data size and a smaller second data size. The first data size is equal to or smaller than the size of memory port. The processing system including at least one data register 514 of the first data size connected to the memory port 505, and at least one data port 525 of the second data size connected to the data register 525 and the processor for enabling access to data elements of the second size.

Description

200407705 玫、發明說明: 【發明所屬之技術領域】 本發明關於一種處理系統。 【先前技術】 第二代無線通信標準如UMTS_FDD,UMTS-TDD,IS2000 與TD-SCDMA以非常高的頻率作業。此等第三代移動式通 “標率的數據機(收發機)如UMTS需要的數位信號處理功 率比GSM多大約100倍。為了能夠處理不同的標準,以及能 夠有彈性適用於此等新標準,希望實行一收發機,以使用 可私式之架構之此等標準。一種改善效率的已知方法,其 儲存多種資料元件於一記憶體之一列,並一次對一個以上 之貝料元件作業。例如,此等系統已知的SIMD(單指令,多 種資料)或MIMD(多種指令,多種資料)。一向量處理器是一 SIMD處理器之範例。通常,一具有—埠之廣泛記憶體能夠 謂與寫存取該記憶體之至少一列之單元。於該範例中,該 圮fe體4列寬度能儲存一向量,一次可讀或寫一或更多的 向量。對於存取比一向量小的單元沒有特別的規定。為了 一記憶體之最佳使用,也希望能夠以_種有效率的方法存 取比具有一記憶體列之全寬度小的資料元件。通常,此一 小單元能被儲存於該記憶體之一列,其中該列有部分未被 使用,而增加存儲成本。此外,此等小單元以一種連鎖的 形式被儲存於-列,其中需要讀或寫―整列,而且$_ 取更多的處理指令與週期,或***一期望的較小單元,邢 成該整體記憶體列。上述會降低該效率。此等問題會變得 85370 -6- 200407705 更嚴重,其中該記憶體的寬度顯然超出該較小單_的 寸。例如’為了語音識別’此等儲存一語音特性向量之級 件的資料元件通常是8至16位元寬。就電話通訊而士,y r 編碼/調變資料元件往往是8位元(或2*8位元複合值)4 、 需要增加該尺寸。為了較新的電話通訊系統或增強語音= 別效率,希望能改善該處理速度。因此使用較廣泛的吃二 體以增加該處理速度,但不需要特別的方法,即能増加: 要儲存資料的記憶體量,或甚至減速如上面所描述= 形式的記憶體存取。 二 【發明内容】 本發明之-目的係提供—種處理器架構,其能夠提供— 種對廣泛記憶體之快速記憶體存取,也用以較小的資料元 件。 ” 為達成該目的,-處理系統具有—處理器與一實體記憶 體’該實體記憶體具有—用以存取該記憶體中之資料之單 尺寸記憶體埠’該處理器被配置以對至少—第—資料尺寸 與-較小《第二資料尺寸之資料作業;該第一资料尺寸等 於或小於該記憶料的尺寸;該處理系統包括該第一資料 、、v貝料暫存裔連接至該記憶體埠;而該第二資 料二寸之至少—料埠連接至該資料暫存器,及赋能存取 该矛―貧料尺寸之此等資料元件的該處理器。 :方式中使用具有一字大小之慣用記憶體。上述使該記 =::成ί保持下降。通常,該記憶體字的尺寸與該處理 、字的尺寸相匹配。為了存取較小的資料元件,使 85370 200407705 用一中間暫存器。一額外 哭由耔丨 八 阜被增加,以賦能存取該暫存 口口中車父小的資料元件。# Φ 以日存器的使用是非常顯而易 兄。對孩處理器核心盥該巷 邮^ ,、及秸式貝而言,看起來就像該記憶 :;有不同尺寸的埠。於此方式中,小資料元件能被快速 :二需要更多的指令附加’例如,為了存取為該大資 邵分小資料元件,大資料元件之選擇和/或移位。 如描迷於所屬之中請專利範園第2項,該記憶體琿尺寸至 =該第二料尺寸的兩倍。於此方式中,該暫存器能儲 y兩個小貝料儿件。尤其,於該等小資料元件被連績 存取的案财,該實體記憶體之—存取,能夠快速存取至 立”連"之小貝料兀件。上述會降低浪費在存取該實體記 fe體的時間。 如描述於所屬之申請專利範圍第3項,為了 一讀資料埠, 使用-多工器,根據-讀取位址的控制從該資料暫存器選 擇^取該第二資料尺寸之_資料元件。例如,該讀取位 址最重要的部分能被用於核對該資料元件是否已經存在該 暫存器(但如果未使用該部分,從該實體記憶體重新取得), 而孩最不重要的部分能被用於選擇該暫存器中的資料元 件0 如描述於所屬之申請專利範圍第4項,為了 一窝資料埠, 使用一解多工裔,根據一寫入位址之控制,在該資料暫存 器之一可選擇的位置***該第二資料尺寸之一資料元件。 該選擇的執行與所描述的讀取埠類似。 如描述於所屬之申請專利範圍第5項,該處理系統包括連 85370 200407705 接至該處理器之該第二資料尺寸之複數個資料埠,而且為 了該等資料璋之每一個,一相關的個別資料暫存器連接至 個別之資料埠與該實體記憶體之一埠。例如,如果一調準 處理兩連續之資料流,該等資料流之每一個可使用該等暫 存器之一與此等資料埠。每一資料流接著能存取至少兩連 續之資料元件,但僅使用其中之一存取該實體記憶體。 如描述於所屬之申請專利範圍第6項,為了因該等暫存器 包含該資料之一 ’’複本’’於該記憶體和/或一個以上之暫存器 所引發之可能一致的衝突,而執行一檢查。該系統維持更 新的資料於該暫存器,因此一小資料元件之更新通常不會 對該實體記憶體發生一寫作業。此外,也能直接從該暫存 器讀取該更新的小資料元件,更進一步節省記憶體存取次 數。對此些資料埠(因此該相關的暫存器允許寫存取)而言, 該處理器儲存於一更一致的暫存器,該資料的資訊被儲存 於該暫存器。該資料被用於檢查讀取的資料是否(從該實體 記憶體或該等暫存器之一)正在存取已被改變的資料(但可 能尚未更新該等實體記憶體或暫存器)。該識別資訊最好是 一用以存取該實體記憶體中之一字的實體位址,其中該字 的寬度與該記憶體琿一樣。此方式中,容易檢查直接存取 該實體記憶體是否可能與儲存於該等暫存器中的資料衝 突。 如描述於所屬之申請專利範圍第8項,該協調檢查器包括 一衝突解決器,為回應一可能一致的衝突,而取得此等修 正步騾。在此方式中,當設計該程式時已取得該方法以接 85370 200407705 替該程式員。取得此等修正步驟之一方法,如描述於所屬 之申請專利範圍第9項,將該資料暫存器標示為讀存取無 效,為回應該資料暫存器之一讀存取,導致從該記憶體重 新載入該資料暫存器之内容。 此外,如描述於所屬之申請專利範圍第10項,該協調檢 查器包括一協調暫存器,其用以每一個別的資料暫存器, 該資料暫存器儲存識別儲存於該個別之資料暫存器之資料 的資訊;以及配置該修正器,根據該識別資訊儲存相同之 資料,為回應寫存取該等資料暫存器之一和/或寫存取該實 體記憶體,因此將寫至該資料暫存器或該記憶體之内容複 製至所有其餘的資料暫存器和/或該記憶體之一位置。於該 實施例中,根據該協調暫存器應儲存同一資料,將更新之 資料複製至所有的暫存器。如果允許直接存取該實體記憶 體,上述包括複製該資料至該實體記憶體。為了該等大資 料元件,最好也經由一中間暫存器發生直接存取該實體記 憶體,於寫至一暫存器之案例中,不會自動迫使一寫存取 該記憶體。 如描述於所屬之申請專利範圍第11項,至少該等資料暫 存器之一(下面的’’讀暫存器’’)被連接至一讀資料埠,以及至 少該等資料暫存器之一(下面的’’寫暫存器’’)被連接至一寫 入資料埠;而且該處理器包括一旁通,其可選擇從該寫暫 存器提供資料給該讀資料埠;該協調檢查器包括一用以每 一個別資料暫存器之協調暫存器,以儲存識別儲存於該相 關資料暫存器之資料的資訊;配置該衝突解決器,因回應 85370 -10- 200407705 寫資料進入啟動該旁通路徑之寫暫存器,以連續讀存取該 讀暫存器,以取得此等修正步驟,如果該讀暫存器根據該 識別資訊儲存同一資料元件。因使用一旁通,一具有同一 内容之寫暫存器被更新時,一讀暫存器就不需要被重新載 入。接著卻直接從該更新寫暫存器讀取該資訊。在此方式 中,對該實體記憶體的存取保持最低水準。於存取寫暫存 器期間有時會發生一延遲。 該實體記憶體最好以高成本效益的單埠SRAM為基礎。為 了獲得一高成本效益的廣泛實體記憶體,因此以一記憶體 字儲存許多小資料元件,最好使用由複數個平行配置的 RAM儲存體所形成之實體記憶體。該記憶體最好被嵌於該 處理器。 該描述之架構被用於一純量/向量處理器是有利的,其中 該向量區段根據該第一資料尺寸之向量作業,而該純量區 段根據該第二資料尺寸之純量作業,其中該第一資料寬度 是該第二資料寬度的兩倍。於此一配置中,同一記憶體能 被用於儲存該等向量與純量。也使該配置容易在該向量元 件上執行純量作業。 【實施方式】 為最佳化信號處理,於一處理器中最好使用該位址產生 單元(AGU)與記憶體單元。此一處理器可以是一DSP或任何 其他適合的處理器/微控制器。其餘之描述希望將該等單元 使用於一高功率純量/向量處理器。此一處理器可被單獨使 用或與另一處理器結合。圖1顯示一使用該純量/向量處理 85370 -11 - 200407705 器之最佳組態。於該組態中,三個主要組件係由匯流排u〇 連接。連接此三個組件之匯流排110可以是任何適合的匯流 排,例如一 AMBA高速匯流排(AHB)。該等主要組件是: 包括此等功能單元與一區域資料記憶體(參考圖丨中之向 量記憶體)之可程式化的純量/向量處理器12〇 ; 一包括限制的晶載程式或資料記憶體之微控制器或D s p 子系統1 3 0 ; 一介面區塊140。 該純量/向量處理器12〇主要被用於正規的"重/功率,,處 理尤其疋内邵迴路處理。該純量/向量處理器包括向量處 理的功能。就其本身而論,為了執行該碼之可向量化°部:, 其提供大尺度的平行性。所有信號處理的絕大部分將由嗲 純量/向量處理器之向量部分執行。以一陣列,例如 完全相同之處理元件執行同一指令’其提供大規模的平行 性。結合-32·字廣泛記憶體介面,以低成本及—般的功率 消耗量’導致前所未有的可程式化的效率水準。然而,备 許多演算法不足以表示該正確形式之資料平行性時,通; 芫全利用該平行性是不可行的。根據A—的定律 代碼之直接可向量化部分向量化之後,大部分的時間花; 孩剩餘的代碼。該剩餘的代碼可被分成四種: 用=:)的位址(例如’増加一指標到—循環緩街器,使 正規的向量作業(即,向量作業對應於該向量 迴路) 〜王鉻的王 85370 -12- 循環 非正規的向量作業 々:用j此等種類之每-種的代碼部分,高度依賴該演 、夕勺執行。例如,該G〇lay關聯器(用於搜尋)需要 許多有關此等指令的位址,㈣是其他演算法的案例,例 士曰根據本發明能夠藉由使用該AGU/記憶體單元,以 取佳化此等有關指令與循環之位址的效率。藉由一處理器 :緊湊的積分純量與向量處理,以最佳化該正規純量作業 ⑽之作業。由藏等創作者已揭露的所有與3G數據機相關 <磯算法疋研究,不正規純量作業部分非常受到限制。該 屬性使該純量/向量處理器12〇與該微控制器或歸13〇之間 的任務分開,其中該微控制器或Dspl3〇執行該等非正規的 任務,並同時控制該純量/向量處理器。於該最佳組態中, 該純量/向量處理器12〇作為一可程式化的共處理器(於該其 餘4分也稱為CVP,共向量處理器)。該純量/向量處理器12〇 與該微控制器130之間的介面處理通信(例如通過分享的記 憶體)與同步化(例如通過分享的記憶體與狀態信號)。該介 面最好是記憶體映對。 該介面區塊140使該等處理器與該系統的其餘部分互相 作用。於該最佳實施例中’該純量/向量處理器被使用作為 一 2G/3G移動式網路之軟體數據機(資料收發器)。關於此_ 軟體數據機的功能’该介面區塊14 0包括作為一前端與一主 要任務之專用硬體,以根據該微控制器130的控制,傳遞控 制與資料字給該向量處理器,例如DMA。接著由該純量/ 85370 -13- 200407705 向里處理备處理該向量記憶體中的資料。 對該匯流排110而言,該純量/向量處理器處12〇可以是一 «屬裝置,而邊彳政控制奈1 3 〇與該介面區塊1 (可包括一 DMA單元)可作為一主裝置。所有具有該cvp之通信,其程 式,貝料或控制最好是記憶體映對。該記憶體可以是一晶 片外的DRAM,而該DRAM也可被該純量/向量處理器使用 作為一(無)***的記憶體。 於該描述中,主要使用的該用語,,位址計算單元π或 ACU。就該描述的用途而言可視為與”位址產生單元,,或 AGU—樣。該描述集中於使用此類單元來計算資料位址。 熟悉此項技藝之人士也能夠將同一功能用於計算指令位址 (”迴路控制”)。 圖2顯示根據本發明之處理器的主結構。該處理器包括一 管線向量處理區段210。為了支援該向量區段的作業,該純 量/向量處理器包括一配置與該向量區段平行作業之純量 處理區段220。該純量處理區段最好也是管線。為了支援該 向量區段之之作業’至少該向量區段之一功能單元也提供 該純量區段對應部分的功能。例如,一移位功能單元之向 量區段可用作移位一向量,其中由(或傳遞)該移位功能單元 之純量區段提供一純量組件。就其本身而言,該移位功能 單元轉換該向量與該純量區段。因此,至少若干功能單元 不是只有一向量區段,並且同時有一純量區段,其中兮向 量區段與該純量區段因能交換純量資料而互相合作。一功 能單元之向量區段提供該原始處理功率,其中該對廣、的純 85370 -14- 200407705 量區段(即,同一功能單元的純量區段)藉由提供和/或消耗 純量資料,以支援該向量區段的作業。經由一向量管線提 供該等向量區段之向量資料。 於圖2之最佳實施例中,該純量/向量處理器包括下列七 個專門的功能單元。 指令分配單元(IDU 250)。該IDU包含該程式記憶體252, 以讀取連續的VLIW指令,並且將每一指令的7段分配給7個 功能單元。最好包含一支援不超過三個零負擔循環之巢套 階的迴路單元。於該最佳實施例中,沒支援分支、跳越或 中斷。由該限制描述符載入該初始程式計數器,下面將更 加詳細描述。 向量記憶體單元(VMU 260)。該VMU包含該向量記憶體 (未顯示於圖2)。於每一指令期間,該VMU能夠傳送該向量 記憶體之一列或一向量,或者接收一列放入該向量記憶 體。同一指令可規定另外一純量傳送作業和/或一接收作 業。該VMU是唯一連接到外界之功能單元,即連接到該外 部匯流排110。 該碼產生單元(CGU 262)。以有限的欄位算術特殊化該 CGU。例如,該CGU可被用於產生CDMA碼晶片的向量與相 關功能,例如頻道碼與CRC。 ALU-MAC單元(AMU 264)。以正整數與定點算術特殊化 該AMU。其支援向量間之作業,其中以多向量元件的方式 執行算術。於一最佳實施例中,該AMU也提供若干内部向 量作業,其中以單一向量内之該等元件執行算術。 85370 -15- 200407705200407705 Description of the invention: [Technical field to which the invention belongs] The present invention relates to a processing system. [Previous Technology] The second generation wireless communication standards such as UMTS_FDD, UMTS-TDD, IS2000 and TD-SCDMA operate at very high frequencies. These third-generation mobile communication standard-rate modems (transceivers) such as UMTS require about 100 times more digital signal processing power than GSM. In order to be able to handle different standards, and to be flexible to adapt to these new standards It is hoped to implement a standard for a transceiver to use these standards of a private architecture. A known method for improving efficiency is to store multiple data elements in a row of memory and operate on more than one shell element at a time. For example, these systems are known as SIMD (single instruction, multiple data) or MIMD (multiple instructions, multiple data). A vector processor is an example of a SIMD processor. Generally, a wide memory with a port can be described as And write access to at least one column of the memory. In this example, the 4 column width of the memory can store a vector, which can read or write one or more vectors at a time. For accesses smaller than a vector There are no special rules for the unit. For the best use of a memory, it is also desirable to be able to access data elements that are smaller than the full width of a memory row in an efficient manner. Generally, A small unit can be stored in a row of the memory, some of which are unused, which increases the storage cost. In addition, these small units are stored in a chained form in a column, which needs to be read or written- The entire column, and $ _ take more processing instructions and cycles, or insert a desired smaller unit, Xingcheng the overall memory column. The above will reduce the efficiency. These problems will become 85370 -6- 200407705 more serious , Where the width of the memory obviously exceeds the size of the smaller order. For example, 'for speech recognition', the data elements that store a speech feature vector are usually 8 to 16 bits wide. For phone communications The yr encoding / modulation data component is often 8-bit (or 2 * 8-bit composite value) 4. This size needs to be increased. For newer telephone communication systems or enhanced voice = different efficiency, I hope to improve the processing speed .Therefore, a wide range of eating two bodies is used to increase the processing speed, but no special method is needed, that is, it can increase: the amount of memory to store data, or even slow down as described above = form of memory storage [Summary of the Invention] The purpose of the present invention is to provide a processor architecture that can provide fast memory access to a wide range of memory, and also use smaller data elements. "To achieve this purpose The processing system has a processor and a physical memory. The physical memory has a single-size memory port for accessing data in the memory. The processor is configured to match at least the first data size. And-smaller "data operation of the second data size; the first data size is equal to or smaller than the size of the memory material; the processing system includes the first data material, the V material material temporarily connected to the memory port; At least the material port of the second data is connected to the data register, and the processor is enabled to access the data elements of the spear-lean material size. : Method uses conventional memory with a word size. The above makes the record = :: Cheng keeps falling. Generally, the size of the memory word matches the size of the process and word. In order to access smaller data elements, 85370 200407705 uses an intermediate register. An extra cry has been added to enable access to the car's small data element in the temporary port. # Φ The use of the day register is very obvious and easy. To the core of the processor, it looks like the memory: there are different sizes of ports. In this way, small data elements can be quickly added. Second, more instructions are needed. For example, in order to access small data elements for the large capital, the selection and / or shift of large data elements. If you are obsessed with the item No. 2 of the patent, the size of the memory is equal to twice the size of the second material. In this way, the register can store two pieces of baby materials. In particular, in the case where these small data components are accessed consecutively, the physical memory-access, can quickly access to the "connected" Beckham materials. The above will reduce waste in accessing the The entity records the time of the fe. As described in item 3 of the scope of the patent application to which it belongs, for the first reading of the data port, a -multiplexer is used, and the data register is selected from the data register according to the control of the -read address. _Data element of two data sizes. For example, the most important part of the read address can be used to check whether the data element already exists in the register (but if the part is not used, retrieve it from the physical memory) The least important part of the child can be used to select the data element in the register. As described in item 4 of the scope of the patent application to which it belongs, a solution multiplex is used for a nest of data ports. Control of the access address, inserting a data element of the second data size in a selectable position of the data register. The selection is performed similarly to the read port described. As described in the scope of the patent application to which it belongs Item 5, where The system includes a plurality of data ports connected to the second data size of 85370 200407705 to the processor, and for each of these data frames, an associated individual data register is connected to the individual data port and the physical memory For example, if an alignment processes two consecutive data streams, each of these data streams can use one of these registers and these data ports. Each data stream can then access at least two Continuous data elements, but only one of them is used to access the physical memory. As described in item 6 of the scope of the patent application to which it belongs, in order that these registers contain a `` copy '' of the data in the memory And / or more than one register may cause a consistent conflict, and a check is performed. The system maintains updated data in the register, so the update of a small data component usually does not update the physical memory A write operation occurs. In addition, the updated small data component can also be read directly from the register, further saving the number of memory accesses. For these data ports (the relevant temporary Register allows write access), the processor is stored in a more consistent register, and the data information is stored in the register. The data is used to check whether the read data (from the entity Memory or one of the registers) is accessing changed data (but the physical memory or registers may not have been updated). The identifying information is preferably one used to access the physical memory The physical address of one of the words, where the width of the word is the same as that of the memory. In this way, it is easy to check whether direct access to the physical memory may conflict with the data stored in these registers. Described in item 8 of the scope of the patent application to which it belongs, the coordination checker includes a conflict resolver to obtain these corrective steps in response to a potentially consistent conflict. In this way, the design has been obtained when the program was designed The method replaces the programmer with 85370 200407705. One method of obtaining these correction steps, as described in item 9 of the scope of the patent application to which it belongs, is to mark the data register as read access invalid, in response to read access to one of the data registers, resulting in The memory reloads the contents of the data register. In addition, as described in item 10 of the scope of the applied patent, the coordination checker includes a coordination register for each individual data register, and the data register stores and identifies the data stored in the individual Information about the data in the register; and configuring the modifier to store the same data based on the identification information, in response to write access to one of these data registers and / or write access to the physical memory, so write Copy the contents of the data register or the memory to all the remaining data registers and / or one of the memories. In this embodiment, according to the coordination register, the same data should be stored, and the updated data is copied to all the registers. If direct access to the physical memory is allowed, the above includes copying the data to the physical memory. For these large data elements, it is also best to directly access the physical memory via an intermediate register. In the case of writing to a register, a write is not automatically forced to access the memory. As described in item 11 of the scope of the patent application to which it belongs, at least one of the data registers (the `` read register '' below) is connected to a read data port, and at least one of the data registers One ("write register" below) is connected to a write data port; and the processor includes a bypass, which can optionally provide data from the write register to the read data port; the coordination check The register includes a coordinating register for each individual data register to store information identifying the data stored in the relevant data register; configure the conflict resolver in response to 85370 -10- 200407705 write data entry The write register of the bypass path is activated to continuously read and access the read register to obtain these correction steps. If the read register stores the same data element according to the identification information. Due to the use of a bypass, when a write register with the same content is updated, a read register does not need to be reloaded. The information is then read directly from the update write register. In this way, access to the physical memory is kept to a minimum. A delay sometimes occurs during access to the write scratchpad. The physical memory is preferably based on a cost-effective port SRAM. In order to obtain a wide range of cost-effective physical memory, it is better to use a physical memory formed by a plurality of parallel RAM memories to store many small data elements in one memory word. The memory is preferably embedded in the processor. It is advantageous for the described architecture to be used in a scalar / vector processor, where the vector segment operates on a vector of the first data size and the scalar segment operates on a scalar of the second data size, The first data width is twice the second data width. In this configuration, the same memory can be used to store the vectors and scalars. This configuration also makes it easy to perform scalar jobs on the vector element. [Embodiment] To optimize signal processing, it is best to use the address generation unit (AGU) and memory unit in a processor. This processor may be a DSP or any other suitable processor / microcontroller. The rest of the description hopes to use these units in a high-power scalar / vector processor. This processor can be used alone or in combination with another processor. Figure 1 shows an optimal configuration using the scalar / vector processing 85370 -11-200407705. In this configuration, the three main components are connected by a bus u0. The bus 110 connecting the three components may be any suitable bus, such as an AMBA high-speed bus (AHB). The main components are: a programmable scalar / vector processor including these functional units and a regional data memory (refer to the vector memory in Figure 丨) 12; a wafer-based program or data including restrictions The memory microcontroller or the D sp subsystem 1 3 0; an interface block 140. The scalar / vector processor 120 is mainly used for regular " weight / power " processing, especially for internal loop processing. The scalar / vector processor includes vector processing functions. For its part, in order to perform the vectorizable part of the code :, it provides large-scale parallelism. The vast majority of all signal processing will be performed by the vector portion of the 处理器 scalar / vector processor. Executing the same instruction 'in an array, e.g., identical processing elements, provides massive parallelism. Combined with the -32-word wide-memory interface, low cost and average power consumption 'have led to unprecedented levels of programmable efficiency. However, when many algorithms are insufficient to represent the parallelism of the data in the correct form, it is not feasible to make full use of the parallelism. According to A's law, the code is directly vectorizable. After partial vectorization, most of the time is spent; the remaining code. The remaining code can be divided into four types: using the address of = :) (for example, '増 adds an index to the-loop slower, so that the regular vector operation (that is, the vector operation corresponds to the vector loop) ~ Wang 85370 -12- Looping non-normal vector operations 々: use each and every type of code of these types, which are highly dependent on the performance and implementation. For example, the Golay correlator (for searching) requires many Regarding the addresses of these instructions, ㈣ is a case of other algorithms. For example, according to the present invention, the efficiency of the addresses of these related instructions and loops can be optimized by using the AGU / memory unit. A processor: compact integral scalar and vector processing to optimize the work of the normal scalar operation. All studies related to 3G modems by Tibetan and other creators have been studied, not formal The scalar job part is very limited. This attribute separates the tasks between the scalar / vector processor 120 and the microcontroller or 130, where the microcontroller or Dspl30 performs these informal tasks And control the scalar / Vector processor. In the optimal configuration, the scalar / vector processor 12 is used as a programmable coprocessor (the remaining 4 points are also called CVP, common vector processor). The scalar The interface between the vector processor 12 and the microcontroller 130 handles communication (e.g., via shared memory) and synchronization (e.g., via shared memory and status signals). The interface is preferably a memory mapping The interface block 140 enables the processors to interact with the rest of the system. In the preferred embodiment, the scalar / vector processor is used as a software modem for a 2G / 3G mobile network. (Data Transceiver). Regarding the function of the software modem, the interface block 140 includes dedicated hardware as a front end and a main task to transfer control and data words to the microcontroller 130 according to the control. The vector processor, such as DMA. The scalar / 85370 -13- 200407705 is then processed inward to process the data in the vector memory. For the bus 110, the scalar / vector processor is 12 °. Can be a «genuine device, Frontier government control Nai 130 and the interface block 1 (which can include a DMA unit) can be used as a master device. All communication with the cvp, its program, data or control is preferably a memory mapping. The memory can be an off-chip DRAM, and the DRAM can also be used by the scalar / vector processor as a (none) inserted memory. In this description, the term is mainly used, the address calculation unit π or ACU. For the purpose of this description, it can be regarded as "address generation unit," or AGU. The description focuses on the use of such units to calculate data addresses. Those skilled in the art can also use the same The function is used to calculate the instruction address ("loop control"). Figure 2 shows the main structure of a processor according to the present invention. The processor includes a pipeline vector processing section 210. To support the operation of the vector section, the scalar / vector processor includes a scalar processing section 220 configured to operate in parallel with the vector section. The scalar processing section is also preferably a pipeline. In order to support the operation of the vector section, at least one functional unit of the vector section also provides the function of the corresponding part of the scalar section. For example, the vector section of a shift function unit can be used as a shift vector, where a scalar component is provided (or passed) by the scalar section of the shift function unit. For its part, the shift function converts the vector and the scalar section. Therefore, at least some functional units are not only a vector segment, but also a scalar segment. The scalar segment and the scalar segment cooperate with each other because they can exchange scalar data. A vector section of a functional unit provides the raw processing power, where the pair of wide and pure 85370 -14- 200407705 quanta sections (ie, scalar sections of the same functional unit) provide and / or consume scalar data To support operations on that vector segment. Vector data for these vector segments is provided via a vector pipeline. In the preferred embodiment of FIG. 2, the scalar / vector processor includes the following seven specialized functional units. Instruction allocation unit (IDU 250). The IDU contains the program memory 252 to read consecutive VLIW instructions and allocate 7 segments of each instruction to 7 functional units. It is best to include a nested loop unit that supports no more than three zero-burden cycles. In the preferred embodiment, branching, skipping or interruption is not supported. The initial program counter is loaded by the limit descriptor, which will be described in more detail below. Vector memory unit (VMU 260). The VMU contains the vector memory (not shown in Figure 2). During each instruction, the VMU can transmit a column or a vector of the vector memory, or receive a column and place it in the vector memory. The same instruction may specify another scalar transmission operation and / or a reception operation. The VMU is the only functional unit connected to the outside, that is, connected to the external bus 110. The code generating unit (CGU 262). Specialize this CGU with limited field arithmetic. For example, the CGU can be used to generate vector and related functions for CDMA code chips, such as channel codes and CRC. ALU-MAC unit (AMU 264). Specialize this AMU with positive integers and fixed-point arithmetic. It supports operations between vectors, in which arithmetic is performed as multi-vector elements. In a preferred embodiment, the AMU also provides a number of internal vector operations in which arithmetic is performed with the elements in a single vector. 85370 -15- 200407705

ShuFfle單元(SFU 266)。該SFU能夠根據一特定之混洗圖 樣,重新配置一向量之此等元件。 左移單元(SLU 268)。該SLU能由一單元移位該向量之此 等元件,例如向左一字、一雙字或一四元字。所產生之純 量被提供給自己的純量區段。根據SLU向量作業之類型發 出,該消耗的純量不是零,就是取自自己的純量區段。 右移單元(SRU 270)。該SRU與該SLU相似,但向右移位。 此外該SRU有能力合併該AMU内部向量作業之連續結果。 下表顯示所有的FU有一功能向量區段210,但有些沒有一 控制區段230或純量區段220。 功能單元 控制 純量 向量 指令分配單元 定序,循環 指令分配 向量記憶體單元 位址計算 純量輸入/輸出 向量輸入/輸出 碼產生單元 碼向量產生 ALU-MAC 單元 索引 廣播 向量間· ALU ’ MAC,mul,… 分段 内部向量:加, 最大值 混洗單元 向量混洗 左移單元 純量輸入/輸出 向量移位 右移早元 純量輸入/輸出 向量移位 根據本發明之純量/向量處理器以兩種主要的方式應用 指令階的平行性: 85370 -16- 200407705 向量處理’其中一單指令報擔 很據(純I)資料之此等向量作 業。該方法也是已知的單指合、、云 .^ ^ 平扣7 <,多資料流或SIMD。 多功能單元之平行處理,夂& 4 谷自根據此等向量作業。上述 可視為-VUW指令階平行性的形式(受限制的), 注意此兩種形式之指令階平行性是獨立的,而且它們的 影響是累積的。 FU間通信 汶功犯單元(FU)以平行方式作業。每一 FU能夠接收與傳 送向量資料。許多FU也能接收與傳送純量資料。 功能單元 原始 vmu Cgu amu sfu slu sru 目標 #輸入 vmu 1 I ! ! ! ! ! cgu 1 ! ! J ! I amu 2 ! ! ! f ! ! sfu 1 ! ! ! ! ! I slu 1 | | ! ! i sru 1 ! I ! ! ! 所有的功能單元以平行方式作業。根據接受之指令段, 所有的功能單元輸入、處理與輸出資料,包括向量資料與 其中可應用之純量資料。在此等FU之間的通信完全在該等 純量區段或該等向量區段中(FU間通信)。即,由一管線連 接除了該IDU之外的所有FU的該等向量區段。於一最隹實 施例中,該管線可配置在指令基礎上。為此目的,最好由 85370 -17- 200407705 一互連的網路互連該等FU,在管線中使每一向量區段於各 週期從其餘向量區段中任何之一接收一向量。該特效率夠 建立該等FU(除了該IDU之外)的所有管線。於每_時鐘週 期,該等功能單元中之六個能幫助該向量路徑以平行方式 輸出一向量,並將其傳送給其他單元。它們也能夠從另一 單元接收一向量。該網路幾乎完全被連接。只有此等沒有 意義的鏈結被省略。該AMU同時接收兩向量。如圖2中所顯 示的,該網路最好由每一連接到一網路路徑作為信號源(由 一圓點表示)的FU所形成。並且被連接到所有其他的路徑作 為一 #號接收點(由一三角形表示)。該Fu之VLIW指令區段 指示將由那個路徑消耗一向量。於此方式中,根據一指令 基礎配置該管線。各路徑能傳輸一完整的向量,例如,使 用256平行線。同樣地,該等fu至少有些純量區段由一個別 的管線連接。最好也根據指令基礎配置該管線。該等171]純 量區段中之互連網路就某種意義來說是不完全的,即不能 由至少一 FU的純量區段傳送或接收純量。因此,較少描述 管線的定序。該等純量與向量管線可被獨立配置。例如, 以该相關之VLIW區段指示由該功能單元讀取該純量管線 與該向量管線。 沒有描述該等不同功能單元之控制區段間的連接。此等 區段從該IDU接收該VLIW指令之一段,更新自己所屬的狀 態,並控制各自的純量與向量區段。 内部-FU通信 於一 FU内,此等區段(内部-FU通信)間有緊密的交互作 85370 -18 - 200407705 用。泫人互作用為該?11作業的整體部分。範例是該sLu與 SRU ’其中所產生和’或所消耗的純量被提供給/取自對應該 FU的純量區段部分。 通常於單週期執行此等指令。只有在拖延週期 擠滿該向 里记彳思目豆與纟頃露自己時會發生例外。 資料寬度 於取佳貫施例中,該純量/向量處理支援數種資料寬度 Μ貝料颏j,如圖3所顯示。記憶體定址的基本單位是一 半也稱為單丰。今覓取好能夠是一單字(w),雙字(DW, 或2W=16位元),或⑸元字_或谓=32位元)。—字的尺寸 是W4位元。此等純量最好出於三種尺寸:(單)字,雙字與 四7C字具有PQ四兀竽之固定尺寸的向量。最好由下列 三種格式之一建構:ShuFfle unit (SFU 266). The SFU can reconfigure a vector of these elements based on a particular shuffle pattern. Move unit left (SLU 268). The SLU can shift these elements of the vector by a unit, such as one word to the left, a double word, or a quaternary word. The resulting scalar is supplied to its own scalar section. According to the type of SLU vector operation, the consumed scalar is either zero or taken from its own scalar section. Move the unit to the right (SRU 270). The SRU is similar to the SLU, but shifted to the right. In addition, the SRU is capable of merging the continuous results of the internal vector operations of the AMU. The table below shows that all FUs have a function vector section 210, but some do not have a control section 230 or a scalar section 220. The functional unit controls scalar vector instruction allocation unit sequencing, cyclic instruction allocation vector memory unit address calculation scalar input / output vector input / output code generation unit code vector generation ALU-MAC unit index broadcast vector · ALU 'MAC, mul, ... Segmented internal vector: addition, maximum shuffle unit vector shuffle left shift unit scalar input / output vector shift right shift early element scalar input / output vector shift scalar / vector processing according to the present invention The processor applies instruction-level parallelism in two main ways: 85370 -16- 200407705 vector processing, where one single instruction reports these vector operations with very good (pure I) data. This method is also known as single-finger, multi-stream, or SIMD. The parallel processing of the multi-function unit is based on these vector operations. The above can be regarded as the -VUW instruction order parallelism (restricted). Note that these two forms of instruction order parallelism are independent and their effects are cumulative. FU-to-FU communication The Wengong Unit (FU) operates in parallel. Each FU can receive and transmit vector data. Many FUs can also receive and transmit scalar data. Functional Unit Original vmu Cgu amu sfu slu sru Target # Enter vmu 1 I!!!!! Cgu 1!! J! I amu 2!!! F!! Sfu 1!!!!! I slu 1 | |! I sru 1! I!!! All functional units operate in parallel. According to the accepted instruction section, all functional units input, process, and output data, including vector data and scalar data that can be used therein. The communication between these FUs is entirely in the scalar sections or the vector sections (inter-FU communication). That is, the vector sectors of all FUs except the IDU are connected by a pipeline. In one embodiment, the pipeline can be configured on an instruction basis. For this purpose, the FUs are preferably interconnected by an interconnected network of 85370 -17- 200407705, and each vector segment in the pipeline receives a vector from any of the remaining vector segments in each cycle. This special efficiency is enough to build all the pipelines of the FU (except the IDU). At each clock cycle, six of these functional units can help the vector path output a vector in parallel and pass it to other units. They can also receive a vector from another unit. The network is almost completely connected. Only these meaningless links are omitted. The AMU receives two vectors simultaneously. As shown in Figure 2, the network is preferably formed by each FU connected to a network path as a signal source (indicated by a dot). It is connected to all other paths as a # receiving point (represented by a triangle). The Fu's VLIW instruction section indicates which vector will be consumed by that path. In this way, the pipeline is configured according to a command base. Each path can transmit a complete vector, for example, using 256 parallel lines. Similarly, at least some of the scalar sections of the fu are connected by another pipeline. It is also best to configure this pipeline based on the instruction base. The interconnection network in the 171] scalar section is incomplete in a sense, that is, it cannot be transmitted or received by the scalar section of at least one FU. Therefore, the sequencing of the pipeline is less described. The scalar and vector pipelines can be configured independently. For example, the relevant VLIW section instructs the functional unit to read the scalar pipeline and the vector pipeline. The connection between the control sections of these different functional units is not described. These sections receive a section of the VLIW instruction from the IDU, update their own status, and control their respective scalar and vector sections. Internal-FU communication Within a FU, there is a close interaction between these sections (internal-FU communication) for 85370 -18-200407705. What about human interaction? The whole part of 11 homework. An example is the sLu and SRU 'where the produced sum' or consumed scalar is provided to / taken from the scalar section portion corresponding to FU. These instructions are usually executed in a single cycle. Exceptions occur only when the procrastination cycle is crowded with the thoughts and thoughts of the bean and the bean. Data width In the exemplified example, the scalar / vector processing supports several data widths, as shown in Figure 3. The basic unit of memory addressing is half also called Shanfeng. Today, it can be a single word (w), a double word (DW, or 2W = 16 bits), or a unit word_or predicate = 32 bits. — The size of the word is W4 bits. These scalars are best derived from three sizes: (single) words, double words and four 7C words with a fixed-size vector of PQ four vultures. Ideally constructed in one of three formats:

Pq尺寸四元字的元件, Pd=2Pq尺寸雙字的元件Pq size quaternion element, Pd = 2Pq size double word element

Ps=2Pd=4Pq尺寸(單)字的元件。Ps = 2Pd = 4Pq size (single) word component.

Pq為2的次方。於該最佳實施例中 資料路徑寬度與記憶體寬度。 指令 該向量元件的索引範圍是[UVU。因此,此等雙字有 此等偶數索引,而此等四元字的索引是四的倍數。圖3提供 此等資料尺寸的概觀。該架構完全可由^製,並被定義 為任何向量尺寸响。然而,料大部分情況,最好選擇 pq是8,意味著32字之 一 CVP指令不是一控制指令 就是一 VLIW指令。例如, 85370 -19- 200407705 此等控制指令可以是零負擔的迴路初始化。沒有分支、跳 越或副常式。一 VLIW指令被分成一些段,其中每一資料段 規定最好由該對應之功能單元執行該(等)作業。該段進一步 被細分成該向量區段與該純量區段的一部分(如有呈現)。該 段也包括此兩部分在網路部分被用於接收資料(該向區 段之一或更多的向量,或該純量區段之一或更多的純量)之 資訊。 該純量/向量處理器的狀態 該CVP的狀態是自己之此等功能單元的結合狀態。於該 最佳實施例中,其包括: 該向量記憶體(該VMU部分); 該程式記憶體(該IDU部分); 此等向量暫存器(所有的功能單元); 此等純量暫存器(大部分的功能單元); 此等包括該等程式計數器與位址偏移暫存器之控制暫存 器。 除了程式員可見到的該等暫存器外,一 CVP的現實通常 包括更多管線與高速存取的暫存器(向量,純量與控制)。不 是部分的CVP指令集架構。 該等(向量,純量與控制)暫存器中之若干也就是所謂的組 態暫存器。一組態暫存器的内容只能夠從該向量記憶體載 入;沒有其他的方法可改變自己的值。一組態暫存器支援 該等功能單元之組態,而且通常定義一函數參數。藉由儲 存此等"半固定”函數參數於組態暫存器,以大大減少該指 85370 -20- 200407705 令寬度與記憶體的流量。 該CVP狀態之該等組件的簡介呈現於下表。 FU 控制路徑 純量路徑 向量路徑 資料 組態 資料 組態 資料 組態 vmu 偏移量 5 位址 8 資料記憶體 2048 cgu 計算器 3 代碼 3 狀態 6 遮罩 2 多項式 2 amu 1 接收 1 段尺寸 1 暫存器檔案 16 sfu 暫存器 1 混洗圖樣 4 slu 接收 1 暫存器檔案 2 sru 接收 1 暫存器檔案 2 idu pc 1 迴路CU 2 程式記憶體 2048 程式員可見的所有暫存器能夠由該向量記憶體載入。所 有的暫存器除了該等組態暫存器,能夠被儲存於該向量記 憶體。為了定量目的而儲存該等CVP暫存器,並且在最後 一次將它們復原,該CVP能繼續一特別的任務,猶如無法 同時執行其他的分配工作。此等儲存與復原作業是可選擇 的,可以是部分,但必須被明確地程式設計。 該記憶體單元 圖4顯示該記憶體單元(VMU 400)的方塊圖,於其中使用 根據本發明之記憶體配置。於下面所描述之最佳實施例 中,該記憶體單元被用於與具有能儲存一完整向量之廣泛 實體記憶體結合的向量處理器。應明白同一觀念也可應用 於此等純量處理器,例如慣用的DSP。該VMU包含並控制 85370 -21 - 200407705 孩向量記憶體410,其提供一廣泛資料頻寬給該等其他的功 能單元。該實體向量記憶體410最好以單埠SRAM為基礎。 由於此等寬度為Ps*W的嵌入SRAM通常是不能用的,因此 平行配置一或更多廣泛隨機存取記憶體(Ram)之儲存體以 形成謔貫體記憶體。該純量資料最好被儲存在儲存該向量 資料之同一記憶體。於此一系統中,此等純量能與相對的 此等向量混合。為了該記憶體的成本效益與最佳存取時 間,該記憶體最好只允許此等全向量列的讀與寫。就其本 身而言,邏輯上該實體記憶體由一向量尺寸的所有列組 成。為支援此等純量的讀與寫,使用更多的硬體(一列中的 純量區段之列快取記憶體43〇與支援44〇)以一純量方式存 取該向量廣泛實體記憶體。 圖5提供更詳細的配置。顯示該實體記憶體5〇〇具有一全 X度的槔505(於該範例中’具有—向量寬度)。於該圖中僅 顯示-讀科。熟悉此項技藝之人士能輕易地決定類似的 寫出資料配置。該配置包括至少—與該實體記憶體埠5〇5相 同寬度的暫存器。顯示四個暫存器51(),512,514與516。 所有的暫存器是選擇性連接至該讀取埠奶以接收資料。於 該圖中,-暫存器514㈣讀取此等較小的資料元件,於此 犯例中純量。此等較小資料元件中最好至少有兩個適 合孩暫存器。與一讀取埠525相關的資料暫存器514耦合至 一處理單元(或更多一般的··資料梓一 ^ ^ ^ 钇)一夕工器520最好耦 合至孩暫存心4,以從該暫存器選擇該相關的純量資料。 由該暫存器中的純量數量控制該多工器,如同㈣位址之 85370 -22- 200407705 最不重要的位元規定(例如,使用一具有32個8-位元字的256 位元向量)。此等多工器是眾所周知的,因此不進一步描 述。該暫存器被連接至用以接收該資料(全寬度)之實體記憶 體的讀取埠505。一般而言,可以是此等Nr純量讀取埠各自 連接至一向量寬暫存器。上述能分割此等暫存器或甚至同 一暫存器。該等暫存器是圖4之快取記憶體的部分。該等多 工器是該純量區段塊440的部分。但未以類似的方式顯示, 可以是存在該快取記憶體430中的Nw純量寫埠與Nw向量寬 暫存器。對於每一純量寫埠,該快取記憶體430中之對應暫 存器連接至Nw輸入之一向量寬解多工器,以選擇那一快取 記憶體列被寫回該實體記憶體。當一 VMU指令需要多種快 取記憶體列被寫回時,連續執行上述,以拖延所有其他功 能單元直到所有的寫入被完成。對此等不同寫埠的存取, 雖然使用同一指令,但不允許存取該實體記憶體的同一 列。假設以連續純量存取空間方位(例如此等連續純量屬於 一處理迴路,後續被連續儲存於該實體記憶體410),載入/ 儲存此等暫存器之實體記憶體410的存取頻率顯然低於此 等暫存器之純量存取頻率。 於該最佳實施例中,在該記憶體中,不需要以此等向量 範圍調準一向量。因此,由此等Ps字組成之向量可具有一 任意的記憶體位址。一記憶體列具有同一尺寸,但其開始 位址由多種Ps所定義。(對於列的存取,忽略該位址最不重 要的位元2log Ps)。由於允許任意調準此等向量(通常調準該 最小的字範圍),因此能更恰當地利用該記憶體與較少的空 85370 -23- 200407705 位。將此等方法視為可使該純量/向量處理器讀/窝此等個別 向量,而該向量可被儲存於該實體記憶體之兩連續列。為 了此目的,一調準單元被用於向量傳送作業。該調準單元 被顯示於圖4的區塊440。於圖5更詳細描述。該調準單元53〇 連接至兩列快取記憶體510與512(即,兩個向量廣泛暫存 器),包含由該需要的向量跨越的兩列。當存取此等連續向 量時,僅從該實體記憶體取得一新列,而另一列仍然存在 此等列快取記憶體之一。形成該需要之向量的兩列快取記 憶體與由此等多工器530所構成之一網路相結合,接著儲存 於一向量廣泛管線暫存器。該管線暫存器經由一向量讀取 埠535接收該資料。由於該管線暫存器,該值能在該vmu 廣播匯流排上傳輸。 圖5也顯示進一步的向量廣泛暫存器516,及能夠直接從 忒圮彳思體碩取一列的相關向量廣泛讀取琿,其中該暫存 器作為一快取記憶體。 最好由該程式員隱藏圍繞該記憶體的快取。儘管此等快 速記憶體的使用盡力趕上具有單埠SRAM的多埠向量記憶 體’該程式員另外能採用一協調向量記憶體。因為每一暫 存益能包含泫貫體1己憶體中之同一資料的可能複本,接著 自動維護協調性代替防護該協調性的程式員。為了此目 的,對衝突的位址執行檢查,即,使一暫存器的寫發生在 一列位址,其中同一列也被儲存於該等其餘暫存器之一。 由於每一暫存益的儲存有此一檢查就夠了,所以該·列之列 位址儲存於該暫存器。如果偵測到一可能的衝突,可採取 85370 -24- 200407705 -種修正方法。例如,以同—列對—暫存器發生寫入作業, 一,暫存器就被標示為無效。除非再次從該實體記憶體讀 取摄暫存器(在該寫暫存器首次被寫回該記憶體之後),否目q 不會進—步使用該暫存器。此外,對該窝暫存器發生一寫 入1,使用同一列將一寫暫存器的内容複製到所有的讀 暫存為。弟三種可能性是在讀取與寫人琿之間分享暫存 器。後面的方法需要更多的向量廣泛多工器,會增加成本, 但提供高效率的優點。事實上,一頻道被建立,其中一連 =至一讀取埠之讀暫存器㈣通,但實際上通過該讀取埠 攸一寫暫存器讀取資料。所有此等修正方法發生在使用的 功能,共同稱為”協調檢查器”。為了決定使用儲存可能資 料义一複製形式(具有可能的一致性問題)的協調暫存器,儲 存孩内容之儲存資訊的資料暫存器與該協調暫存器相關 驷S。周暫存态最好將該資料儲存於該對應資料暫存器 《貫體位址。此等向量讀取也能採取相同的協調檢查與方 法,其中該向量被(部分)儲存於與一寫入埠相關的暫存器, 以代替僅對此等純量存取。最好於單時鐘週期中,由該實 把尤丨思把51 〇之一單存取執行該實體記憶體之一列的讀取 與寫入。 ~ 於-早VMU指♦中,胃向量記憶體單元能支援最多四個 同時發生的”子作業,,: 傳迗一向量,或傳送一列,或接受一列從/至从“位置; 從一 VM位置傳送一純量; 接受一純量給一 VM位置; 85370 -25- 200407705 修改一位址計算單元之狀態/輸出。 VMCcmd = vopc,aid_v,ainc_v,sopc,aid_s,ainc_s,si srcv,aid一r,ainc一r,aopc,aid一a,imm一addr) vopc = NOP|SENDL|SENDV|RCVL_CGU|RCVL_ AMU|RCVL_SFU|RCVL_SLU|RCVL_SRU Aid_v = {〇,···,7} Ainc_v = NOP|INC sopc = NOP|SEND aid_s = {〇,···,7} ainc_s = NOP|INC size = WORD|DWORD|QWORD srcv = NONE|VMU|AMU|SLU|SRU aid_r = {〇,···,7} ainc_r = NOP|INC aopc = NOP|IMM|LDBASE|LDOFFS|LDINCR (LDBOUND aid_a = {〇,···,7} imm_addr = {0·0,···,524288·31}| {-262144.0,··., 262143.31} 該VMU指令可取得一些變化的時鐘週期,依子作業數量 與位址順序的協調性而定。 該VMU輸入/輸出為: 85370 -26- 200407705 輸入 說明 Cmd VMU命令 rev amu AMU向量接收匯流排 rcv_cgu CGU向量接收匯流排 rcv_sfu SFU向量接收匯流排 rcv_slu SLU向量接收匯流排 rcv_sru SRU向量接收匯流排 srevamu AMU純量接收匯流排 srcvslu SLU純量接收匯流排 s_rcv_sru SRU純量接收匯流排 輸出 說明 Snd VMU向量結果 ssnd VMU純量結果 此外有兩純量埠(一傳送,一接收)被連接至該外部匯流 排。同步存取此等具有CVP指令之記憶體是該微控制器130 的任務。 該VMU向量區段包括該實體向量記憶體5 10 : 名稱 說明 mem[4096][32] Vector memory: 32字各自之4096列 注意,此等向量子作業不能存取該純量記憶體。因此, 忽略此等向量子作業最重要的位址位元。該VMU之向量區 段支援七個子作業,解碼加入該指令的VOPC欄位:向量傳 送(SENDV),列傳送(SENDL),及五列接收子作業 85370 -27- 200407705 (RCVL—CGU,RCVL一AMU,RCVL—SFU,RCVL—SLU, RCVL·—SRU) 〇該接收源之功能單元被明確解碼加入該對應 列接收子作業。由一對應的位址計算單元規定每一子作業 之讀取位址或寫入位址。所有的向量子作業分享該AINC_V 欄位。其將被傳遞給該ACU解碼加入該AID_V欄位。該 AINC_V欄位規定該受影響的位址計算單元是否執行一後 增加作業。 _ __ 防護 轉變 vopc=NOP 無 vopc = SENDL snd=mem.line[acu[aid v].out] vopc = SENDV snd=mem.vector[acu[aid v].out] vopc=RCVL CGU mem.line[acu[aid vl.outl=rcv cgu vopc=RCVL AMU mem.line[acu[aid vl.outl=rcv amu vopc=RCVL SFU mem.line[acu[aid vl.outl==rcv sfu vopc=RCVL_SLU mem.line[acu[aid v].〇utl=rcv__slu vopc=RCVL_SRU mem.line[acu[aid_v].out!=rcv sru 注意,該等作業被選派作為傳送〈或接收〉動作,但不 作為涉及一目的(或來源)的載入(或儲存)動作。稍後由其餘 功能單元中的此等作業規定。一列傳送的功能等同一具有 同一位址的向量傳送,此等列傳送子作業通常被用於組配 此等功能單元,或用於復原各種暫存器中之任務狀態。由 於採用一特別模式之列傳送,通過快取記憶體的有效使 用,最佳化此等連續向量傳送(π向量流’')的存取時間。 該VMU的純量子作業被解碼加入該指令的SOPC欄位。只 85370 -28- 200407705 支援一子作業··純量傳送(SEND)。由該AID_S欄位規定的 位址計算單元規定該讀取位址。該指令的位規定 該位址計算單元是否執行一後增加作業。該純量子作業之 運算域尺寸(WORD,DWORD或QWORD)由該指令的SIZE欄 位決定。 防護 轉變 sopc=NOP 無 sopc ^SEND && size =WORD S snd=mem.word[acu[aid s].out] sopc=SEND && size =DWORD S_snd=mem.dwordfacu[aid s].out] sopc=SEND && size =QWORD S snd=mem.qword『acu『aid sl.outl 防護 轉變 srcv=NONE 無 srcv=VMU meni.scalar[acu[aid r].outl = s rev vmu srcv=AMU mem.scalar[acu[aid rl.outl = s rev amu srcv=SLU mem.scalar[acu[aid rl.outl==s rev slu srcv=SRU mcni·scalar[acu[aid r].outl=s rev sru 該傳送與接收子作業能被組合成一純量移動作業,從一 該VMU之純量接收子作業被解碼加入該指令的SRCV欄 位。如果其值是無,則沒有純量接收被執行。除此之外, 該指令的SRCV欄位可決定使用那個功能單元作為該純量 接收源。由該AID_R^f|位規定的該位址計算單元規定該讀 取位址。該指令的AINC一R欄位規定該位址計算單元是否執 行一後增加作業。該純量子作業之運算域尺寸(WORD, DWORD或QWORD)由該原始純量的尺寸決定。 85370 -29- 200407705 VM作業至另一 VM作業。由一對應的位址計算單元規定每 一存取位址。 該VMU控制區段550主要是一組位址計算單元或(ACU) 或位址產生單元(AGU),以支援定址模式,如慣用之DSP中 的位址計算單元或位址產生單元。此一單元執行每指令之 一或更多位址計算,不需使用該處理器的主要資料路徑。 例如,一純量之位址在每一純量讀存取之後能被後增加。 上述允許根據資料,平行發生位址計算與算術作業,以改 善該處理器的效率。根據該組定址模式的支援,此一 ACU 需要存取一些暫存器。例如, 與定址相關,即,定址有關一所謂的基址,需要一基底 暫存器base 有關該基底暫存器的偏移儲存於一偏移暫存器offs 由一值前/後增量該偏移量儲存於一增量暫存器incr 有關一位址的模數定址儲存於一界限暫存器bound 由該組定址模式,支援下面的作業。採用一偏移暫存器 offs。每一記憶體在位址base+offs存取(讀取或寫入)之後, 根據offs: = (offs+incr)的模數bound更新暫存器offs。因此, 改變offs頻繁(在每一存取之後),而且頻繁改變儲存於 base,incr與bound的值。通常,在一程式迴路之前,初始 化後面的三個暫存器。於此不詳細描述該ACU的作業。 請注意上面所提到的實施例是用於說明,而不是用於限 制本發明,而熟悉此項技藝之人士能夠設計許多替代實施 例,但不達背該附加申請專利範圍的領域。於該申請專利 85370 -30- 200407705 範園中,所有置於圓括號間的參考符號不應被解釋為限制 薇申請專利範圍。該等字”包括”與,,包含,,不排除列於該申請 專利範圍外的其他元件或步驟。 【圖式簡單說明】 由上面所描述之該等相關實施例瞭解並說明本發明之此 等與其他觀點。 該等圖示之簡略描述,該等圖示中·· 圖1顯示-最佳之組態’在其中使用根據該本發明之純旦 /向量處理器; ' $ 圖2顯示根據該本發明之純量/向量處理器的主要結構· 圖3顯示支援的資料寬度與資料類型; ^ , 圖4顯示該向量記憶體單元之方塊圖;及 圖5說明使用的此等中間暫存器與兩埠的尺寸。 【圖式代表符號說明】 110 匯流排 120 純量/向量處理器 130 DSP或微控制器 140 控制器/介面 210 管線向量處理區段 220 純量處理區段 230 控制區段 250 指令分配單元 252 程式記憶體 260 向量記憶體單元 85370 .31 - 200407705 262 碼產生單元 264 ALU-MAC 單元 266 ShuFfle 單元 268 左移單元 270 右移單元 W 字 DW 雙字 QW 四元字 410 實體記憶體 430 列快取記憶體 440 向量調準+純量區段 500 實體記憶體 505 , 525 讀取埠 510 暫存器 512 暫存器 514 暫存器 516 暫存器 520 多工器 440 純量區段塊 530 調準單元 535 , 540 讀取埠 -32- 85370Pq is a power of two. In the preferred embodiment, the data path width and the memory width. Instruction The index range of this vector element is [UVU. Therefore, these double words have these even indexes, and these quaternions have indexes of multiples of four. Figure 3 provides an overview of these data sizes. The architecture is fully controllable and is defined as any vector size response. However, in most cases, it is best to choose pq as 8, which means that one of 32 words CVP instruction is either a control instruction or a VLIW instruction. For example, 85370 -19- 200407705 These control instructions can be zero-burden loop initialization. There are no branches, jumps, or subroutines. A VLIW instruction is divided into sections, each of which specifies that the (or other) operation is preferably performed by the corresponding functional unit. The segment is further subdivided into the vector segment and part of the scalar segment (if present). This section also includes information that these two parts are used in the network part to receive data (one or more vectors of the segment, or one or more scalars of the scalar segment). The state of the scalar / vector processor The state of the CVP is the combined state of these functional units. In the preferred embodiment, it includes: the vector memory (the VMU part); the program memory (the IDU part); the vector registers (all functional units); the scalar temporary storage Registers (most of the functional units); these include control registers for these program counters and address offset registers. In addition to these registers visible to programmers, the reality of a CVP usually includes more pipelines and high-speed access registers (vector, scalar, and control). Not part of the CVP instruction set architecture. Some of these (vector, scalar and control) registers are also called configuration registers. The contents of a configuration register can only be loaded from this vector memory; there is no other way to change its value. A configuration register supports the configuration of these functional units and usually defines a function parameter. By storing these " semi-fixed " function parameters in the configuration register, the finger width and memory flow of 85370-20-20200407705 are greatly reduced. The brief introduction of these components in the CVP state is presented in the table below FU control path scalar path vector path data configuration data configuration data configuration vmu offset 5 address 8 data memory 2048 cgu calculator 3 code 3 state 6 mask 2 polynomial 2 amu 1 receive 1 segment size 1 Register file 16 sfu Register 1 Shuffle pattern 4 slu Receive 1 Register file 2 sru Receive 1 Register file 2 idu pc 1 Loop CU 2 Program memory 2048 All registers visible to the programmer can be changed by The vector memory is loaded. All registers except the configuration registers can be stored in the vector memory. The CVP registers are stored for quantitative purposes, and they are restored at the last time. The CVP can continue a special task as if it was unable to perform other assignments at the same time. These save and restore operations are optional Yes, it can be part, but it must be explicitly programmed. The memory unit Figure 4 shows a block diagram of the memory unit (VMU 400) in which the memory configuration according to the invention is used. In the preferred embodiment, the memory unit is used in a vector processor combined with a wide range of physical memory capable of storing a complete vector. It should be understood that the same concept can be applied to such scalar processors, such as the conventional DSP. The VMU contains and controls 85370 -21-200407705 child vector memory 410, which provides a wide range of data bandwidth to these other functional units. The physical vector memory 410 is preferably based on the port SRAM. Because of these widths Embedded SRAMs that are Ps * W are usually unusable, so one or more extensive random access memory (RAM) banks are arranged in parallel to form a consistent memory. The scalar data is best stored in storage The same memory of the vector data. In this system, these scalars can be mixed with the opposite vectors. For the cost-effectiveness of the memory and the best access time, the memory It is best to only allow reading and writing of these full vector columns. Logically, the physical memory consists of all columns of a vector size. To support such scalar reads and writes, use more The hardware (column cache memory 43 and support 44 in the scalar section in a row) accesses the vector extensive physical memory in a scalar manner. Figure 5 provides a more detailed configuration. Shows the physical memory 500 has a full X degree of 槔 505 (in this example, 'with-vector width'). In this figure, only the reading subjects are shown. Those skilled in the art can easily decide on a similar write-out configuration. The configuration includes at least—a register with the same width as the physical memory port 505. Four registers 51 (), 512, 514 and 516 are displayed. All registers are selectively connected to the read port to receive data. In this figure, the-register 514㈣ reads these smaller data elements, which is scalar in this case. There are preferably at least two suitable child registers in these smaller data elements. A data register 514 associated with a read port 525 is coupled to a processing unit (or more general data materials ^ ^ ^ yttrium), and the night maker 520 is preferably coupled to the child memory 4 to The register selects the relevant scalar data. The multiplexer is controlled by the scalar quantity in the register, as the least significant bit specification of 85370 -22- 200407705 for the address (for example, using a 256 bit with 32 8-bit words vector). These multiplexers are well known and will not be described further. The register is connected to the read port 505 of the physical memory for receiving the data (full width). Generally speaking, these Nr scalar read ports are each connected to a vector-wide register. The above can split these registers or even the same register. These registers are part of the cache memory of FIG. 4. The multiplexers are part of the scalar segment block 440. However, it is not shown in a similar manner, and may be an Nw scalar write port and an Nw vector wide register stored in the cache memory 430. For each scalar write port, the corresponding register in the cache memory 430 is connected to a vector-wide demultiplexer of the Nw input to select which cache memory row is written back to the physical memory. When a VMU instruction requires multiple cache memory columns to be written back, the above is performed continuously to delay all other functional units until all writing is completed. For access to these different write ports, although the same command is used, access to the same row of the physical memory is not allowed. It is assumed that the spatial orientation is accessed by continuous scalars (for example, these continuous scalars belong to a processing circuit and are subsequently stored in the physical memory 410 continuously). The frequency is obviously lower than the scalar access frequency of these registers. In the preferred embodiment, there is no need to align a vector in this memory in such a range of vectors. Therefore, a vector composed of such Ps words can have an arbitrary memory address. A memory row has the same size, but its starting address is defined by multiple Ps. (For column access, ignore the least significant bit of the address, 2log Ps). Because these vectors are allowed to be arbitrarily aligned (usually the smallest word range is allowed), this memory can be used more appropriately with fewer free 85370 -23- 200407705 bits. Think of these methods as enabling the scalar / vector processor to read / nest these individual vectors, and the vectors can be stored in two consecutive rows in the physical memory. For this purpose, an alignment unit is used for the vector transfer operation. The alignment unit is shown in block 440 of FIG. This is described in more detail in FIG. 5. The alignment unit 53 is connected to two columns of cache memories 510 and 512 (i.e., two vector wide registers) and contains two columns spanned by the required vector. When accessing these continuous vectors, only a new row is obtained from the physical memory, while another row still has one of the caches in the row. The two columns of cache memory forming the required vector are combined with a network formed by such multiplexers 530, and then stored in a vector-wide pipeline register. The pipeline register receives the data via a vector read port 535. Due to the pipeline register, this value can be transmitted on the vmu broadcast bus. Figure 5 also shows a further vector wide register 516, and a wide range of related vectors that can directly fetch a column from the ThinkPad master, where the register serves as a cache memory. It is best for the programmer to hide the cache surrounding the memory. Although the use of these flash memories is trying to catch up with the multi-port vector memory with a port SRAM, the programmer can additionally use a coordinated vector memory. Because each temporary benefit can contain a possible copy of the same data in the memory of the system, then the coordination is automatically maintained instead of the programmer protecting the coordination. For this purpose, a check is performed on the conflicting address, that is, a write to a register occurs at a row of addresses, and the same row is also stored in one of the remaining registers. Since it is sufficient to have this check for each temporary benefit storage, the address of this column is stored in the temporary register. If a possible conflict is detected, 85370 -24- 200407705 can be taken as a correction method. For example, if a write operation takes place in the same-column pair-register, the register is marked as invalid. Unless the camera register is read from the physical memory again (after the write register is first written back to the memory), the item q will not proceed—the register will be used further. In addition, a write of 1 occurs to the nest register and the same column is used to copy the contents of a write register to all read registers. The three possibilities are to share the scratchpad between reading and writing. The latter method requires more vector-wide multiplexers, which increases cost, but provides the advantage of high efficiency. In fact, a channel is set up, one of which is connected to a read register of a read port, but actually reads data through the read port and a write register. All of these correction methods occur in the functions used and are collectively referred to as the "coordination checker". In order to decide on the use of a coordination register that stores possible data in a copy form (with possible consistency issues), the data register that stores the stored information of the child content is related to the coordination register 暂 S. The weekly temporary storage state is best to store the data in the corresponding data register "the body address. These vector reads can also take the same coordinated check and method, where the vector is (partially) stored in a register associated with a write port, instead of only accessing such scalars. Preferably, in a single clock cycle, it is necessary for the implementation to perform a single access of 51 to perform reading and writing of a column of the physical memory. ~ In -Early VMU, the stomach vector memory unit can support up to four simultaneous "sub-operations" ,: pass a vector, or transmit a column, or accept a column from / to from "location; from a VM Position sends a scalar; accepts a scalar to a VM position; 85370 -25- 200407705 Modify the status / output of a bit computing unit. VMCcmd = vopc, aid_v, ainc_v, sopc, aid_s, ainc_s, si srcv, aid-r, aid-r, aopc, aid-a, imm-addr) vopc = NOP | SENDL | SENDV | RCVL_CGU | RCVL_ AMU | RCVL_SFU | RCVL_SLU | RCVL_SRU Aid_v = {〇, ··, 7} Ainc_v = NOP | INC sopc = NOP | SEND aid_s = {〇, ··, 7} ainc_s = NOP | INC size = WORD | DWORD | QWORD srcv = NONE | VMU | AMU | SLU | SRU aid_r = {〇, ··, 7} ainc_r = NOP | INC aopc = NOP | IMM | LDBASE | LDOFFS | LDINCR (LDBOUND aid_a = {〇, ··, 7} imm_addr = {0 · 0, ···, 524288 · 31} | {-262144.0, ·· ,, 262143.31} The VMU instruction can obtain some changing clock cycles, depending on the coordination of the number of sub-jobs and the address order. The VMU The input / output is: 85370 -26- 200407705 Input description Cmd VMU command rev amu AMU vector receive bus rcv_cgu CGU vector receive bus rcv_sfu SFU vector receive bus rcv_slu SLU vector receive bus rcv_sru SRU vector receive bus srevamu AMU scalar Receive bus srcvslu SLU scalar receive bus s_rcv_sru SRU scalar receive bus The output shows the Snd VMU vector result and the ssnd VMU scalar result. In addition, two scalar ports (one transmit and one receive) are connected to the external bus. The memory with CVP instructions is accessed by the microcontroller 130. Task. The VMU vector section includes the physical vector memory 5 10: Name description mem [4096] [32] Vector memory: 4096 columns of 32 words Note that these vector sub-operations cannot access the scalar memory. Therefore, the most important address bits of these vector sub-operations are ignored. The VMU vector section supports seven sub-operations and decodes the VOPC fields added to the instruction: vector transfer (SENDV), column transfer (SENDL), and five Column receiving sub-job 85370 -27- 200407705 (RCVL-CGU, RCVL-AMU, RCVL-SFU, RCVL-SLU, RCVL · -SRU) 〇 The functional unit of the receiving source is explicitly decoded to join the corresponding column receiving sub-job. A corresponding address calculation unit specifies a read address or a write address for each sub-operation. All vector child operations share this AINC_V field. It will be passed to the ACU to decode and join the AID_V field. The AINC_V field specifies whether the affected address calculation unit performs a post-add operation. _ __ Protection change vopc = NOP without vopc = SENDL snd = mem.line [acu [aid v] .out] vopc = SENDV snd = mem.vector [acu [aid v] .out] vopc = RCVL CGU mem.line [ acu [aid vl.outl = rcv cgu vopc = RCVL AMU mem.line [acu [aid vl.outl = rcv amu vopc = RCVL SFU mem.line [acu [aid vl.outl == rcv sfu vopc = RCVL_SLU mem.line [acu [aid v] .〇utl = rcv__slu vopc = RCVL_SRU mem.line [acu [aid_v] .out! = rcv sru Note that these jobs are assigned to act as transmitting (or receiving), but not for a purpose (or Source) load (or save). These operations are specified later in the remaining functional units. The functions of a column transfer are the same vector transfer with the same address. These column transfer sub-jobs are usually used to assemble these functional units or to restore the task status in various registers. Since a special pattern of transfers is used, the access time of these continuous vector transfers (π vector streams '') is optimized through the effective use of cache memory. The pure quantum operation of the VMU is decoded and added to the SOPC field of the instruction. Only 85370 -28- 200407705 supports one sub-operation ... Send. The address calculation unit specified by the AID_S field specifies the read address. The bit of the instruction specifies whether the address calculation unit executes a post-add operation. The operation field size (WORD, DWORD or QWORD) of the pure quantum operation is determined by the SIZE field of the instruction. Protection change sopc = NOP no sopc ^ SEND & & size = WORD S snd = mem.word [acu [aid s] .out] sopc = SEND & & size = DWORD S_snd = mem.dwordfacu [aid s] .out] sopc = SEND & & size = QWORD S snd = mem.qword 『acu『 aid sl.outl Protection change srcv = NONE No srcv = VMU meni.scalar [acu [aid r] .outl = s rev vmu srcv = AMU mem.scalar [acu [aid rl.outl = s rev amu srcv = SLU mem.scalar [acu [aid rl.outl == s rev slu srcv = SRU mcni · scalar [acu [aid r] .outl = s rev sru The transmit and receive sub-jobs can be combined into a scalar mobile job. A scalar receive sub-job from a VMU is decoded and added to the SRCV field of the instruction. If its value is None, no scalar reception is performed. In addition, the SRCV field of the instruction determines which functional unit is used as the scalar receiving source. The address calculation unit specified by the AID_R ^ f | bits specifies the read address. The AINC-R field of the instruction specifies whether the address calculation unit performs an additional operation after one. The size (WORD, DWORD, or QWORD) of the operation field of the pure quantum operation is determined by the size of the original scalar. 85370 -29- 200407705 VM job to another VM job. Each access address is specified by a corresponding address calculation unit. The VMU control section 550 is mainly a group of address calculation units (ACU) or address generation units (AGU) to support addressing modes, such as the address calculation unit or address generation unit in a conventional DSP. This unit performs one or more address calculations per instruction without using the processor's main data path. For example, a scalar address can be incremented after each scalar read access. The above allows address calculation and arithmetic operations to occur in parallel based on data to improve the efficiency of the processor. According to the support of this set of addressing modes, this ACU needs to access some registers. For example, related to addressing, that is, addressing is related to a so-called base address, and a base register is required. The offset of the base register is stored in an offset register. Offs is incremented by a value before / after. The offset is stored in an incremental register incr. The modulo address of a bit address is stored in a bound register. The set of address modes supports the following operations. Use an offset register offs. After each memory is accessed (read or written) at the address base + offs, the register offs is updated according to the modulus bound of offs: = (offs + incr). Therefore, the offs are changed frequently (after each access), and the values stored in base, incr, and bound are frequently changed. Normally, the following three registers are initialized before a program loop. The operation of the ACU is not described in detail here. Please note that the above-mentioned embodiments are for the purpose of illustration, but not for the purpose of limiting the invention. Those skilled in the art can design many alternative embodiments without departing from the scope of this additional patent application. In the patent application 85370 -30- 200407705, all reference signs placed between parentheses should not be interpreted as limiting the scope of Wei's patent application. The words "including" and, including, do not exclude other elements or steps that are not included in the scope of the patent application. [Brief Description of the Drawings] These and other aspects of the invention are understood and explained from the related embodiments described above. Brief description of the diagrams, in which ... Figure 1 shows-the best configuration 'in which the pure denier / vector processor according to the invention is used;' $ Figure 2 shows the according to the invention The main structure of the scalar / vector processor. Figure 3 shows the supported data width and data type; ^, Figure 4 shows the block diagram of the vector memory unit; and Figure 5 illustrates these intermediate registers and two ports used size of. [Illustration of symbolic representation of the diagram] 110 bus 120 scalar / vector processor 130 DSP or microcontroller 140 controller / interface 210 pipeline vector processing section 220 scalar processing section 230 control section 250 instruction allocation unit 252 program Memory 260 Vector memory unit 85370 .31-200407705 262 Code generation unit 264 ALU-MAC unit 266 ShuFfle unit 268 Left shift unit 270 Right shift unit W word DW double word QW quaternary word 410 physical memory 430 column cache memory Body 440 vector calibration + scalar section 500 physical memory 505, 525 read port 510 register 512 register 514 register 516 register 520 multiplexer 440 scalar section block 530 adjustment unit 535, 540 read port-32- 85370

Claims (1)

-nj/7〇5 &、申請專利範園·· 2. 、種具有一處理器與一實體記憶體之處理系統,該實體 兒憶體具有-存取該記憶體之資料之單一尺寸記憶體 蜂二配置孩處理器對至少-第-資料尺寸與-較小的第 =貝料尺找料料料尺寸料或小於該 C憶體埠尺寸;該處理系統包括該第一資料尺寸之至少 了暫㈣連接至該記憶體埠;以及該第二資料尺寸之至 J S料埠連接至該資料暫存器與賦能存取該第二資料 尺寸之此等資料元件之該處理器。 如申請專利範圍第1項之處理系統,其中記憶體埠尺寸至 少是該第二資料尺寸的兩倍。 3·如申請專利範圍第2項之處理系統,其中該資料埠是一讀 取埠’而且减理系統包括一多工器,以根據一讀取位 址的控制,從該資料暫存器選擇與擷取該第二資料尺寸 之—資料元件。 申叫專利範圍第2項之處理系統,其中該資料埠是一窝 阜,而且該處理系統包括一解多工器,以根據一寫入 =址的控制,將該第二資料尺寸之一資料元件插於該資 料暫存器之可選擇位置。 、 5·=申請專利範圍第1或2項之處理系統,其中該處理系統 4第一貝料尺寸之複數個資料埠連接至該處理器, =且為了该等資料埠之每一個,一相關的個別資科暫存 器連接至該個別的資料琿與該實體記憶體之一埠。 6·如申請專利範圍第丨項之處理系統,其中該資料崞是一寫 85370 200407705 入埠,而且m處理系統包括一協調檢查器,為了該資料 暫存器,包括一相關的協調暫存器,以儲存識別儲存於 β貝料暫存咨之資料的資訊;該協調檢查器可作業,藉 由比較-存取該記憶體之讀取位址與儲存於該協調暫存 器的識別資訊,以檢查準備從該記憶體讀取的資料是否 與儲存於該資料暫存器之資料相符。 疋口 如申請專利範圍第6項之處理系統,其中該識別資訊包括-nj / 7〇5 & patent application park 2. ·, a processing system with a processor and a physical memory, the physical memory has a single size memory for accessing the data in the memory The body bee 2 is configured with a processor pair of at least -the -data size and-the smaller size = to find the material size or less than the size of the C memory port; the processing system includes at least the first data size To the memory port; and the JS data port of the second data size is connected to the data register and the processor that enables access to the data elements of the second data size. For example, the processing system of the first scope of the patent application, wherein the memory port size is at least twice the size of the second data. 3. The processing system according to item 2 of the scope of patent application, wherein the data port is a read port 'and the reduction system includes a multiplexer to select from the data register according to the control of a read address And retrieve the second data size—the data element. The application system is called the processing system of the second scope of the patent, wherein the data port is a nest, and the processing system includes a demultiplexer to control one of the second data sizes according to a write = address control. The component is inserted in a selectable position of the data register. 5 · = Processing system of item 1 or 2 of the scope of patent application, wherein the processing system 4 has a plurality of data ports of the first size of the material connected to the processor, and for each of these data ports, a correlation The individual asset register is connected to the individual data port and one of the physical memory ports. 6. If the processing system of the scope of application for the patent application item 丨, the data 崞 is a write 85370 200407705 into the port, and the m processing system includes a coordination checker, for the data register, including a related coordination register To store and identify the information stored in the beta shell temporary storage; the coordination checker is operable to compare and access the read address of the memory with the identification information stored in the coordination register, To check whether the data to be read from the memory matches the data stored in the data register.疋 口 If the processing system of item 6 of the patent application scope, the identification information includes 一存取該實體記憶體之一字的實體位址,其中該字的寬 度與該記憶體埠相同。 見 如申請專利範圍第6項之處理***,其中該協調檢查哭包 括一衝突解決器’為回應一可能一致的衝突,取彳^:等 •如申凊專利範圍第8項之處理系統,其中該衝突解決器才 配置’以藉由標記該資料暫存器為讀存取無效,而取《 2等修正步驟,為回應該資料暫存器之-讀存取,導3 從蔹記憶體重新載入該資料暫存器之内容。A physical address that accesses a word of the physical memory, where the width of the word is the same as the memory port. See, for example, the processing system of the scope of patent application, where the coordination check includes a conflict resolver 'in response to a possible consistent conflict, take 彳: etc. • If the processing system of the scope of patent application of claim 8, The conflict resolver is configured 'to mark the data register as read access invalid, and take the "2nd class of correction steps, in response to the data register's-read access, guide 3 from the memory to re- Load the contents of the data register. 如申π專利把圍第5、6與8項之處理系統,該協調檢查. 包括-協調暫存器’用以每一個別的資料暫存器:;: =暫存器儲存識別儲存於該個別之資料暫存器之資^ 貝汛,配置該修正器以回應寫存取該等資料暫存哭之· 複製至所有其餘的資料暫存器和 fe—位置,根據該識別資訊儲存相同的資料: I如申4專利範圍第5、6與8項之處理系統,其中至Ί 85370 -2- 200407705 資料暫存器之一(下面的"讀暫存器")被連接至―讀資料 璋,及至少該等資料暫存器之—(下面的”窝暫存器”)被連 接至一‘胃人資科;而且該處理器包括-旁通,用以可 選擇從料暫存器提供資料給該讀資料;該協調檢查 器包括-用以每一個別資料暫存器之協調暫存器,以儲 存識別儲存於該相關資料暫存器之資料的資訊;钱突 解決器被配置,藉㈣應寫資料進人啟動該旁通路徑之 孩寫暫存器,以連續讀存取該讀暫存器,以取得此等修 正步驟,如果該讀暫存器根據該識別資訊應儲存同一资 料元件。 " 其中該實體記憶體以 12·如申請專利範圍第1項之處理系統 一單埠SRAM為基礎。 13.如申請專利範圍第u項之處理系統,其中該實體記憶體 以複數個平行方式配置的RAM形成。 14·如申凊專利範圍第1項之處理系統,其中該處理器被配 f ’根據該第-資料尺寸之向i與該第二資料尺寸之純 里作業中該第一資料寬度至少是該第二資料寬度的 兩倍。 15. 如申凊專利範圍第i項之處理系統,纟中該記憶體被彼於 該處理器。 85370Such as the application of the π patent to the processing system around items 5, 6, and 8, the coordination check. Including-coordination register 'for each individual data register :; = register storage identification stored in the The data of the individual data register ^ Bei Xun, configure the corrector in response to write access to these data temporary cry · Copy to all the remaining data registers and fe-location, according to the identification information to store the same Data: The processing system of items 5, 6, and 8 in the scope of patent 4 of Rushen 4, among which is one of the data registers 85370 -2- 200407705 (the " read register " below) is connected to ―read The data card, and at least one of these data registers — (“the nest register” below) —is connected to a stomach human resources department; and the processor includes a bypass, which can be selected to temporarily store data The reader provides data to the read data; the coordination checker includes-a coordination register for each individual data register to store information identifying the data stored in the relevant data register; the money burst resolver is Configuration, by writing data to enter the child write register of the bypass path Continuous read access to the read register, in order to obtain such correction step, if the read should be stored in the same register resources based on the identification information element materials. " Among them, the physical memory is based on the processing system of a patent application, such as the first item in the scope of a SRAM. 13. The processing system according to item u of the application, wherein the physical memory is formed by a plurality of RAMs arranged in parallel. 14. The processing system as claimed in claim 1 of the patent scope, wherein the processor is configured with f 'according to the direction i of the first data size and the second data size. The second data is twice as wide. 15. If applying for the processing system of item i of the patent scope, the memory is stored in the processor. 85370
TW092113718A 2002-05-24 2003-05-21 A processing system for accessing data in a memory TWI291096B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP02077034 2002-05-24
EP02078618 2002-09-04

Publications (2)

Publication Number Publication Date
TW200407705A true TW200407705A (en) 2004-05-16
TWI291096B TWI291096B (en) 2007-12-11

Family

ID=29585702

Family Applications (1)

Application Number Title Priority Date Filing Date
TW092113718A TWI291096B (en) 2002-05-24 2003-05-21 A processing system for accessing data in a memory

Country Status (9)

Country Link
US (1) US7430631B2 (en)
EP (1) EP1512068B1 (en)
JP (1) JP2005527035A (en)
CN (1) CN1656445B (en)
AT (1) ATE372542T1 (en)
AU (1) AU2003222411A1 (en)
DE (1) DE60316151T2 (en)
TW (1) TWI291096B (en)
WO (1) WO2003100599A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI584192B (en) * 2011-12-23 2017-05-21 英特爾公司 Instruction and logic to provide vector blend and permute functionality

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7725745B2 (en) * 2006-12-19 2010-05-25 Intel Corporation Power aware software pipelining for hardware accelerators
US8489825B2 (en) 2007-04-16 2013-07-16 St-Ericsson Sa Method of storing data, method of loading data and signal processor
US8140771B2 (en) * 2008-02-01 2012-03-20 International Business Machines Corporation Partial cache line storage-modifying operation based upon a hint
US8117401B2 (en) 2008-02-01 2012-02-14 International Business Machines Corporation Interconnect operation indicating acceptability of partial data delivery
US8255635B2 (en) * 2008-02-01 2012-08-28 International Business Machines Corporation Claiming coherency ownership of a partial cache line of data
US8024527B2 (en) * 2008-02-01 2011-09-20 International Business Machines Corporation Partial cache line accesses based on memory access patterns
US8108619B2 (en) * 2008-02-01 2012-01-31 International Business Machines Corporation Cache management for partial cache line operations
US7958309B2 (en) 2008-02-01 2011-06-07 International Business Machines Corporation Dynamic selection of a memory access size
US8250307B2 (en) * 2008-02-01 2012-08-21 International Business Machines Corporation Sourcing differing amounts of prefetch data in response to data prefetch requests
US20090198910A1 (en) * 2008-02-01 2009-08-06 Arimilli Ravi K Data processing system, processor and method that support a touch of a partial cache line of data
US8266381B2 (en) * 2008-02-01 2012-09-11 International Business Machines Corporation Varying an amount of data retrieved from memory based upon an instruction hint
US8035537B2 (en) * 2008-06-13 2011-10-11 Lsi Corporation Methods and apparatus for programmable decoding of a plurality of code types
US7895381B2 (en) * 2009-02-16 2011-02-22 Himax Media Solutions, Inc. Data accessing system
US8117390B2 (en) * 2009-04-15 2012-02-14 International Business Machines Corporation Updating partial cache lines in a data processing system
US8176254B2 (en) * 2009-04-16 2012-05-08 International Business Machines Corporation Specifying an access hint for prefetching limited use data in a cache hierarchy
US8140759B2 (en) * 2009-04-16 2012-03-20 International Business Machines Corporation Specifying an access hint for prefetching partial cache block data in a cache hierarchy
CN101986287B (en) * 2010-11-25 2012-10-17 中国人民解放军国防科学技术大学 Reform buffer for vector data streams
US8688957B2 (en) 2010-12-21 2014-04-01 Intel Corporation Mechanism for conflict detection using SIMD
CA2859999A1 (en) * 2011-01-25 2012-08-02 Cognivue Corporation Apparatus and method of vector unit sharing
US9342479B2 (en) 2012-08-23 2016-05-17 Qualcomm Incorporated Systems and methods of data extraction in a vector processor
US9411592B2 (en) 2012-12-29 2016-08-09 Intel Corporation Vector address conflict resolution with vector population count functionality
US9411584B2 (en) 2012-12-29 2016-08-09 Intel Corporation Methods, apparatus, instructions, and logic to provide vector address conflict detection functionality
US9424034B2 (en) * 2013-06-28 2016-08-23 Intel Corporation Multiple register memory access instructions, processors, methods, and systems
CN104679584B (en) * 2013-11-28 2017-10-24 中国航空工业集团公司第六三一研究所 Vector context switching method
EP3193254A4 (en) * 2014-10-09 2017-10-11 Huawei Technologies Co. Ltd. Asynchronous instruction execution apparatus and method
CN105337995B (en) * 2015-11-29 2019-06-21 恒宝股份有限公司 A kind of quick personalization method of smart card and system

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4949247A (en) * 1988-02-23 1990-08-14 Stellar Computer, Inc. System for transferring multiple vector data elements to and from vector memory in a single operation
US5379393A (en) * 1992-05-14 1995-01-03 The Board Of Governors For Higher Education, State Of Rhode Island And Providence Plantations Cache memory system for vector processing
US5426754A (en) * 1992-05-26 1995-06-20 International Business Machines Corporation Cross-interrogate method and means for combined scaler and vector processing system
US5537606A (en) * 1995-01-31 1996-07-16 International Business Machines Corporation Scalar pipeline replication for parallel vector element processing
US5689653A (en) * 1995-02-06 1997-11-18 Hewlett-Packard Company Vector memory operations
US6006315A (en) * 1996-10-18 1999-12-21 Samsung Electronics Co., Ltd. Computer methods for writing a scalar value to a vector
US5928350A (en) * 1997-04-11 1999-07-27 Raytheon Company Wide memory architecture vector processor using nxP bits wide memory bus for transferring P n-bit vector operands in one cycle

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI584192B (en) * 2011-12-23 2017-05-21 英特爾公司 Instruction and logic to provide vector blend and permute functionality

Also Published As

Publication number Publication date
EP1512068B1 (en) 2007-09-05
DE60316151D1 (en) 2007-10-18
WO2003100599A2 (en) 2003-12-04
US7430631B2 (en) 2008-09-30
US20050240729A1 (en) 2005-10-27
AU2003222411A1 (en) 2003-12-12
EP1512068A2 (en) 2005-03-09
TWI291096B (en) 2007-12-11
JP2005527035A (en) 2005-09-08
CN1656445B (en) 2010-05-05
AU2003222411A8 (en) 2003-12-12
ATE372542T1 (en) 2007-09-15
WO2003100599A3 (en) 2004-07-22
DE60316151T2 (en) 2009-10-22
CN1656445A (en) 2005-08-17

Similar Documents

Publication Publication Date Title
TW200407705A (en) Access to a wide memory
US7568086B2 (en) Cache for instruction set architecture using indexes to achieve compression
JP4339245B2 (en) Scalar / vector processor
US5659785A (en) Array processor communication architecture with broadcast processor instructions
JP4386636B2 (en) Processor architecture
US5968167A (en) Multi-threaded data processing management system
CN101061460B (en) Micro processor device and method for shuffle operations
US7383419B2 (en) Address generation unit for a processor
US6006315A (en) Computer methods for writing a scalar value to a vector
WO1999045472A1 (en) Multi-processor system with shared memory
JPH07141175A (en) Active memory and processing system
WO2001067235A2 (en) Processing architecture having sub-word shuffling and opcode modification
JP2004501470A (en) Generating memory addresses using scheme registers
JP2003501773A (en) Data processor with arithmetic logic unit and stack
US7340591B1 (en) Providing parallel operand functions using register file and extra path storage
JP4164371B2 (en) Data processing apparatus, data processing method, program, and storage medium
EP1238343A2 (en) Digital signal processor having a plurality of independent dedicated processors
JP2003005954A (en) Data processor and method for controlling the same
Van Berkel et al. Address generation unit for a processor
CN117807014A (en) Electronic device and data reading method
JPH07152680A (en) Apparatus, system and method for distributed processing
Krikelis VASP-4096: a very high performance programmable device for digital media processing applications

Legal Events

Date Code Title Description
MK4A Expiration of patent term of an invention patent