TW200806039A

TW200806039A - Method and apparatus for processing algorithm steps of multimedia data in parallel processing systems

Info

Publication number: TW200806039A
Application number: TW096101019A
Authority: TW
Inventors: Lazar Bivolarski; Bogdan Mitu
Original assignee: Brightscale Inc
Priority date: 2006-01-10
Filing date: 2007-01-10
Publication date: 2008-01-16
Also published as: KR20080085189A; WO2007082043A3; KR20080094006A; US20070188505A1; US20070189618A1; CN101371263A; US20100066748A1; KR20080094005A; WO2007082042A3; CN101371264A; WO2007082044A3; WO2007082042A2; TW200803464A; JP2009523291A; EP1971958A2; EP1971956A2; US20070162722A1; JP2009523292A; JP2009523293A; WO2007082043A2

Abstract

An efficient method and device for the parallel processing of data variables. A parallel processing array has computing elements configured to process data variables in parallel. An algorithm for a plurality of computing elements of a parallel processor is loaded. The algorithm includes a plurality of processing steps. Each of the plurality of computing elements is configured to process a data variable associated with the computing element. Selection codes for the plurality of computing elements of the parallel processor are loaded, wherein the selection codes identify which of the algorithm steps are to be applied by the computing elements to the data variables. The algorithm processing steps are applied to the data variables by the computing elements, wherein for each computing element, only those processing steps identified by the selection codes are applied to the data variable.

Description

200806039 九、發明說明：【發明戶斤屬之技術領域3 本申請案請求美國專利申請案第60/758,065號之利益’其係於2006年一月1〇日所提出，其之揭示内容在此合 5 併為參考文獻。發明領域本發明大致係關於平行處理。更具體地說，本發明係關於用來將平行處理系統中之多媒體資料之處理排程之方 B 法與裝置。 10 【ittf才支冬好】發明背景多媒體資料之越來越多的使用已導致越來越多對於更快速的和更有效率的處理這些資料及即時傳送其之方式之需求。具體地說，已有越來越多對於更快速和更有效率平 ‘ 15械處理多媒體資料之方式之需求，諸如影像和相關的音 φ 訊。平行處理之需求通常例如在諸如多媒體資料之壓縮和/ 或解壓縮之計算密集之程序期間所產生的，其需要相對大量的計算，其仍需要足夠快地完成，如此使得即時地傳送音訊和視訊。 20 目此，持續改進多媒體資料之平行處理上的努力是令人滿意的。特別令人滿意的是發展更快速和更有效率的; 法來平行處理這些資料。這些方法需處理區塊平行處理，子區塊平行處理’和雙線性過濾平行處理。【發明内容】 200806039 發明概要本發明可以許多不同方式來實現，包括做為一方法及一電腦可讀取之媒體。本發明之許多不同的實施例在下面討論。 5 在具有組態來平行地處理資料變數之計算元件之一平行處理陣列中，——方法包括載入對一平行處理器之多個計算元件之一演繹法，其中該演繹法包括多個處理步驟，且其中多個計算元件之每一個被組態來處理與該計算元件相關之一資料變數，載入平行處理器之多個計算元件之選擇 10 碼，其中該等選擇碼識別哪些演繹法步驟要由計算元件應用至資料變數，以及由計算元件應用演繹法處理步驟至資料變數，其中對每個計算元件來說，僅那些由選擇碼所識別之處理步驟被應用至資料變數。在其他觀點中，一電腦可讀取媒體具有電腦可執行之 15 指令於其上，其係供在一平行處理陣列中之一處理方法用，該陣列具有組態來平行地處理資料變數之計算元件，該方法包括載入一平行處理器之多個計算元件之一演繹法，其中該演繹法包括多個處理步驟，且其中多個計算元件之每一個被組態來處理與該計算元件相關之一資料變 20 數，載入平行處理器之多個計算元件之選擇碼，其中該等選擇碼識別哪些演繹法步驟要被計算元件應用至資料變數，並由計算元件應用演繹法處理步驟至資料變數，其中對每個計算元件來說，僅有由選擇碼識別的那些處理步驟被應用至資料變數。 6 200806039 本發明之其他目標與特徵將藉由觀看說明，申請專利範圍及所附圖式而變得明顯。圖式簡單說明第1圖觀念地說明一 1080i高晝質（HD)框架之巨集區 5 塊。第2A-2B圖說明在一影像框架内之諸如巨集區塊之區塊之配置。第3A-3C圖說明將巨集區塊從其在一影像内之配置映射至個別平行處理器。 10 第4 A - 4 E圖說明對許多不同的影像格式，將影像映射至個別之平行處理器。第5A-5B圖說明用以將影像之子劃分映射至個別平行處理器之16x8映射。第6A-6B圖說明用以將影像之子劃分映射至個別平行 15 處理器之16x4映射。第7A-7C圖說明根據本發明之一實施例將影像區塊映射至平行處理器之其他方法。第8A-8C圖說明一影像格式之資料結構之進一步之細節，包括亮度和色度資訊。 20 第9A-9C圖說明根據本發明之一實施例，用來映射多個影像區塊至平行處理器之許多不同的其他方法。第10A-10C圖說明根據本發明之一實施例之資料區塊資料位置，子區塊位置，子區塊旗標資料位置，和區塊型式資料。 7 200806039 第11A-11B圖說明演繹法處理步驟及用以識別哪些處理步驟應用至哪些資料變數之選擇碼。笫12圖5兒明一平行處理器。類似的翏考數字指稱圖式中之對應部份。 5 【方包】較佳實施例之詳細說明在此所描述之發明處理平行處理增強之三個主要領200806039 IX. Description of the invention: [Technical field of invention of the invention] 3 This application claims the benefit of US Patent Application No. 60/758,065, which was filed on January 1st, 2006, and its disclosure is here. 5 is a reference. FIELD OF THE INVENTION The present invention generally relates to parallel processing. More specifically, the present invention relates to a method and apparatus for scheduling processing of multimedia material in a parallel processing system. 10 [ITTF is good for winter] Background of the Invention The increasing use of multimedia materials has led to an increasing demand for faster and more efficient processing of such data and the way it is delivered. Specifically, there is a growing need for faster and more efficient ways to process multimedia materials, such as video and related audio. The need for parallel processing is typically generated, for example, during computationally intensive programs such as compression and/or decompression of multimedia data, which requires a relatively large amount of computation, which still needs to be completed quickly enough to allow instant transmission of audio and video. . 20 Therefore, efforts to continuously improve the parallel processing of multimedia materials are satisfactory. Particularly gratifying is the development of faster and more efficient; the law to process these materials in parallel. These methods need to deal with block parallel processing, sub-block parallel processing' and bilinear filtering parallel processing. SUMMARY OF THE INVENTION The present invention can be implemented in a number of different ways, including as a method and a computer readable medium. Many different embodiments of the invention are discussed below. 5 in a parallel processing array having one of the computational elements configured to process the data variables in parallel, the method comprising loading one of a plurality of computational elements of a parallel processor, wherein the deduction comprises a plurality of processing a step, and wherein each of the plurality of computing elements is configured to process a data variable associated with the computing component, the selection of a plurality of computing elements of the parallel processor, 10 codes, wherein the selection codes identify which deductions The steps are applied by the computing component to the data variables, and the computing component applies the deductive processing steps to the data variables, wherein for each computing component, only those processing steps identified by the selection code are applied to the data variables. In other aspects, a computer readable medium has computer-executable 15 instructions thereon for processing in a parallel processing array having a configuration to process data variables in parallel. An element, the method comprising one of a plurality of computing elements loaded into a parallel processor, wherein the deductive method comprises a plurality of processing steps, and wherein each of the plurality of computing elements is configured to process the computing element One of the data is changed to 20, and the selection codes of the plurality of computing elements of the parallel processor are loaded, wherein the selection codes identify which derivation steps are to be applied to the data variables by the computing component, and the derivation processing steps are applied by the computing component to Data variables, wherein for each computing element, only those processing steps identified by the selection code are applied to the data variables. 6 200806039 Other objects and features of the present invention will become apparent from the description, the appended claims, and the appended claims. Brief Description of the Drawings Figure 1 conceptually illustrates a macroblock of a 1080i high-quality (HD) frame. Figure 2A-2B illustrates the configuration of blocks such as macroblocks within an image frame. Figure 3A-3C illustrates mapping a macroblock from its configuration within an image to an individual parallel processor. 10 4A - 4E illustrates the mapping of images to individual parallel processors for many different image formats. Figures 5A-5B illustrate a 16x8 mapping used to map sub-partitions of an image to individual parallel processors. Figures 6A-6B illustrate a 16x4 mapping used to map sub-partitions of images to individual parallel 15 processors. 7A-7C illustrate other methods of mapping image patches to parallel processors in accordance with an embodiment of the present invention. Figures 8A-8C illustrate further details of the data structure of an image format, including luminance and chrominance information. 20 Figures 9A-9C illustrate many different other methods for mapping multiple image blocks to parallel processors in accordance with an embodiment of the present invention. 10A-10C illustrate data block data location, sub-block location, sub-block flag data location, and block type data in accordance with an embodiment of the present invention. 7 200806039 Figure 11A-11B illustrates the deductive processing steps and the selection codes used to identify which processing steps apply to which data variables.笫12 Figure 5 shows a parallel processor. A similar reference number refers to the corresponding part of the schema. 5 [Bags] Detailed Description of the Preferred Embodiments The invention described herein deals with the three main principles of parallel processing enhancement.

域：處理區境平行處理，子區塊平行處理，以及類似㈣繹法平行處理。〃 10 區後平行處理 π‘个％，不贫明係 …州々、川从丁 ^丁爽理多媒 15 20 資料之更有效率之方法。已知在許多不同的影像格式中影像被劃分為區塊’當—般以矩陣型式來觀看影像時，具有“較晚的’’區塊或_般落在影像巾其他區塊之下^及邊之那些區塊，視來自“較早，，區塊之資訊而定，即，那在較晚區塊上面和左邊之影像。較早的區塊必須在較晚之前加以處理，因為較晚者需要來自較早區塊 ^關性資料之資訊。因此，輯(或其之部份)以其相關性 ^之順序被傳运至許多不同的平行處理器。較早區塊先 ♦运給平行處理器，較晚區塊較晚傳送。區塊儲存於平㈣ϋ中之特定位置中’且視需要加以 :時，每倾塊具有以特定位置位於_特定組之較= =其之相關性貧料。以此方式，其相關性資料可以相 ^令來。即’較早的輯被位移如此使得可」 8 200806039 單一一組命令來區塊較晚區塊，該等命令指示每個處理器從特定位置取回其相關性資料。藉由允許每個平行處理器以相同的命令集合來處理其區塊，本發明之方法消除傳送分離命令給每個處理器之需要，代之允許一單一全域命令 5組被傳送。此得到較快速且較有效率之處理。第1圖觀念性地說明一般被觀看時和/或儲存於記憶體中時，一影像之示範性框架，為其之矩陣型式。在此範例中，一 1080i HD影像矩陣10被劃分成68行之12〇個巨集區塊，每一個標號12。一般來說，諸如此1〇8〇丨框架之影像係 1〇以個別的巨集區塊12來加以處理的。即，一或多個巨集區塊12由一平行處理陣列之每個計算元件（或處理器）來處理。然而，在本發明通常於巨集區塊12之處理之環境中討論之同時，應體會到本發明包括影像及其他資料至任何部份之劃分，其通常稱為區塊，其可被平行地加以處理者。 15 如上述，諸如1第1圖之1080i HD框架之影像之巨集區塊包括相關性資料’如第2A-2B圖中進一步說明的。根據諸如但不受限於h.264先進視訊編碼標準及vc-i MPEG-4標準之標準，一影像之區塊R之處理需要來自區塊a，d，b*ct 相關性資料(例如内插等所需之資料）。即，根據這些標準， 20 一影像之每個區塊之處理需要來自恰在左邊之區塊之相關性資料，以及來自對角線地恰左上之區塊，恰在上方之區塊以及對角線地恰在右上之區塊之相關性資料。因此區塊a 亦依賴來自區塊d和b之資訊而定，區塊b依賴來自區塊4之資訊等，同時區塊d不依賴來自任何其他區塊之資訊。因此 200806039 可看到這些區塊之平行處理需要對角線上之處理，區塊d先處理’之後為區塊a和b，因為它們依賴來自區塊仅資訊，然後區塊R*c，因為它們依賴來自區塊a，d和b之資訊等。然後參考第3A-3C圖，因此可看出對最佳化平行處理來 5說’區塊可映射至處理器，且依序加以處理，較早的區塊比較晚的區塊先處理。第3A圖說明當影像顯示給一觀看者時，-示範性影像之巨集區塊結構。如上述，第3A圖之區塊以保持其相關性資料給較晚區塊之順序來加以處理。第侧說明必須處理之對角線，以它們必須被處理以保留其 10相關性資料以供較晚區塊用之順序。每列說明一分離之對角線’每個對角線僅需來自其上方之列之相關性資料。例如，區塊()〇先被處理’因其位於影像之最左上角，且如此沒有相關性資料。區塊0〇接下來被處理，且如此一列中’因其僅需要來自區塊0。之相關性資料。接下來處 15理區塊1命1〇,且因此出現在隨後之列中，因為區塊n需要來自區塊〇0和〇〇之相關性資料，且區塊1〇需要來自區塊〇〇之相關性資料。因此可看出第3A圖中之區塊之每個對角線，其由虛線強調者，可映射至—平行處理陣列之列中，如第3B圖中所示的。. 20 炒第则中所示的，映射區塊至計算元件之列中保留在每，列上之所有需要的相關性資料，困難仍然存在。更具體地説，對每-個區塊之相關性資料仍通常位於相關於該區塊之不同位置中。例如，從第3A圖中，可看到區塊 41具有位於隨後之區塊中之相·資料，明時鐘之方 10 200806039 向：3^1(),2()和30。當如第3B圖中所示般映射至處理器中時，這些處理器如箭號所示般定位，處理器hjojo和3〇配置於區塊4!上之一“L”形狀中。對照來說，對區塊9S之相關性資料位於區塊83,82,72和62中，其如箭號所示般配置。這說明，Domain: Parallel processing of processing areas, parallel processing of sub-blocks, and parallel processing similar to (4) 绎. 〃 10 areas after parallel processing π ‘%%, not poorly ... 々々, Chuan ding Ding Shuang Shuangli multimedia 15 20 more efficient method of data. It is known that in many different image formats, images are divided into blocks. When the images are viewed in a matrix format, there is a "late" 'block or _ like falling under other blocks of the image towel. The blocks on the side, depending on the information from the earlier block, that is, the image above and to the left of the later block. Older blocks must be processed later, as later parties need information from earlier blocks. Therefore, the series (or parts thereof) are transported to many different parallel processors in the order of their relevance. Older blocks are shipped to parallel processors earlier, and later blocks are sent later. The block is stored in a specific position in the flat (4) and is required to be: when each block has a specific position in the _ specific group = = its associated poor material. In this way, the relevant information can be ordered. That is, the 'earlier series is shifted so that it can be made" 8 200806039 A single set of commands to block later blocks, the commands instructing each processor to retrieve its correlation data from a specific location. By allowing each parallel processor to process its blocks with the same set of commands, the method of the present invention eliminates the need to transfer separate commands to each processor, and instead allows a single global command 5 groups to be transmitted. This results in faster and more efficient processing. Figure 1 conceptually illustrates an exemplary frame of an image when viewed and/or stored in memory, in the form of a matrix. In this example, a 1080i HD image matrix 10 is divided into 12 macroblocks of 68 lines, each labeled 12. In general, image frames such as the 1〇8 frame are processed in individual macroblocks 12. That is, one or more macroblocks 12 are processed by each computing element (or processor) of a parallel processing array. However, while the present invention is generally discussed in the context of the processing of macroblocks 12, it should be appreciated that the present invention encompasses the division of images and other materials into any portion, which is commonly referred to as a block, which can be parallelized. Deal with it. 15 As described above, the macroblock of the image of the 1080i HD frame such as Fig. 1 includes correlation data as further explained in Figs. 2A-2B. According to standards such as but not limited to the h.264 advanced video coding standard and the vc-i MPEG-4 standard, the processing of the block R of an image requires the correlation data from the block a, d, b*ct (for example, Insert the required information). That is, according to these criteria, the processing of each block of the 20 images requires correlation data from the block just to the left, and the block from the diagonally upper left, just above the block and diagonally. Correlation data of the block just above the upper right. Therefore, block a also depends on information from blocks d and b, block b depends on information from block 4, etc., while block d does not depend on information from any other block. Therefore 200806039 can see that the parallel processing of these blocks requires processing on the diagonal, block d is processed first 'after blocks a and b, because they rely on only information from the block, then block R*c because they Rely on information from blocks a, d and b, etc. Referring then to Figures 3A-3C, it can be seen that the optimized parallel processing is to say that the 'blocks' can be mapped to the processor and processed sequentially, with the earlier blocks being processed earlier. Figure 3A illustrates the macroblock structure of the exemplary image when the image is displayed to a viewer. As mentioned above, the blocks of Figure 3A are processed in the order in which their correlation data is maintained for the later blocks. The first side describes the diagonals that must be processed so that they must be processed to retain their 10 correlation data for the order in which the later blocks are used. Each column illustrates a separate diagonal. Each diagonal requires only correlation data from the top of it. For example, a block () is processed first because it is located at the top left corner of the image, and there is no correlation data. Block 0〇 is then processed, and in this column 'because it only needs to come from block 0. Relevant information. The next block is 1 block 1 and therefore appears in the following column, because block n needs correlation data from block 〇0 and 〇〇, and block 1〇 needs to come from block 〇〇 Relevant information. It can thus be seen that each diagonal of the block in Figure 3A, which is highlighted by the dashed line, can be mapped into a column of parallel processing arrays, as shown in Figure 3B. In the 20th section of the speculation, the difficulty of remaining all the relevant correlation data in each column, from the mapping block to the computing component, still exists. More specifically, the correlation data for each block is still typically located in a different location associated with the block. For example, from Fig. 3A, it can be seen that the block 41 has phase data located in the subsequent block, and the clock side is 10 200806039 to: 3^1(), 2() and 30. When mapped to the processor as shown in Fig. 3B, the processors are positioned as indicated by the arrows, and the processors hjojo and 3〇 are arranged in one of the "L" shapes on the block 4!. In contrast, the correlation information for block 9S is located in blocks 83, 82, 72 and 62, which are configured as indicated by the arrows. this means,

5為使每個區塊在一處理陣列内所示之位置上處理，每個計算元件將需要其自己的命令，指示其取回相關性資料。換句話說，因為對每個區塊之相關性資料不同地對每個區塊配置(如區塊七和％所示的）’所以必須將分離的資料取回命令推入每個處理器中，減缓可處理影像之速度。在本發明之實施例中，藉由在處理該區塊之前位移對每個區塊之相關性資料來克服此問題。熟悉技藝之人士將體會到可以任何方式來位移相關性資料。然而，在第3(：圖中說明位移相關性資料之-方便方法’其中包含相關性資料之區塊被位移至上述之“L”形狀中。即，當處理區塊χ時， 15其需要來自區塊入①之相關性資料。在影像内，這些區塊分別直接位於X上，至恰左上方，恰在左邊，及恰右上。在平行處轉肋，則可分雜移這些區魅χ上之二個處理器位置，三個上方之處理器位置，—上方之處理器位置，以及恰在右上方之處理器位置。例如，在第犯圖中，為處理 2〇區塊93，每一個可位移包含區塊之列至右邊-個位置，將區塊H72和62至特性之“L，，形狀中。 “ τ，，错由在處理區塊χ之前位移所有這些相·資料至此立二狀^，可使關_命令集合錢縣悔塊X。此思“令集合僅於-單-載人操作中被載至平行處理器， 11 200806039 每個處理器載入分離命令集合。當處理影像、孰=間節省，特別對大的處理陣列。一^ 4之人士將體會到上述方法僅為本發明之一實二更具體地說’將體會到在資料可位移狀5 In order for each block to be processed at the location shown in a processing array, each computing component will need its own command to instruct it to retrieve the correlation data. In other words, because the correlation data for each block is configured differently for each block (as shown in blocks VII and %), the separate data retrieval commands must be pushed into each processor. , slow down the speed at which images can be processed. In an embodiment of the invention, this problem is overcome by shifting the correlation data for each block before processing the block. Those skilled in the art will appreciate that the relevant information can be shifted in any way. However, in the third (the figure illustrates the displacement correlation data - the convenient method), the block containing the correlation data is shifted into the above-mentioned "L" shape. That is, when the block is processed, 15 it is required Correlation data from the block into the 1. In the image, these blocks are located directly on the X, just to the upper left, just to the left, and just to the right. In the parallel, the ribs can be miscellaneous The top two processor positions, three upper processor positions, the upper processor position, and the processor position just above the top right. For example, in the first map, to process 2 blocks 93, Each of the displaceable blocks contains the block to the right-position, and blocks H72 and 62 to the characteristic "L,, in the shape. "τ,, the error is shifted by processing all of these phases before processing the block 至Standing two times ^, you can make the _ command set money county regret block X. This thought "let the set is only loaded into the parallel processor in the - single-manage operation, 11 200806039 each processor loads the separate command set. When dealing with images, 孰 = saving, especially for large processing arrays A person skilled in the art will appreciate that the above method is only one of the inventions. More specifically, it will be appreciated that the data can be displaced.

10 ==本發明並非受限於資料區塊至此組態之位移 ==發明包含相關性資料至任何組態，或特性位置之㈣2可—般用來供每個要被處理之區《用者。具體地同的影像格式可具有位於非第2A圖中所示的那些之區塊中之相關性資料，形成除了“L”形狀之外之其他特性位置或形狀，其更方便使用者。、10 == The invention is not limited by the displacement of the data block to this configuration == The invention contains the correlation data to any configuration, or the characteristic position (4) 2 can be used for each zone to be processed. . Specifically, the same image format may have correlation data located in those blocks not shown in Fig. 2A, forming other characteristic positions or shapes other than the "L" shape, which is more convenient for the user. ,

也悉技藝之人士亦將體會到在本發明至目前為止已於具有多個巨集區塊之叫_框架之環境中說明之同時，本發明包含可分成任何劃分之影像格式。即，本發明之方法可雜何框架之任何劃分來使用。第4請朗此點，其顯 15不許少型式之框架之對角線可如何被映射至不同數目之處理器列。在第4A圖中，如所示可將,框架之對角線映射至連續列之處理器，建立—梯形（或者為—長菱形，或可能甚至為二者之組合)佈局，其中使用257列之處理器，在一單歹J中使用表大61個處理器。較小的框架使用較少的 20列，和較少的處理器。例如，在第4B圖中，一CIF框架使用 59列處理器，在任何列中使用最多19列。類似地，在第圖中，當映射至一平行處理陣列中時，625 SD框架會佔117 列，每列最大36個處理器。類似地，在第4D圖中，當映射至相同的陣列中時，—SIF框架會佔.51列，且每列最大16個 12 200806039 處理斋。在第4E圖中，一525 SD框架會佔107列，且每列最個處理器。如可從這些範例中看出的，本發明可用來映射任何影像至一平行處理陣列中，其中可如上述在列内位移資料，允許以一單一命令或命令集合來處理區塊。It will also be appreciated by those skilled in the art that while the present invention has been described in the context of a frame having multiple macroblocks so far, the present invention encompasses image formats that can be divided into any partition. That is, the method of the present invention can be used in any division of any framework. For the fourth time, please let me know how the diagonal of the less-restricted frame can be mapped to a different number of logical columns. In Figure 4A, as shown, the diagonal of the frame can be mapped to a continuous column of processors, creating a trapezoidal (or - rhomboid, or possibly even a combination of the two) layout, using 257 columns The processor uses 61 large processors in a single J. Smaller frames use fewer 20 columns, and fewer processors. For example, in Figure 4B, a CIF framework uses 59 columns of processors, using up to 19 columns in any column. Similarly, in the figure, when mapped to a parallel processing array, the 625 SD frame will occupy 117 columns with a maximum of 36 processors per column. Similarly, in Figure 4D, when mapped to the same array, the -SIF frame will occupy .51 columns, and each column will have a maximum of 16 12 200806039 processed. In Figure 4E, a 525 SD frame will occupy 107 columns with the most processor per column. As can be seen from these examples, the present invention can be used to map any image into a parallel processing array in which the data can be shifted within the column as described above, allowing the block to be processed in a single command or set of commands.

亦應體會到本發明並不受限於嚴格的區塊和一平行處理陣列之計算元件間之1對1對應。即，本發明包含其中區塊之部份映射至計算元件之一部份之實施例，藉此增加處理廷些區塊之效率和速度。第5人_56圖說明一個這樣的實施 J /、中〜像之區塊被劃分為二。然後如上來處理這些 0劃分之每一個，除了每個劃分被映射至一處理器之一半及田共爽理之外。參考第5A圖部，如所示。即 15 20 ，左上方之區塊被劃分成二個子區塊，〇和 2/類似地’其之旁邊之區塊被劃分成子區塊1和3等。注意每個子區塊對相關性目的來說動作與一完整區塊相同： P子區鬼1僅而要來自區塊〇之相關性資料，最左邊的子區塊2需要來自區塊_之相關性資料等。參考第沾圖，块處理器之—半中，如所示，子區塊〇和、，弟列中，子區塊2和子區塊3映射至第二列等。然上述之相同方式來使用本發明之程序，子區塊沿者處理盗之列如需要般位移。以此方式，可看出比在先前實施例中在佔用更多處理H ’其允許使較多平行處理陣列，且^ 得到較快速之影像處理。具體地說，參考第音所使用之處理器之數目對每隔-列增加―：前二列每_ = 13 200806039 用一處理器，接下來二列每列使用二個處理器等。對照來說，第5B圖說明其實施例對每列增加所使用之處理器數目為1 :第一列使用一個處理器，第二列二個，以此類推。第 5A-5B圖之實施例如此一次使用較多的處理器，結果得到較 5 快的處理。第6A-6B圖說明其他這樣的實施例，其中一影像之區塊被劃分成四個子劃分。例如，一影像之左上區塊被劃分成子區塊0,2,4和6。這些子區塊然後被映射至一處理器之部份中，以其相關性資料所要求之順予。即，每個處理器可被 10 劃分成四個“子列”，每一個能夠處理一列子區塊。然後可將許多不同的子區塊映射至處理器之子列中，如所示。例如，0，1，2和3個子區塊可全部被映射至第一列中之二個處理器中（第一處理器處理子區塊0，1，一2子區塊和一3子區塊，而第二處理器處理另外的2和3子區塊），且據此加以 15 處理。注意本實施例在第一列中使用二個處理器而非一個，且處理之數目每列增加二，如此允許每列使用更多的處理器。本發明亦包含區塊和處理器劃分成16個子劃分。另外，本發明包括“肩並肩”處理多個區塊，即，每列處理多 20 個區塊。第7A-7C圖說明二個這些觀念。第7A圖說明一區塊劃分成16個子區塊()0-8()，如所示的。熟悉技藝之人士將體會到可分離地處理分離區塊，只要它們配置為使得他們的相關性資料可被正確地判斷。第7B圖說明不相關的區塊，即不需要來自彼此之相關性資料之區塊可平行地加以 14 200806039 處理之事實。每一個區塊如第7A圖中般劃分，為簡化起見，所示的子區塊沒有下標。在此，例如，第一區塊劃分為16 個子區塊，標記為0至9，如上述同時地處理類似的數字。只要在每一列中之區塊不需要來自彼此之相關性資料，它 5 們可被一起處理，在一相同的列中。因此，一群處理器可同時處理多個不相關之區塊。例如，第7B圖中之四個區塊之上方列（子區塊分別標記為0-9,10-19,20-29和30-39)可在一單一集合之處理器中加以處理。第7C圖，處理器之圖（沿著左手邊標號)及對應之載入 10 至其中之子區塊說明此點。在此，子區塊0-9可被載入至處理器之劃分0-9中（其中處理器沿著左手邊標記)以形成鑽石似的樣式，如所示。然後可將進一步之區塊載入至重疊之處理器集合中，子區塊10-19載入至處理器4-13等。以此方式，進一步之區塊之劃分，以及多個區塊至重疊之處理器 15 集合中之“鏈接”，允許更快速地伋畢更多處理器，得到更快速之處理。第7A-7C圖說明四乘四之處理。應了解到此相同之技術亦可於一八乘八之處理中實現。除了在不同的處理器中處理不同的區塊之外，亦應注 20 意到在相同區塊中之不同型式之資料可於不同的處理器中處理。具體地說，本發明包含分離處理來自相同區塊之強度資訊，亮度資訊和色訊資訊。即，來自一區塊之強度資訊可與來自該區塊之亮度資訊分離地處理，其可與該自該區塊之色訊資訊分離地處理。熟悉技藝之人士將觀察到亮 15 200806039 度和色訊貝訊可映射至處理器中並如上處理（即如所需般位移等）且亦可加以劃分，劃分映射至不同的處理器，以增 =處理之效率。第8A_8CS1說明此。在第8A圖中，亮度資料之一區塊可映射至一處理器，對應之“半區塊，，之色訊資 5 相同之處理器或不同者中。具體地說，注意強度， w度和色訊肓料可被映射至相鄰之處㈣集合中，或許在至）部份重疊之列之集合中，類似於第7]8圖。亦可將亮度和色訊貢訊劃分成子區塊，以供在個別計算元件之劃分中處理之用，如連結第5Α_5Β，和6Α_6Β圖連結描述的。具體 10地說，第8B-8C圖分別說明一框架之亮度和色訊資料至二和四個子區塊之劃分。然後第8Β圖之二個子區塊可於不同半部之處理器中處理，如連結第5A_5B圖描述的。類似地，可在不同四分之一部份之處理器中處理第8C圖之四個子區塊，類似在第6A-6B圖中所描述的。 15 在上述貫施例之一些包括以相同列或多個相同列之處理器來肩並肩處理不同之區塊之同時，亦應注意到本發明包括沿著相同行之處理器來處理不同的區塊，亦增加處理之效率和速度。第9A-9C圖，其觀念性地說明由許多不同區塊所佔據之處理器者，描述後者觀念之實施例。在此，處 2〇理器之列沿著垂直軸延伸，同時行沿著水平軸延伸。如此可看到當映射至一處理陣列之列中時，一典型區塊會佔據由區域100_ 104所4田述大致為梯形之形狀中之處理器。具體地說，注意區域104不佔據許多處理器，如此少處理陣列之總使用。此可至少部伤藉由處理恰在佔據區域100—104下之 16 200806039 資料之區塊來補救。此區塊可佔據區域1〇6_ιΐ2，其允許使用更多處理器，特別在隨後之區塊間之“轉換，，區域1〇4屬中以此方式’可更快速地完成處理，且陣列使用比使用者欲僅在完成區域刚·刚中之區塊之處理後處理區域 5 1〇6-112之區塊來得多。第9B-9C圖說明此觀念之進一步延伸。具體地說，注意 ^射，塊之此#直‘‘鏈接”可在二或更多區塊上持續，結果传到局得多的陣列使用。具體地說，區塊可一個接一個映射至相鄰之行中，區域116-120由一區塊佔據，區域122-126 10 由其他區塊佔據等。應注意到長菱形之形狀可取代或連結梯形形狀來加以使用。再者，任何不同行之映射之組合可由不同大小或組合之長曼形和/或梯形實現以協助同時地處理多個串流。、熟悉技藝之人士亦將觀察到上述之本發明之程序和方法可由許夕不同的平行處理器來加以執行。本發明考慮使何平行處理為，其具有多個能夠每一個處理影像資料之一區塊並將這些資料位移以保留相關性之計算元件。在許多這樣的平行處理器被考慮之同時，一個適當的範例描述於美國專利申請案第11/584,480號中，其標題為“積體處 2〇理斋陣歹丨]，指令定序器和I/O控制器”，其係於2006年十月 9曰所&出’其之揭示内容在此合併為參考文獻。 ill塊平扞虛第l〇A-l〇C圖說明相關於子區塊平行處理之發明。根據上述之視訊標準，每個巨集區塊12為16列乘以16行（i6x16) 17 200806039 資料位元（即像素)之矩陣，其分成4或更多子區塊2〇。具體地說’母個矩陣分成至少四個相等之四分之一部份子區塊 20，其大小為8x8。每個四分之一部份子區塊2〇可進一步分成具有8x4 ’ 4x8和4x4之大小之子區塊20。如此，任何已予 5區塊12可分成子區塊20,其具有8x8，4x8，8x4和4x4之大小。第10A圖說明一區塊12,其具有一8x8子區塊2〇a，二個 4x8子區塊2〇b，二個8x4子區塊20c，以及四個4x4子區塊 20d。每個大小的子區塊2〇之數目若有的話可改變，以及其在區塊12内之位置。再者，許多不同的大小之子區塊2〇之 10 數目和位置可每個區塊12每個區塊12不同。如此，為以平行方式處理具有子區塊之區塊12，必須先判斷子區塊之位置和大小。這是對每個區塊12做之耗時判斷，其增加明顯處理成本至區塊12之平行處理。其需要處理器分析區塊12兩次，一次判斷子區塊2〇之數目和位 15置，且然後再次以正確之順序來處理子區塊(記住一些子區塊20可能需要來自其他子區塊之相關性資料以供處理用，如上述，其就是為何必須先判斷許多不同的子區塊之位置和大小）。為減經此問題，本發明需要包括一特殊之型式資料之 20區塊，其識別在區塊12中之所有子區塊20之型式（即位置和大小），如此避免需要處理器做此判斷。第1〇B圖圖說明區塊12，並顯示十六個資料位置22，其可能形成對任何已予子區塊20之第一資料位置（第一意指子區塊2〇之最左上之項目）。對每個區塊12來說，這十六個位置22將包含旗標出 18 200806039 此資料位置是否構成一新的子區塊2〇之第一項目所需之資料。若旗標出該位置，則此位置被認為是一資料區塊20之起始點，且在其恰左邊之位置(若有的話)被認為是恰在左邊之子區塊20之最後一行，且恰在上方之位置（若有的話)被認 5為疋恰在上方之子區塊之最後一列。若其未被旗標的話，則此項目意義為一相同子區塊2〇之持續。如此，可看出這十六個旗標資料位置22包含判斷子區塊2〇之位置，和大小所必要之資料。第10C圖說明根據本發明之型式資料區塊，其中一型式 10資料之區塊24,其具有一 KM之大小者，係與每個區塊12 相關的。區塊24之四個列對應於在區塊12中之四個列，其包含旗標資料位置22。如此，藉由僅分析在每個型式資料區塊24之每一列中之第〗，第5,第9和第13個資料位置，子區塊20之位置和大小可加以判定。為此目的不再需要進一 b步之區塊12之分析。再者，在區塊2〇中之剩餘之資料位置可用來健存其他資料，諸如子區塊型式(1本地預測，p_以動作向量預測，以及B_雙向預測），區塊向量等。如此，如在第10C圖中所看到的，僅那些構成—新子區塊之開始之資料位置22被旗標出來，且在每個區_之列中之第ι，第$、， 20第9和第13個資料位置匹配該旗標。相似性演繹法土行處理其他平行處理最佳化之來源牽涉到同時處理相撼之演澤法(例如類似的計算)。電腦處理牵涉I個= 計异:數值計算和資_動。這些計算係藉由處理不是計算 19 200806039 數，計算就是移動（或複製)所要的資料至-新位置之演绎法來只現的。這樣的演繹法傳統上係使用一系列”任，，教$ 來處理’其巾若符合—特定鮮的話，則做-計算，同時若不是的話，則不是不做計算就是做_不_計=。藉由在夕個IF敘述中導銳，在每個資料中執行所要的總計算。然而’有對此度量方法之缺點。首先，其是耗時的且對平地里不可行。第二，其是浪費的，因為對每個圧敘述，會做-計算’不是轉到下一個計算就是做其他的計算。因 10 15 20 、匕對母個肩繹法透過ip敘述所做的路徑來說，多達一半的處理器功能(及可貴的晶圓空間）變成未使用的。第三，其需要發展-唯-的碼以對每個唯一之資料集合實現演绎法之每個排列。 … 解決方案為一演繹法之實現，其包含對許多分離計瞀或資料移動之所有計算，其中所有資料可能㈣演繹法; 之^個步驟_，料行地處闕有不同㈣料。然後使用選擇碼來判斷演繹法的哪些部份要應用至哪些資料。如此’相同的碼（演繹法）一般應用至所有資料，且只有只有選擇碼需對每筆資料加以修改關斷如何做每個計^在t 的優點為若多筆資料正在處理，其中許多處理步驟是相同 =則以共同的二計算及非共同之那些來應用—演绎法碼來簡化為。為應用此技術至_的轉法，可藉由看指令本身來發現相似性，或藉由以—較細單位之表示來表示指令且然後尋找相似性。第1A和11B圖說明上述觀念之一範例。此範例牵涉 20 200806039 ' 5 到用來在像素間產生中間值之雙線性濾波器，其中做特定數字計算(雖然此技術可對任何資料演繹法來使用）。演繹法需計算許多不同的值，使用相同基本集合之數字加法和資料位移步驟，但這些步驟之順序和標號視所做的計算為而定而不同。如此，在第11A圖中，1/2和3/4雙立方等式之第一計算為數字53，其需要做7個計算步驟。第二計算為數字18，其需要6個計算步驟，其之四個當它們在先前計算中發生時之相同四個步驟共同且小序相同。第一等式之 • 最後二個計算再次具有與頭兩個計算重疊之計算步驟。對 10 1/2雙立方等式，以及三個第11B圖之雙線性等式之額外計算全牽涉到相同計算步驟之許多不同的組合，且全具有四個計算要做。對每個等式來說，全部四個計算可使用一平行處理器 30加以執行，其具有四個處理元件32，每一個具有其自己 - 15 的記憶體34,如第12圖中所示的，連結一與每個演繹法之步驟相關之選擇碼。有一與每個步驟相關之選擇碼，其命令四個變數中的哪些受該步驟作用。例如，有九個說明於第11A和11B圖之計算中之演繹法步驟。對第11A圖之第一等式來說，第一步驟僅應用至第三和第四變數，其由與 20 該步驟相關之“0011”之選擇碼命令（其中若對該步驟和變數之碼為“1”的話，步驟應用至一特定變數，且若其為”0”的話不應用）。如此，一“0011”之選擇碼命令該步驟將僅應用至第三和第四變數，而非第一和第二變數。第二步驟僅應用至第二變數，如由選擇碼“0100”所命令的。相同的度量 21 200806039 方法對所有的步驟和所有等式之變數應用，其係使用所示之選擇碼。八使用選擇碼優點為取代產生二十個演绛碼來做說明於第11A和11B圖中之二十個不同的計算（或至少八個不同的 5演繹碼來做八個不同的數值計算），以及將那些演繹法之每 -：載入至四個處理元件之每一個中，僅需產生和載入一 =單㊉繹碼(載人至多個處理元件以供散佈之記憶體組悲用，或載入至-單一記憶體位置，其在所有處理元件間是共用的）。僅選擇碼需被產生並載入至不同的處理元件中二實現所要的計算，其是簡單得多的。因為演繹碼僅應用一次’選祕地且平行於财魏地，_增加了平行處理速度和效率。第11A和11B圖說明對一資料計算應用之選擇碼之使用’用以選擇性地命令哪些演繹法步驟應用至資料之選擇 15碼相同地可供用來移動資料之演繹法之用。前述描述，為說明之目的起見，使用特定的用語來提供本發明之徹底了解。然而，對於熟悉技藝之人士來說，為實施本發明不需要特定細節是明顯的。如此，本發明之特定實施例之前述描述係呈現來供說明和描述之目的。它 2〇們並非預定為辦盡的或限制本發明至所揭示之精確型式。許多修改和變化在觀看上述指導下是可能的。例如，本發明可用來處理任何影像格式之任何劃分。即，本發明可平行也處理任何格式之影像，無論它們是否為⑽⑴肋影像’ CIFf;像’ SIF影像或任何其他。這些影像亦可劃分成 22 200806039 任何劃分，無論它們為一影像之巨集區塊或任何其他。又，任何影像資料可如此處理，無論其為強度資訊，亮度資訊，色訊資訊或任何其他。實施例被選擇和描述為最佳地說明本發明及其實際應用之原理，以藉此使得熟悉技藝之人士 5袁仏地利用本發明，且連同許多不同之修改之許多不角因實施例適於所考慮之特定用途。本發明可實施於方法和用以實施那些方法之裝置之型式中。本發明亦可實施為程式碥之型式，其實施於有形媒體中者，諸如軟碟，CD-ROM，硬碟機，韌體或任何其他 1〇機器可讀取之儲存媒體中，其中當程式碼被载入至諸如一電知之機為中且由之執行時，機器變成一用以實施本發明之裝置。本發明亦可實施為程式碼之型式，例如無論儲存於一儲存媒體中，載入至和/或由一機器執行，或於某傳輸媒體上傳送，諸如在電線或電纜上，透過光纖或透過電 15磁輻射，其中當程式碼被載入至一諸如一電腦之機器中且由之執行時，機器變成一用以實施本發明之機器。當於一一般用途之電腦上實現時，程式碼區段結合處理器以提供類似地於特定邏輯電路操作之一唯一裝置。 L圖式簡單說明j 2〇第1圖觀念地說明一 1〇8〇i高畫質（HD)框架之巨集區塊。第2A-2B圖說明在一影像框架内之諸如巨集區塊之區塊之配置。第3A-3C圖說明將巨集區塊從其在一影像内之配置映 23 200806039 射至個別平行處理器。第4A-4E圖說明對許多不同的影像格式，將影像映射至個別之平行處理器。第5A-5B圖說明用以將影像之子劃分映射至個別平行 5 處理器之16x8映射。第6 A - 6 B圖說明用以將影像之子劃分映射至個別平行處理器之16x4映射。第7A-7C圖說明根據本發明之一實施例將影像區塊映射至平行處理器之其他方法。 10 第8A-8C圖說明一影像格式之資料結構之進一步之細節，包括亮度和色度資訊。第9 A - 9 C圖說明根據本發明之一實施例，用來映射多個影像區塊至平行處理器之許多不同的其他方法。第10A-10C圖說明根據本發明之一實施例之資料區塊 15 資料位置，子區塊位置，子區塊旗標資料位置，和區塊型式資料。第11A-11B圖說明演繹法處理步驟及用以識別哪些處理步驟應用至哪些資料變數之選擇碼。第12圖說明一平行處理器。 20 類似的參考數字指稱圖式中之對應部份。【主要元件符號說明】 10…影像矩陣 12…巨集區塊 100-112···區域 24It should also be appreciated that the present invention is not limited to a one-to-one correspondence between a computational component of a strict block and a parallel processing array. That is, the present invention includes embodiments in which portions of a block are mapped to a portion of a computing element, thereby increasing the efficiency and speed of processing the blocks. The fifth person _56 shows that one such implementation J /, medium ~ image block is divided into two. Each of these 0 partitions is then processed as above, except that each partition is mapped to one of the processors and the other. Refer to section 5A as shown. That is, 15 20 , the block at the upper left is divided into two sub-blocks, and the blocks next to the 〇 and 2/ are similarly divided into sub-blocks 1 and 3 and the like. Note that each sub-block has the same action as a complete block for the purpose of correlation: P sub-region ghost 1 only needs to be related to the block 〇, and the leftmost sub-block 2 needs to be related to block _ Sexual information, etc. Referring to the digest map, the half of the block processor, as shown, sub-blocks 〇 and , in the sub-column, sub-block 2 and sub-block 3 are mapped to the second column and the like. However, in the same manner as described above, the program of the present invention is used, and the sub-blocks are processed as if they were required to be displaced as needed. In this way, it can be seen that more processing H ‘ is allowed than in the previous embodiment, which allows more parallel processing of the array, and a faster image processing is obtained. Specifically, the number of processors used with reference to the first tone is increased by every - column - the first two columns per _ = 13 200806039 with one processor, the next two columns using two processors per column, and so on. In contrast, Figure 5B illustrates that the number of processors used in its embodiment for each column increase is 1: the first column uses one processor, the second column two, and so on. The implementation of Figures 5A-5B, for example, uses more processors this time, resulting in a faster processing than 5. Figures 6A-6B illustrate other such embodiments in which a block of an image is divided into four sub-divisions. For example, the upper left block of an image is divided into sub-blocks 0, 2, 4, and 6. These sub-blocks are then mapped into a portion of a processor, as required by their correlation data. That is, each processor can be divided into four "sub-columns" by 10, each capable of processing a column of sub-blocks. Many different sub-blocks can then be mapped into sub-columns of the processor as shown. For example, 0, 1, 2, and 3 sub-blocks may all be mapped to two processors in the first column (the first processor processes the sub-block 0, 1, the 2 sub-block, and the 3 sub-region The block, while the second processor processes the other 2 and 3 sub-blocks, and processes 15 accordingly. Note that this embodiment uses two processors instead of one in the first column, and the number of processing is increased by two per column, thus allowing more processors to be used per column. The invention also includes partitioning and processor partitioning into 16 sub-partitions. In addition, the present invention includes processing a plurality of blocks "side by side", i.e., processing 20 more blocks per column. Figures 7A-7C illustrate two of these concepts. Figure 7A illustrates a block divided into 16 sub-blocks () 0-8() as shown. Those skilled in the art will appreciate that detached blocks can be processed separately as long as they are configured such that their correlation data can be correctly determined. Figure 7B illustrates an unrelated block, i.e., the fact that blocks that do not require correlation data from each other can be processed in parallel. Each block is divided as shown in Figure 7A. For simplicity, the sub-blocks shown have no subscripts. Here, for example, the first block is divided into 16 sub-blocks, labeled 0 to 9, and similar numbers are processed simultaneously as described above. As long as the blocks in each column do not require correlation data from each other, they can be processed together in the same column. Therefore, a group of processors can process multiple unrelated blocks simultaneously. For example, the upper columns of the four blocks in Figure 7B (the sub-blocks are labeled 0-9, 10-19, 20-29, and 30-39, respectively) can be processed in a single set of processors. Figure 7C shows the diagram of the processor (labeled along the left hand side) and the corresponding sub-blocks loaded into it. Here, sub-blocks 0-9 can be loaded into partitions 0-9 of the processor (where the processor is labeled along the left hand side) to form a diamond-like pattern, as shown. Further blocks can then be loaded into the overlapping processor sets, and sub-blocks 10-19 loaded into processors 4-13 and the like. In this way, further partitioning, and the “links” in the collection of multiple blocks to overlapping processors 15 allow for faster processing of more processors for faster processing. Figure 7A-7C illustrates the processing of four times four. It should be understood that this same technology can also be implemented in the 18-by-8 process. In addition to processing different blocks in different processors, it should be noted that different types of data in the same block can be processed in different processors. Specifically, the present invention includes separating processing of intensity information, brightness information, and color information from the same block. That is, the intensity information from a block can be processed separately from the brightness information from the block, which can be processed separately from the color information from the block. Those familiar with the art will observe that the bright 15 200806039 degrees and the color information can be mapped into the processor and processed as above (ie, as required, etc.) and can also be divided, mapped to different processors, to increase = efficiency of processing. This is explained in 8A_8CS1. In Figure 8A, one block of luminance data can be mapped to a processor, corresponding to a "half block," a processor of the same color or a different one. Specifically, attention to intensity, w degrees And the color information can be mapped to the adjacent (4) set, perhaps in the set of partial overlaps, similar to the 7th 8th. The brightness and color information can also be divided into sub-areas. Blocks, for use in the processing of individual computing elements, as described in Sections 5Α5Β, and 6Α_6Β图. Specific 10, Figure 8B-8C illustrates the brightness and color information of a frame to the second and The division of the four sub-blocks. Then the two sub-blocks of Figure 8 can be processed in different half processors, as described in connection with Figure 5A_5B. Similarly, processors in different quarters can be used. The four sub-blocks of Figure 8C are processed, similar to those described in Figures 6A-6B. 15 Some of the above embodiments include processors in the same column or columns of the same column to handle different regions side by side. At the same time as the block, it should also be noted that the invention includes along the same line The processor handles different blocks and increases the efficiency and speed of processing. Figures 9A-9C, which conceptually illustrate the processor occupied by many different blocks, describe embodiments of the latter concept. The row of 2 processors extends along the vertical axis while the rows extend along the horizontal axis. It can be seen that when mapped to a column of a processing array, a typical block will occupy the area of the region 100_104. It is a processor in the shape of a trapezoid. Specifically, it is noted that the area 104 does not occupy a large number of processors, so the total use of the array is so little that this can be at least partially processed by processing 16 under the area 100-104. Block to remedy. This block can occupy the area 1〇6_ιΐ2, which allows more processors to be used, especially in the "conversion of the subsequent blocks, in the region 1〇4 genus in this way" can be faster The processing is completed, and the array uses much more than the block that the user wants to process the area 5 1 - 6 - 12 only after the processing of the block in the area just finished. Figure 9B-9C illustrates a further extension of this concept. Specifically, note that the #直''link" of the block can last on two or more blocks, and the result is passed to a much larger array. Specifically, the blocks can be mapped one after another to In adjacent rows, regions 116-120 are occupied by one block, regions 122-126 10 are occupied by other blocks, etc. It should be noted that the shape of the rhomboid can be used in place of or in conjunction with a trapezoidal shape. Again, any difference The combination of row mappings can be implemented by long sized and/or trapezoidal shapes of different sizes or combinations to assist in the simultaneous processing of multiple streams. Those skilled in the art will also observe that the procedures and methods of the present invention described above can be varied from Xu Xi. The parallel processor is implemented. The present invention contemplates parallel processing as having a plurality of computational elements capable of processing each of the image data and displacing the data to preserve correlation. In many such parallel processing At the same time as a device is being considered, a suitable example is described in U.S. Patent Application Serial No. 11/584,480, entitled "Integration 2", Command Sequencer and I/O Controller" , the disclosure of which is incorporated herein by reference in its entirety, the disclosure of which is incorporated herein by reference. In the video standard described above, each macroblock 12 is a matrix of 16 columns by 16 rows (i6x16) 17 200806039 data bits (ie, pixels) divided into 4 or more sub-blocks 2 具体地说. The mother matrix is divided into at least four equal quarters of partial sub-blocks 20, which are 8x8 in size. Each quarter-part sub-block 2〇 can be further divided into children having a size of 8x4 '4x8 and 4x4. Block 20. Thus, any of the 5 blocks 12 can be divided into sub-blocks 20 having sizes of 8x8, 4x8, 8x4, and 4x4. Figure 10A illustrates a block 12 having an 8x8 sub-block 2 a, two 4x8 sub-blocks 2〇b, two 8x4 sub-blocks 20c, and four 4x4 sub-blocks 20d. The number of sub-blocks of each size can be changed if any, and The location within block 12. Again, the number and location of a plurality of sub-blocks of different sizes can be different for each block 12 per block 12. In order to process the block 12 having the sub-blocks in a parallel manner, it is necessary to first determine the position and size of the sub-blocks. This is a time-consuming judgment for each block 12, which increases the apparent processing cost to the parallel of the block 12. Processing. It requires the processor to analyze block 12 twice, once to determine the number of sub-blocks 2 and bit 15 and then process the sub-blocks again in the correct order (remember that some sub-blocks 20 may need to come from The correlation data of other sub-blocks is used for processing, as described above, which is why it is necessary to first determine the position and size of many different sub-blocks. To reduce this problem, the present invention needs to include a special type of data. Block 20, which identifies the type (i.e., location and size) of all of the sub-blocks 20 in block 12, thus avoiding the need for the processor to make this determination. Figure 1B illustrates block 12 and displays sixteen data locations 22, which may form a first data location for any given sub-blocks 20 (first meaning sub-block 2, the top leftmost project). For each block 12, the sixteen locations 22 will contain the information required to flag 18 200806039 if this data location constitutes a new sub-block 2 of the first item. If the location is flagged, the location is considered to be the starting point of a data block 20, and its just left position (if any) is considered to be the last row of the sub-block 20 just to the left. And just above the position (if any) is recognized as the last column of the sub-block just above. If it is not flagged, the meaning of this item is the same as that of the same sub-block. Thus, it can be seen that the sixteen flag data locations 22 contain information necessary to determine the location of the sub-blocks 2, and size. Figure 10C illustrates a type of data block in accordance with the present invention in which a block of data 10 having a size of one KM is associated with each block 12. The four columns of block 24 correspond to four columns in block 12, which contain flag data locations 22. Thus, the position and size of sub-block 20 can be determined by analyzing only the ninth, fifth, ninth, and thirteenth data locations in each column of each type of data block 24. The analysis of block 12 of step b is no longer required for this purpose. Furthermore, the remaining data locations in block 2 can be used to store other data, such as sub-block patterns (1 local prediction, p_ motion vector prediction, and B_ bidirectional prediction), block vectors, and the like. Thus, as seen in FIG. 10C, only those data locations 22 that constitute the beginning of the new sub-block are flagged, and the first, $, and 20 in each of the regions are listed. The 9th and 13th data locations match the flag. Similarity deduction method The other sources of parallel processing optimization involve the simultaneous processing of the relative method (such as similar calculations). Computer processing involves I = calculation: numerical calculation and capitalization. These calculations are only made by processing the calculations that are not the calculations of the number of 2008 20083939, the calculation is to move (or copy) the required data to the new position. This kind of deductive method traditionally uses a series of "arbitrary, teach $ to deal with 'when its towel meets - if it is specific, then do - calculate, and if not, then do not do calculations or do _ no _ count = By performing the sharpening in the IF narration, the required total calculation is performed in each data. However, 'there are disadvantages of this metric. First, it is time consuming and not feasible on the ground. Second, It's a waste, because for each narrative, it will do - calculate 'not to go to the next calculation or to do other calculations. Because 10 15 20, 匕 to the parent shoulder method through the ip narrative path As much as half of the processor functions (and valuable wafer space) become unused. Third, they need to develop a -only-code to implement each permutation of each derivation of the data set. The scheme is the implementation of a deductive method, which includes all calculations for many separate calculations or data movements, where all data may be (4) deductive methods; ^ step _, where the material is located differently (four), then use the selection code To determine which parts of the deductive method The information to be applied to. So the same code (deductive method) is generally applied to all data, and only the selection code needs to be modified for each data. How to do each measurement ^ The advantage of t is if more than one The data is being processed, and many of the processing steps are the same = then the common two calculations and the non-common ones are applied - deductive code to simplify. To apply this technique to _, the method can be found by looking at the instruction itself. Similarity, or by representing the instruction in a representation of a thinner unit and then looking for similarity. Figures 1A and 11B illustrate an example of the above concept. This example involves 20 200806039 ' 5 to be used to generate intermediate values between pixels. A bilinear filter in which a specific numerical calculation is performed (although this technique can be used for any data deduction). The deductive method requires calculation of many different values, using the same basic set of digital addition and data displacement steps, but these steps The order and labeling are different depending on the calculations made. Thus, in Figure 11A, the first of the 1/2 and 3/4 bicubic equations is calculated as the number 53, which needs to be done. 7 calculation steps. The second calculation is the number 18, which requires 6 calculation steps, four of which are common to the same four steps when they occur in the previous calculation and the same order is the same. The first equation • the last two The calculations again have the computational steps of overlapping the first two calculations. The additional calculations for the 10 1/2 bicubic equation and the three bilinear equations of the 11B diagram all involve many different combinations of the same computational steps. And all have four calculations to be done. For each equation, all four calculations can be performed using a parallel processor 30 with four processing elements 32, each with its own - 15 memory 34. As shown in Fig. 12, a selection code associated with each of the steps of the deductive method is coupled. There is a selection code associated with each step that commands which of the four variables are affected by the step. For example, there are nine deductive steps that are illustrated in the calculations of Figures 11A and 11B. For the first equation of Figure 11A, the first step applies only to the third and fourth variables, which are selected by the "0011" selection code command associated with the step 20 (where the code for the step and the variable) If it is "1", the step is applied to a specific variable, and if it is "0", it is not applied). Thus, a "0011" selection code command will only be applied to the third and fourth variables, rather than the first and second variables. The second step applies only to the second variable, as commanded by the selection code "0100". The same metrics 21 200806039 The method applies to all steps and all equation variables using the selection code shown. Eight uses the option code advantage instead of generating twenty deductive codes to account for twenty different calculations in the 11A and 11B diagrams (or at least eight different 5 deduction codes to do eight different numerical calculations) And loading each of the deductive methods into each of the four processing elements, only need to generate and load a = single ten-digit code (management to multiple processing elements for the memory group to be scattered) , or loaded into a single memory location, which is common across all processing elements). Only the selection code needs to be generated and loaded into different processing elements. The implementation of the desired calculation is much simpler. Because the deductive code is applied only once, and it is parallel to the Weiweidi, _ increases the parallel processing speed and efficiency. Figures 11A and 11B illustrate the use of a selection code for a data computation application to selectively command which derivation steps are applied to the data selection. 15 codes are equally available for the deductive method of moving data. The foregoing description, for purposes of illustration However, it will be apparent to those skilled in the art that no particular details are required to practice the invention. As such, the foregoing description of the specific embodiments of the invention are set forth It is not intended to be exhaustive or to limit the invention to the precise form disclosed. Many modifications and variations are possible under the guidance of the above. For example, the present invention can be used to process any partitioning of any image format. That is, the present invention can process images of any format in parallel, regardless of whether they are (10) (1) rib images 'CIFf; like 'SIF images or any other. These images can also be divided into 22 200806039 any division, whether they are a huge block of images or any other. Also, any image material can be processed as such, whether it is intensity information, brightness information, color information or any other. The embodiments were chosen and described in order to best explain the invention and the embodiments of the embodiments of the invention For the specific use considered. The invention may be embodied in a method and a device for carrying out those methods. The present invention can also be implemented in a program format, which is implemented in a tangible medium such as a floppy disk, a CD-ROM, a hard disk drive, a firmware or any other machine readable storage medium, wherein the program When the code is loaded into and executed by a machine such as a computer, the machine becomes a device for implementing the present invention. The invention may also be embodied in the form of a code, for example, stored in a storage medium, loaded into and/or executed by a machine, or transmitted over a transmission medium, such as on a wire or cable, through an optical fiber or through Electrical 15 magnetic radiation, wherein when the code is loaded into and executed by a machine such as a computer, the machine becomes a machine for implementing the present invention. When implemented on a general purpose computer, the code segment is coupled to the processor to provide a unique device that operates similarly to a particular logic circuit. A simple description of the L diagram j 2〇 Figure 1 conceptually illustrates a macroblock of the 1〇8〇i high-definition (HD) frame. Figure 2A-2B illustrates the configuration of blocks such as macroblocks within an image frame. Figure 3A-3C illustrates the shooting of a macroblock from its image in an image to an individual parallel processor. Figure 4A-4E illustrates the mapping of images to individual parallel processors for many different image formats. Figures 5A-5B illustrate a 16x8 mapping used to map sub-partitions of images to individual parallel 5 processors. Figure 6A-6B illustrates a 16x4 mapping used to map sub-partitions of images to individual parallel processors. 7A-7C illustrate other methods of mapping image patches to parallel processors in accordance with an embodiment of the present invention. 10 Figures 8A-8C illustrate further details of the data structure of an image format, including luminance and chrominance information. 9A-9C illustrate many different other methods for mapping multiple image blocks to parallel processors in accordance with an embodiment of the present invention. 10A-10C illustrate a data block 15 data location, sub-block location, sub-block flag data location, and block type data in accordance with an embodiment of the present invention. Figures 11A-11B illustrate the deductive processing steps and the selection codes used to identify which processing steps are applied to which data variables. Figure 12 illustrates a parallel processor. 20 Similar reference numbers refer to the corresponding parts of the schema. [Main component symbol description] 10...image matrix 12...macroblock block 100-112···region 24

Claims

200806039 十、申請專利範圍： 1· 一種在具有組態來平行地處理資料變數之計算元件之 - 一平行處理陣列中，該方法包含：載入一平行處理器之多個計算元件之一演繹法，其中該演繹法包括多個處理步驟，且其中該等多個計算元件之每一個被組態來處理與該計算元件相關之一資料變數； • 載入該平行處理器之該等多個計算元件之選擇碼，其中該等選擇碼識別該等演繹法步驟之哪些要由該等計算元件應用至該等資料變數；以及由該等計算元件應用該等演繹法處理步驟至該等資料變數，其中對每個計算元件來說，僅由該等選擇碼識別之那些處理步驟要被應用至該資料變數。 2. 如申請專利範圍第1項之方法，其中對該等計算元件之每一個來說：該等處理步驟之每一個具有一與之相關之選擇碼，其決定是否要應用該處理步驟至該資料變數。 3·如申請專利範圍第1項之方法，其中該等處理步驟之每一個具有一與之相關之選擇碼，其決定是否該等計算元件之任一應用該處理步驟至該等資料變數之任一。 4·如申請專利範圍第1項之方法，其中該等處理步驟包括算數加法和資料位移。 5·如申請專利範圍第1項之方法，其中該演繹法之載入包括載入該演繹法至一記憶體中，其係在該等多個計算元 25 200806039 件間共用的。 6. 如申請專利範圍第1項之方法，其中該演繹法之載入包括載入該演繹法至多個記憶體中，其中該等多個記憶體之每一個係與該等計算元件之一相關的。 7. —懂電腦可讀取之媒體，其具有電腦可執行之指令於其上，其係供在一平行處理陣列中之處理方法用的，該陣列具有組態來平行地處理資料變數之計算元件，該方法包含：載入一平行處理器之多個計算元件之一演繹法，其中該演繹法包括多個處理步驟，且其中該等多個計算元件之每一個被組態來處理與該計算元件相關之一資料變數；載入該平行處理器之該等多個計算元件之選擇碼，其中該等選擇碼識別該等演繹法步驟之哪一些要被該等計算元件應用至該等資料變數；以及由該等計算元件應用該等演譯法處理步驟至該等資料變數，其中對每個計算元件來說，僅有由該等選擇碼識別之那些處理步驟被應用至該資料變數。 8. 如申請專利範圍第1項之電腦可讀取之媒體，其中對該等計算元件之每一個來說：該等處理步驟之每一個具有一與之相關之選擇碼，其決定是否要應用該處理步驟至該資料變數。 9. 如申請專利範圍第1項之電腦可讀取之媒體，其中該等處理步驟之每一個具有一與之相關之選擇碼，其決定是 26 200806039 否該等計算元件之任一應用該處理步驟至該等資料變數之任一。 10. 如申請專利範圍第1項之電腦可讀取之媒體，其中該等處理步驟包括算術加法和資料位移。 11. 如申請專利範圍第1應之電腦可讀取之媒體，其中該演繹法之載入包括載入該演繹法至一記憶體中，其係在該等多個計算元件間共用。 12. 如申請專利範圍第1項之電腦可讀取之媒體，其中該演繹法之載入包括載入該演繹法至多個記憶體中，其中該等多個記憶體之每一個係與該等計算元件之一相關。 27200806039 X. Patent Application Range: 1. A method of processing a parallel computing array in a parallel processing array having a configuration to process data variables in parallel, the method comprising: loading one of a plurality of computing elements of a parallel processor Where the deductive method comprises a plurality of processing steps, and wherein each of the plurality of computing elements is configured to process a data variable associated with the computing element; • loading the plurality of calculations of the parallel processor a selection code of the component, wherein the selection codes identify which of the derivation steps are to be applied by the computing elements to the data variables; and the deriving processing steps are applied by the computing elements to the data variables, Wherein for each computing element, only those processing steps identified by the selection codes are applied to the data variable. 2. The method of claim 1, wherein each of the computing elements: each of the processing steps has a selection code associated therewith that determines whether the processing step is to be applied to the Data variables. 3. The method of claim 1, wherein each of the processing steps has a selection code associated therewith, which determines whether any of the computing elements applies the processing step to the data variables One. 4. The method of claim 1, wherein the processing steps include arithmetic addition and data displacement. 5. The method of claim 1, wherein the loading of the deductive method comprises loading the deductive method into a memory, which is shared among the plurality of computing elements 25 200806039. 6. The method of claim 1, wherein the loading of the deductive method comprises loading the deductive method into a plurality of memories, wherein each of the plurality of memories is associated with one of the computing elements of. 7. A computer readable medium having computer executable instructions thereon for processing in a parallel processing array having a configuration to process data variables in parallel The method, the method comprising: loading one of a plurality of computing elements of a parallel processor, wherein the deductive method includes a plurality of processing steps, and wherein each of the plurality of computing elements is configured to process Computing a data variable associated with the component; loading a selection code of the plurality of computing elements of the parallel processor, wherein the selection codes identify which of the derivation steps are to be applied to the data by the computing component Variables; and the processing steps are applied by the computing elements to the data variables, wherein for each computing element, only those processing steps identified by the selection codes are applied to the data variables. 8. The computer readable medium of claim 1 of the patent scope, wherein each of the computing elements: each of the processing steps has a selection code associated therewith, which determines whether to apply This processing step goes to the data variable. 9. The computer readable medium of claim 1, wherein each of the processing steps has a selection code associated therewith, the decision being 26 200806039 no application of any of the computing elements Step to any of these data variables. 10. A computer readable medium as claimed in claim 1 wherein the processing steps include arithmetic addition and data displacement. 11. A computer readable medium as claimed in claim 1 wherein the loading of the method includes loading the deductive method into a memory that is shared among the plurality of computing elements. 12. The computer readable medium of claim 1, wherein the loading of the deductive method comprises loading the deductive method into a plurality of memories, wherein each of the plurality of memories is compatible with the plurality of memories One of the computing elements is related. 27