CN118227515A

CN118227515A - Performing separable operations on a two-dimensional array of values at a processing unit including a memory

Info

Publication number: CN118227515A
Application number: CN202311754998.XA
Authority: CN
Inventors: S·塞法尔瓦伊
Original assignee: Imagination Technologies Ltd
Current assignee: Imagination Technologies Ltd
Priority date: 2022-12-21
Filing date: 2023-12-19
Publication date: 2024-06-21

Abstract

The present application relates to performing separable operations on a two-dimensional array of values at a processing unit including a memory. A computer-implemented method of performing separable operations on a two-dimensional array of values, each memory bank being written to or read from only one respective thread in each writing or reading step, the method comprising: dividing the array into subarrays; for each subarray: performing an initial stage of separable operations on the subarrays of values using threads to generate a respective processed value for each value of the subarrays; writing a first plurality of processed values to a memory, the values corresponding to a one-dimensional sequence of values; reading a respective second plurality of processed values corresponding to a vertical one-dimensional sequence of values of the sub-array of values in transposed positions relative to the sub-array of values; performing a later stage of separable operations to generate an output value for each value of the sub-array of values in the transpose position; the process values are written to each of the memory banks and the corresponding process values are read from each of the memory banks.

Description

Performing separable operations on a two-dimensional array of values at a processing unit including a memory

Cross Reference to Related Applications

The present application claims priority from british patent application 2219374.2 filed on month 21 of 2022, which is incorporated herein by reference in its entirety. The present application also claims priority from british patent application 2219375.9 filed on month 21 of 2022, which is incorporated herein by reference in its entirety.

Technical Field

The present disclosure relates to performing separable operations on a two-dimensional array of values at a processing unit including a memory.

Background

FIG. 1A illustrates an example Graphics Processing Unit (GPU) 100 and global memory 108. The work to be performed by the GPU may be arranged into "workgroups", "thread bundles (warp)" and "threads". A workgroup may include one or more bundles of threads. The thread bundle may include multiple threads, where the multiple threads may be processed in parallel at a single core 102 of the GPU 100.

GPU 100 may perform operations on the array of values. FIG. 2 illustrates an example operation. In fig. 2, the operation is a one-dimensional Gaussian (Gaussian) filtering operation, in which the value 204 is filtered according to a one-dimensional kernel 202 comprising the value 204 to be filtered and the values of the three values on either side of the value 204. The filtered output for value 204 is determined by weighted summing the values in kernel 202. Respective weights for each value in kernel 202 are determined from gaussian function 200 centered around value 204 to be filtered.

The separable two-dimensional gaussian filtering operation can be implemented by: performing an initial stage in which a one-dimensional gaussian filter operation is horizontally performed on each row of the two-dimensional array of values; the latter stage is then performed, wherein the same one-dimensional gaussian filtering operation is performed vertically along each column of the horizontally filtered two-dimensional array of values, or vice versa (i.e., performed vertically, then horizontally).

In a simple approach, the phase of separable two-dimensional gaussian filtering operations on an array of values may be performed by assigning each value of the array of values to be filtered to a thread for processing. As an illustrative example, consider a separable two-dimensional gaussian filter operation to be performed on a two-dimensional array of values including 1024 values (e.g., an array of values arranged as 32×32). In this example, to perform an initial (e.g., horizontal) phase of the operation, each of the 1024 values may be assigned to a respective thread of the 1024 threads. To filter each value, each thread may read all values (e.g., the value to be filtered and three values on either side of the value) included in the one-dimensional filter kernel used to filter the value from global memory 108. Each thread may then generate a filtered value for its value by performing a weighted sum on the values in the kernel and write the filtered value back to the global memory 108 such that the global memory 108 stores the once filtered (e.g., horizontally filtered) value for each value of the two-dimensional array of values. To perform the latter (e.g., vertical) phase of the operation, each of the 1024 once-filtered values may be assigned to a respective thread of the 1024 threads. To further filter each once-filtered value, each thread may read all of the once-filtered values (e.g., the once-filtered value to be further filtered and three once-filtered values on either side of the value) included in the one-dimensional filter kernel used to further filter the once-filtered value from global memory 108. The one-dimensional filter kernel used in the latter (e.g. vertical) stage is perpendicular to the one-dimensional filter kernel used in the initial (e.g. horizontal) stage. Each thread may then generate a further filtered value for its once filtered value by performing a weighted sum of the once filtered values in the kernel and write the further filtered value back to the global memory 108 such that the global memory 108 stores the further filtered (e.g., horizontally filtered and vertically filtered) values for each value of the two-dimensional array of values.

Performing separable two-dimensional gaussian filtering operations on the graphics processing unit 100 in this manner is relatively slow, as this involves performing two sets of reads and two sets of writes to the global memory 108. Furthermore, in each stage of the separable filtering operation, each value is read from memory multiple times (e.g., seven times in the example given above). This is because, in each stage, each value is read from memory (e.g., global memory 108) by a thread assigned to filter the value, and each value is also read from memory (e.g., cache memory to which the value is written after being read from global memory 108) by each of the threads assigned to filter other values using a filter kernel that includes the value.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

According to a first aspect of the present invention there is provided a computer-implemented method of performing separable operations on a two-dimensional array of values at a processing unit comprising a memory, the memory comprising a plurality of memory banks, wherein in each writing or reading step each memory bank can be written to or read from by only one respective thread, the method comprising: dividing the two-dimensional value array into a plurality of two-dimensional value sub-arrays; for each of the plurality of subarrays: performing an initial stage of separable operations on the subarray of values using a plurality of threads to generate a respective processed value for each value of the subarray of values; each of the plurality of threads writing a respective first plurality of processed values to the memory through a plurality of writing steps, the first plurality of processed values corresponding to a one-dimensional sequence of values of the sub-array of values; each of the plurality of threads reads a respective second plurality of processed values from the memory through a plurality of read steps, the second plurality of processed values corresponding to a vertical one-dimensional sequence of values of the sub-array of values in a transposed position relative to the sub-array of values within the array of values; and performing a later stage of separable operations using the plurality of threads on the plurality of processed values read by the plurality of threads to generate respective output values for each value of the sub-array of values in the transpose position; wherein in at least one of the plurality of writing steps a respective processing value is written to each of the banks of the memory and in at least one of the plurality of reading steps a respective processing value is read from each of the banks of the memory.

The memory bank to which the process value is to be written may be determined from an array of write buffers having a greater number of elements than the number of elements in the two-dimensional array of values.

The write buffer array may include: a value element corresponding to the value of the two-dimensional array; and filling elements. The fill element may correspond to a memory fill.

The write buffer array may include multiple sets of consecutive value elements corresponding to values of the two-dimensional array, and fill elements interspersed between the multiple sets.

The number of value elements in each group may (i) be equal to the number of memory banks comprised by the memory, (ii) be a multiple of the number, or (iii) be a factor of the number.

The number of value elements in each group may be equal to or less than the number of threads for performing separable operations on the value array.

The write buffer array may be mapped to memory such that the processed values corresponding to the values of the two-dimensional array are written to memory in the memory locations to which the value elements of the write buffer array are mapped, and the processed values corresponding to the values of the two-dimensional array are not written to memory in the memory locations to which the fill elements of the write buffer array are mapped.

The memory location of the threaded write process value may be determined from a base memory address, a write offset, and a write fill amount, which may depend on the location of the value corresponding to the process value within the value array.

The two-dimensional array of values may be divided into a plurality of two-dimensional sub-arrays of values represented by a multi-dimensional array [ I ] [ J ] [ K ] [ M ], wherein I and J represent the number of sub-arrays of values within the array of values in each of the two dimensions, and K and M represent the number of values within each of the sub-arrays of values in each of the two dimensions, and wherein each value in the array of values has coordinates [ I ] [ K ] [ M ], which define the position of the value within the multi-dimensional array [ I ] [ J ] [ K ] [ M ], and the write offset may be equal to (I x J x K x M) + (I x J x K x M) + (M x K). And the write fill amount is equal to the write offset divided by the fill frequency.

The memory location from which the thread reads the process value may be determined from the base memory address, the read offset, and the read fill level, the read offset and the read fill level being dependent on the location within the array of values of the value to which the process value corresponds.

A two-dimensional array of values divided into a plurality of two-dimensional sub-arrays of values may be represented by a multi-dimensional array [ I ] [ J ] [ K ] [ M ], wherein I and J represent the number of sub-arrays of values within the array of values in each of the two dimensions and K and M represent the number of values within each of the sub-arrays of values in each of the two dimensions, and wherein each value in the array of values has coordinates [ I ] [ K ] [ M ], which define the position of the value within the multi-dimensional array [ I ] [ J ] [ K ] [ M ], and the read offset may be equal to (J x K x M) + (I x K x M) + (M x M) +k or (J x K x M) + (I x K x M) +k x M. And the read fill amount is equal to the read offset divided by the fill frequency.

The filling frequency may (i) be equal to the number of memory banks comprised by the memory, (ii) be a multiple of the number, or (iii) be a factor of the number.

The fill frequency may be equal to or less than the number of threads used to perform separable operations on the array of values.

The plurality of two-dimensional value sub-arrays may be non-overlapping. The two-dimensional array of values may be square, and each of the two-dimensional sub-arrays of values may be square.

The multiple threads may be processed by processing logic included by a core of the processing unit, which may be implemented on a chip, and the memory is physically located on the same chip as the processing logic.

The two-dimensional array of values may be a two-dimensional array of pixel values. The separable operations may be separable filtering operations or separable fast integral image calculations. The separable filtering operation may be a separable gaussian filtering operation or a separable box filtering operation.

The one-dimensional sequence of values of the sub-array of values may be a row of values of the sub-array of values, and the vertical one-dimensional sequence of values of the sub-array of values in the transposed position may be a column of values of the sub-array of values in the transposed position; or the one-dimensional sequence of values of the sub-array of values may be a column of values of the sub-array of values and the vertical one-dimensional sequence of values of the sub-array of values in the transposed position may be a row of values of the sub-array of values in the transposed position.

According to a second aspect of the present invention, there is provided a processing unit for performing separable operations on a two-dimensional array of values, the processing unit comprising: a memory comprising a plurality of memory banks, wherein the memory is configured such that in each writing or reading step, each memory bank can be written to or read by only one respective thread; and processing logic configured to: dividing the two-dimensional value array into a plurality of two-dimensional value sub-arrays; for each of the plurality of subarrays: performing an initial stage of separable operations on the subarray of values using a plurality of threads to generate a respective processed value for each value of the subarray of values; writing, using each of a plurality of threads, a respective first plurality of processed values to a memory through a plurality of writing steps, the first plurality of processed values corresponding to a one-dimensional sequence of values of the sub-array of values; using each of the plurality of threads, reading a respective second plurality of processed values from memory through a plurality of reading steps, the second plurality of processed values corresponding to a vertical one-dimensional sequence of values of the subarray of values in a transposed position within the array of values relative to the subarray of values; and performing a later stage of separable operations using the plurality of threads on the plurality of processed values read by the plurality of threads to generate respective output values for each value of the sub-array of values in the transpose position; wherein in at least one of the plurality of writing steps a respective processing value is written to each of the banks of the memory and in at least one of the plurality of reading steps a respective processing value is read from each of the banks of the memory.

There may also be provided a computer-implemented method of performing an operation on an array of values at a processing unit, the method comprising: to perform the phase of the operation: for each one of one or more one-dimensional sequences of values of the array of values: assigning a respective value segment of the one-dimensional sequence of values to each of a plurality of threads; and a first thread of the plurality of threads: determining at least one contribution from the value segment assigned to the first thread to the stage of the operation to be completed by a second thread of the plurality of threads for an adjacent value segment of the one-dimensional sequence of values; and writing the at least one contribution to memory; and the second thread of the plurality of threads: reading the at least one contribution from the memory; and completing the phase of the operation for the adjacent value segment allocated to the second thread in accordance with the at least one contribution read from the memory to generate a processed value segment.

A processing unit for performing operations on an array of values may also be provided, the processing unit comprising processing logic and a memory, the processing logic configured to: stage for performing the operation: for each one of one or more one-dimensional sequences of values of the array of values: assigning a respective value segment of the one-dimensional sequence of values to each of a plurality of threads; and by a first thread of the plurality of threads: determining at least one contribution from a value segment assigned to a first thread to a stage of an operation to be completed by a second thread of the plurality of threads for an adjacent value segment of the one-dimensional sequence of values; and writing at least one contribution to the memory; and by a second thread of the plurality of threads: reading at least one contribution from a memory; and completing a phase of an operation for a neighboring value segment allocated to the second thread based on the at least one contribution read from the memory to generate a processed value segment.

The processing units as described in any of the examples herein may be embodied in hardware on an integrated circuit. A method of manufacturing a processing unit as described in any of the examples herein at an integrated circuit manufacturing system may be provided. An integrated circuit definition data set may be provided that, when processed in an integrated circuit manufacturing system, configures the system to manufacture a processing unit as described in any of the examples herein. A non-transitory computer-readable storage medium may be provided having stored thereon a computer-readable description of a processing unit as described in any of the examples herein, which when processed in an integrated circuit manufacturing system, causes the integrated circuit manufacturing system to manufacture an integrated circuit embodying the processing unit.

An integrated circuit manufacturing system may be provided, the integrated circuit manufacturing system comprising: a non-transitory computer-readable storage medium having stored thereon a computer-readable description of a processing unit as described in any of the examples herein; a layout processing system configured to process the computer-readable description to generate a circuit layout description of an integrated circuit embodying the processing unit; and an integrated circuit generation system configured to fabricate the processing unit according to the circuit layout description.

Computer program code for performing any of the methods described herein may be provided. A non-transitory computer-readable storage medium having stored thereon computer-readable instructions that, when executed at a computer system, cause the computer system to perform any of the methods described herein may be provided.

As will be apparent to those skilled in the art, the above features may be suitably combined and combined with any of the aspects of the examples described herein.

Drawings

Examples will now be described in detail with reference to the accompanying drawings, in which:

FIG. 1A illustrates an example Graphics Processing Unit (GPU) and memory.

FIG. 1B illustrates an example memory hierarchy accessible by processing logic.

FIG. 2 illustrates an example operation.

FIG. 3 illustrates an example image divided into a plurality of overlapping tiles.

FIG. 4 illustrates an example allocation of value segments to multiple threads.

Fig. 5 shows a one-dimensional sequence of values.

FIG. 6 illustrates an example filter kernel overlaid on a one-dimensional sequence of values.

Fig. 7 a-7 c illustrate example contributions from the values of a value segment to the phase of an operation to be completed for an adjacent value segment.

In an example, fig. 8 shows a value section corresponding to a plurality of processing values allocated in the latter stage of the operation.

Fig. 9 shows a vertical one-dimensional value sequence.

FIG. 10 illustrates a method of performing operations on an array of values at a processing unit in accordance with principles described herein.

FIG. 11 illustrates an example memory including multiple memory banks.

Fig. 12A and 12B show the processing values in the memory.

FIG. 13 illustrates a method of performing separable operations on a two-dimensional array of values at a processing unit including a memory in accordance with principles described herein.

Fig. 14 shows a value array divided into a plurality of value sub-arrays.

Fig. 15A to 15D show the processing values and the padding in the memory.

FIG. 16 illustrates a computer system in which a processing unit is implemented; and

Fig. 17 illustrates an integrated circuit manufacturing system for generating an integrated circuit embodying a processing unit.

The figures illustrate various examples. Skilled artisans will appreciate that element boundaries (e.g., blocks, groups of blocks, or other shapes) illustrated in the figures represent one example of boundaries. In some examples, it may be the case that one element may be designed as a plurality of elements, or that a plurality of elements may be designed as one element. Where appropriate, common reference numerals have been used throughout the various figures to indicate like features.

Detailed Description

The following description is presented by way of example to enable a person skilled in the art to make and use the invention. The invention is not limited to the embodiments described herein and various modifications to the disclosed embodiments will be apparent to those skilled in the art.

Embodiments will now be described by way of example only.

FIG. 1A illustrates an example Graphics Processing Unit (GPU) 100 and memory 108. Graphics processing unit 100 is described herein as an example of a processing unit capable of (e.g., configured to execute) parallel processing. It should be appreciated that the principles described herein may also be applied to any other suitable type of processing unit capable of (e.g., configured to perform) parallel processing, such as a digital signal processing unit (e.g., a DSP); or a Central Processing Unit (CPU) of a suitable type capable of parallel processing.

Graphics processing unit 100 may have any suitable architecture. Graphics processing unit 100 may operate to perform any kind of graphics, image or video processing, general purpose processing, and/or any other type of data processing, such as processing of general purpose computing tasks (particularly tasks that are easily parallelizable). Examples of general computing tasks include signal processing, audio processing, computer vision, physical simulation, statistical computation, neural networks, and cryptography.

Graphics processing units typically include one or more processing elements. In FIG. 1A, graphics processing unit 100 is shown to include three processing elements, labeled 102-1, 102-2, and 102-n. It should be appreciated that a processing unit configured in accordance with the principles described herein may include any suitable number of processing elements.

Each processing element 102 may be a different core of graphics processing unit 100. Each processing element 102 includes processing logic 104 and memory 106. That is, in fig. 1A: processing element 102-1 includes processing logic 104-1 and memory 106-1; processing element 102-2 includes processing logic 104-2 and memory 106-2; and processing element 102-n includes processing logic 104-n and memory 106-n. Each memory 106 is operable to store data exclusively by/for processing logic 104 of processing element 102 that includes the memory. Each memory 106 may be physically located on the same chip (e.g., on the same semiconductor die and/or in the same integrated circuit package) as the processing logic 104 of the processing element 102 that includes the memory. Thus, each memory 106 may be referred to as "local memory," on-chip memory, "or" internal memory. The processing logic 104 of each processing element 102 may be able to access its local memory 106 without consuming memory bandwidth of the memory 108. That is, each local memory 106 may have a smaller storage capacity, for example 60kB (kilobytes), relative to the storage capacity of the memory 108.

The processing logic 104 of each processing element 102 may also access the memory 108, for example, through a system bus. Graphics processing unit 100 may be implemented on a chip (e.g., a semiconductor die and/or an integrated circuit package), and memory 108 may not be physically located on the same chip (e.g., a semiconductor die and/or an integrated circuit package) as graphics processing unit 100. Thus, the memory 108 may be referred to as "off-chip memory" and/or "external memory. The memory 108 may also be used to store data for other processing units of the system that implement a graphics processing unit, such as a central processing unit (CPU-not shown in fig. 1A), and thus may also be referred to as "system memory" and/or "global memory. The memory 108 may be a dynamic random access memory (e.g., DRAM). Global memory 108 may have a larger storage capacity, for example 10GB (gigabytes), relative to the storage capacity of each local memory 106. That is, the latency associated with processing logic 104 reading from and writing to global memory 108 may be greater (e.g., significantly greater) than the latency associated with processing logic 104 reading from and writing to its local memory 106.

As described in further detail herein, the processing logic 104 of each processing element 102 may also access other types of memory (e.g., caches, registers, or any other suitable type of memory—not shown in fig. 1A for ease of illustration).

The work performed by the processing units capable of (e.g., configured to execute) parallel processing may be arranged into so-called "work groups", "thread bundles", and "threads". A workgroup may include one or more bundles of threads. A thread bundle may include multiple threads, where the multiple threads may be processed in parallel (e.g., at a single core of a graphics processing unit). In examples where the workgroup includes more than one thread bundle, each of those thread bundles may be processed serially at a single core of the graphics processing unit. The workgroups may be processed independently of each other (e.g., at different cores of the graphics processing unit or serially at a single core of the graphics processing unit). Threads within the same workgroup (e.g., threads within the same thread bundle of the workgroup, and threads within different thread bundles of the same workgroup) may be able to share access during their processing to memory (e.g., local memory 106 of processing logic 104 dedicated to processing those threads) of a processing element (e.g., core) dedicated to processing those threads. That is, threads within the same thread bundle may be able to share access during their processing to memory (e.g., local memory 106 of processing logic 104 dedicated to processing those threads) of a processing element (e.g., core) dedicated to processing units that process those threads. Furthermore, thread bundles within the same workgroup may be able to share access during their processing to memory dedicated to processing elements (e.g., cores) of the processing units that process those threads (e.g., local memory 106 of processing logic 104 that is dedicated to processing those thread bundles). In contrast, different workgroups may not be able to share access to memory dedicated to a certain processing element (e.g., core) of a processing unit during their processing.

The thread bundles may be arranged as an array of threads (e.g., a one-dimensional, two-dimensional, or three-dimensional array of threads). The number of threads included by a thread bundle may be limited. The limit on the number of threads included by a thread bundle may be due to hardware limitations (e.g., a limit on how many threads can be processed in parallel on available processing hardware). In an example, a thread bundle may include up to 128 threads. In this example, if more than 128 threads are to perform the same operation, more than one thread bundle will be associated with the operation. For example, if 2048 threads are to perform the same operation, sixteen thread bundles may be associated with the operation. The sixteen thread bundles may be included by the same workgroup or may be divided among multiple workgroups (e.g., up to sixteen different workgroups). It is to be understood that the terms "workgroup," "thread bundle," and "thread" as used herein are not intended to be limiting, and that another term may be used to describe the same concepts. For example, a "thread" as described herein may alternatively be referred to as a "call" or "work item," while a "workgroup" as described herein may alternatively be referred to as a "thread block" or "thread group.

FIG. 1B illustrates an example memory hierarchy accessible by processing logic 104 of processing element 102. In FIG. 1B, for ease of illustration, a memory hierarchy associated with a single processing element 102 of a processing unit (e.g., processing unit 100 of FIG. 1A) is shown. It should be appreciated that each processing element 102 of a processing unit (e.g., processing unit 100 of fig. 1A) may be associated with a memory hierarchy equivalent to that shown in fig. 1B.

In FIG. 1B, processing logic 104, local memory 106, and global memory 108 have the same characteristics as processing logic 104, local memory 106, and global memory 108 described with reference to FIG. 1A. Also shown in FIG. 1B is a register bank 110 included with processing element 102. The register bank 110 includes a plurality of registers (e.g., register memories). Each register of the register bank 108 has a smaller storage capacity, e.g., 32 bits, relative to the storage capacity of the local memory 106. That is, the latency associated with processing logic 104 reading from and writing to register bank 110 may be less than the latency associated with processing logic 104 reading from and writing to local memory 106. The register bank 110 may be physically located on the same chip (e.g., on the same semiconductor die and/or in the same integrated circuit package) as the processing logic 104 of the processing element 102 that includes the memory. The register bank 110 may be physically located on the same chip (e.g., on the same semiconductor die and/or in the same integrated circuit package) as the local memory 106 of the processing element 102 that includes the register bank.

When processing logic 104 processes a thread bundle that includes multiple threads, a corresponding one or more registers of register bank 110 may be dedicated to each thread of the thread bundle (e.g., exclusively accessible by the thread). That is, values processed according to (e.g., "by") a thread may be stored in one or more registers accessible to the thread. Other threads within the same thread bundle may not be able to access those values within one or more registers accessible to the thread.

For example, a thread bundle including a plurality of threads (e.g., 128 threads) may be processed at a processing element 102 (e.g., a core of a processing unit) to perform operations on an array of values (e.g., 1024 values). Referring to FIG. 1B, the array of values may initially be stored in global memory 108. Each thread of the thread bundle may be configured to perform an operation on a set of values (e.g., 8 values) of an array of values (e.g., 1024 values). The limitation on the number of values that can be handled by a thread may be caused by the amount of register memory in the register bank 110 that is accessible to each thread (e.g., dedicated to that thread). To perform this operation, for each thread of the thread bundle, processing logic 104 may cause a respective set of values (e.g., 8 values) to be processed by the thread to be read from global memory 108 into one or more registers dedicated to the thread. Values written to one or more registers dedicated to a thread may be processed by processing logic 104 in accordance with the thread. Thereafter, the processing values may be output from the processing element 102 by writing those processing values from the corresponding one or more registers dedicated to each thread into the global memory 108. Alternatively, if further processing is to be performed on those processing values at the processing element 102, those values may be written into the local memory 106 from a corresponding one or more registers dedicated to each thread. Multiple threads within a thread bundle may share access to local memory 106. That is, multiple threads within a thread bundle may access the processing values written into local memory 106. Thus, for each thread of a thread bundle, processing logic 104 may cause a corresponding set of processing values (e.g., 8 processing values) to be further processed by the thread to be read from local memory 106 into one or more registers dedicated to the thread. The processing values written into one or more registers dedicated to the thread may be further processed by processing logic 104 in accordance with the thread. Thereafter, the further processed values may be output from the processing element 102 by writing those further processed values from the corresponding one or more registers dedicated to each thread into the global memory 108. Alternatively, still further processing iterations may be performed at processing element 102 in the same manner.

A processing unit, such as graphics processing unit 100, may perform operations on an array of values (e.g., a one-dimensional or multi-dimensional array of values). In an example, the value of the array of values may be a pixel value, an audio sample of an audio signal, a signal sample of a transmission signal, or any other suitable type of value.

FIG. 2 illustrates an example operation. In fig. 2, the operation is a one-dimensional gaussian filtering operation, in which the value 204 is filtered according to a one-dimensional kernel 202 of values that includes the value 204 to be filtered and one or more values on one or both sides of the value 204. The one-dimensional gaussian filter operation may use a filter kernel 202 that includes an odd number of values, with a center value 204 being the value to be filtered. In fig. 2, filter kernel 202 is symmetrical. In the particular example shown in fig. 2, filter kernel 202 includes seven values, including a value 204 to be filtered and three values on either side of the value 204. In this example, filter kernel 202 can be said to have three values of radius r. That is, for a symmetric filter kernel, the radius r of the filter kernel 202 can be said to be the number of values on either side of the center value to be filtered. It should be appreciated that in other examples, the filter kernel need not be symmetrical. That is, the filter kernel may include more values on one side of the values to be filtered than on the other side).

In a one-dimensional gaussian filter operation, the filtered output for the center value 204 may be determined by performing a weighted summation of the values in the kernel 202. Respective weights for each value in kernel 202 may be determined from gaussian function 200 centered around value 204 to be filtered. That is, as will be appreciated by those skilled in the art, in weighted summation, values farther from the value 204 to be filtered will be weighted lower (e.g., have values closer to 0) than values nearer to the value 204 to be filtered, according to the gaussian function 200.

The one-dimensional gaussian filter operation shown in fig. 2 may be performed for each value of a one-dimensional sequence of values. That is, each value in a one-dimensional sequence of values may be filtered according to a one-dimensional kernel 202 of values, in which the value is the value to be filtered. Incidentally, one skilled in the art will know of many different techniques for operating on values at the beginning and end of a sequence of values, such as when the value to be filtered is the first value in the sequence of values. In one example, this edge condition may be addressed by replicating a first value in the sequence of values such that the filter kernel includes the value to be filtered, a plurality of values from the sequence of values on one side of the value to be filtered, and a plurality of copies of the value to be filtered on the other side of the value to be filtered. Typically, the number of values at the beginning and end of a sequence of values during filtering, to which the "edge condition technique" is applied, is equal to the radius r of the filter kernel. Those skilled in the art will be aware of various other techniques for addressing such edge situations and, therefore, for the sake of brevity, these techniques will not be further discussed herein.

In examples where the array of values to be operated on is a one-dimensional array of values (e.g. a sequence of audio samples of an audio signal, or a sequence of signal samples of a transmission signal), the output of this filtering for each value in the one-dimensional sequence of values may be the output of a one-dimensional gaussian filtering operation. In other examples where the array of values to be operated on is a multi-dimensional array of values (e.g., a two-dimensional array of pixel values), the filtering of each value in a one-dimensional sequence of values may be the first stage of a separable multi-stage gaussian filtering operation. As will be appreciated by those skilled in the art, a separable filter operation is an operation that decomposes a multi-dimensional operation into a series of smaller-dimensional filter operations. For example, the separable two-dimensional gaussian filtering operation may be implemented by: performing an initial stage in which a one-dimensional gaussian filter operation (as shown in fig. 2) is performed horizontally on each row of the two-dimensional array of values; the latter stage is then performed, wherein the same one-dimensional gaussian filtering operation is performed vertically along each column of the horizontally filtered two-dimensional array of values, or vice versa (i.e., performed vertically, then horizontally). Separable filtering operations are typically used to process images (e.g., two-dimensional arrays of pixel values). For example, separable filtering operations may be used to blur an image.

Gaussian filtering operations are described herein as examples of operations that may be performed on an array of values (e.g., a one-dimensional or multi-dimensional array of values). It should be appreciated that the principles described herein may be applied to other types of operations that may be performed on one-dimensional or multi-dimensional arrays of values, such as box filter operations, convolution operations, fast integral computation operations, or any other suitable type of operation. As with the one-dimensional gaussian filter operation, the one-dimensional box filter operation may be performed in one stage as a one-dimensional operation on a one-dimensional array of values, or in multiple stages, in order to achieve separable operations on a multi-dimensional array of values. In a one-dimensional box filter operation, values may be filtered according to a one-dimensional kernel of values that includes the value to be filtered and one or more other values positioned on one or both sides of the value. The filtered output for the value to be filtered may be determined by averaging the values in the ground kernel. The skilled artisan is well aware of the box filter operations and, therefore, for brevity, the box filter operations will not be discussed further herein. In a convolution operation, an array of activation values is convolved with (e.g., filtered by) an array of coefficients (e.g., filter weights), e.g., to implement a so-called "convolution layer" of a neural network. The skilled artisan is well aware of convolution operations and, therefore, for simplicity, convolution operations will not be discussed herein. The fast integral calculation operation will be discussed in further detail below.

In a simple approach, the phase of separable two-dimensional gaussian filtering operations on an array of values may be performed by assigning each value of the array of values to be filtered to a thread for processing. As an illustrative example, consider a separable two-dimensional gaussian filter operation to be performed on a two-dimensional array of values comprising 1024 values. In this example, to perform an initial (e.g., horizontal) phase of the operation, each of the 1024 values may be assigned to a respective thread of the 1024 threads. To filter each value, the processing logic 104 that processes each thread may read all values included in the value in the one-dimensional filter kernel used to filter the value (i.e., the value to be filtered and one or more values on one or both sides of the value) from the global memory 108 into one or more registers accessible to the thread. The processing logic 104 that processes each thread may then generate a filtered value for its value by performing a weighted summation on the values in the kernel in the manner described above, and write the filtered value back to the global memory 108, such that the global memory 108 stores the once-filtered (e.g., horizontally filtered) value for each value of the two-dimensional array of values. To perform the latter (e.g., vertical) phase of the operation, each of the 1024 once-filtered values may be assigned to a respective thread of the 1024 threads. To further filter each once-filtered value, processing logic 104 that processes each thread may read all once-filtered values (i.e., the once-filtered value to be further filtered and one or more once-filtered values on one or both sides of the value) included in the once-filtered values in the one-dimensional filter kernel used to further filter the once-filtered value from global memory 108 into one or more registers accessible to the thread. The one-dimensional filter kernel used in the latter (e.g. vertical) stage is perpendicular to the one-dimensional filter kernel used in the initial (e.g. horizontal) stage. The processing logic 104 that processes each thread may then generate a further filtered value for its once filtered value by performing a weighted summation on the values in the kernel in the manner described above, and write the further filtered value back to the global memory 108 such that the global memory 108 stores the further filtered (e.g., horizontally filtered and vertically filtered) values for each value of the two-dimensional array of values. Performing separable two-dimensional gaussian filtering operations in this manner is relatively slow because it involves performing two sets of reads and two sets of writes to global memory 108, each of which has a greater associated latency than reads from/writes to local memory 106, as described above. Furthermore, each value is read from memory multiple times. This is because, in each stage, each value is read from memory (e.g., global memory 108) into one or more registers accessible to each thread by processing logic 104 that processes each thread that is assigned to filter the value, and each value is also read from memory (e.g., cache memory to which the value is written after reading from global memory 108) into one or more registers accessible to each of those threads by processing logic 104 that processes each of the threads that is assigned to filter other values using a filter kernel that includes the value. In addition, this simple approach does not take advantage of the ability of threads within the same workgroup to share access to local memory 106 of processing logic 104 dedicated to processing those threads during their processing.

A computer-implemented method of performing operations on an array of values at processing unit 100 is described herein with reference to graphics processing unit 100 shown in fig. 1A and the flowchart shown in fig. 10. This approach may solve one or more problems with a simple approach as described in the previous paragraph. Hereinafter, the method is described mainly with reference to an illustrative example, in which the value array is a two-dimensional pixel value array, and the operation is a separable two-dimensional gaussian filtering operation.

In this illustrative example, the input for the method shown in fig. 10 is a two-dimensional array of pixel values, such as a two-dimensional image or a portion of a two-dimensional image (e.g., a tile or block). In an example, the two-dimensional image may represent: a still image; one frame of video; a computer-generated two-dimensional representation of a three-dimensional scene (e.g., an image such as rendered using path or ray tracing); or any other suitable type of image. The pixel values may represent one or more characteristics of their corresponding pixels within the image. For example, the pixel value may represent one or more of the following: brightness (luma), luminance (luma), chrominance (chrominance), brightness, luminance, hue, saturation, chroma, any color component (e.g., red, green, or blue components), or any other suitable characteristic of its corresponding pixel. In other examples, pixel values may represent characteristics associated with their respective pixels, such as characteristics indicated in a depth map, a normal map, or a surface texture map (e.g., an albedo map), or any combination of these maps at their respective pixel locations (e.g., an element-by-element product). The input may be stored in global memory 108. As will be appreciated by those skilled in the art, this input may be written to global memory 108 by an application (e.g., a process) running at a computer system (not shown in FIG. 1A) implementing graphics processing unit 100.

In some examples where the received two-dimensional image is sufficiently small, one thread bundle of threads may be able to perform the desired operation for all of the pixel values of the two-dimensional image. This means that one processing element 102 of the processing unit 100 can perform the desired operation on the two-dimensional image by processing the threads of the thread bundle in parallel. By way of non-limiting example, a thread bundle may include up to 128 threads, each capable of completing a desired operation of up to eight pixel values. In this case, if the received two-dimensional image includes 1024 or less pixel values, the received two-dimensional image may be regarded as sufficiently small. In these examples, the entire two-dimensional image may be input as a two-dimensional array of pixel values to a method as shown in fig. 10.

In other examples, the received two-dimensional image may optionally be divided into a plurality of overlapping tiles. FIG. 3 illustrates an example two-dimensional image 300 divided into a plurality of overlapping tiles A through P. The overlap between tile a and tile B is labeled 302. The overlap between tile A and tile E is labeled 304. Each tile of the plurality of overlapping tiles a-P includes a respective two-dimensional array of pixel values. A tile may be sized such that one thread bundle of threads is able to perform a desired operation for all of the pixel values included in the tile. This means that the processing elements 102 of the processing unit 100 may perform the desired operation on one of the overlapping tiles by processing the threads of the thread bundles in parallel. By way of non-limiting example, a thread bundle may include up to 128 threads, each capable of completing a desired operation of up to eight pixel values. In this case, the received two-dimensional image may be divided into a plurality of overlapping tiles, each tile including 1024 or less pixel values (e.g., each tile having a pixel size of 32 x 32 or less). In these examples, the two-dimensional array of pixel values for each overlapping tile may be input to the method shown in fig. 10. According to the method illustrated in fig. 10, the respective processing elements 102 of the processing unit 100 may independently process a two-dimensional array of pixel values for each suitably sized overlapping tile.

Preferably, but optionally, the width of the overlap between overlapping tiles is greater than or equal to twice the radius r of the filter kernel to be used in the filtering operation performed on each of those overlapping tiles. For example, in fig. 2, the filter kernel has a radius r of three values, and thus the width of the overlap between overlapping tiles to perform the operation shown in fig. 2 should preferably be six or more values. This is because, as described herein, the number of values at the beginning and end of the sequence of values where the "edge condition technique" is applied during filtering is equal to the radius r of the filter kernel. Thus, the width of the overlap between overlapping tiles is preferably greater than or equal to twice the radius r of the filter kernel, so that the filtered image output by the method need not include any pixel values in the overlapping region of tiles that have been filtered using techniques that address the edge condition that occurs at the edges of the tiles. That is, continuing with the example that begins earlier in this paragraph, when overlapping tiles A and B are recombined after filtering to form a filtered image, the three values at the end of each row of values in tile A to which the "edge case technique" was applied during filtering may be replaced by the fourth, fifth, and sixth values of each row of values in tile B, which are filtered without using the "edge case technique". Similarly, the three values at the beginning of each row of values in tile B to which the "edge case technique" is applied during filtering may be replaced by the fourth last, fifth and sixth values of each row of values in tile a, which are filtered without using the "edge case technique".

Returning to FIG. 10, the array of values includes one or more one-dimensional sequences of values. For example, a one-dimensional array of values may comprise a single one-dimensional sequence of values (e.g., a single row of values or a single column of values). As in the illustrative example, the two-dimensional array of values may include a plurality of horizontal sequences of values (e.g., rows) and a plurality of vertical sequences of values (e.g., columns).

In the illustrative example, each stage of the separable two-dimensional gaussian filter operation involves performing a one-dimensional gaussian filter operation. To perform the phase of the operation, in step S1002, for each one of one or more one-dimensional sequences of values of the array of values, a respective value segment of the one-dimensional sequence of values is assigned to each of the plurality of threads. The value segment may be a non-overlapping value segment. This step can be understood with reference to fig. 4 and 5, fig. 4 and 5 showing a step S1002 performed for each of one or more horizontal value sequences (e.g., rows) of a two-dimensional array of values. It should be understood that step S1002 may alternatively be performed for each of one or more vertical sequences of values (e.g., columns) of a two-dimensional array of values.

FIG. 4 illustrates an example allocation of value segments to multiple threads. In fig. 4, each one-dimensional sequence of values of the array of values 400 is a row of values of the array of values 400. In fig. 4, the value array includes 32 rows of values. Each row value includes 32 values. In the examples described herein, each thread may be capable of completing up to eight value desired operations. This limitation may be caused by the amount of register memory in the register bank 110 that is accessible to each thread (e.g., dedicated to that thread), and it should be appreciated that in other examples, each thread may be able to complete the desired operation for a different number of values (e.g., any suitable number of values up to 100, or even more in some examples). In this example, each row of values is divided into four non-overlapping value segments, each value segment comprising eight values. Each of the value segments is assigned to a thread. In fig. 4, the two-dimensional value array is divided into 128 value segments (i.e., 32 rows by 4 segments per row), each value segment being assigned to a respective one of the 128 threads, labeled T1 through T128 in fig. 4. As described herein, in an example, a thread bundle may include up to 128 threads. Thus, in this example, one processing element 102 of processing unit 100 may process the 128 threads to which the values of the array of values have been assigned in parallel. That is, one thread bundle to be processed by the processing element 102 of the processing unit 100 may include the 128 threads (i.e., T1 through T128 shown in fig. 4).

Fig. 5 shows a one-dimensional sequence of values 500 in further detail. Specifically, FIG. 5 shows a first row of values 500 of the array of values 400. The row of values 500 has been divided into four sections, each section comprising eight values. Those value segments have been assigned to threads T1, T9, T17 and T25, respectively. The value segment assigned to thread T1 is adjacent to the value segment assigned to thread T9. The value segment assigned to thread T9 is adjacent to the value segment assigned to thread T1 and is also adjacent to the value segment assigned to thread T17. The value segment assigned to thread T17 is adjacent to the value segment assigned to thread T9 and is also adjacent to the value segment assigned to thread T25. The value segment assigned to thread T25 is adjacent to the value segment assigned to thread T17.

It should be understood that more than one value segment may alternatively be assigned to each thread. That is, a thread may be assigned a segment of values from each of more than one-dimensional sequence of values. For example, a thread may be able to complete a desired operation of up to eight values, and may be assigned two value segments from different one-dimensional sequences of values (e.g., rows), each value segment comprising four values. The one-dimensional sequence of values (e.g. rows) may be contiguous within the two-dimensional array of values, but this need not be the case.

In the methods described herein, each thread will complete the phase of the operation for each of the values of the value segment to which that thread has been assigned. To achieve this, first, each thread causes each of the values included in the value segments assigned to that thread to be read from global memory 108 into one or more registers in register bank 110 dedicated to that thread (e.g., "registers for that thread"). That is, referring to FIG. 5, thread T1 reads all eight of its values in the section allocated from global memory 108 into its registers, thread T9 reads all eight of its values in the section allocated from global memory 108 into its registers, thread T17 reads all eight of its values in the section allocated from global memory 108 into its registers, and thread T25 reads all eight of its values in the section allocated from global memory 108 into its registers. Where a thread is described herein as reading from/writing to memory, it will be appreciated that it is this thread that causes the processing logic 104 that processes the thread to perform a read from/write to the memory.

As described herein, in an illustrative example, each stage of the separable two-dimensional gaussian filter operation involves performing a one-dimensional gaussian filter operation. As shown in fig. 2, the one-dimensional gaussian filtering operation uses a filter kernel that includes seven values, including the value 204 to be filtered and three values on either side of the value 204. Fig. 6 shows an example of the filter kernel overlaid on some values of the one-dimensional sequence of values shown in fig. 5. As will be appreciated with reference to fig. 6, for some of the values assigned to the respective value segments of each thread, the thread may be able to complete a one-dimensional gaussian filter operation for that value using only the values that have been read into the value segments in one or more registers accessible to the thread (e.g., "registers of the thread"). For example, in FIG. 6, thread T1 may independently complete a one-dimensional Gaussian filter operation for value 604-1 because all of the values within filter kernel 602-1 have been read into the registers of the thread. In other words, as described herein, thread T1 may independently calculate a weighted sum of all of the values within filter kernel 602-1—since the thread has access to all of those values within the thread's registers. That is, in another example, thread T17 cannot independently complete a one-dimensional gaussian filter operation for value 604-2 because only a subset of the values within filter kernel 602-2 (i.e., four of the values) have been read into one or more registers accessible to thread T17. The other three values in filter kernel 602-2 have been read into one or more registers accessible to thread T9. In other words, as described herein, neither thread T9 nor thread T17 may independently calculate a weighted sum of all of the values within filter kernel 602-2, as none of those threads has access to all of those values within the registers of that thread. Thus, in this example, threads T9 and T17 must cooperate to complete a one-dimensional Gaussian filter operation for value 604-2. The manner in which the two threads cooperate to complete a one-dimensional gaussian filter operation for certain values, such as value 604-2, is described below with reference to fig. 7 a-7 c and 10.

The steps of stage 1000 of the operation are shown in fig. 10. In step S1004, a first thread of the plurality of threads determines at least one contribution from a value segment assigned to the first thread to a stage of an operation to be completed by a second thread of the plurality of threads for an adjacent value segment of the one-dimensional sequence of values. In a first example, in step S1004, a first thread of the plurality of threads may perform at least a portion of a phase of an operation on at least one set of one or more values assigned to a value segment of the first thread to determine at least one contribution from the at least one set of one or more values to a phase of an operation to be completed by a second thread of the plurality of threads for an adjacent value segment of the one-dimensional sequence of values. In a second example, in step S1004, a first thread of the plurality of threads may determine one or more values assigned to a value segment of the first thread as at least one contribution from the value segment assigned to the first thread to a stage of an operation to be completed by a second thread of the plurality of threads for an adjacent value segment of the one-dimensional sequence of values.

The first example of step S1004 may be understood with reference to fig. 7a to 7c, which show example contributions from a set of one or more values of a value segment to a stage of a separable gaussian filter operation to be completed for an adjacent value segment. In fig. 7a to 7c, thread T9 is a "first thread", and thread T17 is a "second thread".

FIG. 7a shows how threads T9 and T17 from FIG. 6 may cooperate to complete the phase of the separable Gaussian filter operation for value 604-2. In fig. 7a, in a first example of step S1004, thread T9 performs a portion of a one-dimensional gaussian filter operation on a set of three values (e.g., the 6 th, 7 th, and 8 th values) of the value segment to which the thread has been assigned in order to determine contribution 708a. This portion of the gaussian filtering operation performed on the three values involves performing a weighted sum of the three values. Respective weights for each of those three values are determined from a portion 700a of a gaussian function centered around the value 604-2 to be filtered.

FIG. 7b shows how threads T9 and T17 may cooperate to complete the phase of the separable Gaussian filter operation for value 704-1. In fig. 7b, in a first example of step S1004, thread T9 performs a portion of a gaussian filter operation on the set of two values (e.g., the 7 th and 8 th values) of the value segment to which the thread has been assigned in order to determine contribution 708b. Performing this portion of the gaussian filter operation on the two values involves performing a weighted sum on the two values. Respective weights for each of those two values are determined from a portion 700b of a gaussian function centered around the value 704-1 to be filtered.

FIG. 7c shows how threads T9 and T17 may cooperate to complete the phase of the separable Gaussian filter operation for value 704-2. In fig. 7c, in a first example of step S1004, thread T9 performs a portion of a gaussian filter operation on a set of values (e.g., 8 th values) of the value segment to which the thread has been assigned in order to determine contribution 708c. Performing this portion of the gaussian filtering operation on this value involves weighting the value. The weight by which the value 704-2 to be filtered is multiplied is determined from a portion 700c of a gaussian function centered around the value.

That is, in the first example of step S1004 of the illustrative example, thread T9 determines three contributions 708a, 708b, and 708c from three sets of one or more values (e.g., (i) 6 th, 7 th, and 8 th values, (ii) 7 th and 8 th values, and (iii) 8 th values) of the value section to which the thread has been assigned to a stage of an operation to be completed by thread T17 for an adjacent value section of the one-dimensional sequence of values. In this illustrative example, the number of contributions determined by the first thread to the stage of the operation to be completed by the second thread in the first example of step S1004 may be equal to the radius r of the filter kernel.

It should be understood that in the first example of step S1004 of the illustrative example, in a similar manner as described with reference to fig. 7a to 7 c: thread T9 may also determine three contributions from three sets of one or more values (e.g., (i) 1 st value, (ii) 1 st and 2 nd values, and (iii) 1 st, 2 nd, and 3 rd values) of the value segment to which the thread has been assigned to (as shown in fig. 6) the stage of the operation to be completed by thread T1 (not shown in fig. 7 a-7 c); thread T17 may determine three contributions from three sets of one or more values (e.g., (i) 1 st value, (ii) 1 st and 2 nd values, and (iii) 1 st, 2 nd, and 3 rd values) of the value segment to which the thread has been assigned to the stage of the operation to be completed by thread T9 (not shown in fig. 7 a-7 c); and thread T17 may determine three contributions from three sets of one or more values (e.g., (i) 6 th, 7 th, and 8 th values, (ii) 7 th and 8 th values, and (iii) 8 th values) of the value segment to which the thread has been assigned to (as shown in fig. 6) the stage of the operation to be completed by thread T25 (not shown in fig. 7 a-7 c). In other words, each of the threads of the plurality of threads (e.g., each of the threads shown in fig. 4) may determine at least one contribution to a stage of an operation to be completed by at least one other thread by performing the action of the "first thread" described with reference to the first example of step S1004 of fig. 10.

The second example of step S1004 can also be understood with reference to fig. 7a to 7 c. As will be appreciated with reference to fig. 7a to 7 c: the 6 th, 7 th and 8 th values assigned to the value segment of thread T9 will contribute to the operation to be performed by thread T17 for value 604-2 (see FIG. 7 a); the 7 th and 8 th values assigned to the value segment of thread T9 will contribute to the operation to be performed by thread T17 for value 704-1 (see FIG. 7 b); and the 8 th value of the value segment assigned to thread T9 will contribute to the operation to be performed by thread T17 for value 704-2 (see fig. 7 c). Thus, in the second example of step S1004, the 6 th, 7 th, and 8 th values assigned to the value segment of thread T9 may be determined as contributions to the phase of the operation to be completed by thread T17 for the adjacent value segment. That is, in the second example, the contribution is an unprocessed value. In other words, in the second example, the thread T9 does not perform at least part of the phase of the operation in step S1004, but rather identifies one or more of the unprocessed values of the value segment to which the thread has been assigned as a contribution to the phase of the operation to be completed by the thread T17 for the adjacent value segment. The thread T9 may identify the number of values within the value segment to which it has been assigned that will contribute to the stage of the operation to be completed by the thread T17 based on the number of values on the "left hand side" of the values to be filtered in the filter kernel to be used during the operation (e.g., three values in this example). In a similar manner, reference is made to fig. 6: the 6 th, 7 th, and 8 th values assigned to the value segment of thread T1 may be determined as contributions to the phase of the operation to be completed by thread T9; the 1 st, 2 nd, and 3 rd values assigned to the value segment of thread T9 may be determined as contributions to the phase of the operation to be completed by thread T1; the 1 st, 2 nd, and 3 rd values assigned to the value segment of thread T17 may be determined as contributions to the phase of the operation to be completed by thread T9; the 6 th, 7 th, and 8 th values assigned to the value segment of thread T17 may be determined as contributions to the phase of the operation to be completed by thread T25; and the 1 st, 2 nd, and 3 rd values assigned to the value segment of thread T25 may be determined as contributions to the phase of the operation to be completed by thread T17. In other words, each of the threads of the plurality of threads (e.g., each of the threads shown in fig. 4) may determine at least one contribution to a stage of an operation to be completed by at least one other thread by performing the action of the "first thread" described with reference to the second example of step S1004 of fig. 10.

In step S1006, the first thread writes the at least one contribution that the thread has determined to memory. Specifically, the first thread may write at least one contribution to the local memory 106. For example, referring to fig. 7 a-7 c, in a first example, thread T9 writes contributions 708a, 708b, and 708c to local memory 106. In a second example, thread T9 writes the 6 th, 7 th, and 8 th values of the value segment to which the thread has been assigned to local memory 106. In step S1006, each of the other threads in the plurality of threads may also write any contribution determined by the thread in step S1004 to the local memory 106. As described herein, multiple threads within a thread bundle may share access to local memory 106.

Incidentally, it may be advantageous that the radius r of the filter kernel used in each stage of the operation is less than or equal to half the number of values in each value section. This may be advantageous because it means that the number of contributions generated by a thread in step S1004 does not exceed the number of values in the value segment allocated to that thread—this may limit the number of memory locations required to store contributions in the local memory 106 to not exceed the number of memory locations required to store values of the value array in the local memory 106. As described herein, each local memory 106 may have a smaller storage capacity, e.g., 60kB (kilobytes), relative to the storage capacity of the memory 108, and thus, it may be advantageous to limit the number of memory locations in the local memory 106 required to store the contribution.

In step S1008, the second thread reads from memory the at least one contribution determined by the first thread. Specifically, the second thread may read at least one contribution from the local memory 106. As described herein, multiple threads within a thread bundle may share access to local memory 106. The second thread may cause the at least one contribution to be read into one or more registers accessible to the second thread such that the registers include the at least one contribution and values assigned to adjacent value segments of the second thread. For example, referring to fig. 7 a-7 c, in a first example, thread T17 reads contributions 708a, 708b, and 708c from local memory 106 into the registers of the thread. In a second example, thread T17 reads the 6 th, 7 th, and 8 th values assigned to the value segment of thread T9 from local memory 106 into the thread's registers. In step S1008, each of the other threads in the plurality of threads may also read any contribution determined by the thread to which its adjacent value segment is assigned from the local memory 106 into the registers of that thread.

In step S1010, the second thread completes a phase of an operation on adjacent value segments allocated to the second thread according to at least one contribution read from memory, so as to generate a processed value segment.

In a first example, in step S1010, the stage of completing the operation for the adjacent value segment may include: the second thread performing at least a portion of a phase of the operation on at least one set of one or more values assigned to adjacent value segments of the second thread to determine at least one contribution from each of the one or more values of the adjacent value segments to the phase of the operation; and combining the at least one contribution determined by the second thread for the adjacent value segment with at least one contribution read from memory by the second thread. This first example of step S1010 can be understood with reference to fig. 7a to 7c, wherein thread T9 is a "first thread" and thread T17 is a "second thread".

In the first example of step S1010, in fig. 7a, thread T17 performs a portion of a one-dimensional gaussian filter operation on the set of four values (e.g., 1 st, 2 nd, 3 rd, and 4 th values) of the value segment to which the thread has been assigned to determine contribution 712a, the value segment being adjacent to the value segment assigned to thread T9. Performing this portion of the gaussian filter operation on these four values involves performing a weighted sum on these four values. Respective weights for each of those three values are determined from a portion 710a of a gaussian function centered around the value 604-2 to be filtered. Subsequently, to complete the stage of the Gaussian filter operation to generate the processed value for the value 604-2, the thread T17 combines (e.g., sums) the contribution 708a read by the thread from memory with the contribution 712a determined by the thread.

In fig. 7b, in a first example of step S1010, thread T17 performs a portion of a gaussian filter operation on a set of five values (e.g., 1 st, 2 nd, 3 rd, 4 th, and 5 th values) of the value segment to which the thread has been assigned to determine contribution 712b, the value segment being adjacent to the value segment assigned to thread T9. Performing this portion of the gaussian filter operation on these five values involves performing a weighted sum on these five values. Respective weights for each of those five values are determined from a portion 710b of a gaussian function centered around the value 704-1 to be filtered. Subsequently, to complete the stage of the Gaussian filter operation to generate a processed value for the value 704-1, the thread T17 combines (e.g., sums) the contribution 708b read by the thread from memory with the contribution 712b determined by the thread.

In fig. 7c, in a first example of step S1010, thread T17 performs a portion of a gaussian filter operation on a set of six values (e.g., 1 st, 2 nd, 3 rd, 4 th, 5 th, and 6 th values) of the value segment to which the thread has been assigned to determine contribution 712c, the value segment being adjacent to the value segment assigned to thread T9. Performing this portion of the gaussian filter operation on the six values involves performing a weighted sum on the six values. Respective weights for each of those six values are determined from a portion 710c of a gaussian function centered around the value 704-2 to be filtered. Subsequently, to complete the stage of the gaussian filter operation to generate a processed value for value 704-2, thread T17 combines (e.g., sums) contribution 708c read by the thread from memory with contribution 712c determined by the thread.

In a first example, in step S1010, thread T17 may independently complete (e.g., without cooperating with another thread) a stage of a gaussian filter operation to generate respective processed values for each of the 4 th and 5 th values in the value range to which the thread has been assigned, because thread T17 has access to all of the values within a filter kernel that is used to filter the 4 th and 5 th values within the registers of the thread, in a similar manner as the value 604-1 of thread T1 described with reference to fig. 6. Thread T17 may complete the stages of the gaussian filter operation to generate respective processed values for each of the 6 th, 7 th and 8 th values in the value section to which the thread has been assigned by cooperating with thread T25 in a similar manner to the cooperation of the thread with thread T9 described with reference to fig. 7a to 7 c. Thus, using the methods described herein, in a first example, thread T17 may complete a stage of gaussian filtering operations for each of the values of the value segment to which the thread has been assigned in order to generate a processed value segment.

In a second example, in step S1010, the stage of completing the operation for the adjacent value segment may include: the second thread performs a phase of an operation on the values assigned to the adjacent value segments of the second thread using the values assigned to the adjacent value segments of the second thread and the at least one contribution read from the memory to generate a processed value segment. That is, in a second example, in step S1010, each thread will have access to all of the values in the filter kernel that filters each of the values within the value range assigned to that thread in its registers. For example, referring to fig. 6, after performing step S1008, thread T17 will have access to the following within the registers of that thread: the 6 th, 7 th and 8 th values of the value section assigned to thread T9; each of the values assigned to the value segments of thread T17; and the 1 st, 2 nd and 3 rd values of the value segment assigned to thread T25. Thus, in a second example of step S1010, the thread T17 may independently complete the stages of the gaussian filter operation to generate a respective processed value for each of the values in the value range to which the thread has been assigned. In a similar manner, in a second example of step S1010, each of the threads (e.g., each of the threads shown in fig. 4) of the plurality of threads may independently complete the stages of the gaussian filter operation to generate a respective processed value for each of the values in the value segment to which the thread has been assigned.

It should be appreciated that, in order to perform stage 1000 of an operation, each of the threads of the plurality of threads (e.g., each of the threads shown in fig. 4) may perform an action of a "first thread" in steps S1004 through S1006 to determine at least one contribution to a stage of the operation to be completed by at least one other thread, and then perform an action of a "second thread" in steps S1008 through S1010 to complete the stage of the operation in accordance with the at least one contribution determined by the at least one other thread.

In step S1014, it is determined whether the operation is completed. In an example where the value array to be operated on is a one-dimensional value array (e.g., a sequence of audio samples of an audio signal, or a sequence of signal samples of a transmission signal) and the phase 1000 of the operation performed in steps S1004 to S1010 is a one-dimensional operation (e.g., a one-dimensional gaussian filter operation), it may be determined that the operation is completed after completing the single phase 1000 of steps S1004 to S1010. In this case, the second thread may write the processing value segment that the second thread has generated to the global memory 108. Each of the plurality of threads (e.g., each of the threads shown in fig. 4) may write a segment of processing values that the thread has generated to global memory 108 such that the processing value corresponding to each value of the array of values for which an operation is to be performed is written to global memory 108. The processed value written to global memory 108 corresponding to each value of the array of values may be the output of the method.

In the illustrative example where the array of values to be operated on is a two-dimensional array of pixel values, the operation is a separable two-dimensional gaussian filter operation, and the stage 1000 of the operation performed in steps S1004 to S1010 is a one-dimensional gaussian filter operation, it may be determined that the operation is not completed after completion of a single (e.g., initial) stage 1000 of steps S1004 to S1010. In this case, the second thread may write the processing value segment that the second thread has generated to the local memory 106. Each of the plurality of threads (e.g., each of the threads shown in fig. 4) may write a segment of processing values that the thread has generated to the local memory 106 such that the processing value corresponding to each value of the array of values for which an operation is to be performed is written to the local memory 106. As described herein, multiple threads within a thread bundle may share access to local memory 106. The method then proceeds to step S1016.

To perform the latter stage of the operation, in step S1016, for each of one or more vertical one-dimensional sequences of values of the array of values, a respective plurality of processed values from a memory (e.g., local memory 106) corresponding to a one-dimensional value segment of the vertical one-dimensional sequence of values of the two-dimensional array of values are assigned to each of a plurality of threads. The value segment may be a non-overlapping value segment.

This step can be appreciated with reference to fig. 8, which in an example shows a value segment corresponding to a plurality of processing values allocated in a later stage of operation. In the example shown in fig. 8, step 1016 has been performed for each of one or more vertical sequences of values (e.g., columns) of the two-dimensional array of values that are perpendicular to one or more horizontal sequences of values (e.g., rows) from which value segments are assigned to the plurality of threads in step S1002. It should be appreciated that if step S1002 has been performed for one or more vertical sequences of values (e.g., columns) of a two-dimensional array of values, step S1016 may alternatively be performed for each of one or more horizontal sequences of values (e.g., rows) of the two-dimensional array of values.

In fig. 8, each vertical one-dimensional sequence of values of the array of values 800 is a column of values of the array of values 800. In fig. 8, the value array includes 32 columns of values. Each column value includes 32 values. As described herein, each thread may be capable of completing a desired operation for up to eight pixel values. This limitation may be caused by the amount of register memory that each thread may access—and it should be understood that in other examples, each thread may be able to complete the desired operation for a different number of pixel values (e.g., any suitable number of pixel values up to 100, or even more in some examples). In this example, each column of values is divided into four non-overlapping value segments, each value segment comprising eight values. A plurality of processing values stored in the local memory 106 corresponding to each of the value segments are assigned to the threads. In fig. 8, the two-dimensional value array is divided into 128 value segments (i.e., 32 columns x 4 segments per column), and the plurality of processing values stored in the local memory 106 corresponding to each value segment are assigned to respective ones of the 128 threads, labeled T1 through T128 in fig. 8. As described herein, in an example, a thread bundle may include up to 128 threads. Thus, in this example, one processing element 102 of processing unit 100 may process the 128 threads that have been assigned processing values corresponding to the values of the value array in parallel. That is, the thread bundle to be processed by the processing element 102 of the processing unit 100 may include the 128 threads (i.e., T1 through T128 shown in FIG. 8).

Fig. 9 shows a vertical one-dimensional sequence of values 900 in further detail. Specifically, FIG. 9 shows a first column value 900 of the array of values 800. The column of values 900 has been divided into four sections, each section comprising eight values. The plurality of processing values stored in the local memory 106 corresponding to the values of each of those value segments have been assigned to threads T1, T9, T17 and T25, respectively. The plurality of processing values assigned to thread T1 corresponds to a value segment adjacent to a value segment corresponding to the plurality of processing values assigned to thread T9. The plurality of processing values assigned to thread T9 corresponds to a value segment adjacent to a value segment corresponding to the plurality of processing values assigned to thread T1, and also corresponds to a value segment adjacent to a value segment corresponding to the plurality of processing values assigned to thread T17. The plurality of processing values assigned to thread T17 corresponds to a value segment adjacent to a value segment corresponding to the plurality of processing values assigned to thread T9, and also corresponds to a value segment adjacent to a value segment corresponding to the plurality of processing values assigned to thread T25. The plurality of processing values assigned to thread T25 corresponds to a value segment adjacent to a value segment corresponding to the plurality of processing values assigned to T17.

In the examples described herein, the same plurality of threads (e.g., threads T1 through T128 shown in fig. 4 and 8) as used in the initial phase of the operation are used in the latter phase of the operation. That is, it should be appreciated that while it is convenient to perform the allocation as shown in fig. 4 and 8 such that the number of values allocated to each thread in the latter stage of the operation is equal to the number of values allocated to each thread in the initial stage of the operation, this need not be the case.

In an alternative example, step S1016 may include assigning a processing value generated by a thread in an initial stage of operation back to the thread for further processing in a later stage of operation. For example, as described herein, in some examples, a thread may be assigned a value segment from each of more than one-dimensional sequence of values. Thus, in an alternative example, in step S1002, each of the value segments labeled T1 through T8 in fig. 4 may be assigned to one thread (not shown in the figure). More generally, in this alternative example, in step S1002, a thread may be assigned a number (e.g., eight) of consecutive value segments in different one-dimensional value sequences that is equal to the number of values (e.g., eight) in each value segment to which the thread is assigned. In this alternative example, during an initial phase of operation, the thread may cooperate with one or more neighboring threads "to the right of the thread" in accordance with principles described herein to generate a plurality of processing values (e.g., one or more neighboring threads that have been assigned a value segment labeled T9 through T16 in fig. 4). In this alternative example, in step S1016, the processing values generated by the thread in the initial phase of the operation may be assigned back to the thread for further processing in the later phase of the operation (e.g., the thread may be "assigned" a plurality of processing values corresponding to each of the vertical value segments labeled T1 through T8 in FIG. 8). In this alternative example, the processing value need not be written to the local memory 106 between the initial stage and the subsequent stage, but may be retained within one or more registers accessible to the thread between the initial stage and the subsequent stage. In this alternative example, during the latter phase of operation, the thread may cooperate with one or more neighboring threads "from below" in accordance with principles described herein to generate a plurality of output values (e.g., one or more neighboring threads that have been assigned the value segments labeled T9 through T16 in fig. 8).

Returning to FIG. 10, in the method described herein, each thread will complete a later stage of operation for each of the values of the plurality of processing values that the thread has been assigned. To achieve this, first, each thread causes each of the processing values assigned to that thread to be read from local memory 106 into one or more registers accessible to that thread.

The method then proceeds to the second pass of steps S1004 to S1010 in order to perform the latter phase 1000 of the operation. In the latter stage of the operation, the same one-dimensional gaussian filter operation as that performed in the initial stage of the operation for the value section allocated to each of the plurality of threads is performed for the plurality of processing values allocated to each of the plurality of threads.

In a later stage, in step S1004, a first thread of the plurality of threads determines at least one contribution from a plurality of processing values assigned to the first thread to a later stage of an operation to be completed by a second thread of the plurality of threads for the plurality of processing values, the plurality of processing values corresponding to a one-dimensional value segment of a vertical one-dimensional value sequence of the two-dimensional value array, the one-dimensional value segment being adjacent to a one-dimensional value segment of the vertical one-dimensional value sequence corresponding to the plurality of processing values assigned to the first thread. In a first example, a first thread of the plurality of threads may perform at least a portion of a subsequent stage of the separable operation on at least one set of one or more values of the plurality of processing values assigned to the first thread to determine at least one contribution from the at least one set of one or more values to the subsequent stage of the operation to be completed by a second thread of the plurality of threads for a plurality of processing values corresponding to a one-dimensional value segment of a vertical one-dimensional value sequence of the two-dimensional value array adjacent to a one-dimensional value segment of the vertical one-dimensional value sequence corresponding to the plurality of processing values assigned to the first thread. In a second example, a first thread of the plurality of threads may determine one or more values of a plurality of processing values assigned to the first thread as at least one contribution from the plurality of processing values assigned to the first thread to a later stage of an operation to be completed by a second thread of the plurality of threads for the plurality of processing values corresponding to a one-dimensional value segment of a vertical one-dimensional value sequence of the two-dimensional value array that is adjacent to the one-dimensional value segment of the vertical one-dimensional value sequence corresponding to the plurality of processing values assigned to the first thread. As described with reference to fig. 7a to 7c, step S1004 of the latter stage is performed in a similar manner to step S1004 of the initial stage. That is, for example, in a manner similar to the manner in which the threads T9 and T17 shown in fig. 4 cooperate with each other in the initial stage of the operation, the threads T9 and T17 shown in fig. 8 cooperate with each other in the latter stage of the operation.

In a later stage, in step S1006, a first thread of the plurality of threads writes at least one contribution that the first thread has determined to memory. Specifically, a first thread of the plurality of threads may write the at least one contribution that the first thread has determined to the local memory 106. As described herein, multiple threads within a thread bundle may share access to local memory 106.

In a later stage, in step S1008, a second thread of the plurality of threads reads from memory at least one contribution determined by the first thread. Specifically, a second thread of the plurality of threads may read from the local memory 106 at least one contribution determined by the first thread. The second thread may cause the at least one contribution to be read into one or more registers accessible to the second thread such that the registers include the at least one contribution and a value assigned to the second thread.

In a later stage, in step S1010, a second thread of the plurality of threads completes a later stage of separable operations for the plurality of processing values assigned to the second thread in accordance with at least one contribution determined by the first thread to generate the output value segment. As described with reference to fig. 7a to 7c, step S1010 of the latter stage is performed in a similar manner to step S1010 of the initial stage. That is, for example, in a manner similar to the manner in which the threads T9 and T17 shown in fig. 4 cooperate with each other in the initial stage of the operation, the threads T9 and T17 shown in fig. 8 cooperate with each other in the latter stage of the operation.

It should be appreciated that in order to perform a later stage of an operation, each of the threads of the plurality of threads (e.g., each of the threads shown in fig. 8) may perform an action of a "first thread" in steps S1004 through S1006 of the later stage to determine at least one contribution to the later stage of the operation to be completed by at least one other thread, and then perform an action of a "second thread" in steps S1008 through S1010 of the later stage to complete the later stage of the operation according to the at least one contribution determined by the at least one other thread.

In the latter stage, in step S1014, it is determined whether the operation is completed. In the illustrative example in which the value array to be operated on is a two-dimensional pixel value array, the operation is a separable two-dimensional gaussian filter operation, and the initial stage and the subsequent stage of the operation performed in steps S1004 to S1010 are vertical one-dimensional gaussian filter operations, it may be determined that the operation is completed after completion of the subsequent stage 1000 of steps S1004 to S1010. In this case, the second thread may write the output value segment that the second thread has generated to the global memory 108. Each of the plurality of threads (e.g., each of the threads shown in fig. 8) may write a segment of output values that the thread has generated to global memory 108 such that the output value corresponding to each value of the array of values for which an operation is to be performed is written to global memory 108. The output value written to global memory 108 corresponding to each value of the array of values may be the output of the method. In examples where the received two-dimensional image has been divided into a plurality of overlapping tiles for input to the method, the output of the method for each overlapping tile may be recombined as described herein to form a filtered two-dimensional image.

It is advantageous to perform the separable two-dimensional gaussian filtering operation using the method of fig. 10 rather than using the simple method described herein. This is because the method described herein is relatively fast because it involves performing only one set of reads and one set of writes to the global memory 108 to perform operations, as opposed to two sets of reads and two sets of writes to the global memory 108 that are required in a simple method. Furthermore, in the method of fig. 10, each value is read only once from the global memory 108, regardless of the number of stages of the operation to be performed. This is because each thread completes each phase of the operation for each of the plurality of values to which the thread is assigned, and the processing values are stored in the local memory 106 between phases. This is in contrast to a simple method in which each value is read from memory many times so that each thread can complete each phase for a single value. The methods described herein take advantage of the ability of threads within the same thread bundle to share access to local memory 106 of processing logic 104 dedicated to processing those threads during their processing.

It should be appreciated that the method described herein with reference to fig. 10 is also applicable to separable multi-dimensional operations performed on arrays of values having more than two dimensions. In these examples, additional passes of steps S1016 and S1004 through S1010 may be performed in order to perform additional stages of the operation.

Fast integral calculation

It should be understood that the method described herein with reference to fig. 10 is not limited to use in performing separable filtering operations, such as separable gaussian or box filtering operations. The method described herein with reference to fig. 10 may be used to perform other types of separable operations, such as a fast integral calculation operation or any other suitable type of separable operation.

As with the one-dimensional filter operation, the one-dimensional fast integral computation operation may be performed in multiple stages to implement separable operations on the multi-dimensional array of values. In a one-dimensional fast integral calculation operation, values in a one-dimensional sequence of values are successively summed such that a first processed value in the processed sequence is equal to a first input value in the input sequence, a second processed value in the processed sequence is equal to the first input value and the second input value in the input sequence, a third processed value in the processed sequence is equal to the sum of the first input value, the second input value and the third input value, and so on until a final processed value in the processed sequence is equal to the sum of all input values in the input sequence.

The fast integral image calculation operation is a type of fast integral calculation operation performed in two stages on a two-dimensional image. When performing a fast integral image calculation operation, the input to the method of fig. 10 may be a two-dimensional array of pixel values.

In some examples where the received two-dimensional image is sufficiently small, one thread bundle of threads may be able to perform the desired operation for all of the pixel values of the two-dimensional image. In these examples, the entire two-dimensional image may be input as a two-dimensional array of pixel values to a method as shown in fig. 10.

In other examples, the received two-dimensional image may optionally be divided into a plurality of tiles. Each tile of the plurality of tiles includes a respective two-dimensional array of pixel values. Unlike the filtering operation example given above, the tiles do not need to overlap, as the edge condition does not occur at the beginning/end of each sequence of values when performing the fast integral image calculation operation. That is, unlike the filtering operation example given above, when performing a fast integral image calculation operation, it is not possible to always process all of the tiles in parallel. This is because, in an example in which the initial phase is performed horizontally, the final output value for each row of tiles in the first "tile column" of the image is input for each respective row of tiles in the second "tile column" of the image, the final output value for each row of tiles in the second "tile column" of the image is input for each respective row of tiles in the third "tile column" of the image, and so on. Thus, when performing a fast integral image calculation operation, some of the tiles may be processed serially at the processing unit. Those skilled in the art will know of many techniques for: propagating output values from processing of tiles in a first "tile column" or "tile row" of an input image for later processing of tiles in a second "tile column" or "tile row" of the image, etc., such as by processing a work group including a plurality of thread bundles at a processing element of a processing unit, wherein each tile in a row or column of tiles is processed by one of the thread bundles and the thread bundles are processed in parallel at the processing element such that tiles in the first "tile column" or "tile row" are processed by a first thread bundle of the work group that then shares its output values with a second thread bundle of the work group via a local memory, which then processes tiles in the second "tile column" or "tile row", and so on, and so forth, and so for brevity, these techniques will not be further discussed herein.

In the following, an example is discussed in which an entire two-dimensional image may be input to the method shown in fig. 10 as a two-dimensional array of pixel values to be processed by threads of a thread bundle running in parallel at a single processing element (e.g., core) of a processing unit.

In order to perform the initial stage of the fast integral image calculation operation, in step S1002, for each one of one or more one-dimensional value sequences of the value array, a respective value segment of the one-dimensional value sequence is assigned to each of the plurality of threads. Step S1002 for the fast integral image calculation operation may be performed in the same manner as step S1002 is performed for the separable filter operation, as described herein with reference to fig. 4 and 5. Each thread causes each of the values included in the value segments assigned to that thread to be read from global memory 108 into one or more registers in register bank 110 dedicated to that thread (e.g., "registers for that thread").

In step S1004, a first thread Cheng Duifen of the plurality of threads assigns at least a portion of a phase of performing an operation to at least one set of one or more values of a value segment of the first thread in order to determine at least one contribution from the at least one set of one or more values to a phase of the operation to be completed by a second thread of the plurality of threads for an adjacent value segment of the one-dimensional sequence of values. For example, referring to fig. 4 and 5, thread T1 may perform a summation of all values (e.g., all eight values) in the value segments to which the thread has been assigned. This sum is the contribution to the stage of the fast integral image calculation that thread T9 will complete. This sum is also a contribution to the stage of the fast integral computation that threads T17 and T25 will complete when performing the fast integral image computation operation. Each of the threads T1 to T24, T33 to T56, T65 to T88, and T97 to T120 (as shown in fig. 4) may determine the contribution in this way by performing a summation of all values (e.g., all eight values) of the values in the value segment to which the thread has been assigned.

In step S1006, the first thread writes the at least one contribution that the thread has determined to memory. Specifically, the first thread (e.g., T1 in this example) may write at least one contribution to local memory 106. As described herein, multiple threads within a thread bundle may share access to local memory 106. In step S1006, each of the other threads in the plurality of threads may also write any contribution determined by the thread in step S1004 to the local memory 106.

In step S1008, the second thread reads from memory the at least one contribution determined by the first thread. Specifically, the second thread (e.g., T9 in this example) may read at least one contribution from the local memory 106. The second thread may cause the at least one contribution to be read into one or more registers accessible to the second thread such that the registers include the at least one contribution and values assigned to adjacent value segments of the second thread. For example, thread T9 may read the contribution determined by thread T1 from memory. In addition, when the fast integral image calculation operation is performed: thread T17 may read the contributions determined by threads T1 and T9 from memory, and thread T25 may read the contributions determined by threads T1, T9, and T17 from memory. The same principle applies to other one-dimensional sequences of values (e.g., rows) of a two-dimensional array of values.

In step S1010, the second thread completes a phase of an operation for a neighboring value segment allocated to the second thread in accordance with the at least one contribution determined by the memory to generate a processed value segment. For example: thread T9 may determine a processed value for the first value in the value segment to which the thread has been assigned by summing the contribution determined by thread T1 with the first value in the value segment to which the thread has been assigned; thread T9 may determine a processed value for the second value in the value segment to which the thread has been assigned by summing the contribution determined by thread T1, the first value and the second value in the value segment to which the thread has been assigned; similarly, until thread T9 can determine a processed value for the final (e.g., eighth) value in the value segment to which the thread has been assigned by summing the contribution determined by thread T1 with all values in the value segment to which the thread has been assigned. In addition: thread T17 may determine a processed value for the first value in the value segment to which the thread has been assigned by summing the contribution determined by thread T1, the contribution determined by T9, and the first value in the value segment to which the thread has been assigned; thread T17 may determine a processed value for the second value in the value segment to which the thread has been assigned by summing the contribution determined by thread T1, the contribution determined by T9, the first value and the second value in the value segment to which the thread has been assigned; similarly, until thread T17 can determine a processed value for the final (e.g., eighth) value in the value segment to which the thread has been assigned by summing all values in the value segment to which the thread has been assigned, the contribution determined by thread T1, the contribution determined by T9, and the value. In addition: thread T25 may determine a processing value for a first value in a value segment to which the thread has been assigned by summing the contribution determined by thread T1, the contribution determined by T9, the contribution determined by T17, and the first value in the value segment to which the thread has been assigned; thread T25 may determine a processed value for the second value in the value segment to which the thread has been assigned by summing the contribution determined by thread T1, the contribution determined by T9, the contribution determined by T17, the first value and the second value in the value segment to which the thread has been assigned; similarly, until thread T25 can determine a processed value for the final (e.g., eighth) value in the value segment to which the thread has been assigned by summing all values in the value segment to which the thread has been assigned, the contribution determined by thread T1, the contribution determined by T9, the contribution determined by T17, and the value in the value segment to which the thread has been assigned. The same principle applies to other one-dimensional sequences of values (e.g., rows) of a two-dimensional array of values.

In step S1014, it is determined whether the operation is completed. In this example where the array of values to be operated is a two-dimensional array of pixel values, the operation is a separable two-dimensional fast integral image calculation operation, and the stage 1000 of the operation performed in steps S1004 to S1010 is a one-dimensional fast integral image calculation operation, it may be determined that the operation is not completed after the single (e.g., initial) stage 1000 of steps S1004 to S1010 is completed. In this case, a second thread (e.g., thread T9 in this example) may write the segment of processing values that the thread has generated to local memory 106. Each of the plurality of threads (e.g., each of the threads shown in fig. 4) may write a segment of processing values that the thread has generated to the local memory 106 such that the processing value corresponding to each value of the array of values for which an operation is to be performed is written to the local memory 106. As described herein, multiple threads within a thread bundle may share access to local memory 106. Subsequently, the method proceeds to step S1016.

To perform the latter stage of the operation, in step S1016, for each of one or more vertical one-dimensional sequences of values of the array of values, a respective plurality of processed values from a memory (e.g., local memory 106) corresponding to a one-dimensional value segment of the vertical one-dimensional sequence of values of the two-dimensional array of values are assigned to each of a plurality of threads. Step S1016 for the fast integral image calculation operation may be performed in the same manner as step S1016 is performed for the separable filtering operation, as described herein with reference to fig. 8 and 9. As described herein, in some alternative examples, step S1016 may include assigning a processing value generated by a thread in an initial stage of an operation back to the thread for further processing in a later stage of the operation.

The method then proceeds to the second pass of steps S1004 to S1010 in order to perform the latter stage 1000 of the fast integral image calculation operation. In the latter stage of the operation, the same one-dimensional fast integral image calculation operation as that performed in the initial stage of the operation for the value segments allocated to each of the plurality of threads is performed for the plurality of processing values allocated to each of the plurality of threads. Steps S1004 to S1010 of the later stage 1000 of the fast integral image calculation operation are performed in a similar manner to steps S1004 to S1010 of the initial stage 1000 of the fast integral image calculation operation, as described herein.

In step S1014 of the latter stage, it is determined whether the operation is completed. In this example in which the value array to be operated is a two-dimensional pixel value array, the operation is a separable two-dimensional fast integral calculation operation, and the initial stage and the subsequent stage of the operation performed in steps S1004 to S1010 are vertical one-dimensional fast integral calculation operations, after the subsequent stage 1000 of steps S1004 to S1010 is completed, it can be determined that the operation is completed. In this case, each of the plurality of threads (e.g., each of the threads shown in fig. 8) may write the output value segments that the thread has generated to global memory 108 such that the output value corresponding to each value of the array of values for which the operation is to be performed is written to global memory 108. The output value written to global memory 108 corresponding to each value of the array of values may be the output of the method.

It should be appreciated that the method described herein with reference to fig. 10 may also be applied to one-dimensional fast integral calculation operations performed on a one-dimensional array of values (e.g., a sequence of audio samples of an audio signal, or a sequence of signal samples of a transmission signal). In these examples, a single pass of steps S1002 through S1014 may be performed in order to perform a single stage of the fast integral calculation operation.

It should also be appreciated that the method described herein with reference to fig. 10 is also applicable to separable multi-dimensional fast integral calculation operations performed on arrays of values having more than two dimensions. In these examples, additional passes of steps S1016 and S1004 through S1014 may be performed in order to perform additional stages of the fast integral calculation operation.

Writing to and reading from memory between stages of separable operations

As described herein with reference to fig. 10, in step S1014 of the initial phase of the separable operation, each of the plurality of threads may cause the value processed by the thread to be written to memory (e.g., local memory 106). In step S1016, the processing values to be operated on in the latter stage of the separable operation may be allocated to a plurality of threads. Subsequently, a step S1004 of a later stage of the separable operation may begin with each of the plurality of threads such that the processing value to which the thread has been assigned is read from memory (e.g., local memory 106) into a register of the thread (e.g., a register of the thread in register bank 110).

Alternatively, the latency of the separable operations may be reduced by writing the processing values to memory (e.g., local memory 106) in a prescribed manner after an initial phase of the separable operations, as will be described herein.

FIG. 11 illustrates an example memory 106 that includes multiple memory banks. The memory 106 shown in fig. 11 may be a "local" memory 106, as previously defined herein with reference to fig. 1A and 1B. In particular, the memory 106 may be constituted by a processing element (e.g., a core) of the processing unit. The processing logic of the processing element may access the memory 106.

The local memory 106 described herein with reference to fig. 1A and 1B may have the same characteristics as the memory 106, as will be described herein with reference to fig. 11.

The memory 106 shown in fig. 11 includes sixteen memory banks, labeled 1 through 16 in fig. 11. Sixteen memory banks in fig. 11 are shown as sixteen columns of memory locations. A memory location is a portion of memory into which a value (e.g., a pixel value) may be written. It should be appreciated that a value may include more than one bit of information. It should be appreciated that the memory may include any suitable number of memory banks (e.g., 8, 32, 64, or any other suitable number of memory banks). It should be appreciated that the number of memory locations illustrated in FIG. 11 is not intended to be limiting, and that the memory may include any suitable number of memory locations. The sixteen memory banks shown in fig. 11 are each one memory location "wide". That is, each memory bank shown in fig. 11 may store one value in each of its rows. It should be appreciated that a memory bank may be "wide" of more than one memory location (e.g., a memory bank may be "wide" of 2,4, or any other suitable number of memory locations). That is, a memory bank may store more than one value in each of its rows.

As described herein, a thread bundle comprising a plurality of threads may be processed by a processing element (e.g., a core) comprising processing logic and memory 106. Multiple threads within a thread bundle may share access to memory 106. In other words, each of the plurality of threads may cause processing logic to write and/or read values to/from memory 106.

After the initial phase of the separable operation, excessive bank conflicts occur when writing the processed value to memory

In each write step (e.g., clock or instruction), the thread may cause only one value (e.g., pixel value) to be written into the memory 106. In each writing step, each store may be written by only one corresponding thread. A "bank conflict" occurs when each of the multiple threads attempts to write a corresponding value to one of the banks in a single write step. When a bank conflict occurs, the writing into the bank is performed through a plurality of writing steps. That is, in each writing step, each thread of the plurality of different threads may write a respective value to a respective store of the plurality of different stores of memory. Thus, the most efficient (e.g., lowest latency) way for multiple threads to write their processing values to memory is for each different thread, equal in number to the number of banks in memory, to each write a respective value to a respective bank in the multiple banks in each write step. For example, after an initial stage of performing the separable operations, the most efficient (e.g., lowest latency) manner in which the multiple threads shown in FIG. 4 write their processing values into the memory 106 of FIG. 11 is that, in each write step, 16 different threads each write a respective value into a respective one of 16 different memory banks of the memory 106.

In examples where the number of processing values to be written by each of the plurality of threads is a factor of or equal to the number of banks in the memory, unnecessary bank conflicts may occur when the plurality of threads write their processing values into the memory. This means that not all of the memory banks of the memory can be written to in each writing step. This can be understood with reference to fig. 12A, which illustrates a number of processing values written to the memory 106 using a first simple method. The processing values shown in fig. 12A are values processed by a plurality of threads as shown in fig. 4 and 5, wherein each of the threads T1 to T128 processes eight respective values (e.g., pixels P1 to P8) in an initial stage of a separable operation as described herein. In this example, after the initial phase of performing the separable operation, the number of processing values to be written by each of the plurality of threads (i.e., 8) is a factor of the number of banks in memory 106 (i.e., 16). For ease of illustration, multiple rows of memory locations are omitted.

The memory bank to which the process value is to be written may be determined from the write buffer array. In a first simple method shown in fig. 12A, the write buffer array may be a one-dimensional element sequence including: an element corresponding to each of the 8 processing values to be written by thread T1 (i.e., T1P1, T1P2 … T1P 8); followed by elements corresponding to the 8 processing values to be written by thread T2 (i.e., T2P1, T2P2 … T2P 8); followed by elements corresponding to the 8 processing values to be written by thread T3 (i.e., T3P1, T3P2 … T3P 8); and so on, up to the element corresponding to the 8 processing values to be written by thread T128 (i.e., T128P1, T128P2 … T128P 8). The write buffer array may be mapped to the memory 106 by mapping the first 16 elements (e.g., 1 st through 16 th values) in the write buffer array to the 16 memory locations of the first row, mapping the subsequent 16 elements (e.g., 17 th through 32 th values) in the write buffer array to the 16 memory locations of the second row, and so on. The write buffer array is mapped to memory such that the processing values are written into memory in the memory locations to which the corresponding elements of the write buffer array are mapped.

As described herein, in each write step (e.g., clock or instruction), each thread may cause only one value to be written into memory 106. Thus, in the first write step, each of the threads T1 to T128 attempts to write a respective processing value for its first value (i.e., P1) to memory (i.e., processing values T1P1, T2P1, T3P1, T4P1, T5P1, T6P1, T7P1, T8P1 through T128P 1). While it is not possible for all 128 threads to write to 16 banks in memory 106 in a single write step, a case where multiple sets of 16 of 128 threads can write to 16 banks in memory 106 in each write step (i.e., such that all 128 of the P1 values are written by 8 write steps) would be preferable. However, this is also not possible using the first simple method-as will be appreciated with reference to fig. 12A. For example, in a first write step, thread T1 attempts to write T1P1 to the same store (e.g., store 1 in FIG. 12A) as: thread T3 attempts to write T3P1, T5 attempts to write T5P1, thread T7 attempts to write T7P1, thread T9 attempts to write T9P1, thread T11 attempts to write T11P1, thread T13 attempts to write T13P1, and thread T15 attempts to write T15P1. In addition, in this first writing step, thread T2 attempts to write T2P1 to the same store (e.g., store 9 in fig. 12A) as: thread T4 attempts to write T4P1, T6 attempts to write T6P1, thread T8 attempts to write T8P1, T10 attempts to write T10P1, thread T12 attempts to write T12P1, T14 attempts to write T14P1, and thread T16 attempts to write T16P1. This means that instead of the set of 16 threads (i.e. threads T1 to T16) writing to 16 different banks in one writing step, in this first simple method the same number of writes is performed by pairs of the 16 threads (i.e. threads (i) T1 and T2, (ii) T3 and T4, etc. to (viii) T15 and T16) writing to two different banks by eight writing steps. This is inefficient because only two of sixteen banks are written to in each write step. Each of the threads T17 to T32 requires eight further write steps to write the corresponding processing value for P1 for that thread to memory, and so on, until the threads T113 to T128 require eight write steps to write the corresponding processing value for P1 for that thread to memory. This means that 64 write steps are required to write all 128 of the P1 values into the memory 106—instead of 8 write steps in each write step to write all 16 banks. This inefficient write process is then repeated seven more times in order for each of the plurality of threads to write to each of the other seven processing values of that thread (i.e., each of P2 through P8).

Excessive bank conflicts occur when reading the processed value from memory prior to the later stage of the separable operation

In each read step (e.g., clock or instruction), the thread may cause only one value (e.g., pixel value) to be read from memory 106. In each read step, each store may be read by only one respective thread. A "bank conflict" occurs when each of the multiple threads attempts to read a corresponding value from one of the banks in a single read step. When a bank conflict occurs, the read from the bank is performed through multiple read steps. That is, in each reading step, each of the plurality of different threads may read a respective value from a respective one of the plurality of different banks of memory. Thus, the most efficient (e.g., lowest latency) way for multiple threads to read processed values from memory is to each read a corresponding value from a corresponding one of the multiple memory banks by a different number of threads equal to the number of memory banks in memory in each read step. For example, before the latter stage of the separable operation is performed, the most efficient (e.g., lowest latency) manner in which the plurality of threads shown in FIG. 8 read the processed values from the memory 106 of FIG. 11 is to read, in each read step, one respective value from each of 16 different banks of the memory 106.

In examples where the number of processing values to be read by each of the plurality of threads is a factor of or equal to the number of banks in the memory, unnecessary bank conflicts may occur when the plurality of threads read the processing values of the plurality of threads from the memory. This means that not all of the memory banks of the memory can be read in each read step. This can be understood with reference to fig. 12B, which illustrates a number of processing values written to the memory 106 using a second simple method. The processing values shown in fig. 12B are values processed by multiple threads as shown in fig. 4 and 5-where threads T1 through T128 each process eight respective values (e.g., pixels P1 through P8) in an initial stage of a separable operation as described herein. In this example, the number of values (i.e., 8) to be read by each of the plurality of threads (e.g., arranged as shown in fig. 8) is a factor of the number of banks (i.e., 16) in memory before the later stage of the separable operation is performed. For ease of illustration, multiple rows of memory locations are omitted.

As described herein, a store of processing values to be written may be determined from a write buffer array. In a second simple method shown in fig. 12B, the write buffer array may be a one-dimensional element sequence including: an element of the first processing value P1 (i.e., T1P1, T2P1 … T8P 1) corresponding to each of the threads T1 to T8; followed by elements corresponding to the second processing value P2 for each of the threads T1 to T8 (i.e., T1P2, T2P2 … T8P 2); and so on, up to the element corresponding to the eighth processing value P8 for each of the threads T1 to T8 (i.e., T1P8, T2P8 … T8P 8); followed by an element of the first processing value P1 corresponding to each of the threads T9 to T16 (i.e., T9P1, T10P1 … T16P 1); followed by elements of the second processing value P2 corresponding to each of the threads T9 to T16 (i.e., T9P2, T10P2 … T16P 2); and so on, up to the element corresponding to the eighth processing value P8 for each of the threads T9 to T16 (i.e., T9P8, T10P8 … T16P 8); and so on until an element corresponding to the eighth processing value P8 for each of the threads T121 through T128 (i.e., T121P8, T122P8 … T128P 8). The write buffer array may be mapped to the memory 106 by mapping the first 16 elements (e.g., 1 st through 16 th values) in the write buffer array to the 16 memory locations of the first row, mapping the subsequent 16 elements (e.g., 17 th through 32 th values) in the write buffer array to the 16 memory locations of the second row, and so on. The write buffer array is mapped to memory such that the processing values are written into memory in the memory locations to which the corresponding elements of the write buffer array are mapped.

Before the latter stage of the operation, the thread T1 shown in fig. 8 reads the processing values T1P1, T2P1, T3P1, T4P1, T5P1, T6P1, T7P1, and T8P1 from the memory 106. This can be understood by comparing the position within the value array assigned to the threads T1 to T8 in the initial stage as shown in fig. 4 with the position within the value array corresponding to the processing value assigned to the thread T1 in the later stage as shown in fig. 8. The same principle applies: the thread T2 shown in fig. 8 reads the processing values T1P2, T2P2, T3P2, T4P2, T5P2, T6P2, T7P2, and T8P2; the thread T3 shown in fig. 8 reads the processing values T1P3, T2P3, T3P3, T4P3, T5P3, T6P3, T7P3, and T8P3; the thread T4 shown in fig. 8 reads the processing values T1P4, T2P4, T3P4, T4P4, T5P4, T6P4, T7P4, and T8P4; the thread T5 shown in fig. 8 reads the processing values T1P5, T2P5, T3P5, T4P5, T5P5, T6P5, T7P5, and T8P5; the thread T6 shown in fig. 8 reads the processing values T1P6, T2P6, T3P6, T4P6, T5P6, T6P6, T7P6, and T8P6; the thread T7 shown in fig. 8 reads the processing values T1P7, T2P7, T3P7, T4P7, T5P7, T6P7, T7P7, and T8P7; and the thread T8 shown in fig. 8 reads the processing values T1P8, T2P8, T3P8, T4P8, T5P8, T6P8, T7P8, and T8P8. For simplicity, the processing values read by each of the other threads shown in FIG. 8 prior to the latter stage of the separable operation will not be elaborated herein. As described herein, by comparing fig. 4 and 8, one skilled in the art will readily determine which process values to read by each of the other threads shown in fig. 8.

As described herein, in each read step (e.g., clock or instruction), the thread may cause only one value (e.g., pixel value) to be read from memory 106. Thus, in the first reading step, each of the threads T1 to T128 as shown in fig. 8 attempts to read the corresponding processing value from the memory. While it is not possible for all 128 threads to read from 16 banks in memory 106 in a single read step, it would be preferable if multiple sets of 16 of the 128 threads could be read from 16 banks in memory 106 in each read step (i.e., such that all 128 of the threads read in the corresponding processing values through 8 read steps). However, this is also not possible using the second simple method-as will be appreciated with reference to fig. 12B. For example, in a first read step, thread T1 attempts to read T1P1 from the same store (e.g., store 1 in FIG. 12B) as: thread T3 attempts to read T1P3; thread T5 attempts to read T1P5; thread T7 attempts to read T1P7; thread T33 attempts to read T9P1; thread T35 attempts to read T9P3; thread T37 attempts to read T9P5; and thread T39 attempts to read T9P7. In addition, in the first reading step, the thread T2 tries to read T1P2 from the same memory bank (e.g., memory bank 9 in fig. 12B) as: thread T4 attempts to read T1P4; thread T6 attempts to read T1P6; thread T8 attempts to read T1P8; thread T34 attempts to read T9P2; thread T36 attempts to read T9P4; thread T38 attempts to read T9P6; and thread T40 attempts to fetch T9P8. This means that instead of the set of 16 threads (i.e. threads T1 to T8 and T33 to T40) being read from 16 different banks in one read step, in this second simple method the same number of reads are performed by pairs of the 16 threads (i.e. threads (i) T1 and T2, (ii) T3 and T4, etc. to (viii) T39 and T40) being read from two different banks by eight read steps. This is inefficient because only two of sixteen banks are read in each read step. This means that all 128 of the threads need 64 read steps to read a corresponding processing value-instead of 8 read steps in each read step to read all 16 banks. This inefficient read process is then repeated seven more times so that each of the plurality of threads reads in the other seven of the processing values assigned to that thread.

Excessive bank conflicts are avoided when writing to/reading from memory between stages of separable operations

A computer-implemented method of performing separable operations on a two-dimensional array of values at a processing unit including a memory is described herein with reference to fig. 13. The separable operations may be any of the separable operations described herein (e.g., separable gaussian filter operations, separable box filter operations, or separable fast integral calculation operations) or any other suitable type of separable operation. The method can be used for both: (i) Minimizing the number of bank conflicts that result when writing a processed value to memory (e.g., local memory 106) after an initial stage of a separable operation; and (ii) minimize the number of bank conflicts that arise when reading those processing values from the memory (e.g., local memory 106) prior to the later stage of the separable operation. Thus, the method may reduce the latency associated with performing separable operations.

The input for the method shown in fig. 13 is a two-dimensional array of values. The two-dimensional array of values input to the method illustrated in fig. 13 may have the same characteristics as any of the two-dimensional arrays of values as discussed herein that may be input to the method illustrated in fig. 10.

In step S1302, the two-dimensional array of values input to the method is divided into a plurality of two-dimensional sub-arrays of values (e.g., cells). This step can be appreciated with reference to fig. 14, which shows a two-dimensional array 1400 of values divided into a plurality of two-dimensional sub-arrays of values 1 to 16. The two-dimensional value sub-arrays may be non-overlapping. In fig. 14, the two-dimensional value array is square, and each of the two-dimensional value sub-arrays is square. In fig. 14, the number of subarrays in each row of the array is equal to the number of subarrays in each column of the array. A two-dimensional array of values divided into a plurality of two-dimensional arrays of values may be represented by a multi-dimensional array [ I ] [ J ] [ K ] [ M ], where I and J represent the number of value subarrays within the array of values in each of the two dimensions, and K and M represent the number of values within each of the subarrays of values in each of the two dimensions. For example, in the illustrative examples described herein with reference to fig. 4, 5, 8, 9, and 14, the multi-dimensional array may be represented by [4] [4] [8] [8 ]. That is, I represents the number of rows of the subarray, J represents the number of columns of the subarray, K represents the number of rows within each subarray, and M represents the number of columns within each subarray.

In step S1304, for each of the plurality of sub-arrays, an initial stage of separable operations is performed for the value sub-array using a plurality of threads to generate a respective processed value for each value of the value sub-array. Step S1304 may be performed by performing steps S1002 to S1010 as described herein with reference to fig. 10. For example: by using the threads T1 to T8 as shown in fig. 4, the initial stage of the separable operation can be performed for the subarray 1 as shown in fig. 14; by using the threads T9 to T16 as shown in fig. 4, an initial stage of the separable operation can be performed for the subarray 2 as shown in fig. 14; and so on until the initial stage of the separable operation can be performed for the sub-array 16 as shown in fig. 14 by using the threads T121 to T128 as shown in fig. 4.

In step S1306, for each of the plurality of sub-arrays, each of the plurality of threads writes a respective first plurality of processing values to a memory (e.g., memory 106 shown in fig. 11) through a plurality of writing steps, the first plurality of processing values corresponding to a one-dimensional sequence of values for the sub-array of values. For example, referring to fig. 4, 5, and 14, thread T1 writes eight processed values (e.g., T1P1, T1P2, T1P3, T1P4, T1P5, T1P6, T1P7, and T1P 8) corresponding to eight values within a one-dimensional sequence of values (e.g., column) of sub-array 1, which thread performs an initial phase of separable operations for the sub-array. For simplicity, the processing values written by each of the other threads shown in FIG. 4 for each of the sub-arrays shown in FIG. 14 will not be elaborated herein. By referring to fig. 4 and 14, the skilled person will be able to easily determine which processing values to write by each of the other threads for each of the sub-arrays.

As described herein, a store of processing values to be written may be determined from a write buffer array. In accordance with the principles described herein, in step S1306, a store of process values to be written may be determined from a write buffer array having a greater number of elements than the number of elements in the two-dimensional array of values. For example, in the illustrative example described herein, the two-dimensional value array includes 1024 elements, and thus, the write buffer array used in step S1306 includes more than 1024 elements. The write buffer array may include value elements corresponding to values of the two-dimensional array, and fill elements corresponding to memory fills. More specifically, the write buffer array may include multiple sets of consecutive value elements corresponding to values of the two-dimensional array, and padding elements interspersed between the multiple sets. The number of value elements in each group may (i) be equal to the number of memory banks comprised by the memory, (ii) be a multiple of the number, or (iii) be a factor of the number. For example, the memory 106 includes 16 memory banks, and thus, the number of value elements in each group may be (i) 16, (ii) a multiple of 16, such as 32, 64, or 128, or (iii) a factor of 16, such as 8. The number of value elements in each group may be equal to or less than the number of threads for performing separable operations on the value array. For example, in the illustrative example described herein, 128 threads are used to perform separable operations on an array of values. In this example, the number of value elements in each set of the write buffer array may be equal to or less than 128.

In examples where each memory bank in the memory is one memory location "wide" (e.g., as in memory 106 shown in fig. 11), at least one filler element may be interspersed among groups of value elements. In examples where more than one memory location "wide" per memory bank in memory, multiple consecutive filler elements may be interspersed between groups of consecutive value elements. In other words, the number of consecutive padding elements interspersed between groups of consecutive value elements may be greater than or equal to the number of memory locations in each row of a memory bank in memory.

The write buffer array may be a one-dimensional array, such as a one-dimensional sequence of elements. To determine into which store each processing value is to be written, a write buffer array may be mapped to the structure of memory accessible to processing logic. For example, by mapping the first 16 elements (e.g., 1 st through 16 th values) in the write buffer array to the 16 memory locations of the first row, mapping the subsequent 16 elements (e.g., 17 th through 32 th values) in the write buffer array to the 16 memory locations of the second row, and so on, the write buffer array may be mapped to the structure of the memory 106 of FIG. 11. It should be appreciated that the structure of mapping the write buffer array to the memory 106 is not required starting from the first (e.g., "upper left") memory location of the memory 106. That is, the structure of the write buffer array to the memory 106 may be mapped starting from any suitable memory location in the memory 106.

In step S1306, the write buffer array may be mapped to the memory such that the processing value corresponding to the value of the two-dimensional array is written to the memory in the memory location to which the value element of the write buffer array is mapped, and the processing value corresponding to the value of the two-dimensional array is not written to the memory in the memory location to which the fill element of the write buffer array is mapped. Any other information (e.g., one or more "0" bits, or any arbitrary value) may be written to memory in the memory location to which the fill element of the write buffer array is mapped. Alternatively, the memory locations to which the fill elements of the write buffer array are mapped may not be written at all (e.g., those memory locations may be "left empty").

The memory location where the thread written the processing value in step S1306 may be determined from the base memory address, the write offset, and the write fill amount. The memory location of the thread write process value may be determined from the sum of the base memory address, the write offset, and the write fill amount. The sum may be used to determine the element in the write buffer array to which the processed value is to be written. The base memory address may be the first element within the write buffer array.

The write offset and the write fill amount may depend on the location within the array of values of the value corresponding to the processed value to be written to memory. For example, each value in an array of values may be assigned a coordinate [ I ] [ J ] [ K ] [ M ] -defined using a zero index (i.e., such that a first subarray or value in a row or column is assigned a "0" coordinate for that row or column dimension) -that defines the position of that value within the multidimensional array [ I ] [ J ] [ K ] [ M ] as defined herein. The write offset for a processed value to be written may be a function of the coordinates [ i ] [ j ] [ k ] [ m ] of the value to which the processed value corresponds. That is, i represents the row in the subarray in which the subarray that includes the value is located, j represents the column in the subarray in which the subarray that includes the value is located, k represents the row in the subarray in which the value is located, and m represents the column in the subarray in which the value is located. In some examples, referring to FIG. 4, the value T1P1 has the coordinate [0] [0] [0] [0], the value T1P2 has the coordinate [0] [0] [1], the value T9P1 has the coordinate [0] [1] [0] [0], the value T78P3 has the coordinate [2] [1] [5] [2], and the value T128T6 has the coordinate [3] [3] [7] [5].

Non-limiting examples of suitable write offsets are provided below.

The write fill amount may be equal to the write offset divided by the fill frequency. The division may be integer division. Integer division involves dividing the first number by the second number and returning the integer portion of the result as an output. For example, an integer division of 53 divided by 8 would be equal to 6 (e.g., a "remainder" of 5 would not be returned as part of the output). The filling frequency may (i) be equal to the number of memory banks comprised by the memory, (ii) be a multiple of the number, or (iii) be a factor of the number. For example, the memory 106 includes 16 memory banks, and thus, the fill frequency may be (i) 16, (ii) a multiple of 16, such as 32, 64, or 128, or (iii) a factor of 16, such as 8. The fill frequency may be equal to or less than the number of threads used to perform separable operations on the array of values. For example, in the illustrative example described herein, 128 threads are used to perform separable operations on an array of values. In this example, the filling frequency may be equal to or less than 128. The filling frequency may also be a power of two, i.e. the filling frequency is 2 ⁿ, where n is an integer. This means that 1 divided by this number can be performed as a right shift of bits instead of having to perform a full division calculation that is less efficient to implement in hardware.

It should be appreciated that the write buffer array may exist in physical memory (e.g., may be implemented in registers in register bank 110) such that the processing values generated by the thread in step S1304 are physically written to the write buffer array before the contents of the write buffer array are transferred to memory (e.g., memory 106). Alternatively, the write buffer array may be a construct conceptually used by the threads to determine to which memory location in memory each of the processing values generated in step S1304 is to be written, wherein the processing values are physically written into the determined memory locations in memory (e.g., memory 106) directly from the corresponding one or more registers accessible to each thread.

In step S1306, the respective processing value is written to each of the memory banks of the memory (e.g., memory 106 of fig. 11) in at least one of the plurality of writing steps by applying the principles described herein. In particular, by applying the principles described herein, a respective processing value may be written to each of the memory banks of a memory (e.g., memory 106 of FIG. 11) in each of a plurality of write steps. That is, excessive bank conflicts are avoided during the writing in step S1306.

In the following, four specific examples are provided that illustrate memory locations in memory 106 to which multiple threads may write processing values by applying the principles described herein. It should be understood that these specific implementations are provided by way of example only and that the principles described herein may be variously applied.

Example 1

Referring to fig. 15A, which illustrates a plurality of processing values written to memory 106 using a first method according to principles described herein, example 1 may be appreciated. The processing values shown in fig. 15A are values processed by a plurality of threads as shown in fig. 4 and 5, wherein each of the threads T1 to T128 processes eight respective values (e.g., pixels P1 to P8) in an initial stage of a separable operation as described herein.

In example 1, the write buffer array for determining which memory location in memory 106 each process value is to be written to includes a plurality of sets of eight consecutive value elements corresponding to the values of the two-dimensional array, with a fill element interspersed between each of the plurality of sets. In example 1, the write buffer array is a one-dimensional sequence of elements comprising: a value element corresponding to the 8 processed values to be written by thread T1 (i.e., T1P1, T1P2 … T1P 8); followed by a filler element; followed by the value elements corresponding to the 8 processed values to be written by thread T2 (i.e., T2P1, T2P2 … T2P 8); followed by a filler element; followed by the value elements corresponding to the 8 processed values to be written by thread T3 (i.e., T3P1, T3P2 … T3P 8); followed by a filler element; and so on, up to the value element corresponding to the 8 processed values to be written by thread T128 (i.e., T128P1, T128P2 … T128P 8).

As described herein, the memory location where the thread written the process value in step S1306 may be determined from the sum of the base memory address, the write offset, and the write fill amount. The sum may be used to determine an element of the write buffer array corresponding to a processed value to be written to, wherein the base memory address is a first element within the write buffer array. In example 1, for values having the coordinates [ I ] [ J ] [ K ] [ M ] within the multi-dimensional array [ I ] [ J ] [ K ] [ M ], the write offset is equal to (i×j×k×m) + (j×k×m) + (k×m) +m, and the write pad amount is equal to the write offset divided by 8 (pad frequency). The division may be integer division.

Fig. 15A shows the contents of the memory 106 when this write buffer array is mapped to the memory 106 such that the following is implemented: the processed values corresponding to the values of the two-dimensional array are written into the memory 106 in the memory locations to which the corresponding value elements of the write buffer array are mapped, and the processed values corresponding to the values of the two-dimensional array are not written into the memory 106 in the memory locations to which the fill elements of the write buffer array are mapped. In fig. 15A, the memory location to which the fill element of the write buffer array is mapped is indicated by an "X". For ease of illustration, multiple rows of memory locations are omitted.

In example 1, the respective processing values may be written to each of the memory banks of the memory (e.g., memory 106 of fig. 11) in at least one of the plurality of writing steps. For example, referring to fig. 15A, in one writing step: thread T1 may write T1P1 into store 1; thread T2 may write T2P1 into store 10; thread T3 may write T3P1 into store 3; thread T4 may write T4P1 into store 12; thread T5 may write T5P1 into store 5; thread T6 may write T6P1 into store 14; thread T7 may write T7P1 into store 7; thread T8 may write T8P1 into store 16; thread T9 may write T9P1 into store 9; thread T10 may write T10P1 into store 2; thread T11 may write T11P1 into store 11; thread T12 may write T12P1 into store 4; thread T13 may write T13P1 into store 13; thread T14 may write T14P1 into store 6; thread T15 may write T15P1 into memory store 15; and thread T16 may write T16P1 into store 8.

This first method is effective because multiple sets of 16 of 128 threads can be written to 16 banks in the memory 106 in each write step (i.e., so that all 128 of the P1 values are written through 8 write steps). This active write process may then be repeated seven more times in order for each of the plurality of threads to write to each of the other seven processing values of that thread (i.e., each of P2 through P8).

Example 2

Referring to fig. 15B, which illustrates a plurality of processing values written to memory 106 using a second method according to principles described herein, example 2 may be appreciated. The processing values shown in fig. 15B are values processed by a plurality of threads as shown in fig. 4 and 5, wherein each of the threads T1 to T128 processes eight respective values (e.g., pixels P1 to P8) in an initial stage of a separable operation as described herein.

In example 2, the write buffer array for determining which memory location in memory 106 each process value is to be written to includes a plurality of sets of sixteen consecutive value elements corresponding to the values of the two-dimensional array with a fill element interspersed between each of the plurality of sets. In example 2, the write buffer array is a one-dimensional sequence of elements comprising: a value element corresponding to the 8 processed values to be written by thread T1 (i.e., T1P1, T1P2 … T1P 8); followed by the value elements corresponding to the 8 processed values to be written by thread T2 (i.e., T2P1, T2P2 … T2P 8); followed by a filler element; followed by the value elements corresponding to the 8 processed values to be written by thread T3 (i.e., T3P1, T3P2 … T3P 8); followed by the value elements corresponding to the 8 processed values to be written by T4 (i.e., T4P1, T4P2 … T4P 8); followed by a filler element; and so on, up to the value element corresponding to the 8 processed values to be written by thread T128 (i.e., T128P1, T128P2 … T128P 8).

As described herein, the memory location where the thread written the process value in step S1306 may be determined from the sum of the base memory address, the write offset, and the write fill amount. The sum may be used to determine an element of the write buffer array corresponding to a processed value to be written to, wherein the base memory address is the first element within the write buffer array. In example 2, for values having the coordinates [ I ] [ J ] [ K ] [ M ] within the multi-dimensional array [ I ] [ J ] [ K ] [ M ], the write offset is equal to (i×j×k×m) + (j×k×m) + (k×m) +m, and the write pad amount is equal to the write offset divided by 16 (pad frequency). The division may be integer division.

Fig. 15B shows the contents of the memory 106 when this write buffer array is mapped to the memory 106 such that the following is implemented: the processed values corresponding to the values of the two-dimensional array are written into the memory 106 in the memory locations to which the corresponding value elements of the write buffer array are mapped, and the processed values corresponding to the values of the two-dimensional array are not written into the memory 106 in the memory locations to which the fill elements of the write buffer array are mapped. In fig. 15B, the memory location to which the fill element of the write buffer array is mapped is indicated by an "X". For ease of illustration, multiple rows of memory locations are omitted.

In example 2, the respective processing values may be written to each of the memory banks of the memory (e.g., memory 106 of fig. 11) in at least one of the plurality of writing steps. For example, referring to fig. 15B, in one writing step: thread T1 may write T1P1 into store 1; thread T2 may write T2P1 into store 9; thread T3 may write T3P1 into store 2; thread T4 may write T4P1 into store 10; thread T5 may write T5P1 into store 3; thread T6 may write T6P1 into store 11; thread T7 may write T7P1 into store 4; thread T8 may write T8P1 into store 12; thread T9 may write T9P1 into store 5; thread T10 may write T10P1 into store 13; thread T11 may write T11P1 into store 6; thread T12 may write T12P1 into store 14; thread T13 may write T13P1 into store 7; thread T14 may write T14P1 into store 15; thread T15 may write T15P1 into store 8; and thread T16 may write T16P1 to store 16.

This second method is effective because multiple sets of 16 of 128 threads can be written to 16 banks in the memory 106 in each write step (i.e., so that all 128 of the P1 values are written through 8 write steps). This active write process may then be repeated seven more times in order for each of the plurality of threads to write to each of the other seven processing values of that thread (i.e., each of P2 through P8).

Example 3

Referring to fig. 15C, which illustrates a plurality of processing values written to memory 106 using a third method according to principles described herein, example 3 may be appreciated. The processing values shown in fig. 15C are values processed by a plurality of threads as shown in fig. 4 and 5, wherein each of the threads T1 to T128 processes eight respective values (e.g., pixels P1 to P8) in an initial stage of a separable operation as described herein.

In example 3, the write buffer array for determining which memory location in memory 106 each process value is to be written to includes multiple sets of 32 consecutive value elements corresponding to the values of the two-dimensional array, with a fill element interspersed between each of the multiple sets. In example 3, the write buffer array is a one-dimensional sequence of elements comprising: a value element corresponding to the 8 processed values to be written by thread T1 (i.e., T1P1, T1P2 … T1P 8); followed by the value elements corresponding to the 8 processed values to be written by thread T2 (i.e., T2P1, T2P2 … T2P 8); followed by the value elements corresponding to the 8 processed values to be written by thread T3 (i.e., T3P1, T3P2 … T3P 8); followed by the value elements corresponding to the 8 processed values to be written by thread T4 (i.e., T4P1, T4P2 … T4P 8); followed by a filler element; followed by the value elements corresponding to the 8 processed values to be written by T5 (i.e., T5P1, T5P2 … T5P 8); followed by the value elements corresponding to the 8 processed values to be written by thread T6 (i.e., T6P1, T6P2 … T6P 8); followed by the value elements corresponding to the 8 processed values to be written by thread T7 (i.e., T7P1, T7P2 … T7P 8); followed by the value elements corresponding to the 8 processed values to be written by T8 (i.e., T8P1, T8P2 … T8P 8); followed by a filler element; and so on, up to the value element corresponding to the 8 processed values to be written by thread T128 (i.e., T128P1, T128P2 … T128P 8).

As described herein, the memory location where the thread written the process value in step S1306 may be determined from the sum of the base memory address, the write offset, and the write fill amount. The sum may be used to determine an element of the write buffer array corresponding to a processed value to be written to, wherein the base memory address is a first element within the write buffer array. In example 3, for values having the coordinates [ I ] [ J ] [ K ] [ M ] within the multi-dimensional array [ I ] [ J ] [ K ] [ M ], the write offset is equal to (i×j×k×m) + (j×k×m) + (k×m) +m, and the write pad amount is equal to the write offset divided by 32 (pad frequency). The division may be integer division.

Fig. 15C shows the contents of the memory 106 when this write buffer array is mapped to the memory 106 such that the following is implemented: the processed values corresponding to the values of the two-dimensional array are written into the memory 106 in the memory locations to which the corresponding value elements of the write buffer array are mapped, and the processed values corresponding to the values of the two-dimensional array are not written into the memory 106 in the memory locations to which the fill elements of the write buffer array are mapped. In fig. 15C, the memory location to which the fill element of the write buffer array is mapped is indicated by an "X". For ease of illustration, multiple rows of memory locations are omitted.

In example 3, the respective processing values may be written to each of the memory banks of the memory (e.g., memory 106 of fig. 11) in at least one of the plurality of writing steps. For example, referring to fig. 15C, in one writing step: thread T1 may write T1P1 into store 1; thread T2 may write T2P1 into store 9; thread T5 may write T5P1 into store 2; thread T6 may write T6P1 into store 10; thread T9 may write T9P1 into store 3; thread T10 may write T10P1 into store 11; thread T13 may write T13P1 into repository 4; thread T14 may write T14P1 into store 12; thread T17 may write T17P1 into store 5; thread T18 may write T18P1 into store 13; thread T21 may write T21P1 into store 6; thread T22 may write T22P1 into store 14; thread T25 may write T25P1 into store 7; thread T26 may write T26P1 into store 15; thread T29 may write T29P1 into store 8; and thread T30 may write T30P1 to store 16.

This third method is effective because multiple sets of 16 of 128 threads can be written to 16 banks in the memory 106 in each write step (i.e., so that all 128 of the P1 values are written through 8 write steps). This active write process may then be repeated seven more times in order for each of the plurality of threads to write to each of the other seven processing values of that thread (i.e., each of P2 through P8).

Example 4

Referring to fig. 15D, which illustrates a plurality of processing values written to memory 106 using a fourth method according to principles described herein, example 4 may be appreciated. The processing values shown in fig. 15D are values processed by a plurality of threads as shown in fig. 4 and 5, wherein each of the threads T1 to T128 processes eight respective values (e.g., pixels P1 to P8) in an initial stage of a separable operation as described herein.

In example 4, the write buffer array for determining which memory location in memory 106 each processing value is to be written to includes multiple sets of 16 consecutive value elements corresponding to the values of the two-dimensional array, with a fill element interspersed between each of the multiple sets. In example 4, the write buffer array is a one-dimensional sequence of elements comprising: a value element corresponding to the first processed value P1 for each of threads T1 through T8 of sub-array 1 (i.e., T1P1, T2P1 … T8P 1); followed by a value element corresponding to the second processed value P2 for each of threads T1 through T8 of sub-array 1 (i.e., T1P2, T2P2 … T8P 2); followed by a filler element; followed by a value element (i.e., T1P3, T2P3 … T8P 3) corresponding to the third processed value P3 for each of the threads T1 through T8 of sub-array 1; followed by a value element corresponding to the fourth processed value P4 for each of threads T1 through T8 of sub-array 1 (i.e., T1P4, T2P4 … T8P 4); followed by a filler element; and so on, up to the value element corresponding to the eighth processing value P8 for each of the threads T1 to T8 of sub-array 1 (i.e., T1P8, T2P8 … T8P 8); followed by a filler element; followed by the value element of the first processed value P1 corresponding to each of the threads T9 to T16 of the sub-array 2 (i.e. T9P1, T10P1 … T16P 1); followed by a value element corresponding to the second processed value P2 for each of the threads T9 through T16 of sub-array 2 (i.e., T9P2, T10P2 … T16P 2); followed by a filler element; and so on, up to an element corresponding to the eighth processing value P8 for each of the threads T9 through T16 of the sub-array 2 (i.e., T9P8, T10P8 … T16P 8); followed by a filler element; and so on, up to the element corresponding to the eighth processing value P8 for each of the threads T121 through T128 of the sub-array 16 (i.e., T121P8, T122P8 … T128P 8).

As described herein, the memory location where the thread written the process value in step S1306 may be determined from the sum of the base memory address, the write offset, and the write fill amount. The sum may be used to determine an element of the write buffer array corresponding to a processed value to be written to, wherein the base memory address is a first element within the write buffer array. In example 4, for values having the coordinates [ I ] [ J ] [ K ] [ M ] within the multi-dimensional array [ I ] [ J ] [ K ] [ M ], the write offset is equal to (i×j×k×m) + (j×k×m) + (m×k) +k, and the write pad amount is equal to the write offset divided by 16 (pad frequency). The division may be integer division.

Fig. 15D shows the contents of the memory 106 when this write buffer array is mapped to the memory 106 such that the following is implemented: the processed values corresponding to the values of the two-dimensional array are written into the memory 106 in the memory locations to which the corresponding value elements of the write buffer array are mapped, and the processed values corresponding to the values of the two-dimensional array are not written into the memory 106 in the memory locations to which the fill elements of the write buffer array are mapped. In fig. 15D, the memory location to which the fill element of the write buffer array is mapped is indicated by an "X". For ease of illustration, multiple rows of memory locations are omitted.

In example 4, the respective processing values may be written to each of the memory banks of the memory (e.g., memory 106 of fig. 11) in at least one of the plurality of writing steps. For example, referring to fig. 15D, in one writing step: thread T1 may write T1P1 into store 1; thread T2 may write T2P1 into store 2; thread T3 may write T3P1 into store 3; thread T4 may write T4P1 into store 4; thread T5 may write T5P1 into store 5; thread T6 may write T6P1 into store 6; thread T7 may write T7P1 into store 7; thread T8 may write T8P1 into store 8; thread T17 may write T17P1 into store 9; thread T18 may write T18P1 into store 10; thread T19 may write T19P1 into store 11; thread T20 may write T20P1 into store 12; thread T21 may write T21P1 into store 13; thread T22 may write T22P1 into store 14; thread T23 may write T23P1 to store 15; and thread T24 may write T24P1 to store 16.

This fourth method is effective because a plurality of sets of 16 of the 128 threads can be written to 16 banks in the memory 106 in each write step (i.e., so that all 128 of the P1 values are written through 8 write steps). This active write process may then be repeated seven more times in order for each of the plurality of threads to write to each of the other seven processing values of that thread (i.e., each of P2 through P8).

Returning to FIG. 13, in step S1308, for each of the plurality of subarrays, each of the plurality of threads reads a respective second plurality of processed values from the memory 106 through the plurality of read steps, the second plurality of processed values corresponding to a vertical one-dimensional sequence of values of the subarray of values in a transposed position relative to the subarray of values in the array of values. This can be understood by reference to fig. 4, 8, 14 and table 1.

As shown in fig. 14, if the array of values 1400 is transposed, the array of values will effectively be reflected around the diagonal 1402. The subarrays intersecting diagonal 1402 (i.e., subarrays 1, 6, 11, and 16 shown in fig. 14) will not change position during transposition (although the multi-row and multi-column values of the subarray will be transposed). Thus, for example, for sub-array 1, the sub-array in the transposed position within array 1400 is sub-array 1. For subarrays that do not intersect diagonal 1402, the position of the subarray will change during the transpose of array 1400. Thus, for example, for sub-array 2, the sub-array in the transposed position within array 1400 is sub-array 5. Table 1 defines sub-arrays in a transposed position with respect to each of the sub-arrays within the array 1400.

TABLE 1

Thus, in an example, in step S1306, referring to fig. 4 and 14, a thread T1 writes eight processed values (e.g., T1P1, T1P2, T1P3, T1P4, T1P5, T1P6, T1P7, and T1P 8) corresponding to eight values within a one-dimensional sequence of values (e.g., column) of sub-array 1, the thread performing an initial phase of separable operations for the sub-array. In step S1308, referring to fig. 4, 8, and 14, the thread T1 reads eight processed values (e.g., T1P1, T2P1, T3P1, T4P1, T5P1, T6P1, T7P1, and T8P 1) corresponding to eight values within a vertical one-dimensional sequence of values (e.g., a column) of sub-array 1 (sub-array in a transposed position with respect to sub-array 1 within the array), for which the thread is to perform the later stage of the separable operation.

In another example, in step S1306, referring to fig. 4 and 14, thread T9 writes eight processed values (e.g., T9P1, T9P2, T9P3, T9P4, T9P5, T9P6, T9P7, and T9P 8) corresponding to eight values within a one-dimensional sequence of values (e.g., column) of sub-array 2, the thread performs an initial phase of separable operations for the sub-array. In step S1308, referring to fig. 4, 8, and 14, the thread T9 reads eight processed values (e.g., T33P1, T34P1, T35P1, T36P1, T37P1, T38P1, T39P1, and T40P 1) corresponding to eight values within a vertical one-dimensional sequence of values (e.g., a column) of sub-array 5 (sub-array in a transposed position with respect to sub-array 2 within the array), for which the thread is to perform the later stage of the separable operation.

For the sake of brevity, the processing values written by each of the other threads shown in FIG. 4 for each of the sub-arrays shown in FIG. 14 after the initial stage of the separable operation is performed, and the processing values read by each of the other threads shown in FIG. 8 before the latter stage of the separable operation is performed, will not be elaborated herein. By referring to fig. 4 and 14, the skilled person will be able to easily determine which processing values to write by each of the other threads for each of the sub-arrays. As described herein, by comparing fig. 4 and 8, one skilled in the art will readily determine which process values to read by each of the other threads shown in fig. 8.

The memory bank from which the thread is to read the processing value may be determined from the read buffer array. The read buffer array used in step S1308 may have the same characteristics as the write buffer array used in step S1306-as described herein. For example, the read buffer array may be a one-dimensional array. The read buffer array may have a greater number of elements than the number of elements in the two-dimensional array of values. The read buffer array may include multiple sets of consecutive value elements corresponding to the values of the two-dimensional array, and fill elements interspersed between the multiple sets. The relative numbers of values and fill elements in the read buffer array used in step S1308 may correspond to the relative numbers of values and fill elements in the write buffer array used in step S1306.

The contents of the memory may be mapped to the structure of the read buffer array such that processed values corresponding to the values of the two-dimensional array are read from memory locations in the memory that are mapped onto the value elements of the read buffer array, and values are not read from memory locations in the memory that are mapped onto the fill elements of the read buffer array. For example, the contents of the memory 106 of FIG. 11 may be mapped to the structure of the read buffer array by mapping the first row 16 memory locations of the memory 106 to the first 16 elements (e.g., the 1 st through 16 th values) and the second row 16 memory locations of the memory 106 to the subsequent 16 elements (e.g., the 17 th through 32 th values). It should be appreciated that it is not necessary to map the memory 106 to the read buffer array starting from the first ("upper left") memory location of the memory 106. Instead, the memory 106 may be mapped to the read buffer array starting from the first memory location where the processing value was written in step S1306.

As described herein, each value in the array of values may be assigned a coordinate [ I ] [ J ] [ K ] [ M ], which defines the position of the value within the multi-dimensional array [ I ] [ J ] [ K ] [ M ] as defined herein. In general, a thread writing a processing value having coordinates [ I ] [ J ] [ K ] [ M ] within the multi-dimensional array [ I ] [ K ] [ M ] to memory in step S1306 may read a processing value having transposed coordinates (e.g., [ J ] [ I ] [ M ] [ K ]) within the multi-dimensional array [ I ] [ J ] [ K ] [ M ] from memory in step S1308. For example, as described herein, in step S1306, thread T1 writes the following processing values: T1P1 having the coordinates [0] [0] [0] [0 ]; T1P2 having the coordinates [0] [0] [0] [1 ]; T1P3 having the coordinates [0] [0] [0] [2 ]; T1P4 having the coordinates [0] [0] [0] [3 ]; T1P5 having the coordinates [0] [0] [0] [4 ]; T1P6 having the coordinates [0] [0] [0] [5 ]; T1P7 with coordinates [0] [0] [0] [6 ]; and T1P8 having the coordinates [0] [0] [0] [7 ]. As described herein, in step S1308, the thread T1 reads the following processing values: T1P1 having the coordinates [0] [0] [0] [0 ]; T2P1 having the coordinates [0] [0] [1] [0 ]; T3P1 with coordinates [0] [0] [2] [0 ]; T4P1 with coordinates [0] [0] [3] [0 ]; T5P1 having the coordinates [0] [0] [4] [0 ]; T6P1 with coordinates [0] [0] [5] [0 ]; T7P1 with coordinates [0] [0] [6] [0 ]; and T8P1 having the coordinates [0] [0] [7] [0 ].

The memory location of the read process value of the thread in step S1308 may be determined from the base memory address, the read offset, and the read fill amount. The read offset and read fill level may depend on the location (e.g., coordinates) within the array of values of the value corresponding to the processed value to be read. The memory location where the thread will read the process value may be determined from the sum of the base memory address, the read offset, and the read fill amount. The sum may be used to determine the element in the read buffer array corresponding to the processed value to be read. Non-limiting examples of suitable read offsets are provided below. The read fill amount may be equal to the read offset divided by the fill frequency. The division may be integer division. The filling frequency used in step S1308 may be equal to the filling frequency used in step S1306, as described herein.

It should be appreciated that the read buffer array may exist in physical memory (e.g., may be implemented in registers in register bank 110) such that a thread causes processing values from memory to be physically read into the read buffer array before the contents of the read buffer array are transferred into the corresponding one or more registers accessible to each thread. Alternatively, the read buffer array may be a construct conceptually used by threads to determine from which memory locations in memory those threads will read values, where processing values are physically read directly from the determined memory locations into a corresponding one or more registers accessible to each thread.

In step S1308, by applying the principles described herein, a respective processing value is read from each of the memory banks of the memory (e.g., memory 106 of fig. 11) in at least one of the plurality of read steps. In particular, by applying the principles described herein, a respective process value may be read from each of the memory banks of a memory (e.g., memory 106 of FIG. 11) in each of a plurality of read steps. That is, during the reading in step S1308, excessive bank conflicts can be avoided. This can be understood by returning to the four specific examples described above with reference to fig. 15A to 15D.

Example 1

The relative number of values and fill elements in the read buffer array used in step S1308 in example 1 may correspond to the relative number of values and fill elements in the write buffer array used in step S1306 in example 1, as described herein. In example 1, the contents of the memory 106 shown in fig. 15A may be mapped to a read buffer array. The memory location where the thread read the process value in step S1308 may be determined from the sum of the base memory address, the write offset, and the write fill amount. The sum may be used to determine an element of the read buffer array corresponding to a processed value to be read, wherein the base memory address is a first element within the read buffer array. As described herein, in general, a thread writing a process value having coordinates [ i ] [ j ] [ k ] [ m ] to memory at step S1306 may read a process value having transposed coordinates (e.g., [ j ] [ i ] [ m ] [ k ]) from memory at step S1308. Therefore, in example 1, the read offset used by the thread that writes the processing value having the coordinates [ i ] [ J ] [ K ] [ M ] in step S1306 is equal to (j×j×k×m) + (i×k×m) + (m×m) +k, and the read padding amount is equal to the read offset divided by 8 (padding frequency). The division may be integer division.

In example 1, the respective process values may be read from each of the memory banks of the memory (e.g., memory 106 of fig. 11) in at least one of the plurality of read steps. For example, referring to fig. 15A, in one reading step: thread T1 may read T1P1 from store 1; thread T2 may read T1P2 from store 2; thread T3 may read T1P3 from store 3; thread T4 may read T1P4 from store 4; thread T5 may read T1P5 from store 5; thread T6 may read T1P6 from store 6; thread T7 may read T1P7 from store 7; thread T8 may read T1P8 from store 8; thread T33 may read T9P1 from store 9; thread T34 may read T9P2 from store 10; thread T35 may read T9P3 from store 11; thread T36 may read T9P4 from store 12; thread T37 may read T9P5 from store 13; thread T38 may read T9P6 from store 14; thread T39 may read T9P7 from store 15; and thread T40 may read T9P8 from store 16. This first method is effective because multiple sets of 16 of 128 threads can be read from 16 banks in the memory 106 in each read step (i.e., so that all 128 of the threads can read in one corresponding processing value through 8 read steps). This active read process may then be repeated seven more times so that each of the plurality of threads reads each of the other corresponding seven processing values assigned to that thread.

Example 2

The relative number of values and fill elements in the read buffer array used in step S1308 in example 2 may correspond to the relative number of values and fill elements in the write buffer array used in step S1306 in example 2, as described herein. In example 2, the contents of the memory 106 shown in fig. 15B may be mapped to a read buffer array. The memory location where the thread read the process value in step S1308 may be determined from the sum of the base memory address, the write offset, and the write fill amount. The sum may be used to determine an element of the read buffer array corresponding to a processed value to be read, wherein the base memory address is a first element within the read buffer array. As described herein, in general, a thread writing a process value having coordinates [ i ] [ j ] [ k ] [ m ] to memory at step S1306 may read a process value having transposed coordinates (e.g., [ j ] [ i ] [ m ] [ k ]) from memory at step S1308. Therefore, in example 2, the read offset used by the thread that writes the processing value having the coordinates [ i ] [ J ] [ K ] [ M ] in step S1306 is equal to (j×j×k×m) + (i×k×m) + (m×m) +k, and the read padding amount is equal to the read offset divided by 16 (padding frequency). The division may be integer division.

In example 2, the respective process values may be read from each of the banks of memory (e.g., memory 106 of fig. 11) in at least one of the plurality of read steps. For example, referring to fig. 15B, in one reading step: thread T1 may read T1P1 from store 1; thread T2 may read T1P2 from store 2; thread T3 may read T1P3 from store 3; thread T4 may read T1P4 from store 4; thread T5 may read T1P5 from store 5; thread T6 may read T1P6 from store 6; thread T7 may read T1P7 from store 7; thread T8 may read T1P8 from store 8; thread T65 may read T17P1 from store 9; thread T66 may read T17P2 from store 10; thread T67 may read T17P3 from store 11; thread T68 may read T17P4 from store 12; thread T69 may read T17P5 from store 13; thread T70 may read T17P6 from store 14; thread T71 may read T17P7 from store 15; and thread T72 may read T17P8 from store 16.

This second method is effective because multiple sets of 16 of 128 threads can be read from 16 banks in the memory 106 in each read step (i.e., so that all 128 of the threads can read in one corresponding processing value through 8 read steps). This active read process may then be repeated seven more times so that each of the plurality of threads reads each of the corresponding other seven processing values assigned to that thread.

Example 3

The relative number of values and fill elements in the read buffer array used in step S1308 in example 3 may correspond to the relative number of values and fill elements in the write buffer array used in step S1306 in example 3, as described herein. In example 3, the contents of memory 106 shown in fig. 15C may be mapped to a read buffer array. The memory location where the thread read the process value in step S1308 may be determined from the sum of the base memory address, the write offset, and the write fill amount. The sum may be used to determine an element of the read buffer array corresponding to a processed value to be read, wherein the base memory address is a first element within the read buffer array. As described herein, in general, a thread writing a process value having coordinates [ i ] [ j ] [ k ] [ m ] to memory at step S1306 may read a process value having transposed coordinates (e.g., [ j ] [ i ] [ m ] [ k ]) from memory at step S1308. Therefore, in example 3, the read offset used by the thread that writes the processing value having the coordinates [ i ] [ J ] [ K ] [ M ] in step S1306 is equal to (j×j×k×m) + (i×k×m) + (m×m) +k, and the read padding amount is equal to the read offset divided by 32 (padding frequency). The division may be integer division.

In example 3, the respective process values may be read from each of the banks of memory (e.g., memory 106 of fig. 11) in at least one of the plurality of read steps. For example, referring to fig. 15C, in one reading step: thread T1 may read T1P1 from store 1; thread T2 may read T1P2 from store 2; thread T3 may read T1P3 from store 3; thread T4 may read T1P4 from store 4; thread T5 may read T1P5 from store 5; thread T6 may read T1P6 from store 6; thread T7 may read T1P7 from store 7; thread T8 may read T1P8 from store 8; thread T9 may read T33P1 from store 9; thread T10 may read T33P2 from store 10; thread T11 may read T33P3 from store 11; thread T12 may read T33P4 from store 12; thread T13 may read T33P5 from store 13; thread T14 may read T33P6 from store 14; thread T15 may read T33P7 from store 15; and thread T16 may read T33P8 from store 16.

This third method is effective because multiple sets of 16 of 128 threads can be read from 16 banks in the memory 106 in each read step (i.e., so that all 128 of the threads can read in one corresponding processing value through 8 read steps). This active read process may then be repeated seven more times so that each of the plurality of threads reads each of the corresponding other seven processing values assigned to that thread.

Example 4

The relative number of values and fill elements in the read buffer array used in step S1308 in example 4 may correspond to the relative number of values and fill elements in the write buffer array used in step S1306 in example 4, as described herein. In example 4, the contents of memory 106 shown in fig. 15D may be mapped to a read buffer array. The memory location where the thread read the process value in step S1308 may be determined from the sum of the base memory address, the write offset, and the write fill amount. The sum may be used to determine an element of the read buffer array corresponding to a processed value to be read, wherein the base memory address is a first element within the read buffer array. As described herein, in general, a thread writing a process value having coordinates [ i ] [ j ] [ k ] [ m ] to memory at step S1306 may read a process value having transposed coordinates (e.g., [ j ] [ i ] [ m ] [ k ]) from memory at step S1308. Therefore, in example 4, the read offset used by the thread that writes the processing value having the coordinates [ i ] [ J ] [ K ] [ M ] in step S1306 is equal to (j×j×k×m) + (i×k×m) + (k×k) +m, and the read padding amount is equal to the read offset divided by 16 (padding frequency). The division may be integer division.

In example 4, the respective process values may be read from each of the banks of memory (e.g., memory 106 of fig. 11) in at least one of the plurality of read steps. For example, referring to fig. 15D, in one reading step: thread T1 may read T1P1 from store 1; thread T2 may read T1P2 from store 9; thread T3 may read T1P3 from store 2; thread T4 may read T1P4 from store 10; thread T5 may read T1P5 from store 3; thread T6 may read T1P6 from store 11; thread T7 may read T1P7 from store 4; thread T8 may read T1P8 from store 12; thread T33 may read T9P1 from store 5; thread T34 may read T9P2 from store 13; thread T35 may read T9P3 from store 6; thread T36 may read T9P4 from store 14; thread T37 may read T9P5 from store 7; thread T38 may read T9P6 from store 15; thread T39 may read T9P7 from store 8; and thread T40 may read T9P8 from store 16.

This fourth method is effective because a plurality of sets of 16 of the 128 threads can be read from the 16 banks in the memory 106 in each read step (i.e., so that all 128 of the threads can read in one corresponding processing value through 8 read steps). This active read process may then be repeated seven more times so that each of the plurality of threads reads each of the corresponding other seven processing values assigned to that thread.

Returning to fig. 13, in step S1310, for each of the plurality of sub-arrays, a plurality of threads are used to perform a later stage of separable operations on the plurality of processed values read by the plurality of threads to generate a respective output value for each value of the sub-array of values in the transpose position. Step S1310 may be performed by performing steps S1004 to S1010 as described herein with reference to fig. 10. For example: by using the threads T1 to T8 as shown in fig. 8, the latter stage of the separable operation can be performed for the subarray 1 as shown in fig. 14; by using the threads T33 to T40 as shown in fig. 8, the latter stage of the separable operation can be performed for the subarray 2 as shown in fig. 14; by using the threads T65 to T72 as shown in fig. 8, the latter stage of the separable operation can be performed for the subarray 3 as shown in fig. 14; by using the threads T97 to T104 as shown in fig. 8, the latter stage of the separable operation can be performed for the subarray 4 as shown in fig. 14; and by using the threads T121 to T128 as shown in fig. 8, the latter stage of the separable operation can be performed for the subarray 16 as shown in fig. 14.

After the latter stage of the operation is performed, each of the plurality of threads (e.g., each of the threads shown in FIG. 8) may write the segment of output values that the thread has generated to memory (e.g., global memory 108) such that the output value corresponding to each value of the array of values for which the operation is to be performed is written to memory (e.g., global memory 108). This writing may also be performed using the principles described herein, for example, to minimize the number of bank conflicts that arise when writing output values to global memory 108.

The output value for each value corresponding to the array of values written to memory (e.g., global memory 108) may be the output of the method of fig. 13.

In the illustrative examples described above with reference to fig. 4, 8, 13, and 14, the initial stage of the separable operation in step S1304 is performed "horizontally", and the latter stage of the separable operation in step S1310 is performed "vertically". It should be understood that, alternatively, the initial stage of the separable operation in step S1304 may be performed "vertically", and the later stage of the separable operation in step S1310 may be performed "horizontally". In other words, in the illustrative examples described above with reference to fig. 4, 8, 13, and 14, in step S1306, the one-dimensional value sequence of each value sub-array is a row of values of the value sub-array, and in step S1308, the vertical one-dimensional value sequence of each value sub-array in the transposed position is a column of values of the value sub-array in the transposed position. It should be understood that, alternatively, in step S1306, the one-dimensional sequence of values for each sub-array of values may be a column of values for the sub-array of values, and in step S1308, the vertical one-dimensional sequence of values for each sub-array of values in the transposed position may be a row of values for the sub-array of values in the transposed position.

In the illustrative example of the method described herein with reference to fig. 13, in each stage of the separable operation (e.g., in steps S1304 and S1310), each one-dimensional sequence of values of the two-dimensional array may be divided into a plurality of value segments assigned to each of a respective plurality of threads. For example, referring to FIG. 5, a first row of values 500 of the value array 400 may be divided into four segments that may be assigned to threads T1, T9, T17, and T25 in an initial stage of the operation, and referring to FIG. 9, a processed value corresponding to a first column of values 900 of the value array 800 may be divided into four segments that may be assigned to threads T1, T9, T17, and T25 in a later stage of the operation. In this illustrative example, during each stage of operation, adjacent threads may cooperate to complete each stage of one-dimensional operation, as described herein with reference to steps S1004 through S1010 of fig. 10. It should be appreciated that this need not be the case in the method described herein with reference to fig. 13. That is, in an alternative example, in each stage of the separable operation (e.g., in steps S1304 and S1310), the entire one-dimensional sequence of values of the two-dimensional array may be assigned to each of the plurality of threads. For example, referring to FIG. 5, the entire first row value 500 of the value array 400 may be assigned to a single thread in an initial stage of the operation (i.e., in step S1304), and referring to FIG. 9, the processed values corresponding to the entire first column value 900 of the value array 800 may be assigned to the single thread in a later stage of the operation (i.e., in step S1304). In this alternative example, the thread need not cooperate with another thread in order to complete each stage of the operation, because the thread will have access within the registers of the thread to all of the values in the one-dimensional sequence of values that will be operated on in each stage of the operation, and thus each stage of the operation can be completed independently (e.g., the one-dimensional gaussian operation described herein with reference to fig. 2).

FIG. 16 illustrates a computer system in which the processing units described herein may be implemented. The computer system includes a Central Processing Unit (CPU) 1602, a Graphics Processing Unit (GPU) 100, a memory 108, a Neural Network Accelerator (NNA) 1608, and other devices 1614 such as a display 1616, speakers 1618, and a camera 1622. The array of values (e.g., image and/or audio signals) input to the methods described herein may be generated by one or more of the other devices 1614 (e.g., camera 1622 and/or microphone (not shown in fig. 14)). The output of the methods described herein (e.g., the filtered image and/or the filtered audio signal) may be provided to one or more of the other devices 1614, such as the display 1616 and/or the speaker 1618. One or more processing elements 102 are implemented on the GPU 100. In other examples, one or more of the depicted components may be omitted from the computer system. The components of the computer system may communicate with each other via a communication bus 1620.

The graphics processing units of fig. 1A and 1B, and the computer system of fig. 16 are shown as including a plurality of functional blocks. This is merely illustrative and is not intended to limit the strict division between the different logic elements of such entities. Each of the functional blocks may be provided in any suitable manner. It should be understood that intermediate values described herein formed by a processing unit need not be physically generated by the processing unit at any point in time, and may represent only logical values that conveniently describe the processing performed by the processing unit between its inputs and outputs.

The processing units described herein may be embodied in hardware on an integrated circuit. The processing units described herein may be configured to perform any of the methods described herein. In general, any of the functions, methods, techniques or components described above may be implemented in software, firmware, hardware (e.g., fixed logic circuitry) or any combination thereof. The terms "module," "functionality," "component," "element," "unit," "block," and "logic" may be used herein to generally represent software, firmware, hardware, or any combination thereof. In the case of a software implementation, the module, functionality, component, element, unit, block or logic represents program code that performs specified tasks when executed on a processor. The algorithms and methods described herein may be executed by one or more processors executing code that causes the processors to perform the algorithms/methods. Examples of a computer-readable storage medium include Random Access Memory (RAM), read-only memory (ROM), optical disks, flash memory, hard disk memory, and other memory devices that can store instructions or other data using magnetic, optical, and other techniques and that can be accessed by a machine.

The terms computer program code and computer readable instructions as used herein refer to any kind of executable code for a processor, including code expressed in a machine language, an interpreted language, or a scripting language. Executable code includes binary code, machine code, byte code, code defining an integrated circuit (e.g., a hardware description language or netlist), and code expressed in programming language code such as C, java or OpenCL. The executable code may be, for example, any kind of software, firmware, script, module, or library that, when properly executed, handled, interpreted, compiled, run in a virtual machine or other software environment, causes the processor of the computer system supporting the executable code to perform the tasks specified by the code.

The processor, computer, or computer system may be any kind of device, machine, or special purpose circuit, or a set or portion thereof, that has processing capabilities such that it can execute instructions. The processor may be or include any kind of general purpose or special purpose processor, such as CPU, GPU, NNA, a system on a chip, a state machine, a media processor, an Application Specific Integrated Circuit (ASIC), a programmable logic array, a Field Programmable Gate Array (FPGA), or the like. The computer or computer system may include one or more processors.

The present invention is also intended to cover software defining a configuration of hardware as described herein, such as HDL (hardware description language) software, as used for designing integrated circuits or for configuring programmable chips to perform desired functions. That is, a computer readable storage medium having encoded thereon computer readable program code in the form of an integrated circuit definition data set may be provided, which when processed (i.e., run) in an integrated circuit manufacturing system configures the system to manufacture a processing unit configured to perform any of the methods described herein, or to manufacture a processing unit comprising any of the devices described herein. The integrated circuit definition data set may be, for example, an integrated circuit description.

Accordingly, a method of manufacturing a processing unit as described herein at an integrated circuit manufacturing system may be provided. Furthermore, an integrated circuit definition data set may be provided which, when processed in an integrated circuit manufacturing system, causes a method of manufacturing a processing unit to be performed.

The integrated circuit definition data set may be in the form of computer code, for example, as a netlist, code for configuring a programmable chip, as a hardware description language defining hardware suitable for fabrication at any level in an integrated circuit, including as Register Transfer Level (RTL) code, as a high-level circuit representation (such as Verilog or VHDL), and as a low-level circuit representation (such as OASIS (RTM) and GDSII). A higher-level representation (e.g., RTL) that logically defines hardware suitable for fabrication in an integrated circuit may be processed at a computer system configured to generate fabrication definitions for the integrated circuit in the context of a software environment that includes definitions of circuit elements and rules for combining these elements to generate fabrication definitions for the integrated circuit so defined by the representation. As is typically the case when software is executed at a computer system to define a machine, one or more intermediate user steps (e.g., providing commands, variables, etc.) may be required to configure the computer system to generate a manufacturing definition for an integrated circuit to execute code that defines the integrated circuit to generate the manufacturing definition for the integrated circuit.

An example of processing an integrated circuit definition data set at an integrated circuit manufacturing system to configure the system to manufacture a processing unit will now be described with reference to fig. 17.

Fig. 17 illustrates an example of an Integrated Circuit (IC) fabrication system 1702 configured to fabricate a processing unit as described in any of the examples herein. Specifically, the IC manufacturing system 1702 includes a layout processing system 1704 and an integrated circuit generation system 1706. The IC fabrication system 1702 is configured to receive an IC definition data set (e.g., defining a processing unit as described in any of the examples herein), process the IC definition data set, and generate an IC (e.g., embodying the processing unit as described in any of the examples herein) from the IC definition data set. Processing of the IC definition data set configures the IC fabrication system 1702 to fabricate an integrated circuit embodying a processing unit as described in any of the examples herein.

Layout processing system 1704 is configured to receive and process the IC definition data set to determine a circuit layout. Methods of determining a circuit layout from an IC definition dataset are known in the art and may involve, for example, synthesizing RTL codes to determine a gate level representation of a circuit to be generated, for example in terms of logic components (e.g., NAND, NOR, AND, OR, MUX and FLIP-FLOP components). By determining the location information of the logic components, the circuit layout may be determined from the gate level representation of the circuit. This may be done automatically or with the participation of a user in order to optimize the circuit layout. When the layout processing system 1704 has determined a circuit layout, it may output the circuit layout definition to the IC generation system 1706. The circuit layout definition may be, for example, a circuit layout description.

The IC generation system 1706 generates ICs according to a circuit layout definition, as is known in the art. For example, the IC generation system 1706 may implement a semiconductor device fabrication process for generating ICs that may involve a multi-step sequence of photolithography and chemical processing steps during which electronic circuits are gradually formed on a wafer made of semiconductor material. The circuit layout definition may be in the form of a mask that may be used in a lithographic process for generating an IC from the circuit definition. Alternatively, the circuit layout definitions provided to the IC generation system 1706 may be in the form of computer readable code that the IC generation system 1706 may use to form a suitable mask for generating the IC.

The different processes performed by the IC fabrication system 1702 may all be implemented in one location, e.g., by a party. Alternatively, the IC fabrication system 1702 may be a distributed system such that some processes may be performed at different locations and by different parties. For example, some of the following phases may be performed at different locations and/or by different parties: (i) Synthesizing an RTL code representing the IC definition dataset to form a gate level representation of the circuit to be generated; (ii) generating a circuit layout based on the gate level representation; (iii) forming a mask according to the circuit layout; and (iv) using the mask to fabricate the integrated circuit.

In other examples, processing of the integrated circuit definition data set at the integrated circuit manufacturing system may configure the system to manufacture the processing unit without processing the integrated circuit definition data set to determine the circuit layout. For example, an integrated circuit definition dataset may define a configuration of a reconfigurable processor such as an FPGA, and processing of the dataset may configure the IC manufacturing system to generate (e.g., by loading configuration data into the FPGA) the reconfigurable processor having the defined configuration.

In some embodiments, the integrated circuit manufacturing definition data set, when processed in the integrated circuit manufacturing system, may cause the integrated circuit manufacturing system to generate an apparatus as described herein. For example, a configuration of an integrated circuit manufacturing system by an integrated circuit manufacturing definition dataset in the manner described above with reference to fig. 17 may manufacture an apparatus as described herein.

In some examples, the integrated circuit definition dataset may contain software running on or in combination with hardware defined at the dataset. In the example shown in fig. 17, the IC generation system may be further configured by the integrated circuit definition data set to load firmware onto the integrated circuit in accordance with program code defined at the integrated circuit definition data set at the time of manufacturing the integrated circuit or to otherwise provide the integrated circuit with program code for use with the integrated circuit.

Specific implementations of the concepts set forth in the present disclosure in apparatus, devices, modules, and/or systems (and in methods implemented herein) may provide improved performance over known specific implementations. Performance improvements may include one or more of increased computational performance, reduced latency, increased throughput, and/or reduced power consumption. During the manufacture of such devices, apparatuses, modules and systems (e.g., in integrated circuits), a tradeoff may be made between performance improvements and physical implementation, thereby improving the manufacturing method. For example, a tradeoff can be made between performance improvement and layout area, matching the performance of a known implementation, but using less silicon. This may be accomplished, for example, by reusing the functional blocks in a serial fashion or sharing the functional blocks among elements of an apparatus, device, module, and/or system. Rather, the concepts described herein that lead to improvements in the physical implementation of devices, apparatus, modules and systems (e.g., reduced silicon area) can be weighed against performance improvements. This may be accomplished, for example, by fabricating multiple instances of the module within a predefined area budget.

The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.

Claims

1. A computer-implemented method of performing separable operations on a two-dimensional array of values at a processing unit comprising a memory, the memory comprising a plurality of memory banks, wherein in each writing or reading step, each memory bank is capable of being written to or read from by only one respective thread, the method comprising:

dividing the two-dimensional array of values into a plurality of two-dimensional sub-arrays of values;

for each sub-array of the plurality of sub-arrays:

Performing an initial stage of the separable operation for the sub-array of values using a plurality of threads to generate a respective processed value for each value of the sub-array of values;

Each of the plurality of threads writing a respective first plurality of processing values to the memory through a plurality of writing steps, the first plurality of processing values corresponding to a one-dimensional sequence of values of the sub-array of values;

Each of the plurality of threads reads a respective second plurality of processed values from the memory through a plurality of read steps, the second plurality of processed values corresponding to a vertical one-dimensional sequence of values of the subarray of values in a transposed position within the array of values relative to the subarray of values; and

Performing, using the plurality of threads, a later stage of the separable operation for the plurality of processed values read by the plurality of threads to generate a respective output value for each value of the sub-array of values in the transpose position;

Wherein in at least one of the plurality of writing steps a respective processing value is written into each of the memory banks of the memory, and in at least one of the plurality of reading steps a respective processing value is read from each of the memory banks of the memory.

2. The method of claim 1, wherein the repository of processed values to be written is determined from an array of write buffers having a greater number of elements than the number of elements in the two-dimensional array of values.

3. The method of claim 2, wherein the write buffer array comprises:

a value element corresponding to a value of the two-dimensional array; and

Filling elements.

4. A method as claimed in claim 3, wherein the fill element corresponds to a memory fill.

5. The method of claim 3 or 4, wherein the write buffer array comprises a plurality of sets of consecutive value elements corresponding to values of the two-dimensional array, and filler elements interspersed between the plurality of sets.

6. The method of claim 5, wherein the number of value elements in each group is (i) equal to the number of memory banks comprised by the memory, (ii) a multiple of the number, or (iii) a factor of the number.

7. The method of claim 5, wherein the number of value elements in each group is equal to or less than a number of threads for performing the separable operations on the array of values.

8. The method of claim 3 or 4, wherein the write buffer array is mapped to the memory such that processed values corresponding to values of the two-dimensional array are written into the memory in memory locations to which the value elements of the write buffer array are mapped, and processed values corresponding to values of the two-dimensional array are not written into the memory in memory locations to which the fill elements of the write buffer array are mapped.

9. The method of claim 1, wherein the memory location of a threaded write process value is determined from a base memory address, a write offset, and a write fill amount, the write offset and the write fill amount being dependent on a location of the value within the array of values to which the process value corresponds.

10. The method of claim 9, wherein the two-dimensional array of values divided into a plurality of two-dimensional sub-arrays of values is represented by a multi-dimensional array [ I ] [ J ] [ K ] [ M ], wherein I and J represent the number of sub-arrays of values within the array of values in each of two dimensions, and K and M represent the number of values within each of the sub-arrays of values in each of the two dimensions, and wherein each value in the array of values has coordinates [ I ] [ J ] [ K ] [ M ], the coordinates defining the position of the value within the multi-dimensional array [ I ] [ J ] [ K ] [ M ], and wherein:

the write offset is equal to (i×j×k×m) + (j×k×m) + (k×m) +m or (i×j×k×m) + (j×k×m) + (m×k) +k. And

The write fill amount is equal to the write offset divided by a fill frequency.

11. The method of claim 1, wherein the memory location from which a thread reads a processed value is determined from a base memory address, a read offset, and a read fill level, the read offset and the read fill level being dependent on the location within the array of values of the value to which the processed value corresponds.

12. The method of claim 11, wherein the two-dimensional array of values divided into a plurality of two-dimensional sub-arrays of values is represented by a multi-dimensional array [ I ] [ J ] [ K ] [ M ], wherein I and J represent the number of sub-arrays of values within the array of values in each of the two dimensions, and K and M represent the number of values within each of the sub-arrays of values in each of the two dimensions, and wherein each value in the array of values has coordinates [ I ] [ J ] [ K ] [ M ], the coordinates defining a position of the value within the multi-dimensional array [ I ] [ J ] [ K ] [ M ], and wherein:

The read offset is equal to (jxkxm) + (ixkxm) + (M x M) +k or (jxjxkxm) + (i x kxm) + (K x K) +m. And

The read fill amount is equal to the read offset divided by the fill frequency.

13. The method of claim 10 or 12, wherein the filling frequency is (i) equal to the number of memory banks comprised by the memory, (ii) a multiple of the number, or (iii) a factor of the number.

14. The method of claim 13, wherein the fill frequency is equal to or less than the number of threads used to perform the separable operations for the array of values.

15. The method of any one of claims 1 to 4, wherein the plurality of two-dimensional value sub-arrays are non-overlapping.

16. The method of any of claims 1-4, wherein the plurality of threads are processed by processing logic included by a core of the processing unit, the processing logic being implemented on a chip, and the memory being physically located on-chip with the processing logic.

17. The method of any one of claims 1 to 4, wherein the two-dimensional array of values is a two-dimensional array of pixel values.

18. The method of any one of claims 1 to 4, wherein:

the one-dimensional sequence of values of the subarray of values is a row of values of the subarray of values, and the vertical one-dimensional sequence of values of the subarray of values in the transposed position is a column of values of the subarray of values in the transposed position; or alternatively

The one-dimensional sequence of values of the sub-array of values is a column of values of the sub-array of values, and the vertical one-dimensional sequence of values of the sub-array of values in the transposed position is a row of values of the sub-array of values in the transposed position.

19. A processing unit for performing separable operations on a two-dimensional array of values, the processing unit comprising:

A memory comprising a plurality of memory banks, wherein the memory is configured such that in each writing or reading step, each memory bank is capable of being written to or read from by only one respective thread; and

Processing logic configured to:

for each sub-array of the plurality of sub-arrays:

Writing, using each of the plurality of threads, a respective first plurality of processing values to the memory through a plurality of writing steps, the first plurality of processing values corresponding to a one-dimensional sequence of values of the sub-array of values;

Reading, using each of the plurality of threads, a respective second plurality of processed values from the memory by a plurality of reading steps, the second plurality of processed values corresponding to a vertical one-dimensional sequence of values of a subarray of values in a transposed position within the array of values relative to the subarray of values; and

20. A non-transitory computer-readable storage medium having stored thereon computer-readable instructions that, when executed at a computer system, cause the computer system to perform a computer-implemented method of performing separable operations on a two-dimensional array of values at a processing unit comprising a memory, the memory comprising a plurality of memory banks, wherein in each writing or reading step, each memory bank is capable of being written to or read from by only one respective thread, the method comprising:

for each sub-array of the plurality of sub-arrays: