AU2003204212A1

AU2003204212A1 - A Processor for Alpha-compositing

Info

Publication number: AU2003204212A1
Application number: AU2003204212A
Authority: AU
Inventors: Matthew William Gallagher
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2002-06-14
Filing date: 2003-05-15
Publication date: 2004-01-15

Description

S&F Ref: 637363

AUSTRALIA

PATENTS ACT 1990 COMPLETE SPECIFICATION FOR A STANDARD PATENT Name and Address of Applicant Actual Inventor(s): Address for Service: Invention Title: Canon Kabushiki Kaisha 30-2, Shimomaruko 3-chome, Ohta-ku Tokyo 146 Japan Matthew William Gallagher Spruson Ferguson St Martins Tower Level 31 Market Street Sydney NSW 2000 (CCN 3710000177) A Processor for Alpha-compositing ASSOCIATED PROVISIONAL APPLICATION DETAILS [33] Country [31] Applic. No(s) AU PS2976 The following statement is a full description of this invention, performing it known to me/us:- 5815c [32] Application Date 14 Jun 2002 including the best method of FIon: t- 15 AY 2003 Batch No: -1- A PROCESSOR FOR ALPHA-COMPOSITING Copyright Notice This patent specification contains material that is subject to copyright protection.

The copyright owner has no objection to the reproduction of this patent specification or related materials from associated patent office files for the purposes of review, but otherwise reserves all copyright whatsoever.

Technical Field of the Invention The present invention relates generally to alpha compositing and, in particular, to a technique for the implementation of alpha compositing in a processor arrangement having multiple execution units, where each execution unit is specialised to execute specific types of instructions.

Background Alpha compositing is a mechanism used in computer graphics and image processing as a means of overlaying and combining two layers of two-dimensional colour data to obtain a single output layer. For the purposes of this document it is appropriate to only consider the image region in which layers overlap.

Each layer is comprised of an array of pixels, with each pixel being formed by a set of values or channels. In alpha compositing, one channel is always the "a channel", which describes the opacity of the pixel to be reproduced. Opacity is the extent to which a pixel in a layer will override or obscure colour information from pixels on layers behind the layer in question. Opacity is often referred to by its conjugate name, transparency. In a normalised system, opacity (1 transparency).

The remaining channels describe the colour of the pixel. The number of these remaining channels and the manner in which they describe the colour of the pixel is implementation specific. For example, there may be just one colour channel (in the case 637363.DOC -2of greyscale or colour separated images), or there may be multiple colour channels (for example red, green and blue). These different ways of describing colour are often referred to as the colourspace of the image. It is assumed that the colourspace is the same for all pixels on a layer.

Often, for storage efficiency, the different colour channels, and sometimes the alpha channel as well, are packed into a single value and stored at a single memory address.

For example as shown in Fig. 1, if a pixel 100 includes 8-bits to store each colour channel red 102, greeen 104, blue 106 and alpha 108, then those values may be packed into a single 32-bit value 110.

As with the colourspace used, the arrangement of channels in a packed pixel and the number of bits used to store each channel can vary between different implementations.

Each pixel is uniquely defined by two-dimensional coordinates within a layer.

Since two layers are overlayed and combined during the compositing process, each pixel in each layer has a corresponding pixel in the other layer which shares the same two dimensional coordinates, and with which the pixel is to be combined to produce the resultant pixel in the output layer with the same coordinates.

When layers are combined, the two layers combined are not treated as equal. One layer is treated as being conceptually behind, or beneath the other layer. The front, or top, layer is often referred to as the "source" and the rear, or lower, layer is referred to as the "destination" Layers can also be combined using any of a number of well known the compositing operators, whose names and corresponding functions are shown in Fig. 4, or using a raster operation (ROP), or both. The compositing operators of Fig. 4 are those described in "Compositing Digital Images", Porter, T; Duff, T; Computer Graphics, Vol. 18 No. 3 (1984) pp. 253-259 (hereinafter "Porter Duff'). A ROP is an arithmetic expression performed on the source and destination colour values to obtain a combined value. Any arithmetic expression which takes two pixels as input, and returns one pixel 637363.DOC -3as output can be considered a valid ROP. The amount of colour which comes from each layer, and the manner in which this colour is used to generate the resultant colour depends on both the compositing operator and the ROP.

The compositing described in this document assumes only two layers are being combined at any one time. More than two layers can be combined using these techniques by compositing the layers together two at a time.

Combining the compositing arithmetic of Porter Duff with an operation on the intersection term to implement a current ROP, reveals the following equation: aCR fD( 1 -as)(aD)C fsn (1 aD)(as)Cs fsnD(as)(aD)ROP(Cs,CD) (1) In Equation the subscripts S and D represent the source and destination layers (compositing inputs) and the subscript R represents the resultant layer (compositing output); C represents the set of colour channels for the pixel specified by the subscript; a is the opacity level for the pixel specified by the subscript as written in this equation, the domain of a is {x e 91 0 x 1}; ROP is the compositing operation specified to combine the source and destination layers; and f is a Boolean value. If the two overlapping shapes below are assumed to be the layers being composited, then the three values of f together implement the Porter Duff compositing operations by masking out the appropriate regions.

The result of Equation is colour information multiplied by the alpha channel. This combination is often referred to as premultiplied colour. To separate the colour and alpha 637363.DOC -4channel information again, it is appropirate to calculate cR independently, with Equation for example: aR fnD (1 aS)aD fsn (1 aD)as fsnDaSaD (2) Dividing the premultiplied value given by Equation by the opacity of Equation (2) gives the desired resultant colour, ie: CR CR CR R. (3) When two layers are composited, a single ROP is specified. This means that the ROP does not change from pixel to pixel. If desired, the compositing operation can be further simplified by dividing the image into areas where all three f values remain constant, allowing terms which are not required to be left uncalculated.

It will be appreciated from the above, in view of the relatively involved nature of Equations and each including a number of multiplications and additions, and the divisonal relationship therebetween, that the alpha channel imposes a substantial computational burden over and above that which would ordinarily be implemented in an graphics system utilizing entirely opaque objects or layers. This burden becomes particularly significant when it is desired for images to be printed, since printer resolutions necessitate calculation of, typically between 4 and 16 times, more pixel values in comparison to those that would be reproduced on a video display screen.

Traditional computer graphics and image processing arrangements utilize microprocessors, more correctly RISC and CISC microprocessors, to perform raster image processing (RIP) and compositing functions. The acronym RIP is used hereinafter 637363.DOC to also refer to such microprocessors as will be clear from the given context. Utilizing a microprocessor has the advantage that the desired functions are relatively easy to implement in software and thus may be performed inexpensively. However, to execute complex tasks such as raster image processing, a microprocessor must read and execute potentially millions of smaller instructions, and each of those instructions must be executed in series. It is the mismatch between a typical instruction set of a microprocessor and the tasks required for raster image processing, and the serial nature of typical microprocessors, which make RIP's formed by microprocessors relatively slow.

For typical office printing arrangements, the microprocessor which performs the RIP is a dedicated processor and resides on either a circuit card incorporated in a host computer device or on a circuit within the printer itself. In either case, this dedicated circuit is typically referred to as a "printer RIP". Microprocessors are employed in RIP's because they are flexible, readily available and can be programmed to execute almost any type of computational task. To perform the large number of mathematical operations and large data transfers that are required for high-speed printing, microprocessors must process instructions at very high rates. For this reason, printer RIP's typically employ the fastest microprocessors available. For example, at the time of writing this specification, a 1.7GHz Pentium T M IV microprocessor, manufactured by Intel Corporation of the USA, represented the state-of-the-art commercial microprocessor available for general use and retailed for about US$1000. The cost of a microprocessor this fast can be a significant fraction of the overall cost of the printer. To reduce the per-unit cost, the general-purpose microprocessor can be replaced or supplemented with an application specific integrated circuit (ASIC) whose arrangement is designed exclusively to perform the onerous image processing tasks at hand. Whilst such devices can be made to render pages as quickly as a microprocessor for a small fraction of the cost per-unit (typically less than US$200), such is not achieved without the attendant high cost of designing the ASIC and 637363.DOC -6implementing its operation. Such devices are also inflexible in their operation, particularly compared to microprocessors.

In order to solve Equations and and thus determine the resultant colour and a value for each composited pixel, a processing method 390 shown in Fig. 3A must be performed.

As seen in Fig. 3A, the method 390 begins by fetching colour and a values for the corresponding pixels from the input layers in step 391. Step 391 involves loading the pixel values C s and CD as well as the two corresponding ac values. If the pixels are stored in a packed format, it is often necessary to unpack the data.

The colour values obtained in step 391 are used in steps 392, 392 and 304.

Step 392 determines the a products for the three different f terms in Equations (1) and and involves calculating a s )aD, aD )a s and asaD If the compositing has been divided into regions where the three f values are known and constant, then there is no need to calculate any component, which is associated with a false f.

Step 393 involves determining the colours resulting from the ROP for the S r) D term. This is performed by applying the ROP expression to the Cs and CD colour values. This may involve execution of any mathematical expression, which returns the colour values for a single pixel.

Step 394 multiplies the three colours (two input-layer colours and the ROP result colour) by their respective a. The values of (1-as)aD, (1-aD )a s and asaD have already been calculated and are sourced from an output of step 304. Step 394 involves the multiplication of the results of step 392 by the results of step 393, CD, C s and

ROP(C

s CD) respectively. Since each pixel in the colourspace may contain a number of different colour channels, each of the channels will need to be multiplied by the 637363.DOC -7respective a product. Again, if it is known in advance that a given f value will be false, then there is no need to calculate the values associated with that term of the equation.

Step 395 follows step 394 and operates to sum the terms for which f is true. If it is not already known whether each f value is true or false, it is necessary to determine this by determining (according to the Porter Duff compositing operation being used) which of the three regions should be included in the result (f true) and which should be excluded (f false). With this, each result from the previous step in the flow-chart where the corresponding f value is true, is then summed, giving the colour values for a single premultiplied pixel as the result.

Step 396 receives an output from step 392 and calculates a, from the a intermediates. The earlier calculated values of (1-as)aD, (1-aD)as and asao, combined with the three Boolean values of f, are used to calculate the equation: aR fD 1 as)aD+ fs 0(1 aD)aS fSnDaSaD Step 397 operates to divide by a R to obtain the resultant colour. The premultiplied colour value aRCR obtained in step 394 is here divided by aR. Since there may be more than one colour channel in the colourspace, this operation may require multiple division operations.

Lastly, step 398 stores the colour and a values for the result. In the same way that the C s and CD values as well as the two corresponding a values were fetched at the beginning, CR and its associated a value must now be stored. If pixels are stored in a packed format, the pixels must be similarly repacked.

The method 390 of Fig. 3A is traditionally practised using a printer controller 532 such as that shown in Fig. 5. As seen, the printer controller 532 forms part of a printer 515 which is coupled by either a direct connection 530 or a network 637363.DOC -8connection 531 to a computer system 500. The computer system 500 comprises a computer module 501, input devices such as a keyboard 502 and mouse 503, output devices including a printer 515 and a display device 514. A Modulator-Demodulator (Modem) transceiver device 516 is used by the computer module 501 for communicating to and from a communications network 520, for example connectable via a telephone line 521 or other functional medium. The modem 516 can be used to obtain access to the Internet, and other network systems, such as a Local Area Network (LAN) or a Wide Area Network (WAN), where for example the printer 515 may reside.

The computer module 501 typically includes at least one microprocessor unit 505, a memory unit 506, for example formed from semiconductor random access memory (RAM) and read only memory (ROM), input/output interfaces including a video interface 507, and an I/O interface 513 for the keyboard 502 and mouse 503 and optionally a joystick (not illustrated), and an interface 508 for the modem 516. A storage device 509 is provided and typically includes a hard disk drive 510 and a floppy disk drive 511. A magnetic tape drive (not illustrated) may also be used. A CD-ROM drive 512 is typically provided as a non-volatile source of data. The components 505 to 513 of the computer module 501, typically communicate via an interconnected bus 504 and in a manner, which results in a conventional mode of operation of the computer system 500 known to those in the relevant art. Examples of computers on which the described arrangements can be practised include IBM-PC's and compatibles, Sun Sparcstations or alike computer systems evolved therefrom.

The printer 515 includes the printer controller 532 and a printer engine 534. The controller 532 has a network interface 536 for receiving print jobs from the connections 530, 531 and a job controller 540 which passes print jobs to a bus 539.

Coupled to the bus 539 is a RAM memory 537 and a ROM memory 538, the later containing firmware necessary for operation of the controller 532. A rendering 637363.DOC -9module 542, a PDL interpreter 544, a colour converter 546 and a halftoning unit 548 also couple to the bus 539 to perform traditional functions known in the art. The printer engine includes a printer engine controller 550, which receives pixels for printing from the bus 539 and controls the reproduction of those pixels using printing drums 552.

In Fig. 5, the modules 540 548 can be implemented by either a microprocessor, an ASIC or, most commonly, a combination of both. In the case where the rendering module 542 is implemented on a typical microprocessor, the compositing technique employed would be a linear method, such as that described above in respect of Fig. 3A.

If the rendering module 542 is implemented in an ASIC, the linear method can also be used but accelerated due to hardware omission of unnecessary fetch and stores, and also by parallel duplication of the processing path.

During the powering up of the printer controller 532, information required to initialise the modules 540 548 is loaded from the ROM/firmware module 538. In the case of a microprocessor implementation, the information loaded will be a software program that will, when executed, cause the microprocessor to execute the functions of the different printer controller modules 540 548. In the case of an ASIC implementation, the job control module 540 will be initialised to a start-up state.

Normally a printer job will be generated by a computer 500 or other device on the network 520 connected to the printer controller 532. The printer 515 will receive the job through the network interface 536 and the job control unit 540 will store the job in memory 537 as it is received. Once enough of the job is received to proceed, the job control unit 540 feeds the job into the PDL interpretation unit 544 which interprets PDL instructions in the job and produces instructions that can be handled by the rendering module 542. The rendering module 542 then takes these instructions and converts them into raw pixels. This pixel data will be either stored in the RAM 537 or cached locally in the rendering module 542. This pixel data may have been generated by the rendering 637363.DOC module 542 based on rendering instructions in the job file. The pixel data may also have been contained in the job file as a bitmap or other image type. These instructions may include among others: shape filling, bitmap rendering or compositing instructions. These raw pixels are then fed into the colourspace conversion unit 546 to convert RGB pixels into CMYK pixels that are colour-corrected for the target printer engine 534. The colourcorrected pixels are then optionally half-toned 548 (for non-continuous tone printers) before being sent to the printer drums 552 for imaging and reproduction.

Often, for four-drum printer engines 534, the four CMYK colour planes must be staggered. In this case, each colour plane may have to be compressed and stored in memory after either colourspace-conversion or half-toning while waiting for its respective drum. The colourplane is then decompressed and fed through half-toning and imaging as required.

Traditionally, for microprocessor-based implementations, the alpha channel composition application program is resident on the ROM 539 and read and controlled in its execution by the microprocessor. Image data to be composited is traditionally sourced from the storage device 509, CD-ROM 512 or the network 520, generally by way of an general image processing application program, often including a graphic object capability, of which the alpha channel compositing application may form a part.

Intermediate storage of these programs and any image data fetched from the network 520 may be accomplished using the semiconductor memory 506, possibly in concert with the hard disk drive 510 of the computer 500. In some instances, the application programs may be supplied to the user encoded on a CD-ROM or floppy disk and read via the corresponding drive 512 or 511, or alternatively may be read by the user from the network 520 via the modem device 516. The compositing application may be uploaded to the printer 515 from the computer 500. Still further, the software can also be loaded into the computer system 500 from other computer readable medium including magnetic tape, 637363.DOC 11 a ROM or integrated circuit, a magneto-optical disk, a radio or infra-red transmission channel between the computer module 501 and another device, a computer readable card such as a PCMCIA card, and the Internet and Intranets including e-mail transmissions and information recorded on websites and the like. The foregoing is merely exemplary of relevant computer readable media. Other computer readable media may alternately be used.

Though it is elegant and typographically efficient to simply write a mathematical expression and state that a particular point in the flow-chart of Fig. 3A requires evaluation of the expression, the implementation of that evaluation on a processor is not necessarily as simple or efficient. Every addition, subtraction and multiplication in the above expressions represents another instruction that must be executed.

More complicated still, is evaluating the product of a scalar number (such as a) with the colour values of a packed pixel. Evaluation of this type of expression on a typical microprocessor involves unpacking each colour channel, by masking and bitshifting the data for each channel, performing the required arithmetic operations on the channel values, and then repacking the pixel.

While the steps required to implement the compositing algorithm were summarised simply in what appears as the small flow-chart of Fig. 3A, the assembly language implementation of each step might typically comprise 10 to 20 statements. This makes the time taken to fully composite just one pixel significant in itself. The time taken to composite an entire image, which may contain millions of pixels, will be understandably even more substantial. For applications where compositing of large images must happen quickly, sequential execution of the algorithm may not be fast enough. Technically, there is no impediment to implementing the algorithm as discussed thus far on a standard microprocessor but the need for fast compositing of high resolution images increasingly calls for better approaches.

637363.DOC 12- A typical approach when an algorithm will not run fast enough on a microprocessor is to design specific hardware to accelerate the process. An ASIC can be designed to implement independent steps in parallel, and independent iterations, such as compositing of different pixels, can occur simultaneously. ASIC's can even be designed to handle data in a packed form, thereby avoiding the additional processing invQlYed in unpacking and repacking the data. Custom designing hardware however is not always an ideal solution. Custom hardware has many drawbacks including the high cost to design and produce, as well as inflexibility and difficulty to update or change.

It will be appreciated from the above discussion of Fig. 3A in relation to the traditional processing arrangement shown in Fig. 5, that performing alpha-channel compositing is a complex task that, for efficiency, involves high speed processing facilitated by high speed expensive microprocessors executing software or high speed and expensive ASIC's performing the functions in an inflexible hardware fashion. A need therefore exists for a processing arrangement that can perform compositing operations in the presence of an alpha channel at high speeds but at a lower cost than traditional arrangements.

Summary of the Invention According to a first aspect of the present disclosure, there is provided a method of performing alpha-channel compositing, said method comprising the steps of: dividing an alpha-channel compositing algorithm into a plurality of stages; arranging the stages in a pipeline fashion such that stages that require corresponding computing operations are sequentially arranged in said pipeline and stages that are performed by independent computing operations are parallel, subject to dependency criteria in said algorithm; replicating the pipeline as a plurality of substantially parallel configured pipelines; and 637363.DOC

I

-13 implementing the pipelines in a substantially parallel configuration such that corresponding stages in each said pipeline are offset in sequence with each other.

According to another aspect of the present disclosure, there is provided a digital signal processor configured to perform alpha-channel compositing.

According to another aspect of the present disclosure, there is provided a method of alpha-compositing at least two layers to provide a digital image using a processor having a plurality of asymmetric execution units capable of parallel operation, said method including the steps of: calculating intermediate alphas for a pixel dependent upon said at least two layers and an overlap term between said at least two layers and in parallel determining a colour resulting from a ROP operation applied to said overlap term; determining a pre-multiplied colour result for said pixel dependent upon a colour and a respective one of said intermediate alphas for each of said at least two layers and said ROP resultand in parallel calculating a resulting alpha from said intermediate alphas; and dividing said pre-multiplied colour result by said resulting alpha to provide a resulting colour and a resulting alpha for said pixel; wherein said steps are applied in a time-staggered manner to a plurality of pixels of said digital image.

According to another aspect of the present disclosure, there is provided apparatus for alpha-compositing at least two layers to provide a digital image using a processor having a plurality of asymmetric execution units capable of parallel operation, said apparatus including: means for calculating intermediate alphas for a pixel dependent upon said at least two layers and an overlap term between said at least two layers and in parallel determining a colour resulting from a ROP operation applied to said overlap term; 637363.DOC 14means for determining a pre-multiplied colour result for said pixel dependent upon a colour and a respective one of said intermediate alphas for each of said at least two layers and said ROP result and in parallel calculating a resulting alpha from said intermediate alphas; and means for dividing said pre-multiplied colour result by said resulting alpha to provide a resulting colour and a resulting alpha for said pixel; wherein the processing is applied in a time-staggered manner to a plurality of pixels of said digital image.

According to another aspect of the present disclosure, there is provided a computer program product having a computer readable medium with a computer program recorded therein for alpha-compositing at least two layers to provide a digital image using a processor having a plurality of asymmetric execution units capable of parallel operation, said computer program product including: computer program code means for calculating intermediate alphas for a pixel dependent upon said at least two layers and an overlap term between said at least two layers and in parallel determining a colour resulting from a ROP operation applied to said overlap term; computer program code means for determining a pre-multiplied colour result for said pixel dependent upon a colour and a respective one of said intermediate alphas for each of said at least two layers and said ROP result and in parallel calculating a resulting alpha from said intermediate alphas; and computer program code means for dividing said pre-multiplied colour result by said resulting alpha to provide a resulting colour and a resulting alpha for said pixel; wherein the processing is applied in a time-staggered manner to a plurality of pixels of said digital image.

637363.DOC 15 According to another aspect of the present disclosure, there is provided a digital signal processor configured to perform division by values other than multiples of two.

According to another aspect of the present disclosure, there is provided a VLIW processor configured for alpha-compositing at least two layers to provide a digital image using a processor having a plurality of asymmetric execution units capable of parallel operation, said processor comprising: a first module configured for calculating intermediate alphas for a pixel dependent upon said at least two layers and an overlap term between said at least two layers and in parallel determining a colour resulting from a ROP operation applied to said overlap term; a second module configured for determining a pre-multiplied colour result for said pixel dependent upon a colour and a respective one of said intermediate alphas for each of said at least two layers and said ROP result and in parallel calculating a resulting alpha from said intermediate alphas; and a third module configured for dividing said pre-multiplied colour result by said resulting alpha to provide a resulting colour and a resulting alpha for said pixel; wherein the processing is applied in a time-staggered manner to a plurality of pixels of said digital image.

Brief Description of the Drawings A number of embodiments of the present invention will now be described with reference to the drawings, in which: Fig. 1 shows representations of packed and unpacked pixel data; Fig. 2 illustrates areas and corresponding masking functions that relate to Porter Duff compositing operations; Fig. 3A is a flowchart illustrating prior art method steps for implementing alpha channel compositing according to Equations and 637363.DOC -16- Fig. 3B shows a re-arrangement of Fig. 3A to optimise pipeline processing in accordance with the present disclosure; Fig. 4 illustrates various Porter Duff compositing operations; Fig. 5 is a schematic block diagram of a computer arrangement upon which prior art alpha-channel compositing approaches can be practiced; Fig. 6 depicts a preferred architecture for implementing the method of Fig. 3B in a parallel processing environment; Figs. 7A, 7B and 7C show approaches for packing multiples to facilitate division operations; Fig. 8 is a schematic block diagram of an exemplary VLIW-DSP core; Fig. 9 shows the structure a pre-multiplied pixel data in the preferred implementation; Fig. 10 is a schematic block diagram of a preferred alpha-channel compositor arrangement; and Appendix A is a code listing for the preferred alpha-compositing implementation.

Detailed Description including Best Mode An ASIC or other parallel processor however can be made to perform some operations simultaneously. The operations, which can be performed simultaneously, are determined by separate data dependency paths in the set of operations. Data dependency paths are those paths through the set of operations where each operation is dependent on the output of preceding operations in the path. The data dependency path for the compositing operation of Fig. 3A is shown in Fig. 3B with respect to a revised method 300, where steps 302, 304, 306, 308, 310, 312, 314 and 316 correspond to steps 391, 392, 393, 394, 395, 396, 397, and 398, respectively. Fig. 3B shows that steps 304 (392) and 306 (393) are independent and hence can be performed simultaneously once step 302 (391) is completed. Similarly step 312 (396) is on a 637363.DOC 17separate path to steps 308 (394) and 310 (395), so these two paths can be executed simultaneously, once their preceding steps are completed.

A better solution to the problems of using microprocessors and ASIC's for alpha compositing discussed above in the Background section is a processor which implements multiple parallel execution units, and can handle data in packed formats, while still maintaining the "off-the-shelf" availability of more mainstream purely sequential processors. A desired processor for this type of compositing operation implements a very long instruction word (VLIW) architecture and preferably implements some form of single-instruction-multiple-data instruction (SIMD). SIMD allows packed data, such as multiple-channel colour, to be operated upon without unpacking. VLIW allows multiple instructions to be issued and executed simultaneously, while still allowing the programmer control over the scheduling and execution of instructions.

Many processors, and not just VLIW processors, feature multiple execution units that are capable of executing instructions in parallel. Such processors can be classified as superscalar or VLIW. Whilst sharing the characteristics of multiple executions operable in parallel, VLIW processors and superscalar processors use very different techniques to achieve high performance. Specifically, the parallelism explicit in VLIW instructions must be discovered by hardware at run-time by superscalar processors. Further with superscalar processors, the programmer has limited opportunity to influence the order in which instructions are issued to the ovaries parallel execution units, and hence the level of parallelism achieved.

With a VLIW processor, the programmer has direct control over which instructions will be executed in parallel by the multiple execution units. As a result, the programmer can schedule the instructions in the most optimal order. The programmer can also arrange for instruction operands to be fetched before they are needed by the execution units thus hiding any memory latency.

637363.DOC 18- Due to the degree of parallelism possible with a VLIW architecture, a VLIW processor can achieve the same performance as a conventional serial microprocessor of far higher clocking rate. Since device fabrication technology limits the clocking rate, this allows higher absolute performance to be attained than would otheiwise be possible.

Since VLIW processors do not require special logic for deciding which instructions can be executed in parallel, VLIW processors can be much simpler and cheaper than serial processors. For this reason, and because VLIW require a lower clocking rate, power consumption tends to be lower than with serial microprocessors.

VL1W types of processors, which include digital signal processors (DSP's), have typically been used in applications such as signal processing, such as digital filtering, performing Fast Fourier Transforms, amongst many other functions. Traditional RIP's have not supported compositing because, until recently, Postscript T M and Windows T m GDI have not supported compositing. Proper alpha-compositing was the domain of very high-end printer RIP's running on either custom ASIC's or one or more high-end microprocessors. DSP's have been used for printer RIP's previously but such implementations have generally been modelled on their earlier low-end microprocessor ancestors and have shied away from computationally-intensive tasks such as alphacompositing, using the DSP simply to take advantage of the VLIW nature of these processors in a traditional RIP. With more RIPs now extant that support transparency and the increased power of this class of processor as both a general microprocessor and a SIMD mathematical processor the present inventor has discovered that it is now feasible to implement alpha-compositing on a lower-end VLIW processor instead of a custom ASIC or high end microprocessor.

There are two significant limitations that remain with VLIW processors. Firstly, while VLIW processors typically have multiple execution units, these units are rarely homogenous. Typical VLIW processors include some units which perform data transfer 637363.DOC 19operations (moving memory, performing loads and stores etc.), some which perform basic arithmetic operations (addition, subtraction), some which perform multiplication, and other units which perform different functions. This means that it is easy to perform a load and an addition simultaneously, but two simultaneous loads might not be possible. The implication of this, is that the traditional method of accelerating compositing, to perform multiple compositing operations in parallel, is not a viable option with VLIW processors, because these processing units are not truly parallel.

The other limitation comes from dependencies in algorithm processing. In order to perform an operation on data, it is first necessary to have the data. If the data is produced in a previous step then the desired operation cannot be performed until after the previous step has been completed. This means that the operation also cannot be performed in parallel with the previous step. It is therefore not possible to simply break the compositing operation up into as many steps as corresponding to the available execution units and attempt to execute the algorithm all at once.

Instead of simply breaking a single iteration into a number of components and trying to execute all components simultaneously, which would not work due to dependency problems, it is possible to take multiple iterations of the compositing algorithm (ie. executing on multiple pixels at once) and send these to different units simultaneously. If iterations are staggered so that similar operations do not occur at once, then the scheduling problems due to different execution unit types can be avoided.

Further, once enough staggered iterations of the loop become operational to maximise usage of the execution units on the processor, further iterations need only be started when previous iterations complete. This technique is called loop-pipelining. If properly implemented this arrangement can allow the compositing of multiple pixels to occur simultaneously, dramatically increasing overall throughput.

637363.DOC 20 An example of how a VLIW processor can operate for the alpha-channel compositing algorithm is shown Fig. 6. In Fig. 6, a process 600 is shown for the compositing of six different pixels scheduled to occur in pseudo-parallel over eleven (11) time periods to-tio. Six iterations 602, 604, 606, 608, 610 and 612 of the method 300 of Fig. 3B are used in pseudo-parallel with the execution thereof beingt.aggered over the eleven time periods to-tj 0 Each column in Fig. 6 represents one stream of the parallel execution, and each row represents instructions that occur simultaneously, effectively corresponding to an execution stage or cycle on the VLIW processor. The actual scheduling arrangement used for a given implementation will depend on the scheduling and resource restrictions of the specific processor being used.

It is important to note that each column does not represent a specific (hardware) execution unit of a VLIW processor. This is because each column contains the full range of instructions of the compositing algorithm and so could not be executed exclusively on any one execution unit, such as a multiplication unit for example. Instead each column shows how a single pixel is processed through the pipeline.

Examining the centre of the diagram at time t 5 reveals that even in the most parallel section of the process 600, only one load (step 302 in process 612), one store (step 316 in process 602), one divide (step 314 in process 604) and a limited number of additions and multiplications (steps 312, 308, 310, 304 and 306 in processes 604, 606, 608 and 610) are involved. This broad range of tasks means that as many execution units as possible can be utilised simultaneously.

Fig. 6 also shows three specific regions: a top region formed by time periods to-t 4 which may be referred to as the "prologue", a bottom region formed by the periods t 6 -tl0 is the "epilogue", and a middle region at time ts is the "loop kernel". For a situation where many pixels are being composited, the loop kernel can be executed repeatedly, resulting in one pixel being output from the kernel at every iteration, where scheduling or 637363.DOC -21 dependency problems do not hinder execution. It is to be noted that the arrangement shown in Fig. 6, and for which the scheduling shown represents a valid pipeline, relates to a hypothetical VLIW processor. In this regard, Fig. 6 is used to demonstrate pipelining, and not a particular scheduling to be used. Specifically, in the some implementations, most of the steps of Fig. 3A take between about 3 10 instructions to complete and in a real processor will depend upon the ROP used and the pre-multiplied state of the pixels.

As shown in Fig. 6, the kernel is executed in one cycle, as depicted by the one rows of parallel process steps within the period t 5 This means that one pixel may be composited for every cycle using such a processor arrangement, as opposed to the eight cycles taken if each step shown for a pixel compositing operation were to be executed sequentially, as seen in Fig. 3A.

If there are as many iterations running in parallel as there are dependent steps in the dependency path, all instructions can still be executed simultaneously, provided all instructions can be scheduled simultaneously between the different execution units on the processor.

The final result is that, by exploiting the multiple execution units of such a processor, the process of compositing can run many times faster than is possible on a serial microprocessor, such as that described above, and at a much reduced cost and increased flexibility compared to an ASIC or a traditional (sequential) microprocessor.

An alpha-channel compositing arrangement in accordance with the present disclosure may therefore be implemented using a VLIW processor having more than two instruction execution units, and preferably a VLIW-SIMD processor. Examples of such processors include the FR500 DSP manufactured by Fujitsu Ltd. of Japan, the MAP-

CA

Tm DSP manufactured by Equator Technologies Inc. of the United States, and the TMS320C64xTM family of devices manufactured by Texas Instruments of the USA. An example of a non-DSP VLIW processor is the Itanium T M family of processors 637363.DOC 22 manufactured by Intel Corporation of the USA. The ItaniumTM family use Explicitly Parallel Instruction Computing (EPIC) which is an implementation of a VLIW architecture.

A specific example of the use of a VLIW processor for performing the alpha channel compositing algorithm will now be described with re.ference to the TMS320C64xTM family of devices. This family of devices was targeted in their development to support telecommunications functions associated with third-generation wireless base stations, such as Viterbi and Reed-Solomon coding/decoding processes.

This family, like its TMS320 T ancestors, also supports traditional DSP functions such as filtering, vector-dot products, and Fast Fourier Transforms (FFT). The family is also described by its manufacturer as being useful in imaging applications such as 3x3 correlation, 8x8 block motion estimation, 3x3 media filtering, horizontal and vertical wavelet transformations, 8x8 block IDCT and quantization. Each of these applications are characterised by a reliance upon multiplication and arithmetic operations, that are founded upon the eight core functional units of the TMS320C64x T M these including two multipliers and six arithmetic units. The family uses an advanced Very Long Instruction Word (VLIW) architecture and a SIMD instruction set. One specific advantage of the TMS320C64xTM family lies in an ability to work with packed data values, as encountered in pixel compositing as described above.

However, unlike general-purpose microprocessors, the TMS320C64x"

T

family does not possess an ability to perform division, thereby contrasting the need for division in the alpha channel compositing algorithm as described above. This means that careful consideration must be made with regards to algorithm implementation. It is understood in digital hardware that a shift left is multiplication by a power of two, and a shift right is division by a power of two. The TMS320C64xTM family can bit shift both left (_shl) and right however it can only SIMD bit shift right. In this regard, there 637363.DOC 23 is an instruction shr2, but there is no equivalent shl2. This means that to implement shl2 would required two instructions (_mpyu2 by a power of 2 and _pack2 to truncate the results) which is considerably slower than a _shl2 would be since _mpyu2 has a latency of 2 cycles. The preferred approach is to employ bit-shifting right (ie. division by powers of two) wherever possible, or most preferably, to truncate values when division by 256 (8-bits) or 65536 (16-bits) is required. In addition to this approach, the preferred approach would leave pixel data in its premultiplied state as long as possible so that unnecessary divisions are avoided. In the ultimate case where conversion back to an unpremultiplied 8-bit per channel pixel value is required, truncation and a two-dimensional, 8-bit look-up table (LUT) are employed to perform division by alpha. In this regard, where there exists premultiplied data (eg. aRCR), it is necessary to divide by aR to obtain the desired result. The two dimensional LUT would then accept aR and CR as the two indices into the table, such as: LUT aR [0 255] CR [0 255].

Significantly, the TMS320C64x T family has IMeg of SRAM on-chip thereby possessing ample capacity to retain these LUT's. Accordingly the LUT's may be used to perform the un-premultiply function on conclusion of the compositing algorithm.

Further, where an image is being composited that has more than two overlapping layers, a further composite will be necessary for each additional layer. In these instances, access to the LUT's need only be performed once, on conclusion of the last composite.

This is advantageous since the number of cycles required to access the LUT (between 4 and 20 cycles, depending on the location and cache status of cache entry) is commensurate the number of cycles (about 11) required to composite one pixel. Thus processing time savings can be obtained by leaving values in their premultiplied form.

637363.DOC 24 However, the use of the LUT is constrained to the particular values in the table and the tables must be first constructed for each pixel being composited. Accordingly a general solution to division is nevertheless required.

Consider the situation where two 8-bit values are multiplied, this giving a 16-bit output value. It is necessary to divide by 255 in order to return to an 8-bit value. A general-purpose microprocessor can perform this function since such typically possesses integer division, and most often a floating-point arithmetic instruction set. In hardware, such a division may be approximated by a right bit shift by 8, but such is equivalent to a division by 256. Implementing such an approach introduces errors by virtue of the difference between 255 and 256, and such is not possible in any event in the TMS320C64xTM family, as mentioned above.

The solution proposed in the present implementation is to treat all values as 16-bit values, thereby obtaining a 32-bit multiplied word after multiplication, which can simply be truncated back to 16-bits with no more than a single bit of error. Fig. 7A shows how these 16-bit values are obtained from 8-bit colour values. This is useful because 16-bit compositing affords better precision than 8-bit compositing and the 16-bit data size is better supported on the preferred implementation. In Fig. 7A, the 8-bit values are multiplied to give a 16-bit intermediate result, which is then summed with bit shifted version of itself, to give a final result. This process ensures that the resulting 16-bit value can always span the full range over zero to Ox7FFF for its high 15-bits, with the lowest bit always being undefined. This trailing bit will result in accumulation of errors over time due to truncation and its undefined nature upon creation, but since the preferred target precision after un-premultiplication is just 8-bits of precision, this is sufficient to ensure better than 8-bit accuracy even after a hundred composite operations on the same pixel (many more than will ever be encountered in a well-formed compositing stack).

637363.DOC 25 Although this approach accumulates error over time, such has been found to be sufficiently accurate for present purposes. In the code listing of Appendix A, lines 28 and 31 achieve the process of Fig. 7A for the three colour channels of the source pixel, lines 55 and 56 achieve the process of Fig. 7A for the three colour channels of the destination pixel. Figs. 7B and 7C show truncation operations. Fig. 7B shows a 16-bit by 16-bit multiplication with truncation back to 16-bits. Fig 7C shows how the lower 8bits are ignored so as to truncate back to 8-bits. Truncation back to 16-bits is achieved through the pack2 and _packh2 instructions, as seen at lines 111 and 86 respectively, for example.

Fig. 8 shows the core architecture 800 of the TMS320C64xTM family of DSP's which may be used in a preferred implementation of the described alpha-compositing processes. The core 800 is characterised by two data paths 820, 840 each having a register file 822, 842 incorporating two 16-bit registers 824,826; 844,846 configured for reading as a 32-bit word, thereby offering 64-bit operation. The register files 822,842 are coupled via data bus connections 828,848 to respective banks of load 830, 850, shift 832, 852, multiply 834, 854 and arithmetic 836, 856 process modules, which afford quad 16bit and octal 8-bit multiply-accumulate performance. The data paths 820, 840 are crosscoupled via a bus connection 860 that permits the register files 822, 842 access to the other bank of process modules. The core 800 is founded upon a VLIW architecture and has separate modules for each of instruction fetch 808, instruction dispatch 810 and instruction decode 812, thereby permitting pipelined instruction operation in concert with the separate data paths 820, 840. Control registers 802, interrupt control 804 and emulation 806 modules are also incorporated into the core 800 and perform traditional roles.

Since the core 800 has eight parallel operational units, it is capable of executing eight instructions simultaneously. This parallelism allows the compositing algorithm to 637363.DOC

M

-26be executed in 6 to 16 processor cycles (depending on the ROP used and the premultipled/unpremultiplied nature of the data) by working on between 4 and 7 pixels simultaneously. Optimal (ie. fastest) performance necessitates that the code be optimised for each case and appropriately scheduled using pipelining to take advantage of available execution units.

The instruction set associated with the core 800 provides for a number or combination of operations that may be advantageously adapted to the compositing algorithm. These may now be described, with particular reference to Appendix A which provides the code fragments for a preferred implementation.

Firstly, because pixel data, as shown in Fig. 1, is in a 32-bit format, and since the core 800 can handle 64-bit words, such permits pre-multiplying all pixels values by alpha to obtain the data structure shown in Fig. 9. With this, approximately the first half of the code listing of Appendix A from lines 19 to 70 attends to the organisation of source and destination pixel data that is either premultiplied or un-premultiplied, the former for example being derived from a previous composite operation and the later as new pixel data. These operations are advantageously founded upon _pack, _mpyu2, and _mpyu4 operators from the instruction set. pack performs the packing operation of Fig. 1, the _mpy2 and _mpyu2 instructions multiply 2 packed 16-bit values to obtain 2 packed 32bit values, the _mpyu4 multiplies 4 packed 8-bit values to obtain 4 packed 16-bit values.

With the composite terms multiplied, it is then necessary for the terms to be added together. Logically for example, such may be expressed as: if 5n D add2 lowordl, loword2 else 637363.DOC 27 Whilst such is a logical representation, such is not conducive to pipelining. The core 800 however has a "predicate" command that operates with the control register in the following fashion: [Al] add2 loword, loword2 which is functionally equivalent to the above logical operation and only executes when the register Al is set.

Although the above discusses the concept of predication, and predication is actually used by the preferred implementation, it is not essential to predicate on these Boolean f values. The preferred approach is to predicate on whether or not the source, destination and result are premultiplied. This may be implemented by a change of the condition from S9n D to src_premultiplied, and this bit is still accurate. Predication of the Boolean f values is implemented using the bit masks on lines 76, 77, and 86-89 of the code of Appendix A.

Raster operations (ROPs) represent a step that is not present in Porter Duff operations. They are only implemented in the SrD region and are a way of combining the colour information other than by simple alpha-transparency. Colour ROPs include XORing colour values, merging (ORing) colour values, and arithmetic operations such as thresholding and chroma-keying. Implementing a ROP is simply a matter of applying the correct operations to the source and destination pixel values when a ROP is selected. The most common ROPs however, are simply NOP and COPYPEN, which simply result in alpha transparency without any additional operations performed on the data.

When all compositing operations are completed (ie. line 94 of Appendix the command packl4 is used to take 4 values (ie. 64-bits) and inserts them into a 32-bit word.

This is seen in line 109 of Appendix A.

637363.DOC 28 The code then proceeds to the look-up table, as described above, to perform the necessary division operation to strip the ct's from the premultiplied pixels to give a 32-bit word. This is seen at lines 107 to 121 of Appendix A.

In the preferred implementation, operating the code of Appendix A upon the core 800 requires approximately 20 clock cycles to fill the pipeline (of.Eig. With this, one pixel is output from the pipeline every 8 cycles when using a basic LCOCOPYPEN operation for the ROP. As the TMS320C64xTM family is scalable to clock speeds up to 1.1 GHz, this provides a compositing rate of approximately 130 megapixels per second, noting that the cycle count will vary. This compositing rate, coupled with a cost of about US$50 per unit, compares most favourably with alternatives such as a dedicated software compositor formed on the Pentium T IV processor mentioned above.

It is to be appreciated that much of the above description relates to an implementation using the TMS320C64xTM family of DSPs. Where other VLIW processors are used, such as the FR500, MAP-CATM or Itanium T mentioned above, certain device specific implementation variations may be required.

Fig. 10 is a schematic block diagram of a printer 1000 including an alpha-channel compositor board 1010 and the printer engine 534 of Fig. 5, and which may be used with the computer system 500. Notably, the board 1010 includes the network interface 536 for coupling to the computer 500 via the connection 530 or the network 520 via the connection 531. The RAM 537 is retained for temporary storage and ROM/firmware 1030 used to retain the operating code ofa DSP device 1020. A bus 1050 couples the DSP 1020 to the RAM 540 and an ASIC 1040 which may be configured to perform the roles of colour converter 546 and the halftoning unit 548.

Industrial Applicability 637363.DOC 29 The arrangements described are applicable to the image compositing and digital signal processing arts.

The foregoing describes only some embodiments of the present invention, and modifications and/or changes can be made thereto without departing from the scope and spirit of the invention, the embodiment(s) being illustrative and not restrictive.

637363.DOC Appendix A Below is a code listing for performing alpha-channel compositing using the TMS320C64xTM device manufactured by Texas Instruments Inc. of the USA. Note that all functions that begin with an underscore character are "intrinsic" functions, in that such functions map directly to TMS320C64xTM assembly functions. From their frequency throughout the code, the skilled reader will be able to infer how difficult this code might be to write without use of the intrinsic functions.

All mpy commands perform multiplication, and the pack and itod commands re-order and repack the bytes from different variables into a single variable. ufsr_div255 is a look-up table used to perform division, the nature of having been already described.

In the code listing, all code is in Arial font and comments are in Italicised Times font bounded by symbols.

16 17 Code Listing 18 19 commence with actions for the un-premultiplied source #define LOAD_SRC_UNPRE 21 22 load the pixel data 23 src_32 srcdata[i]; 24 extract and extend alpha to 16-bits alpha_s _mpyu(src_32 24, Ox101); 26 double_s _pack2(alpha s, alpha_s); 27 premultipy all channels 28 src_dword3 _mpyu4(_extu(src 32, 8, _packl4(double s, double_s)); 29 the remainder of this sub-routine operates to complete the extension of all colour channels to 16 bits and reassemble data 31 src_dword2 _itod(_shru2(_hi(src_dword3),7), _shru2(_lo(src_dword3),7)); 32 src_dword itod 33 34 (double_s<<16) I _add2(_hi(src_dword3), _hi(src_dword2)) 637363.DOC -31 36 _add2Llo(src-dword3), _Io(src-dword2)) 37 38 39 now for the premultiplied source 41 #define LOADSRCPRE 42 43 load premult iplied pixel 44 src-dword ((1nt64*)src -data)[i sr-x-dep__mrask]; extract alpha 46 alpha-s _hi(src-dword) 16; 47 1 48 49 same as above, but for un-premultiplied destination #define LOADDSTUNPRE 51 52 dst_32 =dst-data[i]; 53 alpha-d mpyu(dst32 24, Oxi0l); 54 double-d pack2(alpha alpha dst-dword3 _mpyu4Lextu(dst -32, 8, _packl4(doubie double 56 dst-dword2 jtodLshru2Lhi(dst-dword3),7), _.shru2Llo(dst_dword3),7)); 57 dst-dword itod 58( 59 (double-d«<16) I _add2Lhi(dst-dword3), _hi(dst-dword2)) 61 _add2(jo(dst-dword3), lo(dst-dword2)) 62 63 64 same as above, but for premultiplied destination 66 #define LOADDSTPRE 67 68 dst-dword ((lnt64*)dst -data)[i dst-x-depmrask; 69 alpha-d _hi(dst-dword) 16; 71 72 now perform the composite function 637363.DOC 32 73 #define DOCOMPOSITE 74 the next two lines perform the raster operation ROP 76 srd-temp2 (ROPRESULT HI) use_srd -mask; 77 srd-temp3 (ROP_-RESULT -LO) use_srd -mask; 78 the next ten (10) lines perform the required multiplications and mask out 79 undesired Porter Duff regions other-s _pack2(65535 alpha 65535 alpha 81 other-d _pack2(65535 alpha 65535 alpha-d); 82 dos dtempl -mpyu2(other-s, _hi(dst-dword)); 83 sod dtempl _mpyu2(other _hi(src -dword)); 84 dos dtemp2 _mpyu2(other-s, _Io(dst-dword)); sod dtemp2 -mpyu2(otHer _Io(src-dword)); 86 dos temp2 _packh2Lhi(dos-dtempl), _Io(dos -dtemp use -dos mask; 87 dos temp3 _packh2Lhi(dos -dtemp2), _Io(dos-dtemp2)) use -dos mask; 88 sod temp2 _.packh2Lhi(sod-dtempl), _io(sod -dtempl)) use -sod -mask; 89 sod temp3 _packh2Lhi(sod-dtemp2), _Io(sod_dtemp2)) use-sod-mask; the next two lines sum the terms 91 hi-word srd -temp2 dos -temp2 sod -temp2; 92 lo-word srd temp3 dos-temp3 sod temp3; 93 94 store the data 96 #define STORERESPRE 97 98 ((1nt64*)res-data)[i] _itod(hi word, lo-word); 99 1 100 101 the remaining code operates to un-premultiply the colour channels and store the pixel 102 as a 32-bit word 103 #define STORERESUNPRE 104 105 alpha-r hi-word 24; 106 107 res-data[i] 108- 637363.DQC 33 109 _pack14 110( ill _pack2(alpha~r, ufsr-div255[alpha rlLextu(hi-word, 16, 24)]) 112 113 _pack2 114 115 ufsr-div255[alpha r][io word 24] 116 117 ufsr-div255[alpha r]Lextu(Io-word, 16, 24)] 118 119 120 121 code listing ends 637363 .DOC

Claims

1. A method of performing alpha-channel compositing, said method comprising the steps of: dividing an alpha-channel compositing algorithm into a plurality of stages; arranging the stages in a pipeline fashion such that stages that require corresponding computing operations are sequentially arranged in said pipeline and stages that are performed by independent computing operations are parallel, subject to dependency criteria in said algorithm; replicating the pipeline as a plurality of substantially parallel configured pipelines; and implementing the pipelines in a substantially parallel configuration such that corresponding stages in each said pipeline are offset in sequence with each other.

2. A method according to claim 1 wherein said compositing algorithm comprises: aRCR as)(aD)CD fsn 1 -cD)(aS)CS fsaD (a)(aD)ROP(Cs,C) aR fD( 1 -aS) D fsn (1 aD)as fsnADSaD and CR aR CR aR wherein: the subscripts S and D represent the source and destination compositing input layers and the subscript R represents the resultant compositing output layer; C represents the set of colour channels for the pixel specified by the subscript; a is the opacity level for the pixel specified by the subscript; ROP is a compositing operation specified to combine the source and destination layers; and

637363.DOC 35 f is a Boolean value.

3. A method according to claim 2 wherein step forms processing stages for said algorithm for each of: fetching colour and a values for corresponding pixels from the input layers; determining a products for the three different f terms in said algorithm; determining the colours resulting from the ROP for the S n D term; multiplying the three colours, being the two input-layer colours and the ROP result colour, by their respective a; summing the terms for which f is true: calculating aR from the a intermediates; divide by aR to obtain the resultant colour CR and storing the colour CR and cR values for the result.

4. A method according to claim 3 wherein step forms said pipeline as plural sequential components, being: stage (ii) stages and arranged in parallel and receiving inputs from stage (iii) stages and arranged in parallel, stage receiving input from stage (2) and stage receiving input from stages and (iv) stage 5, receiving input from stage stage receiving input from stages and and (vi) stage receiving input from stage 637363.DOC -36- A method according to claim 4 wherein step comprises replicating said pipeline in substantial parallel such that sequential component (ii) of one said pipeline is operative coincident with sequential component of another said pipeline. 6. A method according to claim 5 wherein when said pluralityof said pipelines numbers six, coincident operation of each said pipeline results in parallel operation of one said sequential component of each said pipeline. 7. A method according to any one of claims 1 to 6 wherein step is performed using a processor having a plurality of parallel process pipelines such that the process pipelines at any moment in the substantially parallel configuration can each operate independently by virtue of the offset sequence. 8. A method according to claim 7 wherein said processor comprises a digital signal processor. 9. A digital signal processor configured to perform alpha-channel compositing. A digital signal processor according to claim 9 wherein said compositing implements an algorithm comprising: aR R fD (1 a )(aD)CD fsn (1 aD)(as)C, fsnD(a )(aD)ROP(Cs,CD) aR fD( 1 aS)aD fs (1 aD)aS fsnDaSaD and CR (R CR CR wherein: 637363.DOC 37 the subscripts S and D represent the source and destination compositing input layers and the subscript R represents the resultant compositing output layer; C represents the set of colour channels for the pixel specified by the subscript; c is the opacity level for the pixel specified by the subscript; ROP is a compositing operation specified to combine the source and destination layers; and f is a Boolean value; and said digital signal processor has multiple execution units configurable to form a plurality of substantially parallel processing paths each arranged to implement at least one part of said algorith simultaneously.. 11. A method of alpha-compositing at least two layers to provide a digital image using a processor having a plurality of asymmetric execution units capable of parallel operation, said method including the steps of: calculating intermediate alphas for a pixel dependent upon said at least two layers and an overlap term between said at least two layers and in parallel determining a colour resulting from a ROP operation applied to said overlap term; determining a pre-multiplied colour result for said pixel dependent upon a colour and a respective one of said intermediate alphas for each of said at least two layers and said ROP result and in parallel calculating a resulting alpha from said intermediate alphas; and dividing said pre-multiplied colour result by said resulting alpha to provide a resulting colour and a resulting alpha for said pixel; wherein said steps are applied in a time-staggered manner to a plurality of pixels of said digital image. 637363.DOC -38 12. A method according to claim 11, wherein said processor implements very-long- instruction words (VLIW) for multiple parallel instructions. 13. A method according to claim 12, wherein said processor implements single- instruction-multiple-data instructions. 14. A method according to claim 12, wherein said processor is one of a microprocessor and a digital signal processor. A method according to claim 10, wherein each pixel of said digital image is encoded in a packed pixel format. 16. A method according to claim 10, further including the steps of: fetching said pixel colour and alpha values from said at least two layers; and storing said resulting colour and said resulting alpha for said pixel. 17. A method according to claim 1, wherein: said step of determining said pre-multiplied colour result includes: multiplying said colour for each of said at least two layers and said ROP result by said respective one of said intermediate alphas; and combining said alpha-multiplied colours dependent upon one of said at least two layers and said ROP result being a correct term for said pixel to provide said pre-multiplied colour result; and said step for calculating said resulting alpha is carried out in parallel with at least one of said multiplying and said combining steps. 637363.DOC 39 18. A method according to claim 10, further including the step of: accessing a lookup table (LUT) using said resulting colour and said resulting alpha as indices of said LUT to perform an un-premultiply function. 19. Apparatus for alpha-compositing at least two layers to provide a digital image using a processor having a plurality of asymmetric execution units capable of parallel operation, said apparatus including: means for calculating intermediate alphas for a pixel dependent upon said at least two layers and an overlap term between said at least two layers and in parallel determining a colour resulting from a ROP operation applied to said overlap term; means for determining a pre-multiplied colour result for said pixel dependent upon a colour and a respective one of said intermediate alphas for each of said at least two layers and said ROP result and in parallel calculating a resulting alpha from said intermediate alphas; and means for dividing said pre-multiplied colour result by said resulting alpha to provide a resulting colour and a resulting alpha for said pixel; wherein the processing is applied in a time-staggered manner to a plurality of pixels of said digital image. Apparatus according to claim 19, wherein said processor is a digital signal processor. 21. Apparatus according to claim 20, wherein said processor implements single- instruction-multiple-data instructions. 637363.DOC 40 22. Apparatus according to claim 20, wherein said processor implements very-long- instruction words (VLIW) for multiple parallel instructions. 23. Apparatus according to claim 10, wherein each pixel of said digital image is encoded in a packed pixel format. 24. Apparatus according to claim 9, further including: means for fetching for said pixel colour and alpha values from said at least two layers; and means for storing said resulting colour and said resulting alpha for said pixel. Apparatus according to claim 9, wherein: said means for determining said pre-multiplied colour result includes: means for multiplying said colour for each of said at least two layers and said ROP result by said respective one of said intermediate alphas; and means for combining said alpha-multiplied colours dependent upon one of said at least two layers and said ROP result being a correct term for said pixel to provide said pre-multiplied colour result; and said means for calculating said resulting alpha operates in parallel with at least one of said multiplying and said combining means. 26. Apparatus according to claim 9, further including: a lookup table (LUT) using said resulting colour and said resulting alpha as indices of said LUT to perform an un-premultiply function. 637363.DOC -41 27. A computer program product having a computer readable medium with a computer program recorded therein for alpha-compositing at least two layers to provide a digital image using a processor having a plurality of asymmetric execution units capable of parallel operation, said computer program product including: computer program code means for calculating intermediate alphas for a pixel dependent upon said at least two layers and an overlap term between said at least two layers and in parallel determining a colour resulting from a ROP operation applied to said overlap term; computer program code means for determining a pre-multiplied colour result for said pixel dependent upon a colour and a respective one of said intermediate alphas for each of said at least two layers and said ROP result and in parallel calculating a resulting alpha from said intermediate alphas; and computer program code means for dividing said pre-multiplied colour result by said resulting alpha to provide a resulting colour and a resulting alpha for said pixel; wherein the processing is applied in a time-staggered manner to a plurality of pixels of said digital image. 28. A computer program product according to claim 27, wherein said processor is a digital signal processor. 29. A computer program product according to claim 28, wherein said processor implements single-instruction-multiple-data instructions. A computer program product according to claim 28, wherein said processor implements very-long-instruction words (VLIW) for multiple parallel instructions. 637363.DOC 42 31. A computer program product according to claim 27, wherein each pixel of said digital image is encoded in a packed pixel format. 32. A computer program product according to claim 27, further including: computer program code means for fetching for said pixel colour and alpha values from said at least two layers; and computer program code means for storing said resulting colour and said resulting alpha for said pixel. 33. A computer program product according to claim 27, wherein: said computer program code means for determining said pre-multiplied colour result includes: computer program code means for multiplying said colour for each of said at least two layers and said ROP result by said respective one of said intermediate alphas; and computer program code means for combining said alpha-multiplied colours dependent upon one of said at least two layers and said ROP result being a correct term for said pixel to provide said pre-multiplied colour result; and said computer program code means for calculating said resulting alpha operates in parallel with at least one of said computer program code means for multiplying and said computer program code means for combining. 34. A computer program product according to claim 27, further including: computer program code means for accessing lookup table (LUT) using said resulting colour and said resulting alpha as indices of said LUT to perform an un- premultiply function. 637363.DOC 43 A digital signal processor configured to perform division by values other than multiples of two. 36. A digital signal processor according to claim 35, comprising: means for multiplying pairs of integer values to form a corresponding multiple; means for storing each said pair of said values and the corresponding said multiple in a look-up table; means for determining a division result of a numerator and a denominator, said means comprising: means for traversing said look-up table along a first axis thereof corresponding to said denominator to identify one said multiple corresponding to said numerator; and means, having identified said one multiple, for identifying said division result from a second axis of said look-up table. 37. A processor according to claim 36 wherein said processor is operative for pixel compositing and said means for determining said division result operates upon one colour channel of said pixel at a time. 38. A digital signal processor according to claim 36 wherein said multiplying is performed in a fashion corresponding to that depicted in Fig. 7A or Fig. 7B. 39. A digital signal processor according to claim 36 further comprising means for truncation values in said look-up table in a fashion corresponding to that of Fig. 7B or Fig. 7C. 637363.DOC 44 An alpha-channel compositing system substantially as described herein with reference to Figs. 6, 7A to 7C, 9 and 10 of the drawings. 41. A VLIW processor configured for alpha-compositing at least two layers to provide a digital image using a processor having a plurality of asymmetric execution units capable of parallel operation, said processor comprising: a first module configured for calculating intermediate alphas for a pixel dependent upon said at least two layers and an overlap term between said at least two layers and in parallel determining a colour resulting from a ROP operation applied to said overlap term; a second module configured for determining a pre-multiplied colour result for said pixel dependent upon a colour and a respective one of said intermediate alphas for each of said at least two layers and said ROP result and in parallel calculating a resulting alpha from said intermediate alphas; and a third module configured for dividing said pre-multiplied colour result by said resulting alpha to provide a resulting colour and a resulting alpha for said pixel; wherein the processing is applied in a time-staggered manner to a plurality of pixels of said digital image. DATED this FIFTEENTH Day of MAY 2003 CANON KABUSHIKI KAISHA Patent Attorneys for the Applicant SPRUSON&FERGUSON 637363.DOC