EP1754198A1

EP1754198A1 - Animation systems

Info

Publication number: EP1754198A1
Application number: EP05731119A
Authority: EP
Inventors: David Jason Lander
Original assignee: Gameware Europe Ltd
Current assignee: Gameware Europe Ltd
Priority date: 2004-05-26
Filing date: 2005-04-11
Publication date: 2007-02-21
Also published as: WO2005116932A1

Abstract

This invention relates to methods, apparatus, computer program code and signals for capturing and animating images. A method of providing an animated image, the method comprising: inputting and storing two-dimensional image data defining an image; inputting and storing coordinate data defining one or more reference coordinates within said image identifying at least one portion of said image for animation; inputting audio data defining sounds to which animation of said image is to be synchronised; reading a portion of said audio data; determining deformation mesh data for said image responsive to said portion of audio data using said coordinate data, said deformation mesh data defining a deformation mesh for generating a deformed version of said image; mapping said stored image data using said deformation mesh data to generate deformed image data defining an image deformed in accordance with said mesh; outputting said deformed image data to provide a frame of said animated image; and repeating said reading, determining, mapping and outputting to provide data for said animated image.

Description

Animation Systems

This invention relates to methods, apparatus, computer program code and signals for capturing and animating images. The application claims the benefit of US provisional application no. 60/575,025 filed May 28, 2004.

Known image animation techniques generally employ three dimensional model making, requiring substantial computing power and data storage capabilities. It is known to implement such systems on a server in a development environment; see, for example, US 2003/0137516, Pulse Entertainment Inc. and the related Veepers (trademark) system, and WO 01/13332, Biovirtual Limited. Lightweight systems for animating predetermined icons on a mobile phone are also known - see, for example, FaceWave (registered trademark) from Anthropics Technology Limited, and also GB 2378879, EP 1272979, WO 02/052863, WO 02/052508, GB 2360919, GB 23609183, and WO 00/17820.

The above described systems are either cumbersome, not suitable for use by other than software development engineers and processor and memory intensive, and/or limited in their capabilities. There therefore exists a need for easy-to-use animation systems which do not require large amounts of computing power and/or memory, and which are preferably implementable on a mobile computing device such as a mobile phone.

According to a first aspect of the present invention there is therefore provided a method of providing an animated image, the method comprising: inputting and storing two- dimensional image data defining an image; inputting and storing coordinate data defining one or more reference coordinates within said image identifying at least one portion of said image for animation; inputting audio data defining sounds to which animation of said image is to be synchronised; reading a portion of said audio data; determining defoπnation mesh data for said image responsive to said portion of audio data using said coordinate data, said deformation mesh data defining a deformation mesh for generating a deformed version of said image; mapping said stored image data using said deformation mesh data to generate deformed image data defining an image defoπned in accordance with said mesh; outputting said deformed image data to provide a frame of said animated image; and repeating said reading, determining, mapping and outputting to provide data for said animated image.

The above method facilitates the provision of an animated image in substantially real time on a lightweight platform such as mobile phone. Embodiments of the invention perform the deformation mesh generating and stored image mapping procedurally, that is generating a new deformation mesh each frame based upon the reference coordinates rather than generating a three dimensional model which is morphed to a target, thus giving the appearance of an animated object without the need for substantial amounts of memory and processing power. Since embodiments of the method may be operated in substantially real time they may be employed an audio feed from any of a range of sources including the internet and a local or wide area network, although in preferred embodiments the method is implemented on a mobile communications device such as a mobile phone. In such an arrangement the animation system may operate substantially independently from an audio playback system provided that there is access to at least a portion of a sound buffer, and in this way audio playback of, say, a stored message may be performed in a conventional manner and the animation system added as an extra code module component. This facilitates implementation of embodiments of the method by means of program code which is sent in an message to a mobile communications device together a captured image, the audio data and, preferably, the coordinate data defining reference coordinates within the image identifying one or more portions of the image for animation. Such reference coordinates may identify, for example, an overall shape or outline of a region for animation such as a head and/or lip shape/region of a "talking head".

In preferred embodiments the deformation mesh data determining is performed on a per-frame basis, that it is determined afresh for each successive frame rather than by modifying a mesh determined for a previous frame. Despite this there will generally be some correlation between the defoπnation mesh of one frame and that of a previous frame because successive frame are based upon successive portions of the same audio data, in preferred embodiments a defoπnation mesh being responsive to a signal (level) profile within a sliding window onto the audio data.

In preferred embodiments of the mesh a defoπnation mesh is generated by determining a first mesh to sub divide the image based upon the coordinate data and then deforming this to generate the deformation mesh data. This first mesh may comprise a two- dimensional mesh and the defoπnation mesh a three-dimensional mesh in which case the deforming may be performed in two dimensions (for speed) and then the deformed mesh converted to a three-dimensional mesh. Such a conversion may be based upon a reference object or structure such as a reference head which incoiporates some depth (three-dimensional) information. This reference structure can be used to apply some pseudo-depth to the image by, in effect, pulling the eyes, nose and mouth forward by an amount scaled to, say, an overall head size.

Broadly speaking a three-dimensional effect may be created by mapping portions of the image marked out by the first mesh onto corresponding portions in the defoπned mesh - in effect triangles or polygons of the (original) stored image as "textures" and mapping these onto the three-dimensional deformation mesh to generate a 3D image which can then rendered in 2D on the display. It will be appreciated that portions of the original image may be mapped either onto a two-dimensional defoπned mesh (before pseudo depth is added) or onto the three-dimensional deformation mesh. Generally an image can be presumed to comprise a foreground and a background portion (the foreground can be identified, for example, by simply assuming this occupies the greater portion of the image) and in this case only the foreground portion of the image need be deformed; this may then be simply overlaid on an unchanged background derived from the (originally) stored image. This again reduces the processing requirements and results in a visually acceptable appearance because the overlaid image is simply a defoπned version of the original image and the "join" with the background is therefore little more obvious than in the original image. Thus preferably the reference coordinates include one or more coordinates for use in distinguishing the foreground from the background portions of the image. The background may always be treated as a two-dimensional image. In preferred embodiments the reference coordinates identify a location of a head which the image is considered to depict (although in practice the "head" may comprise an image of any animate or inanimate object this, when animated, giving the general appearance of a talking head). The reference coordinates preferably therefore locate the head and also locate a lip region of the head; preferably the defoπnation mesh then comprises or defines a defoπnation of the lip region. For example, a mouth may be defined by a mesh comprising two, four or more rows divided into a number of horizontal segments to allow the mouth to be opened and closed and, optionally, to facilitate expressions such as curvature at one or both ends of the mouth. The reference coordinates may define a boundary box to limit the mouth deformation. Mouth reference points may be defined in the vicinity of the top and/or bottom lip (the bottom lip being more important than the top as this moves more). In embodiments one or more predetemiined images may be employed in generating the defoπned image, for example an image or mouth internals such as teeth and/or tongue which can be employed when the mouth is open, and/or a closed eye image for use in making an animated head blink. Thus the defoπned image may be combined with a predetemiined image based upon a degree of defomiation, for example when the mouth is open. Optionally a pseudo-random defomiation may also be applied to the image, for example to give the appearance of an eye blink. This may be in response to an eye blink level parameter; this may be dependent upon one or more of a random interval between blinks, a time of a last blink, and blink rate.

The degree of defoπnation applied, for example percentage defomiation in two (x and y) directions may be deteπnined by one or more parameters. These may include an exaggeration factor and a background noise level factor (having the effect of filtering out background noise). Other parameters may control the deformation of specific regions of the image, for example a lip region. Such parameters may comprise an expression parameter, a lip thickness parameter (relating to the need or otherwise to display mouth internals) an eye sized parameter (for example lid to eye size ratio) and the like.

Preferably the deformation is deteπnined responsive to a signal level profile of the audio data over a time interval, preferably in the range 0.1 to 2 seconds, for example 0.75 seconds. Thus the method may read audio data from a sound buffer and determine a signal level profile for the read data, then mapping this to one of a plurality of mouth shapes. In embodiments a mouth shape is specified by a data object comprising, for example, 8 bytes, preferably defined in parametric teπns, that is one or more (say horizontal and/or vertical) defomiation factors which may be expressed in percentage terms. A role, pitch, and/or yaw for a foreground portion or object within the image may also be deteπnined for increased realism.

Thus in broad terms in embodiments the method starts the audio data playing and then on a per-frame basis, for example every 40 milliseconds, looks at the last 750 milliseconds of audio, maps the signal level profile onto a lip shape, detemiines whether or not an eye blink is to be applied, defoπns an initial grid based upon the lip and eye infoπnation, converts this to three dimensions using, for example, a simple half sphere model of a head, optionally applies further deformations and/or pitch/role/yaw, applies textures from the original image to the tliree-dimensional mesh (optionally this may be applied at an earlier stage before applying pseudo-depth to the mesh) and then rendering the result for display as a two-dimensional image, superimposed upon the original (two- dimensional) background. Optionally the three-dimensional rather than two- dimensional mesh may be deformed although this requires greater processing power. This technique may be employed to animate a head of a human or animal figure and may also be adapted to animate an image of an inanimate object. Reference points to identify parts of the image for animation may entered by a user and because, in effect the original captured image is being distorted it does not particularly matter if these are not accurate.

The invention also provides processor control code to, when running, implement embodiments of the above described methods. Embodiments of the above described methods have the advantage that they may be implemented using only integer arithmetic, facilitating fast, low memory, low power implementations. For example embodiments of the above described methods may be implemented in less than 100K bytes or even less than 50K bytes and thus processor control code for animating an image may be included within a message bearing the image data for animation and/or audio data. In this way, for example, an image and an associated audio file together with code for displaying the animated image may be sent as a package to mobile communications device without relying upon appropriate animation code needing to be embedded within the device. A particularly convenient form of such a message comprises a MMS (multimedia messaging service) message in which processor control code for an animation application or module may be incorporate into a custom data packet.

Thus in another aspect the invention provides a signal providing data for animation in substantially real time by a mobile communications device, the signal comprising: two- dimensional image data defining an image; coordinate data defining one or more reference coordinates within said image identifying at least one portion of said image for animation; and audio data defining sounds to which animation of said image is to be synchronised.

The invention further provides a computer system for providing an animated image, the system comprising: data memory for storing two-dimensional image data defining an image, coordinate data defining one or more reference coordinates within said image identifying at least one portion of said image for animation, and audio data defining sounds to which animation of said image is to be synchronised; program memory storing processor control code; and a processor coupled to said data memory and to said program memory for loading and implementing said code, said code comprising code for controlling the processor to: read a portion of said audio data; deteπnine defomiation mesh data for said image responsive to said portion of audio data using said coordinate data, said defomiation mesh data defining a defomiation mesh for generating a deformed version of said image; map said stored image data using said defomiation mesh data to generate defo ied image data defining an image defoπued in accordance with said mesh; output said defomied image data to provide a frame of said animated image; and repeat said reading, deteπnining, mapping and outputting to provide data for said animated image.

The computer system may comprise, for example, a portion of a mobile communications device such as a mobile phone or a general purpose computer system for connection to a wired network such as the internet, hi this latter case the audio data may comprise, for example, voice over IP (internet protocol) data. In another aspect the invention provides a computer system for capturing an image of an object for animation, the system comprising: an image capture device; a user input device; a display device; data memory for storing a captured image of said object; program memory storing processor control code; and a processor coupled to said image capture device, to said user input device, to said display device, to said data memory, and to said program memory for loading and implementing said code, said code comprising code for controlling the processor to: capture an image of said object using said display device; store said captured image in said data memory; determine an initial position for two sets of points, a set of reference points and a set of image portion identification points for identifying said object or a portion of said object for animation, locations of said reference points detemiining locations of said image portion identification points; and wherein there are fewer of said reference points than said image portion identification points; input control data from said input device to adjust a position of one or more of said reference points; determine an adjusted position of said image portion identification points responsive to said adjusted position of said one or more reference points; output to said display device data for displaying said captured image and marker data for displaying positions of said reference points and of said image portion identification points in conjunction with said captured image; and store coordinates of said image portion identification points in association with said captured image.

The initial position for the sets of points may be determined based upon default positions, for example for a head by assuming that the head occupies a particular percentage of the image and then defining eye and mouth positions, hi embodiments the image portion identification points may identify a head outline and/or lip region and/or eye regions, hi embodiments the reference points in effect act as control points so that moving one of these moves one or more of the image portion identification points - in this way a user can quickly and easily define regions of an image by moving only a few points. Thus in embodiments user selection data may be input to select one or more of the reference points and these may then be used to move some or all of the image portion identification points, for example a translation of a reference point corresponding to a translation of image portion identification points, for example using a linear stretch. In embodiments an image portion identification point is also movable by a user, one or more of the reference points then optionally being adjusted in response to this motion. In a preferred embodiment one of the reference points is selectable and controllable to rotate some or all of the image portion identification points and, preferably with them some or all of the reference points. In a prefeπed embodiment the reference points comprise four points defining a rectangle, preferably framing the object, and preferably with a fifth substantially central point for controlling rotation of the image portion identification points.

The invention also provides a method coπesponding to the above described processor control code, for determining positions of two sets of points, inputting control data, detemiining an adjusted position of at least the image portion identification points, outputting a user display for controlling the adjusted position(s), and storing coordinates of the image portion identification points in association with a captured image.

In another aspect the invention provides a method a method of providing a talking image, the method comprising: capturing an image with a first mobile communications device; determining positions of one or more reference coordinates within said image identifying at least one portion of said image for animation; capturing audio data using said first mobile communications device; sending said captured image, said reference coordinates and said captured audio data to a second mobile communications device; receiving said captured image, said reference coordinates and said captured audio data at said second mobile communications device; displaying said captured image on said second mobile communications device; playing said captured audio data on said second mobile communications device; and modifying said captured, displayed image in synchronism with said playing audio using said reference coordinates to provide said talking image.

The detemiining of reference coordinate positions preferably comprises displaying initial positions of the reference coordinate positions to a user of the first mobile communications device, and then receiving a control input from the user to modify these initial positions. Preferably at least one control point is displayed, modifying a control point position resulting in modification of a plurality of the reference coordinates. The captured image, reference coordinates, and captured audio may be sent in a message such as a multimedia message; this message may also include processor control code to perfoπn the modifying of the captured, displayed image to provide a talking image.

As of the invention also provide processor control code for implementing the reference coordinate position determining, and for implementing the captured, displayed image modifying.

Broadly speaking in embodiments a user of the first mobile communications device captures an image, preferably in colour, and the device then displays a set of points at default positions defining a perimeter for a head and top left and bottom right positions for eyes and a mouth, displaying these over the captured image. The device then inputs user controls for moving the points until they are at the coπect positions and then (or before) records an audio message. The message, captured image and reference coordinates (for image portion identification points) are then sent to the second mobile communications device, which starts the audio playing and animates the captured image in time to the audio using the reference coordinates positioned by the user of the sending device.

The above described processor control code and/or code to implement embodiments of the above described methods may be provided on a data earner such as a disk, CD- or DVD-ROM, programmed memory such as read-only memory (fiπnware), non-volatile memory such as Flash, or on a data earner such as an optical or electrical signal earner. Code (and data) to implement embodiments of the invention may comprise code in a conventional progi'amming language such as C or Java, or microcode, or even code for a hardware description language such as SystemC or VHDL (for example where the code is to be embedded in a mobile device). As the skilled person will appreciate such code and/or data may be distributed between a plurality of coupled components in communication with one another.

These and other aspects of the present invention will now be further described, by way of example only, with reference to the accompanying figures.

Figure 1 shows a hardware block diagram of a system 100 for creation and playback of "Living Pictures" on a mobile device. Figure 2 shows the steps used in gathering information for animation.

Figure 3 shows a bounding box and points arranged in an oval to define a head perimeter.

Figures 4a and 4b illustrate user manipulation of a bounding box.

Figure 5 shows top left and bottom right points of a bounding box (not shown) around the eyes, and centre markers for adjusting the width of the eyes.

Figure 5b shows the image of figure 4 with the bounding box positioned around the eyes.

Figure 6a shows a bounding box of a mouth, with a centre line to aid positioning.

Figure 6b shows the image of figure 5b with the bounding box positioned around the mouth.

Figure 7 shows steps required to generate a 3D model of the head.

Figure 8 shows polygons created by the system.

Figure 9a shows a grid constructed around each eye.

Figure 9b shows the grid when the eye is completely closed.

Figure 9c shows a grid around the eyes applied to the image of figures 4a, b.

Figure 10a shows the image of Figure 4 with a grid around the mouth and the mouth closed.

Figure 10b shows the mouth open. Figure 10c shows polygons created for representing the internals of the mouth.

Figure 1 1 shows an animation playback procedure.

Figures 12a and 12b illustrate applying a vertical bias to the lips.

Figures 12c and 12d illustrate applying a vertical and horizontal bias to the lips.

Figure 13 shows software modules for a prefeπed embodiment of the system.

We will describe an optimised software technology, here refeπed to as "Living Pictures" (Trade Mark), that can produce lifelike computer generated 3D talking animations fi-om a digital photograph or picture. The system is given the position of the outline of the head to be animated, along with basic information on the position of key features, such as eyes and mouth, and a sound to lip-synch to.

The system then generates an appropriate 3D representation of the head and its features and animates it in real time. The generated animation includes lip movements synchronized with the speech in the sound, head movements, blinks and facial expressions.

The user interface has been designed to be as simple as possible, and this simplicity, coupled with low computer resource requirements, enable the system to run on a wide range of computer devices, from mobile phones to desktop computer systems.

The preferred embodiment of the invention has been designed to run on any computer device that has a colour display, at least two keys for user input, a modest processor and onboard RAM, and the capability to record and playback speech.

Devices with onboard cameras, such as modem mobile telephones, are particularly suitable as "Living Pictures" can generate animations from any picture a user takes with the onboard camera. Figure 1 shows a hardware block diagram of a system 100 for creation and playback of "Living Pictures" on a mobile device. The system comprises a data processor 102 coupled to non-volatile memory 104, 106 storing program code and program data respectively for implementing embodiments of the invention as described in more detail below. Processor 102 is coupled to a display 108 and audio playback device 1 10 for providing a substantially synchronised audio and moving image output. Input devices coupled to processor 102 preferably include a digital camera 112, audio recording device 114, and user controls 1 16 which may comprise a keyboard and/or pointing device. The system 100 of figure 1 is, in preferred embodiments, implemented on a portable consumer electronic device such as a mobile phone, PDA or the like. There is also conventional volatile memory (not shown in Figure 1) for working data storage.

The process starts with a picture. On devices with an integrated camera, such as a mobile phone with integrated digital camera, the user can take any picture of his or her choosing. On any supported device, the user can select any existing digital picture to be animated. The picture does not have to be a photograph, the system will work with any digital image, whether it is a photograph, drawing or sketch.

An assumption of the process is that a face or head is on the picture, looking approximately towards the front. For best results the model in the picture should have their eyes open and mouth closed with a neutral expression.

Figure 2 shows the steps used in gathering information for animation. Referring to figure 2 at step 200 a user inputs data to the system 100 of figure 1 to select a picture for animation. This may include taking a picture with a built in camera and/or loading a picture fi-om storage. Then, at step 202 the procedure inputs user data to specify a head perimeter by adjusting a bounding box and perimeter points. At step 204 the procedure inputs data specifying eye positions, allowing adjustment of a bounding box and inner points. At step 206 user data is input to specify mouth position data, again by inputting data to adjust a bounding box and preferably also by inputting data positioning a centre line on the lips. Finally at step 210 user data is input to select a sound file. This may comprise, for example, recording a sound using a microphone and/or loading a sound fi-om storage. We next describe a procedure for specifying head perimeter.

Figure 3 shows a bounding box and points aπanged in an oval to define a head perimeter. First the user is asked to define the approximate region that the head to be animated encompasses on the picture 300. The system presents a number of points 302 in an oval shape 304 that are positioned by default centrally on the picture, with the default height of the oval approximately 75 percent of the height of the image.

The user can then define a head region by adjusting a bounding box 306 of the set of points that make up the oval, and then by direct manipulation of the points themselves.

For devices with no mouse and only a few keys, such as a mobile phone, the user can toggle through the points to select the active point. The active point can be the top left 306a or the bottom right 306b point on the bounding box, or any point on the oval. The active point can then be manipulated with cursor or numeric keys.

For devices with mouse or pen support, the user can directly select the active point by clicking on the required point. The system scales the points that make up the oval of the head perimeter, when the top left or bottom right point of the bounding box is manipulated by the user. Thus, to make the head naπower, the user can simply move the bottom right bounding box point to the left or the top left bounding box point to the right.

When an oval point is manipulated, the system recalculates the bounding box again to occupy all the points on the oval, so at all time, the user has a choice of either moving a bounding box point to quickly adjust the overall head position and dimensions, or moving a perimeter point from the oval to adjust he shape of the head more directly. For example, simply lowering the bottom most point could extend the chin.

Figures 4a and 4b illustrate user manipulation of a bounding box. In figure 4a the four comer markers 400 indicate the bounding box (not shown). In Figure 4b, the guide points 402 have been manipulated to indicate the perimeter of the head.

We next describe a procedure to specify eye positions. Figure 5a shows top left 500 and bottom right 502 points of a bounding box (not shown) around the eyes, and centre markers 504, 506 for adjusting the width of the eyes. The user is asked to show the position of the eyes, again by moving a bounding box around them. The system is aware that, for humans, the distance between the eyes is approximately equal to the width of an eye, and this approximation is used to allow the bounding box around both eyes to be defined with just two points 500, 502, the top left and bottom right of the bounding box. The system generates two boxes 508a, b from these two points, with the left eye occupying the left-most third of the total eye bounding box and the right eye occupying the right-most third.

Two further points 504, 506, which are presented on the inside of the eyes, allow the user to adjust the width of each eye independently if required. This feature is provided for images where the model is not looking straight on, or where the eyes have long lashes.

For instance if someone is looking towards the side in a photograph, one eye will appear wider than the other, due to the perspective effect. Also if the eyes have large lashes, the system will need to manipulate a larger area around the eyes, to ensure blinking for instance moves all the lashes and prevents tearing effects.

We next describe a procedure to specify mouth positions.

Next the user is asked to define a box around the mouth. The bounding box is designed to be slightly wider and larger than the actual mouth itself, to enable the animation generation process to morph the skin around the mouth to represent muscles stretching and contracting,

Figure 6a shows a bounding box 600 of a mouth 602, with a centre line 604 to aid positioning. The bounding box 600 for the mouth is shown to the user, along with a horizontal centre line 604 that is 80 percent as wide as the bounding box itself, as shown in the figure.

The bounding box can be manipulated by the user by moving top left 606 and bottom right 608 points. The box is manipulated until the indicated centre line covers the centre of the lips on the picture. The width of the centre line should be the same as the width of the inside of the lips on the picture.

Figure 6b shows the image of figure 5b with the bounding box 600 positioned around the mouth.

We next describe selection of a sound file.

With the head perimeter defined, along with a box around the eyes and the mouth, this embodiment of the algorithm has the information that it requires to display the picture.

The user is then asked to either record a sound file on devices with built in audio recording, such as mobile phones, or load an existing sound file from the devices memory. Any digital sound recording may be employed, but for optimal results a recording of some speech with low background noise will produce the best lip synching results.

We next describe automatic model generation.

When the user interface steps for gathering the required infoπnation are complete, the system generates a 3D representation of the head on the picture. The process of generating a 3D head is split into several sections, as shown in the procedure of figure

7.

Figure 7 shows steps required to generate a 3D model of the head.

Refeπing to figure 7, at step 700 the procedure generates polygons to represent the head by connecting the perimeter points defined as described above. Then, at step 702 the procedure generates polygons to represent the eye by constructing grids over the eyes and, at step 704, generates polygons for representing the mouth constructing a grid over the mouth. At step 706 polygons are generated to represent internals of the mouth, in particular for teeth and tongue. Finally the procedure applies a 3D defomiation 708 to convert the flat model into a 3D model with depth by displacing the points in the z-direction, that is out of the x-y image plane by an amount, for each point, depending upon the position of the point within the image plane.

These steps are described in more detail below:

We first describe constructing the head 700, with reference to Figure 8 which shows polygons 800 created by the system. The system first defines a grid structure around the head, using the perimeter points. Each point is linked to the centre of the head, (the centre is calculated as a mathematical mid point of the perimeter points). Between the perimeter point and the centre point, intemiediate points are calculated at approximately 20 percent, 50 percent and 80 percent of the distance along the vector from the perimeter point to the centre.

The perimeter points, the centre point, and the intemiediate points are used to construct a series of polygons to represent the model, as shown in figure 8. This effectively produces four circles 802, 804, 806, 808 of polygons of varying thickness from the centre.

The vertex positions for each polygon are used to determine a texture position. Each texture position is simply he 2D vertex position converted by a 1 : 1 mapping from an (x,y) pixel position to a (u,v) texture position. For instance if the image is 256 pixels high, a vertex at position 128, 128, would have a (u,v) position o (0.5, 0.5), as u, v coordinates are typically in the range zero to one.

We next describe the step 702 of constructing the eyes.

Figure 9a shows a grid 900 constructed around each eye. The grid 900 is drawn using the bounding box around the eyes. The grid is split into a number of horizontal sections 902. By default we chose six or seven horizontal sections or segments as a compromise between apparent quality of the curve produced, and rendering speed. Each horizontal section is foπried of two parts, a top and bottom part, with top and bottom dividers 906a, b.

Vertically, a centre line 904 is used to mark the centre of the eyes, and the system uses an ellipsoid algorithm to calculate the vertical position of the eye at each of the horizontal sections. The top and bottom comer sections of the grid, are given etched comers to better blend with the suπounding polygons, and make the edges of the boxes for the eyes less obvious.

Figure 9a shows how the polygons of the grid look when each eye is open. The inner vertices of the top and bottom sections follow the approximate perimeter of the eye. To animate the eyes blinking or closing, the vertex positions of the points marking the top and bottom perimeter of each eye can be adjusted towards the centre of the eye. As the triangles are rendered using 3D texture mapping techniques, this results in hepolygons representing above and below the eyes seeming to stretch to fill the gap created This provides a good illusion of eyes closing.

Figure 9b shows the grid 900 when the eye is completely closed; the inner vertices of the top and bottom sections 906a, b are all on a horizontal centre line.

Figure 9c shows a grid around the eyes applied to the image of figures 4a, b.

We next describe step 704 of constructing the mouth.

The mouth is modelled in a similar manner to the eyes, with a sectional grid, but the m mouth polygons are constructed in the closed position, and the inner points animate outwards from the centre of the mouth to give the illusion of the mouth opening.

The grid for the mouth is initially constructed with the two lip lines together. Thus there are two rows of polygons created for them, but they have no apparent height on the screen until the mouth is open.

Figure 10a shows the image of Figure 4 with a grid around the mouth and the mouth closed; Figure 10b shows the mouth open. In figure 10b, the polygons representing the lips have been moved apart. The lower lip is moved more than the upper one to represent the jaw moving, as in real life.

The system uses the position of the centre lip point for the upper and lower lips to dete iine a reference point for placing polygons to represent the teeth and tongue. The upper teeth are represented by a polygon running vertically fonri the lowest point in the upper lip to the top of the top lip.

The lower teeth and tongue are represented by a polygon which runs from the upper teeth to the bottom lip. The dept of the lower teeth polygon is adjusted so that the back of the mouth is further into the screen than the front, giving a perspective effect and a lifelike mouth.

Figure 10c shows polygons 1000 created for representing the internals of the mouth. The top polygon 1002 connects to the rear centre 1006 of the mouth to the inside of the upper lip. The bottom polygon 1004 connects the rear centre of the mouth to the inside of the lower lip. As the mouth opens and closes, the top most points of the top polygon and the bottom most points of the bottom polygon move to match the associated lip position.

We next describe the step 708 of conversion to 3-dimensions.

The system now has a series of 2D polygons. Each polygon has three or four vertices, and each vertex has an associated texture position. The vertices are then "extruded" to provide depth to the model. The smallest of the width and height of the perimeter of the model is used to determine an approximate radius for the head. This radius value provides the maximum extrusion to be applied, which would be near the nose of the model. Polygon vertices at the perimeter points are not extruded as this is the reference depth for the model, and the depth that the background is given.

The extrusion of a point is calculated based on its distance from the centre of the head. Points at the centre are extruded by radius value amount towards the front of the screen. Points 50 percent of the distance for the centre to the perimeter, are extruded by 50 percent of the radius. The algorithm is not designed to be geometrically accurate, but to give an approximation to the shape of a head, and provide a realistic 3D effect.

The same procedure is applied to the polygons of the grids constructed to represent the eyes and mouth, so that the eyes and mouth are coπectly nearer to the front than the section of the head behind them. The mouth internal polygons are preferably given more depth at the centre than at the top and bottom to give the mouth a 3D effect when the lips move.

We now describe animation of the image at playback. Figure 11 shows an animation playback procedure.

An animation is displayed by starting playback of the sound file and then updating the display frequently to show the state of the head and its features at any point in time. During playback the algorithm animates the head and lip-synchs the model. Thus the mouth and lips of the model are made to appear as though the model is actually talking the given sound.

The update interval for the display is ideally around 25 times a second to give smooth playback, however realistic results can be achieved with much update rates on slower devices. No temporal compression is involved in the animation system, thus it can easily detemiine the state of all objects at any point in time during the playback, allowing variable update rates to be used without problems.

Refeπing to figure 11, at step 1100 the procedure begins background playing of a sound file and initialises the animation position to a start position for "zero" position. Then at step 1102 the procedure detemiines a cuπent playback position in the sound data and responsive to this calculates 1104 mouth/lip positions to simulate speech and, preferably, calculates 1106 eye positions to simulate blinking, for increased realism. Likewise at step 1108 the procedure preferably calculates head rotations and movements to simulate an actively talking head. Then the calculated picture is displayed 1 110 with the 3D model over the top of it - that is the 2D image is given 3D z-position data. The procedure then loops 1112 back to step 1102 until the end of the sound file is reached. We first describe animating the mouth.

Samples from the sound, are used to detemiine the animation of the head and the lips. For the lips three average samples are taken from the sound. One at the current animation position, one at a first interval, e.g., 150ms prior to that, and one at a second interval, e.g., 300ms prior to the first.

Each average sample is taken over a predeteπnined, e.g. a 50ms period. Thus the first sample represents the average amplitude of the sound from the 50ms prior to the cuπent animation position to the cu rent animation position, and the second sample represents the average for 150ms prior to that.

The three average values are then reduced by the required background factor. This is preferably a simple subtraction to reduce background noise.

If all three samples are over 50 percent amplitude, the lips are given a vertically biased position, hi other words, the vertical delta for the lips is calculated by the percent amplitude. For a 50 percent amplitude value,k the vertical delta for the lips would be 50 percent of the vertical distance from the centre of the mouth to the edge of the mouth bounding box.

Figures 12a and 12b illustrate applying a vertical bias to the lips. The inner vertex positions are moved vertically outwards from the centre. 100 percent amplitude would equate to the vertex positions being moved so as to be at the same vertical position as the top and bottom of the mouth grid.

If the first two samples are over 50 percent but the third is not then the mouth is judged to be closing both the vertical and horizontal deltas for the lips are calculated as above. Horizontal movement is reduced by approximately 250 percent as lips move inwards and outwards less they then move up and down.

Figures 12c and 12d illustrate applying a vertical and horizontal bias to the lips. The vertex positions are moved vertically outwards and horizontally towards the centre. As the lips are moved by direction based on the three samples, and by delta based on the amplitude, a large variety of mouth shapes are recreated. The lip movements are in time with the sound as the samples are taken over a short 50ms period prior to the cunent animation position, hence the maximum latency between sound and lip movements is zero ms for the calculation, and thus will just be down to rendering performance.

We next describe animating the eyes.

To make the eyes blinlc, a blink factor is calculated each frame. The blink factor detemiines how near the eyes are to being closed, here as a percentage. Zero percent is fully open, 100 percent is fully closed. Blinks are calculated using a reference starting time in milliseconds, a required duration for a blink. The blink percentage is then calculated by seeing how much of a blink action should be complete at a point in time.

For instance a blinlc which lasts 200ms is 50 percent complete at 100ms past its starting time. The blinlc percentage is then weighted so that it takes longer for the eyes to close than open, as in real life. After a blinlc action is complete, a new reference time is set a random amount further along the animation for when the next blinlc should commence.

The grid for the eyes is defoπned in the same manner as the grid for the mouth is defoπned, except the higher the bias of the blinlc (blinlc percentage) the more vertically inwards the points are moved. The eye defomiation is preferably only applied vertically, preferably horizontal positions are unchanged.

We now describe animating the head.

All the vertices in the head are rotated by one or more (preferably all) of a yaw, pitch and roll angle, which may be determined based on the amplitude of the sound and using a small fixed sequence (of motion data) which is predefined by the system to give a natural flowing movement.

Preferably the procedure is configured such that the head will still animate when there is no sound, albeit by a smaller amount than during speech. The rotation of the head is applied in 3D about the centre, and the eyes and mouth are transformed by the same rotation.

We now describe displaying the animation.

To display the animation, the system draws the original picture, and then draws the polygons representing the head, eyes and mouth over the top, by rendering the polygons with a 3D rasterisation function. The polygons are rendered via an orthographic projection. This ensures the polygons are drawn in the coπect position relative to the original picture.

If at a point in the animation, there is no head rotation, no blink and no lip movement, the position of all the polygons would be identical to the original picture. Thus it would appear as if only the picture was being displayed.

To increase perfomiance, a z-buffer or similar depth sorting algorithm is not required if the polygons are drawn in the following order: head polygons, then mouth internal polygons, then eye and mouth polygons.

We now describe software modules used in a prefeπed embodiment of the animation creation and playback system. The embodiment of the animation and playback system comprises software modules as shown in figure 13.

Figure 13 shows software modules for a prefeπed embodiment of the system. These comprise a system module 1300 which manages the other modules. The other modules comprise an image storage module 1302 coupled to an image compression module 1304; a sound storage module 1306 coupled to a sound compression module 1308; an animation module 1310 coupled to a picture model module 1312, a sound analysis module 1314 and an audio playback module 1316; and a 3D rasterisation engine 1318 coupled preferably to a fixed-point integer maths library 1320. Prefeπed embodiments of the system are implemented using primarily or only fixed point or integer maths rather than floating point maths in the interest of speed/power consumption/ease of implementation on a variety of platfoirns. Apart from the previously described modules these functions, can, if desired, implemented using conventional library code; we have already described the implementation of modules 1300, 1310, 1312 and 1314.

The functions of the software modules will now be described.

The Living Pictures (system) module 1300 manages the data required for a "Living Picture" (Trade Mark), using the image, sound and model modules for storage and manipulation, and the animation module for playback. It preferably also contains parameters for viewing such as offset and zoom.

The image storage module 1302 functions to store an image as a bitmap using an RGB data structure, and access its pixels. Includes functions to flip, minor and rotate an image. For mobile applications, a 16-bit image structure is used with one-word (2 bytes) representing a pixel in a bitmap, with 4-bits for each of the red, green and blue colour components. For desktop applications, a 32-bit image structure is used with one double word (4 bytes) representing a pixel in a bitmap, with 8-bits for each of the red, green and blue colour components.

The image compression module 1302 functions to compress/decompress an image as stored in the Image Storage module to a smaller size for transmission via network, over- air or other communication methods. As photos are typically used in embodiments of the system, the JPEG compression technique is employed. This can compress image data significantly.

The sound storage module 1306 functions to store an audio recording as a PCM (Pulse- code-modulalion) sound data buffer. The sound is sampled at between 8Khz and 44KHz depending on available memory. Each sound sample is 16-bits.

The sound compression module 1308 functions to compress/decompress an audio recording in the Sound Storage module to a smaller size suitable for transmission. The audio data is typically speech, hence GSM audio compression/decompression is preferably used. This is the method for audio compression and decompression employed by mobile phones. The Living Pictures animation module 1310 functions to animate the vertices of the eyes and mouth and head based on audio samples, and to create new vertex positions for each frame of an animation.

The 3D rasterisation engine module 1318 provides 3D rasterisation of texture mapped triangles using a z-buffer. At the start of each display frame, the frame buffer and z- buffer are cleared. A triangle is rasterised using integer interpolation. The vertices of each triangle have a 3D vertex position and a 2D texture position that correlates to a position on the picture. This information is used to render the triangle using a scanline rasterisation algorithm.

The fixed-point integer maths library 1320 provides functions for adding, subtracting and multiplying 2D and 3D vector positions. Contains integer based conversion functions to generate sine and cosine values from angles. Includes functions for integer- based fixed-point 2D and 3D matrices to transfomi 2D and 3D vector positions using compound transformations. All functions use fixed-point integer based arithmetic for speed.

The Living Picture model module 1312 stores the keypoints required for a model in a Living Picture. It stores the positions of the head perimeter points, the eye bounding box points, and the mouth bounding box points. It also stores the rotation amount for the head, and key parameters for animation, such as the exaggeration required for lipsynching. and the required blinking interval and duration.

The sound analysis module 1314 has functions for detemiining lip x positions based on an audio buffer, as well as functions for retrieving a blink amount at any point in an animation, based on required interval, duration and frequency of a blinlc. It also has functions for detemiining a yaw, pitch, and roll rotation for the head based on both animation position and the sound at that position in the animation. For instance, the head movements are more exaggerated during speech than during periods of silence in the animation. The audio playback module 1316 provides functions for playing back an audio file or audio data buffer.

No doubt many other effective alternatives will occur to the skilled person. It will be understood that the invention is not limited to the described embodiments and encompasses modifications apparent to those skilled in the art lying within the spirit and scope of the claims appended hereto.

Claims

CLAIMS:

1. A method of providing an animated image, the method comprising: inputting and storing two-dimensional image data defining an image; inputting and storing coordinate data defining one or more reference coordinates within said image identifying at least one portion of said image for animation; inputting audio data defining sounds to which animation of said image is to be synchronised; reading a portion of said audio data; determining defomiation mesh data for said image responsive to said portion of audio data using said coordinate data, said defomiation mesh data defining a defomiation mesh for generating a deformed version of said image; mapping said stored image data using said deformation mesh data to generate defoπned image data defining an image defomied in accordance with said mesh; outputting said defoπned image data to provide a frame of said animated image; and repeating said reading, detemiining, mapping and outputting to provide data for said animated image.

2. A method as claimed in claim 1 for providing said animated image in substantially real time, the method comprising repeating said reading, determining, mapping and outputting in substantially real time to provide said data for said animated image.

3. A method as claimed in claim 1 or 2 wherein said detemiining of said defomiation mesh data is substantially independent of previously deteπnined defomiation mesh data for said image.

4. A method as claimed in claim or 3 wherein said deformation mesh data determining comprises determining first mesh data for a first mesh to subdivide said image responsive to said coordinate data, and deforming said first mesh responsive to said portion of audio data to generate said deformation mesh data.

5. A method as claimed in claim 4 wherein said first mesh comprises a two- dimensional mesh and said deformation mesh comprises a three-dimensional mesh.

6. A method as claimed in claim 5 wherein said defoπning comprises deforming said first mesh in two-dimensions to generate a deformed mesh, and then converting said defomied mesh to a three-dimensional mesh to generate said defoπnation mesh data.

7. A method as claimed in claim 6 wherein said converting comprises: determining a dimension of said at least one identified portion of said image; scaling three dimensional reference data for said identified image portion by said dimension to determine scaled reference depth data; and generating depth coordinate data for points of said image using said scaled reference depth data.

8. A method as claimed in any one of claims 4 to 7 wherein said mapping comprises mapping subdivisions of said image by said first mesh onto coπesponding, defomied subdivision of said image associated with said defomiation mesh.

9. A method as claimed in any preceding claim wherein said image comprises a foreground and a background portion, and wherein said at least one image portion identified by said reference coordinates comprises one of said foreground and background portions of said image.

10. A method as claimed in claim 9 wherein said mapping comprises mapping said foreground image portion using said defomiation mesh to generate defomied image data defining a defomied foreground portion of said image; and combining said deformed foreground image portion with said stored image data.

11. A method as claimed in claim 10 further comprising applying a three- dimensional rotation and/or translation operation to said foreground image portion before said combining.

12. A method as claimed in any preceding claim said image is considered to depict a head, wherein said reference coordinates identify a location of said head within the image and a location of a lip region of said head, and wherein said defomiation mesh comprises a deformation of said lip region.

13. A method as claimed in claim 12 wherein said generation of defonned image data includes combining a defomied version of said image with predeteπnined image data for a portion of said head.

14. A method as claimed in claim 13 wherein said combining with said predeteπnined image is responsive to a degree of defomiation of said image.

15. A method as claimed in claim 13 or 14 wherein said combining with said predetermined image is configured to appear random.

16. A method as claimed in any preceding claim wherein said defomiation mesh data deteπnining is responsive to a signal level profile of said audio data over a time interval.

17. Processor control code to, when running, perfomi the method of any preceding claim.

18. Processor control code as claimed in claim 17 wherein comprising arithmetic operations, and wherein substantially all said arithmetic operations employ integer arithmetic.

19. A carrier carrying the processor control code of claim 17 or 18.

20. A caπier as claimed in claim 19, wherein the carrier is a signal for a mobile communications device, in particular a MMS (multimedia messaging service) signal.

21. A carrier as claimed in claim 20 including said image data, said audio data, and said coordinate data.

22. A mobile communications device configured to perfomi the method of any one of claims 1 to 16.

23. A signal providing data for animation in substantially real time by a mobile communications device, the signal comprising: two-dimensional image data defining an image; coordinate data defining one or more reference coordinates within said image identifying at least one portion of said image for animation; and audio data defining sounds to which animation of said image is to be synchronised.

24. A signal as claimed in claim 23 further comprising processor control code for implementation by said mobile communications device to generate animated image data for display by said device substantially in synclironism with said sounds defined by said audio data.

25. A signal as claimed in claim 23 or 24 comprising an MMS signal.

26. A computer system for providing an animated image, the system comprising: data memory for storing two-dimensional image data defining an image, coordinate data defining one or more reference coordinates within said image identifying at least one portion of said image for animation, and audio data defining sounds to which animation of said image is to be synchronised; program memory storing processor control code; and a processor coupled to said data memory and to said program memory for loading and implementing said code, said code comprising code for controlling the processor to: read a portion of said audio data; detemiine defomiation mesh data for said image responsive to said portion of audio data using said coordinate data, said defomiation mesh data defining a defomiation mesh for generating a defomied version of said image; map said stored image data using said defoπnation mesh data to generate defomied image data defining an image defoπned in accordance with said mesh; output said defomied image data to provide a frame of said animated image: and repeat said reading, deteπnining, mapping and outputting to provide data for said animated image.

27. A computer system as claimed in claim 26 configured to repeat said reading, detemiining, mapping and outputting to provide data for said animated image in substantially real time.

28. A computer system for capturing an image of an object for animation, in particular for use with the method of claim 1, the system comprising: an image capture device; a user input device; a display device; data memory for storing a captured image of said object; program memory storing processor control code; and a processor coupled to said image capture device, to said user input device, to said display device, to said data memory, and to said program memory for loading and implementing said code, said code comprising code for controlling the processor to: capture an image of said object using said display device; store said captured image in said data memory; detemiine an initial position for two sets of points, a set of reference points and a set of image portion identification points for identifying said object or a portion of said object for animation, locations of said reference points deteπnining locations of said image portion identification points; and wherein tliere are fewer of said reference points than said image portion identification points; input control data from said input device to adjust a position of one or more of said reference points; detemiine an adjusted position of said image portion identification points responsive to said adjusted position of said one or more reference points; output to said display device data for displaying said captured image and marker data for displaying positions of said reference points and of said image portion identification points in conjunction with said captured image; and store coordinates of said image portion identification points in association with said captured image.

29. A computer system as claimed in claim 28 wherein said object comprises a head and wherein said image position identification points are for identifying an outline of said head and a lip region of said head.

30. The processor control code of claim 28 or 29, in particular on a caπier.

31. A mobile communications device including the processor control code of claim

32. A method of providing a talking image, the method comprising: capturing an image with a first mobile communications device; detemiining positions of one or more reference coordinates within said image identifying at least one portion of said image for animation; capturing audio data using said first mobile communications device; sending said captured image, said reference coordinates and said captured audio data to a second mobile communications device; receiving said captured image, said reference coordinates and said captured audio data at said second mobile communications device; displaying said captured image on said second mobile communications device; playing said captured audio data on said second mobile communications device; and modifying said captured, displayed image in synclironism with said playing audio using said reference coordinates to provide said talking image.

33. A method as claimed in claim 32 wherein said detemiining of reference coordinate positions comprises displaying initial positions of said reference coordinates positions to a user of said first mobile communications device; and receiving a control input from said user to modify said initial positions.

34. A method as claimed in claim 33 wherein said determining of reference coordinate positions further comprises displaying at least one control point, and receiving a control input from said user to modify a position of said control point, wherein modifying said control point position modifies a position of a plurality of said reference coordinates.

35. A method as claimed in claim 32, 33, 34 or 35 wherein said sending comprises sending a message, in particular a multimedia message comprising said captured image, said reference coordinates and said captured audio data; the method further comprising receiving said message at said second mobile communications device.

36. A method as claimed in claim 35 wherein said message further comprises control code to perform said modifying to provide said talking image.

37. A mobile communications system configured to implement the method of any one of claims 32 to 36.

38. Processor control code, in particular on a caπier, to when running implement the portion of the method of claim 34 operating on said first mobile communications device comprising at least said reference coordinate position detemiining.

39. Processor control code, in particular on a caπier, to when running implement the portion of the method of claim 32 operating on said second mobile communications device comprising at least said captured, displayed image modifying.

40. A mobile communications device including the processor control code of claim 38 or 39.