GB2609996A

GB2609996A - Image stitching

Info

Publication number: GB2609996A
Application number: GB2114637.8A
Authority: GB
Inventors: Paul Alexander Geissler Michael; Kingshott Oliver Augustus
Original assignee: Mo Sys Engineering Ltd
Current assignee: Mo Sys Engineering Ltd
Priority date: 2021-07-07
Filing date: 2021-10-13
Publication date: 2023-02-22
Also published as: GB202114637D0; GB202109804D0; GB2609996A8

Abstract

A method of compositing video streams comprises: obtaining first 30 and second 40 video streams, where the backgrounds of the streams overlap; identifying a common feature 32 in the background of the streams; and stitching the streams together along the identified feature within the background of the streams. The streams may be captured by a camera and the background may be displayed on a display screen within the field of view of the camera. The backgrounds may be part of a common 3D model. A second invention defines a render engine which receives a first captured stream and a second CGI stream, identifies areas of difference between the streams to define possible subjects for extraction in the first stream, extracts the identified areas, and stitches the areas into the second (CGI) stream. A third invention comprises obtaining a first captured stream where the background is displayed on an LED screen and where the background is based on a 3D model. A second stream is obtained based on the 3D model, and at least a portion of the first stream is rendered then stitched into the second stream along one or more seams.

Description

Intellectual Property Office Application No GI32114637.8 RTM Date:11 November 2022 The following terms are registered trade marks and should be read as such wherever they occur in this document: Unreal Engine (page 10) Epic Games (page 10) StarTracker (page 20) Intellectual Property Office is an operating name of the Patent Office www.gov.uk/ipo

IMAGE STITCHING

This invention relates to image stitching, for example in relation to video content, where images for use in a video stream are formed from multiple sources and are joined together.

Figure 1 shows an arrangement for recording video. A subject 1 is in front of a display screen 2. Multiple cameras 3, 4 are located so as to view the subject against the screen. A display controller 5 can control the display of images on the screen. These may be still or moving images which serve as a background behind the subject. This setup provides an economical way to generate video content with complex backgrounds. Instead of the background being built as a traditional physical set it can be computer generated and displayed on the screen.

When the background represents a three-dimensional scene, with depth, the image displayed on the screen should be displayed with a suitable transformation so that it appears realistic from the point of view of the camera. This is usually achieved by a render engine implemented in display controller 5. The render engine has access to a datastore 6 which stores three-dimensional locations of the objects to be represented in the background scene. The render engine then calculates the position and appearance of those objects as they would be seen from the point of view of the active camera, for instance camera 3. The results of that calculation are used to form the output to the display screen 2. When filming switches to another camera, for instance camera 4, the render engine re-calculates the background image as it would be seen from the point of view of that camera.

The use of screens to depict the background is becoming increasingly widespread and in many situations is replacing so called "green screens" whereby one or more cameras are used to capture the action, usually played by actors using a limited number of props, against a neutral background, usually a green screen. Computers are then used to subtract the foreground, that is to say the "real" actors and props from the neutral background, enabling the CGI to be added to the remaining portions of the footage.

In many situations, it is possible to perform this process in real-time, most notably, in relatively low-resolution filing situations, such as those found in newsroom and whether studios. In such cases, the CGI or background image is relatively simple, meaning that the processing requirements are sufficiently low for the compositor to be able to "keep up" with the footage as it is being shot -the result being real-time compositing.

However, when the complexity of the CGI increases, the data processing requirements of the compositor increase. For example, when filming a motion picture, it is commonplace for "location shots" to be filmed against a green screen background to avoid having to enact the scene at a real location. In such a situation, the background can be a complex landscape, including buildings, weather phenomena, moving water and so on, which all add a great deal of complexity to the background image.

Whilst the use of "green screen" technology such allows for enormous complexity in the form of the background images, it does mean that the personnel involved in the filing have limited reference points within the scene and therefore have very limited ability to know exactly where to stand, how to move or indeed otherwise interact with the "scene" around them. This can lead to less natural acting or presenting.

The benefit therefore of the use of a background screen is that the actors/presenters can see the scene/images around them and can better interact with that scene. This leads to more natural and/or realistic performance. Furthermore, issues surrounding "green spill" that is difficulty acquiring detail of the subject against the green background, for example all the details of hair where there might be very small patches of green background between hairs, can be avoided. However, many scenes, especially those that are required to depict complex outdoor settings, require giant screen arrangements which can be of the order of 20-30 metres wide and 5-10 metres high, may be curved (for example a curve having a diameter of say 20 metres) and may additionally include a "roof" screen extending laterally at least partially over the performers. Such screens are extremely expensive and cumbersome to install.

Further, the quality of the image from the LED background can be of lower quality requiring significant post image capture processing.

According to the present invention, there is provided a method of compositing a video stream comprising the steps of: obtaining first and second video streams, wherein there is overlap between the background of the first and second video streams; identifying a common feature in the background of the first and second video streams; and stitching the first and second video streams together along the identified feature within the background of the first and second video streams.

Thus, the present invention provides a hybrid arrangement in which a relatively small screen is used behind the actors/presenters to enable them to have a visible but relatively narrow (either or both in terms of width or height) "scene" with which to interact, and in which the remainder of the scene can be added by way of CGI or the like, but typically without requiring the provision of a "green screen" or similar technology. Thus the invention allows a relatively low cost solution for providing a set extension.

The first video stream may have a relatively narrow background and the second video 20 stream has a relatively wide background. The background of the first and second streams may be generated from the same 3D model and/or from the same perspective.

The method may further comprise the step of extracting a portion of the first image stream for stitching into the second image stream.

The method may further comprise the step of comparing the first and second image streams to identify areas of difference, thereby identifying potential extraction portions.

The method may further comprise the step of identifying a portion of the first image stream for extraction and then increasing the size of the extracted portion by dilating in at least one direction.

In the method, the common feature may be inboard of an edge of the first video stream.

The common feature is preferably outboard of a subject within the first video stream.

The background of the first video stream is preferably provided by a display screen, typically an LED screen. The screen may be formed of different parts, where the more centrally located sections behind foreground objects may be a higher resolution, as these images are more likely to be contained in any combined stream, and lower resolution sections towards the more outer sections, as these areas are more likely to be excluded from a combined stream and, in effect, replaced by the higher resolution images from a CGI image stream.

The identified feature may change during the first video stream from a first identified feature to a second identified feature. The identified feature may be a common edge of an object in each background.

The identified feature may be determined by one or more of: dilation about an identified object, identification of a feature line along which the join can lie, or determination of a smart seam based on best fit criteria. For example, best fit criteria may include setting the generated stitching seam to be a hard edges on edges features and/or soft edges on smooth areas at a threshold distance from the extracted subject. Thus, the generated stitching seam may be dependent on the characteristic of the background around the subject.

Also disclosed is a video compositing system comprising: one or more inputs for receiving and/or storing first and second video streams to be joined together, the video streams having at least partially overlapping backgrounds; an input for receiving an identification of a feature common to the backgrounds of the first and second video streams, and a processor configured to stitch the first and second video streams together along the identified feature.

The processor may be further configured to detect the common feature and provide the necessary input. The processor may be further configured to ensure that the common feature is inboard of the edge of the first video stream.

The processor may be further configured to ensure that the common feature is outboard of any subject in the first video stream.

The system may further comprise a camera for capturing the first video stream.

The system may further comprise a display screen for displaying the background to be used during capture of the first video stream. The display screen may obtain the background image to display from a video store in the video corn positing system, or it may be supplied real time from a render engine, which may be rendering a 3D model into the 2D images for display.

The system may further comprise an infra red camera for determining the location of any subject in the first video stream.

The system may further comprise a comparator for determining the location of any subject in the first video stream by comparing the first and second video streams, typically by identifying areas of difference between the first and second video streams.

The system may further comprise one or more primary compositors running in parallel 20 with one or more secondary compositors, the primary compositor or compositors being optimised for rendering real-time CGI footage and the secondary compositor or compositors being optimised for rendering high-quality CGI footage.

Also disclosed is a render engine configured to receive a first captured image stream and a second CGI image stream, compare the image streams to identify areas of difference thereby defining one or more possible subjects for extraction in the first stream, extracting one or more of the areas of difference, and stitching the extracted area of difference into the second CGI image stream.

Also disclosed is a method of compositing a video stream comprising the steps of: obtaining a first captured video stream in which the background is displayed on an LED screen and in which the background is based on a 3D model, obtaining a second video stream based on the 3D model, rendering at least a portion of the first captured video stream, stitching at least a portion of the images from the first captured video stream into the second video stream along one or more seams.

The one or more seams may be determined by one or more of: dilation about an identified object, identification of a feature line along which the seam can lie, or determination of a smart seam based on best fit criteria.

In this description, the terms "subject", "actor", "presenter", "people" are generally synonymous and are intended to cover any form of object that it is desired to video and may include a performer. Such performers may be human or animal or even robots, and may be delivering a fictional portrayal of a scene such as in a film or television programme, a live or pre-recorded report, or any other form of video content. Alternatively, the subject may be or include one or more inanimate objects.

The provision of the subjects against a screen also means that the additional CGI does not need to appear behind the subjects in the scene, which is a complex and time-consuming task especially when the subjects are people that are moving, such that the scene behind them is continually changing. The computing power required by the relevant render engines and the like in that scenario is significant and can only be done in post production.

By "overlapping" of the backgrounds between the first and second video streams, we are including both when the background of the first captured video stream is either at least partially, or wholly, contained within the CGI video stream, and also where a common edge is in both video streams. For example, the first video stream may have on an edge of the stream a vertical line indicative of say the end of a wall. The second video stream may have an equivalent vertical line on an opposite edge of the stream, such that when the two edges of the streams are aligned, a continuous image is formed. The "overlap" is therefore the common edge which permits alignment. It is, however, more common that the CGI background has significant portions in common with, and preferably contains all of, the display screen background.

The background of the captured video stream is preferably only images shown on the screen. The display screen is typically an LED screen.

The background for the display screen and for the CGI stream are preferably both generated / rendered from the same 3D model. This allows for more accurate alignment when overlaying the two streams and/or greater accuracy when identifying the areas of difference. The background for the CGI stream may be adapted to match the background of the display screen stream. By this, we mean that as the angle of the camera view (pan, tilt, height etc) that generates the first captured video stream is altered, the equivalent changes are made to the CGI stream background to replicate the effect of the camera motion. Thus the first and second image streams that are combined may be created from the same perspective. Thus, one stream may be a captured/filmed image stream of the LED screen and foreground objects such as actors, where the perspective is determined based upon the camera position. The second stream may then be computer generated based on the same perspective from the same virtual camera position.

Thus, a method is disclosed in which two image streams are combined, wherein the two image streams are created from the same perspective, where a first image stream is a captured image stream of a display screen and a second image stream is computer generated. The displayed image on the screen and the second image stream are preferably based on the same 3D model. This may be combined with any of the disclosed methods of identifying foreground objects within the captured video stream and extracting them for stitching into the second image stream, for example, by identifying non-common and/or overlapping areas, adding rim dilation, e.g. extracted subject or extracted actor dilation, around those non-common areas / foreground objects and/or expanding the extracted image out to a non-obvious border like lines or fades.

Any of the aspects of the invention discussed herein are advantageously carried out automatically, that is without human intervention.

The invention described above may advantageously be combined with the sort of computer generated imagery compositors that are described in for example W02013156776. Thus, the present invention may further include a CGI compositing system comprising one or more primary compositors running in parallel with one or more secondary compositors, the primary compositor or compositors being optimised for rendering real-time CGI footage and the secondary compositor or compositors being optimised for rendering high-quality CGI footage.

Thus, the present invention may also include a system for compositing a scene comprising a captured video image and a computer-generated background image, the system comprising: at least one video camera for recording and storing a raw video image of a scene and for outputting a working video image of the scene; a selector switch operatively interposed between each of the video cameras and a real-time compositor for feeding a selected one of the working video images at any given time to the real time compositor; the real time compositor being adapted to incorporate a first CGI image into the selected working video image and to display a composite video image representative of the final shot on a display screen; the system further comprising: a data storage device adapted to store an archive copy of the raw video images from each of the video cameras and a postproduction compositor operatively connected to the data storage device for superimposing a second CGI image onto any one or more of the video images captured by the video cameras, wherein the first CGI image is a lower resolution, or simplified, version of the second CGI image.

Suitably, therefore, the invention may include and may operate two GGI rendering engines in parallel that work from a common set of video footage, for example, the raw video images. Such a configuration suitably allows for CGI rendering, that is to say, rendering of a three-dimensional computer model to be conducted at two different resolutions (levels of detail) simultaneously, with the lower resolution version of the CGI footage being composited in real time with the actual video footage to allow the effect to be seen in real-time, whereas the higher resolution CGI rendering is carried out in near-time, to enable the final com posited footage to be reviewed later on. By operating two compositors or sets of compositors in parallel, it may be possible for directors or actors to be able to appreciate the likely end result of the shoot in real time, whilst retaining the ability to render final footage in a higher quality format afterwards, that is to say in near-time or in postproduction. One of the main advantages is that it may afford directors and actors an opportunity to review the footage as it is shot, thereby enabling critical on-set decisions to be made quickly, thereby reducing the amount of re-shooting required. This may also afford improved continuity because scenes, sets, props etc. may not need to be laboriously re-created at a later point in time to reproduce a scene shot days or weeks previously, should a re-shoot be required.

The present invention will now be described by way of example with reference to the accompanying drawings. In the drawings: Figure 1 shows an arrangement for recording video.

Figure 2 shows a further arrangement for recording video Figure 3 shows an example of combined first and second video streams.

Figure 4 shows an example of image dilation.

Figure 5 shows a further arrangement for recording video. Figure 6 shows a yet further arrangement for recording video. Figure 7 shows perspective differences when a camera moves. Figure 8 shows a correction applied to Figure 7.

Figure 2 shows an arrangement for recording video. The arrangement comprises a display screen 10, multiple video cameras 11, 12, a video feed switch 13, a camera selector unit 14, a camera selector user interface 15, multiple display controllers 16, 17 and a scene database 18.

The display screen 10 is controllable to display a desired scene. It may be a front-or back-projection screen, or a light emissive screen such as an LED wall. It may be made up of multiple sub-units such as individual frameless displays butted together. It may be planar, curved or of any other suitable shape.

A subject 19 is in front of the screen, so that the subject can be viewed against an image displayed on the screen. That is, with the image displayed on the screen as a background to the subject. The subject may be an actor, an inanimate object or any other item that is desired to be videoed. Typically, the subject is free to move in front of the screen.

The outputs from each camera, representing video data streams, pass to the video feed switch 13. This selects a single one of the incoming video data streams for output at 20. The stream is selected in dependence on a signal from the camera selector unit 14, which operates under the control of user interface 15. Thus, by operating the user interface 15, and operator can cause a selected one of the incoming video streams captured by the cameras to be output In this way the operator can cut between the two cameras.

Each display controller comprises a processor 21, 22 and a memory 23, 24. Each memory stores in non-transitory form instructions executable by the respective processor to cause the processor to provide the respective display controller with the functions as described herein. In practice, the two display controllers may be to substantially identical.

Each display controller has access to the scene database 18. The scene database stores information from which the display controllers can generate images of a desired scene from a given point of view to allow such images to be displayed on the screen 10. In one example, the scene database may store one or more images that can be subject to transformations (e.g. any of affine or projective transformations, trapezoidal transformations and/or scaling transformations) by the display controllers to adapt the stored images to a representation of how the scenes they depict may appear from different points of view. The transformations may take into account the distortion induced by the lens currently installed in the camera, the pan/tilt attitude of the camera and any offset of the camera image plane from a datum location of the camera. Transformations to deal with these issues are known in the literature. In another example, the scene database may store data defining the appearance of multiple objects and those objects' placement in three dimensions in one or more scenes. With this data the display controllers can calculate the appearance of the collection of objects from a given point of view. Again, the transformations may take into account the distortion induced by the lens currently installed in the camera, the pan/tilt attitude of the camera and any offset of the camera image plane from a datum location of the camera. To achieve the required processing the display controllers may implement a three-dimensional rendering engine. An example of such an engine is Unreal Engine available from Epic Games, Inc. A display controller may be continuously active but may output control data to the screen only when it determines itself to be operational to control the screen. When a display controller is outputting data to the screen, the screen displays that data as an image.

Thus, each display controller 16, 17 has a processor running code stored in the respective memory. That code causes the respective display controller to retrieve data from the memory 18 and to form an image of a scene using that data and the location of a given point of view. Then, when the controller is operational to control the screen it outputs that image to the screen, which displays the image.

The image displayed by the screen may be a still image or a video image. When a subject is in front of the screen, the use of a video image allows the background to vary during the course of a presentation by the subject that is being recorded by the cameras. The video image may, for example, portray falling rain, moving leaves, flying birds or other background motion.

An operator may determine the locations of the cameras 11, 12 before videoing starts and provide that data to the display controllers for use as point-of-view locations. Alternatively, each camera may be provided with a location estimating device 25, 26 that estimates the location of the camera in the studio or other environment where filming is taking place. That device may, for example, be a StarTracker sensor/processor system as is commercially available from the applicant. Such a device can allow the location of the camera to be tracked as the camera moves. Location data determined by such a device can be passed to the display controllers for use as point of view locations. These are just examples of mechanisms whereby the display controllers can receive the locations of the cameras. Once the display controllers have the location of the cameras 11, 12 they can use that information as point of view locations from which to estimate the appearance of a scene. Figure 2 shows a camera location estimation unit 27, which could form part of a StarTracker system. In this example, that unit communicates wirelessly with the devices 25, 26 to learn the cameras' locations and provides those locations to the display controllers 16, 17, although other forms of communication may be possible.

Whilst Figures 1 and 2 show a relatively small screen, such as 3-4 metres wide and 2-3 metres high, in practice these screens could be much larger as discussed above in order to provide the necessary scale of the background scenery. This is especially true for outdoor and/or outer space scenes which deliver dramatic effect by way of the "vastness" of the scene in which the actors are being portrayed.

However, by using a combination of a relatively small screen, say 2 x 3 metres, and combining that image with CGI technology to extend the scene, the vast scale that a scene requires can be achieved without the cost and space requirements of a giant screen, but allowing the actors to see the scene around them to make the acting or presenting more natural. Thus, the present invention has two image streams, firstly a relatively narrow captured image stream of the actors/presenters in front of the screen and a second relatively wide stream of the CGI to fit around the relatively narrow captured stream. The CGI stream would typically contain the background images shown on the screen and therefore captured by in the narrow stream in actors are present, as this allows for easier further production work. This is illustrated schematically in Figure 5.

The camera 11 captures images of the subject 19 in front of display screen 10. Those images are of a relatively narrow frame shot in which the background on the screen 10 is provided by a first render engine 59. Typically, this image will be of relatively low quality as the render engine may be generating the background in real time or near real time. Both the background on the display screen and the second CGI image stream are typically generated (i.e. rendered) from the same 3D model. Thus, rendering in this context is the generation of 2D images from the 3D model.

Location information is captured by the tracker 25 and fed via line 52 to a second render engine 60, as well as to the first render engine 59. The feed to the first render engine allows the background on the display screen to be adjusted based on the changing camera position. Also fed to this second render engine 60 is lens distortion information via line 51. The second render engine can then determine how to correct and/or re-render the captured images to compensate for lens distortion issues, lighting discrepancies, colour differences and to improve the overall quality of the image in the captured image stream. Additional render engines and/or computers may be used to carry out one or more of the task disclosed herein.

The second render engine 60 may either communicate with the internet, for example via the cloud 63, or the first render engine, to obtain the second image stream, or it may already contain details of the second video stream which is typically the CGI background and is the wider background which does not fit on the display screen 10. The second render engine may utilise its own onboard processing, memory and other computing requirements, or may alternatively use cloud based services 63 to carry out one or more of the tasks.

The second render engine 60 then stitches the captured image stream and the CGI image stream together. Various steps can be carried out when stitching the image streams together and these are discussed below. Not all steps necessarily are required, and the step may be carried out in a different order to that described below.

In combining the first, captured image stream and the second, CGI background image, it is preferable that a comparison 61 of the first and second streams is carried out.

This comparison can also be known as a difference key. Such a comparison is looking for common points, features or colours, or indeed areas that are absent of any such common features. This identification 62 recognises absent areas or "areas of difference" which are typically people, props or other objects that are the subject(s) of the image stream. Portions of the first image stream then need to be extracted 64 so that they can be combined by insertion, i.e. stitched into the second image stream. These portions need to include all relevant areas of difference to ensure that the final video stream includes all relevant parts of the captured image. Once extracted, the comparison between the two image streams will allow the extracted image(s) to be placed in the correct location within the second image stream. This may be done by overlaying the second CGI stream over the extracted portion(s) of the first image stream or vice versa.

The comparison may also include one or more steps of segmentation of one or both of the image streams, that is the identification 62 of distinct areas via segmentation algorithms, or through object recognition within the images, e.g. a chair or a car, such that the identified object can then be recognised at a later point for further processing, such as extraction or alteration, such as changing colour.

The combining of the two image streams is preferably done in such a way to minimise the visibility of the stitching or patching of the two streams. In an ideal situation, the viewer of the combined stream would not be able to distinguish the location at which the two streams were combined, for example due to precise colour and/or brightness matching. Thus, image blending techniques may be included to correct colour and/or brightness when merging join lines. Such techniques may include two band image blending, multi-band image blending and Poisson image editing.

When stitching the images together, typically, the second CGI image stream will be of a higher quality than the captured image stream (which is captured in real time and therefore will be of lower quality compared to the near time CGI stream), so to minimise the amount of re-rendering of the captured image stream that is required, it is desirable to reduce the size of the extracted image where possible. As such, one technique is to extract close to, preferably on the edge of, the area of difference, i.e. ensure that the area of difference is as small as possible. It is clear that the stitch line is therefore outboard of the subject, else part of the subject would be lost. However, this may mean that certain detailing is lost, so one option is to "dilate" the extracted image (see step 63), that is to expand the extracted image beyond just the area of difference and include some of the background that is common to both the captured image stream and the CGI image stream. The dilation can potentially go as far as the boundary of the first captured image stream, although it is preferable that the stitch line in inboard of the boundary of the first captured image stream. This is illustrated in Figure 4 in which a subject is identified as an area of difference. It is desired to extract the subject and to minimise the amount of captured LED background that is taken, it may be preferable to extract along line 70 close to the subject. However, certain detailing, such as hair 75, may get lost. As such, the area to be extracted maybe dilated, that is expanded for example to line 71. This may still be insufficient if the subject moves significantly, for example extravagant arm movements, as continually changing the stitching location will increase the likelihood of a viewer noticing the stitching location.

As such, the extracted image may be dilated further to line 72 or even beyond. Such dilation may be helpful when looking to use the technique described with reference to Figure 3 below, which demonstrates one way in which two streams can be combined.

The determination of the location of the stitching is important when ensuring that the viewer of the final combined image stream cannot discern where, in any given image in the final product, one stream finishes and another stream starts.

Whilst known stitching algorithms can assist in merging two images together, for example by blurring and/or fading, further options for how to reduce the visibility of the stitching can be beneficial.

An example of combined first 30 and second 40 video streams is shown in Figure 3 such that the combined stream 50 can be broadcast/sent for further processing etc. In this example, the first stream 30, depicted by the dotted line, is wholly within the bounds of the second stream 40. It may be that only part of the first stream 30 overlaps with the second stream 40, or it may be that an edge of first stream 30 has a common edge with second stream 40.

In this example, the background of the first stream 30 is also include within the background of second stream 40, such the edge 32 of house 31, cloud 33, hill 34 and horizon 35 all appear in each background. It is therefore possible to identify one or more of these features are being features along which stitching between the first and second video streams can occur. If for example, edge 32 of the house 31 extended fully from top to bottom of the first video stream, then such an edge would provide an ideal location to stitch. Such features may be identified by the segmentation step discussed above. In practice however, the scene will not necessarily contain features which extend fully from one side to the other, but rather the stitching will need to occur at various different features. For example, given the position of the subject 36, the stitching could occur along part of wall edge 32, then horizon 35 and finally along hill 34. The definition of the hill and the horizon, being in the far distance, would also provide an ideal location to stitch the two streams together as these features would naturally include some blurring due to the distance. In this case, a blend or fade using stitching algorithms may be beneficial. The edge 32 of the house 31 would, being in the foreground, require less blending/fading but more of a "clean cut". Given the likely significant difference between the colour, contrast and/or brightness of the house relative to the background sky, such a clean cut would not be noticeable to the viewer.

In an alternative, if the subject were further to the left in the image, the hill 34 and the horizon 35 may pass behind the subject 36, such that these would not provide suitable "hard" features to stitch along. In that case, it might be necessary to stitch across the sky in which the cloud 33 is located. Being a relatively blurry object, the cloud and or the sky itself would provide a suitable feature to use for location of the stitching, as blurring or fading between first and second video streams would be less noticeable by the viewer.

Importantly, it is the identification of the most appropriate feature in any given scene, 10 or indeed to vary the feature as scene develops, e.g. by movement of the subject, which allows the invention the greatest benefit.

There will likely be perspective, colour, brightness and/or alignment incompatibilities between the two streams -traditional stitching algorithms can provide a correction to blend one image into another, but this tends to mean that, whilst the join may not be a sharp change in appearance, there is nevertheless a visible difference between two streams that make up the video content and which may very well be detectable by a viewer.

This problem can be reduced further by making the "join" between the two streams lie along the edge of an object e.g. the edge 32 of the wall 31 within the combined stream, such that the edge of the object provides a "natural" change in appearance such that any change due to the appearance difference between the two streams is "lost" within the edge of the object. The object may have a linear or substantially linear edge, such as a tree trunk, edge of a building or billboard or the like, or may include one or more curved sections, for example the curve of a cloud formation, a wheel or similar. By using a natural border between two objects however ensures that the viewer will find it much harder to see any stitching or patching together of the two streams.

The natural border between objects may be very clear and well defined, e.g. a wall edge, such that whilst there is a significant change in appearance between the captured stream and the CGI stream, the significant change is already part of the scene and therefore is not distracting to the viewer. Alternatively, where for example a cloud is used as the join, this has a natural blurriness and/or gradual change of appearance, such that any smoothing that is generated by a traditional stitching algorithm is not noticeable. Such a region may be known a smooth area.

The join may be made up of any of one or more linear sections, one or more curved S sections and/or one or more smooth areas depending upon the make up of the scene.

Thus, the selection of where to place the join or seam between the captured screen stream and the CGI stream is crucial. Preferably any join line between the two streams is be placed inboard of the edge of the screen to avoid limitations on the choice of joining locations. Further, it is preferably that any join line does not pass behind any actors, presenters and/or props, as to do so greatly increases the post-production work required. Further, the system may be configured to recognise the location of actors or indeed any other object or feature to be extracted, within the scene, for example by detecting a silhouette of a human form in the captured stream and comparing it to the CGI stream containing the same background images. The human form would be missing from the CGI stream. This is discussed above in relation to the extraction of an area of difference. An alternative method for detecting the human form would be to have a further camera, for example an infrared camera, associated with the camera capturing the actors in front of the screen, such that the actor(s) would appear as a shadow within the image captured by the infrared camera. The main camera (i.e the camera doing the filming) may have an IR light source mounted on it so that it casts a shadow on the LED wall, this shadow being detected by a separate IR camera having a fixed location and which is trained on the LED wall. This method may also be useful to detect shadows when using LED floors, if each actual light source has an associated IR light next to it to cast equivalent shadows, which are then detected by a fixed IR camera and synthesised into the scene.

The join line or lines may be chosen to be lines of colour and/or contrast and/or brightness changes. The join line or lines may alternatively be in regions of a single 30 colour and/or contrast and/or brightness such that fading between the joined streams may be effected.

The edges or join lines may be automatically detected by known edge detection algorithms such as "Canny" or may be by way of image segmentation algorithms that can identify for example a building with a straight edge wall. Alternatively, the joins could be manually selected depending upon the user's desire and skill levels. Segmentation techniques may include instance segmentation (i.e. labels each object) and semantic segmentation (i.e. labels particular objects such as humans) and/or may include algorithms such as "graph cut" or "max flow" to find a good seam based on whatever stitching strategy is used, such as minimum texture regions or strong boundaries. Alternatively and/or additionally, machine learning methods (i.e. Al) could also be used, e.g. using segmentation ground-truth from green screen as training data, such as "Mask RCNN" or "DensePose".

The join line may alter location during a given scene -for example, whilst an actor is in a first position, the edge of a building may provide the most suitable join line, but later in the scene, the actor may move to a second position in which the actor overlaps with the edge of the building. In the situation, it is not desirable for the join line to pass behind the actor, so the join line may be switched to an alternative location on the captured stream that is no longer behind the actor. It is desirable to minimise the switching of stitching locations to minimise production difficulties and also as the human eye is adept at recognising changes, the fewer different stitching location, the less opportunity there is for the stitch to be detected.

In a further example of compositing two image streams in line with any of the examples described above, where a first image stream is captured by a camera, the first image stream including a background displayed on a display screen, and wherein the second image stream is a 2D rendering, typically of a 3D model, it has been noted that there can be latency to update the display screen when the camera moves from a first position to a second position.

This is illustrated in Figures 6, 7 and 8.

There is a camera 11 and camera tracking device 25 (I.e. StarTracker) setup. The camera is recording footage and sending it either to a storage unit or another computer (not shown, but e.g. 60 in Figure 5) to be processed. Meanwhile, the camera tracking device 25 is tracking the pose (position and orientation) of the camera. This pose information is sent to a computer 59.

The computer 59 receives the pose information and renders a virtual scene 81 that has a perspective aligned to the pose of the tracked camera. This rendering is depicted in frame 81. A subsection 82 of this rendering (denoted by the dashed frame) is then displayed on the screen 80, e.g. an LED wall. A human actor 19 is positioned in front of the LED wall 80.

The latency problem is now explained and stems from the change in perspective of the camera 11 from the dotted position A to the solid position B -there is a latency in to updating the LED wall to match the new perspective of the camera in position B such that for at least a few frames when the camera is in the new position, the image shown on the LED screen is created based on the perspective of the camera in position A or in the transition from A to B. When the camera 11 shifts its position, it thereby changes its perspective. The dotted camera depicts its original (old) perspective, and the solid line camera depicts its new perspective.

The camera tracking device 25 sends the new pose information of the camera to the 20 computer 59 that renders the virtual world 81. The computer 59 processes the new camera pose data and re-renders the virtual scene to match the new camera perspective.

However, the time it takes for the new pose data to be sent, received by the rendering computer and generate a new render is not immediate. Therefore, as shown in Figure 7, there is a perspective disparity between the captured footage (by the camera 11) and the virtual rendering 81 displayed on the LED wall, as shown in the section labelled 84.

The camera footage (depicted by frame 83) has captured an LED wall 70 whose lines are no longer parallel with the 2D render 81. This causes a mismatch between the details in the camera footage and the virtual render as shown, for example, by the clouds 85.

To reiterate, the cause of this problem is due to the latency between the camera moving and the update on the virtual rendering.

To correct this change in perspective due to latency, a geometric transformation is S applied, as shown in Figure 8.

The first step is to identify a transformation that explains the change in perspective due to the movement of the camera. Such a transformation can be solved through the known trajectory of the camera (from the tracking information) and the known delay 10 between the camera movement and re-rendering of the virtual scene.

Once the transformation has been applied to the contents of the camera footage 83, he edges of the LED wall are now made parallel with the 2D virtual world. This better aligns the camera footage 83 with the 2D virtual world rendering 81.

The transformation does not have to be on the camera footage. We could also apply the transformation on the 2D virtual rendering to instead match the virtual rendering to the camera footage. It may be possible to apply a transformation to each of the camera footage and the 2D virtual rendering. The overall aim however is the same, namely to match the two streams' perspective.

The transformation may be a simple shifting of position of the captured camera footage 83, i.e one or more of up, down, left or right, or may include a more complex transformation to address the sort of misalignment shown in Figure 7. One example of transformation may be Affine Transformation (or Projective Transformation which has greater generality). Methods for finding such transformations include using Direct Linear Transform which simply involves solving a system of linear equations via the identification of corresponding points, since the system can use for example StarTrackers tracking information, it is possible recover the transformation more easily.

Another challenge to overcome is how to match the frames to be transformed into each other. One method is to identify the delay between the camera pose change and the LED wall update, which can be obtained empirically by timing when the rendering updates after shifting the camera pose.

The next step is to key the actor, or any other foreground object, out of the camera footage which may result in large areas of the LED wall being removed. This is shown schematically in Figure 4 where the various lines 70, 71 and 72 represent various levels of dilation and/or keying out of the subject to be extracted.

Once the shape to be extracted (the "Key") is produced, we need to align the Key in a to position that best fits in the 2D virtual world 71. When doing this, the edges of the Key may be non-trivial to merge into the 2D virtual world such that: - edges of the Key could be made to align with an actual edge in the picture, e.g. the contours of furniture. By using these edges, the join lines will be less obnoxious to the audience.

- edges of the Key may also be some distance away from the actor's actual body to ensure finer features such as hair, folds of clothes and/or even motion blur is not cropped away.

The method of keying the actor or an object out of a scene can be conducted in a variety of methods. The general name for performing this task would be called image segmentation. Within this topic, there are algorithms such as edge detection, k-means clustering, water shed or net trained methods such as Mask R-CNN. Then alignment can be performed through methods such as feature descriptor matching or template matching.

Upon aligning the Key, it may be necessary to compensate for other properties between the camera footage and 2D virtual rendering Adjusting these properties to make a seamless final image may include: compensating for colour differences compensating for different exposure levels blending colour and lighting gradients in the images. (I.e. In Poisson Image Editing) Even after aligning and trying to compensate for certain image properties, due to numerical errors, poor camera calibration or even due to imperfect camera models, it may still be that features between the camera footage and the 2D virtual rendering are misaligned.

To resolve these imperfections, the camera footage Key can be warped to match features in the 2D virtual render. Warping is typically the localised bending/curving or other transformation that align features between the camera footage and the virtual scene. For example, if the clouds 85 still dd not perfectly align, localised warping can be applied to ensure that edges or other features correctly align. The key is to only warp local areas as to avoid affecting the rest of the Key which may degrade its quality.

As with other transformations, the warping operation can be applied to either the Key or the 2D virtual render or to both. The overall aim is simply to join up features to create a seamless product from the two streams. The methods used for image warping include any of: finding corresponding feature descriptors, corresponding edges or creating triangulated segments, forwarding mapping and inverse mapping and/or 2 pass mesh warp to name a few related techniques.

The various techniques described above may or may not all be required, and they may be carried out in a different order depending upon the circumstance of the particular com positing that is being carried out. Thus whilst warping is described as a final step, this may be carried out at any stage in the process. The same applies to any of the techniques disclosed.

In relation to a single image frame, the steps carried out on a single frame above could be summarised as follows: (a) Rough actor segmentation -that is identify the object to be extracted from the camera image and define the border around the object (b) Image alignment -determine where in the virtual image the extracted object is to be placed (c) Virtual image warp to camera image -generate a transformation that improves the alignment of the virtual image and the camera image (d) Smart fringe -apply any edge effects to smooth the joins/edges and/or remove some or all of the camera background (e) Colour and exposure compensation -correct for any colour and/or exposure differences -this may be done either in the virtual image or the camera image.

In a further example, which may be combined with any of the above steps and/or features, there is also disclosed a method of reusing the identified transformation for both virtual and camera image alignment.

Once the geometric transformation has been obtained in any of the ways described above for the extracted part of the image, it allows the system to align the virtual images and the camera images together. The same transformation can be reused to other special effects or details to be added onto the background of the actor (e.g. explosions). By applying this reusable geometric transform, the perspective distortions will remain coherent with the actor in the camera image.

Therefore the method of reusing the identified transformation for both virtual and camera image alignment may include the following: 1. The transformation for alignment is estimated for a single frame.

2. The transformation is then saved in a non-volatile memory storage space.

3. The transformation is then applied to align both the virtual and camera image.

4. The saved transformation may be used in post production to add further special effects or additional background details onto the same frame to ensure coherent perspective distortion.

In the earlier described methods, the background around the actor may have been dilated and, if this has been carried out, then this will therefore remain the same as the original LED background content, i.e. no additional special effects will appear in the dilated sections around the actor. This may be compensated for or the dilation may or may not be carried out.

Image aliqnment quality consistency checker he system and/or methods may include an image alignment quality consistency checker.

The transformation mentioned defines the alignment between the virtual and the S camera image. This said transformation will be estimated based on different factors for example but not limited to: - Corresponding feature points - Image alignment metrics such as mutual information or cross correlation to The quality assessment of this transformation is however limited to the frame it is estimated for. Due to the nondeterministic nature of the method, the following steps could occur alone or in combination: 1. search for the best transformation to align a single frame quantitatively but end up with a qualitatively poor alignment to the human eye.

2. The factors for estimating the transformation could be particularly poor for a single frame and therefore the estimation is poor compared to neighbourhood frames.

Therefore, a new method of constraining and assessing the quality consistency of the transformation for alignment may be utilised. Preferably, for every individual frame that is aligned, the assessment will be subject to the following constraints: (a) Applying neighbour frame constraint The assumption is that neighbouring frames should have: (i) similar transformations for alignment, and (ii) similar scores that quantify the alignment through image alignment metrics (e.g. mutual information or cross correlation) If a frame violates these constraints, the frame can be flagged for further human operator assessment during or after the entire video sequence is stitched. This allows 30 the operator to fast track to the problematic frame to be fixed instead of watching the entire video sequence frame by frame.

For point (i), transformations are an assembly of numeric parameters that make up a single transformation. Therefore, neighbouring frames which contain much of the same content and similar camera perspective should be expected to have similar numeric parameters estimated for their transformation. This enables a measure of the similarity of transformation between frames to be obtained. Therefore, frames that have transformation parameters that exceed a certain threshold can be flagged.

For point (ii), the alignment procedure searches for a transformation that best aligns the virtual and the camera based on a numeric metric (e.g. mutual information, cross correlation, sum squared error for corresponding points). The assumption is that for neighbouring frames with similar content and similar perspective, the numeric metric should not differ by large margins. Therefore, frames with numeric metrics that exceed a certain threshold can be flagged.

(b) Applying tracked motion constraint The tracked motion data from devices such as the Star Tracker can be used to constrain the directionality of our transformation. The transformations can be expected to have zero directionality or are in a direction related to the camera motion. These would constrain estimated transformations from poor data situations that may produce erratic results. When the estimated transformation is not coherent with camera motion, it can be flagged.

The method of image alignment quality consistency checking therefore may comprise: 1. Input virtual stream in line with any of the methods described above 2. Input camera stream in line with any of the methods described above 3. Stitch streams frame by frame in line with any of the methods described above 4. Assess stitch quality for each frame to flag bad frames 5. Adjust bad frames by human operator.

Such a method will largely automate the checking process and will reduce the human burden by only requiring human intervention for those frames that have been identified as potentially needing human involvement.

The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present invention may consist of any such individual feature or combination of features. In view of the foregoing description, it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.

Claims

CLAIMS1. A method of com positing a video stream comprising the steps of: obtaining first and second video streams, wherein there is overlap between the background of the first and second video streams; identifying a common feature in the background of the first and second video streams; and stitching the first and second video streams together along the identified feature within the background of the first and second video streams.
2. A method according to claim 1, wherein the first and second video streams are created with the same perspective.
3. A method according to claim 1 or claim 2, further comprising the step of adjusting the perspective of one or both image streams to compensate for movement of a camera capturing one of the image streams.
4. A method according to any one of the preceding claims, wherein the background of the first and second streams is generated from the same 3D model
5. A method according to any one of the preceding claims, further comprising the step of extracting a portion of the first image stream for stitching into the second image stream.
6 A method according to any one of the preceding claims, further comprising the step of comparing the first and second image streams to identify areas of difference, thereby identifying potential extraction portions.
7 A method according to any of the preceding claims, further comprising the step of identifying a portion of the first image stream for extraction and then increasing the size of the extracted portion by dilating in at least one direction.
8. A method according to any of the preceding claims, wherein the common feature is inboard of an edge of the first video stream.
9. A method according to any of claim 1 to 7, wherein the common feature is outboard of a subject within the first video stream.
10.A method according to any the preceding claims, wherein the background of the first video stream is provided by a display screen.
11.A method according to any of the preceding claims, wherein the identified feature is a common edge of an object in each background.
12.A method according to any of the preceding claims, wherein the identified feature is determined by one or more of: dilation about an identified object, identification of a feature line along which the seam can lie, or determination of a smart seam based on best fit criteria.
13.A video compositing system comprising: one or more inputs for receiving and/or storing first and second video streams to be joined together, the video streams having at least partiallyoverlapping backgrounds;an input for receiving an identification of a feature common to the backgrounds of the first and second video streams, and a processor configured to stitch the first and second video streams together along the identified feature.
14.A system according to claim 13, wherein the processor is further configured to detect the common feature and provide the necessary input.
15.A system according to claim 13 or claim 14, wherein the processor is further configured to ensure that the common feature is inboard of the edge of the first video stream.
16.A system according to any of claims 13 to 15, wherein the processor is further configured to ensure that the common feature is outboard of any subject in the first video stream.
17.A system according to any of claims 13 to 16, further comprising a camera for capturing the first video stream.
18.A system according to any of claims 13 to 17, further comprising a display screen for displaying the background to be used during capture of the first video stream.
19.A system according to claim 18, wherein the display screen obtains the background image to display from a video store in the video compositing system.
20.A system according to any one of claims 13 to 19, further comprising an infra red camera separate from any camera capturing the first video stream for determining the location of any subject in the first video stream.
21.A system according to any of claim 13 to 20, further comprising a comparator for determining the location of any subject in the first video stream by comparing the first and second video streams.
22.A system according to any of claim 13 to 21, further comprising one or more primary compositors running in parallel with one or more secondary compositors, the primary compositor or compositors being optimised for rendering real-time CGI footage and the secondary compositor or compositors being optimised for rendering high-quality CGI footage.
23.A render engine configured to receive a first captured image stream and a second CGI image stream, compare the image streams to identify areas of difference thereby defining one or more possible subjects for extraction in the first stream, extracting one or more of the areas of difference, and stitching the extracted area of difference into the second CGI image stream.
24.A method of com positing a video stream comprising the steps of: obtaining a first captured video stream in which the background is displayed on an LED screen and in which the background is based on a 3D model, obtaining a second video stream based on the 3D model, rendering at least a portion of the first captured video stream, stitching at least a portion of the images from the first captured video stream into the second video stream along one or more seams.
25.A method according to claim 24, wherein the one or more seams are determined by one or more of: dilation about an identified object, identification of a feature line along which the seam can lie, or determination of a smart seam based on best fit criteria.