CN115661280A

CN115661280A - Method and device for implanting multimedia into video, electronic equipment and storage medium

Info

Publication number: CN115661280A
Application number: CN202211220862.6A
Authority: CN
Inventors: 黄星; 郭小燕; 石峰; 赵松涛
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2022-10-08
Filing date: 2022-10-08
Publication date: 2023-01-31

Abstract

The present disclosure relates to a method, apparatus, electronic device, computer-readable storage medium, and computer program product for embedding multimedia in video. The method comprises the steps of obtaining a key frame of a video to be implanted, implanting preset multimedia into the key frame as a foreground to obtain a first synthetic image of the key frame, obtaining a first display lookup table of the first synthetic image, obtaining a mapped first target synthetic image based on the first display lookup table, and performing time sequence frame interpolation according to the first target synthetic image to obtain a target video. When the harmony processing is performed on the video implanted with the multimedia, only the key frame of the video needs to be extracted for processing, and the time sequence frame interpolation is performed based on the effect of the key frame, so that the calculation amount can be reduced, and the processing efficiency can be improved. And the harmony processing is carried out by introducing the LUT, therefore, the harmony effect with higher quality can be realized without the limitation of resolution.

Description

Method and device for implanting multimedia into video, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product for embedding multimedia in a video.

Background

With the development of image processing technology, image composition (image composition) has been widely used. Image composition is obtained by pasting the foreground from one image onto another image by clipping to obtain a composite image. The composite image has a wide application prospect, for example, the composite image can be used for acquiring an interested target image and can also be used for data augmentation.

However, since the foreground and background images may be captured under different capturing conditions (such as time, season, illumination, weather, etc.), there is a significant mismatch in brightness and color. Therefore, the composite image may have problems such as the size or position of the foreground being unreasonable, the foreground and background not appearing harmonious, and the like. In the related art, a foreground integrated into an image can be tuned to be harmonious with a background by image harmony (image harmony). For example, a method for solving the optimal value of the pixel through Poisson fusion introduced based on the Poisson equation can be used for solving one Poisson equation according to boundary conditions specified by a user to realize continuity on a gradient domain, so that the source image and the target image can be fused while the gradient information of the source image is retained to achieve seamless fusion at the boundary.

However, the solution process of cedar fusion is large in calculation amount, and therefore, the harmony of simple scenes can be realized. For complex scenes, such as video scenes, not only the processing time is consumed, but also it is difficult to achieve the ideal fusion effect.

Disclosure of Invention

The present disclosure provides a method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product for embedding multimedia in a video, so as to at least solve the problem in the related art that a solving process based on cedar fusion is large in calculation amount and difficult to apply to a video scene. The technical scheme of the disclosure is as follows:

according to a first aspect of an embodiment of the present disclosure, there is provided a method for embedding multimedia in a video, including:

performing frame extraction processing on a video to be implanted to obtain an extracted key frame;

implanting preset multimedia as a foreground into the key frame to obtain a first synthetic image of the key frame, wherein the first synthetic image comprises a first foreground area taking the multimedia as the foreground;

acquiring a first display lookup table of the first synthetic image, where the first display lookup table is a mapping relationship table between pixel values of the first foreground region and pixel values of first other regions in the first synthetic image, and the first other regions are regions other than the first foreground region in the first synthetic image;

mapping the pixel value of the first foreground area according to the first display lookup table to obtain a mapped first target synthetic image;

and performing time sequence frame interpolation according to the first target synthetic image to obtain a target video implanted with multimedia.

In one embodiment, the obtaining the first display look-up table of the first composite image includes: inputting the first synthetic image into a pre-trained harmony model to indicate that the harmony model outputs a first display lookup table corresponding to the relationship according to the relationship between the pixel value of the first foreground region and the pixel value of the first other region in the first synthetic image.

In one embodiment, the method for obtaining the harmony model includes: acquiring a first sample image, and performing image segmentation on the first sample image by adopting a foreground mask to obtain a foreground mask area of the first sample image; adjusting the display parameters of the foreground mask area to obtain an adjusted target foreground mask area; generating a second sample image from the target foreground mask region and the first sample image; and training a convolutional neural network by adopting the first sample image and the second sample image to obtain a trained harmonious model.

In one embodiment, the generating a second sample image from the target foreground mask region and the first sample image comprises: and replacing the foreground mask area in the first sample image according to the target foreground mask area to obtain a second sample image after area replacement.

In one embodiment, the training a convolutional neural network by using the first sample image and the second sample image to obtain a trained harmony model includes: inputting the second sample image into the convolutional neural network to obtain a harmonious image output by the convolutional neural network; calculating a loss value between the harmonious image and the first sample image by adopting a set loss function; and training the convolutional neural network according to the loss value to obtain a trained harmony model.

In one embodiment, the mapping the pixel values of the first foreground region according to the first display lookup table to obtain a mapped first target composite image includes: aiming at the pixel value of each pixel in the first foreground area, acquiring a target pixel value of a corresponding pixel according to the first display lookup table; and updating the pixel value of the corresponding pixel in the first synthetic image according to the target pixel value to obtain the mapped first target synthetic image.

In one embodiment, the key frames have a corresponding frame ordering; the performing time sequence frame interpolation according to the first target composite image to obtain a target video implanted with multimedia includes: acquiring a second display lookup table corresponding to a second synthetic image in the video to be embedded according to the first display lookup tables corresponding to adjacent key frames respectively, wherein the second synthetic image is a synthetic image corresponding to other frames positioned between the adjacent key frames in the video to be embedded, the second synthetic image includes a second foreground region taking the multimedia as a foreground, the second display lookup table is a mapping relation table between pixel values of the second foreground region and pixel values of second other regions in the second synthetic image, and the second other regions are regions except the second foreground region in the second synthetic image; mapping the pixel value of the second foreground area according to the second display lookup table to obtain a mapped second target synthetic image; and video synthesis is carried out on the first target synthetic image and the second target synthetic image according to the positions of the key frame corresponding to the first target synthetic image and other frames corresponding to the second target synthetic image in the video to be implanted, so as to obtain a target video.

In one embodiment, the neighboring key frames include a first key frame and a second key frame; the obtaining of the second display lookup table corresponding to the second synthesized image in the video to be implanted according to the first display lookup tables corresponding to the adjacent key frames respectively includes: and calculating a second display lookup table corresponding to a second composite image in the video to be implanted according to the first sequence of the first key frame in the video to be implanted, the second sequence of the second key frame in the video to be implanted, the third sequence of the other frames corresponding to the second composite image in the video to be implanted, the first display lookup table corresponding to the first key frame and the first display lookup table corresponding to the second key frame.

In one embodiment, the frame extraction processing on the video to be implanted to obtain an extracted key frame includes: acquiring image characteristics corresponding to each image frame in the video to be implanted; clustering the image frames according to the image characteristics to obtain a plurality of corresponding frame types; and extracting at least one target image frame from each frame category to serve as an extracted key frame.

According to a second aspect of the embodiments of the present disclosure, there is provided an apparatus for embedding multimedia in video, including:

the frame extraction module is configured to perform frame extraction processing on a video to be implanted to obtain an extracted key frame;

the image synthesis module is configured to implant preset multimedia as a foreground into the key frame to obtain a first synthesis image of the key frame, wherein the first synthesis image comprises a first foreground area taking the multimedia as the foreground;

a display lookup table obtaining module configured to perform obtaining of a first display lookup table of the first synthesized image, where the first display lookup table is a mapping relationship table between pixel values of the first foreground region and pixel values of a first other region in the first synthesized image, and the first other region is a region other than the first foreground region in the first synthesized image;

an image mapping module configured to perform mapping on pixel values of the first foreground region according to the first display lookup table to obtain a mapped first target composite image;

and the target video acquisition module is configured to execute time sequence frame interpolation processing according to the first target composite image to obtain a target video implanted with multimedia.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the method of embedding multimedia in video as described above in the first aspect.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions, when executed by a processor of an electronic device, enable the electronic device to perform the method of implanting multimedia in video as described in the first aspect above.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product including instructions, wherein the instructions, when executed by a processor of an electronic device, enable the electronic device to perform the method for embedding multimedia in video as described in the first aspect.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects: the method comprises the steps of performing frame extraction processing on a video to be implanted to obtain an extracted key frame, implanting preset multimedia into the key frame as a foreground to obtain a first composite image of the key frame, obtaining a first display lookup table of the first composite image, mapping a pixel value of a first foreground area according to the first display lookup table to obtain a mapped first target composite image, and performing time sequence frame interpolation processing according to the first target composite image to obtain a target video implanted with the multimedia. In the embodiment, when the harmony processing is performed on the video implanted with the multimedia, only the key frame of the video needs to be extracted for processing, and for the non-key frame, the time sequence frame interpolation can be performed based on the effect of the key frame, so that the calculation amount can be reduced and the processing efficiency can be improved compared with the harmony processing performed on each frame of the video. In addition, the embodiment performs the harmony processing by introducing the LUT, so that the harmony effect with higher quality can be realized without the limitation of resolution.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 is a flow chart illustrating a method of embedding multimedia in video according to an exemplary embodiment.

FIG. 2 is a diagram illustrating the steps of obtaining a harmony model according to an example embodiment.

FIG. 3 is a schematic diagram illustrating a harmonious model training process in accordance with an exemplary embodiment.

FIG. 4 is a schematic diagram illustrating the principles of a harmonic model training in accordance with an exemplary embodiment.

Fig. 5 is a diagram illustrating a key frame extraction step according to an exemplary embodiment.

FIG. 6 is a schematic diagram illustrating an image mapping step according to an exemplary embodiment.

Fig. 7 is a diagram illustrating a sequential frame interpolation step in accordance with an example embodiment.

Fig. 8 is a block diagram illustrating an apparatus for embedding multimedia in video according to an example embodiment.

FIG. 9 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in other sequences than those illustrated or described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

It should be further noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for presentation, analyzed data, etc.) referred to in the present disclosure are information and data authorized by the user or sufficiently authorized by each party.

Because the solving process based on the cedar fusion has large calculation amount and longer processing time, the harmony under a simple scene can be realized, and an ideal fusion effect is difficult to realize for a complex scene (such as a video scene) with higher effect requirement. Therefore, a domain verification-based depth image harmonizing method is proposed, which combines with deep learning to complete end-to-end harmonizing work. Specifically, two discriminators are introduced, wherein the first discriminator is a discriminator which generates a standard in a countermeasure Network (GAN for short), and acts on the whole picture, so that the data distribution of the generated picture is close to that of a real picture. The second discriminator is a domain verification discriminator, so that the domains of the foreground and background in the generated picture are as close as possible. However, in the scene for video advertisement implantation, this scheme needs to process each frame of the video, so there are also problems of large operation amount, time consumption, and when the video resolution is high, the definition of the implanted advertisement is difficult to achieve.

Based on this, the present disclosure provides a method for embedding multimedia in video, and this embodiment is illustrated by applying this method to a server, and it is understood that this method may also be applied to a terminal, and may also be applied to a system including a terminal and a server, and is implemented by interaction between the terminal and the server. As shown in fig. 1, in this embodiment, the method may include the following steps:

in step S110, a frame extraction process is performed on the video to be implanted to obtain an extracted key frame.

The video to be implanted refers to an original video needing to be implanted with multimedia. Specifically, multimedia can be a composite of multiple media, which can generally include multiple media forms such as text, sound, and images. In this embodiment, multimedia is taken as an example for description.

In order to achieve the ideal and harmonic effect after the original video is embedded into the multimedia image, in this embodiment, the server first performs frame extraction processing on the video to be embedded, that is, the original video, so as to obtain an extracted key frame. Specifically, the key frame may be a frame in which a key action in a change of motion of a character or an object in the video is located, or may be a frame capable of reflecting the most main content in the video. Wherein the key frames can be extracted from the original video based on a certain rule.

In step S120, a preset multimedia is implanted into the key frame as a foreground to obtain a first composite image of the key frame.

The preset multimedia refers to a preset multimedia image which needs to be embedded into the original video. In this embodiment, after the server extracts the key frame from the original video based on the above steps, the preset multimedia can be implanted into the key frame as a foreground. That is, the server may use the key frame as a background, use the preset multimedia image as a foreground, and paste the foreground into the background, thereby obtaining a composite image corresponding to the key frame, specifically, the composite image includes a foreground region using the implanted multimedia image as a foreground. In this embodiment, in order to distinguish the composite image of the key frame from the composite image of the non-key frame, the composite image of the key frame is defined as the first composite image, and the foreground region of the composite image of the key frame is defined as the first foreground region.

In step S130, a first display look-up table of the first composite image is acquired.

Here, the Look-Up Table (LUT) is understood to be a RAM (Random Access Memory) in nature. After data is written into the RAM in advance, every time a signal is input, it is equal to inputting an address to make table look-up so as to find out the content correspondent to the address, then can output it. In this embodiment, the LUT may be a mapping table of pixel values, and may perform a certain transformation, such as threshold, inversion, binarization, contrast adjustment, linear transformation, etc., on the actually sampled pixel values to obtain another value corresponding to the actually sampled pixel values, so as to highlight useful information of the image and enhance the light contrast of the image, so as to finally achieve a harmonic effect after the foreground region and the background region are fused in the first synthesized image. Specifically, the first display lookup table is a mapping relationship table between pixel values of a first foreground region and pixel values of a first other region in the first synthesized image, where the first other region is a region of the first synthesized image except the first foreground region.

In this embodiment, the server may obtain the corresponding first display look-up table, that is, obtain the LUT corresponding to the key frame, according to the obtained first composite image of the key frame. And then the first synthetic image of the key frame can be subjected to harmony processing through subsequent steps so as to realize an ideal fusion effect. Specifically, the server may directly obtain the LUT of the first synthesized image through a pre-trained harmony model, or may obtain the corresponding LUT by fitting the change of the pixel gray levels of the foreground area and the background area in the first synthesized image, which is not limited in this embodiment.

In step S140, the pixel values of the first foreground region are mapped according to the first display lookup table, so as to obtain a mapped first target composite image.

The first target synthetic image is a result of performing harmony processing on the first synthetic image. In this embodiment, the server maps the pixel values of the first foreground region in the first synthetic image according to the LUT of the first synthetic image, so as to obtain a mapped first target synthetic image.

Specifically, the server searches a target pixel value corresponding to the pixel value in the LUT corresponding to the first synthesized image according to the pixel value of the first foreground region in the first synthesized image, and replaces the original pixel value with the target pixel value until the replacement of each pixel value of the first foreground region in the first synthesized image is completed, so that the first target synthesized image obtained by performing the harmony processing on the first synthesized image can be obtained.

In step S150, a time-series frame interpolation process is performed according to the first target composite image to obtain a target video embedded with multimedia.

The time-series frame interpolation processing may be a process of generating a non-key frame fusion effect based on a key frame fusion effect, and performing video synthesis based on an order in which the key frame and the non-key frame are respectively located in the video. The target video is a final video obtained by implanting multimedia into the original video and performing harmony processing. In this embodiment, the server may perform time-series frame interpolation according to the first target composite image corresponding to the key frame, so as to obtain the target video embedded with multimedia.

According to the method for implanting the multimedia into the video, frame extraction processing is carried out on the video to be implanted to obtain an extracted key frame, preset multimedia is implanted into the key frame as a foreground to obtain a first synthetic image of the key frame, a first display lookup table of the first synthetic image is obtained, pixel values of a first foreground area are mapped according to the first display lookup table to obtain a mapped first target synthetic image, and time sequence frame insertion processing is carried out according to the first target synthetic image to obtain the target video implanted with the multimedia. In the embodiment, when the harmony processing is performed on the video implanted with the multimedia, only the key frame of the video needs to be extracted for processing, and for the non-key frame, the time sequence frame interpolation can be performed based on the effect of the key frame, so that the calculation amount can be reduced and the processing efficiency can be improved compared with the harmony processing performed on each frame of the video. In addition, the embodiment performs the harmony processing by introducing the LUT, so that the harmony effect with higher quality can be realized without the limitation of resolution.

In an exemplary embodiment, in step S130, acquiring a first display lookup table of the first composite image may specifically include: and inputting the first synthetic image into a pre-trained harmony model to indicate the harmony model to output a first display lookup table corresponding to the relation according to the relation between the pixel value of the first foreground area and the pixel value of the first other area in the first synthetic image.

The harmony model can be obtained by training a convolutional neural network by adopting sample data based on a deep learning method. Specifically, the basic network adopted in model training can be a semantic segmentation network (U-Net) based on multi-layer convolutional neural network stacking. The sample data includes pairing data of the synthetic image having the discordance effect and the synthetic image corresponding to the discordance effect. In this embodiment, paired sample data is input to the convolutional neural network, so that the convolutional neural network can learn the difference change between the dissonance effect and the harmony effect of the corresponding synthetic image, and fit the corresponding mapping relationship, and further, for any frame of synthetic image, the mapping relationship between the synthetic image and the harmony effect, that is, the LUT, can be fitted.

Based on this, in this embodiment, the server inputs the first synthetic image into the pre-trained harmony model, so as to obtain the first display look-up table output by the model and corresponding to the first synthetic image, that is, obtain the LUT corresponding to the first synthetic image. In the embodiment, the pre-trained harmony model is adopted, so that the mapping relation between the first synthetic image and the corresponding harmony effect, namely the LUT, can be quickly fitted.

In an exemplary embodiment, as shown in fig. 2, the method for obtaining the harmony model specifically includes:

in step S210, a first sample image is obtained, and the foreground mask is used to perform image segmentation on the first sample image, so as to obtain a foreground mask region of the first sample image.

Since the model training is to enable the model to learn the mapping relationship between the synthetic image with the dissonant effect and the synthetic image corresponding to the harmonic effect, the matching data of the synthetic image with the dissonant effect and the synthetic image corresponding to the harmonic effect needs to be used during the model training. However, there is no real composite image with harmonious effect after implantation in the actual scene.

Based on this, the present embodiment may employ an image processing-based method to construct such paired data for model training. Specifically, the server first acquires a first sample image, where the first sample image refers to an image obtained based on the same shooting condition, and specifically, the first sample image may not be a synthesized image (i.e., an image synthesized from two images), but may be one original image shot in nature. In this embodiment, the first sample image may be a composite image having a harmonic effect in the paired data.

In addition, the server can also process the first sample image based on an image processing method, so as to obtain a corresponding composite image with an inharmonious effect. Specifically, the server performs image segmentation on the first sample image by using a foreground mask, so as to obtain a foreground mask region of the first sample image. The mask is a template for controlling an image processing area to partially block an image to be processed. In this embodiment, the foreground mask is a template for blocking a non-foreground region in the first sample image to obtain a foreground region to be processed. The foreground mask region is an image segmentation of the first sample image based on the foreground mask to obtain a foreground region of the segmented first sample image. And then the foreground area can be subjected to image processing through subsequent steps to obtain a synthetic image with an inharmonious effect.

In step S220, the display parameters of the foreground mask area are adjusted to obtain an adjusted target foreground mask area.

The display parameters may be related parameters for changing the image display effect, including but not limited to Gamma value (i.e. Gamma value, also called Gamma), color range, non-linear mapping, color crosstalk (decoupling), hue, saturation, brightness, and so on. The adjustment of the display parameters may be the adjustment of any of the above parameters, and specifically, only one of the parameters may be adjusted, or a plurality of the parameters may be adjusted.

In this embodiment, the server may adjust the display parameter of the foreground mask area of the obtained first sample image, so as to obtain an adjusted target foreground mask area. The target foreground mask area is obtained by adjusting the display parameters of the foreground mask area, so that the target foreground mask area has different display effects relative to the foreground mask area.

In step S230, a second sample image is generated from the target foreground mask region and the first sample image.

Wherein the second sample image is a composite image having an inharmonious effect corresponding to the first sample image. That is, the first sample image and the second sample image together constitute pairing data for model training.

Specifically, the obtained target foreground mask area has a different display effect with respect to the foreground mask area, so that the server pastes the target foreground mask area to the foreground mask area corresponding to the first sample image, so that the second sample image with the anharmonic effect can be obtained, and the construction process of the synthetic image with the anharmonic effect is completed.

In a scenario, the server may further replace the foreground mask region in the first sample image according to the target foreground mask region, that is, replace the foreground mask region in the first sample image with the corresponding target foreground mask region, so as to obtain a second sample image after the region replacement, so as to implement the construction of the composite image with the dissonant effect.

In step S240, the first sample image and the second sample image are used to train the convolutional neural network, so as to obtain the trained harmony model.

The first sample image and the second sample image jointly form the matched data of the synthetic image with the dissonant effect and the synthetic image corresponding to the harmonious effect, so that the first sample image and the second sample image are adopted to train the convolutional neural network, the trained harmonious model is obtained, the model can learn the difference change between the dissonant effect and the harmonious effect of the corresponding image, and the corresponding mapping relation can be fitted.

In the above embodiment, the first sample image is obtained, the foreground mask is used to perform image segmentation on the first sample image to obtain a foreground mask region of the first sample image, the display parameters of the foreground mask region are adjusted to obtain an adjusted target foreground mask region, a second sample image is generated according to the target foreground mask region and the first sample image, and the convolutional neural network is trained by using the first sample image and the second sample image to obtain a trained harmony model. Since the present embodiment can construct the second sample image based on the first sample image, pairing data for model training can be obtained to implement model training.

In an exemplary embodiment, as shown in fig. 3, in step S240, training a convolutional neural network by using the first sample image and the second sample image to obtain a trained harmony model, which may specifically include:

in step S310, the second sample image is input into the convolutional neural network, and a harmonized image output by the convolutional neural network is obtained.

The second sample image includes a corresponding target foreground mask region, that is, a foreground region inconsistent with the background in the second sample image. The harmonious image is a harmonious image of the second sample image predicted by fitting based on the discordant second sample image through the convolutional neural network. In the embodiment, the server inputs the second sample image into the convolutional neural network, so as to obtain a harmonious image output by the convolutional neural network.

In step S320, a loss value between the harmony image and the first sample image is calculated using the set loss function.

The loss function (loss function) is used to measure the degree of inconsistency between the predicted value and the true value of the model, and generally, the smaller the loss function is, the better the robustness of the model is. In this embodiment, the predicted value of the model is the harmonious image output by the convolutional neural network, and the true value is the first sample image with the harmonious effect in the paired data. Based on the loss value, the server can calculate the loss value between the harmonious image and the first sample image through the set loss function. Specifically, the loss function may calculate an Error corresponding to the image pixel by pixel based on a Mean Square Error (MSE for short), or may count a statistical Error of pixels in the corresponding image based on a histogram.

In step S330, the convolutional neural network is trained according to the loss value, and a trained harmony model is obtained.

Specifically, the server iteratively trains the convolutional neural network according to the loss value obtained through calculation until the loss value reaches the minimum value, and a trained harmonious model is obtained, so that the model has good robustness.

Specifically, as shown in fig. 4, the server may perform image processing based on the first sample image Q1 and the foreground mask M to obtain a constructed second sample image Q2, which specifically refers to the process of constructing the second sample image Q2 described in the above step S210 to step S230, and this is not described again in this embodiment. And then the second sample image Q2 is input into the convolutional neural network, so that a harmonious image Q3 output by the convolutional neural network is obtained. And calculating a loss value between the harmonious image Q3 and the first sample image Q1 based on a loss function, and adjusting network parameters based on the loss value to perform iterative training until the model converges, namely the loss value is minimum, so as to obtain the trained harmonious model.

In an exemplary embodiment, as shown in fig. 5, in step S110, performing frame extraction processing on the video to be embedded to obtain an extracted key frame, which may specifically include the following steps:

in step S510, image features corresponding to each image frame in the video to be implanted are obtained.

Wherein the image frame is a minimum unit constituting the video, i.e., one frame image. The image features may include color features, texture features, shape features, and spatial relationships of the image frame. In this embodiment, the server performs frame division processing on the video to be implanted, so as to obtain each image frame of the video to be implanted, and further performs feature extraction on each image frame, so as to obtain image features corresponding to each image frame.

In step S520, the image frames are clustered according to the image features to obtain a plurality of corresponding frame categories.

Clustering is the process of dividing a collection of physical or abstract objects into classes composed of similar objects. Specifically, in this embodiment, all image frames constituting the video to be implanted are corresponding sets, and the frame types are multiple types obtained through clustering.

In this embodiment, the server may perform clustering processing on each image frame according to the image features of each image frame of the video to be implanted, so as to obtain a plurality of corresponding frame categories. Specifically, the image feature is taken as an example of a color feature of an image frame, wherein the color feature may be an RGB value of a pixel in the image. For example, the RGB histogram of the first image frame may be used as an initial centroid, so as to obtain the RGB histogram of the first image frame, then compare the second frame with the first clustering centroid (the RGB histogram of the first frame is used as the centroid), if the second frame is similar to the first frame (specifically, by comparing the distances between the RGB centroids of the two image frames and performing judgment according to a preset threshold), add the second frame into the clustering of the first frame, and generate a new centroid on the basis of the newly added image and the initial centroid, as the basis for comparison with the new frame. If not, a new cluster, i.e., frame class, is generated. The next comparison will be made with the new frame and the centroids of all clusters, either selecting the cluster to which it belongs or generating a new cluster. In this way, a plurality of clusters are generated, i.e. a plurality of frame classes are obtained, so that each frame is attributed.

In step S530, at least one target image frame is extracted from each frame category as an extracted key frame.

The target image frame may be an image frame corresponding to a centroid of a corresponding cluster, i.e., frame class. In the present embodiment, the server extracts at least one target image frame from each frame category as an extracted key frame.

According to the method and the device, the key frame of the video to be implanted is determined in a clustering mode, and then the key frame is subjected to harmony processing, so that compared with the harmony processing of each frame of the video, the calculation amount can be reduced, and the processing efficiency is improved.

In an exemplary embodiment, as shown in fig. 6, in step S140, mapping the pixel values of the first foreground region according to the first display lookup table to obtain a mapped first target composite image, which may specifically include the following steps:

in step S610, for the pixel value of each pixel in the first foreground region, a target pixel value of the corresponding pixel is obtained according to the first display lookup table.

Since the first display look-up table is an LUT of the first composite image corresponding to a certain key frame, the LUT can reflect the mapping relationship between the dissonance effect and the harmony effect of the corresponding image. Thus, for the pixel value of each pixel of the first foreground region in the first composite image, the target pixel value to harmonize it can be found in the first display look-up table, i.e. LUT.

In step S620, the pixel value of the corresponding pixel in the first synthetic image is updated according to the target pixel value, so as to obtain the mapped first target synthetic image.

Specifically, the server updates the pixel value of the corresponding pixel in the first synthetic image according to the target pixel value obtained by table lookup, so as to obtain the mapped first target synthetic image, namely the result of the corresponding keyframe after implantation and harmony processing.

For example, if the pixel value of a certain pixel n in the first foreground region in the first composite image is 25 (the range is 0 to 255), the LUT is searched for, and if the target value corresponding to the pixel value of 25 is 112, 112 is the target pixel value of the pixel n, and the pixel value of the pixel n is replaced with 112. Based on this, each pixel of the first foreground region in the first synthetic image is subjected to the processing, so that the mapped first target synthetic image is obtained, and the result of the implantation harmony processing of the corresponding key frame is obtained. In the embodiment, the synthetic image of the key frame is subjected to the harmonizing processing based on the LUT method, and the harmonizing effect with higher quality can be realized without the limitation of resolution because the processing method is based on the pixel granularity.

In an exemplary embodiment, the key frames further have a corresponding frame ordering, where the frame ordering may be a total ordering of the corresponding key frames in the video to be embedded, or an order of the key frames extracted based on the video progress. As shown in fig. 7, in step S150, performing a time-series frame interpolation process according to the first target composite image to obtain a target video embedded with multimedia, which may specifically include the following steps:

in step S710, a second display lookup table corresponding to a second composite image in the video to be implanted is obtained according to the first display lookup tables respectively corresponding to the adjacent key frames.

The second composite image is a composite image corresponding to other frames between adjacent key frames in the video to be embedded, and similarly, the second composite image includes a second foreground region taking multimedia as a foreground. Specifically, the other frames are non-key frames between adjacent key frames. The second display lookup table is a mapping relation table between pixel values of a second foreground region and pixel values of a second other region in the second synthesized image, wherein the second other region is a region except the second foreground region in the second synthesized image.

In this embodiment, the server may obtain, according to the first display lookup table, that is, the LUT, corresponding to the non-key frame between the adjacent key frames, and by means of interpolation, the second display lookup table, that is, the LUT corresponding to the non-key frame between the adjacent key frames. For example, if there are adjacent first key frame I2 and second key frame I3, where if the first key frame I2 is located in the original video, i.e. the video to be embedded, with the first ordering being 10, if the LUT result obtained by the model is R2 (i.e. the first display look-up table corresponding to the first key frame), and if the second key frame I3 is located in the original video with the second ordering being 20, the LUT result obtained by the model is R3 (i.e. the first display look-up table corresponding to the second key frame), then for the non-key frame between the first key frame I2 and the second key frame I3 (i.e. the other frames located in the original video with the third ordering being 11 to 19), an interpolation may be performed based on R2 and R3 to obtain the LUT corresponding to the non-key frame. Specifically, taking the third 13-ordered non-key frame as an example, the corresponding LUT13 can be expressed as:

LUT13＝((20-13)*R3+(13-10)*R2)/(20-10)。

therefore, the non-key frame is not required to be processed by using a model, and is simply obtained by time-series interpolation based on the effect of the key frame, so that the calculation amount can be reduced, and the processing efficiency can be improved.

In step S720, the pixel values of the second foreground region are mapped according to the second display look-up table, so as to obtain a mapped second target composite image.

In the embodiment, the operation of obtaining the mapped second target synthetic image by mapping the pixel value of the second foreground region through the second display lookup table is similar to the operation of obtaining the mapped first target synthetic image by mapping the pixel value of the first foreground region through the first display lookup table, and reference may be specifically made to the embodiment shown in fig. 6.

In step S730, video synthesis is performed according to the positions of the key frame corresponding to the first target synthesized image and the other frames corresponding to the second target synthesized image in the video to be implanted, so as to obtain a target video.

Because the first target synthetic image is an image obtained after the key frame is implanted and harmonised, and the second target synthetic image is an image obtained after the non-key frame is implanted and harmonised, the server carries out video synthesis on the first target synthetic image and the second target synthetic image according to the fact that the key frame corresponding to the first target synthetic image and other frames corresponding to the second target synthetic image are respectively located in the position of the video to be implanted, and therefore the target video after implantation and harmonisation can be obtained. In the embodiment, when the harmony processing is performed on the video with the multimedia embedded therein, only the key frames of the video need to be extracted for processing, and for the non-key frames, the time sequence frame interpolation is performed based on the effect of the key frames, so that the overall processing efficiency of the video can be improved compared with the harmony processing performed on each frame of the video.

It should be understood that although the various steps in the flowcharts of fig. 1-7 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1-7 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.

It is understood that the same/similar parts between the embodiments of the method described above in this specification can be referred to each other, and each embodiment focuses on the differences from the other embodiments, and it is sufficient that the relevant points are referred to the descriptions of the other method embodiments.

Fig. 8 is a block diagram illustrating an apparatus for embedding multimedia in video according to an example embodiment. Referring to fig. 8, the apparatus includes a frame extraction module 802, an image composition module 804, a display look-up table acquisition module 806, an image mapping module 808, and a target video acquisition module 810.

A frame extraction module 802 configured to perform frame extraction processing on a video to be implanted to obtain an extracted key frame;

an image synthesis module 804, configured to perform implantation of preset multimedia as a foreground into the key frame, to obtain a first synthetic image of the key frame, where the first synthetic image includes a first foreground region in which the multimedia is a foreground;

a display lookup table obtaining module 806 configured to perform obtaining a first display lookup table of the first synthesized image, where the first display lookup table is a mapping relationship table between pixel values of the first foreground region and pixel values of a first other region in the first synthesized image, and the first other region is a region in the first synthesized image except the first foreground region;

an image mapping module 808, configured to perform mapping on the pixel value of the first foreground region according to the first display look-up table, so as to obtain a mapped first target composite image;

and the target video acquisition module 810 is configured to perform time sequence frame interpolation processing according to the first target composite image to obtain a target video implanted with multimedia.

In an exemplary embodiment, the display look-up table acquisition module is configured to perform: and inputting the first synthetic image into a pre-trained harmony model to obtain a first display lookup table which is output by the harmony model and corresponds to the first synthetic image.

In an exemplary embodiment, the display lookup table obtaining module further includes: the device comprises a first sample image acquisition unit, a foreground mask acquisition unit and a foreground mask processing unit, wherein the first sample image acquisition unit is configured to acquire a first sample image and perform image segmentation on the first sample image by adopting a foreground mask to obtain a foreground mask area of the first sample image; a parameter adjusting unit, configured to perform adjustment on the display parameter of the foreground mask area, so as to obtain an adjusted target foreground mask area; a second sample image generation unit configured to perform generating a second sample image from the target foreground mask region and the first sample image; and the model training unit is configured to train a convolutional neural network by adopting the first sample image and the second sample image to obtain a trained harmonious model.

In an exemplary embodiment, the second sample image generation unit is further configured to perform: and replacing the foreground mask area in the first sample image according to the target foreground mask area to obtain a second sample image after area replacement.

In an exemplary embodiment, the model training unit is further configured to perform: inputting the second sample image into the convolutional neural network to obtain a harmonious image output by the convolutional neural network; calculating a loss value between the harmonious image and the first sample image by adopting a set loss function; and training the convolutional neural network according to the loss value to obtain a trained harmonious model.

In an exemplary embodiment, the image mapping module is further configured to perform: aiming at the pixel value of each pixel in the first foreground area, acquiring a target pixel value of a corresponding pixel according to the first display lookup table; and updating the pixel value of the corresponding pixel in the first synthetic image according to the target pixel value to obtain the mapped first target synthetic image.

In an exemplary embodiment, the key frames have a corresponding frame ordering; the target video acquisition module is further configured to perform: acquiring a second display lookup table corresponding to a second synthetic image in the video to be implanted according to the first display lookup tables corresponding to adjacent key frames respectively, wherein the second synthetic image is a synthetic image corresponding to other frames positioned between the adjacent key frames in the video to be implanted, and the second synthetic image comprises a second foreground area taking the multimedia as a foreground; mapping the pixel value of the second foreground area according to the second display lookup table to obtain a mapped second target synthetic image; and video synthesis is carried out on the first target synthetic image and the second target synthetic image according to the positions of the key frame corresponding to the first target synthetic image and other frames corresponding to the second target synthetic image in the video to be implanted, so as to obtain a target video.

In an exemplary embodiment, the frame extraction module is further configured to perform: acquiring image characteristics corresponding to each image frame in the video to be implanted; clustering the image frames according to the image characteristics to obtain a plurality of corresponding frame types; and extracting at least one target image frame from each frame category to serve as an extracted key frame.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 9 is a block diagram illustrating an electronic device S00 for embedding multimedia in video in accordance with an exemplary embodiment. For example, the electronic device S00 may be a server. Referring to fig. 9, the electronic device S00 comprises a processing component S20, which further comprises one or more processors, and memory resources, represented by memory S22, for storing instructions, e.g. application programs, executable by the processing component S20. The application stored in the memory S22 may include one or more modules each corresponding to a set of instructions. Furthermore, the processing component S20 is configured to execute instructions to perform the above-described method.

The electronic device S00 may further include: the power supply component S24 is configured to perform power management of the electronic device S00, the wired or wireless network interface S26 is configured to connect the electronic device S00 to a network, and the input-output (I/O) interface S28. The electronic device S00 may operate based on an operating system stored in the memory S22, such as Windows Server, mac OSX, unix, linux, freeBSD, or the like.

In an exemplary embodiment, a computer-readable storage medium comprising instructions, such as the memory S22 comprising instructions, executable by a processor of the electronic device S00 to perform the above-described method is also provided. The storage medium may be a computer-readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product is also provided, which includes instructions executable by a processor of the electronic device S00 to perform the above method.

It should be noted that the descriptions of the above-mentioned apparatus, the electronic device, the computer-readable storage medium, the computer program product, and the like according to the method embodiments may also include other embodiments, and specific implementations may refer to the descriptions of the related method embodiments, which are not described in detail herein.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for embedding multimedia in video, the method comprising:

acquiring a first display lookup table of the first synthetic image, where the first display lookup table is a mapping relation table between pixel values of the first foreground region and pixel values of first other regions in the first synthetic image, and the first other regions are regions other than the first foreground region in the first synthetic image;

and carrying out time sequence frame interpolation processing according to the first target synthetic image to obtain a target video implanted with the multimedia.

2. The method of claim 1, wherein said obtaining a first display look-up table for the first composite image comprises:

inputting the first synthetic image into a pre-trained harmony model to indicate that the harmony model outputs a first display lookup table corresponding to the relationship according to the relationship between the pixel value of the first foreground region and the pixel value of the first other region in the first synthetic image.

3. The method of claim 2, wherein the obtaining of the harmony model comprises:

acquiring a first sample image, and performing image segmentation on the first sample image by adopting a foreground mask to obtain a foreground mask area of the first sample image;

adjusting the display parameters of the foreground mask area to obtain an adjusted target foreground mask area;

generating a second sample image from the target foreground mask region and the first sample image;

and training a convolutional neural network by adopting the first sample image and the second sample image to obtain a trained harmonious model.

4. The method of claim 3, wherein generating a second sample image from the target foreground mask region and the first sample image comprises:

and replacing the foreground mask area in the first sample image according to the target foreground mask area to obtain a second sample image after area replacement.

5. The method of claim 3, wherein the training a convolutional neural network using the first sample image and the second sample image to obtain a trained harmony model comprises:

inputting the second sample image into the convolutional neural network to obtain a harmonious image output by the convolutional neural network;

calculating a loss value between the harmonious image and the first sample image by adopting a set loss function;

and training the convolutional neural network according to the loss value to obtain a trained harmony model.

6. The method according to any one of claims 1 to 5, wherein the mapping the pixel values of the first foreground region according to the first display look-up table to obtain a mapped first target composite image comprises:

aiming at the pixel value of each pixel in the first foreground area, acquiring a target pixel value of a corresponding pixel according to the first display lookup table;

and updating the pixel value of the corresponding pixel in the first synthetic image according to the target pixel value to obtain the mapped first target synthetic image.

7. The method of any of claims 1 to 5, wherein the key frames have a corresponding frame ordering; the time sequence frame interpolation processing is carried out according to the first target composite image to obtain the target video implanted with the multimedia, and the method comprises the following steps:

acquiring a second display lookup table corresponding to a second synthetic image in the video to be embedded according to the first display lookup tables corresponding to adjacent key frames respectively, wherein the second synthetic image is a synthetic image corresponding to other frames positioned between the adjacent key frames in the video to be embedded, the second synthetic image includes a second foreground region taking the multimedia as a foreground, the second display lookup table is a mapping relation table between pixel values of the second foreground region and pixel values of second other regions in the second synthetic image, and the second other regions are regions except the second foreground region in the second synthetic image;

mapping the pixel value of the second foreground area according to the second display lookup table to obtain a mapped second target synthetic image;

and performing video synthesis on the first target synthetic image and the second target synthetic image according to the positions of the key frame corresponding to the first target synthetic image and other frames corresponding to the second target synthetic image in the video to be implanted, so as to obtain a target video.

8. The method of claim 7, wherein the neighboring key frames comprise a first key frame and a second key frame; the obtaining of the second display lookup table corresponding to the second synthesized image in the video to be implanted according to the first display lookup tables corresponding to the adjacent key frames respectively includes:

and calculating a second display lookup table corresponding to a second composite image in the video to be implanted according to the first sequence of the first key frame in the video to be implanted, the second sequence of the second key frame in the video to be implanted, the third sequence of the other frames corresponding to the second composite image in the video to be implanted, the first display lookup table corresponding to the first key frame and the first display lookup table corresponding to the second key frame.

9. The method according to any one of claims 1 to 5, wherein the step of performing frame extraction on the video to be embedded to obtain extracted key frames comprises:

acquiring image characteristics corresponding to each image frame in the video to be implanted;

clustering the image frames according to the image characteristics to obtain a plurality of corresponding frame types;

and extracting at least one target image frame from each frame category to serve as an extracted key frame.

10. An apparatus for embedding multimedia in video, comprising:

and the target video acquisition module is configured to perform time sequence frame interpolation processing according to the first target composite image to obtain a target video implanted with the multimedia.

11. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of implanting multimedia in video of any of claims 1 to 9.

12. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of implanting multimedia in video of any of claims 1 to 9.