CN113408332A

CN113408332A - Video mirror splitting method, device, equipment and computer readable storage medium

Info

Publication number: CN113408332A
Application number: CN202110246190.5A
Authority: CN
Inventors: 郭卉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-03-05
Filing date: 2021-03-05
Publication date: 2021-09-17

Abstract

The application provides a video lens dividing method, a video lens dividing device, video lens dividing equipment and a computer readable storage medium; the method comprises the following steps: performing frame extraction processing on a target video to obtain a video frame sequence corresponding to the target video; extracting the characteristics of each video frame in the video frame sequence to obtain the image embedding characteristics of each video frame; based on the image embedding characteristics of each video frame, acquiring the similarity between the image embedding characteristics of adjacent video frames in the video frame sequence; based on the similarity between the image embedding characteristics of the adjacent video frames, performing mirror splitting processing on the video frame sequence to obtain at least two initial mirrors, wherein each initial mirror comprises at least one video frame; and polymerizing the at least two initial sub-mirrors to obtain at least two target sub-mirrors. By the method and the device, the accuracy of the split mirror and the polymerization degree of the split mirror can be improved.

Description

Video mirror splitting method, device, equipment and computer readable storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a video mirroring method, apparatus, device, and computer-readable storage medium.

Background

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. Video processing is an important technology in computer vision technology.

Video framing is an important task in video processing, namely distinguishing which video frame belongs to the same shot from each video frame of a video and describing the same event of the same person. The scene transformation provided by the split mirror identifies data that can greatly compress samples in video processing and provides a main clue to the storyline change of the video sequence.

In the related art, a method is usually adopted to perform a mirror segmentation processing on a video based on Scale-invariant feature transform (SIFT) features of a video frame, and due to the problem of the self expression capability of the SIFT features, the accuracy of the mirror segmentation is low, and the degree of polymerization of the mirror segmentation is low.

Disclosure of Invention

The embodiment of the application provides a video lens splitting method, a video lens splitting device, video lens splitting equipment and a computer readable storage medium, which can improve the accuracy of lens splitting and the polymerization degree of lens splitting.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a video lens splitting method, which comprises the following steps:

performing frame extraction processing on a target video to obtain a video frame sequence corresponding to the target video;

extracting the characteristics of each video frame in the video frame sequence to obtain the image embedding characteristics of each video frame;

based on the image embedding characteristics of each video frame, acquiring the similarity between the image embedding characteristics of adjacent video frames in the video frame sequence;

and performing mirror splitting processing on the video frame sequence based on the similarity between the image embedding characteristics of the adjacent video frames to obtain at least two target mirrors, wherein each target mirror comprises at least one video frame.

The embodiment of the application provides a video minute mirror device, includes:

the frame extracting module is used for carrying out frame extracting processing on a target video to obtain a video frame sequence corresponding to the target video;

the extraction module is used for extracting the characteristics of each video frame in the video frame sequence to obtain the image embedding characteristics of each video frame;

the acquisition module is used for acquiring the similarity between the image embedding characteristics of adjacent video frames in the video frame sequence based on the image embedding characteristics of each video frame;

a mirror splitting module, configured to perform mirror splitting on the video frame sequence based on a similarity between image embedding features of the adjacent video frames to obtain at least two initial mirrors, where each initial mirror includes at least one video frame;

and the polymerization module is used for polymerizing the at least two initial sub-mirrors to obtain at least two target sub-mirrors.

In the above scheme, the frame extracting module is further configured to obtain a video frame rate of the target video and a duration of the target video;

determining the number of frame extraction according to the video frame rate of the target video and the duration of the target video;

and performing frame extraction processing on the target video according to the frame extraction quantity to obtain a video frame sequence containing the video frames of the frame extraction quantity.

In the above scheme, the extraction module is further configured to perform feature extraction on each video frame in the video frame sequence through a first image embedding model to obtain image embedding features of each video frame;

the device further comprises:

the updating module is used for screening out difficult cases of the objective lenses from at least two objective lenses;

acquiring an artificial lens splitting result corresponding to the lens splitting difficulty case, wherein the artificial lens splitting result is used for indicating a video frame belonging to the same lens splitting;

and constructing a training set based on the hard-to-split example and the manual splitting result, and updating the model parameters of the first graph embedding model through the training set to obtain a second graph embedding model.

In the above scheme, the updating module is further configured to obtain the number of video frames in each target segment and an average number of video frames of each target segment corresponding to each target segment;

respectively determining the ratio of the number of the video frames in each target sub-mirror to the average number of the video frames;

and taking the target sub-mirror with the ratio of the number of the video frames to the average number of the video frames not reaching the target ratio as a sub-mirror hard case.

In the foregoing solution, the updating module is further configured to, when the target segment includes at least two video frames, perform the following operations for each target segment:

starting from the first video frame in the target lens, obtaining the video frames of the target proportion according to the sequence of playing time points to obtain a first video frame subset, and

starting from the last video frame in the target split mirror, obtaining video frames with a target proportion according to the playing time point in a reverse order, and obtaining a second video frame subset;

acquiring the similarity between the first video frame subset and the second video frame subset;

and when the similarity does not reach the similarity threshold value, determining that the target lens splitting is a difficult lens splitting case.

In the above scheme, the training set includes a plurality of triples, where the triples include a reference image, a positive image belonging to the same lens as the reference image, and a negative image belonging to a different lens from the reference image;

the updating module is further used for inputting the reference image, the positive image and the negative image in the triple into the first image embedding model;

forward calculation is carried out on the reference image, the positive image and the negative image through the first image embedding model, and image embedding characteristics of the reference image, the positive image and the negative image are obtained through prediction;

obtaining a first difference between a graph embedding feature of a reference image and a graph embedding feature of a positive image and a second difference between the graph embedding feature of the reference image and a graph embedding feature of a negative image, and determining a value of a loss function of the first graph embedding model based on the first difference and the second difference;

updating model parameters of the first graph embedding model based on the value of the loss function.

In the foregoing solution, the updating module is further configured to obtain a test set, and input the test set to the first graph embedding model and the second graph embedding model respectively to obtain a first prediction result corresponding to the first graph embedding model and a second prediction result corresponding to the second graph embedding model;

determining a value of a loss function corresponding to the first graph embedding model based on the first prediction result;

determining a value of a loss function corresponding to the second graph embedding model based on the second prediction result;

and updating the first graph embedding model by using the second graph embedding model when the value of the loss function corresponding to the second graph embedding model is smaller than the value of the loss function corresponding to the first graph embedding model.

In the above scheme, the mirror splitting module is further configured to obtain a first similarity threshold and a second similarity threshold, where the first similarity threshold is greater than the second similarity threshold;

dividing the two video frames with the similarity between the image-embedded features reaching the first similarity threshold into the same mirror,

and dividing the two video frames of which the similarity among the image embedding features does not reach the second similarity threshold value into different mirrors to obtain at least two initial mirrors.

In the above scheme, the lens splitting module is further configured to obtain at least two triplets, where the triplets include a reference image, a positive image belonging to the same lens splitting as the reference image, and a negative image belonging to a different lens splitting as the reference image;

obtaining at least two candidate similarity thresholds;

determining the split mirror accuracy and the split mirror balance score corresponding to each candidate similarity based on the triple;

taking the candidate similarity threshold with the highest partial mirror balance score as a first similarity threshold;

and taking the candidate similarity threshold with the highest accuracy as a second similarity threshold.

In the above scheme, the aggregation module is further configured to obtain a similarity between the initial partial mirrors;

and based on the similarity between the initial sub-mirrors, performing aggregation processing on the at least two initial sub-mirrors so as to enable the similarity between any two initial sub-mirrors aggregated to the same target sub-mirror to reach a similarity threshold value.

In the foregoing solution, the aggregation module is further configured to, when the initial segments include at least two video frames, use the first initial segment and the second initial segment as any two initial segments of the at least two initial segments, and perform the following processing on any two initial segments:

aiming at any target video frame in a first initial lens, respectively obtaining the similarity between the target video frame and each video frame in a second initial lens to obtain at least two similarities corresponding to the target video frame;

determining the maximum similarity corresponding to each target video frame in the first initial split mirror based on at least two similarities corresponding to each target video frame in the first initial split mirror;

acquiring a first number of target video frames with the maximum similarity reaching a first similarity threshold and a second number of target video frames in a first initial split mirror;

and taking the ratio of the first number to the second number as the similarity between the first initial partial mirror and the second initial partial mirror.

when the playing time sequence of the first initial split mirror is prior to that of the second initial split mirror, acquiring a target video frame with the last playing time point sequence in the first initial split mirror;

acquiring the similarity between the target video frame and each video frame in a second initial split mirror to obtain at least two similarities corresponding to the target video frame;

and taking the similarity average value of at least two similarities as the similarity between the first initial partial mirror and the second initial partial mirror.

An embodiment of the present application provides a computer device, including:

a memory for storing executable instructions;

and the processor is used for realizing the video lens splitting method provided by the embodiment of the application when the executable instructions stored in the memory are executed.

The embodiment of the present application provides a computer-readable storage medium, which stores executable instructions for causing a processor to execute the method for splitting a video into a plurality of images.

The embodiment of the application has the following beneficial effects:

by applying the embodiment, the frame extraction processing is carried out on the target video to obtain the video frame sequence corresponding to the target video; respectively extracting the features of each video frame in the video frame sequence to obtain the image embedding features of each video frame; based on the image embedding characteristics of each video frame, acquiring the similarity between the image embedding characteristics of adjacent video frames in the video frame sequence; based on the similarity between the image embedding characteristics of the adjacent video frames, performing mirror splitting processing on the video frame sequence to obtain at least two initial mirrors, wherein each initial mirror comprises at least one video frame; compared with SIFT characteristics, the image embedding characteristics can describe video frames more accurately, and therefore accuracy of the lens splitting is improved; moreover, the initial split mirror is polymerized, so that the internal polymerization degree in the target split mirror is higher, and the polymerization degree of the split mirror is improved.

Drawings

FIG. 1 is a diagram illustrating the results of manual binning of video;

FIG. 2 is a diagram illustrating the result of a video being mirrored based on SIFT features;

FIG. 3 is a schematic diagram of an alternative architecture of a video mirroring system according to an embodiment of the present application;

FIG. 4 is a schematic flowchart of a video mirroring method according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a ResNet module provided in an embodiment of the present application;

FIG. 6 is a schematic diagram of a determination process of an initial split mirror provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of a determination process of an initial split mirror provided by an embodiment of the present application;

FIG. 8 is a schematic illustration of the polymerization effect provided by an embodiment of the present application;

FIG. 9 is a video mirroring method provided by an embodiment of the present application;

FIG. 10 is a schematic flowchart of a video mirroring method according to an embodiment of the present application;

FIG. 11 is a schematic processing flow diagram of video mirroring provided by an embodiment of the present application;

fig. 12 is a schematic structural diagram of a video splitter provided in an embodiment of the present application;

fig. 13 is a schematic structural diagram of a computer device provided in an embodiment of the present application.

Detailed Description

For the purpose of making the purpose, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by a person skilled in the art without making any creative effort fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the following description, references to the terms "first \ second \ third" are only to distinguish similar objects and do not denote a particular order, but rather the terms "first \ second \ third" are used to interchange specific orders or sequences, where appropriate, so as to enable the embodiments of the application described herein to be practiced in other than the order shown or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) Video split-lens means that the scenes of movie and television theaters are often shot by lenses with different positions and angles, the change of video scenes is caused by changing the lenses every time, and the images in the same scene are the same (often the same person and the same background environment).

The related art provides a video framing method, which is implemented based on a SIFT video switch interest (p yScenedetect) library, that is, a SIFT video switch interest (pyScenedetect) library is adopted to extract SIFT features of each video frame in a video to be framed, and then framing is performed according to the similarity of the SIFT features of adjacent video frames in the video to be framed.

Fig. 1 is a schematic diagram of a result of manual image segmentation of a video, fig. 2 is a schematic diagram of a result of image segmentation of a video based on SIFT features, and referring to fig. 1 and fig. 2, one image segment 201 is omitted from an image segmentation result in fig. 2, compared with fig. 1.

The applicant finds the following problems in the above scheme when implementing the embodiments of the present application:

1) due to the problem of SIFT expression ability, the method cannot be applied to distinguishing all images in mass data, such as distinguishing different sports persons wearing the same sports wear;

2) the corner feature in image noise distortion or deformation easily causes image expression error so that similar images cannot be matched with the same lens under the feature expression;

3) the difference of fine granularity is difficult to distinguish, for example, because the corner points of the human face in the SIFT features are often positioned at the same positions of eyes and noses, but the human face has no capability of specifically distinguishing the eyes and noses of different people, and different human face structures cannot be distinguished when the human face scene is switched;

4) because the method is a logic-driven split mirror frame, but not a data-driven split mirror frame, wherein the 'adjacent feature similarity split mirror' is not suitable for other features, and SIFT features do not have feature learning capability and are difficult to optimize.

Based on this, embodiments of the present application provide a video mirroring method, apparatus, device and computer-readable storage medium, which can solve at least one of the above problems.

Referring to fig. 3, fig. 3 is an illustration of an alternative architecture of a video split-mirror system provided in an embodiment of the present application, in order to support an exemplary application, terminals (terminal 400-1 and terminal 400-2 are exemplarily shown) are connected to a server 200 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of both. In practical application, the terminal is provided with clients, such as a video client, a browser client, a news client, an education client and the like, and the number of the terminals and the servers is not limited.

The terminal is used for receiving the input target video and sending the target video to the server;

the server 200 is configured to perform frame extraction processing on the target video to obtain a video frame sequence corresponding to the target video; respectively extracting the features of each video frame in the video frame sequence to obtain the image embedding features of each video frame; based on the image embedding characteristics of each video frame, acquiring the similarity between the image embedding characteristics of adjacent video frames in the video frame sequence; based on the similarity between the image embedding characteristics of adjacent video frames, performing mirror splitting processing on the video frame sequence to obtain at least two initial mirrors, wherein each initial mirror comprises at least one video frame; polymerizing at least two initial sub-lenses to obtain at least two target sub-lenses;

and the terminal is used for acquiring at least two target sub-mirrors.

In some embodiments, the server 200 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a smart television, and the like.

Based on the above description of the video mirroring system in the embodiment of the present application, the video mirroring method provided in the embodiment of the present application is described below. Referring to fig. 4, fig. 4 is a schematic flow chart of a video mirror splitting method provided in an embodiment of the present application; in some embodiments, the video mirroring method may be implemented by a terminal alone, a server alone, or a terminal and a server cooperatively, and taking the server alone as an example, the video mirroring method provided in an embodiment of the present application includes:

step 401: and the server performs frame extraction processing on the target video to obtain a video frame sequence corresponding to the target video.

The source and the type of the target video are not limited, the server can obtain the target video from an online service stream, a database or other data sources, and the target video can be a video published by a video creator, a chat video in an instant messaging process or an advertisement video published by an e-commerce.

In practical implementation, after the server acquires the target video, the time point of the frame to be extracted is acquired, where the frame is extracted uniformly according to time, for example, a time point is set every second, that is, a frame is extracted every second, or a video frame at a specified time point is extracted, where the specified time point may be uniform or non-uniform, for example, the specified time point may be 1 second, 2 seconds, or 5 seconds.

In some embodiments, the frame-decimation processing may be performed on the target video to obtain a video frame sequence corresponding to the target video by: acquiring a video frame rate of a target video and the duration of the target video; determining the number of frame extraction according to the video frame rate of the target video and the duration of the target video; and performing frame extraction processing on the target video according to the frame extraction quantity to obtain a video frame sequence containing the video frames with the frame extraction quantity.

Here, the video frame rate is a measure for measuring the number of display frames, i.e., the number of display frames per second. In practical implementation, the time point of frame extraction may be determined according to the video frame rate, that is, for any second, the time points with the same number as the video frame rate are determined, for example, if the video frame rate is 25, 25 video frames need to be extracted every second, that is, one video frame is extracted every 0.04 seconds. The number of frames extracted can be determined according to the video frame rate of the target video and the time length of the target video, for example, if the video frame rate is 25 and the time length is 4 seconds, a total of 100 video frames are extracted, and the video frames form a video frame sequence according to the playing time points.

Therefore, the video frame rate of the target video and the duration of the target video are obtained; the frame extraction quantity is determined according to the video frame rate of the target video and the duration of the target video, so that the extracted video frames are more coherent, and the subsequent framing is more facilitated.

Step 402: and extracting the characteristics of each video frame in the video frame sequence to obtain the image embedding characteristics of each video frame.

In some embodiments, feature extraction may be performed on each video frame in the sequence of video frames through the first graph embedding model, so as to obtain graph embedding features of each video frame.

Here, the first graph embedding model is obtained by training based on a training set, and in actual implementation, the server constructs a neural network model, such as a convolutional neural network model, and inputs the training set including a plurality of image samples to construct the neural network model, calculates a value of a loss function based on an output result, and then reversely transmits the value of the loss function to each layer of the neural network model, and updates model parameters of each layer by a Stochastic Gradient Descent (SGD) method to train the neural network model, thereby obtaining the first graph embedding model. After the first graph embedding model is obtained, feature extraction is carried out on the video frames in the video frame sequence through the trained first graph embedding model.

According to the method and the device, the characteristics of each video frame in the video frame sequence are extracted through the first graph embedding model, and graph embedding characteristics of each video frame are obtained.

In some embodiments, the first graph embedding model may use ResNet101 as a network structure, see fig. 5, and fig. 5 is a schematic structural diagram of a ResNet module provided in an embodiment of the present application, where a1 × 1 convolution is used to reduce the input 256 dimensions to 64 dimensions, and then a3 × 3 convolution is performed, and then a1 × 1 convolution is used to increase the dimension and restore to 256 dimensions, so that the amount of calculation of parameters can be reduced.

Here, ResNet101 includes feature modules and embedded embedding modules, table 1 is a feature module structure table of ResNet101 provided in the embodiment of the present application, and referring to table 1, the feature modules of ResNet101 provided in the embodiment of the present application include 5 convolutional layers, i.e., Conv1, Conv2_ x, Conv3_ x, Conv4_ x, and Conv5_ x, where Conv1 is a convolution of 7 × 7 × 64, a span (stride) is 2, and Conv2_ x includes 3 × 3 maximum pooling layers (max pool) and 3 ResNet modules (block), and Conv3_ x-Conv 5_ x includes 3 net res modules, 4 res modules, 23 res modules, and 3R esNet modules, respectively; table 2 is an embedding module structure table of resnet101 provided in the embodiment of the present application, and referring to table 2, the embedding module includes a maximum pooling layer.

TABLE 1 feature Module Structure Table for resnet101

Layer name	Output size	Layer(s)
			Pool_cr	1×2048	Maximum pooling layer

Table 2 imbedding module structure table of resnet101

In practical implementation, in the training process, a classification layer, namely a full connection layer, may be added after the maximum pooling layer, and the classification layer is used to perform classification prediction on the images based on the image embedding features output by the maximum pooling layer, so as to obtain a list of the images. In practical application, an image labeled with an image category is used as a training sample, the training sample is input into a resnet101 network structure, a graph embedding feature corresponding to the training sample is output, then the graph embedding feature is input into a full connection layer, the image category of the training sample is output, the value of a loss function is determined based on the difference between the output image category and the labeled image category, and then the model parameter of each layer is updated based on the value of the loss function.

Step 403: and acquiring the similarity between the image embedding characteristics of adjacent video frames in the video frame sequence based on the image embedding characteristics of each video frame.

Here, the similarity may be calculated based on the minkowski distance, the euclidean distance, the cosine similarity, the pearson correlation coefficient, or the like.

Illustratively, the cosine similarity is taken as an example for explanation, the graph embedding features may be represented in a vector form, for any two adjacent video frames in the video frame sequence, the graph embedding features of the two video frames are obtained, the cosine of an included angle of the two graph embedding features is calculated, and the calculated cosine of the included angle is taken as the similarity between the two graph embedding features.

Step 404: based on the similarity between the image embedding characteristics of the adjacent video frames, the video frame sequence is subjected to split mirror processing to obtain at least two initial split mirrors.

Wherein each initial split mirror comprises at least one video frame.

In practical implementation, whether the similarity between the image embedding features of adjacent video frames meets the similarity condition is judged, if yes, the adjacent video frames belong to the same split mirror, so as to perform split mirror processing on the video frame sequence, wherein the video frames in the same initial split mirror are continuous in playing time.

In some embodiments, a similarity threshold is obtained; and dividing the two video frames with the similarity reaching the similarity threshold value between the image embedding features into the same partial mirror to obtain at least two initial partial mirrors.

In practical implementation, for any two adjacent video frames, the two adjacent video frames are a first video frame and a second video frame, wherein the playing time point of the first video frame and the playing time point of the second video frame are firstly judged, whether the similarity between the image embedding characteristics of the first video frame and the second video frame reaches a first similarity threshold value is judged, and if yes, the second video frame is divided into the same initial sub-lenses as the first video frame; otherwise, a new initial lens is created, and the second video frame is divided into the new initial lens.

Fig. 6 is a schematic diagram of a determination process of initial split-mirror according to an embodiment of the present application, and referring to fig. 6, when a similarity between map-embedded features of a first video frame and a second video frame reaches a similarity threshold, the first video frame and the second video frame are divided into initial split-mirror 1; the similarity between the image embedding characteristics of the second video frame and the third video frame reaches a similarity threshold, and the third video frame is divided into initial lens divisions which are the same as the second video frame, namely initial lens divisions 1; the similarity between the image embedding characteristics of the third video frame and the fourth video frame does not reach a similarity threshold value, and the fourth video frame is divided into initial sub-mirrors different from the third video frame, namely the initial sub-mirrors 2; and (3) when the similarity between the graph embedding characteristics of the fifth video frame and the fourth video frame reaches a similarity threshold value, dividing the fifth video frame into the same initial sub-mirrors as the fourth video frame, namely the initial sub-mirrors 2.

In some embodiments, a first similarity threshold and a second similarity threshold are obtained, the first similarity threshold being greater than the second similarity threshold; and dividing the two video frames with the similarity between the image embedding features reaching a first similarity threshold into the same partial mirror, and dividing the two video frames with the similarity between the image embedding features not reaching a second similarity threshold into different partial mirrors to obtain at least two initial partial mirrors.

In practical implementation, for any two adjacent video frames, wherein the two adjacent video frames are a first video frame and a second video frame, the playing time point of the first video frame and the playing time point of the second video frame are firstly judged, whether the similarity between the image embedding characteristics of the first video frame and the second video frame reaches a first similarity threshold value is judged, and if yes, the second video frame is divided into the same initial sub-mirrors as the first video frame; and if not, judging whether the similarity between the image embedding characteristics between the first video frame and the second video frame reaches a second similarity threshold, if so, indicating that the second video frame is a transition frame and is not divided into any initial sub-mirrors, otherwise, creating a new initial sub-mirror and dividing the second video frame into new initial sub-mirrors.

For example, fig. 7 is a schematic diagram of a determination process of initial split-mirror provided in an embodiment of the present application, and referring to fig. 7, when a similarity between map-embedded features of a first video frame and a second video frame reaches a first similarity threshold, the first video frame and the second video frame are divided into initial split-mirror 1; the similarity between the image embedding characteristics of the second video frame and the third video frame reaches a first similarity threshold value, and the third video frame is divided into initial lens divisions which are the same as the second video frame, namely initial lens divisions 1; the similarity between the image embedding features of the third video frame and the fourth video frame does not reach a first similarity threshold, and the similarity between the image embedding features of the third video frame and the fourth video frame reaches a second similarity threshold, wherein the fourth video frame is not divided into any initial segment mirror; and the similarity between the graph embedding characteristics of the fifth video frame and the fourth video frame does not reach the first similarity threshold value and does not reach the second similarity threshold value, and the fifth video frame is divided into initial sub-mirrors different from the fourth video frame, namely the initial sub-mirrors 2.

In practical application, in a video frame sequence, transitional video frames exist, that is, in a shot conversion process, in order to improve video frames of visual effects, such as fade-in and fade-out, two-image overlapping, zooming-in and zooming-out, the application does not consider the video frames when performing the mirror splitting processing, and based on this, when performing the mirror splitting, two video frames, in which the similarity between image embedding features reaches a first similarity threshold, are divided into the same mirror, and two video frames, in which the similarity between the image embedding features does not reach a second similarity threshold, are divided into different mirrors, without considering the video frames between the first similarity threshold and the second similarity threshold, so that the transitional video frames can be prevented from affecting the whole mirror splitting result, and the accuracy of the mirror splitting is improved.

In some embodiments, the first similarity threshold and the second similarity threshold may be obtained by obtaining at least two triplets including a reference image, a positive image belonging to the same segment as the reference image, and a negative image belonging to a different segment from the reference image; obtaining at least two candidate similarity thresholds; determining the split mirror accuracy and the split mirror balance score corresponding to each candidate similarity threshold based on the triples; taking the candidate similarity threshold with the highest score of the split mirror balance as a first similarity threshold; and taking the candidate similarity threshold with the highest partial mirror accuracy as a second similarity threshold.

In practical implementation, a plurality of candidate similarity thresholds are preset, for example, 0.01 of 0.05-0.99 is set as one candidate similarity threshold, then the candidate similarity thresholds are traversed, for each candidate similarity threshold, the image embedding features of the reference image, the positive image and the negative image in the triplet are obtained, the similarity between the image embedding features of the reference image and the positive image and the similarity between the image embedding features of the reference image and the positive image are obtained, the number of triples in the triplets, in which the similarity between the image embedding features of the reference image and the positive image is greater than the candidate similarity threshold and the similarity between the image embedding features of the reference image and the negative image is smaller than the candidate similarity threshold, is obtained, and the ratio of the number to the total number of the triplets is used as the split-mirror accuracy. And acquiring a first number of the triples, wherein the similarity between the graph embedding features of the reference image and the positive image is greater than the candidate similarity threshold, and a second number of the triples, wherein the similarity between the graph embedding features of the reference image and the negative image is greater than the candidate similarity threshold, determining the sum of the first number and the second number, and taking the ratio of the first number to the sum of the first number and the second number as the recall rate.

After obtaining the corresponding scope accuracy and the corresponding scope recall rate based on each candidate similarity threshold, the method can be used according to

And acquiring a split mirror balance score, wherein precision is accuracy, and recal is recall rate. Taking the candidate similarity threshold with the highest balance score as a first similarity threshold according to the acquired split mirror accuracy and the split mirror balance score; and taking the candidate similarity threshold with the highest accuracy as a second similarity threshold.

In practical applications, if the second similarity threshold obtained based on the above manner is greater than the first similarity threshold, the second similarity threshold may be determined according to thr2 ═ thr1-0.05, where thr1 is the first similarity threshold, and thr2 is the second similarity threshold.

The first similarity threshold and the second similarity threshold are obtained by screening according to the image embedding characteristics of the images in the triple, and can be modified correspondingly according to different image embedding models, and the first similarity threshold and the second similarity threshold determined by the method can enable the split mirror to be more accurate compared with the fixed first similarity threshold and the fixed second similarity threshold.

Step 405: and polymerizing the at least two initial sub-mirrors to obtain at least two target sub-mirrors.

In practical implementation, the content of the same split mirror can be played at different playing time, for example, a video comprises a conversation of two persons, and the two lenses can be switched back and forth.

For example, fig. 8 is a schematic diagram of the aggregation effect provided by the embodiment of the present application, referring to fig. 8, the initial segment 801A and the initial segment 801B in fig. 8 are shot through the same lens, and although other segments are inserted in the middle, they may be aggregated together as one target segment, and the initial segment 802A and the initial segment 802B are shot through the same lens, and the initial segment 802A and the initial segment 802B are aggregated together as one target segment.

In some embodiments, the at least two initial segments may be aggregated to obtain the at least two target segments by: acquiring the similarity between the initial partial mirrors; and based on the similarity between the initial sub-mirrors, performing polymerization treatment on at least two initial sub-mirrors so as to enable the similarity between any two initial sub-mirrors polymerized to the same target sub-mirror to reach a similarity threshold value.

Here, the number of the initial partial mirrors aggregated into the same target partial mirror may be two or more, and in actual implementation, for each initial partial mirror, the similarity between the initial partial mirror and each of the other initial partial mirrors is obtained, and then if the similarity between the initial partial mirror and a certain initial partial mirror reaches a similarity threshold, the two initial partial mirrors are aggregated together to obtain the target partial mirror; if other initial mirrors need to be aggregated into the target mirror, the similarity between the initial mirror and each initial mirror in the target mirror needs to reach a similarity threshold.

For example, the initial segments A, B, C are merged into one target segment, then the similarity between segment a and segment B reaches a similarity threshold, the similarity between segment a and segment C reaches a similarity threshold, and the similarity between segment B and segment C reaches a similarity threshold.

In some embodiments, the similarity between the initial partial mirrors may be obtained by: when the initial sub-mirrors comprise at least two video frames, taking the first initial sub-mirror and the second initial sub-mirror as any two initial sub-mirrors in the at least two initial sub-mirrors, and executing the following processing on the any two initial sub-mirrors: respectively acquiring the similarity between a target video frame and each video frame in a second initial split mirror aiming at any target video frame in the first initial split mirror to obtain at least two similarities corresponding to the target video frame; determining the maximum similarity corresponding to each target video frame in the first initial split mirror based on the obtained at least two similarities corresponding to each target video frame in the first initial split mirror; acquiring a first number of target video frames with the maximum similarity reaching a third similarity threshold and a second number of the target video frames in the first initial split mirror; and taking the ratio of the first number to the second number as the similarity between the first initial partial mirror and the second initial partial mirror.

In practical implementation, if the first initial segment includes m video frames and the second initial segment includes n video frames, then for any target video frame in the first initial segment, obtaining the similarity between the target video frame and the n video frames in the second initial segment to obtain n similarities, and obtaining the maximum similarity among the n similarities as the maximum similarity corresponding to the target video frame; then, m maximum similarities can be obtained for m video frames in the first initial split mirror; and comparing the m maximum similarities with a third similarity threshold value, acquiring a first number reaching the third similarity threshold value in the m maximum similarities, and then taking the ratio of the first number to m as the similarity between the first initial partial mirror and the second initial partial mirror.

Here, in determining the similarity, the map embedding feature of each video frame is acquired, and then the similarity between the target video frame and the n video frames in the second initial split is determined by acquiring the similarities between the map embedding features of the target video frame and the n video frames in the second initial split.

For example, if the first initial split includes 5 video frames A, B, C, D, E and the second initial split includes 4 video frames F, G, H, J, then the similarity between the graph embedding feature of a and the graph embedding feature of F, G, H, J is obtained, and the highest similarity is selected as the similarity corresponding to a, such as 0.8, 0.75, 0.82, 071, and then 0.82 is selected as the similarity corresponding to a. In the same manner, the similarity corresponding to B, C, D, E is obtained, for example, the similarity corresponding to A, B, C, D, E is obtained as 0.82, 0.73, 0.78, 0.77 and 0.74, and when the third similarity threshold is 0.75, the similarity between the first initial partial mirror and the second initial partial mirror is determined to be 0.6.

In some embodiments, the similarity between the initial partial mirrors may be obtained by: when the initial sub-mirrors comprise at least two video frames, taking the first initial sub-mirror and the second initial sub-mirror as any two initial sub-mirrors in the at least two initial sub-mirrors, and executing the following processing on the any two initial sub-mirrors: when the playing time sequence of the first initial sub-mirror is prior to that of the second initial sub-mirror, acquiring a target video frame with the last playing time point sequence in the first initial sub-mirror; acquiring the similarity between the target video frame and each video frame in the second initial split mirror to obtain at least two similarities corresponding to the target video frame; and taking the similarity average value of at least two similarities as the similarity between the first initial partial mirror and the second initial partial mirror.

In practical implementation, generally, the video frames in the same sub-mirror should be continuous in playing time, based on which, the last target video frame of an initial sub-mirror whose playing time is earlier may be obtained, then the target video frame is subjected to similarity matching with each video frame in an initial sub-mirror whose playing time is later to obtain the similarity between the target video frame and each video frame, and then the average value of the similarities is obtained to obtain the similarity between the first initial sub-mirror and the second initial sub-mirror.

For example, if the first initial segment includes 5 video frames A, B, C, D, E (sorted by playing time point) and the second initial segment includes 4 video frames F, G, H, J, then the similarity between the graph embedding feature of a and the graph embedding feature of F, G, H, J, such as 0.8, 0.75, 0.82, 071, is obtained, then the average of these several similarities, such as 0.8, 0.75, 0.82, 071, is obtained as 0.77, and then 0.77 is used as the similarity between the first initial segment and the second initial segment.

In some embodiments, after the at least two initial mirrors are subjected to polymerization treatment to obtain at least two target mirrors, the server can also screen out a mirror difficult case from the at least two target mirrors; acquiring an artificial lens splitting result corresponding to a difficult lens splitting case, wherein the artificial lens splitting result is used for indicating a video frame belonging to the same lens splitting case; and constructing a training set based on the hard case of lens division and the manual lens division result, and updating the model parameters of the first graph embedding model through the training set to obtain a second graph embedding model.

In practical implementation, after the target mirrors are obtained, a mirror hard case can be screened from at least two target mirrors, and a mirror hard case can be screened from at least two target mirrors, where the mirror hard case is a target mirror which is judged to be a mirror wrong mirror based on a hard case judgment condition, where the hard case judgment condition may be that the number of video frames in the target mirrors does not reach a number threshold value, and the like; then, manually judging the screened out difficult cases of lens division, namely judging that the video frames in the difficult cases of lens division belong to the same lens division, and when judging that the video frames in the target lens division do not belong to the same lens division, performing lens division processing on the video frames in the target lens division to obtain a manual lens division result; and establishing a training set based on the hard-to-split example and the manual split result, inputting the training set into the first graph embedding model, and updating model parameters of the first graph embedding model to obtain a second graph embedding model.

This application is through establishing the training set based on the difficult case of minute mirror and artifical minute mirror result to through the training set, update the model parameter of first picture embedding model, obtain second picture embedding model, so, can update first picture embedding model according to the minute mirror result, promote the characteristic expression ability of first picture embedding model, and then promote the accuracy of follow-up minute mirror.

In practical implementation, after the target mirrors are obtained, whether each target mirror is mistaken in mirror splitting can be respectively judged, if yes, the target mirror is determined to be a difficult-to-mirror-splitting example, and in some embodiments, the difficult-to-mirror-splitting example can be screened out from at least two target mirrors in the following ways: acquiring the number of video frames in each target lens and the number of average video frames corresponding to each target lens; respectively determining the ratio of the number of video frames to the average number in each target sub-mirror; and taking the target lens with the ratio of the number of the video frames to the average number not reaching the target ratio as a lens difficult example.

In actual implementation, the sum of the number of video frames in all target mirrors is obtained, and then the ratio of the sum of the number of video frames to the number of the target mirrors is determined and is the average number of video frames corresponding to the target mirrors; and for each target sub-mirror, acquiring the number of video frames in the target sub-mirror, determining the ratio of the number of the video frames to the average number of the video frames, and if the number of the video frames is less than the target ratio, determining that the target sub-mirror is a difficult sub-mirror case.

For example, the target ratio may be set to 1/4, the number of video frames in all target mirrors is obtained, and if there are five target mirrors, the number in the mirrors is 8, 12, 13, 2, and 10, respectively, then the average number of video frames in at least two target mirrors corresponding to each target mirror is 9, and thus it can be seen that the ratio of the number of video frames in the 4 th target mirror to the average number is less than 1/4, and thus the target mirror is a difficult example of mirror splitting.

In some embodiments, it may also be determined whether the target mirror is a mirror splitting difficult case directly according to the number of video frames in the target mirror, that is, whether the number of video frames in the target mirror reaches a number threshold, if not, the target mirror is a mirror splitting difficult case, and for example, a target mirror with the number of video frames smaller than 2 in the target mirror may be determined as a mirror splitting difficult case.

In some embodiments, the hard cases of the objective lens can be screened out from at least two objective lenses by: when the target sub-mirror comprises at least two video frames, for each target sub-mirror the following is performed: the method comprises the steps that video frames with a target proportion are obtained according to a playing time point sequence from a first video frame in a target split mirror to obtain a first video frame subset, and video frames with the target proportion are obtained in a reverse sequence according to the playing time point from a last video frame in the target split mirror to obtain a second video frame subset; acquiring the similarity between the first video frame subset and the second video frame subset; and when the similarity does not reach the fourth similarity threshold, determining that the target lens splitting is a lens splitting difficult example.

The applicant finds that a common difficult split mirror case is an error split mirror consisting of an a scene → a short-time transition video frame → a b scene in the process of implementing the application, and based on the error split mirror, the application determines the difficult split mirror case by comparing a video frame with a front target proportion and a video frame with a rear target proportion in a target split mirror so as to avoid the accuracy of determining the difficult split mirror case of the intermediate-period time transition video frame.

In practical implementation, when the first video frame subset and the second video frame subset are calculated, the similarity between the first video frame subset and the second video frame subset can be calculated in the same manner as the determination of the similarity between the initial split mirrors.

For example, if the target proportion is set to 1/3, then according to the sequence of playing time points, the video frame at the front 1/3 in the target split mirror and the video frame at the back 1/3 in the target split mirror are obtained, the video frame at the front 1/3 is used as the first video frame subset, and the video frame at the back 1/3 is used as the second video frame subset; for each target video frame in the first video frame subset, acquiring the similarity between the target video frame and each video frame in the second video frame subset, and acquiring the maximum similarity from the acquired multiple similarities as the similarity corresponding to the target video frame; acquiring the similarity corresponding to all target video frames in the first video frame subset, and respectively comparing the similarity corresponding to each target video frame with a similarity threshold value to determine the quantity of the similarities reaching a third similarity threshold value; determining the ratio of the number of the similarities to the number of the video frames in the first video subset, wherein the ratio is the similarity between the first video frame subset and the second video frame subset; and when the similarity does not reach a fourth similarity threshold value, determining that the target lens is a lens splitting difficult example.

In some embodiments, the training set comprises a plurality of triplets including a reference image, a positive image belonging to the same segment as the reference image, and a negative image belonging to a different segment from the reference image; the model parameters of the first graph embedding model may be updated based on the values of the loss function by: inputting a reference image, a positive image and a negative image in the triple into a first image embedding model; carrying out forward calculation on the reference image, the positive image and the negative image through the first image embedding model, and predicting to obtain image embedding characteristics of the reference image, the positive image and the negative image; acquiring a first difference between the graph embedding feature of the reference image and the graph embedding feature of the positive image and a second difference between the graph embedding feature of the reference image and the graph embedding feature of the negative image, and determining the value of a loss function of the first graph embedding model based on the first difference and the second difference; model parameters of the first graph embedding model are updated based on the values of the loss function.

In practical implementation, a plurality of hard case pairs a-B can be determined according to the hard case of the split mirror and the result of manual split mirror, where a and B respectively represent a split mirror, the video frames in a are similar to each other, the video frames in B are similar to each other, and the video frames in a and B are dissimilar to each other, that is, the video frames in the set are similar to each other, and the video frames out of the set are dissimilar to each other, and a plurality of triplets are extracted from the hard case pairs, for example, a is { a1, a2, … }, B is { B1, B2, B3, … }, where (a1, a2, B1), (a1, a3, B2), etc., that is, (a1, a2, B1), a1 is a reference image, a2 and a1 belong to the same split mirror, B1 and a1 belong to different split mirrors, and a2 is a1, negative image.

After the triples are obtained, the images in the triples are input into a first image embedding model, forward calculation is carried out on the reference image, the positive image and the negative image through the first image embedding model, and image embedding characteristics of the reference image, the positive image and the negative image are obtained through prediction. Here, in updating the first map embedding model, the learning task is to increase the degree of similarity between the reference image and the positive image and decrease the degree of similarity between the reference image and the negative image, that is, to make the distance between the map embedding feature of the reference image and the map embedding feature of the positive image as small as possible and the distance between the map embedding feature of the reference image and the map embedding feature of the negative image as large as possible.

Based on the learning task, a loss function can be constructed as follows:

wherein a is a reference image, p is a positive image, n is a negative image, f function is a first image embedding model, and alpha value is 0.2.

And calculating the value of the loss function based on the loss function, reversely transmitting the value of the loss function to each layer of the first graph embedding model, reversely transmitting the value of the loss function to each layer of the image classification model, and updating the model parameters of each layer by a random gradient descent method, thereby completing one round of training. Here, multiple rounds of training may be performed to obtain a second graph embedding model.

In some embodiments, the server may further obtain a test set, and input the test set to the first graph embedding model and the second graph embedding model respectively to obtain a first prediction result corresponding to the first graph embedding model and a second prediction result corresponding to the second graph embedding model; determining a value of a loss function corresponding to the first graph embedding model based on the first prediction result; determining a value of a loss function corresponding to the second graph embedding model based on the second prediction result; and updating the first graph embedding model by using the second graph embedding model when the value of the loss function corresponding to the second graph embedding model is smaller than the value of the loss function corresponding to the first graph embedding model.

In practical implementation, the first graph embedding model is updated with the second graph embedding model only when the second graph embedding model is better than the first graph embedding model, so as to perform subsequent mirroring based on the second graph embedding model. Here, the first graph embedding model and the second graph embedding model are tested by a test set, and the first graph embedding model and the second graph embedding model are compared with each other with the value of the loss function as a parameter for measuring the merits of the models.

In some embodiments, the test set may include a standard test set and a difficult-to-case test set, and then a value of a loss function of the standard test set corresponding to the first graph embedding model, L00, a value of a loss function of the difficult-to-case test set corresponding to the first graph embedding model, L01, a value of a loss function of the standard test set corresponding to the second graph embedding model, L10, a value of a loss function of the difficult-to-case test set corresponding to the second graph embedding model, L10, and when L00 is greater than L10 and L01 is greater than L11, the first graph embedding model is updated with the second graph embedding model.

The standard test set is constructed based on uniformly distributed long video episode samples and is obtained by manual cleaning; the hard case test set is constructed based on a lens hard case.

Taking the construction of the standard test set as an example, a standard sample pair a-B is obtained, a plurality of triples are randomly extracted, for example, 10 triples are extracted to form a test subset 1, and one or more times of extraction are performed to obtain one or more test subsets, so as to obtain the standard test set, for example, 10 times of extraction is performed to obtain the standard test set consisting of 10 test subsets. Here, the difficult-to-instantiate test set is constructed in the same manner as the standard test set.

It should be noted that the triples are extracted in the same manner as described above when the training set is constructed.

Then, each triple in the standard test set is input into the first graph embedding model, graph embedding features of the reference image, graph embedding features of the positive image and graph embedding features of the negative image in each triple are output, and the value of the loss function is calculated based on the obtained graph embedding features. When the standard test set includes a plurality of test subsets, the average value of the values of the loss functions of the plurality of test subsets is taken to be L00.

Correspondingly, each triad in the difficult-to-sample test set is input into the first graph embedding model, graph embedding characteristics of the reference image, the positive image and the negative image in each triad are output, and the value of the loss function is calculated based on the obtained graph embedding characteristics. When the difficult test set includes a plurality of test subsets, the average value of the values of the loss functions of the plurality of test subsets is taken to be L01.

Here, the value L10 of the loss function in the standard test set of the second graph embedding model and the value L10 of the loss function in the difficult-to-sample test set of the second graph embedding model can be obtained in the same manner as described above.

By applying the embodiment, the video frame sequence corresponding to the target video is obtained by performing frame extraction processing on the target video; respectively extracting the characteristics of each video frame in the video frame sequence through a first image embedding model to obtain image embedding characteristics of each video frame; acquiring similarity between image embedding characteristics of adjacent video frames in a video frame sequence based on the image embedding characteristics of each video frame; based on the similarity between the image embedding characteristics of adjacent video frames, performing mirror splitting processing on the video frame sequence to obtain at least two initial mirrors, wherein each initial mirror comprises at least one video frame; compared with SIFT characteristics, the image embedding characteristics can more accurately describe video frames, and therefore accuracy of the lens splitting is improved; moreover, the polymerization is carried out on the initial split mirror, so that the internal polymerization degree in the target split mirror is higher, and the polymerization degree of the split mirror is improved.

Continuing to describe the video mirroring method provided in the embodiment of the present application, referring to fig. 9, fig. 9 is a video mirroring method provided in the embodiment of the present application, where the video mirroring method provided in the embodiment of the present application is cooperatively implemented by a first terminal, a second terminal, and a server, and the video mirroring method provided in the embodiment of the present application includes:

step 901: the first terminal sends the target video to the server.

Step 902: and the server performs frame extraction processing on the target video to obtain a video frame sequence corresponding to the target video.

Step 903: and the server respectively extracts the features of each video frame in the video frame sequence through the first image embedding model to obtain the image embedding features of each video frame.

Step 904: the server divides two video frames with the similarity between the image embedding characteristics reaching a first similarity threshold value into the same partial mirror, and divides two video frames with the similarity between the image embedding characteristics not reaching a second similarity threshold value into different partial mirrors so as to obtain at least two initial partial mirrors.

Step 905: and the server acquires the similarity between the initial split mirrors.

Step 906: and the server carries out aggregation processing on at least two initial sub-mirrors based on the similarity between the initial sub-mirrors to obtain at least two target sub-mirrors.

Step 907: the server obtains 1/3 video frames according to the sequence of playing time points from the first video frame in the target split mirror to obtain a first video frame subset, and obtains 1/3 video frames according to the reverse sequence of playing time points from the last video frame in the target split mirror to obtain a second video frame subset.

Step 908: the server obtains the similarity between the first video frame subset and the second video frame subset.

Step 909: and when the similarity does not reach the similarity threshold value, the server determines that the target mirror splitting is a mirror splitting difficult case.

Step 910: the server sends the split mirror case to the second terminal.

Step 911: and the second terminal acquires a manual lens splitting result corresponding to the lens splitting difficulty case and returns the result to the server.

Step 912: and the server constructs a training set based on the hard-to-split example and the manual split result, and updates the model parameters of the first graph embedding model through the training set to obtain a second graph embedding model.

Step 913: the server obtains the test set, and respectively inputs the test set to the first graph embedding model and the second graph embedding model to obtain a first prediction result corresponding to the first graph embedding model and a second prediction result corresponding to the second graph embedding model.

Step 914: the server determines a value of a loss function corresponding to the first graph embedding model based on the first prediction result.

Step 915: the server determines a value of a loss function corresponding to the second graph embedding model based on the second prediction result.

Step 916: and when the value of the loss function corresponding to the second graph embedding model is smaller than the value of the loss function corresponding to the first graph embedding model, the server updates the first graph embedding model by using the second graph embedding model.

By applying the embodiment, the first graph embedding model is obtained through data training, and compared with the SI FT characteristic, the graph embedding characteristic can describe a video frame more accurately, so that the accuracy of lens splitting is improved; according to the method, the initial split mirror is polymerized, so that the internal polymerization degree in the target split mirror is higher, and the polymerization degree of the split mirror is improved; and moreover, a hard split mirror case is obtained from the target split mirror to construct a training set, and model parameters of the first graph embedding model are updated, so that self-optimization is realized, and the split mirror accuracy is further improved.

Next, an exemplary application of the embodiment of the present application in a practical application scenario will be described. Fig. 10 is a schematic flowchart of a video mirroring method provided in an embodiment of the present application, and referring to fig. 10, the video mirroring method provided in the embodiment of the present application includes:

step 1001: the server obtains the video.

Step 1002: and the server performs frame extraction on the video.

In practical implementation, an open source image library opencv is adopted to frame the video, and the video is packaged into a video frame sequence.

For example, for a 10 second 25fps video, a total of 250 video frames can be decimated.

Step 1003: and the server acquires the image embedding characteristics of each video frame through a deep learning network model.

And performing feature extraction on each video frame by adopting a preset deep learning network model to obtain the image embedding features of each video frame.

In practical implementation, resnet101 may be used as a network structure, where a vacation source ML-Images (a large-scale image classification data set) is used to train model parameters to obtain a preset deep learning network model, features are extracted from the video frame sequence frame by frame through the preset deep learning network model, and after the deep learning network model is input into each video frame, the results output from the Pool _ cr layer in table 2 are used as graph embedding features, that is, features of 1 × 2048 dimensions.

Step 1004: and the server performs the split mirror processing based on the similarity between the image embedding characteristics of the adjacent video frames.

In practical implementation, a first similarity threshold and a second similarity threshold are preset (for example, the first similarity threshold is 0.9, and the second similarity threshold is 0.85), a framing result dictionary D ═ {1: [ ],2: [ ],3: [ ],4: [ ], …, M: [ ] }, where 1 represents a first initial framing mirror, and a list is arranged below the first initial framing mirror for storing subsequent framing results (a list corresponding to frame ids), for example, D ═ 1: [1, 2, 3],2: [4, 5], … } indicates that the 1 st, 2 nd, and 3 th frames belong to the 1 st initial framing mirror, and the 4 th and 5 th frames belong to the 2 nd initial framing mirror.

Storing the 1 st video frame into the 1 st initial shot, and presetting the current initial shot as shot to be 1 (indicating that the current video frame belongs to the initial shot); the following operations are performed for the 2 nd to N th video frames x:

1) if the similarity between the graph embedding features of the video frame x and the video frame x-1 is greater than the first similarity (th r1), the x and the x-1 belong to the same initial split mirror, and the x is recorded in a split mirror list which is the same as the x-1;

2) if the similarity between the graph embedding features of the video frame x and the video frame x-1 is smaller than a second similarity threshold (thr2), an initial split-mirror shot +1 is newly built in D, and the video frame x is recorded in a new split-mirror list.

In the process of performing the split mirror processing, the video frames with the similarity between thr1 and thr2 and the graph embedding characteristics of the previous video frame are not reserved.

Step 1005: and the server carries out aggregation processing on the initial split mirror.

Here, for each initial segment P from the initial segment 1 to the initial segment M, acquiring the number n of video frames in the initial segment, and for each initial segment q from the initial segment P to the initial segment M, if the initial segments q are not merged, judging whether the similarity between the initial segment P and the initial segment q is greater than 0.5, if so, merging the initial segment P and the initial segment q into a target segment; if the initial mirrors q have been merged, the initial mirrors q are skipped.

The similarity between the initial partial mirror p and the initial partial mirror q can be obtained through the following method: for each target video frame in the initial split mirror p, acquiring the similarity between the target video frame and each video frame in the initial split mirror q, acquiring the maximum similarity among the acquired multiple similarities, taking the maximum similarity as the maximum similarity corresponding to the target video frame, acquiring the number nsim of the maximum similarity corresponding to each target video frame reaching a first similarity threshold, and taking nsim/n as the similarity between the initial split mirror p and the initial split mirror q.

Step 1006: the server extracts the branch mirror difficult cases.

Here, the hard-to-split-mirror case is obtained from the plurality of target split mirrors according to the length of the target split mirrors and the similarity between the inside and the outside of the target split mirrors.

In practical implementation, the average number of video frames of the target mirrors, namely the total number of video frames/the number of target mirrors, is calculated, and for each target mirror, if the number of video frames in the target mirror is less than the average number or the number of video frames in the target mirror is less than 2, the target mirror is determined as a mirror splitting hard case.

Or, the similarity between the video frame subsets of the front 1/3 video frame and the rear 1/3 video frame in the split mirror is less than 0.3, then the front 1/3 and the rear 1/3 are mutually split mirror hard cases and are recorded as a-B split mirror hard case pair. In the difficult mining, due to the huge data volume and limited manpower investment, the internal polymerization degree of the data difficult to mine is expected to be high enough to avoid secondary labeling caused by improper internal polymerization, so a smaller threshold value (0.3 compared with thr1) is adopted, so that the difficult mining quantity is precise and the effect is accurate.

The method for calculating the similarity between the video frame subsets is the same as the method for determining the similarity between the initial split mirrors.

Step 1007: the server performs difficult data collection.

Here, the obtained mirror splitting is judged by manpower, that is, whether the mirror splitting of the target mirror is correct is judged, so as to mark video frames belonging to the same mirror.

And for the acquired difficult split mirror case pair, manually judging whether A and B are different split mirror contents, and if so, saving the difficult split mirror case pair.

Step 1008: and the server performs video basic split-mirror data accumulation.

Here, video frames belonging to the same segment are acquired for base segment data accumulation.

The method and the device need to maintain 3 data sets, namely a lens hard case test set, a lens data set and a lens standard test set. Firstly, initializing a lens-splitting hard case test set to be null; initializing a lens splitting data set into a lens splitting result corresponding to a target lens in a uniformly distributed long video episode sample P; initializing a standard lens test set into a long video episode sample Q (without intersection with P) which is uniformly distributed, and manually cleaning a lens result corresponding to a target lens to obtain the standard lens test set; and adding the above mirror-division difficult-case pairs into a mirror-division database, and simultaneously adding the mirror-division difficult-case pairs into a mirror-division difficult-case test set.

Here, each test set contains A-B difficile pairs, which are: { a1, a2, … } and { B1, B2, B3, … }, which represent the a-segment mirror composed of a1 and a2 …, and the B-segment mirror of B1, B2 and B3, the intra-set images are similar to each other, and the outside-set images are dissimilar to each other. In each training, triples (Anchor, Positive and Negative) can be extracted as a sample combination, the Anchor is a reference image, the Positive is a Positive image belonging to the same lens as the reference image, and the Negative is a Negative image belonging to a different lens from the reference image, for example, (a1, a2, B1), (a1, a3, B2) can be extracted from A-B, and the task of the subsequent deep learning network model is to increase the Anchor-Positive similarity and decrease the Anchor-Negative similarity.

Step 1009: the server trains the deep learning network model.

Here, metric learning-based model training is achieved by the accumulated split-mirror data.

In practical application, a gradient descent method based on SGD is adopted to update the convolution template parameter w and the bias parameter b of the deep learning network model, in each iteration process, a prediction result error is calculated and reversely propagated to each layer of the neural learning network, the gradient is calculated, and the parameter of the convolution neural network model is updated. The specific process is as follows: all parameters of the model are set to be in a state needing learning, forward calculation is carried out on three input images (Anchor, Positive and Negative) through a deep learning network model during training to obtain a prediction result, the value of a loss function is calculated based on the prediction result, the value of the loss function is transmitted back to the deep learning network model, and the model parameters are updated through a random gradient descent method, so that primary parameter optimization is realized. Here, 60 epochs are trained, each epoch traversing all A-B sample pairs in the training set (no sample pairs take 10 triples).

Here, the loss function is

Wherein a is a reference image, p is a positive image, n is a negative image, f function is a deep learning network model, and alpha value is 0.2.

Step 1010: the server updates a deep learning network model for extracting graph-embedded features.

And if the model obtained through the index test training accords with the updating effect, updating the deep learning network model so as to realize video split-mirror through the updated deep learning network model.

In practical application, firstly, randomly extracting 10 triples from all standard sample pairs A-B of a microscope standard test set to form a test subset 1; the total time is 10 times, and 10 test subsets are formed; for each testing subset, extracting the graph embedding characteristics of three images in the triples by adopting a pre-trained neural learning network model, calculating the value of the Loss function according to the Loss function, and recording the average value of the values of the Loss functions of all the triples as Loss.

Then, the values of the loss functions of the 10 test subsets are taken as a reference (b aseline) index L00 corresponding to the standard test set of the split mirror.

And constructing a difficult example test set in a similarity mode, and determining a baseline index L01 corresponding to the difficult example test set. If the difficult example test set is empty, L01 ═ L00 × 10.

Secondly, extracting the features of the images in the triples in the standard test set by using the updated deep learning network model to obtain corresponding image embedding features, and calculating the value of a loss function to obtain a value L10 of the deep learning network model for the loss function of the standard test set; extracting the features of the images in the triples in the difficult-example test set by using the updated deep learning network model to obtain corresponding image embedding features, and calculating the value of a loss function to obtain a value L11 of the deep learning network model for the loss function of the standard test set; comparing L00 and L10, and comparing L01 and L11, when L00 is greater than L10 and L01 is greater than L11, the pre-trained neural learning network model is replaced with the updated deep learning network model.

Here, the pre-trained neural learning network model is determined to be replaced by the updated deep learning network model, the similarity threshold value is traversed by 0.05-0.99 (step size is 0.01), the recall and accuracy of 10 triple subsets of the standard test set are calculated, f1 is calculated, the similarity threshold value which enables f1 to be the highest is selected to replace the original thr1, and the similarity threshold value which enables the accuracy to be the highest is selected to replace the original thr 2. If thr2 is greater than thr1, thr2 is thr1-0.05, so as to ensure that the relative sizes of the primary and secondary thresholds can be used for distinguishing different shots from the same shot.

And after the deep learning network model for extracting the graph embedding features is updated, continuing the next round of partial mirror and data and model iteration.

In some embodiments, the deep learning network model obtained by noisy training can be carried on a cloud service to provide a video split-mirror service.

Fig. 11 is a schematic view of a processing flow of video mirroring provided in an embodiment of the present application, referring to fig. 11, a terminal a receives a video input by a user, uploads the video to a server, the server performs mirroring on the video by using the above method, extracts a hard-to-mirror case, outputs a mirroring result to a terminal B, and accumulates a mirroring database to update a deep learning network model.

The embodiment of the application has the following beneficial effects:

1) the video mirror splitting is carried out by adopting a deep learning network model based on data driving, so that the accuracy of the video mirror splitting is improved.

2) And (3) extracting more accurate samples from mass video samples as a training set by adopting a weak supervision active learning method, and establishing a standard test set for measuring the model effect by utilizing manual cleaning.

3) The self-optimization effect of the split-mirror system is realized through a closed-loop process of initial split-mirror, difficult-case backflow, model learning and split-mirror optimization, wherein the continuous iteration of the system under massive movie and television series data can be maintained only through limited manual cleaning.

The following continues to describe the video lens apparatus provided in the embodiments of the present application. Referring to fig. 12, fig. 12 is a schematic structural diagram of a video splitting device according to an embodiment of the present application, where the video splitting device according to the embodiment of the present application includes:

the frame extracting module 210 is configured to perform frame extracting processing on a target video to obtain a video frame sequence corresponding to the target video;

an extraction module 220, configured to perform feature extraction on each video frame in the video frame sequence to obtain an image embedding feature of each video frame;

an obtaining module 230, configured to obtain, based on the graph embedding feature of each video frame, a similarity between graph embedding features of adjacent video frames in the sequence of video frames;

a mirror splitting module 240, configured to perform mirror splitting on the video frame sequence based on a similarity between the image embedding features of the adjacent video frames to obtain at least two initial mirrors, where each initial mirror includes at least one video frame;

and the aggregation module 250 is configured to perform aggregation processing on the at least two initial partial mirrors to obtain at least two target partial mirrors.

In some embodiments, the frame extracting module 210 is further configured to obtain a video frame rate of the target video and a duration of the target video;

In some embodiments, the extraction module is further configured to perform feature extraction on each video frame in the video frame sequence through a first graph embedding model to obtain a graph embedding feature of each video frame;

the device further comprises:

the updating module is used for screening out a hard-case of the split mirrors from at least two target split mirrors, wherein the hard-case of the split mirrors is a target split mirror which is judged to be in error;

In some embodiments, the update module is further configured to obtain the number of video frames in each target sub-mirror and an average number of video frames corresponding to each target sub-mirror;

and taking the target lens with the ratio of the number of the video frames to the average number not reaching the target ratio as a lens difficult example.

In some embodiments, the update module is further configured to, when the target segment includes at least two video frames, perform the following for each target segment:

In some embodiments, the training set comprises a plurality of triplets including a reference image, a positive image belonging to the same segment as the reference image, and a negative image belonging to a different segment from the reference image;

acquiring a first difference between a graph embedding feature of a reference image and a graph embedding feature of a positive image and a second difference between the graph embedding feature of the reference image and a graph embedding feature of a negative image, and determining a value of a loss function of the first graph embedding model based on the first difference and the second difference;

In some embodiments, the updating module is further configured to obtain a test set, and input the test set to a first graph embedding model and a second graph embedding model respectively to obtain a first prediction result corresponding to the first graph embedding model and a second prediction result corresponding to the second graph embedding model;

determining a value of a loss function corresponding to the first graph embedding model based on a first prediction result;

determining a value of a loss function corresponding to the second graph embedding model based on a second prediction result;

In some embodiments, the mirror splitting module is further configured to obtain a first similarity threshold and a second similarity threshold, where the first similarity threshold is greater than the second similarity threshold;

In some embodiments, the mirror splitting module 240 is further configured to obtain at least two triplets, where the triplets include a reference image, a positive image belonging to the same mirror as the reference image, and a negative image belonging to a different mirror from the reference image;

obtaining at least two candidate similarity thresholds;

determining the split mirror accuracy and the split mirror balance score corresponding to each candidate similarity threshold based on the triples;

In some embodiments, the aggregation module 250 is further configured to obtain a similarity between the initial mirrors;

In some embodiments, the aggregating module 250 is further configured to, when the initial segments include at least two video frames, take the first initial segment and the second initial segment as any two initial segments of the at least two initial segments, and perform the following processing on any two initial segments:

In some embodiments, the aggregation module 240 is further configured to, when the initial split mirrors include at least two video frames, take the first initial split mirror and the second initial split mirror as any two initial split mirrors of the at least two initial split mirrors, and perform the following processing on any two initial split mirrors:

By applying the embodiment, compared with the SIFT feature, the image embedding feature can more accurately describe the video frame, so that the accuracy of the lens splitting is improved; moreover, the polymerization is carried out on the initial split mirror, so that the internal polymerization degree in the target split mirror is higher, and the polymerization degree of the split mirror is improved.

An embodiment of the present application further provides a computer device, where the computer device may be a terminal or a server, see fig. 13, and fig. 13 is a schematic structural diagram of the computer device provided in the embodiment of the present application, and the computer device provided in the embodiment of the present application includes:

a memory 550 for storing executable instructions;

the processor 510 is configured to, when executing the executable instructions stored in the memory, implement the video playing method provided in the embodiments of the present application.

Here, the Processor 510 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor or the like.

The memory 550 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 450 optionally includes one or more storage devices physically located remote from processor 510.

The memory 550 may comprise volatile memory or nonvolatile memory, and may also comprise both volatile and nonvolatile memory. The nonvolatile memory may be a Read Only Memory (ROM), and the volatile memory may be a Random Access Memory (RAM). The memory 550 described in embodiments herein is intended to comprise any suitable type of memory.

At least one network interface 520 and user interface 530 may also be included in some embodiments. The various components in computer device 500 are coupled together by a bus system 540. It is understood that the bus system 540 is used to enable communications among the components. The bus system 540 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 540 in FIG. 13.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the video split mirror method described in the embodiment of the present application.

Embodiments of the present application provide a computer-readable storage medium storing executable instructions, which when executed by a processor, cause the processor to perform a method provided by embodiments of the present application, for example, the method shown in fig. 4.

In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EP ROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may, but need not, correspond to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (H TML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices located at one site or distributed across multiple sites and interconnected by a communication network.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A method for video mirroring, the method comprising:

based on the similarity between the image embedding characteristics of the adjacent video frames, performing mirror splitting processing on the video frame sequence to obtain at least two initial mirrors, wherein each initial mirror comprises at least one video frame;

and polymerizing the at least two initial sub-mirrors to obtain at least two target sub-mirrors.

2. The method of claim 1, wherein the decimating the target video to obtain a sequence of video frames corresponding to the target video comprises:

acquiring a video frame rate of a target video and the duration of the target video;

3. The method of claim 1, wherein said extracting features from each video frame in the sequence of video frames to obtain the map-embedded features of each video frame comprises:

extracting the features of each video frame in the video frame sequence through a first image embedding model to obtain image embedding features of each video frame;

after the at least two initial sub-mirrors are subjected to polymerization treatment to obtain at least two target sub-mirrors, the method further comprises:

screening out a hard case of a lens from at least two target lenses;

4. The method of claim 3, wherein the screening of the at least two target scopes for scope difficulty comprises:

acquiring the number of video frames in each target lens and the number of average video frames corresponding to each target lens;

and taking the target lens with the ratio of the number of the video frames to the average number of the video frames not reaching the target ratio as a lens difficult example.

5. The method of claim 3, wherein the screening of the at least two target scopes for scope difficulty comprises:

when the target sub-mirror comprises at least two video frames, performing the following for each target sub-mirror:

starting from the first video frame in the target lens, obtaining the video frames of the target proportion according to the sequence of the playing time points to obtain a first video frame subset, and

6. The method of claim 3, wherein the training set comprises a plurality of triplets including a reference image, a positive image that is the same mirror as the reference image, and a negative image that is a different mirror from the reference image;

updating the model parameters of the first graph embedding model through the training set, including:

inputting a reference image, a positive image and a negative image in the triple into the first image embedding model;

acquiring a first difference between the map embedding feature of the reference image and the map embedding feature of the positive image, and a second difference between the map embedding feature of the reference image and the map embedding feature of the negative image, and

determining a value of a loss function of the first graph embedding model based on the first difference and the second difference;

7. The method of claim 3, wherein after updating the model parameters of the first graph embedding model, the method further comprises:

obtaining a test set, and inputting the test set to the first graph embedding model and the second graph embedding model respectively to obtain a first prediction result corresponding to the first graph embedding model and a second prediction result corresponding to the second graph embedding model;

8. The method of claim 1, wherein said performing a split-mirror processing on said sequence of video frames based on a similarity between map-embedded features of said neighboring video frames to obtain at least two initial split-mirrors comprises:

acquiring a first similarity threshold and a second similarity threshold, wherein the first similarity threshold is greater than the second similarity threshold;

dividing two video frames with the similarity between the image embedding features reaching the first similarity threshold into the same partial mirror,

and dividing the two video frames of which the similarity among the image embedding features does not reach the second similarity threshold value into different sub-mirrors to obtain at least two initial sub-mirrors.

9. The method of claim 8, wherein obtaining the first similarity threshold and the second similarity threshold comprises:

acquiring at least two triples, wherein the triples comprise a reference image, a positive image belonging to the same lens with the reference image and a negative image belonging to different lenses with the reference image;

obtaining at least two candidate similarity thresholds;

10. The method of claim 1, wherein said aggregating the at least two initial segments to obtain at least two target segments comprises:

acquiring the similarity between the initial partial mirrors;

11. The method of claim 10, wherein said obtaining a similarity between initial components comprises:

when the initial sub-mirrors comprise at least two video frames, taking the first initial sub-mirror and the second initial sub-mirror as any two initial sub-mirrors in the at least two initial sub-mirrors, and executing the following processing on any two initial sub-mirrors:

determining the maximum similarity corresponding to each target video frame in the first initial split mirror based on at least two obtained similarities corresponding to each target video frame in the first initial split mirror;

acquiring a first number of target video frames with the maximum similarity reaching a third similarity threshold and a second number of target video frames in the first initial split mirror;

12. The method of claim 10, wherein said obtaining a similarity between initial components comprises:

13. A video separation device, comprising:

the mirror splitting module is used for carrying out mirror splitting processing on the video frame sequence based on the similarity between the image embedding characteristics of the adjacent video frames to obtain at least two initial mirrors, and each initial mirror comprises at least one video frame;

14. A computer device, comprising:

a memory for storing executable instructions;

a processor configured to implement the video mirroring method of any one of claims 1 to 12 when executing the executable instructions stored in the memory.

15. A computer-readable storage medium having stored thereon executable instructions for, when executed by a processor, implementing the video mirroring method of any one of claims 1 to 12.