CN113449824A - Video processing method, device and computer readable storage medium - Google Patents

Video processing method, device and computer readable storage medium Download PDF

Info

Publication number
CN113449824A
CN113449824A CN202111018263.1A CN202111018263A CN113449824A CN 113449824 A CN113449824 A CN 113449824A CN 202111018263 A CN202111018263 A CN 202111018263A CN 113449824 A CN113449824 A CN 113449824A
Authority
CN
China
Prior art keywords
video
similarity
positioning
model
target video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111018263.1A
Other languages
Chinese (zh)
Other versions
CN113449824B (en
Inventor
郭卉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Yayue Technology Co ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202111018263.1A priority Critical patent/CN113449824B/en
Publication of CN113449824A publication Critical patent/CN113449824A/en
Application granted granted Critical
Publication of CN113449824B publication Critical patent/CN113449824B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a video processing method, a video processing device and a computer readable storage medium, wherein the method comprises the following steps: carrying out feature extraction on a target video to obtain a feature sequence comprising feature information of a plurality of frames of images in the target video, and calling a feature sequence of a positioning point identification model to process to obtain a positioning point of the target video, wherein the positioning point identification model is obtained based on accumulated feature information corresponding to each frame of image in the plurality of frames of images of a sample video; dividing a target video into a plurality of video segments according to the positioning points; and calling a similarity judgment model to obtain the similarity among the video clips, and determining whether the target video is a cyclic video according to the similarity. By the method, the accuracy and efficiency of judging whether the target video is the circulating video can be improved.

Description

Video processing method, device and computer readable storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a video processing method and apparatus, and a computer-readable storage medium.
Background
The cyclically played video, referred to as a carousel video for short, refers to a video in which some video is repeatedly played due to insufficient material (for example, the original video is only 10 seconds and is finally output for 60 seconds) or a highlight segment is required (for example, the video is repeatedly played at the moment of goal), and two segments of sub-videos are always the same in the cyclically played video. Therefore, the detection of the loop play video becomes important.
At present, there is no systematic scheme for detecting whether a video is a loop playing video, and the idea of detecting whether multiple segments of videos are the same is to mainly determine whether videos are the same by detecting similarity thresholds of respective frame images corresponding to multiple videos.
Under the circumstances, how to efficiently and accurately detect whether a video is a loop playing video is a technical problem to be solved at present.
Disclosure of Invention
The embodiment of the application provides a video processing method and device and a computer readable storage medium, which can improve the accuracy and efficiency of judging whether a target video is a cyclic video.
An embodiment of the present application discloses a video processing method, which includes:
extracting features of a target video to obtain a feature sequence, wherein the feature sequence comprises feature information of a plurality of frames of images in the target video;
calling a positioning point identification model to process the characteristic sequence to obtain a positioning point of the target video, wherein the positioning point identification model is obtained based on accumulated characteristic information corresponding to each frame of image in a plurality of frames of images of the sample video;
dividing the target video into a plurality of video segments according to the positioning points;
and calling a similarity judgment model to obtain the similarity among the video clips, and determining whether the target video is a cyclic video according to the similarity.
An embodiment of the present application discloses a video processing apparatus, which includes:
the processing unit is used for extracting the characteristics of the target video to obtain a characteristic sequence, and the characteristic sequence comprises the characteristic information of a plurality of frames of images in the target video;
the processing unit is further configured to invoke a positioning point identification model to process the feature sequence to obtain a positioning point of the target video, wherein the positioning point identification model is obtained by training accumulated feature information corresponding to each frame of image in a plurality of frames of images of the sample video;
the processing unit is further configured to divide the target video into a plurality of video segments according to the positioning points;
the processing unit is further used for calling a similarity judgment model to obtain the similarity among the plurality of video clips;
and the determining unit is used for determining whether the target video is a circulating video according to the similarity.
An embodiment of the present application discloses a computer device in one aspect, where the computer device includes:
a processor adapted to implement one or more computer programs; and a computer storage medium storing one or more computer programs adapted to be loaded by the processor and to execute the video processing method as described above.
An aspect of the present application discloses a computer-readable storage medium storing one or more computer programs adapted to be loaded by a processor and to perform the above-mentioned video processing method.
An aspect of an embodiment of the present application discloses a computer program product, which includes a computer program, and the computer program is stored in a computer readable storage medium. The processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program, so that the computer device executes the video processing method described above.
In an embodiment of the present application, a video processing method includes: firstly, extracting the characteristics of a target video to obtain a characteristic sequence comprising the characteristic information of a plurality of frames of images in the target video, and calling a positioning point identification model characteristic sequence to process to obtain a positioning point of the target video, wherein the positioning point identification model is obtained based on the accumulated characteristic information training corresponding to each frame of image in the plurality of frames of images of a sample video; dividing the target video into a plurality of video segments according to the positioning points; and finally, calling a similarity judgment model to obtain the similarity among the video clips, and determining whether the target video is a cyclic video according to the similarity. The method comprises the steps of judging whether a target video is a cyclic video or not, determining a positioning point by means of a positioning point identification model, judging which time points in the target video are possibly repeated, realizing time sequence cyclic judgment of the video through the positioning point, dividing the video according to the positioning point to obtain a plurality of video segments, further judging the similarity between the characteristics of any two video segments in the plurality of video segments by utilizing a similarity judgment model, realizing characteristic cyclic judgment of the video, combining the time sequence cyclic judgment and the characteristic cyclic judgment, determining whether the target video is the cyclic video or not, and improving the accuracy and efficiency of judging whether the target video is the cyclic video or not.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic diagram of an architecture of a video processing system according to an embodiment of the present application;
fig. 2 is a schematic flow chart of a video processing method disclosed in an embodiment of the present application;
fig. 3 is a block diagram of a video similarity determination according to an embodiment of the present disclosure;
fig. 4 is a schematic diagram of a residual error network structure disclosed in the embodiment of the present application;
FIG. 5a is a structural diagram of a sequence feature extraction based on long-term and short-term memory according to an embodiment of the present application;
fig. 5b is a model structure diagram of a long-term and short-term memory network disclosed in the embodiment of the present application;
fig. 5c is an internal structural diagram of a long-and-short term memory network disclosed in an embodiment of the present application;
FIG. 6 is a schematic flowchart of a process for training a positioning recognition model and a similarity determination model according to an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a computer device disclosed in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a video processing system according to an embodiment of the present disclosure, and as shown in fig. 1, the video processing system 100 may at least include a plurality of first terminal devices 101, a plurality of second terminal devices 102, and a server 103, where the first terminal devices 101 and the second terminal devices 102 may be the same device or different devices. The first terminal device 101 and the second terminal device 102 are mainly used for sending a target video and receiving a similarity result of the target video; the server 103 is mainly used for executing relevant steps of the video processing method to obtain a similarity result. The first terminal device 101, the second terminal device 102, and the server 103 may implement communication connection, and the connection manner may include wired connection and wireless connection, which is not limited herein.
In a possible implementation manner, any of the first terminal device 101 and any of the second terminal device 102 mentioned above may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a smart car, and the like, but are not limited thereto; the server 103 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like. Fig. 1 is a diagram illustrating an architecture of a video processing system, and is not intended to be limiting. For example, the server 103 in fig. 1 may be deployed as a node in a blockchain network, or the server 103 is connected to the blockchain network, so that the server 103 may upload video data and a result of whether the video is a looped video to the blockchain network for storage, to prevent internal data from being tampered, thereby ensuring data security.
With reference to the video processing system, the video processing method according to the embodiment of the present application may generally include: receiving a target video sent by a first terminal device 101 or a second terminal device 102, performing feature extraction on the target video by a server 103 to obtain a feature sequence including feature information of a plurality of frames of images in the target video, and calling a feature sequence of a positioning point identification model to process to obtain a positioning point of the target video, wherein the positioning point identification model is obtained based on accumulated feature information corresponding to each frame of image in the plurality of frames of images of a sample video; dividing a target video into a plurality of video segments according to the positioning points; and calling a similarity judgment model to obtain the similarity among the video clips, and determining whether the target video is a cyclic video according to the similarity. Finally, the server 103 transmits a result of whether the target video is the loop video to the first terminal apparatus 101 or the second terminal apparatus 102. The method comprises the steps of judging whether a target video is a cyclic video or not, determining a positioning point by means of a positioning point identification model, judging which time points in the target video are possibly repeated, realizing time sequence cyclic judgment on the video through the positioning point, dividing the video according to the positioning point to obtain a plurality of video segments, and further judging the similarity of the characteristics of any two videos in the plurality of video segments by using a similarity judgment model, so that the characteristic cyclic judgment on the video is realized, and the time sequence cyclic judgment and the characteristic cyclic judgment are combined to determine whether the target video is the cyclic video or not, so that the accuracy and the efficiency of judging whether the target video is the cyclic video or not are improved.
In a possible implementation manner, the first terminal device 101 and the second terminal device 102 may be different devices, a specific scenario may be that the first terminal device 101 receives data (a target video input by a user), and then uploads the data to the server 103, the server 103 performs similarity judgment on the target video by using the video processing method provided in the embodiment of the present application to obtain a similarity result, and sends the similarity result to the second terminal device 102, where the scenario may correspond to a scenario in which the target video is checked, where a user corresponding to the first terminal device 101 may be a video uploader, and a user corresponding to the second terminal device 102 may be a reviewer.
In another possible implementation manner, the first terminal device 101 and the second terminal device 102 may be the same device, taking the first terminal device 101 as an example, a specific scenario may be that the first terminal device 101 receives data (a target video input by a user), and then uploads the data to the server 103, and the server 103 performs similarity determination on the target video by using the video processing method provided in the embodiment of the present application, obtains a similarity result, and returns the similarity result to the first terminal device 101. The scene can correspond to a determination scene of whether the target video is the circular video, for example, an application program for performing circular judgment on the video can use the method in different application programs, and no matter what user can use the method to determine whether the video is the circular video, so that the user can conveniently clip the circular video.
In some possible embodiments, the video processing system may be deployed on a node on a blockchain. Meanwhile, the related data in the embodiment of the present application, such as the similarity result of the target video, may also be stored in the block chain, so as to facilitate the subsequent acquisition of the loop video.
Based on the above description of the architecture of the video processing system, an embodiment of the present application discloses a video processing method, please refer to fig. 2, which is a flowchart illustrating the video processing method disclosed in the embodiment of the present application, where the video processing method may be executed by a computer device, and the computer device may specifically be the server 103 in the video processing system. The video processing method specifically comprises the following steps S201-S204:
s201, extracting the features of the target video to obtain a feature sequence, wherein the feature sequence comprises feature information of a plurality of frames of images in the target video.
In the embodiment of the present application, the target video may refer to one single video. Before extracting the features of the target video to obtain the feature sequence, the computer device needs to acquire the target video, which may be sent by a user of the client, for detecting whether the target video is a circular video. In some implementation scenarios, the target video may also be pulled from the network in real time, for example, in an online auditing system, it needs to be determined whether the video is a cyclic video, so as to perform related auditing.
In a possible implementation manner, after the computer device obtains the target video, the computer device may further perform feature extraction on the target video to obtain a feature sequence, where the feature sequence includes feature information of multiple frames of images in the target video, and the multiple frames of images may be a preset number of images. Further, after the computer device obtains the target video, frame extraction is performed on the target video to obtain a multi-frame image, the frame extraction mode may be sparse frame extraction, specifically, it can be understood that one video is originally very long, and includes multi-frame images every second, sparse frame extraction is performed on the multi-frame image, the multi-frame image carrying key information can be retained, then feature extraction is performed on the multi-frame image to obtain feature information of each frame image, and the feature information of each frame image is combined to obtain a feature sequence. For example, after the target video is determined, 128 time point images (128 is a preset number) in the target video can be uniformly extracted according to the time length ratio, uniform frame extraction can be performed regardless of the length of the target video, and feature extraction can be performed on the 128 images to obtain a feature sequence. The feature extraction network may be resnet101, and may also be another feature extraction network, which is not limited herein.
S202, calling a positioning point identification model to process the characteristic sequence to obtain a positioning point of the target video, wherein the positioning point identification model is obtained by training accumulated characteristic information corresponding to each frame of image in a plurality of frames of images of the sample video.
The locating point identification model is characterized in that accumulated feature information of each feature of a feature sequence is determined based on a long-time memory network, and then the same video frame image is determined according to the accumulated feature information, so that locating points are determined. The method comprises the steps of determining accumulated feature information of each feature in a feature sequence through an anchor point identification model, determining an anchor point based on the change condition of the accumulated feature information, and determining a starting time point and an ending time point of a circular video included by the anchor point based on the accumulated feature information when the accumulated feature information does not change or changes suddenly. For example, after a video segment is sampled, 30 frames of images are obtained, and after the video segment is processed by a positioning point identification model, the accumulated feature information corresponding to each frame of image is determined to be changed from the 1 st frame of image to the 12 th frame of image when compared with the accumulated feature information corresponding to the previous frame of image; starting from the 13 th frame image, the accumulated characteristic information corresponding to the 13 th frame image is unchanged from the accumulated characteristic information corresponding to the 12 th frame image until reaching the 20 th frame image; from the 21 st frame image, the accumulated feature information corresponding to each frame image is changed from the accumulated feature information corresponding to the previous frame image to the 30 th frame image, and it may be determined that the video segment corresponding to the 13 th frame to the 20 th frame is a repeated segment of the video segment corresponding to the 1 st frame to the 12 th frame, where the 13 th frame and the 21 st frame may be used as determined anchor points, a time point corresponding to the 13 th frame may be regarded as a start time point of the loop video, and a time point corresponding to the 21 st frame may be regarded as an end time point of the loop video.
In one possible implementation, after determining the feature sequence of the target video, the computer device may invoke the anchor point identification model to determine an autocorrelation matrix of the feature sequence, where the autocorrelation matrix is used to indicate a correlation degree between every two frames of images in the multi-frame images. For example, the autocorrelation matrix of the signature sequence 128x128 is Re, and Re [ i, j ] indicates the correlation between the information learned at i time and the information learned at j time, and when there is a repetition of segments 0 to i and i to j, Re [ i, j ] indicates a strong correlation. And calling the positioning point identification model to preprocess the autocorrelation matrix, wherein the preprocessing specifically comprises processing diagonal elements and an upper triangular matrix of the autocorrelation matrix, and then performing maximum pooling on the preprocessed matrix to obtain a pooled matrix. And finally, calling a positioning point identification model to calculate the gradient of the matrix after pooling, and determining a positioning point according to the gradient. The positioning point specifically refers to a time point, and includes one or two of a positioning start point and a positioning end point, where the positioning start point is used to indicate a start time point of the predicted looped video segment, and the positioning end point is used to indicate an end time point of the predicted looped video segment, and the number of the positioning end points is not limited, and there may be a plurality of positioning end points, and there may be only one positioning end point, depending on the specific situation.
For example, for sequences 1-6, if the sequence 123 and the sequence 456 are cyclic, the autocorrelation matrix Re of the sequence is assumed to be formula (1).
Figure 519512DEST_PATH_IMAGE001
Further, taking the diagonal of the autocorrelation matrix Re as 0 and taking the upper triangle to obtain Re1, Re1 is formula (2).
Figure 154762DEST_PATH_IMAGE002
Re1 is then maximally pooled to yield Vre, which represents the probability that each location and subsequent feature may cycle, and is expressed as equation (3).
Figure 200078DEST_PATH_IMAGE003
Then, the gradient of Vre is calculated, the formula of Vre [ i +1] -Vre [ i ], for the 1 st time point, the probability of the previous time point is 0 by default, and the obtained gradient result is formula (4).
Figure 798550DEST_PATH_IMAGE004
As can be seen from equation (4), it is predicted that the 1 st time point is a positioning starting point, the 4 th time point is a positioning ending point, and the two time points are the positioning points indicated in the embodiment of the present application.
In some possible implementations, the target video may be a non-cyclic video, in which case, we call the anchor point identification model to process the feature sequence of the target video, and the number of obtained anchor points may be 0.
Further, the above-mentioned computer device invoking the anchor point identification model to determine the autocorrelation matrix of the feature sequence may specifically include: calling a long-time memory network in a positioning point identification model to process the characteristic sequence to obtain accumulated characteristic information corresponding to each frame of image in a multi-frame image, determining the correlation degree between each two frames of images according to the difference between the accumulated characteristic information corresponding to each two frames of images in the multi-frame image, and finally creating an autocorrelation matrix of the characteristic sequence according to the correlation degree between each two frames of images. The long-term memory network has the advantage that the information quantity can be accumulated, specifically, historical time series information is screened to determine which or how much long-term memory (cell state information) needs to be reserved for the output information at a certain time, and which or how much short-term memory needs to be reserved (output of a model at the last time). For example, for a sequence with a length of time of T =30, the long-range memory may include 29 cell state information accumulations generated in T = 1-29. According to the method, the long and short term memory network (LSTM) is used for identifying the positioning points, the calculation of the front and back relations of different time sequences is easier, and the calculated information is more complete. Assuming that feature calculation is performed on a video sequence having repetition in a plurality of periods including 1-60 seconds, 60-120 seconds, and 120-128 seconds, respectively, LSTM learns the sum of feature information for 1-60 seconds at the 60 th second and the same sum of feature information for two segments at 120 seconds, so that the feature output for 120 seconds is strongly correlated with the feature output for 60 seconds, thereby determining that 1-60 seconds and 60-120 seconds are two segments of repeated video.
Step S201 may also be implemented by invoking a localization point identification model, that is, invoking the localization point identification model to perform feature extraction on the target video to obtain a feature sequence. It is simply understood that the anchor point identification model includes multiple layers, and different layers are responsible for performing different tasks.
S203, dividing the target video into a plurality of video segments according to the positioning points.
According to the above description, the positioning point may include one or two of a positioning start point and a positioning end point, a video segmentation point is determined according to one or two of the positioning start point and the positioning end point, and then the target video is segmented by using the video segmentation point, so as to obtain a plurality of video segments corresponding to the target video. For example, if the time length of a target video is 60, where time 1 and time 30 are positioning starting points, a video segmentation point may be determined according to time 30, and the target video may be segmented into two video segments according to the video segmentation point.
Steps S202 to S203 are performed by processing the feature sequence of the target video through the anchor point identification model, and in this process, anchor points that may be repeated in the video frame are obtained through global feature similarity learning and based on autocorrelation pooling, thereby further ensuring reliability of repeated judgment on the video.
And S204, calling a similarity judging model to obtain the similarity among the video clips, and determining whether the target video is a circular video according to the similarity.
After a plurality of video segments of a target video are determined, in order to judge the similarity between every two video segments, the plurality of video segments need to be combined every two first to obtain a plurality of video segment pairs, and then the similarity between each video segment pair is judged. Suppose there are N video segments, pairWhich are combined pairwise to obtain M video segment pairs, wherein,
Figure 878501DEST_PATH_IMAGE005
. For example, there are four video clips, video clip a, video clip B, video clip C, and video clip D, and the result of combining two by two is six,
Figure 904226DEST_PATH_IMAGE006
:AB、AC、AD、BC、BD、CD。
after the plurality of video segments are combined pairwise, the similarity of each video segment pair needs to be determined, if the similarity of any one video segment pair in the plurality of video segment pairs is greater than or equal to a similarity threshold, the target video can be determined to be a cyclic video, wherein the similarity threshold can be directly set to 1, if the similarity is equal to 1, the target video is determined to be the cyclic video, and otherwise, the target video is regarded as a non-cyclic video. In the embodiment of the application, when the similarity model is called to process the video segment pairs, the output result may be a two-classification result, that is, 0 or 1, if there are 6 video segment pairs, if there is one 1 in the 6 results of the 6 video segment pairs, it may be determined that the target video is a cyclic video; and if all the video data are 0, determining that the target video is an acyclic video.
In a specific implementation process, the following example may be used to illustrate the determination of the similarity of a video segment pair, where any video segment pair includes a first video segment and a second video segment, and the determining of the similarity between the two video segments may include the following steps:
1. respectively carrying out frame extraction processing on the first video clip and the second video clip, wherein the frame extraction modes comprise various modes, namely uniform frame extraction and skip frame extraction, and after frame extraction, obtaining a first image frame sequence of the first video clip and a second image frame sequence of the second video clip;
2. extracting the features of the first image frame sequence and the second image frame sequence by using a feature extraction network, such as resnet101, to obtain a first feature sequence and a second feature sequence;
3. respectively carrying out feature fusion on the first feature sequence and the second feature sequence by utilizing a long-time and short-time memory network, taking the output of the last feature as a video feature, namely taking the output of the last feature in the first feature sequence as the video feature of the first video clip, and taking the output of the last feature in the second feature sequence as the video feature of the second video clip;
4. and carrying out similarity calculation on the video characteristics of the first video clip and the video characteristics of the second video clip to obtain a similarity result. The similarity calculation method for two video features may be as in equation (5).
Figure 436839DEST_PATH_IMAGE007
Wherein x and y are the video features of the first video segment and the video features of the second video segment, respectively, and the representation manner may be in the form of vectors.
As shown in fig. 3, the frame diagrams of steps S201 to S204 are frame diagrams for video similarity determination disclosed in the embodiment of the present application, and it can be seen that the general steps include: performing sparse frame extraction and image feature extraction on an input target video to obtain a sparse frame feature sequence, inputting the sparse frame feature sequence into a positioning identification module, mapping the sparse frame feature sequence back to the target video, performing positioning pooling and prediction to determine a positioning point, segmenting the target video based on the positioning point to obtain a plurality of video segments, performing frame extraction and feature extraction on the plurality of video segments respectively, inputting the extracted video segments into a similarity judgment model, for example, performing frame extraction on any two video segments of the plurality of video segments to obtain an image sequence 1 and an image sequence 2, performing image feature extraction on the image sequence 1 and the image sequence 2 respectively to obtain a feature sequence 1 and a feature sequence 2, inputting the feature sequence 1 and the feature sequence 2 into a similarity learning module respectively to obtain a video feature 1 and a video feature 2 respectively, and finally based on a similarity measurement module, and outputting the similarity result of the two video clips so as to determine whether the target video is a cyclic video, wherein the general result of the output is 0 and 1, the output of 1 indicates that the target video is a cyclic video, and the output of 0 indicates that the target video is a non-cyclic video. The method comprises the steps of firstly determining a positioning point and a divided video segment of a target video, then judging the similarity of the video segments, specifically, firstly determining which position point of the target video possibly has a circulating video segment, then extracting features of the pre-presumed segment, and then judging whether the target video is circulating or not, so that the judgment on whether the target video is the circulating video or not is more accurate.
The structure diagram of the feature extraction network resnet101 used in steps S201 and S204 may be shown in table 1, where 101 refers to a 101-layer network, where there are convolution input 7 × 64, then 3+4+23+3=33 building blocks, each block is 3 layers, so there are 33 × 3=99 layers, and finally there are powing layers, so there are 1+99+1=101 layers, an image frame sequence is input into the network, and a feature with a dimension of 2048 is obtained through layer-by-layer convolution processing. A residual error network is also used in the network, and a corresponding residual error network structure is shown in fig. 4, which is a schematic diagram of a residual error network structure disclosed in the embodiment of the present application:
TABLE 1
Figure 839001DEST_PATH_IMAGE008
In the embodiment of the present application, how to determine whether a target video is a cyclic video is mainly described, which mainly includes: acquiring a target video, performing feature extraction on the target video to obtain a feature sequence comprising feature information of a plurality of frames of images in the target video, and calling a locating point identification model feature sequence to process to obtain a locating point of the target video; dividing a target video into a plurality of video segments according to the positioning points; and calling a similarity judgment model to obtain the similarity among the video clips, and determining whether the target video is a cyclic video according to the similarity. The method comprises the steps of judging whether a target video is a cyclic video or not, determining a positioning point by means of a positioning point identification model, judging which time points in the target video are possibly repeated, realizing time sequence cyclic judgment of the video through the positioning point, dividing the video according to the positioning point to obtain a plurality of video segments, and further judging the similarity of the characteristics of any two videos in the plurality of video segments by using a similarity judgment model, so that the characteristic cyclic judgment of the video is realized, and the time sequence cyclic judgment and the characteristic cyclic judgment are combined to determine whether the target video is the cyclic video or not, so that the accuracy and the efficiency of judging whether the target video is the cyclic video or not are improved.
From the above, when the video processing method of the present application is implemented, the positioning recognition model and the similarity determination model are called, and how to train the two models is explained below, and the general idea is to train the positioning recognition model in advance by self-supervision, and then train the positioning recognition model and the similarity determination model together, so as to obtain the positioning recognition model and the similarity determination model. In the process of training the positioning recognition model and the similarity judgment model, the adopted network structure is explained first. The structure diagram of the sequence feature extraction based on the LSTM long-term memory may be an n-layer LSTM network as shown in fig. 5a, and in practical applications, the number of layers of the LSTM network may be set according to specific situations, for example, the similarity learning module DNN1 in fig. 3 may be set to 3 layers, and the location learning module may be set to 2 layers.
As shown in table 2, a framework including a similarity learning module DNN1 of a three-layer LSTM network may generate 36 time-series feature insertions through 3 LSTM layers, use a feature of a last time as a sequence characterization of a video, that is, a model learns a relationship of 36 time series input and characterizes the time-series relationship as a 1x512 vector at each time point, to obtain 36x512 characterizations of 36 time points, and take the last time characterization as a final characterization of the video sequence, that is, a video feature of a video segment:
TABLE 2
Figure 773459DEST_PATH_IMAGE009
The framework of the location identification module can be as shown in table 3, which is a module comprising two layers of LSTM networks, and generates an input sequence (128 features) of information feature embedding through 2 LSTM layers, and the module outputs a 128x64 vector, which represents feature prediction at 128 time points, wherein the feature information prediction result at 2 time points is learned according to the input features at 1 st and 2 nd time points, the information of the input features at 1 st, 2 nd and 3 rd time points is learned at 3 rd time point, and so on, and the last time point has the most learned features. According to the embodiment of the application, the accumulated characteristic information of the video frame is determined through the positioning identification module, so that the repeated positioning points possibly existing in the video are determined, the positioning points are not required to be determined through the designated action in the video frame, the positioning points are determined through the designated action, the problem of threshold values is involved, the threshold values which are required for judging whether the image dimensionality is similar or not for the large action change and the small action change in the video are different, and the accuracy of the determined circulating positioning points is not high when the accuracy of the positioning points is determined by adopting the accumulated characteristic information of the characteristic sequence. Moreover, the video variety is various, confirms the setpoint according to specific action, can not satisfy arbitrary video information's location demand:
TABLE 3
Figure 6905DEST_PATH_IMAGE010
Further, the model structure diagram of the LSTM is shown in FIG. 5b, including 501 and 502, where the input duration is 3, h is the output of each layer, and the result at time t-1 in the LSTM is
Figure 964496DEST_PATH_IMAGE011
Will be transmitted to
Figure 967088DEST_PATH_IMAGE012
Learning is performed, 502 LSTM structure labeled long and short term information, long term memory and short term memory, which is long term information (long term memory) since cell state information is the cumulative sum operation of all information, and output at each time
Figure 693735DEST_PATH_IMAGE013
Is determined by various factors of the time, and is input by the current time,The output and long-range information at the last moment are affected together.
For the composition of parts in the LSTM structure, as can be seen in fig. 5c, the first step in LSTM is to decide what information to discard from the cell state, this decision is done by a layer called forget gate, as in 510, which reads the gate
Figure 389159DEST_PATH_IMAGE014
And
Figure 834046DEST_PATH_IMAGE015
outputting a value between 0 and 1 to each cell state
Figure 374749DEST_PATH_IMAGE016
In (1), 1 represents "complete retention", and 0 represents "complete rejection". The next step is to determine what new information is deposited in the cell state, which here comprises two parts, as shown in formula 520, one is
Figure 470750DEST_PATH_IMAGE017
The layer called "input gate layer" determines what value is to be updated, another
Figure 71496DEST_PATH_IMAGE018
The layer creates a new vector of candidate values,
Figure 3680DEST_PATH_IMAGE019
may be added to the state. The next step is to update the time of the old cell state, as shown at 530, which will be
Figure 551336DEST_PATH_IMAGE020
Is updated to
Figure 314892DEST_PATH_IMAGE021
Taking the old state and
Figure 289801DEST_PATH_IMAGE022
multiplying, discarding information determined to need discarding, and adding
Figure 460014DEST_PATH_IMAGE023
This is the new candidate, which changes according to the degree of decision to update each state. Finally, it is necessary to determine what value to output, which will be based on the cell state, but which is also a filtered version, as shown at 540, first run a filter
Figure 608099DEST_PATH_IMAGE024
The layer to determine which part of the cell state will be output, and then pass the cell state through
Figure 163845DEST_PATH_IMAGE025
Is processed (to obtain a value between-1 and 1) and is combined with
Figure 371972DEST_PATH_IMAGE026
The outputs of the gates are multiplied to finally determine the output section.
Based on the above description, the joint training of the positioning recognition model and the similarity determination model is explained first, please refer to fig. 6, which is a schematic flowchart of a process for training the positioning recognition model and the similarity determination model disclosed in the embodiment of the present application, and specifically includes steps S601-S603:
s601, obtaining a first training sample, wherein the first training sample comprises a plurality of sample video pairs and labeling information of each sample video pair.
The annotation information comprises a reference anchor point of a cyclic segment in each sample video pair and a reference similarity of each sample video pair, wherein the reference anchor point refers to a marked time point in the process of acquiring the first training sample, and the reference similarity refers to a mark of whether the sample video pair is repeated or not in the process of acquiring the first training sample, so that the reference similarity comprises two indication values, one is used for indicating that the sample video pair is repeated, and the other is used for indicating that the sample video pair is not repeated. The plurality of sample video pairs herein includes a positive sample video pair and a negative sample video pair, and in this application, the positive sample video pair is considered as a repeated video pair, and the negative sample video pair is a non-repeated video pair.
Wherein, the acquisition process of the first training sample pair comprises the following steps: the method comprises the steps of firstly determining a plurality of repeated video segments, needing to label the starting and ending positions (reference positioning points) of the video with the repeated segments in advance, and intercepting several segments of videos which are repeated from the starting and ending positions (or adopting an auto-supervision generation mode of splicing the videos to generate a cyclic video, see the next embodiment) to be used as a positive sample video pair. Mining of negative sample video pairs: for all video segments generated based on whether there are repeat points, those segments from different videos can be considered as negative sample video pairs. For any video in each pair of positive samples, 10 different videos can be randomly generated from the total number N of videos as negative samples. Thus, the positive-negative ratio is 1: 10. the positive exemplar pair label is 1, the negative exemplar pair label is 0, and 0 and 1 represent the reference similarity information.
S602, performing joint training on the first network model and the pre-trained second network model by using the plurality of sample video pairs and the label information of each sample video pair to obtain the trained first network model and the trained second network model.
The second network model is obtained through pre-training, the model parameters are adjusted once, the positioning and similarity are subjected to combined training, the cyclic video clip positioning is preposed on the similarity model, the positioning and similarity can be subjected to combined learning, the cyclic judgment of the input video is finally realized, and the judgment accuracy is improved.
In the embodiment of the application, the parameters of the neural network model can be solved by adopting a gradient descent method. In the training process, parameters are initialized, the second network model is obtained by pre-training, the parameters of the pre-trained model can be directly used as initialization values, and if the first network model is not pre-trained, the first network model is initialized by adopting Gaussian distribution with the variance of 0.01 and the mean value of 0; and then, setting learning parameters, a learning rate and a learning process, wherein the learning process carries out epoch iteration on the full data, and each iteration processes a full sample.
The specific iterative process comprises the steps of dividing each batch-size sample of the full-size samples into Nb batches, and training each batch:
1. inputting the sample video pair into a second network model to obtain a predicted fixed site feature, and inputting the sample video pair into a first network model to obtain a predicted similarity;
2. and (3) calculating a loss value: respectively calculating a similarity loss value and a positioning loss value; wherein, the calculation method of the positioning loss value is as the formula (6),
Figure 13169DEST_PATH_IMAGE027
the anchor point features (vector, dimension 1x 128) representing the prediction of video i,
Figure 902628DEST_PATH_IMAGE028
representing the reference anchor point features (vectors) of video i.
Figure 640777DEST_PATH_IMAGE029
The similarity loss value is calculated by the following formula (7) and formula (8), wherein K is the weight of the negative sample video to the loss and is set to be 3. Sigmoid is adopted for output similarity to be mapped between 0 and 1, and then loss is calculated for the positive sample video pair and the negative sample video pair respectively by means of a loss function. Where K may be adjusted according to the ratio of positive and negative sample pairs, as is: negative = 1: 10, it can be adjusted to 0.1, with the goal of equalizing positive and negative, and K being 1 at sample equalization.
Figure 206756DEST_PATH_IMAGE030
3. Updating model parameters: and (3) carrying out gradient backward calculation on the loss in the previous step by adopting an SGD (sparse Gate D) random gradient descent method to obtain the updated values of all model parameters, and updating the network. And when the model loss continues for 10 rounds and does not descend, stopping model training to obtain a trained first network model and a trained second network model.
S603, taking the trained first network model as a similarity judgment model, and taking the trained second network model as a positioning point identification model.
In step S602, the second network model is obtained by pre-training the third network model, where the pre-training process is an auto-supervised training process, and the obtaining of the second network model by training the third network model with massive videos substantially includes: and acquiring a second training sample, wherein the second training sample comprises a plurality of sample videos and marking information of each sample video, the marking information comprises a reference positioning point of a cyclic segment in each sample video, calling a third network model to process each sample video to obtain a predicted positioning point of the cyclic segment in each sample video, and adjusting model parameters of the third network model according to the reference positioning point and the predicted positioning point to obtain a second network model.
The specific training process comprises the following steps:
1. preparing a training sample: firstly, a certain video i with extremely low cyclic probability in the interior is extracted for multiple times (such as 1-10 times) randomly (whether cyclic or not can be roughly judged by adopting methods, such as a traditional cyclic judgment method), the time duration video which is randomly multiplied by the full length of the original video (such as 0.5-1 time) is extracted for each time, and the extracted videos are spliced to obtain a new video i 2. The time of repetition in i2 was recorded. If i video takes 1 full length repeat, the final i2 is 2 times longer than i, and the repeat time is marked as [100- > 100], where 1 represents the start point, -1 represents the end point of the previous segment of video and the start point of the next segment of loop. For any input video, 128 time point images are extracted uniformly according to a long proportion (uniform frame extraction is carried out regardless of long and short videos), and are mapped to the 128 images according to the repeated time marks to obtain a repeated mark vector of 128x1, so that a second training sample is obtained, and a plurality of sample videos and the labeling information of each sample video.
2. Training process: the above multiple sample videos are input, representations of video images of resnet101 are extracted and input into a third network model as sequence features (the frame can be shown in table 3), and an autocorrelation pooling module is used for training. The gradient was updated using the SGD, the learning rate was set to 0.01, and each loss was calculated using equation (6). And adjusting parameters of the third network model by utilizing the labeling information of each sample video, and finishing training when the training times are reached to obtain the second network model.
Because the LSTM network is not easy to converge, the first network model in S602 may also be pre-trained in practical application, and the pre-training aims to make the overall model more sensitive to the timing feature extraction, and to improve the model convergence speed. The pre-training method is the same as the pre-training method described above, but differs in that the samples are different, and the step of pre-training the first network model generally comprises: and acquiring a third training sample, wherein the third training sample comprises a plurality of sample video pairs and labeling information of each sample video pair, the labeling information comprises reference similarity of each sample video pair, calling the first network model to process each sample video pair to obtain predicted similarity of each sample video pair, and adjusting model parameters of the first network model according to the reference similarity and the predicted similarity to obtain the pre-trained first network model.
The specific training process comprises the following steps:
1. and obtaining a third training sample: acquiring a positive sample video pair, extracting batch videos v1 for N videos each time, generating 36 video frame sequences s1 by selecting a random time t0 as a starting point and according to the frame extraction scheme (6-second segmentation), generating a random number x within the range of 0.5-3.5, generating 36 frame sequences s2 by taking t0+ x seconds as a starting point and according to 6-second segmentation, and generating a pair of positive sample videos by s1 and s 2; obtaining a negative sample video pair: for each sample i in s1, 10 videos n1i are selected from the other s1-1 according to the following selection rules: the number of frames for which the frame-by-frame distance (L2 distance) from the image frame of the i sample is less than 0.08 is less than 3, i.e. the ratio of the image level similarity (similarity indicated below the distance threshold) of the two 36-frame sequences is less than 1/10 sequences long (36/10 = 3.6) and the two sequences are necessarily considered as negative sample pair 2;
2. training of the first network model: adopt above-mentioned 1: and (3) training the first network model by using the positive and negative sample video pairs of 10, optimizing model parameters by using SGD (generalized regression) and using the learning rate of 0.005, reducing the training rate to 0.5 per 10 epochs to obtain 200 epochs in total, and completing the times of iteration to obtain the trained first network model.
The pre-training learning method is also an automatic supervision learning method, manual intervention is not needed, and positive and negative sample pairs are directly generated by the sequence characteristics of the video, so that the model is learned.
In a possible implementation manner, the method can also implement badcase optimization problem in the model prediction process, can directly capture the problems encountered in the application and flow back to the model for fine tuning training, is also favorable for directly feeding back the video segment similarity learning result and the labeling information to the positioning identification module, avoids the problem that the logic positioning method cannot be optimized, and simultaneously, the positioning effect is directly applied to the subsequent similarity model training to implement the joint optimization learning of positioning and similarity. Specifically, the model is optimized through recorded data in the application process, for example, Pcheck of a video pair of each judgment and output similarity Psim are recorded; and (2) segmenting every 0.1 segment from 0-1 similarity, wherein 10 segments are provided, randomly selecting 1000 pairs of videos from the range of each segment of similarity of Psim to obtain 1 ten thousand pairs of video pairs, manually auditing the labels, recording real labels of sample pairs Pwrung (badcase) with wrong prediction of the labels, storing the real labels, adding error samples { Pwrung } into training samples after accumulating one week, and re-learning the model at the learning rate of 0.0005. The new model is adopted to update the positioning recognition model and the similarity judgment model in the system, and the model is optimized by the methods of data reflux, data expansion, retraining or fine tuning, so that the maintenance of the model becomes simpler and more friendly.
The embodiment of the application mainly explains that the model is trained, a positioning point concept is proposed by means of self-correlation pooling, better initial model weight is obtained through self-supervision pre-training, an end-to-end positioning identification model and a similarity judgment model are formed through fixed-length sequence extraction and a long-time and short-time memory network, so that automatic long video splitting sequence and sequence embedding identification are realized, sequence information before and after video time can be kept better, and the accuracy of video cycle judgment is improved. The positioning identification model obtains positioning points based on self-correlation pooling to achieve whether repeated positioning effect exists in video internal information, and the similarity judgment model measures whether two sequence representations contain the same segment according to the similarity so as to give out the identification result whether two sections of videos are circular or not. Because the rule is difficult to cover all scenes, an end-to-end similarity model is adopted for similarity learning, compared with the scenes which need different thresholds under the manual rule to realize effective similarity judgment, the difference of different types of videos generated by image change types can be automatically learned by the model, and the similarity judgment is automatically carried out, so that the condition that one threshold needs to be manually designed for each type or the judgment is inaccurate due to the fact that only one threshold is used is avoided.
Based on the above method embodiment, the embodiment of the present application further provides a schematic structural diagram of a video processing apparatus. Fig. 7 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present disclosure. The video processing apparatus 700 shown in fig. 7 may operate as follows:
a processing unit 701 configured to:
extracting features of a target video to obtain a feature sequence, wherein the feature sequence comprises feature information of a plurality of frames of images in the target video;
calling a positioning point identification model to process the characteristic sequence to obtain a positioning point of the target video, wherein the positioning point identification model is obtained based on accumulated characteristic information corresponding to each frame of image in a plurality of frames of images of the sample video;
dividing the target video into a plurality of video segments according to the positioning points;
calling a similarity judgment model to obtain the similarity among the video clips;
a determining unit 702, configured to determine whether the target video is a loop video according to the similarity.
In a possible implementation manner, the processing unit 701 invokes a localization point identification model to process the feature sequence to obtain a localization point of the target video, which specifically includes:
calling a positioning point identification model to determine an autocorrelation matrix of the characteristic sequence, wherein the autocorrelation matrix is used for indicating the correlation degree between every two frames of images in the multi-frame images;
calling the positioning point identification model to perform pooling processing on the autocorrelation matrix to obtain a pooled matrix;
and calling the locating point identification model to obtain the gradient of the matrix after the pooling, and determining a locating point according to the gradient.
In a possible implementation manner, the processing unit 701 invokes a localization point identification model to determine an autocorrelation matrix of the feature sequence, which specifically includes:
calling a long-time memory network in a positioning point identification model to process the characteristic sequence to obtain accumulated characteristic information corresponding to each frame of image in the multi-frame image;
determining the correlation degree between every two frames of images according to the difference between the accumulated characteristic information corresponding to every two frames of images in the multi-frame images;
and creating an autocorrelation matrix of the characteristic sequence according to the correlation degree between every two frames of images.
In one possible implementation manner, the positioning point includes one or both of a positioning start point and a positioning end point, where the positioning start point is used to indicate a start time point of the predicted loop video segment, and the positioning end point is used to indicate an end time point of the predicted loop video segment, and the dividing, by the processing unit 701, the target video into a plurality of video segments according to the positioning point specifically includes:
determining a video segmentation point according to one or two of the positioning starting point and the positioning end point;
and carrying out segmentation processing on the target video by using the video segmentation points to obtain a plurality of video segments corresponding to the target video.
In a possible implementation manner, the processing unit 701 performs feature extraction on a target video to obtain a feature sequence, which specifically includes:
performing frame extraction processing on a target video to obtain a plurality of frame images in the target video;
extracting the features of each frame of image in the multiple frames of images to obtain the feature information of each frame of image;
and combining the characteristic information of each frame of image to obtain a characteristic sequence.
In a possible implementation manner, the processing unit 701 calls a similarity determination model to obtain similarities between the multiple video segments, and the determining unit 702 determines whether the target video is a cyclic video according to the similarities, specifically including:
calling a similarity judgment model to obtain the feature vectors corresponding to every two video clips in the plurality of video clips;
calling the similarity judgment model to determine the similarity between the feature vectors corresponding to each two video segments;
and if the similarity between the feature vectors corresponding to two video segments in the plurality of video segments is greater than or equal to a similarity threshold value, determining that the target video is a cyclic video.
In a possible implementation manner, the obtaining unit 703 is configured to obtain a first training sample, where the first training sample includes a plurality of sample video pairs and annotation information of each sample video pair, where the annotation information includes a reference anchor point and a reference similarity of a cyclic segment in each sample video pair, and the plurality of sample video pairs include a positive sample video pair and a negative sample video pair;
the processing unit 701 is further configured to perform joint training on the first network model and the pre-trained second network model by using the plurality of sample video pairs and the label information of each sample video pair to obtain a trained first network model and a trained second network model, use the trained first network model as a similarity determination model, and use the trained second network model as a positioning point identification model.
In a possible implementation manner, the obtaining unit 703 is further configured to obtain a second training sample, where the second training sample includes a plurality of sample videos and annotation information of each sample video, and the annotation information includes a reference anchor point of a cyclic segment in each sample video;
the processing unit 701 is further configured to invoke a third network model to process each sample video, to obtain a predicted positioning point of a cyclic segment in each sample video, and to adjust a model parameter of the third network model according to the reference positioning point and the predicted positioning point, to obtain a second network model.
According to an embodiment of the present application, the steps involved in the video processing method shown in fig. 2 may be performed by the units in the video processing apparatus shown in fig. 7. For example, steps S201 to S203 in the video processing method shown in fig. 2 may be performed by the processing unit 701 in the video processing apparatus shown in fig. 7, and step S204 may be performed by the determination unit 702 in the video processing apparatus shown in fig. 7. As another example, step S601 in the method shown in fig. 6 may be performed by the acquisition unit 703 in the video processing apparatus shown in fig. 7, and steps S602 to S603 may be performed by the processing unit 701 in the video processing apparatus shown in fig. 7.
According to another embodiment of the present application, the units in the video processing apparatus shown in fig. 7 may be respectively or entirely combined into one or several other units to form one or several other units, or some unit(s) may be further split into multiple units with smaller functions to form one or several other units, which may achieve the same operation without affecting the achievement of the technical effect of the embodiment of the present application. The units are divided based on logic functions, and in practical application, the functions of one unit can be realized by a plurality of units, or the functions of a plurality of units can be realized by one unit. In other embodiments of the present application, the video-based processing apparatus may also include other units, and in practical applications, these functions may also be implemented by being assisted by other units, and may be implemented by cooperation of a plurality of units.
According to another embodiment of the present application, the video processing apparatus as shown in fig. 7 may be constructed by running a computer program (including program codes) capable of executing the steps involved in the respective methods as shown in fig. 2 and fig. 6 on a general-purpose computing device such as a computer including a processing element such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM), and a storage element, and implementing the video processing method according to the embodiment of the present application. The computer program may be embodied on a computer-readable storage medium, for example, and loaded into and executed by the computer apparatus described above via the computer-readable storage medium.
In the embodiment of the application, a processing unit 701 performs feature extraction on a target video to obtain a feature sequence including feature information of a plurality of frames of images in the target video, and calls a feature sequence of a positioning point identification model to perform processing to obtain a positioning point of the target video; dividing a target video into a plurality of video segments according to the positioning points; and calls a similarity judgment model to obtain the similarity among the plurality of video segments, and the determining unit 702 determines whether the target video is a cyclic video according to the similarity. The method comprises the steps of judging whether a target video is a cyclic video or not, determining a positioning point by means of a positioning point identification model, judging which time points in the target video are possibly repeated, realizing time sequence cyclic judgment of the video through the positioning point, dividing the video according to the positioning point to obtain a plurality of video segments, and further judging the similarity of the characteristics of any two videos in the plurality of video segments by using a similarity judgment model, so that the characteristic cyclic judgment of the video is realized, and the time sequence cyclic judgment and the characteristic cyclic judgment are combined to determine whether the target video is the cyclic video or not, so that the accuracy and the efficiency of judging whether the target video is the cyclic video or not are improved.
Based on the above method and apparatus embodiments, the present application provides a computer device, and the computer device may be the server 103 shown in fig. 1. Referring to fig. 8, a schematic structural diagram of a computer device according to an embodiment of the present application is provided. The computer device 800 shown in fig. 8 comprises at least a processor 801, an input interface 802, an output interface 803, a computer storage medium 804 and a memory 805. The processor 801, the input interface 802, the output interface 803, the computer storage medium 804, and the memory 805 may be connected by a bus or other means.
A computer storage medium 804 may be stored in the memory 805 of the computer device 800, the computer storage medium 804 being for storing a computer program comprising program instructions, the processor 801 being for executing the program instructions stored by the computer storage medium 804. The processor 801 (or CPU) is a computing core and a control core of the computer device 800, and is adapted to implement one or more instructions, and in particular, to load and execute one or more computer instructions to implement corresponding method flows or corresponding functions.
Embodiments of the present application also provide a computer-readable storage medium (Memory), which is a Memory device in the computer device 800 and is used for storing programs and data. It is understood that the computer-readable storage medium herein can include both the built-in storage medium in the computer device 800 and, of course, the extended storage medium supported by the computer device 800. The computer-readable storage medium provides storage space that stores an operating system for the computer device 800. Also stored in this memory space are one or more computer programs (including program code) adapted to be loaded and executed by processor 801. It should be noted that the computer-readable storage medium may be a high-speed RAM memory, or may be a non-volatile memory (non-volatile memory), such as at least one disk memory; and optionally at least one computer readable storage medium located remotely from the aforementioned processor.
In one embodiment, the computer readable storage medium may be loaded by processor 801 and executed with one or more computer programs stored in the computer storage medium to implement the video processing method described above with respect to FIG. 2 and the corresponding steps of the model training method illustrated in FIG. 6. In particular implementations, one or more instructions in the computer storage medium are loaded and executed by the processor 801 to perform the steps of:
extracting features of a target video to obtain a feature sequence, wherein the feature sequence comprises feature information of a plurality of frames of images in the target video;
calling a positioning point identification model to process the characteristic sequence to obtain a positioning point of the target video, wherein the positioning point identification model is obtained based on accumulated characteristic information corresponding to each frame of image in a plurality of frames of images of the sample video;
dividing the target video into a plurality of video segments according to the positioning points;
calling a similarity judgment model to obtain the similarity among the video clips;
and determining whether the target video is a cyclic video according to the similarity.
In a possible implementation manner, the processor 801 invokes a localization point identification model to process the feature sequence to obtain a localization point of the target video, which specifically includes:
calling a positioning point identification model to determine an autocorrelation matrix of the characteristic sequence, wherein the autocorrelation matrix is used for indicating the correlation degree between every two frames of images in the multi-frame images;
calling the positioning point identification model to perform pooling processing on the autocorrelation matrix to obtain a pooled matrix;
and calling the locating point identification model to obtain the gradient of the matrix after the pooling, and determining a locating point according to the gradient.
In a possible implementation manner, the processor 801 invokes an anchor point identification model to determine the autocorrelation matrix of the feature sequence, which specifically includes:
calling a long-time memory network in a positioning point identification model to process the characteristic sequence to obtain accumulated characteristic information corresponding to each frame of image in the multi-frame image;
determining the correlation degree between every two frames of images according to the difference between the accumulated characteristic information corresponding to every two frames of images in the multi-frame images;
and creating an autocorrelation matrix of the characteristic sequence according to the correlation degree between every two frames of images.
In one possible implementation manner, the positioning point includes one or both of a positioning start point and a positioning end point, where the positioning start point is used to indicate a start time point of the predicted looped video segment, and the positioning end point is used to indicate an end time point of the predicted looped video segment, and the processor 801 divides the target video into a plurality of video segments according to the positioning point, and specifically includes:
determining a video segmentation point according to one or two of the positioning starting point and the positioning end point;
and carrying out segmentation processing on the target video by using the video segmentation points to obtain a plurality of video segments corresponding to the target video.
In a possible implementation manner, the processor 801 performs feature extraction on the target video to obtain a feature sequence, which specifically includes:
performing frame extraction processing on a target video to obtain a plurality of frame images in the target video;
extracting the features of each frame of image in the multiple frames of images to obtain the feature information of each frame of image;
and combining the characteristic information of each frame of image to obtain a characteristic sequence.
In a possible implementation manner, the processor 801 invokes a similarity determination model to obtain similarities between the multiple video segments, and determines whether the target video is a loop video according to the similarities, which specifically includes:
calling a similarity judgment model to obtain the feature vectors corresponding to every two video clips in the plurality of video clips;
calling the similarity judgment model to determine the similarity between the feature vectors corresponding to each two video segments;
and if the similarity between the feature vectors corresponding to two video segments in the plurality of video segments is greater than or equal to a similarity threshold value, determining that the target video is a cyclic video.
In one possible implementation manner, the processor 801 is further configured to:
obtaining a first training sample, wherein the first training sample comprises a plurality of sample video pairs and labeling information of each sample video pair, the labeling information comprises a reference positioning point and a reference similarity of a cyclic segment in each sample video pair, and the plurality of sample video pairs comprise positive sample video pairs and negative sample video pairs;
performing joint training on a first network model and a pre-trained second network model by using the plurality of sample video pairs and the labeling information of each sample video pair to obtain a trained first network model and a trained second network model;
and taking the trained first network model as a similarity judgment model, and taking the trained second network model as a positioning point identification model.
In one possible implementation manner, the processor 801 is further configured to:
acquiring a second training sample, wherein the second training sample comprises a plurality of sample videos and labeling information of each sample video, and the labeling information comprises a reference positioning point of a cyclic segment in each sample video;
calling a third network model to process each sample video to obtain a predetermined position of a cyclic segment in each sample video;
and adjusting the model parameters of the third network model according to the reference positioning points and the pre-positioning points to obtain a second network model.
In the embodiment of the application, the processor 801 performs feature extraction on a target video to obtain a feature sequence including feature information of a plurality of frames of images in the target video, and calls a feature sequence of a positioning point identification model to perform processing to obtain a positioning point of the target video; dividing a target video into a plurality of video segments according to the positioning points; and calling a similarity judgment model to obtain the similarity among the multiple video segments, and determining whether the target video is a cyclic video according to the similarity. The method comprises the steps of judging whether a target video is a cyclic video or not, determining a positioning point by means of a positioning point identification model, judging which time points in the target video are possibly repeated, realizing time sequence cyclic judgment of the video through the positioning point, dividing the video according to the positioning point to obtain a plurality of video segments, and further judging the similarity of the characteristics of any two videos in the plurality of video segments by using a similarity judgment model, so that the characteristic cyclic judgment of the video is realized, and the time sequence cyclic judgment and the characteristic cyclic judgment are combined to determine whether the target video is the cyclic video or not, so that the accuracy and the efficiency of judging whether the target video is the cyclic video or not are improved.
According to an aspect of the present application, the embodiment of the present application further provides a computer product, which includes a computer program, and the computer program is stored in a computer readable storage medium. The processor 801 reads the computer program from the computer-readable storage medium, and the processor 801 executes the computer program, so that the computer apparatus 800 performs the video processing method of fig. 2 and the model training method shown in fig. 6.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical division, and other divisions may be realized in practice, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method of video processing, the method comprising:
extracting features of a target video to obtain a feature sequence, wherein the feature sequence comprises feature information of a plurality of frames of images in the target video;
calling a positioning point identification model to process the characteristic sequence to obtain a positioning point of the target video, wherein the positioning point identification model is obtained based on accumulated characteristic information corresponding to each frame of image in a plurality of frames of images of the sample video;
dividing the target video into a plurality of video segments according to the positioning points;
and calling a similarity judgment model to obtain the similarity among the video clips, and determining whether the target video is a cyclic video according to the similarity.
2. The method of claim 1, wherein the invoking of the anchor point recognition model to process the feature sequence to obtain the anchor point of the target video comprises:
calling a positioning point identification model to determine an autocorrelation matrix of the characteristic sequence, wherein the autocorrelation matrix is used for indicating the correlation degree between every two frames of images in the multi-frame images;
calling the positioning point identification model to perform pooling processing on the autocorrelation matrix to obtain a pooled matrix;
and calling the locating point identification model to obtain the gradient of the matrix after the pooling, and determining a locating point according to the gradient.
3. The method of claim 2, wherein said invoking an anchor point recognition model to determine an autocorrelation matrix of the sequence of features comprises:
calling a long-time memory network in a positioning point identification model to process the characteristic sequence to obtain accumulated characteristic information corresponding to each frame of image in the multi-frame image;
determining the correlation degree between every two frames of images according to the difference between the accumulated characteristic information corresponding to every two frames of images in the multi-frame images;
and creating an autocorrelation matrix of the characteristic sequence according to the correlation degree between every two frames of images.
4. The method according to any one of claims 1-3, wherein the anchor point comprises one or both of a positioning start point and a positioning end point, the positioning start point is used for indicating a start time point of the predicted loop video segment, the positioning end point is used for indicating an end time point of the predicted loop video segment, and the dividing the target video into a plurality of video segments according to the anchor point comprises:
determining a video segmentation point according to one or two of the positioning starting point and the positioning end point;
and carrying out segmentation processing on the target video by using the video segmentation points to obtain a plurality of video segments corresponding to the target video.
5. The method of claim 1, wherein the extracting features of the target video to obtain a feature sequence comprises:
performing frame extraction processing on a target video to obtain a plurality of frame images in the target video;
extracting the features of each frame of image in the multiple frames of images to obtain the feature information of each frame of image;
and combining the characteristic information of each frame of image to obtain a characteristic sequence.
6. The method of claim 1, wherein the invoking the similarity determination model to obtain the similarity between the plurality of video segments and determining whether the target video is a circular video according to the similarity comprises:
calling a similarity judgment model to obtain the feature vectors corresponding to every two video clips in the plurality of video clips;
calling the similarity judgment model to determine the similarity between the feature vectors corresponding to each two video segments;
and if the similarity between the feature vectors corresponding to two video segments in the plurality of video segments is greater than or equal to a similarity threshold value, determining that the target video is a cyclic video.
7. The method of claim 1, further comprising:
obtaining a first training sample, wherein the first training sample comprises a plurality of sample video pairs and labeling information of each sample video pair, the labeling information comprises a reference positioning point and a reference similarity of a cyclic segment in each sample video pair, and the plurality of sample video pairs comprise positive sample video pairs and negative sample video pairs;
performing joint training on a first network model and a pre-trained second network model by using the plurality of sample video pairs and the labeling information of each sample video pair to obtain a trained first network model and a trained second network model;
and taking the trained first network model as a similarity judgment model, and taking the trained second network model as a positioning point identification model.
8. The method of claim 7, further comprising:
acquiring a second training sample, wherein the second training sample comprises a plurality of sample videos and labeling information of each sample video, and the labeling information comprises a reference positioning point of a cyclic segment in each sample video;
calling a third network model to process each sample video to obtain a predetermined position of a cyclic segment in each sample video;
and adjusting the model parameters of the third network model according to the reference positioning points and the pre-positioning points to obtain a second network model.
9. A video processing apparatus, characterized in that the apparatus comprises:
the processing unit is used for extracting the characteristics of the target video to obtain a characteristic sequence, and the characteristic sequence comprises the characteristic information of a plurality of frames of images in the target video;
the processing unit is further configured to invoke a positioning point identification model to process the feature sequence to obtain a positioning point of the target video, wherein the positioning point identification model is obtained by training accumulated feature information corresponding to each frame of image in a plurality of frames of images of the sample video;
the processing unit is further configured to divide the target video into a plurality of video segments according to the positioning points;
the processing unit is further used for calling a similarity judgment model to obtain the similarity among the plurality of video clips;
and the determining unit is used for determining whether the target video is a circulating video according to the similarity.
10. A computer-readable storage medium, characterized in that it stores one or more computer programs adapted to be loaded by a processor and to perform the video processing method according to any of claims 1-8.
CN202111018263.1A 2021-09-01 2021-09-01 Video processing method, device and computer readable storage medium Active CN113449824B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111018263.1A CN113449824B (en) 2021-09-01 2021-09-01 Video processing method, device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111018263.1A CN113449824B (en) 2021-09-01 2021-09-01 Video processing method, device and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN113449824A true CN113449824A (en) 2021-09-28
CN113449824B CN113449824B (en) 2021-11-30

Family

ID=77819244

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111018263.1A Active CN113449824B (en) 2021-09-01 2021-09-01 Video processing method, device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN113449824B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114492667A (en) * 2022-02-16 2022-05-13 平安科技(深圳)有限公司 Image processing method, image processing device, electronic equipment and storage medium
CN115002550A (en) * 2022-05-19 2022-09-02 深圳康佳电子科技有限公司 Video playing control method based on image recognition, terminal equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937506A (en) * 2010-05-06 2011-01-05 复旦大学 Similar copying video detection method
JP2016009921A (en) * 2014-06-23 2016-01-18 船井電機株式会社 Video processing apparatus
CN109389096A (en) * 2018-10-30 2019-02-26 北京字节跳动网络技术有限公司 Detection method and device
CN109977262A (en) * 2019-03-25 2019-07-05 北京旷视科技有限公司 The method, apparatus and processing equipment of candidate segment are obtained from video
CN111523430A (en) * 2020-04-16 2020-08-11 南京优慧信安科技有限公司 Customizable interactive video production method and device based on UCL
CN113177538A (en) * 2021-06-30 2021-07-27 腾讯科技(深圳)有限公司 Video cycle identification method and device, computer equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937506A (en) * 2010-05-06 2011-01-05 复旦大学 Similar copying video detection method
JP2016009921A (en) * 2014-06-23 2016-01-18 船井電機株式会社 Video processing apparatus
CN109389096A (en) * 2018-10-30 2019-02-26 北京字节跳动网络技术有限公司 Detection method and device
CN109977262A (en) * 2019-03-25 2019-07-05 北京旷视科技有限公司 The method, apparatus and processing equipment of candidate segment are obtained from video
CN111523430A (en) * 2020-04-16 2020-08-11 南京优慧信安科技有限公司 Customizable interactive video production method and device based on UCL
CN113177538A (en) * 2021-06-30 2021-07-27 腾讯科技(深圳)有限公司 Video cycle identification method and device, computer equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BHAVESH AHUJA 等: "Video Analysis and Natural Language Description Generation System", 《PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ELECTRONICS AND SUSTAINABLE COMMUNICATION SYSTEMS》 *
顾佳伟等: "视频拷贝检测方法综述", 《计算机研究与发展》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114492667A (en) * 2022-02-16 2022-05-13 平安科技(深圳)有限公司 Image processing method, image processing device, electronic equipment and storage medium
CN114492667B (en) * 2022-02-16 2024-06-04 平安科技(深圳)有限公司 Image processing method, device, electronic equipment and storage medium
CN115002550A (en) * 2022-05-19 2022-09-02 深圳康佳电子科技有限公司 Video playing control method based on image recognition, terminal equipment and storage medium
CN115002550B (en) * 2022-05-19 2024-07-02 深圳康佳电子科技有限公司 Video playing control method based on image recognition, terminal equipment and storage medium

Also Published As

Publication number Publication date
CN113449824B (en) 2021-11-30

Similar Documents

Publication Publication Date Title
CN109891897B (en) Method for analyzing media content
CN110072142B (en) Video description generation method and device, video playing method and device and storage medium
CN113449824B (en) Video processing method, device and computer readable storage medium
CN109145828B (en) Method and apparatus for generating video category detection model
CN109492128B (en) Method and apparatus for generating a model
CN109740018B (en) Method and device for generating video label model
CN109308490B (en) Method and apparatus for generating information
CN110147745B (en) Video key frame detection method and device
CN113177538B (en) Video cycle identification method and device, computer equipment and storage medium
CN109389096B (en) Detection method and device
CN110781818B (en) Video classification method, model training method, device and equipment
CN116438544B (en) System and method for domain-specific neural network pruning
CN110968689A (en) Training method of criminal name and law bar prediction model and criminal name and law bar prediction method
CN109816023B (en) Method and device for generating picture label model
JP6846216B2 (en) Scene change point model learning device, scene change point detection device and their programs
CN116630367B (en) Target tracking method, device, electronic equipment and storage medium
CN110427998A (en) Model training, object detection method and device, electronic equipment, storage medium
CN113761282A (en) Video duplicate checking method and device, electronic equipment and storage medium
KR102646430B1 (en) Method for learning classifier and prediction classification apparatus using the same
CN110489592B (en) Video classification method, apparatus, computer device and storage medium
US10885343B1 (en) Repairing missing frames in recorded video with machine learning
WO2021147084A1 (en) Systems and methods for emotion recognition in user-generated video(ugv)
CN114155420B (en) Scene recognition model training method, device, equipment and medium
CN113810751B (en) Video processing method and device, electronic device and server
CN118172799B (en) Pedestrian re-identification method, device and storage medium for millimeter wave radar

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40051403

Country of ref document: HK

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20221115

Address after: 1402, Floor 14, Block A, Haina Baichuan Headquarters Building, No. 6, Baoxing Road, Haibin Community, Xin'an Street, Bao'an District, Shenzhen, Guangdong 518000

Patentee after: Shenzhen Yayue Technology Co.,Ltd.

Address before: 518057 Tencent Building, No. 1 High-tech Zone, Nanshan District, Shenzhen City, Guangdong Province, 35 floors

Patentee before: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd.