CN113011320B - Video processing method, device, electronic equipment and storage medium - Google Patents

Video processing method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113011320B
CN113011320B CN202110285487.2A CN202110285487A CN113011320B CN 113011320 B CN113011320 B CN 113011320B CN 202110285487 A CN202110285487 A CN 202110285487A CN 113011320 B CN113011320 B CN 113011320B
Authority
CN
China
Prior art keywords
feature
video frame
video
image
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110285487.2A
Other languages
Chinese (zh)
Other versions
CN113011320A (en
Inventor
宋浩
黄珊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110285487.2A priority Critical patent/CN113011320B/en
Publication of CN113011320A publication Critical patent/CN113011320A/en
Application granted granted Critical
Publication of CN113011320B publication Critical patent/CN113011320B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The application provides a video processing method, a video processing device, electronic equipment and a computer readable storage medium; relates to the computer vision technology in the artificial intelligence field; the method comprises the following steps: extracting first video frame features from first video frames of a video and extracting second video frame features from second video frames of the video; dividing the first video frame feature into a plurality of first video frame sub-features and dividing the second video frame feature into a plurality of second video frame sub-features; determining a similarity between the first video frame and the second video frame based on the plurality of first video frame sub-features and the plurality of second video frame sub-features; and determining an identification frame in the video according to the similarity between the first video frame and the second video frame. The method and the device can accurately and efficiently identify the representative identification frame in the video.

Description

Video processing method, device, electronic equipment and storage medium
Technical Field
The present application relates to artificial intelligence technology, and in particular, to a video processing method, apparatus, electronic device, and computer readable storage medium.
Background
Artificial intelligence (AI, artificial Intelligence) is the theory, method and technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. As artificial intelligence technology research and advances, artificial intelligence technology has been developed and applied in a variety of fields.
Taking video processing as an example, the video identification frame refers to a main video frame capable of representing information in a video, and through the video identification frame, information expressed in the video can be rapidly determined, so that the video is processed, for example, the video is classified, the identification frame is used as a video cover, and the like.
When a plurality of identification frames of a video are extracted from the video in the related art, the problem that the repetition rate of the identification frames is high due to the fact that local information is the same exists, namely the accuracy of the extracted identification frames is low, and unnecessary computing resources are consumed.
As can be seen, there is no effective solution for how to accurately and efficiently extract identification frames from video.
Disclosure of Invention
The embodiment of the application provides a video processing method, a video processing device, electronic equipment and a computer readable storage medium, which can accurately and efficiently extract an identification frame from a video.
The technical scheme of the embodiment of the application is realized as follows:
The embodiment of the application provides a video processing method, which comprises the following steps:
extracting first video frame features from first video frames of a video and extracting second video frame features from second video frames of the video;
Dividing the first video frame feature into a plurality of first video frame sub-features and dividing the second video frame feature into a plurality of second video frame sub-features;
Determining a similarity between the first video frame and the second video frame based on the plurality of first video frame sub-features and the plurality of second video frame sub-features;
And determining an identification frame in the video according to the similarity between the first video frame and the second video frame.
An embodiment of the present application provides a video processing apparatus, including:
The extraction module is used for extracting first video frame characteristics from first video frames of the video and extracting second video frame characteristics from second video frames of the video;
The dividing module is used for dividing the first video frame characteristic into a plurality of first video frame sub-characteristics and dividing the second video frame characteristic into a plurality of second video frame sub-characteristics;
a similarity module for determining a similarity between the first video frame and the second video frame based on the plurality of first video frame sub-features and the plurality of second video frame sub-features;
and the determining module is used for determining the identification frame in the video according to the similarity between the first video frame and the second video frame.
In the above solution, the similarity module is further configured to perform, for each first video frame sub-feature, the following processing: selecting a second video frame sub-feature corresponding to the same position as the first video frame sub-feature from the plurality of second video frame sub-features, and determining the similarity between the first video frame sub-feature and the selected second video frame sub-feature; and selecting the minimum similarity from the similarities corresponding to the sub-features of the plurality of first video frames as the similarity between the first video frames and the second video frames.
In the above scheme, the similarity module is further configured to perform absolute value subtraction processing on the first video frame sub-feature and the selected second video frame sub-feature to obtain a video frame difference feature; mapping the video frame difference feature to probability corresponding to a plurality of candidate similarities; and determining the candidate similarity corresponding to the maximum probability as the similarity corresponding to the first video frame sub-feature.
In the above aspect, the extracting module is further configured to extract a first image feature from the first video frame and extract a second image feature from the second video frame; extracting a first text mask feature from the first image feature and a second text mask feature from the second image feature; performing fusion processing on the first text mask feature and the first image feature to obtain the first video frame feature; and carrying out fusion processing on the second text mask feature and the second image feature to obtain the second video frame feature.
In the above-described arrangement, the first and second embodiments, the extraction module is used for extracting the components of the liquid crystal display, and is further configured to perform an up-scaling process on the first image feature, obtaining a first dimension-increasing image characteristic; determining the attention weight corresponding to each channel in the first dimension-increasing image characteristic; and carrying out weighted summation on the data in each channel in the first upgoing dimensional image feature according to the attention weight corresponding to each channel in the first upgoing dimensional image feature to obtain the first text mask feature.
In the above scheme, the extracting module is further configured to perform dimension-increasing processing on the second image feature to obtain a second dimension-increasing image feature; determining the attention weight corresponding to each channel in the second dimension-increasing image characteristic; and carrying out weighted summation on the data in each channel in the second upgoing dimensional image feature according to the attention weight corresponding to each channel in the second upgoing dimensional image feature to obtain the second text mask feature.
In the above scheme, the extracting module is further configured to perform convolution processing on the first image feature to obtain a first convolution feature, and perform deconvolution processing on the first convolution feature to obtain a first deconvolution feature; performing fusion processing on the first image feature and the first deconvolution feature to obtain a first fusion feature; and carrying out fusion processing on the first fusion feature and the first deconvolution feature to obtain the first dimension-increasing image feature.
In the above scheme, the extracting module is further configured to perform convolution processing on the second image feature to obtain a second convolution feature, and perform deconvolution processing on the second convolution feature to obtain a second deconvolution feature; performing fusion processing on the second image feature and the second deconvolution feature to obtain a second fusion feature; and carrying out fusion processing on the second fusion feature and the second deconvolution feature to obtain the second dimension-increasing image feature.
In the above solution, the extracting module is further configured to determine weights corresponding to the first text mask feature and the first image feature respectively; and carrying out weighted summation on the first text mask feature and the first image feature based on weights respectively corresponding to the first text mask feature and the first image feature to obtain the first video frame feature.
In the above solution, the extracting module is further configured to determine weights corresponding to the second text mask feature and the second image feature respectively; and carrying out weighted summation on the second text mask feature and the second image feature based on weights respectively corresponding to the second text mask feature and the second image feature to obtain the second video frame feature.
In the above scheme, the extracting module is further configured to divide the first video frame into a plurality of first image blocks, and perform feature extraction processing on each first image block to obtain a plurality of first image sub-features corresponding to the plurality of first image blocks one to one; and combining the plurality of first image sub-features to obtain the first image features.
In the above scheme, the extracting module is further configured to divide the second video frame into a plurality of second image blocks, and perform feature extraction processing on each second image block to obtain a plurality of second image sub-features corresponding to the plurality of second image blocks one by one; and combining the plurality of second image sub-features to obtain the second image features.
In the above scheme, the dividing module is further configured to perform reduction processing on the data corresponding to each channel in the first video frame feature to obtain a first reduced feature; and performing dimension reduction processing on the first reduced feature in the horizontal direction to obtain a first dimension reduction feature, and dividing the first dimension reduction feature into a plurality of first video frame sub-features according to the horizontal direction.
In the above scheme, the dividing module is further configured to perform reduction processing on the data corresponding to each channel in the second video frame feature to obtain a second reduced feature; and performing dimension reduction processing on the second reduced feature in the horizontal direction to obtain a second dimension reduction feature, and dividing the second dimension reduction feature into a plurality of second video frame sub-features according to the horizontal direction.
In the above aspect, the video processing apparatus further includes: a classification module for extracting first image features from the first video frame and extracting second image features from the second video frame; extracting a first text mask feature from the first image feature and a second text mask feature from the second image feature; and classifying the first text mask features to obtain a classification result of whether the first video frame contains text, and classifying the second text mask features to obtain a classification result of whether the second video frame contains text.
In the above solution, the similarity module is further configured to, when the classification result is that the first video frame contains text, the second video frame contains text, and a similarity between the first video frame and the second video frame does not exceed a similarity threshold, use the first video frame and the second video frame as the identification frame in the video; when the classification result is that the first video frame contains text, the second video frame contains text, and the similarity between the first video frame and the second video frame exceeds a similarity threshold, the first video frame or the second video frame is used as an identification frame in the video; when the classification result is that the first video frame contains text and the second video frame does not contain text, the first video frame is used as an identification frame in the video; and when the classification result is that the second video frame contains text and the first video frame does not contain text, the second video frame is used as an identification frame in the video.
In the above solution, the similarity module is further configured to use the first video frame or the second video frame as an identification frame in the video when a similarity between the first video frame and the second video frame exceeds a similarity threshold; and when the similarity between the first video frame and the second video frame does not exceed a similarity threshold, taking the first video frame and the second video frame as identification frames in the video.
An embodiment of the present application provides an electronic device, including:
a memory for storing computer executable instructions;
And the processor is used for realizing the video processing method provided by the embodiment of the application when executing the computer executable instructions stored in the memory.
The embodiment of the application provides a computer readable storage medium which stores computer executable instructions for realizing the video processing method provided by the embodiment of the application when being executed by a processor.
An embodiment of the present application provides a computer program product, where the computer program product includes computer executable instructions for implementing the video processing method provided by the embodiment of the present application when the computer executable instructions are executed by a processor.
The embodiment of the application has the following beneficial effects:
And determining the similarity between the first video frame and the second video frame based on the sub-features of the first video frame and the sub-features of the second video frame after division, wherein the divided features represent local information in the video frame, so that the identification frames identified based on the similarity can be effectively distinguished from the local information layer, and the efficiency and the accuracy of the identified identification frames are improved.
Drawings
Fig. 1 is a schematic architecture diagram of a video processing system 100 according to an embodiment of the present application;
Fig. 2 is a schematic structural diagram of a server 200 according to an embodiment of the present application;
fig. 3 is a schematic flow chart of a video processing method according to an embodiment of the present application;
Fig. 4 is a schematic flow chart of a video processing method according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a machine learning model provided by an embodiment of the present application;
fig. 6 is a schematic flow chart of a video processing method according to an embodiment of the present application;
Fig. 7 is a schematic structural diagram of a twin network according to an embodiment of the present application.
Detailed Description
The present application will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present application more apparent, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.
In the following description, the terms "first" and "second" are used merely to distinguish between similar objects and do not represent a particular ordering of the objects, it being understood that the "first" and "second" may be interchanged with a particular order or precedence, if allowed, to enable embodiments of the application described herein to be implemented in other than those illustrated or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.
Before describing embodiments of the present application in further detail, the terms and terminology involved in the embodiments of the present application will be described, and the terms and terminology involved in the embodiments of the present application will be used in the following explanation.
1) Computer Vision (CV) is a science of researching how to make a machine "look at", and more specifically, to replace human eyes with a camera and a Computer to perform machine Vision such as recognition, following and measurement on a target, and further perform graphic processing, so that the Computer is processed into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, optical character recognition (OCR, optical Character Recognition), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, map construction, etc., as well as common biometric recognition techniques such as face recognition, fingerprint recognition, etc.
2) The twin neural network (Siamese Neural Network), also known as a twin neural network, or twin network, is a coupling framework established based on two artificial neural networks. The twin neural network takes two samples as input, and outputs the characterization of the embedded high-dimensional space so as to compare the similarity degree (hereinafter referred to as similarity) of the two samples. Specifically, the twin neural network comprises two sub-networks, each of which receives an input, maps it to a high-dimensional feature space, and outputs a corresponding representation. By calculating the distance of the two characterizations, e.g., the Euclidean distance, the degree of similarity of the two inputs can be compared.
3) The video identification frame, which is used to describe an image frame of a video, can reflect the main content of a video. The text identification frame belongs to one of the identification frames, and refers to the identification frame containing text information.
4) The sampling rate is the number of samples per second. The sampling rate of a video corresponds to the number of frames of the video, and for example, when a plurality of sampled still pictures are played back at the same rate as the sampling rate, consecutive pictures are seen.
The video identification frame detection technology is an important technical branch of video processing and is also an important research hotspot in the field of computer vision. Its main task is to select a small number of video frames or video segments in the video to describe the stories occurring in the video, which can help to improve the understanding efficiency of important content in the video. Therefore, with the increasing number of videos in the internet, the video identification frame detection technology is widely used. Generally, machine learning is typically employed when video identification frames in a video are acquired using video identification frame detection techniques.
For example, a subset selection process is used to select a video frame/video segment from the video, i.e., learn an optimal solution in the video by a sub-module optimization algorithm (Submodular Optimization) to obtain a video identification frame. Or detecting the video identification frames in the video by adopting a gaze following technology, and improving the correlation and diversity of the obtained video identification frames by utilizing a sub-module optimization algorithm.
For another example, dictionary learning and sparse coding are adopted to promote correlation of video identification frames, and the video identification frames are obtained through extraction according to local motion areas and correlation of the video identification frames.
For another example, the video identification frames are acquired based on a deep learning technology, specifically, the reinforced learning strategy is utilized, and the supervised and unsupervised detection of the video identification frames is realized by setting the diversity and expressive reward functions of the video identification frames.
Also for example, sequence-to-sequence techniques are used to determine video identification frames of a video, and in particular, constructed attention-based codec networks are used to obtain video identification frames. Or automatically detect video identification frames in the video to be processed using a long and short term memory network and determinant point process (DETERMINANTAL POINT PROCESSES) by supervising the learned strategy. Or the expansion time sequence unit in the video to be processed is reconstructed by the generation type countermeasure Network (GENERATIVE ADVERSARIAL Network) and the short-time memory Network, and the detection of the video identification frame of the video to be processed is realized by the reconstruction error.
And finally, for example, taking the text information in the video as a factor for extracting the video identification frame, specifically, calculating the similarity of adjacent video frames based on a text twin network, and simultaneously adding an Attention module (Attention Block) to identify whether the video frames contain text or not, so as to determine the video identification frame.
However, although the above scheme proposes that text information in video is taken as a factor for extracting video identification frames, in the detection process of video identification frames, similarity comparison is performed on global features of whole video frames, so that when scene change among video frames in video is large and local information (such as text information or scene roles) is unchanged, the extracted video identification frames have the problems of high repetition rate and unstable recall rate of the video identification frames caused by high repetition rate of the local information; resulting in low accuracy of the extracted video identification frames.
In view of the above technical problems, embodiments of the present application provide a video processing method, which can accurately and efficiently identify a representative identification frame in a video. The exemplary application of the video processing method provided by the embodiment of the present application is described below, and the video processing method provided by the embodiment of the present application may be implemented by various electronic devices, for example, may be implemented by a terminal alone, for example, the terminal determines an identification frame in a video by means of its own computing power, then displays the identification frame as a cover of the video in the terminal, and classifies or recommends the video according to the content recognition result; the method can also be cooperatively implemented by the terminal and the server, for example, the terminal determines the identification frame in the video by means of the computing power of the server, then displays the identification frame as the cover of the video in the terminal, and classifies or recommends the video according to the content identification result.
Next, an embodiment of the present application is described by a cooperative embodiment of a server and a terminal, referring to fig. 1, fig. 1 is a schematic architecture diagram of a video processing system 100 according to an embodiment of the present application. The video processing system 100 includes: the server 200, the network 300, and the terminal 400 will be described separately.
The server 200 is a background server of the client 410, and is configured to determine an identification frame in the video according to a similarity between a first video frame and a second video frame of the video; the method is also used for carrying out content recognition on the identification frames in the video to obtain a content recognition result, and classifying the video according to the content recognition result to obtain a video classification list; and is further configured to send the video classification list to the client 410 for presentation in response to the video classification list acquisition request from the client 410.
The network 300 may be a wide area network or a local area network, or a combination of both, for mediating communication between the server 200 and the terminal 400.
The terminal 400 is configured to run the client 410, where the client 410 is a client with a video playing function, such as a video client, a short video client, etc.
As one example, a presentation scenario for a video list may be used. The client 410 is configured to send a video classification list acquisition request to the server 200 in response to a video list viewing operation by a user; the video classification list is also used for receiving the video classification list sent by the server 200 and displaying the video classification list on a human-computer interaction interface; the video classification list includes identification frames of each video, and the client 410 displays the identification frames of the video as a cover in the man-machine interaction interface, so as to help a user to quickly know the content of the video through the cover.
As another example, a scene that may be used for a cold start recommendation of a video, such as a newly online video, may be difficult to accurately recommend based on the characteristics of the video due to lack of sufficient information for the video. At this time, the server 200 may further obtain user information (for example, video viewing record) sent by the client 410, select, from the plurality of videos, a video matching the user information of the client 410 according to a content recognition result obtained by performing content recognition on the identification frame, and send recommendation information of the matched video to the client 410, thereby improving efficiency of video recommendation.
The embodiment of the application can be realized by means of Cloud Technology (Cloud Technology), wherein the Cloud Technology refers to a hosting Technology for integrating serial resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data.
The cloud technology is a generic term of network technology, information technology, integration technology, management platform technology, application technology and the like based on cloud computing business model application, can form a resource pool, and is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical network systems require a large amount of computing and storage resources.
As an example, the server 200 may be a stand-alone physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, and basic cloud computing services such as big data and artificial intelligence platforms. The terminal 400 may be a smart phone, a tablet computer, a vehicle-mounted terminal, a smart wearable device, a notebook computer, a desktop computer, or other various types of user terminals. The terminal 400 and the server 200 may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiment of the present application.
Next, the structure of the server 200 in fig. 1 will be described. Referring to fig. 2, fig. 2 is a schematic structural diagram of a server 200 according to an embodiment of the present application, and the server 200 shown in fig. 2 includes: at least one processor 210, a memory 240, and at least one network interface 220. The various components in server 200 are coupled together by bus system 230. It is understood that the bus system 230 is used to enable connected communications between these components. The bus system 230 includes a power bus, a control bus, and a status signal bus in addition to a data bus. But for clarity of illustration the various buses are labeled in fig. 2 as bus system 230.
The Processor 210 may be an integrated circuit chip having signal processing capabilities such as a general purpose Processor, such as a microprocessor or any conventional Processor, a digital signal Processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.
Memory 240 includes volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a random access Memory (RAM, random Access Memory). The memory 240 described in embodiments of the present application is intended to comprise any suitable type of memory. Memory 440 optionally includes one or more storage devices physically remote from processor 210.
In some embodiments, memory 240 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.
An operating system 241 including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks; network communication module 242 for reaching other computing devices via one or more (wired or wireless) network interfaces 220, exemplary network interfaces 220 include: bluetooth, wireless compatibility authentication (WiFi), and universal serial bus (USB, universal Serial Bus), etc.
In some embodiments, the video processing apparatus provided in the embodiments of the present application may be implemented in software, and fig. 2 shows a video processing apparatus 243 stored in a memory 240, which may be software in the form of a program, a plug-in, or the like, including the following software modules: the extraction module 2431, the partitioning module 2432, the similarity module 2433, and the determination module 2434. These modules may be logical functional modules, and thus may be arbitrarily combined or further split depending on the functionality implemented. The functions of the respective modules will be described hereinafter.
In the following, a video processing method provided by the embodiment of the present application is described by way of example by the server 200 in fig. 1. Referring to fig. 3, fig. 3 is a flowchart of a video processing method according to an embodiment of the present application, and will be described with reference to the steps shown in fig. 3.
In step S101, first video frame features are extracted from first video frames of a video, and second video frame features are extracted from second video frames of the video.
In some embodiments, the video may be further decoded before step S101 to obtain a plurality of video frames, from which the first video frame and the second video frame are selected.
As an example, the first video frame and the second video frame are not two particular video frames, the first video frame and the second video frame being relatively speaking, only for distinguishing between different video frames. The first video frame and the second video frame may be any two video frames in the video, for example, the first video frame and the second video frame may be any two adjacent video frames in the video, or any two video frames of a fixed number of frames per interval in the video, or a key frame (i.e., an I-frame) in any two adjacent groups of pictures (GOP, group of Pictures) in the video, or a key frame in any two groups of pictures of a fixed number per interval.
For example, when the frame rate of the video is 24 frames/second and the video duration is 3 seconds, the video may be decoded into video frame 1, video frame 2, video frame 3, video frame 4, and video frame 72. The first video frame and the second video frame may be any two adjacent video frames in the video, for example, video frame 1 is the first video frame and video frame 2 is the second video frame. The first video frame and the second video frame may be any two video frames separated by a fixed number of frames, and taking the first video frame and the second video frame separated by 5 video frames as an example, the video frame 1 is the first video frame and the video frame 6 is the second video frame.
For example, when the video includes 4 groups of pictures (group of pictures 1, group of pictures 2, group of pictures 3, and group of pictures 4), the first video frame and the second video frame may be key frames in any two adjacent groups of pictures in the video, e.g., the I frame of group of pictures 1 is the first video frame and the I frame of group of pictures 2 is the second video frame. The first video frame and the second video frame may also be key frames in any two groups of pictures that are separated by a fixed amount, e.g., group 1I frame is the first video frame and group 3I frame is the second video frame.
In some embodiments, referring to fig. 4, fig. 4 is a flowchart of a video processing method according to an embodiment of the present application, and based on fig. 3, step S101 may include steps S1011 to S1014.
In step S1011, a first image feature is extracted from a first video frame, and a second image feature is extracted from a second video frame.
In some embodiments, dividing a first video frame into a plurality of first image blocks, and performing feature extraction processing on each first image block to obtain a plurality of first image sub-features corresponding to the plurality of first image blocks one to one; and combining the plurality of first image sub-features to obtain first image features.
As an example of adapting fig. 5, fig. 5 is a schematic structural diagram of a machine learning model provided in an embodiment of the present application. Dividing a first video frame into a plurality of first image blocks through a first convolution network, and carrying out feature extraction processing on each first image block to obtain a plurality of first image sub-features corresponding to the plurality of first image blocks one by one; and combining the plurality of first image sub-features to obtain first image features.
For example, the method of combining the plurality of first image sub-features may be to determine weights corresponding to the plurality of first image sub-features, and perform weighted summation on the plurality of first image sub-features according to the weights corresponding to the plurality of first image sub-features to obtain the first image feature.
In some embodiments, dividing the second video frame into a plurality of second image blocks, and performing feature extraction processing on each second image block to obtain a plurality of second image sub-features corresponding to the plurality of second image blocks one to one; and combining the plurality of second image sub-features to obtain a second image feature.
As an example of accepting fig. 5, dividing the second video frame into a plurality of second image blocks through a second convolution network, and performing feature extraction processing on each second image block to obtain a plurality of second image sub-features corresponding to the plurality of second image blocks one by one; and combining the plurality of second image sub-features to obtain a second image feature.
For example, the method of combining the plurality of second image sub-features may be to determine weights corresponding to the plurality of second image sub-features, and perform weighted summation on the plurality of second image sub-features according to the weights corresponding to the plurality of second image sub-features to obtain the second image feature.
As an example, the first convolutional network and the second convolutional network may be neural network models having the same parameters, or may be neural network models having different parameters. The neural network model may include various types, such as a convolutional neural network (CNN, convolutional Neural Networks) model, a cyclic neural network (RNN, recurrent Neural Network) model, and a multi-layer feedforward neural network model.
In the embodiment of the application, in the process of extracting the image features of the video frame, the video frame is divided into a plurality of image blocks to respectively extract the features, so that the computing difficulty is lower compared with the process of directly extracting the features of each pixel in the video frame, the consumption of computing resources can be reduced, and the extracting speed of the image features is improved.
In step S1012, a first text mask feature is extracted from the first image feature and a second text mask feature is extracted from the second image feature.
In some embodiments, performing an up-scaling process on the first image feature to obtain a first up-scaling image feature; determining the attention weight corresponding to each channel in the first dimension-increasing image characteristic; and carrying out weighted summation on the data in each channel in the first upgoing dimensional image characteristic according to the attention weight corresponding to each channel in the first upgoing dimensional image characteristic to obtain a first text mask characteristic.
As an example of adapting fig. 5, the first text position extraction network includes a channel attention module, specifically, the first text position extraction network performs an up-dimension processing on the first image feature to obtain a first up-dimension image feature; determining the attention weight corresponding to each channel in the first upgoing image characteristic through a channel attention module; and carrying out weighted summation on the data in each channel in the first upgoing dimensional image characteristic according to the attention weight corresponding to each channel in the first upgoing dimensional image characteristic to obtain a first text mask characteristic.
For example, the first text mask feature includes a plurality of sub-mask features, each corresponding to an image block, each sub-mask feature including a mask of 0 or 1, the mask representing whether text is contained in the corresponding image block, e.g., 0 indicates no text in the corresponding image block, and 1 indicates text in the corresponding image block.
According to the embodiment of the application, the first dimension-increasing image features are weighted through the attention mechanism, so that the reconstruction accuracy of the first text mask features is higher, the resolution is better, and the accuracy of subsequently determining whether the first video frame contains text or not and the similarity between the first video frame and the second video frame can be improved.
As an example, performing the up-scaling processing on the first image feature, obtaining the first up-scaling image feature may include: performing convolution processing on the first image feature to obtain a first convolution feature, and performing deconvolution processing on the first convolution feature to obtain a first deconvolution feature; performing fusion processing on the first image feature and the first deconvolution feature to obtain a first fusion feature; and carrying out fusion processing on the first fusion feature and the first deconvolution feature to obtain a first dimension-increasing image feature.
According to the embodiment of the application, more accurate feature reconstruction can be realized through deconvolution and multiple fusion processes, so that the matching performance between the first dimension-increasing image features and the first video frame is higher, and the accuracy of extracting the identification frame from the video is improved.
In some embodiments, performing an up-scaling process on the second image feature to obtain a second up-scaling image feature; determining the attention weight corresponding to each channel in the second dimension-increasing image characteristic; and carrying out weighted summation on the data in each channel in the second upgoing dimensional image characteristic according to the attention weight corresponding to each channel in the second upgoing dimensional image characteristic to obtain a second text mask characteristic.
As an example of adapting fig. 5, the second text position extraction network includes a channel attention module, and specifically, the second text position extraction network performs an up-dimension processing on the second image feature to obtain a second up-dimension image feature; determining the attention weight corresponding to each channel in the second upgoing image characteristic through a channel attention module; and carrying out weighted summation on the data in each channel in the second upgoing dimensional image characteristic according to the attention weight corresponding to each channel in the second upgoing dimensional image characteristic to obtain a second text mask characteristic.
For example, the second text mask feature includes a plurality of sub-mask features, each corresponding to an image block, each sub-mask feature including a mask of 0 or 1, the mask representing whether text is contained in the corresponding image block, e.g., 0 indicates no text in the corresponding image block, and 1 indicates text in the corresponding image block.
According to the embodiment of the application, the second up-scaling image features are weighted through the attention mechanism, so that the reconstruction accuracy of the second text mask features is higher, the resolution is better, and the accuracy of determining whether the second video frame contains text or not and the similarity between the first video frame and the second video frame in the follow-up process can be improved.
For example, the first text position extraction network and the second text position extraction network may be neural network models with the same parameters, or may be neural network models with different parameters. The neural network model may include various types, such as a convolutional neural network model, a cyclic neural network model, and a multi-layer feedforward neural network model.
As an example, performing the up-scaling processing on the second image feature, obtaining the second up-scaling image feature may include: performing convolution processing on the second image feature to obtain a second convolution feature, and performing deconvolution processing on the second convolution feature to obtain a second deconvolution feature; performing fusion processing on the second image feature and the second deconvolution feature to obtain a second fusion feature; and carrying out fusion processing on the second fusion feature and the second deconvolution feature to obtain a second dimension-increasing image feature.
According to the embodiment of the application, more accurate feature reconstruction can be realized through deconvolution and multiple fusion processes, so that the matching performance between the second dimension-increasing image features and the second video frame is higher, and the accuracy of extracting the identification frame from the video is improved.
In step S1013, the first text mask feature and the first image feature are subjected to fusion processing, and a first video frame feature is obtained.
In some embodiments, weights corresponding to the first text mask feature and the first image feature, respectively, are determined; and weighting and summing the first text mask feature and the first image feature based on weights respectively corresponding to the first text mask feature and the first image feature to obtain a first video frame feature.
As an example of accepting fig. 5, determining weights respectively corresponding to the first text mask feature and the first image feature through the first text position extraction network; and weighting and summing the first text mask feature and the first image feature based on weights respectively corresponding to the first text mask feature and the first image feature to obtain a first video frame feature.
According to the embodiment of the application, the first text mask characteristic of the text information representing the first video frame and the first image characteristic of the image information representing the first video frame are fused, so that the identification frame can be conveniently extracted according to the text information and the image information of the video frame, the technical problem that the repetition rate of the identification frame is high due to the fact that the text information in the identification frame is the same when scene change among video frames in the video is large and the text information is unchanged in the related art can be solved, and the accuracy of extracting the identification frame from the video is improved.
In step S1014, the second text mask feature and the second image feature are fused to obtain a second video frame feature.
In some embodiments, weights corresponding to the second text mask feature and the second image feature, respectively, are determined; and weighting and summing the second text mask feature and the second image feature based on weights respectively corresponding to the second text mask feature and the second image feature to obtain a second video frame feature.
As an example of accepting fig. 5, determining weights respectively corresponding to the second text mask feature and the second image feature through the second text position extraction network; and weighting and summing the second text mask feature and the second image feature based on weights respectively corresponding to the second text mask feature and the second image feature to obtain a second video frame feature.
According to the embodiment of the application, the second text mask feature representing the text information of the second video frame and the second image feature representing the image information of the second video frame are fused, so that the identification frame can be conveniently extracted according to the text information and the image information of the video frame, and the technical problem that the repetition rate of the identification frame is high due to the fact that the text information in the identification frame is the same when scene change among video frames in the video is large and the text information is unchanged in the related art can be solved, so that the accuracy of extracting the identification frame from the video is improved.
In step S102, the first video frame feature is divided into a plurality of first video frame sub-features, and the second video frame feature is divided into a plurality of second video frame sub-features.
In some embodiments, the data of each channel corresponding to the first video frame feature is subjected to reduction processing to obtain a first reduced feature; and performing dimension reduction processing (e.g. pooling processing) on the first reduced feature in the horizontal direction to obtain a first dimension reduction feature, and dividing the first dimension reduction feature into a plurality of first video frame sub-features according to the horizontal direction.
As an example of adapting fig. 5, the data corresponding to each channel of the first video frame feature is subjected to reduction processing through a similar network (for example, a twin network), so as to obtain a first reduced feature; and performing dimension reduction processing on the first reduced feature in the horizontal direction to obtain a first dimension reduction feature, and dividing the first dimension reduction feature into a plurality of first video frame sub-features according to the horizontal direction.
In some embodiments, the data of each channel corresponding to the second video frame feature is subjected to reduction processing to obtain a second reduced feature; and performing dimension reduction processing (e.g. pooling processing) on the second reduced features in the horizontal direction to obtain second dimension reduction features, and dividing the second dimension reduction features into a plurality of second video frame sub-features according to the horizontal direction.
As an example of adapting fig. 5, the data of each channel corresponding to the first video frame feature is reduced through a similar network, so as to obtain a first reduced feature; and performing dimension reduction processing on the first reduced feature in the horizontal direction to obtain a first dimension reduction feature, and dividing the first dimension reduction feature into a plurality of first video frame sub-features according to the horizontal direction.
For example, the reduction processing of the data corresponding to each channel by the first video frame feature and the reduction processing of the data corresponding to each channel by the second video frame feature may be to simultaneously subtract the data corresponding to each channel in the first video frame feature and the second video frame feature by the same preset value, so that the similarity relationship between the first video frame feature and the second video frame feature is not changed, the complexity of calculating the similarity later can be reduced, and the determination speed of the similarity can be improved.
In step S103, a similarity between the first video frame and the second video frame is determined based on the plurality of first video frame sub-features and the plurality of second video frame sub-features.
In some embodiments, the following processing is performed for each first video frame sub-feature: selecting a second video frame sub-feature in the same position of the video corresponding to the first video frame sub-feature from a plurality of second video frame sub-features, and determining the similarity between the first video frame sub-feature and the selected second video frame sub-feature; and selecting the minimum similarity from the similarities corresponding to the sub-features of the plurality of first video frames as the similarity between the first video frames and the second video frames.
As an example to accommodate fig. 5, the following processing is performed for each first video frame sub-feature over a similar network: selecting a second video frame sub-feature in the same position of the video corresponding to the first video frame sub-feature from a plurality of second video frame sub-features, and determining the similarity between the first video frame sub-feature and the selected second video frame sub-feature; and selecting the minimum similarity from the similarities corresponding to the sub-features of the plurality of first video frames as the similarity between the first video frames and the second video frames.
As an example, determining the similarity between the first video frame sub-feature and the selected second video frame sub-feature may include: performing absolute value subtraction processing on the first video frame sub-feature and the selected second video frame sub-feature to obtain a video frame difference feature; mapping the video frame difference feature into probabilities corresponding to a plurality of candidate similarities; and determining the candidate similarity corresponding to the maximum probability as the similarity corresponding to the sub-feature of the first video frame.
For example, the subtracting the absolute value of the first video frame sub-feature and the second video frame sub-feature may be subtracting the data corresponding to each channel in the first video frame sub-feature and the second video frame sub-feature, and taking the absolute value of the subtracted result to obtain the video frame difference feature, where the number of channels of the video frame difference feature is the same as the number of channels of the first video frame sub-feature/the second video frame sub-feature.
According to the embodiment of the application, the minimum similarity among the similarities corresponding to the sub-features of the plurality of first video frames is used as the similarity between the first video frames and the second video frames, so that the repetition rate of the identification frames determined later can be reduced. And the similarity between the first video frame and the second video frame is determined based on the divided local first video frame sub-features and the second video frame sub-features, so that the accuracy of the determined similarity is higher than that of the similarity determined based on the global first video frame features and the global second video frame features, the identification frames with different local information can be accurately extracted subsequently, the repetition rate of the extracted identification frames is reduced, and the accuracy of extracting the identification frames from the video is improved.
In step S104, an identification frame in the video is determined according to the similarity between the first video frame and the second video frame.
In some embodiments, when the similarity between the first video frame and the second video frame exceeds a similarity threshold, the first video frame or the second video frame is used as an identification frame in the video; and when the similarity between the first video frame and the second video frame does not exceed the similarity threshold, taking the first video frame and the second video frame as identification frames in the video.
As an example, the similarity threshold may be a parameter obtained during training of the machine learning model, or may be a value set by a user, a client, or a server.
The similarity between the first video frame and the second video frame characterizes a likelihood that the first video frame and the second video frame are similar, and the greater the similarity, the less the likelihood that the first video frame and the second video frame are similar, and the lesser the similarity, the greater the likelihood that the first video frame and the second video frame are similar. In this way, selecting any video frame from the pair of video frames (including the first video frame and the second video frame) with the similarity exceeding the similarity threshold as the identification frame can reduce the repetition rate of the extracted identification frame, thereby improving the accuracy of extracting the identification frame from the video.
In some embodiments, before step S104, it may further include: classifying the first video frame to obtain a classification result of whether the first video frame contains text; and classifying the second video frame to obtain a classification result of whether the second video frame contains text. As such, step S104 may determine an identification frame (or text identification frame) in the video according to the classification result and the similarity between the first video frame and the second video frame.
As an example, performing a classification process on the first video frame to obtain a classification result of whether the first video frame contains text may include: and extracting first image features from the first video frame, extracting first text mask features from the first image features, and performing classification processing on the first text mask features to obtain a classification result of whether the first video frame contains text.
As an example of accepting fig. 5, the first text mask feature is classified by the first classification network to obtain a classification result of whether the first video frame contains text.
For example, when the mask in the first text mask feature includes a mask of type 1, determining that the first video frame contains text; when the mask in the first text mask feature does not include a mask of type 1, it is determined that the first video frame does not contain text.
As an example, performing a classification process on the second video frame to obtain a classification result of whether the second video frame contains text may include: and extracting second image features from the second video frames, extracting second text mask features from the second image features, and performing classification processing on the second text mask features to obtain a classification result of whether the second video frames contain texts.
As an example of accepting fig. 5, the second text mask feature is classified through a second classification network to obtain a classification result of whether the second video frame contains text.
For example, when the mask in the second text mask feature includes a mask of type 1, determining that the second video frame contains text; when the mask in the second text mask feature does not include a mask of type 1, it is determined that the second video frame does not contain text.
As an example, determining the identification frame in the video based on the classification result and the similarity between the first video frame and the second video frame may include: when the classification result is that the first video frame contains text, the second video frame contains text, and the similarity between the first video frame and the second video frame does not exceed a similarity threshold, the first video frame and the second video frame are used as identification frames in the video; when the classification result is that the first video frame contains text, the second video frame contains text, and the similarity between the first video frame and the second video frame exceeds a similarity threshold, the first video frame or the second video frame is used as an identification frame in the video; when the classification result is that the first video frame contains text and the second video frame does not contain text, the first video frame is used as an identification frame in the video; when the classification result is that the second video frame contains text and the first video frame does not contain text, the second video frame is used as an identification frame in the video.
For example, after determining the identification frame in the video, text recognition processing may be performed on the identification frame to obtain a text recognition result; and classifying the videos according to the character recognition result.
According to the embodiment of the application, the video frames containing texts are selected from the video frame pairs (comprising the first video frame and the second video frame) with the similarity exceeding the similarity threshold as the text identification frames, so that the repetition rate of the extracted text identification frames can be reduced, the extracted text identification frames can be ensured to contain text information, and the subsequent processing of the text identification frames, such as classification of videos, can be facilitated.
The video processing method provided by the embodiment of the application is described below by taking a specific application scenario as an example.
The video identification frame detection technology is to select a small number of video frames or video segments in the video to describe a story occurring in the video, so that the efficiency of understanding important content of the video can be improved. And the identification frame extracted from the video can be used as the basis of video classification, and the identification frame can be used as a video cover for displaying. The embodiment of the application can detect the video frames with different characters in the video, can effectively process the video containing complex character scenes (such as 'street scene characters') and can obtain more stable and more accurate identification frames under the video frames with multiple sampling rates.
Referring to fig. 6, fig. 6 is a flowchart of a video processing method according to an embodiment of the present application. In fig. 6, the video processing method according to the embodiment of the present application may be divided into three stages, wherein the first stage is to perform frame division processing on a video through a video stream decoding interface, the second stage is to determine inter-frame similarity of an extracted video frame and whether the extracted video frame contains text through a twin network by using an image segmentation technique and a local similarity matching technique, and the third stage is to determine an identification frame according to the inter-frame similarity of the extracted video frame and whether the extracted video frame contains text.
Referring to fig. 7, fig. 7 is a schematic structural diagram of a twin network according to an embodiment of the present application, and the above three stages will be described with reference to fig. 7.
Video is decoded into successive video frames using a video decoding tool, such as ffmpeg.
And (II) extracting adjacent first video frames and second video frames and forming a video frame pair, and taking the video frame pair as an input of a twin network to determine the similarity between the first video frames and the second video frames and whether the first video frames and the second video frames contain texts.
(1) It is determined whether the first video frame and the second video frame contain text.
In some embodiments, the two video frames are divided into 28 x 28 image blocks, respectively, and a prediction is made as to whether text is contained within each image block.
As an example, using Resnet network as the model framework, the input modification to Resnet network is 448 x 448. Unlike image segmentation where pixel level masks are predicted, embodiments of the present application use only the output of the 4_2 th convolutional layer in Resnet < 18 > networks to generate a feature map (or feature) of size 256 x 28. To locate text position in a video frame more accurately, the text position extraction network in fig. 7 first uses an upsampling technique to complete the concatenation of Resnet network and text position extraction network outputs, and then implements more accurate feature reconstruction by deconvolution of small feature maps (14 x 14), thereby locating text position in a video frame more effectively. At the same time, the text position extraction network also introduces a channel (or channel) attention module (Channel Attention Module), and the channel attention module weights 256 channels of the generated feature map to generate a text mask area (i.e. the text mask feature) corresponding to 28×28, where the text mask area corresponds to 28×28 tiles, each tile corresponds to an image tile, and each tile has a mask of 0 or 1, where the mask represents whether there is text in the corresponding image tile, for example, 0 represents no text in the image tile, and 1 represents text in the image tile. The specific process of generating the text mask area may include: (1) Extending the 28 x 28 feature map to a 784-dimensional vector to obtain a 256 x 784 feature map; (2) 256 784-dimensional features are denoted as f i; (3) Using the attention mechanism, weights for 256 features are calculated and weighted sum is performed to generate the final 784-dimensional vector representation. The specific calculation method is shown in formulas (1) - (3):
ei=Wi·fi+bi (1)
fattn=αi·fi (3)
Where W and b are text position extraction network-learnable parameters, e is an intermediate encoding vector, f attn is a 784-dimensional vector representation generated using the attention mechanism, α is the weight of the channel, j is the number of features. In order to accurately identify the position of a text block in an image, the embodiment of the application introduces an L2 norm loss function to train a text position extraction network.
In some embodiments, after obtaining the text mask areas of the first video frame and the second video frame, it may be determined whether the first video frame and the second video frame contain text, e.g., when the mask in the text mask area of the first video frame includes a mask of type 1, it is determined that the first video frame contains text; when the mask in the text mask area of the first video frame does not include a mask of type 1, determining that the first video frame does not contain text; determining that the second video frame contains text when the mask in the text mask area of the second video frame includes a mask of type 1; when the mask in the text mask area of the second video frame does not include a mask of type 1, it is determined that the second video frame does not contain text.
(2) A similarity between the first video frame and the second video frame is determined.
In some embodiments, in order to calculate the similarity between the first video frame and the second video frame, embodiments of the present application introduce a local similarity matching technique and a twin network. After obtaining the text mask region, the text mask region is weighted and fused with the 256×28×28 feature map (i.e., the image feature) of the output of the 4_2 convolution layer in the Resnet network, so as to obtain a corresponding weighted feature map (i.e., the video frame feature). The similarity network firstly carries out subtraction processing on the corresponding positions of the weighted feature graphs, then carries out pooling processing on the processed 256 x 28 feature graphs in the horizontal direction, pools the 256 x 28 feature graphs into 256 x 28 x 1 feature graphs, divides the 256 x 28 x 1 feature graphs into 28 parts in the horizontal direction, respectively carries out similarity comparison on the 28 feature graphs in the horizontal direction to obtain 28 similarity scores, and finally selects the similarity with the lowest score from the 28 similarity scores as the similarity between the first video frame and the second video frame.
In some embodiments, the output of the twin network includes two tasks, each including: a determination is made as to whether the first video frame and the second video frame contain text, and a similarity is determined between the first video frame and the second video frame. The embodiment of the application can realize the joint training of two different tasks through the multi-task loss function, and specifically, the input data is assumed to comprise: sample: first image x 1 and second image x 2, and annotation data for the sample: the position y 1 of the text block contained in the first image x 1, the position y 2 of the text block contained in the second image x 2, and the similarity y (x 1,x2) between the first image x 1 and the second image x 2. Wherein the multitasking loss function is shown in formula (4).
Where L text_mask (·) is the penalty function of the text mask module (including the convolutional network and the text position extraction network in fig. 7), and L sim (·) is the penalty function of the similar module (including the convolutional network and the similar network in fig. 7); l 2 (.) is the L2 norm,Output of the extraction network for text position for the first image x 1,/>For the text position extraction network output for the second image x 2, p (x 1,x2) is the output for the similar network for the first image x 1 and the second image x 2, α is the weight of the penalty function of the text mask module, and β is the weight of the penalty function of the similar module.
As an example, an Adam optimizer may be used to train the twin network, where the initial learning rate may be set to 0.0005 and the learning rate may be reduced to the original 0.1 after every 30 generation of training (epochs). The sample number for small Batch training (Mini-Batch) may be set to 64. During training of the twin network, the Momentum (Momentum) and weight decay may be set to 0.9 and 0.0001, respectively. In order to train similar modules more effectively, the weight of the loss function of the text mask module and the weight of the loss function of the similar module can be set to be 1.
And (III) combining the first video frame and the second video frame into one frame (namely optionally discarding any frame) when the similarity between the two frames is higher than a similarity threshold and the first video frame and the second video frame contain text.
By this strategy, after filtering out the video frames that do not contain text and are similar, the identification frames of the video remain.
The embodiment of the application can extract the identification frame in the video, so that the identification frame can be used for replacing all video frames in the video for carrying out the text detection and identification in the frame (for example, only the text in the identification frame is identified, the identified result can be used as the keyword of the video, and the keyword can be used as the classifying basis of the video), thereby being capable of reducing the processing time of the video.
An exemplary structure of the video processing apparatus according to the embodiment of the present application implemented as a software module is described below with reference to fig. 2.
In some embodiments, as shown in fig. 2, the software modules stored in the video processing device 243 of the memory 240 may include:
An extraction module 2431 to extract a first video frame feature from a first video frame of a video and a second video frame feature from a second video frame of the video; a partitioning module 2432 configured to partition the first video frame feature into a plurality of first video frame sub-features and partition the second video frame feature into a plurality of second video frame sub-features; a similarity module 2433 configured to determine a similarity between the first video frame and the second video frame based on the plurality of first video frame sub-features and the plurality of second video frame sub-features; a determining module 2434 is configured to determine an identification frame in the video according to the similarity between the first video frame and the second video frame.
In the above-mentioned scheme, the similarity module 2433 is further configured to perform the following processing for each first video frame sub-feature: selecting a second video frame sub-feature corresponding to the same position as the first video frame sub-feature from a plurality of second video frame sub-features, and determining the similarity between the first video frame sub-feature and the selected second video frame sub-feature; and selecting the minimum similarity from the similarities corresponding to the sub-features of the plurality of first video frames as the similarity between the first video frames and the second video frames.
In the above scheme, the similarity module 2433 is further configured to perform absolute value subtraction on the first video frame sub-feature and the selected second video frame sub-feature to obtain a video frame difference feature; mapping the video frame difference feature into probabilities corresponding to a plurality of candidate similarities; and determining the candidate similarity corresponding to the maximum probability as the similarity corresponding to the sub-feature of the first video frame.
In the above aspect, the extracting module 2431 is further configured to extract a first image feature from the first video frame and extract a second image feature from the second video frame; extracting a first text mask feature from the first image feature and a second text mask feature from the second image feature; fusing the first text mask feature and the first image feature to obtain a first video frame feature; and carrying out fusion processing on the second text mask feature and the second image feature to obtain a second video frame feature.
In the above scheme, the extracting module 2431 is further configured to perform dimension-up processing on the first image feature to obtain a first dimension-up image feature; determining the attention weight corresponding to each channel in the first dimension-increasing image characteristic; and carrying out weighted summation on the data in each channel in the first upgoing dimensional image characteristic according to the attention weight corresponding to each channel in the first upgoing dimensional image characteristic to obtain a first text mask characteristic.
In the above scheme, the extracting module 2431 is further configured to perform dimension-up processing on the second image feature to obtain a second dimension-up image feature; determining the attention weight corresponding to each channel in the second dimension-increasing image characteristic; and carrying out weighted summation on the data in each channel in the second upgoing dimensional image characteristic according to the attention weight corresponding to each channel in the second upgoing dimensional image characteristic to obtain a second text mask characteristic.
In the above scheme, the extracting module 2431 is further configured to perform convolution processing on the first image feature to obtain a first convolution feature, and perform deconvolution processing on the first convolution feature to obtain a first deconvolution feature; performing fusion processing on the first image feature and the first deconvolution feature to obtain a first fusion feature; and carrying out fusion processing on the first fusion feature and the first deconvolution feature to obtain a first dimension-increasing image feature.
In the above scheme, the extracting module 2431 is further configured to perform convolution processing on the second image feature to obtain a second convolution feature, and perform deconvolution processing on the second convolution feature to obtain a second deconvolution feature; performing fusion processing on the second image feature and the second deconvolution feature to obtain a second fusion feature; and carrying out fusion processing on the second fusion feature and the second deconvolution feature to obtain a second dimension-increasing image feature.
In the above solution, the extracting module 2431 is further configured to determine weights corresponding to the first text mask feature and the first image feature, respectively; and weighting and summing the first text mask feature and the first image feature based on weights respectively corresponding to the first text mask feature and the first image feature to obtain a first video frame feature.
In the above solution, the extracting module 2431 is further configured to determine weights corresponding to the second text mask feature and the second image feature, respectively; and weighting and summing the second text mask feature and the second image feature based on weights respectively corresponding to the second text mask feature and the second image feature to obtain a second video frame feature.
In the above scheme, the extracting module 2431 is further configured to divide the first video frame into a plurality of first image blocks, and perform feature extraction processing on each first image block to obtain a plurality of first image sub-features corresponding to the plurality of first image blocks one to one; and combining the plurality of first image sub-features to obtain first image features.
In the above scheme, the extracting module 2431 is further configured to divide the second video frame into a plurality of second image blocks, and perform feature extraction processing on each second image block to obtain a plurality of second image sub-features corresponding to the plurality of second image blocks one to one; and combining the plurality of second image sub-features to obtain a second image feature.
In the above scheme, the dividing module 2432 is further configured to perform reduction processing on the data corresponding to each channel with the first video frame feature to obtain a first reduced feature; and performing dimension reduction processing on the first reduced feature in the horizontal direction to obtain a first dimension reduction feature, and dividing the first dimension reduction feature into a plurality of first video frame sub-features according to the horizontal direction.
In the above scheme, the dividing module 2432 is further configured to perform reduction processing on the data corresponding to each channel in the second video frame feature to obtain a second reduced feature; and performing dimension reduction processing on the second reduced feature in the horizontal direction to obtain a second dimension reduction feature, and dividing the second dimension reduction feature into a plurality of second video frame sub-features according to the horizontal direction.
In the above-described aspect, the video processing apparatus 243 further includes: the classification module is used for extracting first image features from the first video frame and extracting second image features from the second video frame; extracting a first text mask feature from the first image feature and a second text mask feature from the second image feature; and classifying the first text mask features to obtain a classification result of whether the first video frame contains text, and classifying the second text mask features to obtain a classification result of whether the second video frame contains text.
In the above solution, the similarity module 2433 is further configured to, when the classification result is that the first video frame contains text, the second video frame contains text, and the similarity between the first video frame and the second video frame does not exceed the similarity threshold, use the first video frame and the second video frame as the identification frame in the video; when the classification result is that the first video frame contains text, the second video frame contains text, and the similarity between the first video frame and the second video frame exceeds a similarity threshold, the first video frame or the second video frame is used as an identification frame in the video; when the classification result is that the first video frame contains text and the second video frame does not contain text, the first video frame is used as an identification frame in the video; when the classification result is that the second video frame contains text and the first video frame does not contain text, the second video frame is used as an identification frame in the video.
In the above solution, the similarity module 2433 is further configured to use the first video frame or the second video frame as the identification frame in the video when the similarity between the first video frame and the second video frame exceeds the similarity threshold; and when the similarity between the first video frame and the second video frame does not exceed the similarity threshold, taking the first video frame and the second video frame as identification frames in the video.
In some embodiments, the logic of the video processing method provided by the embodiment of the present application may be implemented in an intelligent contract, where different nodes determine the identification frames of the video by calling the intelligent contract of the respective node, and determine the final identification frames by taking the intersection. The embodiment of the application can further improve the accuracy of extracting the identification frame from the video through the cooperative processing among the plurality of nodes.
Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the video processing method according to the embodiment of the present application.
Embodiments of the present application provide a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, cause the processor to perform the video processing method provided by the embodiments of the present application, for example, the video processing methods shown in fig. 3 and 4, where the computer includes various computing devices including a smart terminal and a server.
In some embodiments, the computer readable storage medium may be FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories.
In some embodiments, computer-executable instructions may be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, in the form of programs, software modules, scripts, or code, and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.
As an example, computer-executable instructions may, but need not, correspond to files in a file system, may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext markup language document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
As an example, computer-executable instructions may be deployed to be executed on one computing device or on multiple computing devices located at one site or distributed across multiple sites and interconnected by a communication network.
In summary, in the embodiment of the application, the similarity between the first video frame and the second video frame is determined based on the divided local first video frame sub-feature and the second video frame sub-feature, and compared with the similarity determined based on the global first video frame feature and the global second video frame feature, the accuracy of the determined similarity is higher, so that the identification frames with different local information can be accurately extracted later, the repetition rate of the extracted identification frames is reduced, and the accuracy of extracting the identification frames from the video is improved.
The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims (14)

1. A method of video processing, the method comprising:
extracting first video frame features from first video frames of a video and extracting second video frame features from second video frames of the video;
Dividing the first video frame feature into a plurality of first video frame sub-features and dividing the second video frame feature into a plurality of second video frame sub-features;
The following is performed for each first video frame sub-feature: selecting a second video frame sub-feature corresponding to the same position as the first video frame sub-feature from the plurality of second video frame sub-features, and determining the similarity between the first video frame sub-feature and the selected second video frame sub-feature;
Selecting the minimum similarity from the similarities corresponding to the sub-features of the plurality of first video frames as the similarity between the first video frames and the second video frames;
And determining an identification frame in the video according to the similarity between the first video frame and the second video frame.
2. The method of claim 1, wherein the determining the similarity between the first video frame sub-feature and the selected second video frame sub-feature comprises:
Performing absolute value subtraction processing on the first video frame sub-feature and the selected second video frame sub-feature to obtain a video frame difference feature;
Mapping the video frame difference feature to probability corresponding to a plurality of candidate similarities;
and determining the candidate similarity corresponding to the maximum probability as the similarity corresponding to the first video frame sub-feature.
3. The method of claim 1, wherein extracting a first video frame feature from a first video frame of a video and extracting a second video frame feature from a second video frame of the video comprises:
extracting first image features from the first video frame and second image features from the second video frame;
extracting a first text mask feature from the first image feature and a second text mask feature from the second image feature;
Performing fusion processing on the first text mask feature and the first image feature to obtain the first video frame feature;
and carrying out fusion processing on the second text mask feature and the second image feature to obtain the second video frame feature.
4. The method of claim 3, wherein the step of,
The extracting a first text mask feature from the first image feature includes:
performing dimension-increasing processing on the first image features to obtain first dimension-increasing image features;
determining the attention weight corresponding to each channel in the first dimension-increasing image characteristic;
According to the attention weight corresponding to each channel in the first dimension-increasing image feature, weighting and summing the data in each channel in the first dimension-increasing image feature to obtain the first text mask feature;
The extracting a second text mask feature from the second image feature includes:
performing dimension-increasing processing on the second image features to obtain second dimension-increasing image features;
determining the attention weight corresponding to each channel in the second dimension-increasing image characteristic;
And carrying out weighted summation on the data in each channel in the second upgoing dimensional image feature according to the attention weight corresponding to each channel in the second upgoing dimensional image feature to obtain the second text mask feature.
5. The method of claim 4, wherein the step of determining the position of the first electrode is performed,
The step of performing the dimension-increasing processing on the first image feature to obtain a first dimension-increasing image feature includes:
Performing convolution processing on the first image feature to obtain a first convolution feature, and performing deconvolution processing on the first convolution feature to obtain a first deconvolution feature;
performing fusion processing on the first image feature and the first deconvolution feature to obtain a first fusion feature;
Performing fusion processing on the first fusion feature and the first deconvolution feature to obtain the first dimension-increasing image feature;
The step of performing the dimension-increasing processing on the second image feature to obtain a second dimension-increasing image feature includes:
Performing convolution processing on the second image feature to obtain a second convolution feature, and performing deconvolution processing on the second convolution feature to obtain a second deconvolution feature;
Performing fusion processing on the second image feature and the second deconvolution feature to obtain a second fusion feature;
And carrying out fusion processing on the second fusion feature and the second deconvolution feature to obtain the second dimension-increasing image feature.
6. The method of claim 3, wherein the step of,
The fusing the first text mask feature and the first image feature to obtain the first video frame feature includes:
Determining weights corresponding to the first text mask feature and the first image feature respectively;
Weighting and summing the first text mask feature and the first image feature based on weights respectively corresponding to the first text mask feature and the first image feature to obtain the first video frame feature;
the fusing the second text mask feature and the second image feature to obtain the second video frame feature includes:
Determining weights corresponding to the second text mask feature and the second image feature respectively;
And carrying out weighted summation on the second text mask feature and the second image feature based on weights respectively corresponding to the second text mask feature and the second image feature to obtain the second video frame feature.
7. The method of claim 3, wherein the step of,
The extracting a first image feature from the first video frame includes:
Dividing the first video frame into a plurality of first image blocks, and carrying out feature extraction processing on each first image block to obtain a plurality of first image sub-features corresponding to the plurality of first image blocks one by one;
combining the plurality of first image sub-features to obtain the first image features;
the extracting a second image feature from the second video frame includes:
Dividing the second video frame into a plurality of second image blocks, and carrying out feature extraction processing on each second image block to obtain a plurality of second image sub-features which are in one-to-one correspondence with the plurality of second image blocks;
And combining the plurality of second image sub-features to obtain the second image features.
8. The method of claim 1, wherein the step of determining the position of the substrate comprises,
The dividing the first video frame feature into a plurality of first video frame sub-features includes:
the data of each channel corresponding to the first video frame characteristic is subjected to reduction processing to obtain a first reduction characteristic;
performing dimension reduction processing on the first reduced feature in the horizontal direction to obtain a first dimension reduction feature, and dividing the first dimension reduction feature into a plurality of first video frame sub-features according to the horizontal direction;
the dividing the second video frame feature into a plurality of second video frame sub-features includes:
The data of each channel corresponding to the second video frame characteristic is subjected to reduction processing to obtain a second reduction characteristic;
and performing dimension reduction processing on the second reduced feature in the horizontal direction to obtain a second dimension reduction feature, and dividing the second dimension reduction feature into a plurality of second video frame sub-features according to the horizontal direction.
9. The method according to claim 1, wherein the method further comprises:
extracting first image features from the first video frame and second image features from the second video frame;
extracting a first text mask feature from the first image feature and a second text mask feature from the second image feature;
And classifying the first text mask features to obtain a classification result of whether the first video frame contains text, and classifying the second text mask features to obtain a classification result of whether the second video frame contains text.
10. The method of claim 9, wherein the determining the identified frame in the video based on the similarity between the first video frame and the second video frame comprises:
When the classification result is that the first video frame contains text, the second video frame contains text, and the similarity between the first video frame and the second video frame does not exceed a similarity threshold, the first video frame and the second video frame are used as identification frames in the video;
When the classification result is that the first video frame contains text, the second video frame contains text, and the similarity between the first video frame and the second video frame exceeds a similarity threshold, the first video frame or the second video frame is used as an identification frame in the video;
When the classification result is that the first video frame contains text and the second video frame does not contain text, the first video frame is used as an identification frame in the video;
And when the classification result is that the second video frame contains text and the first video frame does not contain text, the second video frame is used as an identification frame in the video.
11. The method of claim 1, wherein the determining the identified frame in the video based on the similarity between the first video frame and the second video frame comprises:
When the similarity between the first video frame and the second video frame exceeds a similarity threshold, taking the first video frame or the second video frame as an identification frame in the video;
And when the similarity between the first video frame and the second video frame does not exceed a similarity threshold, taking the first video frame and the second video frame as identification frames in the video.
12. A video processing apparatus, the apparatus comprising:
The extraction module is used for extracting first video frame characteristics from first video frames of the video and extracting second video frame characteristics from second video frames of the video;
The dividing module is used for dividing the first video frame characteristic into a plurality of first video frame sub-characteristics and dividing the second video frame characteristic into a plurality of second video frame sub-characteristics;
a similarity module for performing the following processing for each first video frame sub-feature: selecting a second video frame sub-feature corresponding to the same position as the first video frame sub-feature from the plurality of second video frame sub-features, and determining the similarity between the first video frame sub-feature and the selected second video frame sub-feature; selecting the minimum similarity from the similarities corresponding to the sub-features of the plurality of first video frames as the similarity between the first video frames and the second video frames;
and the determining module is used for determining the identification frame in the video according to the similarity between the first video frame and the second video frame.
13. An electronic device, comprising:
a memory for storing computer executable instructions;
A processor for implementing the video processing method of any one of claims 1 to 12 when executing computer executable instructions stored in said memory.
14. A computer readable storage medium storing computer executable instructions which when executed are adapted to implement the video processing method of any one of claims 1 to 11.
CN202110285487.2A 2021-03-17 2021-03-17 Video processing method, device, electronic equipment and storage medium Active CN113011320B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110285487.2A CN113011320B (en) 2021-03-17 2021-03-17 Video processing method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110285487.2A CN113011320B (en) 2021-03-17 2021-03-17 Video processing method, device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113011320A CN113011320A (en) 2021-06-22
CN113011320B true CN113011320B (en) 2024-06-21

Family

ID=76409160

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110285487.2A Active CN113011320B (en) 2021-03-17 2021-03-17 Video processing method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113011320B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114419517A (en) * 2022-01-27 2022-04-29 腾讯科技(深圳)有限公司 Video frame processing method and device, computer equipment and storage medium
US20230334839A1 (en) * 2022-04-19 2023-10-19 Lemon Inc. Feature extraction
CN117909865A (en) * 2024-01-12 2024-04-19 中南大学 Data driving fault diagnosis method and system for multi-sampling rate industrial process

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110083741A (en) * 2019-04-11 2019-08-02 中国科学技术大学 Text combines the video abstraction extraction method towards personage of modeling with image
CN110147745A (en) * 2019-05-09 2019-08-20 深圳市腾讯计算机***有限公司 A kind of key frame of video detection method and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108804980B (en) * 2017-04-28 2022-01-04 阿里巴巴(中国)有限公司 Video scene switching detection method and device
US20200380263A1 (en) * 2019-05-29 2020-12-03 Gyrfalcon Technology Inc. Detecting key frames in video compression in an artificial intelligence semiconductor solution
CN111368656A (en) * 2020-02-21 2020-07-03 华为技术有限公司 Video content description method and video content description device
CN112183275A (en) * 2020-09-21 2021-01-05 北京达佳互联信息技术有限公司 Video description information generation method and device and server
CN112101329B (en) * 2020-11-19 2021-03-30 腾讯科技(深圳)有限公司 Video-based text recognition method, model training method and model training device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110083741A (en) * 2019-04-11 2019-08-02 中国科学技术大学 Text combines the video abstraction extraction method towards personage of modeling with image
CN110147745A (en) * 2019-05-09 2019-08-20 深圳市腾讯计算机***有限公司 A kind of key frame of video detection method and device

Also Published As

Publication number Publication date
CN113011320A (en) 2021-06-22

Similar Documents

Publication Publication Date Title
CN111898696B (en) Pseudo tag and tag prediction model generation method, device, medium and equipment
WO2021190451A1 (en) Method and apparatus for training image processing model
CN113762322B (en) Video classification method, device and equipment based on multi-modal representation and storage medium
CN113011320B (en) Video processing method, device, electronic equipment and storage medium
EP3757905A1 (en) Deep neural network training method and apparatus
CN113704531A (en) Image processing method, image processing device, electronic equipment and computer readable storage medium
CN110852256A (en) Method, device and equipment for generating time sequence action nomination and storage medium
CN111783712A (en) Video processing method, device, equipment and medium
CN112989212B (en) Media content recommendation method, device and equipment and computer storage medium
CN117033609B (en) Text visual question-answering method, device, computer equipment and storage medium
CN112085120B (en) Multimedia data processing method and device, electronic equipment and storage medium
CN116958323A (en) Image generation method, device, electronic equipment, storage medium and program product
CN114359775A (en) Key frame detection method, device, equipment, storage medium and program product
CN113515669A (en) Data processing method based on artificial intelligence and related equipment
CN116958324A (en) Training method, device, equipment and storage medium of image generation model
CN116863003A (en) Video generation method, method and device for training video generation model
CN113657272B (en) Micro video classification method and system based on missing data completion
CN114529761A (en) Video classification method, device, equipment, medium and product based on classification model
CN115292439A (en) Data processing method and related equipment
CN116975347A (en) Image generation model training method and related device
CN113901330B (en) Video searching method and device, electronic equipment and storage medium
CN113822117B (en) Data processing method, device and computer readable storage medium
CN116992947A (en) Model training method, video query method and device
Xu et al. Deep Neural Network‐Based Sports Marketing Video Detection Research
CN116542292B (en) Training method, device, equipment and storage medium of image generation model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40047265

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant