CN111405288A

CN111405288A - Video frame extraction method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN111405288A
Application number: CN202010198341.XA
Authority: CN
Inventors: 施磊; 周峰
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2020-03-19
Filing date: 2020-03-19
Publication date: 2020-07-10

Abstract

The embodiment of the disclosure provides a video frame extraction method, a video frame extraction device, electronic equipment and a computer readable storage medium; the method comprises the following steps: determining a group of pictures as a coding unit in a video; wherein the image group comprises a plurality of image frames; acquiring a decoding time stamp corresponding to a time point to be extracted, and searching an image group comprising the decoding time stamp; wherein, the time point to be extracted corresponds to a target image frame in the video; and decoding the searched image group to ignore the image frames except the target image frame in the searched image group to obtain the target image frame in the video. The embodiment of the disclosure can reduce the waiting time during decoding and improve the frame extraction efficiency.

Description

Video frame extraction method and device, electronic equipment and computer readable storage medium

Technical Field

The present disclosure relates to video processing technologies, and in particular, to a method and an apparatus for video frame extraction, an electronic device, and a computer-readable storage medium.

Background

Video is a combination of a series of image frames, and because the original data volume of video is usually large, the original video is usually encoded and compressed, and then operations such as transmission and storage are performed. Therefore, in some video processing application scenarios, such as video clips, it is necessary to decode the encoded video, i.e. perform frame extraction processing.

In the solutions provided by the related art, all image frames in the video are usually decoded, and then the image frames required for video processing (such as video clips) are filtered out. However, the decoding method may cause the waiting time to be too long, the frame extraction efficiency is low, and the aging requirement of video processing cannot be met.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, an embodiment of the present disclosure provides a video frame extraction method, including:

determining a group of pictures as a coding unit in a video; wherein the image group comprises a plurality of image frames;

acquiring a decoding time stamp corresponding to a time point to be extracted, and searching an image group comprising the decoding time stamp; wherein, the time point to be extracted corresponds to a target image frame in the video;

decoding the searched image group to obtain a decoded image group

And ignoring the image frames except the target image frame in the searched image group to obtain the target image frame in the video.

In the foregoing solution, before the decoding processing is performed on the searched image group, the method further includes:

determining a dependent image frame which has a dependency relationship with the target image frame in the searched image group;

the decoding processing on the searched image group to ignore the image frames except the target image frame in the searched image group comprises:

sequentially decoding the dependent image frame and the target image frame in the searched image group to obtain a decoded image

And ignoring image frames in the searched image group except the dependent image frame and the target image frame.

determining a non-reference frame in the searched image group; wherein the non-reference frame is an image frame without dependency relationship with other image frames in the searched image group;

performing an operation of discarding the non-reference frame.

In the foregoing solution, the decoding processing on the searched image group includes:

performing single software decoding processing on the target image frame in the searched image group according to a decoding sequence;

wherein the decoding order represents that the previous intra-frame coding image frame corresponding to the searched image group is sequentially changed to the next intra-frame coding image frame; the group of pictures is used to represent the interval between two adjacent intra-coded image frames in the video.

creating and initializing a hardware decoder of an asynchronous mode;

and sending the target image frame in the searched image group to the hardware decoder so that the hardware decoder performs hardware decoding processing on the target image frame in the asynchronous mode.

In the above scheme, the video frame extraction method further includes:

creating a sub-thread for software decoding processing and a plurality of sub-threads for hardware decoding processing;

splitting the video, and respectively distributing the sub-videos obtained by splitting to different sub-threads so as to enable the sub-threads to be in a split state

And the sub thread performs software decoding processing or hardware decoding processing on the distributed sub video.

In the above scheme, the video frame extraction method further includes:

acquiring a sample image frame and a corresponding sample score;

predicting the sample image frame through a machine learning model to obtain a score to be compared;

updating the weight parameters of the machine learning model according to the difference between the sample score and the score to be compared;

predicting the target image frame in the video through the updated machine learning model to obtain the score of the target image frame;

carrying out average processing on scores of all the target image frames in the video to obtain average scores of the video;

and selecting the videos to be subjected to video splicing from the plurality of videos to be subjected to video splicing according to the average scores.

In the above scheme, the video frame extraction method further includes:

constructing a time axis of the video according to the time point to be extracted;

and presenting the target image frame corresponding to the time point to be extracted in a display area corresponding to each time point to be extracted in the time axis.

In a second aspect, an embodiment of the present disclosure provides a video frame extraction apparatus, including:

a group-of-pictures determination unit for determining a group of pictures as a coding unit in a video; wherein the image group comprises a plurality of image frames;

the image group searching unit is used for acquiring a decoding time stamp corresponding to a time point to be extracted and searching an image group comprising the decoding time stamp; wherein, the time point to be extracted corresponds to a target image frame in the video;

a decoding unit for decoding the searched image group to obtain a decoded image group

In the foregoing solution, the video frame extracting apparatus further includes:

the dependent frame determining unit is used for determining a dependent image frame which has a dependent relation with the target image frame in the searched image group;

the decoding unit is further configured to:

a non-reference frame determination unit, configured to determine a non-reference frame in the searched image group; wherein the non-reference frame is an image frame without dependency relationship with other image frames in the searched image group;

a discarding unit configured to perform an operation of discarding the non-reference frame.

In the foregoing scheme, the decoding unit is further configured to:

creating and initializing a hardware decoder of an asynchronous mode;

a sub-thread creating unit for creating one sub-thread for software decoding processing and a plurality of sub-threads for hardware decoding processing;

a splitting unit, configured to split the video and distribute the split sub-videos to different sub-threads respectively, so that the sub-threads are enabled to distribute the split sub-videos to different sub-threads

the sample acquisition unit is used for acquiring a sample image frame and a corresponding sample score;

the first prediction unit is used for performing prediction processing on the sample image frame through a machine learning model to obtain a score to be compared;

the updating unit is used for updating the weight parameters of the machine learning model according to the difference between the sample scores and the scores to be compared;

the second prediction unit is used for performing prediction processing on the target image frame in the video through the updated machine learning model to obtain the score of the target image frame;

the average processing unit is used for carrying out average processing on the scores of all the target image frames in the video to obtain the average score of the video;

and the video selecting unit is used for selecting the videos for video splicing from the plurality of videos to be subjected to video splicing according to the average scores.

a time axis construction unit, configured to construct a time axis of the video according to the time point to be extracted;

and the presentation unit is used for presenting the target image frames corresponding to the time points to be extracted in the display area corresponding to each time point to be extracted in the time axis.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the video frame extracting method provided by the embodiment of the disclosure when the executable instruction is executed.

In a fourth aspect, the present disclosure provides a computer-readable storage medium storing executable instructions, which when executed, are configured to implement the video frame extracting method provided by the present disclosure.

The embodiment of the disclosure has the following beneficial effects:

the corresponding image group is decoded by acquiring the time point to be extracted corresponding to the target image frame so as to ignore the image frames except the target image frame in the image group, thereby shortening the waiting time of decoding processing, improving the frame extraction efficiency, and being suitable for manufacturing scenes with higher timeliness requirements, such as a stuck point video.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.

FIG. 1 is an alternative schematic diagram of an electronic device implementing an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of an alternative structure of a video frame extracting apparatus for implementing the embodiment of the present disclosure;

FIG. 3A is a schematic flow chart diagram illustrating an alternative video framing method according to an embodiment of the present disclosure;

FIG. 3B is a schematic flow chart diagram illustrating an alternative video framing method according to an embodiment of the present disclosure;

FIG. 3C is a schematic flow chart diagram illustrating an alternative video framing method according to an embodiment of the present disclosure;

fig. 4 is an alternative schematic diagram of a group of images provided by an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description. It is worth noting that in the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise. Further, "a plurality" in the embodiments of the present disclosure means at least two.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Referring now to fig. 1, fig. 1 is a schematic diagram of an electronic device 100 implementing an embodiment of the present disclosure. The electronic device may be various terminals including a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a Personal Digital Assistant (PDA), a tablet computer (PAD), a Portable Multimedia Player (PMP), a vehicle terminal (e.g., a car navigation terminal), and the like, and a fixed terminal such as a digital Television (TV), a desktop computer, and the like, and may also be a server disposed in the cloud. The electronic device shown in fig. 1 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 1, the electronic device 100 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 110, which may perform various appropriate actions and processes according to a program stored in a Read-Only Memory (ROM) 120 or a program loaded from a storage means 180 into a Random Access Memory (RAM) 130. In the RAM 130, various programs and data necessary for the operation of the electronic apparatus 100 are also stored. The processing device 110, the ROM 120, and the RAM 130 are connected to each other through a bus 140. An Input/Output (I/O) interface 150 is also connected to bus 140.

In general, input devices 160 including, for example, a touch screen, a touch pad, a keyboard, a mouse, a camera, a microphone, an accelerometer, a gyroscope, etc., output devices 170 including, for example, a liquid crystal Display (L CD, &lTtTtranslation = L "&gTt L &lTt &gTtiquid physical Display), a speaker, a vibrator, etc., storage devices 180 including, for example, a magnetic tape, a hard disk, etc., and communication devices 190. the communication devices 190 may allow the electronic apparatus 100 to communicate wirelessly or wiredly with other apparatuses to exchange data.

In particular, the processes described by the provided flowcharts may be implemented as computer software programs according to embodiments of the present disclosure. For example, the disclosed embodiments include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network through the communication device 190, or installed from the storage device 180, or installed from the ROM 120. The computer program, when executed by the processing device 110, performs the functions in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium described above in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a RAM, a ROM, an erasable programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable compact disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In the disclosed embodiments, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the disclosed embodiments, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including over electrical wiring, fiber optics, Radio Frequency (RF), etc., or any suitable combination of the foregoing.

The computer readable medium may be included in the electronic device 100; or may be separate and not incorporated into the electronic device 100.

The computer readable medium carries one or more programs, which when executed by the electronic device 100, cause the electronic device to perform the video frame extracting method provided by the embodiments of the present disclosure.

Computer program code for carrying out operations in embodiments of the present disclosure may be written in one or more programming languages, including AN object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" language or similar programming languages, or combinations thereof.

The flowchart and block diagrams provided by the embodiments of the present disclosure illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Here, the name of a unit does not constitute a limitation of the unit itself in some cases, and for example, the group of pictures determining unit may also be described as a unit of "determining a group of pictures as a coding unit in a video".

For example, without limitation, exemplary types of hardware logic that may be used include Field-Programmable Gate arrays (FPGAs), Application-Specific integrated circuits (ASICs), Application-Specific Standard Parts (ASSPs), systems-on-a-chip (SOCs), complex Programmable logic devices (CP L D), and so forth.

In the context of embodiments of the present disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

It is understood that the elements in the video frame extracting apparatus may be implemented in the electronic device shown in fig. 1 by using software (for example, a computer program stored in the above computer software program), or may be implemented in the electronic device shown in fig. 1 by using the above hardware logic components (for example, FPGA, ASIC, ASSP, SOC, and CP L D).

Referring to fig. 2, fig. 2 is an alternative structural diagram of a video framing apparatus 200 implementing an embodiment of the present disclosure, showing the following modules: a group of pictures determining unit 210, a group of pictures searching unit 220 and a decoding unit 230.

It should be noted that the above-mentioned classification of units does not constitute a limitation of the electronic device itself, for example, some units may be split into two or more sub-units, or some units may be combined into a new unit.

It is to be noted that the names of the above units do not in some cases constitute a limitation on the units themselves, and for example, the above-described group of pictures determining unit 210 may also be described as a unit of "determining a group of pictures as a coding unit in a video".

For the same reason, units and/or modules in the electronic device, which are not described in detail, do not represent defaults of the corresponding units and/or modules, and all operations performed by the electronic device may be implemented by the corresponding units and/or modules in the electronic device.

With continuing reference to fig. 3A, fig. 3A is an optional flowchart of a video frame extraction method for implementing the embodiment of the disclosure, for example, when the processing device 110 loads a program in the Read Only Memory (ROM)102 or a program in the storage device 180 into a Random Access Memory (RAM), the video frame extraction method shown in fig. 3A may be implemented when the program is executed, and the steps shown in fig. 3A are described below.

In step 101, a group of pictures in a video as a coding unit is determined; wherein the image group comprises a plurality of image frames.

The video refers to a combination of a plurality of image frames, and because the original data volume of the video is usually large, the original video is usually encoded and compressed, which is convenient for performing operations such as transmission and storage. In encoding and compression, a set number Of image frames in a video are usually divided into a Group Of Pictures (GOP), which is also a unit Of encoding in compression. The group of pictures generally includes three types of image frames, namely, an Intra-coded picture (Intra-coded picture) frame, a Predictive-coded picture (Predictive-coded picture) frame, and a Bi-directionally predicted picture (Bi-directionally predicted picture) frame, where the Intra-coded image frame is also called an I frame and a key frame, which contain complete image information, and does not need to refer to other image frames during decoding, and the group of pictures is also used to represent an interval between two adjacent I frames in a video; the predictive coding image frame is also called as P frame, and the former I frame or P frame is needed to be referred to when decoding; bi-directionally predicted encoded picture frames, also called B-frames, require reference to previous I-frames or previous P-frames, etc. during decoding.

The process of video decoding is opposite to that of video encoding, specifically, the compressed video is decompressed, original information in the video is restored as much as possible, and operations such as playing or clipping can be further executed on the decoded video. In the video frame extraction process of the embodiment of the present disclosure, first, an image group serving as a coding unit in a coded video is determined, for example, file information generated after the video is coded is read, a time length occupied by each image group is obtained from the file information, and then each image group in the coded video is determined.

In step 102, a decoding time stamp corresponding to a time point to be extracted is obtained, and an image group including the decoding time stamp is searched; and the time point to be extracted corresponds to a target image frame in the video.

Here, a point in time to be extracted and corresponding time stamp information, i.e., a decoding time stamp, are acquired. For example, when the electronic device is a terminal device, a time point to be extracted, which is manually set by a user of the terminal device, is obtained, such as a certain minute and a certain second in a video; when the electronic device is a server deployed at the cloud, a time point to be extracted, which is sent by the terminal device and is manually set by a user, is obtained. Of course, a plurality of time points to be decimated may also be determined according to a set frame decimation frequency, such as one frame per second or two frames per second, which depends on the actual application scenario. The time points to be extracted correspond to image frames to be extracted from the video, and in order to facilitate distinguishing, the image frames corresponding to the time points to be extracted are named as target image frames, and then in order to decode the target image frames, image groups where decoding time stamps are located are searched in a plurality of image groups of the video.

In step 103, the searched image group is subjected to a decoding process to ignore image frames in the searched image group except the target image frame, and obtain the target image frame in the video.

And decoding the searched image group, wherein in the decoding process, only the image frame corresponding to the decoding time stamp, namely the target image frame, is decoded, the image frames except the target image frame in the searched image group are ignored (skipped), the decoded target image frame is obtained, and the decoding is finished. The target image frame is only needed to be decoded in the decoding process, so that the waiting time is shortened, and the frame extraction efficiency is improved. The decoded target image frame can be further processed according to actual service requirements, and the embodiment of the present disclosure does not limit the application of the target image frame, for example, the decoded target image frame is presented on a graphical interface of a terminal device, or the decoded target image frame is scored, so as to determine the quality of the video.

In some embodiments, before step 103, further comprising: determining a dependent image frame which has a dependent relationship with a target image frame in the searched image group;

the above-described decoding process of the found group of pictures may also be implemented in such a manner as to ignore image frames other than the target image frame in the found group of pictures: and sequentially decoding the dependent image frame and the target image frame in the searched image group to ignore the image frames except the dependent image frame and the target image frame in the searched image group.

According to the rules of video coding, there may be dependencies between different image frames in the searched image group, that is, before decoding a certain image frame, other image frames must be decoded, for example, a B frame needs to refer to a previous I frame or a previous P frame when decoding. Therefore, in the embodiment of the present disclosure, in addition to the target image frame, a dependent image frame having a dependency relationship with the target image frame is determined in the searched image group.

The target image frame can be successfully decoded only by decoding the dependent image frame, so that the dependent image frame and the target image frame are sequentially decoded after the dependent image frame is determined, and the searched image frames except the dependent image frame and the target image frame in the image group are ignored in the decoding process. By the method, the success rate of decoding the target image frame is improved.

In some embodiments, before step 103, further comprising: determining a non-reference frame in the searched image group; the non-reference frame is an image frame which has no dependency relationship with other image frames in the searched image group; an operation of discarding non-reference frames is performed.

In the searched image group, there may be a non-reference frame, which refers to an image frame that has no dependency relationship with other image frames in the searched image group. If the non-reference frames are decoded, some decoding time is wasted, which is not favorable for obtaining the decoded target image frame quickly, so in the embodiment of the present disclosure, all the non-reference frames are discarded in the searched image group, for example, by using the AVDISCARD _ nnref command in the Fast moving pictures experts group (FFmpeg) tool. Through the mode, the frame extraction efficiency can be further improved, and the decoding time length is 12144 milliseconds if the non-reference frame is not discarded for the same video through the experimental verification of the inventor; if the non-reference frames are discarded, the decoding duration is 10922 ms, which is about 10%.

In some embodiments, after step 103, further comprising: constructing a time axis of the video according to the time points to be extracted; and presenting the target image frame corresponding to each time point to be extracted in the display area corresponding to each time point to be extracted in the time axis.

In some application scenarios such as video preview, a plurality of time points to be extracted may be sorted in order from morning to evening to obtain a time axis of the video, and a separate display area is set for each time point to be extracted in the time axis. And after the target image frame is decoded, displaying the decoded target image frame corresponding to the time point to be extracted in a display area corresponding to each time point to be extracted in the time axis. By the mode, the time axis thumbnail of the video can be generated, so that a user can conveniently and quickly know the approximate content of the video, and the user can jump to a certain time point of the video to start watching or clip a certain part of the video according to the needs of the user.

As can be seen from the above exemplary implementation of fig. 3A by the disclosed embodiment, in the decoding process, the disclosed embodiment decodes only the target image frame, and ignores the image frames except for the target image frame, wherein when there is a dependent image frame of the target image frame, the image frames except for the target image frame and the dependent image frame are ignored, so that the waiting time of the decoding process is shortened, the frame extraction efficiency is improved, and the method is suitable for various application scenarios of video processing.

In some embodiments, referring to fig. 3B, fig. 3B is an optional flowchart of a video frame extraction method provided by the embodiment of the present disclosure, and step 103 shown in fig. 3A may be implemented by steps 201 to 203, which will be described with reference to each step.

In step 201, a single software decoding process is performed on the target image frame in the searched image group according to the decoding order to ignore the image frames other than the target image frame in the searched image group.

In the embodiment of the present disclosure, the searched image group may be subjected to software decoding Processing, where the software decoding Processing refers to using software to enable a Central Processing Unit (CPU) of the electronic device to perform a decoding operation. Specifically, a single software decoding process is performed on a target image frame in the searched image group according to a decoding order to ignore image frames except for the target image frame in the searched image group, wherein the decoding order is an order from a previous I frame to a next I frame corresponding to the searched image group. By the method, the target image frame is decoded only once in the decoding process, and the waste of computing resources and time resources caused by repeated decoding can be effectively avoided.

For ease of understanding, the image group diagram shown in fig. 4 is provided in the embodiment of the present disclosure, in fig. 4, a P frame is an I frame, and numbers below the P frame are used to indicate a display order (display order) of image frames in the image group. In the solutions provided in the related art, the software decoding process is performed out of order, and decoding needs to be performed from the beginning of the group of pictures in the decoding process of each image frame. For example, if the image frame to be decoded in the first image group includes B corresponding to the number 5₃Frame and number 7 corresponding to B₃Frame, then B corresponding to number 5₃During decoding of a frame, I corresponds to a number 0₀The frame is started and decoded to B corresponding to the number 5₃A frame; b corresponding to the number 7₃During decoding of the frame, corresponding likewise from the number 0I₀Frame start, decode to B corresponding to number 7₃And (5) frame. The method is easy to cause repeated decoding of some image frames, increases the waiting time of decoding, and has low frame extraction efficiency.

By way of example again, in the embodiment of the present disclosure, according to the preset decoding order, B corresponding to the number 5 is first mapped₃Decoding the frame to obtain B corresponding to the number 5₃Continuing to decode B corresponding to the number 7 without stopping after the frame₃And (5) frame. It is worth noting that B corresponding to the decoding number 7₃In the frame, B does not necessarily correspond to the number 5₃The frames serve as the starting point for decoding, but are dependent on the specific dependencies in the GOP. Therefore, each target image frame in the image group can be guaranteed to be decoded only once, and the frame extraction efficiency is improved.

In step 202, a hardware decoder for asynchronous mode is created and initialized.

In addition to the software decoding process, the embodiment of the present disclosure may perform a hardware decoding process on the image frame, where the hardware decoding process is to replace a part of the decoding work of the CPU with a Graphics Processing Unit (GPU) of the electronic device. In the course of the hardware decoding process, first, a hardware decoder is created and initialized, which operates in an asynchronous mode, for example, the hardware decoder in the asynchronous mode may be created and initialized by a MediaCodec class.

In step 203, the target image frame in the searched image group is sent to the hardware decoder to ignore the image frames except the target image frame in the searched image group, so that the hardware decoder performs the hardware decoding process on the target image frame in the asynchronous mode.

Here, the target image frame in the searched image group is sent (seek) to the hardware decoder through the callback function of the hardware decoder, so that the hardware decoder performs the hardware decoding process on the target image frame in the asynchronous mode. Similarly, the target image frame after decoding is obtained through the callback function of the hardware decoder. Experiments prove that compared with a hardware decoder in a synchronous mode, the decoding time consumption of the asynchronous mode can be shortened by about 50%.

In addition, the hardware decoder of the asynchronous mode can be further optimized, and the embodiment of the disclosure provides two exemplary optimization modes. The first optimization method is to set the hardware decoder to the surface mode, and compared with the buffer mode, the gain of about 5% can be obtained, that is, the frame extraction frequency is increased by about 5%. Another optimization method is to increase the operating frequency of the hardware decoder, for example, setting an operating-rate field for the hardware decoder, thereby increasing the frame extraction speed. For three processors with different models, the inventor respectively tests the two conditions of not setting the operating frequency and increasing the operating frequency through the operating-rate field, and the obtained decoding time consumption is as follows:

from this table it can be determined that the decoding efficiency of the hardware decoder is doubled after the operating frequency is raised by the operating-rate field.

In some embodiments, before step 101, further comprising: creating a sub-thread for software decoding processing and a plurality of sub-threads for hardware decoding processing; and splitting the video, and respectively distributing the split sub-videos to different sub-threads so that the sub-threads perform software decoding processing or hardware decoding processing on the distributed sub-videos.

In the embodiments of the present disclosure, a software decoding process or a hardware decoding process may be used alone, where the software decoding process consumes memory resources and CPU resources, and the hardware decoding process consumes GPU resources and less CPU resources. After testing on a certain type of terminal equipment according to a one-way strategy, the obtained resource consumption results are as follows:

terminal equipment of a certain model	Time consuming to frame	CPU occupancy rate	Memory usage
				Software decoding process	12 seconds	37％	130 million
Hardware decoding process	14 seconds	4％	65 million of

The one-way policy refers to performing software decoding processing or hardware decoding processing through one thread. By the table, it can be determined that the time consumption is still long when decoding processing is performed through a one-way strategy, and a scene with high timeliness requirement cannot be met.

For this case, in the embodiment of the present disclosure, a multi-pass strategy is adopted, and a software decoding process and a hardware decoding process are combined to implement decoding. Specifically, one sub-thread for software decoding processing and a plurality of sub-threads for hardware decoding processing are created, the video is split, and the obtained sub-videos are respectively distributed to different sub-threads, so that the sub-threads perform software decoding processing or hardware decoding processing on the distributed sub-videos. It should be noted that splitting a video may refer to splitting one video into multiple parts, or may refer to allocating multiple videos to different sub-threads, for example, if a video to be decoded includes video 1, video 2, and video 3, allocating video 1 to sub-thread 1, allocating video 2 to sub-thread 2, and allocating video 3 to sub-thread 3, thereby completing splitting.

By combining the software decoding processing and the hardware decoding processing, the time consumed by frame extraction can be further reduced, and the resource occupation is not too high. For convenience of illustration, the inventor performed a multi-path policy test for a certain type of terminal device, and the following results were obtained:

in the table, the two-way hardware decoding processing means that the hardware decoding processing is performed by two sub-threads, 10.3 seconds/10.5 seconds means that the frame extraction time of one sub-thread is 10.3 seconds, the frame extraction time of the other sub-thread is 10.5 seconds, and so on. According to the table, it can be determined that, compared with a mode of only performing software decoding processing or hardware decoding processing, the embodiment of the disclosure combines the software decoding processing and the hardware decoding processing, so that more frame extraction time consumption can be shortened, and meanwhile, occupation of CPU resources and memory resources is within an acceptable range. The inventor tests and verifies that on the basis of creating one sub-thread for software decoding processing, if the callback of the hardware decoder is configured in an asynchronous thread and two sub-threads for hardware decoding processing are created, compared with the situation of one-way soft solution and one-way hard solution, the frame extraction time can be shortened by about one second, and three sub-threads for hardware decoding processing can be created in an actual application scene. In conclusion, by means of the one-path soft solution and the multi-path hard solution, both the time consumption and the resource occupation of frame extraction can be considered, and a good frame extraction effect is achieved.

As can be seen from the above exemplary implementation of fig. 3B, the present disclosure provides two manners, namely a software decoding process and a hardware decoding process, so as to improve the flexibility of frame extraction, and according to different actual application scenarios, either of the two manners may be applied, or the two manners may be combined to implement fast frame extraction.

In some embodiments, referring to fig. 3C, fig. 3C is an optional flowchart of a video frame extraction method provided by the embodiment of the disclosure, and after step 103 shown in fig. 3A, a sample image frame and a corresponding sample score may also be obtained in step 301.

After the decoded target image frame corresponding to the time point to be extracted in the video is obtained, the decoded target image frame can be subjected to scoring processing, so that the quality of the video is measured. The embodiment of the present disclosure provides a way of performing scoring processing according to a machine learning model, but it should be understood that the way of performing scoring processing is not limited thereto, and scoring processing may also be performed by color channel data of a target image frame, for example.

Before scoring a target image frame, training a machine learning model, specifically acquiring a sample image frame and a corresponding sample score, wherein the sample image frame can be acquired from an open-source image data set, and the sample score is obtained through artificial marking. The type of the machine learning model is not limited in the embodiments of the present disclosure, and may be, for example, a random forest model or a neural network model.

In step 302, a prediction process is performed on the sample image frame through a machine learning model, so as to obtain a score to be compared.

Here, the sample image frame is subjected to forward prediction processing by the machine learning model, and a score to be compared is obtained.

In step 303, the weight parameters of the machine learning model are updated according to the difference between the sample score and the score to be compared.

And processing the sample score and the score to be compared according to a loss function of the machine learning model to obtain a difference between the sample score and the score to be compared, wherein the difference is equivalent to a loss value, and the loss function is a cross entropy loss function. And performing back propagation in the machine learning model according to the obtained difference, and updating the weight parameters of the machine learning model in the process of back propagation. And repeating the updating process until the machine learning model meets the set convergence condition, wherein the convergence condition is the set number of training rounds.

In step 304, a target image frame in the video is subjected to prediction processing through the updated machine learning model, so as to obtain a score of the target image frame.

And predicting the decoded target image frame in the video through the updated machine learning model to obtain the score of the target image frame.

In step 305, the scores of all target image frames in the video are averaged to obtain an average score of the video.

A video usually comprises a plurality of target image frames, so the scores of all the target image frames in the video are averaged to obtain the average score of the video, and the average score represents the video quality of the video.

In step 306, among the plurality of videos to be subjected to video splicing, the videos to be subjected to video splicing are selected according to the average scores.

The embodiment of the disclosure can be applied to an application scene of video splicing, for example, in a plurality of videos, videos with better quality are selected and spliced to be a checkpoint video, and the spliced checkpoint video often has a better playing effect. In a plurality of videos to be subjected to video splicing, a video to be subjected to video splicing is selected according to the average score of each video, and it is worth to be noted that the plurality of videos to be subjected to video splicing here may refer to a plurality of videos with different sources or a plurality of video segments in the same video.

According to different practical application scenes, the modes for selecting the videos for video splicing are different. For example, a video corresponding to the K average scores with the largest value may be used as the video for video splicing, where K is an integer greater than 1.

As can be seen from the above exemplary implementation of fig. 3C, in the embodiment of the present disclosure, the score of the decoded target image frame is predicted according to the machine learning model, and the average score of the video is further determined, where the average score represents the quality of the video, and is suitable for an application scenario that needs to select the video, such as video splicing (for example, producing a stuck point video).

According to one or more embodiments of the present disclosure, there is provided a video framing method, including: determining a group of pictures as a coding unit in a video; wherein the image group comprises a plurality of image frames; acquiring a decoding time stamp corresponding to a time point to be extracted, and searching an image group comprising the decoding time stamp; the time point to be extracted corresponds to a target image frame in the video; and decoding the searched image group to ignore the image frames except the target image frame in the searched image group to obtain the target image frame in the video.

In some embodiments, before the decoding process of the searched image group, the method further includes: determining a dependent image frame which has a dependent relationship with a target image frame in the searched image group;

decoding the searched image group to ignore image frames except the target image frame in the searched image group, comprising:

and sequentially decoding the dependent image frame and the target image frame in the searched image group to ignore the image frames except the dependent image frame and the target image frame in the searched image group.

In some embodiments, before the decoding process of the searched image group, the method further includes: determining a non-reference frame in the searched image group; the non-reference frame is an image frame which has no dependency relationship with other image frames in the searched image group; an operation of discarding non-reference frames is performed.

In some embodiments, the decoding process for the searched group of pictures includes: performing single software decoding processing on the target image frame in the searched image group according to the decoding sequence; wherein, the decoding sequence represents that the previous intra-frame coding image frame corresponding to the searched image group is sequentially changed to the next intra-frame coding image frame; the group of pictures is used to represent the interval between two adjacent intra-coded image frames in the video.

In some embodiments, the decoding process for the searched group of pictures includes: creating and initializing a hardware decoder of an asynchronous mode; and sending the target image frame in the searched image group to a hardware decoder so that the hardware decoder performs hardware decoding processing on the target image frame in an asynchronous mode.

In some embodiments, further comprising: creating a sub-thread for software decoding processing and a plurality of sub-threads for hardware decoding processing; and splitting the video, and respectively distributing the split sub-videos to different sub-threads so that the sub-threads perform software decoding processing or hardware decoding processing on the distributed sub-videos.

In some embodiments, further comprising: acquiring a sample image frame and a corresponding sample score; predicting the sample image frame through a machine learning model to obtain a score to be compared; updating the weight parameters of the machine learning model according to the difference between the sample scores and the scores to be compared; predicting a target image frame in the video through the updated machine learning model to obtain a score of the target image frame; carrying out average processing on scores of all target image frames in the video to obtain average scores of the video; and selecting the videos to be subjected to video splicing from the plurality of videos to be subjected to video splicing according to the average scores.

In some embodiments, further comprising: constructing a time axis of the video according to the time points to be extracted; and presenting the target image frame corresponding to each time point to be extracted in the display area corresponding to each time point to be extracted in the time axis.

According to one or more embodiments of the present disclosure, there is provided a video frame extracting apparatus including: a group-of-pictures determination unit for determining a group of pictures as a coding unit in a video; wherein the image group comprises a plurality of image frames; the image group searching unit is used for acquiring a decoding time stamp corresponding to a time point to be extracted and searching an image group comprising the decoding time stamp; the time point to be extracted corresponds to a target image frame in the video; and the decoding unit is used for performing decoding processing on the searched image group so as to ignore the image frames except the target image frame in the searched image group and obtain the target image frame in the video.

In some embodiments, the video framing apparatus further comprises: the dependent frame determining unit is used for determining a dependent image frame which has a dependent relation with the target image frame in the searched image group;

a decoding unit further configured to: and sequentially decoding the dependent image frame and the target image frame in the searched image group to ignore the image frames except the dependent image frame and the target image frame in the searched image group.

In some embodiments, the video framing apparatus further comprises: a non-reference frame determination unit for determining a non-reference frame in the searched image group; the non-reference frame is an image frame which has no dependency relationship with other image frames in the searched image group; a discarding unit for performing an operation of discarding the non-reference frame.

In some embodiments, the decoding unit is further configured to: performing single software decoding processing on the target image frame in the searched image group according to the decoding sequence; wherein, the decoding sequence represents that the previous intra-frame coding image frame corresponding to the searched image group is sequentially changed to the next intra-frame coding image frame; the group of pictures is used to represent the interval between two adjacent intra-coded image frames in the video.

In some embodiments, the decoding unit is further configured to: creating and initializing a hardware decoder of an asynchronous mode; and sending the target image frame in the searched image group to a hardware decoder so that the hardware decoder performs hardware decoding processing on the target image frame in an asynchronous mode.

In some embodiments, the video framing apparatus further comprises: a sub-thread creating unit for creating one sub-thread for software decoding processing and a plurality of sub-threads for hardware decoding processing; and the splitting unit is used for splitting the video and respectively distributing the split sub-videos to different sub-threads so that the sub-threads perform software decoding processing or hardware decoding processing on the distributed sub-videos.

In some embodiments, the video framing apparatus further comprises: the sample acquisition unit is used for acquiring a sample image frame and a corresponding sample score; the first prediction unit is used for carrying out prediction processing on the sample image frame through a machine learning model to obtain a score to be compared; the updating unit is used for updating the weight parameters of the machine learning model according to the difference between the sample scores and the scores to be compared; the second prediction unit is used for performing prediction processing on a target image frame in the video through the updated machine learning model to obtain a score of the target image frame; the average processing unit is used for carrying out average processing on the scores of all target image frames in the video to obtain the average score of the video; and the video selecting unit is used for selecting the videos for video splicing from the plurality of videos to be subjected to video splicing according to the average scores.

In some embodiments, the video framing apparatus further comprises: the time axis construction unit is used for constructing a time axis of the video according to the time point to be extracted; and the presentation unit is used for presenting the target image frames corresponding to the time points to be extracted in the display area corresponding to each time point to be extracted in the time axis.

According to one or more embodiments of the present disclosure, there is provided an electronic device including: a memory for storing executable instructions; and the processor is used for realizing the video frame extracting method provided by the embodiment of the disclosure when executing the executable instruction.

According to one or more embodiments of the present disclosure, a computer-readable storage medium is provided, which stores executable instructions for implementing a video frame extraction method provided by an embodiment of the present disclosure when the executable instructions are executed.

The above description is only an example of the present disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1. A method for video framing, comprising:

decoding the searched image group to obtain a decoded image group

2. The video framing method of claim 1,

before the decoding process is performed on the searched image group, the method further includes:

3. The video framing method according to claim 1, wherein before the decoding process of the searched group of pictures, further comprising:

performing an operation of discarding the non-reference frame.

4. The video framing method according to claim 1, wherein the decoding process of the searched group of pictures includes:

5. The video framing method according to claim 1, wherein the decoding process of the searched group of pictures includes:

creating and initializing a hardware decoder of an asynchronous mode;

sending the target image frame in the searched image group to the hardware decoder so as to enable the hardware decoder to realize the image decoding

And the hardware decoder performs hardware decoding processing on the target image frame in the asynchronous mode.

6. The video framing method of claim 1, further comprising:

7. The video framing method according to any one of claims 1 to 6, further comprising:

acquiring a sample image frame and a corresponding sample score;

8. The video framing method according to any one of claims 1 to 6, further comprising:

9. A video framing apparatus, comprising:

10. An electronic device, comprising:

a memory for storing executable instructions;

a processor, configured to execute the executable instructions to implement the video framing method according to any one of claims 1 to 8.

11. A computer-readable storage medium having stored thereon executable instructions that, when executed, implement a video framing method as claimed in any one of claims 1 to 8.