WO2021159896A1 - 视频处理方法和视频处理的设备、存储介质 - Google Patents

视频处理方法和视频处理的设备、存储介质 Download PDF

Info

Publication number
WO2021159896A1
WO2021159896A1 PCT/CN2021/070875 CN2021070875W WO2021159896A1 WO 2021159896 A1 WO2021159896 A1 WO 2021159896A1 CN 2021070875 W CN2021070875 W CN 2021070875W WO 2021159896 A1 WO2021159896 A1 WO 2021159896A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
video
key
sub
distribution
Prior art date
Application number
PCT/CN2021/070875
Other languages
English (en)
French (fr)
Inventor
敖欢欢
罗巍
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021159896A1 publication Critical patent/WO2021159896A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4398Processing of audio elementary streams involving reformatting operations of audio signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44016Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • H04N21/440245Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display the reformatting operation being performed only on part of the stream, e.g. a region of the image or a time segment
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/472End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/472End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content
    • H04N21/47205End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content for manipulating displayed content, e.g. interacting with MPEG-4 objects, editing locally
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments

Definitions

  • One or more embodiments of the present application generally relate to the video processing field of touch devices, and specifically relate to a video processing method, a video processing device, and a storage medium.
  • Video editing technology is a technology for segmenting and re-splicing recorded video source files. This technology has been developed for a long time, and there have been a lot of editing software. If classified from the operating platform, it can be divided into PC-side software and mobile-side software. The display interface of the mobile software is small, and it is mainly operated by finger touch. It completely relies on the finger touch to edit the software. Due to the large area of the finger, it is difficult to accurately control a certain point in time or a certain frame. Repeated adjustments for many times can achieve the purpose of one adjustment. Therefore, finger touch will cause an unfriendly operating experience.
  • Some embodiments of the present application provide a video processing method and a video processing device, storage medium, and system.
  • the following describes the application from multiple aspects, and the implementations and beneficial effects of the following multiple aspects can be referred to each other.
  • the embodiments of the present application provide a video processing method, including: obtaining an instruction from a user to divide a video; and in response to the instruction, dividing the video into multiple sub-videos, wherein The duration of each sub-video in the video is based at least in part on the sampling moment of one key image frame in at least one key image frame of the video or the sampling moment of one key audio node in at least one key audio node of the video, where , There is at least one of a change in the image scene and a change in the image subject between the sequence of sub-image frames that appear before a key image frame and the sequence of sub-image frames that appear after a key image frame in the sequence of video image frames, And wherein, there is at least one of a change in the speaker subject and a change in the noise distribution between the sub-audio distribution appearing before a key audio node and the sub-audio distribution appearing after a key audio node in the audio distribution of the video.
  • the embodiment of the present application analyzes the image frame sequence and audio distribution in the video loaded by the user, determines the key image frame by judging the changes in the scene and/or subject of the image sequence before and after the image frame, and determines the key image frame by judging The frequency of the audio distribution changes suddenly, the key audio node is determined, and the node that will not damage the voice integrity is selected from the key image frame and the key audio node as the segmentation point during video segmentation, thereby avoiding unnecessary segmentation of the audio. Solve the problem of incomplete voice caused by video processing.
  • the instruction includes a user's long-press instruction on the video.
  • implementation manners of the present application can make it more convenient for users to cut videos on the mobile terminal.
  • it further includes: acquiring a user's selection instruction for at least one sub-video of the plurality of sub-videos, and selecting at least one sub-video from the plurality of sub-videos, wherein the selection instruction includes the user's selection of at least one sub-video. Click instructions for sub-videos.
  • it further includes: obtaining a user's move instruction for one or more sub-videos in the selected at least one sub-video, and moving the one or more sub-videos to a position designated by the user, thereby correcting The selected at least one sub-video is sorted, where the movement instruction includes a sliding instruction for one or more sub-videos.
  • the implementation manners of the present application can enable the user to quickly select multiple video clips, and can also quickly adjust the sequence between the video clips.
  • dividing the video into multiple sub-videos further includes: selecting at least one key image frame from the sequence of image frames; selecting at least one key audio node from the audio distribution ; Determine whether to retain a key image frame in at least one key image frame and a key audio node in at least one key audio node; at least partly based on the reserved sampling moment of a key image frame and the sampling moment of a key audio node At least one of the time period.
  • the embodiments of the present application further avoid unnecessary segmentation of audio.
  • it can prevent the audio of the speaker from being segmented during continuous speaking, so that the segmented speech audio
  • the expression of is incomplete, or part of the voice is missing.
  • determining whether to retain one key image frame in the at least one key image frame and one key audio node in the at least one key audio node includes: determining the sub audio before the one key audio node Whether the distribution and the sub-audio distribution appearing after a key audio node include noise distribution; determine whether one of the sub-audio distribution before a key audio node and the sub-audio distribution appearing after a key audio node includes the noise distribution, or In the case that the sub-audio distribution before a key audio node and the sub-audio distribution appearing after a key audio node do not include noise distribution, determine to keep a key audio node; and determine the sub-audio before a key audio node In the case where the distribution and the sub-audio distribution appearing after a key audio node both include the noise distribution, a key audio node is determined to be abandoned.
  • the sub-audio distribution before one key audio node includes a video start node and a key audio node.
  • the sub-audio distribution after one key audio node includes the difference between a key audio node and the termination node of the video.
  • determining whether to retain one key image frame in the at least one key image frame and one key audio node in the at least one key audio node includes: determining whether the sub-audio distribution includes a noise distribution, where The sub-audio distribution includes sampling moments related to a key image frame; when it is determined that the sub-audio distribution includes the noise distribution, it is determined to keep a key image frame; and when it is determined that the sub-audio distribution does not include the noise distribution, it is determined to discard one Key image frame.
  • the noise distribution includes at least one of a silent distribution, a non-human noise distribution, and a multi-person noise distribution.
  • the change in the noise distribution includes that the sub-audio distribution that appears before a key audio node includes the noise distribution, and the sub-audio distribution that appears after a key audio node includes the non-noise distribution; or The sub-audio distribution before a key audio node includes a non-noise distribution, and the sub-audio distribution after a key audio node includes a noise distribution.
  • the change in the noise distribution includes the sub-audio distribution appearing before a key audio node, including at least one noise distribution in the noise distribution, and the sub-audio distribution appearing after a key audio node At least another type of noise distribution in the noise distribution is included.
  • selecting at least one key audio node from the audio distribution includes: detecting the audio frequency of the video, and determining multiple audio frequency distributions of the video according to the detected audio frequency, wherein Each audio frequency distribution in the audio frequency distribution includes the distribution of the same audio frequency; clustering multiple audio frequency distributions to obtain multiple audio frequency distribution categories, where each audio frequency in the multiple audio frequency distribution categories
  • the distribution category includes at least one audio frequency distribution among a plurality of audio frequency distributions; and an intersection of every two audio frequency distribution categories in the plurality of audio frequency distribution categories is selected as at least one key audio node.
  • clustering includes clustering multiple audio frequency distributions using a clustering algorithm, where the clustering algorithm includes at least one of the SVM and the Kmeans algorithm.
  • the embodiments of the present application provide a video processing method, including: obtaining a first instruction for a user to segment a video, wherein the first instruction includes a long-press instruction for the video; in response to the first instruction, The video is divided into a plurality of sub-videos, where each sub-video of the plurality of sub-videos includes images and audio associated with the time period of each sub-video.
  • the implementation manners of this application can enable users to quickly cut multiple video clips, and can also make it more convenient for users to cut videos on mobile terminals.
  • the embodiment of the present application analyzes the image frame sequence and audio distribution in the video loaded by the user, determines the key image frame by judging the changes in the scene and/or subject of the image sequence before and after the image frame, and determines the key image frame by judging The frequency of the audio distribution changes suddenly, the key audio node is determined, and the node that will not damage the voice integrity is selected from the key image frame and the key audio node as the segmentation point during video segmentation, thereby avoiding unnecessary segmentation of the audio. Solve the problem of incomplete voice caused by video processing.
  • a second instruction for the user to divide a sub-video of the plurality of sub-videos is obtained, where the second instruction includes a long-press instruction for one sub-video; and in response to the second instruction, One child video is divided into multiple grandchild videos, where each grandchild video of the multiple grandchild videos includes images and audio associated with the time period of each grandchild video.
  • it further includes: acquiring a user's move instruction for at least one sub-video of the plurality of sub-videos, and moving the at least one sub-video to a position designated by the user, thereby sorting the plurality of sub-videos.
  • implementation manner of the present application can enable the user to quickly select multiple video clips, and can also quickly adjust the sequence between the video clips.
  • it further includes: acquiring a user's move instruction for at least one of the multiple grandchildren videos, and moving the at least one grandchild video to a position designated by the user, so as to perform operations on the multiple grandchildren videos. Sort.
  • the embodiments of the present application provide a video processing device, including: an instruction acquisition module for obtaining a user's instruction to divide a video; and a segmentation module for responding to the instruction to divide the video into multiple Sub-videos, wherein the duration of each sub-video in the plurality of sub-videos is based at least in part on the sampling moment of one key image frame in at least one key image frame of the video or one key audio in at least one key audio node of the video The sampling moment of the node, where there are changes in the image scene and the main body of the image between the sequence of sub-image frames that appear before a key image frame in the sequence of video image frames and the sequence of sub-image frames that appear after a key image frame.
  • At least one of the changes and where there is a change in the speaker's subject and noise distribution between the sub-audio distribution that appears before a key audio node in the audio distribution of the video and the sub-audio distribution that appears after a key audio node At least one of the changes.
  • the implementation manners of the present application can enable the user to quickly cut multiple video clips.
  • the embodiment of the present application analyzes the image frame sequence and audio distribution in the video loaded by the user, determines the key image frame by judging the changes in the scene and/or subject of the image sequence before and after the image frame, and determines the key image frame by judging The frequency of the audio distribution changes suddenly, the key audio node is determined, and the node that will not damage the voice integrity is selected from the key image frame and the key audio node as the segmentation point during video segmentation, thereby avoiding unnecessary segmentation of the audio. Solve the problem of incomplete voice caused by video processing.
  • the instruction includes a user's long-press instruction on the video.
  • implementation manners of the present application can make it more convenient for users to cut videos on the mobile terminal.
  • it further includes: acquiring a user's selection instruction for at least one sub-video of the plurality of sub-videos, and selecting at least one sub-video from the plurality of sub-videos, wherein the selection instruction includes the user's selection of at least one sub-video. Click instructions for sub-videos.
  • it further includes: a sorting module, configured to obtain a user's move instruction for one or more of the selected at least one sub-video, and move the one or more sub-videos to the user's designated The position of the selected at least one sub-video is sorted, wherein the movement instruction includes a sliding instruction for one or more sub-videos.
  • a sorting module configured to obtain a user's move instruction for one or more of the selected at least one sub-video, and move the one or more sub-videos to the user's designated The position of the selected at least one sub-video is sorted, wherein the movement instruction includes a sliding instruction for one or more sub-videos.
  • implementation manners of the present application can enable the user to quickly select multiple video clips, and can also quickly adjust the sequence between the video clips.
  • dividing the video into multiple sub-videos further includes: selecting at least one key image frame from the sequence of image frames; selecting at least one key audio node from the audio distribution ; Determine whether to retain a key image frame in at least one key image frame and a key audio node in at least one key audio node; at least partly based on the reserved sampling moment of a key image frame and the sampling moment of a key audio node At least one of the time period.
  • the embodiments of the present application further avoid unnecessary segmentation of audio.
  • it can prevent the audio of the speaker from being segmented when the speaker is continuously speaking, so that the segmented speech audio
  • the expression of is incomplete, or part of the voice is missing.
  • determining whether to retain one key image frame in the at least one key image frame and one key audio node in the at least one key audio node includes: determining the sub audio before the one key audio node Whether the distribution and the sub-audio distribution appearing after a key audio node include noise distribution; determining whether one of the sub-audio distribution before a key audio node and the sub-audio distribution appearing after a key audio node includes the noise distribution, or In the case that the sub-audio distribution before a key audio node and the sub-audio distribution appearing after a key audio node do not include noise distribution, determine to keep a key audio node; and determine the sub-audio before a key audio node In the case where the distribution and the sub-audio distribution appearing after a key audio node both include the noise distribution, a key audio node is determined to be abandoned.
  • the sub-audio distribution before one key audio node includes the start node of the video and one key audio node.
  • the sub-audio distribution after one key audio node includes the difference between a key audio node and the termination node of the video.
  • determining whether to retain one key image frame in the at least one key image frame and one key audio node in the at least one key audio node includes: determining whether the sub-audio distribution includes a noise distribution, where The sub-audio distribution includes sampling moments related to a key image frame; when it is determined that the sub-audio distribution includes the noise distribution, it is determined to keep a key image frame; and when it is determined that the sub-audio distribution does not include the noise distribution, it is determined to discard one Key image frame.
  • the noise distribution includes at least one of a silent distribution, a non-human noise distribution, and a multi-person noise distribution.
  • the change in the noise distribution includes that the sub-audio distribution that appears before a key audio node includes the noise distribution, and the sub-audio distribution that appears after a key audio node includes the non-noise distribution; or The sub-audio distribution before a key audio node includes a non-noise distribution, and the sub-audio distribution after a key audio node includes a noise distribution.
  • the change in the noise distribution includes the sub-audio distribution appearing before a key audio node, including at least one noise distribution in the noise distribution, and the sub-audio distribution appearing after a key audio node At least another type of noise distribution in the noise distribution is included.
  • selecting at least one key audio node from the audio distribution includes: detecting the audio frequency of the video, and determining multiple audio frequency distributions of the video according to the detected audio frequency, and multiple Each audio frequency distribution in the audio frequency distribution includes the distribution of the same audio frequency; clustering multiple audio frequency distributions to obtain multiple audio frequency distribution categories, where each audio frequency in the multiple audio frequency distribution categories
  • the distribution category includes at least one audio frequency distribution among a plurality of audio frequency distributions; and an intersection of every two audio frequency distribution categories in the plurality of audio frequency distribution categories is selected as at least one key audio node.
  • clustering includes clustering multiple audio frequency distributions using a clustering algorithm, where the clustering algorithm includes at least one of the SVM and the Kmeans algorithm.
  • the embodiments of the present application provide a video processing device, including: an instruction acquisition module, configured to obtain a user's first instruction to segment a video, where the first instruction includes a long-press instruction for the video; segmentation A module for dividing the video into a plurality of sub-videos in response to the first instruction, wherein each sub-video of the plurality of sub-videos includes an image and audio associated with the time period of each sub-video.
  • the embodiments of the present application can enable the user to quickly cut multiple video clips, and can also make it more convenient for the user to cut the video on the mobile terminal.
  • the embodiment of the present application analyzes the image frame sequence and audio distribution in the video loaded by the user, determines the key image frame by judging the changes in the scene and/or subject of the image sequence before and after the image frame, and determines the key image frame by judging The frequency of the audio distribution changes suddenly, the key audio node is determined, and the node that will not damage the voice integrity is selected from the key image frame and the key audio node as the segmentation point during video segmentation, thereby avoiding unnecessary segmentation of the audio. Solve the problem of incomplete voice caused by video processing. The problem of incomplete voice.
  • it further includes: obtaining a second instruction for the user to divide a sub-video among the multiple sub-videos, where the second instruction includes a long-press instruction for one sub-video; and responding to the first The second instruction is to divide a child video into multiple grandchild videos, where each grandchild video of the multiple grandchild videos includes images and audio associated with the time period of each grandchild video.
  • it further includes: a sorting module, configured to obtain a user's move instruction for at least one sub-video of the multiple sub-videos, and move the at least one sub-video to a position specified by the user, so as to Sub-videos are sorted.
  • a sorting module configured to obtain a user's move instruction for at least one sub-video of the multiple sub-videos, and move the at least one sub-video to a position specified by the user, so as to Sub-videos are sorted.
  • implementation manners of the present application can enable the user to quickly select multiple video clips, and can also quickly adjust the sequence between the video clips.
  • the sorting module further includes: acquiring a user's move instruction for at least one grandchild of the multiple grandchildren videos, and moving the at least one grandchild video to a position designated by the user, so that the multiple grandchildren Videos are sorted.
  • the present application provides a computer-readable storage medium, which may be non-volatile.
  • the storage medium contains instructions that, after being executed, implement the method described in any one of the foregoing aspects or implementation manners.
  • the present application provides an electronic device, including: a memory for storing instructions executed by one or more processors of the electronic device, and a processor for executing instructions in the memory to execute the instructions according to the foregoing The method described in any one aspect or embodiment.
  • Fig. 1 shows a schematic diagram of modules of an exemplary electronic device according to an embodiment of the present application.
  • Fig. 2 shows a schematic diagram of a graphical user interface presented by a display screen of an electronic device according to an exemplary embodiment.
  • Fig. 3 shows a schematic flowchart of a video processing method according to an embodiment of the present application.
  • Fig. 4 shows a schematic diagram of possible user operations of video processing according to an embodiment of the present application.
  • Fig. 5 shows a schematic flowchart of a method for segmenting a video according to an embodiment of the present application.
  • Fig. 6 shows a schematic flowchart of a method for selecting a key image frame according to an embodiment of the present application.
  • Fig. 7 shows a schematic flowchart of a method for selecting a key audio node according to an embodiment of the present application.
  • Fig. 8 shows a schematic diagram of an image sequence of a video, respective audio nodes, and a segmented video segment according to an embodiment of the present application.
  • the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof.
  • the disclosed embodiments can also be implemented as instructions carried by or stored on one or more transient or non-transitory machine-readable (eg, computer-readable) storage media, which can be executed by one or more processors. Read and execute.
  • the instructions can be distributed via a network or via other computer-readable media. Therefore, a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (for example, a computer), but is not limited to, a floppy disk, an optical disk, an optical disk, a read-only memory (CD-ROM), and a magneto-optical disk.
  • ROM Read-only memory
  • RAM random access memory
  • EPROM erasable programmable read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • magnetic or optical card flash memory
  • a machine-readable medium includes any type of machine-readable medium suitable for storing or transmitting electronic instructions or information in a form readable by a machine (for example, a computer).
  • first, second, etc. may be used herein to describe various units or data, these units or data should not be limited by these terms. These terms are used only to distinguish one feature from another.
  • first feature may be referred to as the second feature, and similarly the second feature may be referred to as the first feature.
  • module or unit may refer to or include an application specific integrated circuit (ASIC), an electronic circuit, a processor (shared, dedicated or group) that executes one or more software or firmware programs, and/or Memory (shared, dedicated or group), combinational logic circuit, and/or other suitable components that provide the described functions, or may be an application specific integrated circuit (ASIC), electronic circuit, executing one or more software or firmware
  • ASIC application specific integrated circuit
  • the program is part of the processor (shared, dedicated or group) and/or memory (shared, dedicated or group), combinational logic circuit, and/or other suitable components that provide the described functions.
  • the display interface of the video editing software of the mobile electronic device is small, and the operation is usually performed mainly by finger touch, which results in the extremely inconvenient operation after more operations on the PC terminal are directly transplanted to the mobile electronic device.
  • the operation of selecting an accurate video cutting point by dragging, and the operation of how to move the video clip to a desired position after cutting are inconvenient.
  • the experience is not ideal.
  • the finger area is large, and it is difficult to accurately control a certain time point or a certain frame. It is often necessary to repeatedly adjust multiple times to achieve the purpose of one adjustment.
  • each video clip of the video clip is followed by Regarding video content, a fixed duration cannot meet the needs of users.
  • the current editing is performed on the entire video. It is not convenient to edit and cut multiple videos. It is easy to play the entire video for the duration of the entire video, and requires different operations such as opening and pausing, which is not convenient on mobile phones. Convenient for repeated playback.
  • the technical solution of the present application hopes to solve the above-mentioned problem of video cutting on the mobile terminal.
  • One or more embodiments of the present application propose a video editing method, which makes it more convenient for users to cut videos on a mobile terminal.
  • users can quickly cut multiple video clips, quickly select multiple video clips, and quickly adjust the sequence between video clips.
  • the technical solution of this application solves the problem of incomplete voice caused by video clips.
  • Fig. 1 shows a schematic diagram of modules of an exemplary electronic device according to an embodiment of the present application.
  • the electronic device can be used to perform video processing methods.
  • the electronic device 100 may include a control component 110, an external memory interface 120, an internal memory 121, an audio module 130, a sensor module 140, a display screen 150, and so on.
  • the control component 110 may include a processor 111.
  • the sensor module 140 may include a pressure sensor 140A, a touch sensor 140B, and the like.
  • the structure illustrated in the embodiment of the present invention does not constitute a specific limitation on the electronic device 100.
  • the electronic device 100 may include more or fewer components than those shown in the figure, or combine certain components, or split certain components, or arrange different components.
  • the illustrated components can be implemented in hardware, software, or a combination of software and hardware.
  • the processor 111 may include one or more processing units.
  • the processor 111 may include an application processor (AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (image signal processor, ISP), controller, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU), etc.
  • AP application processor
  • modem processor modem processor
  • GPU graphics processing unit
  • image signal processor image signal processor
  • ISP image signal processor
  • controller video codec
  • digital signal processor digital signal processor
  • DSP digital signal processor
  • NPU neural-network processing unit
  • different processing units can be independent devices or integrated in one or more processors.
  • the GPU is used to perform mathematical and geometric calculations and is used for graphics rendering.
  • the processor 111 may include one or more GPUs, which execute program instructions to generate or change display information.
  • the controller can generate operation control signals according to the instruction operation code and timing signals to complete the control of fetching instructions and executing instructions.
  • NPU is a neural-network (NN) computing processor.
  • NN neural-network
  • applications such as intelligent cognition of the electronic device 100 can be realized, such as image recognition, face recognition, voice recognition, text understanding, and so on.
  • a memory may also be provided in the processor 111 to store instructions and data.
  • the memory in the processor 111 is a cache memory.
  • the memory can store instructions or data that the processor 111 has just used or used cyclically. If the processor 111 needs to use the instruction or data again, it can be directly called from the memory. Repeated accesses are avoided, the waiting time of the processor 111 is reduced, and the efficiency of the system is improved.
  • the external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device 100.
  • the external memory card communicates with the control component 110 through the external memory interface 120 to realize the data storage function. For example, save files such as databases in an external memory card.
  • the internal memory 121 may be used to store computer executable program code, where the executable program code includes instructions.
  • the internal memory 121 may include a storage program area and a storage data area.
  • the storage program area can store an operating system, an application program required by at least one function, and the like.
  • the storage program area may store the video editing device 160, which can implement various aspects of the video editing method of the present application.
  • the video editing device 160 may include an instruction acquisition module 161, a segmentation module 162, a selection module 163, a sorting module 164, and a synthesis module 165.
  • the instruction acquisition module 161 may be used to acquire a user's instruction to edit a video and other instructions input by the user.
  • the segmentation module 162 may be used to respond to instructions to segment the video into multiple video segments.
  • the segmentation module 162 can separately analyze the image frame sequence and audio distribution in the video loaded by the user, and divide all the frames of the entire video into several different scenes according to changes in the image content to determine the image sequence before and after the image frame.
  • Determine the key image frame by changing the scene and/or subject of the video, and by clustering the audio frequency characteristics of the video, determine the sudden change in the frequency distribution of the audio distribution, determine the key audio node, and determine the key image frame and Among the key audio nodes, the node that does not destroy the integrity of the voice is selected as the split point.
  • the selection module 163 may be used to select one or more video clips to be synthesized into a new video from a plurality of video clips.
  • the sorting module 164 may be used to move at least one video segment to a position designated by the user, so as to sort a plurality of video segments.
  • the synthesis module 165 may be used to synthesize multiple video clips into a new video. In some possible implementation manners, only one or more video clips can be selected for synthesis through the selection module 163, or after multiple video clips are selected, the selected video clips are sorted by the sorting module 164, and then video synthesis is performed. And/or only move one or more video clips and then synthesize all the video clips.
  • the video editing device 160 shown in FIG. 1 can be implemented in software, but the video editing device 160 and one or more of its modules can also be implemented in hardware, software or a combination of software and hardware.
  • the data storage area of the internal memory 121 can store data created during the use of the electronic device 100 (for example, the split point of the video timeline) and the like.
  • the internal memory 121 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash storage (UFS), and the like.
  • the processor 111 executes various functional applications and data processing of the electronic device 100 by running instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor.
  • the audio module 130 is used to convert digital audio information into an analog audio signal for output, and is also used to convert an analog audio input into a digital audio signal.
  • the audio module 130 may also be used to encode and decode audio signals.
  • the audio module 130 may be provided in the processor 111, or part of the functional modules of the audio module 130 may be provided in the processor 111.
  • the pressure sensor 140A is used to sense pressure signals, and can convert the pressure signals into electrical signals.
  • the pressure sensor 140A may be provided on the display screen 150.
  • the capacitive pressure sensor may include at least two parallel plates with conductive materials. When a force is applied to the pressure sensor 140A, the capacitance between the electrodes changes.
  • the electronic device 100 determines the intensity of the pressure according to the change in capacitance.
  • the electronic device 100 detects the intensity of the touch operation according to the pressure sensor 140A.
  • the electronic device 100 may also calculate the touched position based on the detection signal of the pressure sensor 140A.
  • touch operations that act on the same touch position but have different touch operation strengths may correspond to different operation instructions. For example: when a touch operation whose intensity of the touch operation is less than the first pressure threshold is applied to the short message application icon, an instruction to view the short message is executed. When a touch operation with a touch operation intensity greater than or equal to the first pressure threshold acts on the short message application icon, an instruction to create a new short message is executed.
  • the touch sensor 140B is also called a "touch device”.
  • the touch sensor 140B may be disposed on the display screen 150, and the touch sensor 140B and the display screen 150 form a touch screen, which is also called a “touch screen”.
  • the touch sensor 140B is used to detect a touch operation acting on or near it.
  • the touch sensor can pass the detected touch operation to the application processor to determine the type of touch event.
  • the visual output related to the touch operation may be provided through the display screen 150.
  • the touch sensor 140B may also be disposed on the surface of the electronic device 100, which is different from the position of the display screen 150.
  • the display screen 150 is used to display images, videos, and the like.
  • the display screen 150 includes a display panel.
  • the display panel can use liquid crystal display (LCD), organic light-emitting diode (OLED), active matrix organic light-emitting diode or active-matrix organic light-emitting diode (active-matrix organic light-emitting diode).
  • LCD liquid crystal display
  • OLED organic light-emitting diode
  • active-matrix organic light-emitting diode active-matrix organic light-emitting diode
  • AMOLED flexible light-emitting diode (FLED), Miniled, MicroLed, Micro-oLed, quantum dot light-emitting diode (QLED), etc.
  • the electronic device 100 may include one or N display screens 150, and N is a positive integer greater than one.
  • the electronic device 100 implements video playback and editing functions through GPU, video codec, display 150, NPU, and/or application processor.
  • the GPU is an image processing microprocessor, which is connected to the display screen 150 and the application processor.
  • Video codecs are used to compress or decompress digital video.
  • the electronic device 100 may support one or more video codecs. In this way, the electronic device 100 can play or record videos in multiple encoding formats, such as: moving picture experts group (MPEG) 1, MPEG2, MPEG3, MPEG4, and so on.
  • MPEG moving picture experts group
  • MPEG2 MPEG2, MPEG3, MPEG4, and so on.
  • Electronic devices include, but are not limited to, laptop devices, handheld PCs, personal digital assistants, cellular phones, portable media players, wearable devices (for example, display glasses or goggles, head-mounted displays, (Referred to as HMD), watches, headsets, armbands, jewelry, etc.), virtual reality (Virtual Reality, VR) and/or augmented reality (Augment Reality, AR) equipment, in-vehicle infotainment equipment, streaming media client equipment , E-book reading equipment, and various other electronic equipment.
  • VR Virtual Reality
  • AR Augment Reality
  • multiple devices and electronic devices capable of containing the processor and/or other execution logic disclosed herein are generally suitable.
  • Fig. 2 shows an example of a graphical user interface 200 presented on the display screen of the electronic device 100 according to an exemplary embodiment.
  • the user can edit the video in the graphical user interface 200.
  • the graphical user interface 200 includes a video preview and playback area 210, a video editing area 220, an audio display area 230, and a save button 240.
  • the video preview and play area 210 can be used to play the edited video and display the effect of the video editing.
  • the video editing area 220 can be the main operation area of video editing, and can display the timeline of the video, for example, display multiple images of the video in the time dimension. It is understandable that usually the timeline of the video is the beginning of the video on the left side.
  • the video editing area 220 may also display one or more video clips edited by the user.
  • the audio display area 230 may be used to display an audio track corresponding to the video or a segment of the video in the video editing area 220.
  • the graphical user interface 200 may include more or fewer areas or modules than shown, or combine certain areas or modules, or split certain areas or modules, or adopt Different arrangements.
  • the graphical user interface 200 may also optionally or additionally include a tool area 250, which may provide some other functional operations, for example, loading a new video, saving a video, or checking a selected video or Video clips can be shared, reverse order and other functions. The user can share the saved video with other users or share it on social media by selecting the sharing function. As another example, the graphical user interface 200 may omit the audio display area 230.
  • a tool area 250 may provide some other functional operations, for example, loading a new video, saving a video, or checking a selected video or Video clips can be shared, reverse order and other functions. The user can share the saved video with other users or share it on social media by selecting the sharing function.
  • the graphical user interface 200 may omit the audio display area 230.
  • the video editing method may be implemented by the video editing device 160 in the electronic device 100.
  • FIG. 3 shows the flow of a video processing method 300 according to an embodiment of the present application.
  • the method at least includes, 301: the user loads the video to be edited in the graphical user interface 200. After the video is loaded, the video can be presented in a timeline manner in the video editing area 220, and a preview of the video is displayed in the video preview and playback area 210, or the video is played.
  • the electronic device 100 obtains different editing instructions of the user for the video through the user's touch operation on the video displayed in the video editing area 220, for example, obtains different editing instructions of the user for the video through the instruction acquisition module 161.
  • the user's touch operation may include, when a touch operation whose touch operation time is less than a first time threshold is applied to the video, executing an instruction to select the video.
  • the video clip instruction is executed.
  • a touch operation continues to act on the video and the touch position continuously changes, an instruction to drag the video along with the touch position is executed.
  • the user can long press the video in the video editing area 220, that is, the user gives an instruction to perform video editing.
  • the segmentation module 162 may at 303: in response to the instruction, the video may be automatically segmented into a plurality of shorter video segments (sub-videos).
  • the video may be automatically segmented into a plurality of shorter video segments (sub-videos).
  • the user can also long-press one of the multiple video clips (sub-videos) again, and the video clip can also be automatically divided into multiple shorter-duration video clips (sun videos).
  • a threshold for the number of divided video segments for performing one video segmentation can be set according to the duration of the video to control the granularity of the video segmentation. If you think that the granularity of the segmentation is not fine enough, you can segment the video segment again until the video segment cannot be processed by the automatic segmentation method.
  • the divided multiple video clips can be displayed in the video editing area 220.
  • the user can click on one or more video clips and use the selection module 163 to 304: According to the user's instruction, select the desired one or more video clips from the multiple video clips. Video clips.
  • the selected video clip can be automatically played in the video preview area to display the video content in real time, so that the user can view the newly selected video content.
  • the user can enter the fine adjustment of the video by clicking the selected video again, where the fine adjustment operation method can be the same as some existing video operation methods, for example, by pulling the time bar to determine the value of each frame Area etc.
  • the user can also drag the selected video clip to freely adjust the position of the video clip on the timeline to realize the reordering of the video clip.
  • the sorting module 164 at 305: according to the user's movement instruction for at least one of the selected video clips, move the at least one video segment to a position designated by the user.
  • the splicing sequence of these video clips can be determined according to the respective positions of these video clips on the timeline.
  • the save button 240 for example, through the synthesis module 165 at 306: generate a new video according to all the selected video segments selected and the location specified by the user.
  • the graphical user interface may also include buttons for realizing other functions, for example, a sharing button, etc. The user can share the saved video with other users or sharing on social media by clicking the sharing button.
  • FIG. 4 shows some possible user operations of video editing in the video editing area 220 of the graphical user interface 200.
  • the video 40 can be presented in the video editing area 220 in a timeline manner.
  • the video 40 or the video segment 42 is represented by a rectangular block, and the long side of the rectangle schematically represents the video 40 or The length of the timeline of the video clip 42.
  • the video 40 When the user needs to edit the video 40, the video 40 is triggered by long-pressing 45 on the video 40. In response to the instruction of the video editing, the video 40 is automatically divided into a plurality of shorter video clips 42 (1...n), and these videos The segments 42 are sequentially arranged in the video editing area 220 corresponding to the timeline of the video. In some embodiments, the user can also long press one of the multiple video clips 42(1...n) again, for example, long press 45 on the video clip 42(i), and the video clip 42(i) ) Can also be automatically divided into multiple shorter video clips 42 (i.1...ij).
  • the user selects one or more video clips 42 required by clicking 47 one or more video clips 42.
  • the user clicks 47 the video segment 42 (1), the video segment 42 (i.1) and the video segment 42 (i.j) respectively to determine that these 3 video segments are required by the user.
  • the user can click the save button 240, and all the selected video clips can be generated as a new video with one click.
  • video segment 42(1), video segment 42(i.1), and video segment 42(ij) are spliced sequentially according to the current position of each video segment with reference to the timeline through transition splicing methods in this field.
  • the start time of the newly generated video is the start of video segment 42(1)
  • the end time of the new video is the end time of the video segment 42(ij).
  • the user can also drag the selected video clip to freely adjust the position of the video clip on the timeline to realize the reordering of the video clip.
  • the user clicks 47 the video segment 42(1), the video segment 42(i.1) and the video segment 42(i.j) respectively to determine that these three video segments are required by the user.
  • the user drags 49 the video segment 42 (i.j) to move the video segment 42 (i.j) to before the video segment 42 (i.1).
  • the stitching starts from the video clip 42(1), then the video clip 42(ij), and finally the video clip 42(i. 1)
  • the start time of the newly generated video is the start time of the video segment 42(1)
  • the end time of the new video is the end time of the video segment 42(i.1).
  • FIG. 5 shows a flow of an example method 500 for segmenting a video implemented by the electronic device 100 according to an embodiment of the present application.
  • the example method 500 is a specific example description of part 303 in the example method 300 shown in FIG.
  • the method 500 at least includes: at 501: selecting at least one key image frame and at least one key audio node from the image frame sequence and the audio distribution of the video, respectively.
  • the electronic device 100 separately analyzes the image frame sequence and audio distribution in the video loaded by the user. On the one hand, it analyzes and recognizes key frames and key events for the image to determine possible image frame segmentation. The point is the key image frame. On the other hand, it analyzes the audio distribution, analyzes the frequency distribution of the audio, and recognizes the voice, noise and other content in the audio content, so as to determine the possible audio segmentation points, that is, the key audio nodes. These two aspects will be described below with reference to FIG. 6 and FIG. 7 respectively.
  • FIG. 6 shows a schematic flowchart of a method 600 for selecting a key image frame.
  • first, 601: similar frames in the image frame sequence can be clustered, so that all image frames of the entire video can be divided into several according to the change of the image content of the image frame. Different scenes, that is, the image frames within each class are similar, but the image frames between classes are not similar.
  • matching methods such as random generation method, uniform generation method in time, etc. compare the image content, and use the image frame with the image content greater than the threshold as the seed frame.
  • SIFT Scale-invariant feature transform
  • human body area detection human body area detection
  • motion area detection motion area detection
  • the frame and the frame are calculated by matching the underlying features.
  • the relationship between the image content for example, for different regions of interest, such as moving objects, human bodies, etc., can be calculated with different weights.
  • the seed frame is used as the initial classification, and the Meanshift algorithm, K-means and other algorithms are used to cluster the image frames.
  • the boundary of image frame classification is generally the starting point or end point of an event.
  • the scene change of the video is also a continuous change process, so it is necessary to carefully determine the appropriate segmentation time point at the boundary of each video segment.
  • Several key factors to determine the time point of the split include at least the following aspects:
  • the blur of the image for example, a significant change in the scene may be accompanied by significant lens movement, which may result in a series of blurred image frames.
  • the second aspect is the tracking of moving objects.
  • some scene changes are caused by objects in the foreground moving and leaving the field of view. Therefore, a video segmentation point needs to include the entire motion process of the object, and try not to let the leaving object enter the next scene after segmentation.
  • the critical value of the frame is calculated for the inter-frame content at the boundary of each classification after clustering.
  • the key image frame as the segmentation point is selected.
  • the criteria for selecting a key image frame may include: First, the segmentation point to be selected should ensure that the previous image frame is relatively clear, and the image content before and after the frame is irrelevant, that is, , The image frame to be selected is the image frame with the smallest calculated value of correlation with the previous image frame in the scene transition stage.
  • the second point is that if the moving object appears throughout the transition phase, the key image frame is not selected; if the object completely terminates in the transition phase, the image frame where the object appears last is selected as the key image frame. If the first point conflicts with the second point, the second point is preferred to select the key image frame.
  • FIG. 7 shows a schematic flowchart of a method 700 for selecting a key audio node.
  • 701 Detect the audio frequency of the video, and determine multiple audio frequency distributions of the video according to the detected audio frequency.
  • time-frequency conversion is performed on audio, for example, through fast Fourier transform ((Fast Fourier transform, FFT), short-time Fourier transform (short-time Fourier transform, STFT), etc., to obtain the audio frequency
  • FFT fast Fourier transform
  • STFT short-time Fourier transform
  • 702 cluster multiple audio frequency distributions to obtain multiple audio frequency distribution categories.
  • 703 Select the intersection of every two audio frequency distribution categories in a plurality of audio frequency distribution categories as at least one key audio node. For example, compared with the frequencies before and after these start and end points, it usually happens. Frequency abrupt changes, that is, the change in frequency exceeds a certain threshold. These points are key audio nodes and can be used for video segmentation points.
  • the corresponding scenes of each audio segment are identified, for example, noise scenes, sounding scenes of one or more subjects, and so on.
  • the noise of the noise scene may include noise floor (silent), non-human noise, multi-person noise, etc., where one condition of the noise floor may include the speaker's pause in the process of speaking.
  • the vocalization scene of the subject may include a single human voice, etc. (that is, the same human frequency accounts for the main audio energy).
  • the electronic device 100 obtains, for example, all the key image frames and key audio nodes selected by the above method from the internal memory 121.
  • the electronic device 100 determines the final video segmentation point from these key image frames and key audio nodes.
  • FIG. 8 shows a schematic diagram of an image sequence of a video, respective nodes of audio, and a segmented video segment according to an embodiment of the present application.
  • the starting point of the loaded video is marked as S
  • the end point is marked as E
  • all the middle key image frames are marked as ICi
  • the middle key audio node is marked as VCi
  • one of ICi and ICi+1 The sequence of image frames in between is denoted as IFi
  • the audio distribution composed of audio signals between VCi and VCi+1 is denoted as VFi, where IFi is the sequence of sub-image frames of the image frame sequence of the video, and VFi is the audio distribution of the video The sub-audio distribution.
  • the starting point S is used as the starting point of the image sequence and audio
  • the ICi and VCi are nodes that need to be judged whether they are used as video splitting points for splitting the video into video segments (for example, child videos and grandchildren videos, etc.).
  • the time interval between the current node and the previous node can also be judged whether the time interval between the current node and the previous node is long enough, and if the time interval is short, the judgment of the current node can be abandoned.
  • the time interval can be set to 1/20 of the total duration of the video, etc., and the current node will be abandoned if the time interval is less than that.
  • each node 503 is determined in turn: whether the key audio node or the key image frame is taken out. If it is a key audio node, such as VCi, then at 504: determine whether the sub-audio distribution before and after the key audio node segmentation are all noises, if not, go to 506a: keep the sampling moment of the key audio node as video Split point; if yes, go to 507a: discard the key audio node. For example, if the two audio distributions VFi-1 and VFi before and after the VCi segmentation are both noise, then VCi is not retained as the video segmentation point. If the two audio distributions before and after the VCi segmentation, VFi-1 and VFi, one of them is noise, Or if the audio distribution VFi-1 and VFi are not noise, then VCi is retained as the video segmentation point.
  • the key audio node is the last node. If the key audio node is the last node among all the key image frames and key audio nodes taken out, the method ends; if the key audio node is not the last node, return 502, the next key image of the key audio node Perform 503-508 for frame or key audio nodes.
  • the electronic device 100 clusters the audio, it classifies the audio sub-distributions VFi-1 and VFi and VCi. Therefore, the frequency similarity between the two is insufficient and can be considered as different. the sound of.
  • the sub-audio distributions VFi-1 and VFi are two different noises respectively.
  • the sub-audio distribution VFi-1 is silent
  • the sub-audio distribution VFi is non-human noise such as environmental noise. In this case, it is not necessary to use VCi as the final segmentation point of the video. Therefore, in the final segmentation, the sub-audio can be divided
  • the distribution VFi-1 and VFi are divided as a segment of noise.
  • one of the sub-audio distributions VFi-1 and VFi is noise.
  • the sub-audio distribution VFi-1 is silent Noise floor
  • the audio distribution VFi is the voice of a single person
  • the sub-audio distribution VFi-1 is the voice of a single person
  • the sub-audio distribution VFi is non-human noise
  • the single person’s voice and the noise can be divided, therefore, the VCi can be divided As the final split point of the video.
  • the sub-audio distributions VFi-1 and VFi are both non-noise.
  • the sub-audio distributions VFi-1 and VFi are respectively
  • the voices of different people can be divided into the voices of different people. Therefore, the VCi can be used as the final segmentation point of the video.
  • the audio frequency of the two people's speech is different.
  • there may be natural short pauses in their speech such as every sentence Pause at the end, or a long pause when one person thinks about how to respond to the other party’s question, etc.
  • These possible pauses can be regarded as noise in the audio performance.
  • the part of each speaker's speech can be segmented, and for each part of each person's speech, a sentence or a paragraph can be completely segmented according to the pause.
  • the reasonable segmentation of speech is more critical, and it is necessary to avoid segmentation of audio when the speaker is speaking continuously.
  • the foregoing method can avoid unnecessary segmentation of audio. For example, it can prevent segmentation of audio when the speaker is continuously speaking, so that the expression of the segmented speech audio is incomplete or part of the speech is lost.
  • the method ends; if the key image frame is not the last node, return 502, the next key image of the key image frame Perform 503-508 for frame or key audio nodes.
  • the time point corresponding to the key image frame ICi is in the audio distribution VFi, then if the sub-audio distribution VFi is noise, then ICi can be used as the final segment of the video. point. Because the audio in which the ICi is located is not the speaker's voice audio, the ICI as a split point will not split the speaker's voice audio. In addition, this can ensure effective segmentation of the image sequence. On the contrary, if the audio distribution VFi is the audio distribution of human voice, the key image frame ICI is not retained as the video segmentation point, because the key image frame ICI will segment the speaker’s voice and audio, which will cause the expression of the segmented voice and audio to be inconsistent. Complete or missing part of the voice. Therefore, the above-mentioned method can ensure the integrity of human voice and audio.
  • the final video segmentation point can be obtained.
  • the final video segmentation points are VC1, ICi, and VCi+1 in turn.
  • the video is segmented according to the sampling time of these video segmentation points, and the final segmented video segments are respectively These are 81, 82, 83, and 84.
  • a method for quickly editing a video on a mobile terminal which can cope with scenes where the position positioning at the pixel level cannot be finely adjusted.
  • the embodiments of the present application also provide a method for automatic video segmentation, by which an independent and meaningful video segment of the video subject can be obtained.
  • Program code can be applied to input instructions to perform the functions described in this article and generate output information.
  • the output information can be applied to one or more output devices in a known manner.
  • a processing system includes any system having a processor such as, for example, a digital signal processor (DSP), a microcontroller, an application specific integrated circuit (ASIC), or a microprocessor.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • the program code can be implemented in a high-level programming language or an object-oriented programming language to communicate with the processing system.
  • assembly language or machine language can also be used to implement the program code.
  • the mechanisms described in this article are not limited to the scope of any particular programming language. In either case, the language can be a compiled language or an interpreted language.
  • IP cores can be stored on a tangible computer-readable storage medium and provided to multiple customers or production facilities to be loaded into the manufacturing machine that actually manufactures the logic or processor.
  • the instruction converter can be used to convert instructions from the source instruction set to the target instruction set.
  • the instruction converter may transform (for example, use static binary transformation, dynamic binary transformation including dynamic compilation), deform, emulate, or otherwise convert the instruction into one or more other instructions to be processed by the core.
  • the instruction converter can be implemented by software, hardware, firmware, or a combination thereof.
  • the instruction converter may be on the processor, off the processor, or part on the processor and part off the processor.

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Television Signal Processing For Recording (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

本申请提供了一种视频处理方法和视频处理的设备、存储介质,该方法通过对用户载入的视频中的图像帧序列和音频分布分别进行分析,确定可以用于分割视频的关键图像帧和关键音频节点,响应于用户的分割指令,选择合适的关键图像帧和关键音频节点对视频进行自动分割,获得获得多个视频片段。根据本申请的实施方式,提供了在移动终端上对视频进行快速剪辑的方法,能够应对无法精细调节像素级别的位置定位的场景。此外,本申请的实施方式还提供了一种视频自动分割方法,通过该方法可以获得视频主体的独立有意义的视频片段。

Description

视频处理方法和视频处理的设备、存储介质
本申请要求于2020年02月13日提交国家知识产权局、申请号为202010090350.7、申请名称为“视频处理方法和视频处理的设备、存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请的一个或多个实施例通常涉及触控设备的视频处理领域,具体涉及一种视频处理方法和视频处理的设备、存储介质。
背景技术
视频剪辑技术是对录制的视频源文件进行分割和重新拼接的技术。该技术已经发展较长时间,并且已有非常多的剪辑软件,如果从操作平台上分类,可以分成PC端的软件和移动端软件。移动端软件的显示界面较小,且主要通过手指点触操作,完全依赖于手指触控来编辑软件,由于手指面积较大,很难精准的控制到某一个时间点或者某一帧,常常需要反复调节多次才能达到一次调节的目的。因此手指点触会造成不友好的操作体验。
发明内容
本申请的一些实施方式提供了一种视频处理方法和视频处理的装置、存储介质和***。以下从多个方面介绍本申请,以下多个方面的实施方式和有益效果可互相参考。
为了应对上述场景,第一方面,本申请的实施方式提供了一种视频处理方法,包括:获得用户对一个视频进行分割的指令;和响应于指令,将视频分割成多个子视频,其中多个子视频中的每个子视频的持续时间段是至少部分地基于视频的至少一个关键图像帧中的一个关键图像帧的采样时刻或者视频的至少一个关键音频节点中的一个关键音频节点的采样时刻,其中,在视频的图像帧序列中出现在一个关键图像帧之前的子图像帧序列与出现在一个关键图像帧之后的子图像帧序列之间存在图像场景的变化和图像主体的变化中的至少一个,以及其中,在视频的音频分布中出现在一个关键音频节点之前的子音频分布与出现在一个关键音频节点之后的子音频分布之间存在说话人主体的变化和噪声分布的变化的至少一个。
从上述第一方面的实施方式中可以看出,本申请的实施方式可以使用户可以快速的剪切多段视频片段。此外,本申请的实施方式通过对用户载入的视频中的图像帧序列和音频分布分别进行分析,通过判断图像帧前后的图像序列的场景和/或主体的变化确定关键图像帧,以及通过判断音频分布的频率突变,确定关键音频节点,并在视频分割时从关键图像帧和关键音频节点中选择不会破坏语音完整性的节点作为分割点,由此,避免对音频的不必要的分割,解决了视频处理造成语音不完整的问题。
结合第一方面,在一些实施方式中,指令包括用户对于视频的长按指令。
从上述结合第一方面的实施方式中可以看出,本申请的实施方式可以使用户在移动终端上剪切视频更加方便。
结合第一方面,在一些实施方式中,还包括:获取用户对多个子视频中的至少一个子视频的选择指令,从多个子视频中选出至少一个子视频,其中选择指令包括用户对至少一个子视频的点击指令。
结合第一方面,在一些实施方式中,还包括:获取用户对选出的至少一个子视频中的一个或多个子视频的移动指令,将一个或多个子视频移动到用户指定的位置,从而对选出的至少一个子视频进行排序,其中移动指令包括对一个或多个子视频的滑动指令。
从上述结合第一方面的实施方式中可以看出,本申请的实施方式可以使用户可以快速的选择多段视频片段,还可以快速的调节视频片段间的顺序。
结合第一方面,在一些实施方式中,响应于指令,将视频分割成多个子视频,还包括:从图像帧序列中选出至少一个关键图像帧;从音频分布中选出至少一个关键音频节点;确定是否保留至少一个关键图像帧中的一个关键图像帧和至少一个关键音频节点中的一个关键音频节点;至少部分地根据保留的一个关键图像帧的采样时刻和一个关键音频节点的采样时刻中的至少一个,确定时间段。
从上述结合第一方面的实施方式中可以看出,本申请的实施方式进一步避免对音频的不必要的分割,例如,可以防止对说话人在连续说话时的音频进行分割,使得分割的语音音频的表达不完整,或丢失部分语音。
结合第一方面,在一些实施方式中,确定是否保留至少一个关键图像帧中的一个关键图像帧和至少一个关键音频节点中的一个关键音频节点,包括:确定在一个关键音频节点之前的子音频分布与出现在一个关键音频节点之后的子音频分布是否包括噪声分布;在确定在一个关键音频节点之前的子音频分布与出现在一个关键音频节点之后的子音频分布中的一个包括噪声分布,或者在一个关键音频节点之前的子音频分布与出现在一个关键音频节点之后的子音频分布均不包括噪声分布的情况下,确定保留一个关键音频节点;和在确定在一个关键音频节点之前的子音频分布与出现在一个关键音频节点之后的子音频分布均包括噪声分布的情况下,确定放弃一个关键音频节点。
结合第一方面,在一些实施方式中,在至少一个关键音频节点包括多个关键音频节点的情况下,在一个关键音频节点之前的子音频分布包括视频的起始节点和一个关键音频节点之间的子音频分布,或者多个关键音频节点中位于一个关键音频节点之前的关键音频节点与一个关键音频节点之间的子音频分布。
结合第一方面,在一些实施方式中,在至少一个关键音频节点包括多个关键音频节点的情况下,在一个关键音频节点之后的子音频分布包括一个关键音频节点与视频的终止节点之间的子音频分布,或者一个关键音频节点与多个关键音频节点中位于一个关键音频节点之后的关键音频节点之间的子音频分布。
结合第一方面,在一些实施方式中,确定是否保留至少一个关键图像帧中的一个关键图像帧和至少一个关键音频节点中的一个关键音频节点,包括:确定子音频分布是否包括噪声分布,其中子音频分布包括与一个关键图像帧相关的采样时刻;在确定子音频分布包括噪声分布的情况下,确定保留一个关键图像帧;和在确定子音频分布不包括噪声分布的情况下,确定放弃一个关键图像帧。
结合第一方面,在一些实施方式中,噪声分布包括:无声分布,非人噪声分布和多人噪声分布中的至少一个。
结合第一方面,在一些实施方式中,噪声分布的变化包括出现在一个关键音频节点之前的子音频分布包括噪声分布,并且出现在一个关键音频节点之后的子音频分布包括非噪声分布;或者出现在一个关键音频节点之前的子音频分布包括非噪声分布,并且出现在一个关键音频节点之后的子音频分布包括噪声分布。
结合第一方面,在一些实施方式中,噪声分布的变化包括出现在一个关键音频节点之前的子音频分布包括噪声分布中的至少一种噪声分布,并且出现在一个关键音频节点之后的子音频分布包括噪声分布中的至少另一种噪声分布。
结合第一方面,在一些实施方式中,从音频分布中选出至少一个关键音频节点,包括:检测视频的音频频率,并根据检测到的音频频率确定视频的多个音频频率分布,其中多个音频频率分布中的每个音频频率分布包括同一个音频频率的分布;对多个音频频率分布进行聚类,以获得多个音频频率分布类别,其中多个音频频率分布类别中的每个音频频率分布类别包括多个音频频率分布中的至少一个音频频率分布;和选择多个音频频率分布类别中的每两个音频频率分布类别的交点,作为至少一个关键音频节点。
结合第一方面,在一些实施方式中,聚类包括利用聚类算法对多个音频频率分布进行聚类,其中聚类算法包括SVM和Kmeans算法中的至少一个。
第二方面,本申请的实施方式提供了一种视频处理方法,包括:获得用户对一个视频进行分割的第一指令,其中第一指令包括对视频的长按指令;响应于第一指令,将视频分割成多个子视频,其中多个子视频中的每个子视频包括与每个子视频的时间段相关联的图像和音频。
从上述第二方面的实施方式中可以看出,本申请的实施方式可以使用户可以快速的剪切多段视频片段,还可以使用户在移动终端上剪切视频更加方便。此外,本申请的实施方式通过对用户载入的视频中的图像帧序列和音频分布分别进行分析,通过判断图像帧前后的图像序列的场景和/或主体的变化确定关键图像帧,以及通过判断音频分布的频率突变,确定关键音频节点,并在视频分割时从关键图像帧和关键音频节点中选择不会破坏语音完整性的节点作为分割点,由此,避免对音频的不必要的分割,解决了视频处理造成语音不完整的问题。
结合第二方面,在一些实施方式中,获得用户对多个子视频中的一个子视频进行分割的第二指令,其中第二指令包括对一个子视频的长按指令;和响应于第二指令,将一个子视频分割成多个孙视频,其中多个孙视频中的每个孙视频包括与每个孙视频的时间段相关联的图像和音频。
结合第二方面,在一些实施方式中,还包括:获取用户对多个子视频中的至少一个子视频的移动指令,将至少一个子视频移动到用户指定的位置,从而对多个子视频进行排序。
从上述结合第二方面的实施方式中可以看出,本申请的实施方式可以使用户可以快速的选择多段视频片段,还可以快速的调节视频片段间的顺序。
结合第二方面,在一些实施方式中,还包括:获取用户对多个孙视频中的至少一个孙视频的移动指令,将至少一个孙视频移动到用户指定的位置,从而对多个孙视频进行 排序。
第三方面,本申请的实施方式提供了一种视频处理装置,包括:指令获取模块,用于获得用户对一个视频进行分割的指令;和分割模块,用于响应于指令,将视频分割成多个子视频,其中多个子视频中的每个子视频的持续时间段是至少部分地基于视频的至少一个关键图像帧中的一个关键图像帧的采样时刻或者视频的至少一个关键音频节点中的一个关键音频节点的采样时刻,其中,在视频的图像帧序列中出现在一个关键图像帧之前的子图像帧序列与出现在一个关键图像帧之后的子图像帧序列之间存在图像场景的变化和图像主体的变化中的至少一个,以及其中,在视频的音频分布中出现在一个关键音频节点之前的子音频分布与出现在一个关键音频节点之后的子音频分布之间存在说话人主体的变化和噪声分布的变化的至少一个。
从上述第三方面的实施方式中可以看出,本申请的实施方式可以使用户可以快速的剪切多段视频片段。此外,本申请的实施方式通过对用户载入的视频中的图像帧序列和音频分布分别进行分析,通过判断图像帧前后的图像序列的场景和/或主体的变化确定关键图像帧,以及通过判断音频分布的频率突变,确定关键音频节点,并在视频分割时从关键图像帧和关键音频节点中选择不会破坏语音完整性的节点作为分割点,由此,避免对音频的不必要的分割,解决了视频处理造成语音不完整的问题。
结合第三方面,在一些实施方式中,指令包括用户对于视频的长按指令。
从上述结合第三方面的实施方式中可以看出,本申请的实施方式可以使用户在移动终端上剪切视频更加方便。
结合第三方面,在一些实施方式中,还包括:获取用户对多个子视频中的至少一个子视频的选择指令,从多个子视频中选出至少一个子视频,其中选择指令包括用户对至少一个子视频的点击指令。
结合第三方面,在一些实施方式中,还包括:排序模块,用于获取用户对选出的至少一个子视频中的一个或多个子视频的移动指令,将一个或多个子视频移动到用户指定的位置,从而对选出的至少一个子视频进行排序,其中移动指令包括对一个或多个子视频的滑动指令。
从上述结合第三方面的实施方式中可以看出,本申请的实施方式可以使用户可以快速的选择多段视频片段,还可以快速的调节视频片段间的顺序。
结合第三方面,在一些实施方式中,响应于指令,将视频分割成多个子视频,还包括:从图像帧序列中选出至少一个关键图像帧;从音频分布中选出至少一个关键音频节点;确定是否保留至少一个关键图像帧中的一个关键图像帧和至少一个关键音频节点中的一个关键音频节点;至少部分地根据保留的一个关键图像帧的采样时刻和一个关键音频节点的采样时刻中的至少一个,确定时间段。
从上述结合第三方面的实施方式中可以看出,本申请的实施方式进一步避免对音频的不必要的分割,例如,可以防止对说话人在连续说话时的音频进行分割,使得分割的语音音频的表达不完整,或丢失部分语音。
结合第三方面,在一些实施方式中,确定是否保留至少一个关键图像帧中的一个关键图像帧和至少一个关键音频节点中的一个关键音频节点,包括:确定在一个关键音频节点之前的子音频分布与出现在一个关键音频节点之后的子音频分布是否包括噪声分布; 在确定在一个关键音频节点之前的子音频分布与出现在一个关键音频节点之后的子音频分布中的一个包括噪声分布,或者在一个关键音频节点之前的子音频分布与出现在一个关键音频节点之后的子音频分布均不包括噪声分布的情况下,确定保留一个关键音频节点;和在确定在一个关键音频节点之前的子音频分布与出现在一个关键音频节点之后的子音频分布均包括噪声分布的情况下,确定放弃一个关键音频节点。
结合第三方面,在一些实施方式中,在至少一个关键音频节点包括多个关键音频节点的情况下,在一个关键音频节点之前的子音频分布包括视频的起始节点和一个关键音频节点之间的子音频分布,或者多个关键音频节点中位于一个关键音频节点之前的关键音频节点与一个关键音频节点之间的子音频分布。
结合第三方面,在一些实施方式中,在至少一个关键音频节点包括多个关键音频节点的情况下,在一个关键音频节点之后的子音频分布包括一个关键音频节点与视频的终止节点之间的子音频分布,或者一个关键音频节点与多个关键音频节点中位于一个关键音频节点之后的关键音频节点之间的子音频分布。
结合第三方面,在一些实施方式中,确定是否保留至少一个关键图像帧中的一个关键图像帧和至少一个关键音频节点中的一个关键音频节点,包括:确定子音频分布是否包括噪声分布,其中子音频分布包括与一个关键图像帧相关的采样时刻;在确定子音频分布包括噪声分布的情况下,确定保留一个关键图像帧;和在确定子音频分布不包括噪声分布的情况下,确定放弃一个关键图像帧。
结合第三方面,在一些实施方式中,噪声分布包括:无声分布,非人噪声分布和多人噪声分布中的至少一个。
结合第三方面,在一些实施方式中,噪声分布的变化包括出现在一个关键音频节点之前的子音频分布包括噪声分布,并且出现在一个关键音频节点之后的子音频分布包括非噪声分布;或者出现在一个关键音频节点之前的子音频分布包括非噪声分布,并且出现在一个关键音频节点之后的子音频分布包括噪声分布。
结合第三方面,在一些实施方式中,噪声分布的变化包括出现在一个关键音频节点之前的子音频分布包括噪声分布中的至少一种噪声分布,并且出现在一个关键音频节点之后的子音频分布包括噪声分布中的至少另一种噪声分布。
结合第三方面,在一些实施方式中,从音频分布中选出至少一个关键音频节点,包括:检测视频的音频频率,并根据检测到的音频频率确定视频的多个音频频率分布,其中多个音频频率分布中的每个音频频率分布包括同一个音频频率的分布;对多个音频频率分布进行聚类,以获得多个音频频率分布类别,其中多个音频频率分布类别中的每个音频频率分布类别包括多个音频频率分布中的至少一个音频频率分布;和选择多个音频频率分布类别中的每两个音频频率分布类别的交点,作为至少一个关键音频节点。
结合第三方面,在一些实施方式中,聚类包括利用聚类算法对多个音频频率分布进行聚类,其中聚类算法包括SVM和Kmeans算法中的至少一个。
第四方面,本申请的实施方式提供了一种视频处理装置,包括:指令获取模块,用于获得用户对一个视频进行分割的第一指令,其中第一指令包括对视频的长按指令;分割模块,用于响应于第一指令,将视频分割成多个子视频,其中多个子视频中的每个子视频包括与每个子视频的时间段相关联的图像和音频。
从上述第四方面的实施方式中可以看出,本申请的实施方式可以使用户可以快速的剪切多段视频片段,还可以使用户在移动终端上剪切视频更加方便。此外,本申请的实施方式通过对用户载入的视频中的图像帧序列和音频分布分别进行分析,通过判断图像帧前后的图像序列的场景和/或主体的变化确定关键图像帧,以及通过判断音频分布的频率突变,确定关键音频节点,并在视频分割时从关键图像帧和关键音频节点中选择不会破坏语音完整性的节点作为分割点,由此,避免对音频的不必要的分割,解决了视频处理造成语音不完整的问题。语音不完整的问题。
结合第四方面,在一些实施方式中,还包括:获得用户对多个子视频中的一个子视频进行分割的第二指令,其中第二指令包括对一个子视频的长按指令;和响应于第二指令,将一个子视频分割成多个孙视频,其中多个孙视频中的每个孙视频包括与每个孙视频的时间段相关联的图像和音频。
结合第四方面,在一些实施方式中,还包括:排序模块,用于获取用户对多个子视频中的至少一个子视频的移动指令,将至少一个子视频移动到用户指定的位置,从而对多个子视频进行排序。
从上述结合第四方面的实施方式中可以看出,本申请的实施方式可以使用户可以快速的选择多段视频片段,还可以快速的调节视频片段间的顺序。
结合第四方面,在一些实施方式中,排序模块还包括:获取用户对多个孙视频中的至少一个孙视频的移动指令,将至少一个孙视频移动到用户指定的位置,从而对多个孙视频进行排序。
第五方面,本申请提供了一种计算机可读存储介质,该存储介质可以是非易失性的。该存储介质中包含指令,该指令在执行后实施如前述任意一个方面或实施方式所描述的方法。
第六方面,本申请提供了一种电子设备,包括:存储器,用于存储由电子设备的一个或多个处理器执行的指令,以及处理器,用于执行存储器中的指令,以执行根据前述任意一个方面或实施方式所描述的方法。
附图说明
图1示出了根据本申请实施方式的示例电子设备的模块示意图。
图2示出了根据示例性实施方式的电子设备的显示屏呈现的图形用户界面的示意图。
图3示出了根据本申请实施方式的视频处理方法的流程示意图。
图4示出根据本申请实施方式的视频处理的可能的用户操作的示意图。
图5示出了根据本申请实施方式的对视频进行分割的方法的流程示意图。
图6示出了根据本申请实施方式的选取关键图像帧的方法的流程示意图。
图7示出根据本申请实施方式的选取关键音频节点的方法的流程示意图。
图8示出了根据本申请实施方式的视频的图像序列、音频各自节点和分割视频片段的示意图。
具体实施方式
以下由特定的具体实施例说明本申请的实施方式,本领域技术人员可由本说明书所揭示的内容轻易地了解本申请的其他优点及功效。虽然本申请的描述将结合较佳实施例一起介绍,但这并不代表此发明的特征仅限于该实施方式。恰恰相反,结合实施方式作发明介绍的目的是为了覆盖基于本申请的权利要求而有可能延伸出的其它选择或改造。为了提供对本申请的深度了解,以下描述中将包含许多具体的细节。本申请也可以不使用这些细节实施。此外,为了避免混乱或模糊本申请的重点,有些具体细节将在描述中被省略。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。
此外,各种操作将以最有助于理解说明性实施例的方式被描述为多个离散操作;然而,描述的顺序不应被解释为暗示这些操作必须依赖于顺序。特别是,这些操作不需要按呈现顺序执行。
除非上下文另有规定,否则术语“包含”,“具有”和“包括”是同义词。短语“A/B”表示“A或B”。短语“A和/或B”表示“(A和B)或者(A或B)”。
在一些情况下,所公开的实施例可以以硬件、固件、软件或其任何组合来实现。所公开的实施例还可以被实现为由一个或多个暂时或非暂时性机器可读(例如,计算机可读)存储介质承载或存储在其上的指令,其可以由一个或多个处理器读取和执行。例如,指令可以通过网络或通过其他计算机可读介质的途径分发。因此,机器可读介质可以包括用于以机器(例如,计算机)可读的形式存储或传输信息的任何机制、但不限于、软盘、光盘、光盘、只读存储器(CD-ROM)、磁光盘、只读存储器(ROM)、随机存取存储器(RAM)、可擦除可编程只读存储器(EPROM)、电可擦除可编程只读存储器(EEPROM)、磁卡或光卡、闪存、或用于通过电、光、声或其他形式的传播信号(例如,载波、红外信号、数字信号等)通过因特网传输信息的有形的机器可读存储器。因此,机器可读介质包括适合于以机器(例如,计算机)可读的形式存储或传输电子指令或信息的任何类型的机器可读介质。
在附图中,以特定布置和/或顺序示出一些结构或方法特征。然而,应该理解,可以不需要这样的特定布置和/或排序。在一些实施例中,这些特征可以以不同于说明性附图中所示的方式和/或顺序来布置。另外,在特定图中包含结构或方法特征并不意味着暗示在所有实施例中都需要这样的特征,并且在一些实施例中,可以不包括这些特征或者可以与其他特征组合。
应当理解的是,虽然在这里可能使用了术语“第一”、“第二”等等来描述各个单元或是数据,但是这些单元或数据不应当受这些术语限制。使用这些术语仅仅是为了将一个特征与另一个特征进行区分。举例来说,在不背离示例性实施例的范围的情况下,第一特征可以被称为第二特征,并且类似地第二特征可以被称为第一特征。
应注意的是,在本说明书中,相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步定义和解释。
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请的实施方式作进一步地详细描述。
如本文所使用的,术语“模块或单元”可以指或者包括专用集成电路(ASIC)、电子电路、执行一个或多个软件或固件程序的处理器(共享的、专用的或组)和/或存储器(共 享的、专用的或组)、组合逻辑电路、和/或提供所描述的功能的其他合适的组件,或者可以是专用集成电路(ASIC)、电子电路、执行一个或多个软件或固件程序的处理器(共享的、专用的或组)和/或存储器(共享的、专用的或组)、组合逻辑电路、和/或提供所描述的功能的其他合适的组件的一部分。
在现有技术中,移动电子设备的视频剪辑软件的显示界面较小,并且通常主要通过手指点触操作,从而导致PC端的较多操作直接移植至移动电子设备后,操作极端不方便。例如,通过拖动选择准确的视频剪切点的操作,以及剪切后如何移动视频片段至想要的位置的操作都是不便的。完全依赖于手指触控来编辑视频,体验并不理想。手指面积较大,很难精准的控制到某一个时间点或者某一帧。常常需要反复调节多次才能达到一次调节的目的。目前,有采用固定剪辑片段的时长的技术来规避调节剪辑点的体验问题,但是,因为固定时长,只能提供有限的选项,所以基本无法覆盖到使用场景,另外视频剪辑每个视频片段是跟视频内容相关的,固定时长也不能满足用户的需求。此外,现在的剪辑都是在整段视频上进行的,编辑和剪切多段视频的时候并不方便,很容易播放整段视频的时长,而且需要不同的打开暂停等操作,在手机上并不方便反复播放。
还有些现有技术在移动电子设备的视频编辑软件中提供自动剪辑功能,通过算法一键生成一段视频。从而避免通过手指点触等非常不友好的操作体验。在移动电子设备上,借助图像识别等算法,提出了一些自动剪辑和生成视频的技术。但是,因为人工干预的太少,算法现在还很难准确的剪切出用户完全满意的视频,并且视频剪切的多样性也不够。此外,用户无法控制最后生成的内容,也无法控制最后视频的时长,完全依赖于算法设置。现在的自动化算法没有根据语音内容来进行剪切,一般来说剪切后视频中的音频内容基本被破坏,无法使用。
本申请的技术方案希望解决移动终端上的视频剪切的上述问题。本申请的一个或多个实施方式提出了一种视频剪辑方法,使用户在移动终端上剪切视频更加方便。此外,用户可以快速的剪切多段视频片段,可以快速的选择多段视频片段,还可以快速的调节视频片段间的顺序。最后,本申请的技术方案解决了视频剪辑造成语音不完整的问题。
图1示出根据本申请实施方式的示例电子设备的模块示意图。该电子设备可以用于执行视频处理的方法。
电子设备100可以包括控制组件110,外部存储器接口120,内部存储器121,音频模块130,传感器模块140,和显示屏150等。其中,控制组件110可以包括处理器111。传感器模块140可以包括压力传感器140A和触摸传感器140B等。
可以理解的是,本发明实施例示意的结构并不构成对电子设备100的具体限定。在本申请另一些实施例中,电子设备100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
处理器111可以包括一个或多个处理单元,例如:处理器111可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个 或多个处理器中。
GPU用于执行数学和几何计算,用于图形渲染。处理器111可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。
控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。
NPU为神经网络(neural-network,NN)计算处理器,通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现电子设备100的智能认知等应用,例如:图像识别,人脸识别,语音识别,文本理解等。
处理器111中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器111中的存储器为高速缓冲存储器。该存储器可以保存处理器111刚用过或循环使用的指令或数据。如果处理器111需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器111的等待时间,因而提高了***的效率。
外部存储器接口120可以用于连接外部存储卡,例如Micro SD卡,实现扩展电子设备100的存储能力。外部存储卡通过外部存储器接口120与控制组件110通信,实现数据存储功能。例如将数据库等文件保存在外部存储卡中。
内部存储器121可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。内部存储器121可以包括存储程序区和存储数据区。其中,存储程序区可存储操作***,至少一个功能所需的应用程序等。在本申请的一个或多个实施方式中,存储程序区可存储视频剪辑装置160,该视频剪辑装置160可以实施本申请的视频剪辑方法的各方面。视频剪辑装置160可以包括,指令获取模块161,分割模块162,选择模块163,排序模块164和合成模块165。其中,指令获取模块161可以用于获得用户对一个视频进行剪辑的指令,以及用户输入的其他指令。分割模块162可以用于响应指令,将视频分割成多个视频片段。分割模块162可以通过对用户载入的视频中的图像帧序列和音频分布分别进行分析,通过将整个视频的所有帧根据图像内容的变化,分成几个不同的场景,判断图像帧前后的图像序列的场景和/或主体的变化确定关键图像帧,以及通过对视频的音频的频率特征进行聚类,判断音频分布的频率分布的突变,确定关键音频节点,并在视频分割时从关键图像帧和关键音频节点中选择不会破坏语音完整性的节点作为分割点。选择模块163可以用于从多个视频片段中选出需要合成为新的视频的一个或多个视频片段。排序模块164可以用于将至少一个视频片段移动到用户指定的位置,从而对多个视频片段进行排序。合成模块165可以用于将多个视频片段合成为新的视频。在一些可能的实施方式中,可以通过选择模块163仅选择一个或多个视频片段进行合成,还可以选择多个视频片段后,对选择的视频片段通过排序模块164排序后,再进行视频合成,和/或仅移动一个或多个视频片段后进行全部视频片段的合成。可以理解,图1所示的视频剪辑装置160可以以软件方式实现,但是视频剪辑装置160及其中一个或多个模块还可以以硬件,软件或软件和硬件的组合实现。
内部存储器121的存储数据区可存储电子设备100使用过程中所创建的数据(比如视频时间线的切分点)等。此外,内部存储器121可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal  flash storage,UFS)等。处理器111通过运行存储在内部存储器121的指令,和/或存储在设置于处理器中的存储器的指令,执行电子设备100的各种功能应用以及数据处理。
音频模块130用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块130还可以用于对音频信号编码和解码。在一些实施例中,音频模块130可以设置于处理器111中,或将音频模块130的部分功能模块设置于处理器111中。
压力传感器140A用于感受压力信号,可以将压力信号转换成电信号。在一些实施例中,压力传感器140A可以设置于显示屏150。压力传感器140A的种类很多,如电阻式压力传感器,电感式压力传感器,电容式压力传感器等。电容式压力传感器可以是包括至少两个具有导电材料的平行板。当有力作用于压力传感器140A,电极之间的电容改变。电子设备100根据电容的变化确定压力的强度。当有触摸操作作用于显示屏150,电子设备100根据压力传感器140A检测所述触摸操作强度。电子设备100也可以根据压力传感器140A的检测信号计算触摸的位置。在一些实施例中,作用于相同触摸位置,但不同触摸操作强度的触摸操作,可以对应不同的操作指令。例如:当有触摸操作强度小于第一压力阈值的触摸操作作用于短消息应用图标时,执行查看短消息的指令。当有触摸操作强度大于或等于第一压力阈值的触摸操作作用于短消息应用图标时,执行新建短消息的指令。
触摸传感器140B,也称“触控器件”。触摸传感器140B可以设置于显示屏150,由触摸传感器140B与显示屏150组成触摸屏,也称“触控屏”。触摸传感器140B用于检测作用于其上或附近的触摸操作。触摸传感器可以将检测到的触摸操作传递给应用处理器,以确定触摸事件类型。可以通过显示屏150提供与触摸操作相关的视觉输出。在另一些实施例中,触摸传感器140B也可以设置于电子设备100的表面,与显示屏150所处的位置不同。
显示屏150用于显示图像,视频等。显示屏150包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display,LCD),有机发光二极管(organic light-emitting diode,OLED),有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix organic light emitting diode的,AMOLED),柔性发光二极管(flex light-emitting diode,FLED),Miniled,MicroLed,Micro-oLed,量子点发光二极管(quantum dot light emitting diodes,QLED)等。在一些实施例中,电子设备100可以包括1个或N个显示屏150,N为大于1的正整数。
电子设备100通过GPU,视频编解码器,显示屏150,NPU和/或应用处理器等实现视频播放和剪辑功能。GPU为图像处理的微处理器,连接显示屏150和应用处理器。
视频编解码器用于对数字视频压缩或解压缩。电子设备100可以支持一种或多种视频编解码器。这样,电子设备100可以播放或录制多种编码格式的视频,例如:动态图像专家组(moving picture experts group,MPEG)1,MPEG2,MPEG3,MPEG4等。
电子设备包括但不局限于,膝上型设备、手持PC、个人数字助理、蜂窝电话、便携式媒体播放器、可穿戴设备(例如,显示眼镜或护目镜,头戴式显示器(Head-Mounted Display,简称HMD),手表,头戴设备,臂带,珠宝等),虚拟现实(Virtual Reality,简称VR)和/或增强现实(Augment Reality,简称AR)设备,车载信息娱乐设备,流媒体客户端设备,电子书阅读设备,以及各种其他电子设备。一般地,能够包含本文中所 公开的处理器和/或其它执行逻辑的多个装置和电子设备一般都是合适的。
图2示出了根据一示例性实施方式的电子设备100的显示屏呈现的图形用户界面200的示例。用户可以在该图形用户界面200中对视频进行编辑。图形用户界面200包括,视频预览及播放区域210,视频编辑区域220,音频显示区域230,以及保存按键240。其中,视频预览及播放区域210可以用于播放在编辑的视频,以及显示视频编辑的效果。视频编辑区域220可以是视频编辑的主要操作区域,可以显示视频的时间线,例如,显示视频在时间维度上的多个图像等,可以理解,通常视频的时间线是左侧为视频起始的时刻,然后一直向右延伸到视频结束的时刻。在一个或多个实施方式中,视频编辑区域220还可以显示一个或多个用户剪辑的视频的片段。音频显示区域230可以用于显示与视频编辑区域220中的视频或视频的片段相对应的音频轨道。用户在视频编辑区域220完成视频编辑后,通过点按保存按键240生成和保存经编辑的新视频。可以理解,在本申请另一些实施方式中,图形用户界面200可以包括比图示更多或更少的区域或模块,或者组合某些区域或模块,或者拆分某些区域或模块,或者采用不同的布置。例如,在一些实施方式中,图形用户界面200还可以可选地或附加地包括工具区域250,工具区域250可以提供一些其他的功能操作,例如,加载新视频,保存视频,对选中的视频或视频片段进行分享,倒序等功能。用户通过选择分享功能可以将保存的视频分享给其他用户或者在社交媒体上分享。作为另一个示例,图形用户界面200可以省略音频显示区域230。
下面结合图形用户界面200对在电子设备100上执行的视频剪辑的方法进行描述。在一个或多个可选的实施方式中,视频剪辑的方法可以通过电子设备100中的视频剪辑装置160实施。
图3示出了根据本申请实施方式的视频处理方法300的流程。如图2所示,该方法至少包括,301:用户在图形用户界面200载入需要编辑的视频。视频载入后,在视频编辑区域220可以通过时间线的方式呈现该视频,并且在视频预览及播放区域210显示该视频的预览图,或者播放该视频。
302:获取用户的视频分割指令。电子设备100通过用户在视频编辑区域220中显示的视频上的触控操作,获取用户对视频的不同编辑指令,例如,通过指令获取模块161获取用户对视频的不同编辑指令。举例来说,用户的触控操作可以包括,当有触摸操作时间小于第一时间阈值的触摸操作作用于视频时,执行选中视频的指令。当有触摸操作时间大于或等于第一时间阈值的触摸操作作用于视频时,执行视频剪辑的指令。当有触摸操作持续作用于视频并且触摸位置连续改变时,执行拖拽视频随触摸位置移动的指令。
作为一个示例,如果用户需要裁剪视频长度或者内容,用户在视频编辑区域220可以长按视频,即,用户给出进行视频剪辑的指令。例如,分割模块162可以在303:响应于该指令,该视频可以自动分割成多个时长较短的视频片段(子视频)。对于视频的自动分割的各种实施方式,将后续参考图5-图8进一步说明。在可能的实施方式中,用户还可以对多个视频片段(子视频)中的一个视频片段再次进行长按,该视频片段还可以自动分割成多个时长更短的视频片段(孙视频)。
在一些实施方式中,可以根据视频的时长设置执行一次视频分割的所分割的视频片段的数量阈值,控制视频分割的颗粒度。如果认为分割的粒度不够细,可以对视频片段 再次分割,直到视频片段无法被自动分割方法处理为止。
分割后的多个视频片段可以显示在视频编辑区域220中,用户通过点击一个或多个视频片段,通过选择模块163可以304:根据用户的指令,从多个视频片段选择需要的该一个或多个视频片段。
在一些实施方式中,选中的视频片段可以在视频预览区自动播放,以实时显示视频内容,便于用户查看新选择的视频内容。
可选地或附加地,用户可以通过再次点击选中的视频,可以进入视频精细调节中,其中,精细调节操作方法可以如已有的一些视频操作方法,例如,通过拉动时间条确定每一帧的区域等。
在可能的实施方式中,用户还可以拖拽已选择的视频片段,来随意调整视频片段在时间线上的位置,实现视频片段的重新排序。例如,通过排序模块164可以在305:根据用户对选择的视频片段中的至少一个视频片段的移动指令,将至少一个视频片段移动到用户指定的位置。通过这种方式,当用户需要拼接多个选择的视频片段时,这些视频片段的拼接顺序可以按照这些视频片段当前在时间线上各自的位置来确定。
最后,如果选择的视频片段已经编辑完成,点按保存按键240,可以例如通过合成模块165在306:根据所选择的所有已选择的视频片段和用户指定的位置生成为一个新的视频。可以理解,在可选的示例中,图形用户界面还可以包括实现其他功能的按键,例如,分享按键等,用户通过点击分享按键可以将保存的视频分享给其他用户或者在社交媒体上分享。
为了便于理解上述用户的触控操作,下面结合附图对上述用户和图形用户界面200的交互过程进行说明。
图4示出了在图形用户界面200的视频编辑区域220的视频编辑的一些可能的用户操作。
当用户载入视频40后,在视频编辑区域220可以通过时间线的方式呈现该视频40,在图4中以矩形块示意视频40或视频片段42,矩形的长边示意性的表示视频40或视频片段42的时间线的长度。
用户需要剪辑视频40时,通过对视频40进行长按45触发视频剪辑,响应于视频剪辑的指令,视频40被自动分割成多个时长较短的视频片段42(1…n),并且这些视频片段42对应于视频的时间线顺序排列在视频编辑区域220中。在一些实施方式中,用户还可以对多个视频片段42(1…n)中的一个视频片段再次进行长按,例如,对视频片段42(i)进行长按45,该视频片段42(i)还可以自动分割成多个时长更短的视频片段42(i.1….i.j)。可以理解,在附图中,引用编号之后的括号内的字母和数字,例如“42(i.1)”,表示对具有该特定引用编号的元素的引用。文本中没有后续括号的引用编号,例如“42”,表示对带有该引用编号的元素的实施方式的总体引用。
作为一个示例,如图4所示,针对已分割的这些视频片段42,用户通过点击47一个或多个视频片段42,选择需要的一个或多个视频片段42。例如,用户分别点击47视频片段42(1)、视频片段42(i.1)和视频片段42(i.j),确定这3段视频片段是用户所需的。之后,用户可以点按保存按键240,可以一键将所有已选择的视频片段生成为一个新的视频。例如,视频片段42(1)、视频片段42(i.1)和视频片段42(i.j)通过本 领域的过渡拼接方法,按照各视频片段参照于时间线的当前的位置依次拼接这些视频片段,具体地,从视频片段42(1)开始,之后拼接视频片段42(i.1),最后拼接视频片段42(i.j),新生成的视频的起始时刻为视频片段42(1)的起始时刻,新的视频的终止时刻为视频片段42(i.j)的终止时刻。
作为另一个示例,用户还可以拖拽已选择的视频片段,来随意调整视频片段在时间线上的位置,实现视频片段的重新排序。如图4所示,用户分别点击47视频片段42(1)、视频片段42(i.1)和视频片段42(i.j),确定这3段视频片段是用户所需的。之后,用户对视频片段42(i.j)进行拖拽49,将视频片段42(i.j)移动到视频片段42(i.1)之前。在这种情况下,当用户需要拼接这些视频片段时,根据各视频片段当前的位置,拼接从视频片段42(1)开始,之后拼接视频片段42(i.j),最后拼接视频片段42(i.1),新生成的视频的起始时刻为视频片段42(1)的起始时刻,新的视频的终止时刻为视频片段42(i.1)的终止时刻。
根据本申请的各个实施方式,用户在移动终端上剪切视频更加方便。此外,用户可以快速的剪切多段视频片段,可以快速的选择多段视频片段,还可以快速的调节视频片段间的顺序。
图5示出了根据本申请实施方式的电子设备100实施的对视频进行分割的示例方法500的流程。示例方法500是对图3所示示例方法300中303部分的具体示例描述,对于上述示例方法300实施方式中未描述的内容,可以参见下述方法500实施方式;同样地,对于方法500实施方式中未描述的内容,可参见上述示例方法300实施方式。
如图5所示,方法500至少包括:在501:分别从视频的图像帧序列和音频分布中选出至少一个关键图像帧和至少一个关键音频节点。
在一些实施方式中,电子设备100对用户载入的视频中的图像帧序列和音频分布分别进行分析,一方面,针对图像进行关键帧和关键事件的分析和识别,从而确定可能的图像帧分割点,即关键图像帧。另一方面,针对音频分布进行分析,分析音频的频率分布,对音频内容中的语音、噪声等内容进行识别,从而确定可能的音频分割点,即关键音频节点。以下参考图6和图7分别对这两方面进行描述。
图6示出了选取关键图像帧的方法600的流程示意图。在一些实施方式中,针对视频的图像帧序列,首先,601:可以对图像帧序列中的相似帧进行聚类,从而将整个视频的所有图像帧根据图像帧的图像内容的变化,分成几个不同的场景,即,每个类内的图像帧是相似的,而类与类之间的图像帧是不相似的。
作为一个示例,首先,根据图像帧的图像内容,匹配诸如随机生成法、按时间均匀生成法等方法,比较图像内容,将图像内容大于阈值的图像帧作为种子帧。其次,生成图像帧的底层特征,通常可以采用SIFT(Scale-invariant feature transform,尺度不变特征变换),人体区域检测,运动区域检测等方法生成底层特征,之后,通过底层特征匹配计算帧与帧之间的图像内容关系,例如,对于诸如移动物体、人体,等不同的感兴趣区域可以设置不同的权重进行计算。最后,以种子帧作为分类的初始,使用Meanshift算法,K-means等算法对图像帧进行聚类。
在对图像帧聚类之后,还可以602:根据图像帧间内容计算帧的关键性值。通常,图像帧分类的边界一般来说是事件的起始点或者终结点。但是,视频的场景变化也是一 个连续的变化过程,所以需要在每个视频片段的边界仔细确定合适的分割时间点。确定分割时间点的几个关键因素至少包括以下方面:
第一方面,图像的模糊度,例如,场景显著变化可能伴随显著的镜头移动,这可能导致一系列模糊的图像帧。在这种情况下,需要根据前面计算的特征,计算模糊的图像帧的区分度。
第二方面,对移动物体的跟踪。在一些情况下,某些场景的变化,是因为前景的物体移动并且离开了视野。因此,一个视频分割点需要包含该物体整个运动过程,尽量不让离开的物体进入到分割后的下一个场景。
根据以上因素,对聚类后的各个分类的边界的帧间内容计算帧的关键性值,603:根据计算的关键性值,选取作为分割点的关键图像帧。作为一个示例,选择关键图像帧的标准可以包括,第一点,要选择的分割点应该保证上一个图像帧是相对清晰的,且该帧前与该帧后的图像内容是不相关的,即,要选择的图像帧是与上一图像帧在场景过渡阶段中关联性的计算值最小的图像帧。此外第二点,如果移动物体在整个过渡阶段都出现,则不选择关键图像帧;如果过渡阶段中该物体完全终止,则选取最后出现该物体的图像帧作为关键图像帧。如果第一点和第二点相冲突,优先使用第二点选择关键图像帧。
对于视频的音频部分而言,图7示出了选取关键音频节点的方法700的流程示意图。首先,701:检测视频的音频频率,并根据检测到的音频频率确定视频的多个音频频率分布。作为一个示例,对音频进行时频转换,例如通过快速傅里叶变换((Fast Fourier transform,FFT)、短时傅里叶变换(short-time Fourier transform,STFT)等方式,获得音频的频率,随后,通过本领域的音频频率特征检测方法,检测出音频的频率分布。之后,702:对多个音频频率分布进行聚类,以获得多个音频频率分布类别。例如,基于诸如SVM(Support Vector Machine,支持向量机)、K-means等聚类方法,对整段音频进行特征聚类分析,获得符合预定频率相似度的音频片段,以及这些音频段的起始点和终止点。而后,703:选择多个音频频率分布类别中的每两个音频频率分布类别的交点,作为至少一个关键音频节点,例如,这些起始点和终止点的之前的频率和之后的频率相比而言,通常发生了频率突变,即频率的变化超过某一特定阈值。这些点是关键音频节点,并且可以用于视频的分割点。
将聚类后得到的音频片段经过音频识别后,识别出每个音频片段的对应场景,例如,噪声场景,一个或多个主体的发声场景等。其中,噪声场景的噪声可以包括底噪声(无声),非人噪声,多人噪声等,其中,底噪声的一种情况可以包括说话人在说话过程中的停顿。主体的发声场景可以包括人类的单人声等(即,同一人频率占主要的音频能量)。
以下继续描述图5,在块502:按照时序逐个取出至少一个关键图像帧中的每一个和至少一个关键音频节点中的每一个。电子设备100例如从内部存储器121获取所有通过上述方法选出的关键图像帧和关键音频节点。电子设备100之后从这些关键图像帧和关键音频节点中确定最终的视频分割点。
确定视频分割点具体的示例方法参考图8描述,图8示出了根据本申请实施方式的视频的图像序列、音频的各自节点和分割视频片段的示意图。如图8所示,假设载入的视频的起始点记为S,终结点记为E,将所有中间的关键图像帧记为ICi,中间的关键音频节点记为VCi,ICi与ICi+1之间的图像帧序列记为IFi,在VCi和VCi+1之间的音 频信号组成的音频分布记为VFi,其中,IFi是视频的图像帧序列的子图像帧序列,VFi是视频的音频分布中的子音频分布。其中,起始点S作为图像序列和音频的起始点,ICi和VCi是需要判断是否作为用于将视频分割为视频片段(例如,子视频和孙视频等)的视频分割点的节点。
在另一些实施方式中,还可以判断当前节点与上一节点的时间间隔是否足够长,如果时间间隔较短,可以放弃当前节点的判断。例如,时间间隔可以设置为视频总时长的1/20等等,小于该时间间隔便放弃当前节点。
继续参考图5,接下来按照时间线的顺序,依次判断每个节点503:取出的是关键音频节点还是关键图像帧。如果是关键音频节点,例如VCi,则在504:判断该关键音频节点分割的之前和之后的子音频分布是否都是噪声,如果否,则到506a:保留该关键音频节点的采样时刻为视频的分割点;如果是,则到507a:丢弃该关键音频节点。例如,如果VCi分割的前后两段音频分布VFi-1和VFi都是噪声,则不保留VCi为视频的分割点,如果VCi分割的前后两段音频分布VFi-1和VFi其中至一是噪声,或者音频分布VFi-1和VFi都不是噪声,则保留VCi为视频的分割点。
之后,在508a:判断该关键音频节点是否为最后一个节点。如果该关键音频节点是取出的所有关键图像帧和关键音频节点中最后一个节点,则结束本方法;如果该关键音频节点不是最后一个节点,则返回502,对该关键音频节点的下一个关键图像帧或关键音频节点执行503-508。
作为一个示例,在图5中的块501中,电子设备100对音频聚类后,分类出子音频分布VFi-1和VFi和VCi,因此它们两者的频率的相似度不足,可以认为是不同的声音。在一种可能的情况下,当对子音频分布VFi-1和VFi分别识别后,识别出子音频分布VFi-1和VFi分别为两种不同的噪声,例如,子音频分布VFi-1是无声的底噪声,子音频分布VFi是诸如环境噪声的非人噪声,那么在这种情况下,并不需要将VCi作为视频的最终的一个分割点,由此,在最终分割时,可以将子音频分布VFi-1和VFi作为一段噪声分割。
在另一种可能的情况下,当对子音频分布VFi-1和VFi分别识别后,识别出子音频分布VFi-1和VFi中的一个是噪声,例如,子音频分布VFi-1是无声的底噪声,音频分布VFi是单人说话声,或者,子音频分布VFi-1是单人说话声,子音频分布VFi是非人噪声,那么可以将单人说话声和噪声分割,因此,可以将VCi作为视频的最终的一个分割点。
在在另一种可能的情况下,当对子音频分布VFi-1和VFi分别识别后,识别出子音频分布VFi-1和VFi都是非噪声,例如,子音频分布VFi-1和VFi分别为不同人的说话声,那么可以将不同人的说话声分割,因此,可以将VCi作为视频的最终的一个分割点。
在一种可能的场景中,假如视频内容为两个人的对谈,这两个人说话的音频的频率不同,同时,在他们的说话过程中可能会出现的自然的短暂的停顿,例如每句话结束时的停顿,或者其中一个人思考如何回应对方提出的问题等的较长的停顿,这些可能的停顿在音频表现上可以认为是底噪声。那么在这个视频中,每个说话人各自说话的部分可以被分割出来,并且对于每个人各自的说话部分,还可以根据停顿,将一句话或一段话完整的分割出来。在这种场景中,对语音的合理分割更为关键,需要避免在说话人在连 续说话时的音频进行分割。
采用上述方法可以避免对音频的不必要的分割,例如,可以防止对说话人在连续说话时的音频进行分割,使得分割的语音音频的表达不完整,或丢失部分语音。
继续参考图5中的块503,如果是关键图像帧,即,该节点为ICi,那么到505:判断覆盖该关键图像帧ICi所在的子音频分布是否为噪声分布,如果是,则到506b:保留该关键图像帧ICi的采样时刻为视频的分割点;如果否,则到507b:丢弃该关键图像帧ICi。之后,在508b:判断该关键图像帧是否为最后一个节点。如果该关键图像帧是取出的所有关键图像帧和关键音频节点中最后一个节点,则结束本方法;如果该关键图像帧不是最后一个节点,则返回502,对该关键图像帧的下一个关键图像帧或关键音频节点执行503-508。
作为一个示例,如图8所示,在时间线上,关键图像帧ICi对应的时间点是处于音频分布VFi中,那么如果子音频分布VFi是噪声,则可以将ICi作为视频的最终的一个分割点。因为ICi所在的音频不是说话人的语音音频,所以ICi作为分割点不会分割说话人的语音音频。此外,这样可以保证图像序列的有效的分割。相反地,如果音频分布VFi是人声的音频分布,则不保留该关键图像帧ICi为视频的分割点,因为关键图像帧ICi会分割说话人的语音音频,会造成分割的语音音频的表达不完整,或丢失部分语音。由此,采用上述方法可以确保人声音频的完整性。
在对所有的关键图像帧和关键音频节点进行上述判断后,就可以得到最终的视频分割点。例如,如图8所示,按照时间线的顺序,最终的视频分割点依次为VC1、ICi、VCi+1,那么根据这些视频分割点的采样时刻对视频进行分割,最终分割后的视频片段分别为81、82、83和84。
根据本申请的实施方式,提供了在移动终端上对视频进行快速剪辑的方法,能够应对无法精细调节像素级别的位置定位的场景。此外,本申请的实施方式还提供了一种视频自动分割方法,通过该方法可以获得视频主体的独立有意义的视频片段。
本申请的各方法实施方式均可以以软件、磁件、固件等方式实现。
可将程序代码应用于输入指令,以执行本文描述的各功能并生成输出信息。可以按已知方式将输出信息应用于一个或多个输出设备。为了本申请的目的,处理***包括具有诸如例如数字信号处理器(DSP)、微控制器、专用集成电路(ASIC)或微处理器之类的处理器的任何***。
程序代码可以用高级程序化语言或面向对象的编程语言来实现,以便与处理***通信。在需要时,也可用汇编语言或机器语言来实现程序代码。事实上,本文中描述的机制不限于任何特定编程语言的范围。在任一情形下,该语言可以是编译语言或解释语言。
至少一个实施例的一个或多个方面可以由存储在计算机可读存储介质上的表示性指令来实现,指令表示处理器中的各种逻辑,指令在被机器读取时使得该机器制作用于执行本文所述的技术的逻辑。被称为“IP核”的这些表示可以被存储在有形的计算机可读存储介质上,并被提供给多个客户或生产设施以加载到实际制造该逻辑或处理器的制造机器中。
在一些情况下,指令转换器可用来将指令从源指令集转换至目标指令集。例如,指令转换器可以变换(例如使用静态二进制变换、包括动态编译的动态二进制变换)、变形、 仿真或以其它方式将指令转换成将由核来处理的一个或多个其它指令。指令转换器可以用软件、硬件、固件、或其组合实现。指令转换器可以在处理器上、在处理器外、或者部分在处理器上且部分在处理器外。

Claims (18)

  1. 一种视频处理方法,其特征在于,包括:
    获得用户对一个视频进行分割的指令;和
    响应于所述指令,将所述视频分割成多个子视频,其中所述多个子视频中的每个子视频的持续时间段是至少部分地基于所述视频的至少一个关键图像帧中的一个关键图像帧的采样时刻或者所述视频的至少一个关键音频节点中的一个关键音频节点的采样时刻,
    其中,在所述视频的图像帧序列中出现在所述一个关键图像帧之前的子图像帧序列与出现在所述一个关键图像帧之后的子图像帧序列之间存在图像场景的变化和图像主体中的变化至少一个,以及
    其中,在所述视频的音频分布中出现在所述一个关键音频节点之前的子音频分布与出现在所述一个关键音频节点之后的子音频分布之间存在说话人主体的变化和噪声分布的变化的至少一个。
  2. 如权利要求1所述的视频处理方法,其特征在于,所述分割指令包括所述用户对于所述视频的长按指令。
  3. 如权利要求1-2中任何一个权利要求所述的视频处理方法,其特征在于,还包括:
    获取用户对所述多个子视频中的至少一个子视频的选择指令,从所述多个子视频中选出所述至少一个子视频,其中所述选择指令包括所述用户对所述至少一个子视频的点击指令。
  4. 如权利要求1-3中任一项所述的视频处理方法,其特征在于,还包括:
    获取所述用户对选出的所述至少一个子视频中的一个或多个子视频的移动指令,将所述一个或多个子视频移动到所述用户指定的位置,从而对选出的所述至少一个子视频进行排序,其中所述移动指令包括对所述一个或多个子视频的滑动指令。
  5. 如权利要求1所述的视频处理方法,其特征在于,所述响应于所述指令,将所述视频分割成多个子视频,还包括:
    从所述图像帧序列中选出所述至少一个关键图像帧;
    从所述音频分布中选出所述至少一个关键音频节点;
    确定是否保留所述至少一个关键图像帧中的所述一个关键图像帧和所述至少一个关键音频节点中的所述一个关键音频节点;
    至少部分地根据保留的所述一个关键图像帧的所述采样时刻和所述一个关键音频节点的所述采样时刻中的至少一个,确定所述时间段。
  6. 如权利要求5所述的视频处理方法,其特征在于,所述确定是否保留所述至少一个关键图像帧中的所述一个关键图像帧和所述至少一个关键音频节点中的所述一个关键 音频节点,包括:
    确定在所述一个关键音频节点之前的所述子音频分布与出现在所述一个关键音频节点之后的所述子音频分布是否包括所述噪声分布;
    在确定在所述一个关键音频节点之前的所述子音频分布与出现在所述一个关键音频节点之后的所述子音频分布中的一个包括所述噪声分布,或者在所述一个关键音频节点之前的所述子音频分布与出现在所述一个关键音频节点之后的所述子音频分布均不包括所述噪声分布的情况下,确定保留所述一个关键音频节点;和
    在确定在所述一个关键音频节点之前的所述子音频分布与出现在所述一个关键音频节点之后的所述子音频分布均包括所述噪声分布的情况下,确定放弃所述一个关键音频节点。
  7. 如权利要求1-6中任一项所述的视频处理方法,其特征在于,在所述至少一个关键音频节点包括多个关键音频节点的情况下,在所述一个关键音频节点之前的所述子音频分布包括所述视频的起始节点和所述一个关键音频节点之间的所述子音频分布,或者所述多个关键音频节点中位于所述一个关键音频节点之前的关键音频节点与所述一个关键音频节点之间的所述子音频分布。
  8. 如权利要求1-7中任一项所述的视频处理方法,其特征在于,在所述至少一个关键音频节点包括多个关键音频节点的情况下,在所述一个关键音频节点之后的所述子音频分布包括所述一个关键音频节点与所述视频的终止节点之间的所述子音频分布,或者所述一个关键音频节点与所述多个关键音频节点中位于所述一个关键音频节点之后的关键音频节点之间的所述子音频分布。
  9. 如权利要求5所述的视频处理方法,其特征在于,所述确定是否保留所述至少一个关键图像帧中的所述一个关键图像帧和所述至少一个关键音频节点中的所述一个关键音频节点,包括:
    确定所述子音频分布是否包括所述噪声分布,其中所述子音频分布包括与所述一个关键图像帧相关的采样时刻;
    在确定所述子音频分布包括所述噪声分布的情况下,确定保留所述一个关键图像帧;和
    在确定所述子音频分布不包括所述噪声分布的情况下,确定放弃所述一个关键图像帧。
  10. 如权利要求1-9中任一项所述的视频处理方法,其特征在于,所述噪声分布包括:无声分布,非人噪声分布和多人噪声分布中的至少一个。
  11. 如权利要求1-10中任一项所述的视频处理方法,其特征在于,所述噪声分布的变化包括出现在所述一个关键音频节点之前的所述子音频分布包括所述噪声分布,并且出现在所述一个关键音频节点之后的所述子音频分布包括非噪声分布;或者出现在所述 一个关键音频节点之前的所述子音频分布包括所述非噪声分布,并且出现在所述一个关键音频节点之后的所述子音频分布包括所述噪声分布。
  12. 如权利要求1-11中任一项所述的视频处理方法,其特征在于,所述噪声分布的变化包括出现在所述一个关键音频节点之前的所述子音频分布包括所述噪声分布中的至少一种噪声分布,并且出现在所述一个关键音频节点之后的所述子音频分布包括所述噪声分布中的至少另一种噪声分布。
  13. 如权利要求5所述的视频处理方法,其特征在于,所述从所述音频分布中选出所述至少一个关键音频节点,包括:
    检测所述视频的音频频率,并根据检测到的所述音频频率确定所述视频的多个音频频率分布,其中所述多个音频频率分布中的每个音频频率分布包括同一个音频频率的分布;
    对所述多个音频频率分布进行聚类,以获得多个音频频率分布类别,其中所述多个音频频率分布类别中的每个音频频率分布类别包括所述多个音频频率分布中的至少一个音频频率分布;和
    选择所述多个音频频率分布类别中的每两个音频频率分布类别的交点,作为所述至少一个关键音频节点。
  14. 如权利要求13所述的视频处理方法,其特征在于,所述聚类包括利用聚类算法对所述多个音频频率分布进行聚类,其中所述聚类算法包括SVM和Kmeans算法中的至少一个。
  15. 一种视频处理方法,其特征在于,包括:
    获得用户对一个视频进行分割的第一指令,其中所述第一指令包括对所述视频的长按指令;
    响应于所述第一指令,将所述视频分割成多个子视频,其中所述多个子视频中的每个子视频包括与所述每个子视频的时间段相关联的图像和音频。
  16. 如权利要求15所述的方法,其特征在于,还包括:获得所述用户对所述多个子视频中的一个子视频进行分割的第二指令,其中所述第二指令包括对所述一个子视频的长按指令;和
    响应于所述第二指令,将所述一个子视频分割成多个孙视频,其中所述多个孙视频中的每个孙视频包括与所述每个孙视频的时间段相关联的图像和音频。
  17. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有指令,该指令在计算机上执行时使所述计算机执行根据权利要求1-16中任一项所述的方法。
  18. 一种电子设备,其特征在于,包括:
    存储器,用于存储由所述电子设备的一个或多个处理器执行的指令,以及
    处理器,用于执行所述存储器中的所述指令,以执行根据权利要求1-16中任一项所述的方法。
PCT/CN2021/070875 2020-02-13 2021-01-08 视频处理方法和视频处理的设备、存储介质 WO2021159896A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010090350.7A CN113259761B (zh) 2020-02-13 2020-02-13 视频处理方法和视频处理的设备、存储介质
CN202010090350.7 2020-02-13

Publications (1)

Publication Number Publication Date
WO2021159896A1 true WO2021159896A1 (zh) 2021-08-19

Family

ID=77219825

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/070875 WO2021159896A1 (zh) 2020-02-13 2021-01-08 视频处理方法和视频处理的设备、存储介质

Country Status (2)

Country Link
CN (1) CN113259761B (zh)
WO (1) WO2021159896A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115100725A (zh) * 2022-08-23 2022-09-23 浙江大华技术股份有限公司 目标识别方法、目标识别装置以及计算机存储介质

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114222159A (zh) * 2021-12-01 2022-03-22 北京奇艺世纪科技有限公司 一种视频场景变化点确定和视频片段生成方法及***
CN115086759A (zh) * 2022-05-13 2022-09-20 北京达佳互联信息技术有限公司 视频处理方法、装置、计算机设备及介质
CN115209218B (zh) * 2022-06-27 2024-06-18 联想(北京)有限公司 一种视频信息处理方法、电子设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1822645A (zh) * 2005-02-15 2006-08-23 乐金电子(中国)研究开发中心有限公司 可摘要提供活动影像的移动通信终端及其摘要提供方法
US20130279701A1 (en) * 2007-10-17 2013-10-24 International Business Machines Corporation Automatic announcer voice attenuation in a presentation of a broadcast event
CN106534951A (zh) * 2016-11-30 2017-03-22 北京小米移动软件有限公司 视频分割方法和装置
CN110121104A (zh) * 2018-02-06 2019-08-13 上海全土豆文化传播有限公司 视频剪辑方法及装置
CN110213670A (zh) * 2019-05-31 2019-09-06 北京奇艺世纪科技有限公司 视频处理方法、装置、电子设备及存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6404925B1 (en) * 1999-03-11 2002-06-11 Fuji Xerox Co., Ltd. Methods and apparatuses for segmenting an audio-visual recording using image similarity searching and audio speaker recognition
US6751354B2 (en) * 1999-03-11 2004-06-15 Fuji Xerox Co., Ltd Methods and apparatuses for video segmentation, classification, and retrieval using image class statistical models
US10134440B2 (en) * 2011-05-03 2018-11-20 Kodak Alaris Inc. Video summarization using audio and visual cues
CN110197135B (zh) * 2019-05-13 2021-01-08 北京邮电大学 一种基于多维分割的视频结构化方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1822645A (zh) * 2005-02-15 2006-08-23 乐金电子(中国)研究开发中心有限公司 可摘要提供活动影像的移动通信终端及其摘要提供方法
US20130279701A1 (en) * 2007-10-17 2013-10-24 International Business Machines Corporation Automatic announcer voice attenuation in a presentation of a broadcast event
CN106534951A (zh) * 2016-11-30 2017-03-22 北京小米移动软件有限公司 视频分割方法和装置
CN110121104A (zh) * 2018-02-06 2019-08-13 上海全土豆文化传播有限公司 视频剪辑方法及装置
CN110213670A (zh) * 2019-05-31 2019-09-06 北京奇艺世纪科技有限公司 视频处理方法、装置、电子设备及存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115100725A (zh) * 2022-08-23 2022-09-23 浙江大华技术股份有限公司 目标识别方法、目标识别装置以及计算机存储介质
CN115100725B (zh) * 2022-08-23 2022-11-22 浙江大华技术股份有限公司 目标识别方法、目标识别装置以及计算机存储介质

Also Published As

Publication number Publication date
CN113259761B (zh) 2022-08-26
CN113259761A (zh) 2021-08-13

Similar Documents

Publication Publication Date Title
WO2021159896A1 (zh) 视频处理方法和视频处理的设备、存储介质
Albanie et al. Emotion recognition in speech using cross-modal transfer in the wild
US20190095946A1 (en) Automatically analyzing media using a machine learning model trained on user engagement information
US20180374105A1 (en) Leveraging an intermediate machine learning analysis
Han et al. Strength modelling for real-worldautomatic continuous affect recognition from audiovisual signals
US11158206B2 (en) Assisting learners based on analytics of in-session cognition
US20110243452A1 (en) Electronic apparatus, image processing method, and program
CN115082602B (zh) 生成数字人的方法、模型的训练方法、装置、设备和介质
Iyer et al. Emotion based mood enhancing music recommendation
KR20190081243A (ko) 정규화된 표현력에 기초한 표정 인식 방법, 표정 인식 장치 및 표정 인식을 위한 학습 방법
CN110072047B (zh) 图像形变的控制方法、装置和硬件装置
Kabani et al. Emotion based music player
Ringeval et al. Emotion recognition in the wild: Incorporating voice and lip activity in multimodal decision-level fusion
US10755087B2 (en) Automated image capture based on emotion detection
Zhang et al. Seeing like a human: Asynchronous learning with dynamic progressive refinement for person re-identification
Elshaer et al. Transfer learning from sound representations for anger detection in speech
Liang et al. Computational modeling of human multimodal language: The mosei dataset and interpretable dynamic fusion
CN112840313A (zh) 电子设备及其控制方法
Salah et al. Video-based emotion recognition in the wild
CN113923521B (zh) 一种视频的脚本化方法
CN108628454B (zh) 基于虚拟人的视觉交互方法及***
CN117795551A (zh) 用于自动捕捉和处理用户图像的方法和***
Milchevski et al. Multimodal affective analysis combining regularized linear regression and boosted regression trees
WO2022194084A1 (zh) 视频播放方法、终端设备、装置、***及存储介质
US20230419663A1 (en) Systems and Methods for Video Genre Classification

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21754372

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21754372

Country of ref document: EP

Kind code of ref document: A1