CN115604539A - Video segmentation method, electronic device, and storage medium - Google Patents

Video segmentation method, electronic device, and storage medium Download PDF

Info

Publication number
CN115604539A
CN115604539A CN202110722106.2A CN202110722106A CN115604539A CN 115604539 A CN115604539 A CN 115604539A CN 202110722106 A CN202110722106 A CN 202110722106A CN 115604539 A CN115604539 A CN 115604539A
Authority
CN
China
Prior art keywords
audio
video
segment
time
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110722106.2A
Other languages
Chinese (zh)
Inventor
马天泽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xiaomi Mobile Software Co Ltd
Original Assignee
Beijing Xiaomi Mobile Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Xiaomi Mobile Software Co Ltd filed Critical Beijing Xiaomi Mobile Software Co Ltd
Priority to CN202110722106.2A priority Critical patent/CN115604539A/en
Publication of CN115604539A publication Critical patent/CN115604539A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

The present disclosure relates to a video segmentation method, an electronic device, and a storage medium, and relates to the technical field of video processing, wherein the method includes: extracting audio in a video to be processed to obtain a first audio file; carrying out the operation of separating the voice and the background sound on the first audio file to obtain a second audio file comprising the voice; segmenting the second audio file to obtain at least one audio segment, wherein the audio segment comprises continuous human voice; and segmenting the video to be processed based on the critical moment of the audio segmentation to obtain at least one video segmentation. The beneficial effects of this disclosure are: the highlight video can be accurately extracted from the video to be optimized based on the audio characteristics of the video, and particularly, the highlight segments of the movie and television play video can be accurately extracted from the movie and television play video.

Description

Video segmentation method, electronic device, and storage medium
Technical Field
The present disclosure relates to the field of video processing technologies, and in particular, to a video segmentation method, an electronic device, and a storage medium.
Background
Short video refers to short-film video, and is generally video which is spread on new internet media within 5 minutes. With the rise of mobile internet, short videos are more and more favored by users. Therefore, it has become a research focus in the industry on how to convert long-length video content into short-length video content, such as movies and television shows.
In the related art, a short video is generally generated by detecting a motion scene in a video and cutting the motion scene as a highlight into a short video, or by detecting a behavior or an action in the video. However, for a movie, due to the particularity of the content of the movie, for example, the scene change is not large, the existing method cannot accurately cut out a highlight video from the movie video.
Disclosure of Invention
To overcome the problems in the related art, the present disclosure provides a video segmentation method, an electronic device, and a storage medium.
According to a first aspect of the embodiments of the present disclosure, there is provided a video segmentation method, including:
extracting audio in a video to be processed to obtain a first audio file;
carrying out the operation of separating the voice and the background sound on the first audio file to obtain a second audio file comprising the voice;
segmenting the second audio file to obtain at least one audio segment, wherein the audio segment comprises continuous human voice;
and segmenting the video to be processed based on the critical moment of the audio segmentation to obtain at least one video segmentation.
In some embodiments, the segmenting the second audio file to obtain at least one audio segment includes:
detecting each audio frame in the second audio file, and determining an audio type to which the audio frame belongs, wherein the audio type comprises a human voice audio and a non-human voice audio;
and segmenting the second audio file based on a boundary between the audio frame belonging to the human voice audio and the audio frame belonging to the non-human voice audio in the second audio file to obtain the audio segmentation.
In some embodiments, the detecting the audio frame and determining the audio category to which the audio frame belongs includes:
judging whether the signal energy of the audio frame is greater than or equal to a preset threshold value or not;
and under the condition that the signal energy of the audio frame is greater than or equal to the preset threshold value, determining the audio type of the audio frame based on a voice activation detection algorithm.
In some embodiments, the method further comprises:
under the condition that the audio type to which the audio frame belongs is determined to be the human voice audio based on the voice activation detection algorithm, calculating a first conditional probability that the audio frame belongs to noise and a second conditional probability that the audio frame belongs to human voice;
and when the first conditional probability is larger than the second conditional probability, determining the audio category of the audio frame as the non-human voice audio.
In some embodiments, after segmenting the second audio file to obtain at least one audio segment, the method further comprises:
for each of the audio segments, performing the steps of:
determining a start time, an end time and an audio duration of the audio segment;
under the condition that the audio time length of the audio segment is less than or equal to a first preset audio time length, if the time interval between the ending time of the audio segment and the starting time of the next audio segment is less than or equal to a preset time interval, combining the audio segment and the next audio segment to obtain a new audio segment;
the segmenting the video to be processed based on the critical moment of the audio segment to obtain at least one video segment includes:
and segmenting the video to be processed based on the critical moment of the new audio segmentation to obtain the video segmentation.
In some embodiments, the method further comprises:
under the condition that the time interval between the ending time of the audio segment and the starting time of the next audio segment is greater than the preset time interval, if the audio duration of the audio segment is greater than or equal to a second preset audio duration, the audio segment is reserved;
if the audio time length of the audio segment is less than the second preset audio time length, discarding the audio segment;
and the second preset audio time length is less than the first preset audio time length.
In some embodiments, the method further comprises:
and under the condition that the audio time length of the combined audio segment is greater than the first preset audio time length, discarding the combined audio segment.
In some embodiments, the video to be processed is obtained by:
acquiring an original video;
under the condition that the video time of the original video is greater than a preset time threshold, intercepting the video between a first time point and a second time point in the original video to obtain the video to be processed, wherein the second time point is later than the first time point;
and under the condition that the video time length of the original video is less than or equal to the preset time length threshold, intercepting the video between the first time point and the ending time point of the original video in the original video to obtain the video to be processed.
According to a second aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to execute instructions stored in the memory to implement the steps of the method of any one of the first aspects.
According to a third aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the steps of the video segmentation method provided by the first aspect of the present disclosure.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: by segmenting the second audio file comprising the voice, the second audio file can be segmented into a plurality of audio segments comprising continuous voice based on the voice in the video to be processed, and the video segments corresponding to the audio segments are wonderful segments. And then, the video to be processed is segmented based on the critical moment of the audio segmentation, so that a wonderful segment is cut out from the video to be processed, and a wonderful segment short video of the video to be processed is generated. The video segmentation method can accurately extract the wonderful video from the video to be optimized based on the audio characteristics of the video.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
FIG. 1 is a flow diagram illustrating a method of video segmentation in accordance with an exemplary embodiment;
FIG. 2 is a schematic diagram illustrating video segmentation in accordance with an exemplary embodiment;
FIG. 3 is a flowchart illustrating segmenting a second audio file according to an exemplary embodiment;
FIG. 4 is a schematic diagram illustrating segmenting a second audio file in accordance with an exemplary embodiment;
FIG. 5 is a flow diagram illustrating detection of an audio class to which an audio frame belongs in accordance with an exemplary embodiment;
FIG. 6 is a flowchart illustrating the determination of an audio category of an audio frame according to another exemplary embodiment;
FIG. 7 is a flowchart illustrating identifying an audio category to which an audio frame belongs, according to another example embodiment;
FIG. 8 is a flowchart illustrating merging audio segments according to another exemplary embodiment;
FIG. 9 is a schematic diagram illustrating merging audio segments according to another exemplary embodiment;
FIG. 10 is a flowchart illustrating the acquisition of a video to be processed according to an exemplary embodiment;
FIG. 11 is a block diagram illustrating a video segmentation apparatus in accordance with an exemplary embodiment;
fig. 12 is a block diagram illustrating an electronic device 800 in accordance with an example embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the disclosure, as detailed in the appended claims.
Fig. 1 is a flow diagram illustrating a video segmentation method in accordance with an exemplary embodiment. As shown in fig. 1, the video segmentation method may be applied to an electronic device, which may include a terminal device, a server, etc., and the method includes the following steps.
In step 110, the audio in the video to be processed is extracted to obtain a first audio file.
Here, the to-be-processed video refers to a video having a long time length that needs to be edited, such as a movie video, a television video, and the like. The first audio file is an audio stream separated from the video to be processed, and the time axes of the first audio file and the time axes of the video to be processed correspond to each other one by one.
It should be understood that the extraction of audio from the video to be processed may be performed by a specific algorithm or software, such as an audio transducer, etc., and the process of obtaining the first audio file will not be described in detail herein.
In step 120, a person sound and background sound separation operation is performed on the first audio file to obtain a second audio file including the person sound.
Here, the second audio file refers to an audio file obtained by performing an operation of removing a background sound on the first audio file, and a time axis length of the second audio file is consistent with a time axis length of the first audio file. The first audio file can be subjected to the operation of separating the human voice and the background sound in a sound track separation mode, and a second audio file comprising the human voice is obtained.
In step 130, the second audio file is segmented to obtain at least one audio segment, wherein the audio segment includes continuous human voice.
Here, after obtaining the second audio file including the human voice, the second audio file is segmented based on the continuous human voice in the second audio file, and at least one audio segment including the continuous human voice is obtained. For example, the second audio file is divided into a plurality of segments based on a boundary between the human voice audio and the non-human voice audio, and the segments including the human voice are retained as audio segments.
In step 140, the video to be processed is segmented based on the critical time of the audio segment, so as to obtain at least one video segment.
Here, the critical time of the audio segment refers to a start time point and an end time point of the audio segment, and since the audio segment is consistent with the time axis of the video to be processed, after the audio segment is obtained, the video to be processed may be segmented based on the critical time of each audio segment, so as to obtain the video segment corresponding to the audio segment. It should be understood that the video segment is a highlight video segment extracted from the video to be processed.
The above-described embodiment will be described in detail with reference to fig. 2.
Fig. 2 is a schematic diagram illustrating video segmentation in accordance with an exemplary embodiment. As shown in fig. 2, audio is extracted from a video to be processed to obtain a first audio file, and then, a human voice and background sound are separated from the first audio file to obtain a second audio file including a human voice. The lengths of time axes of the video to be processed, the first audio file and the second audio file are consistent, and time points of the audio frames and the video frames are in one-to-one correspondence. The second audio file is then segmented resulting in audio segments comprising consecutive human voices, as shown by the grey squares in fig. 2. Then, the video to be processed is segmented based on the critical time of the audio segment, so as to obtain the corresponding video segment, such as the black square in fig. 2. It should be understood that the final segmentation results in audio segments as indicated by the black squares in fig. 2.
Therefore, by segmenting the second audio file comprising the voice, the second audio file can be segmented into a plurality of audio segments comprising continuous voice based on the voice in the video to be processed, and the video segments corresponding to the audio segments are wonderful segments. And then, segmenting the video to be processed based on the critical moment of the audio segmentation, so that a wonderful segment is clipped from the video to be processed, and a wonderful segment short video of the video to be processed is generated. The video segmentation method can accurately extract the highlight video from the video to be optimized based on the audio characteristics of the video, and particularly can accurately extract the highlight segment of the movie and television play video from the movie and television play video. Experiments prove that the extracted video segments are matched with the artificially screened wonderful segments.
FIG. 3 is a flowchart illustrating segmenting a second audio file according to an exemplary embodiment. As shown in fig. 3, in some realizable embodiments, segmenting the second audio file in step 130 to obtain at least one audio segment may include:
in step 131, for each frame of audio frame in the second audio file, the audio frame is detected, and an audio category to which the audio frame belongs is determined, where the audio category includes human audio and non-human audio.
Here, an audio frame refers to a frame of audio obtained according to the encoding format of an audio file, and the length of the audio frame is different for different encoding formats. For example, for an AMR (Adaptive Multi-Rate) format, each 20ms of Audio is an Audio frame, and for an mp3 (Moving Picture Experts Group Audio Layer III) format, the number of Audio frames is determined by a file size and a frame length, the length of each frame may not be fixed or may be fixed, which is determined by a code Rate, each Audio frame is divided into a frame header and a data entity, the frame header records information of the mp3, such as a code Rate, a sampling Rate, and a version, and each frame is independent of each other. The audio category includes human audio and non-human audio, where the human audio refers to a voice signal emitted by a person in the second audio file, such as a speech dialogue in a movie and television show, and the non-human audio refers to background noise in the second audio file, such as background music in the movie and television show.
And detecting each audio frame in the second audio file to determine the audio type of each frame. The audio category of the audio frame may be determined by Voice Activity Detection (VAD) algorithm. Wherein the voice activity detection algorithm is operative to detect whether the audio frame belongs to human speech.
In step 132, the second audio file is segmented based on a boundary between the audio frame belonging to the human voice audio and the audio frame belonging to the non-human voice audio in the second audio file, so as to obtain the audio segment.
Here, the boundary refers to a boundary between an audio frame belonging to a human voice audio and an audio frame belonging to a non-human voice audio. For example, when the first to fifth frames of audio frames are audio frames belonging to a human voice audio and the sixth to seventh frames of audio frames are audio frames belonging to a non-human voice audio, the fifth and sixth frames of audio frames are segmented to obtain audio segments including the first to fifth frames of audio frames.
FIG. 4 is a schematic diagram illustrating segmenting a second audio file according to an exemplary embodiment. As shown in fig. 4, the second audio file includes seven audio frames, the first, second and seventh audio frames are non-human audio, as shown by the white squares in fig. 4, and the third to sixth audio frames are human audio, as shown by the gray squares in fig. 4. The boundary between the second frame and the third frame audio frame and the boundary between the sixth frame and the seventh frame audio frame are the boundaries between audio frames belonging to human audio and audio frames belonging to non-human audio, based on which an audio segment comprising the third frame and the sixth frame audio frame can be obtained.
FIG. 5 is a flow diagram illustrating detection of an audio class to which an audio frame belongs according to an example embodiment. As shown in fig. 5, in some implementation embodiments, in step 131, detecting the audio frame and determining an audio category to which the audio frame belongs may include:
in step 1311, it is determined whether the signal energy of the audio frame is greater than or equal to a preset threshold.
Here, if the signal energy of the audio frame is greater than or equal to the preset threshold, it is indicated that there may be a human voice signal in the audio frame, and if the signal energy of the audio frame is less than the preset threshold, it is indicated that there is no human voice signal in the audio frame, it may be determined that the audio type of the audio frame is a non-human voice audio.
It should be understood that the preset threshold value may be determined according to a signal energy difference between the human voice signal and the non-human voice signal.
In step 1312, in the case that the signal energy of the audio frame is greater than or equal to the preset threshold value, the audio category to which the audio frame belongs is determined based on a voice activity detection algorithm.
Here, when the signal energy of the audio frame is greater than or equal to the preset threshold, the audio category of the audio frame is further determined by a pair of voice activity detection algorithms. The specific principle can be as follows: according to the frequency spectrum range of human voice, the frequency spectrum of the audio frequency frame is divided into six sub-frequency bands, such as 80 Hz-250Hz, 250Hz-500Hz, 500Hz-1K, 1K-2K, 2K-3K and 3K-4K. And then, respectively calculating the energy of the six sub-frequency bands, and further performing operation by using a probability density function of a Gaussian model to obtain a log-likelihood ratio function. The log-likelihood ratio is divided into global and local, global is the weighted sum of six sub-bands, and local is each sub-band. Therefore, when the voice decision is made, the voice decision will first determine each sub-band, and when the sub-band is determined not to be a voice, the overall situation is further determined. If one of the local and global is judged to be the human voice signal, the audio frame is indicated to belong to the human voice audio.
Therefore, the start-stop time node between the human voice audio and the non-human voice audio in the second audio file can be accurately detected based on the Gaussian model through the voice activation detection algorithm, so that the second audio file can be accurately segmented, and the system calculation power can be greatly saved through the voice activation detection algorithm.
Fig. 6 is a flowchart illustrating determining an audio genre of an audio frame according to another exemplary embodiment. As shown in fig. 6, the method may further include:
in step 1313, in a case where it is determined that the audio category to which the audio frame belongs is the human voice audio based on the voice activation detection algorithm, a first conditional probability that the audio frame belongs to noise and a second conditional probability that the audio frame belongs to human voice are calculated.
Here, since there may be a portion of the background sound that cannot be completely separated during the process of separating the human voice and the background sound in the first audio file, the voice activity detection algorithm may recognize the audio frame with the background sound as the human voice audio. Therefore, in the case where it is determined based on the voice activity detection algorithm that the audio category to which the audio frame belongs is human voice audio, the first conditional probability that the audio frame determined by the voice activity detection algorithm as human voice audio belongs to noise and the second conditional probability that it belongs to human voice are further calculated. Wherein the first conditional probability and the second conditional probability may be calculated based on a gaussian model.
In step 1314, when the first conditional probability is greater than the second conditional probability, determining the audio category of the audio frame as the non-human voice audio.
Here, when the first conditional probability is greater than the second conditional probability, it is explained that the audio frame determined as the human voice audio by the voice activity detection algorithm is non-human voice audio, such as the human voice in the background sound, but not the human voice in the speech-to-speech dialogue in the movie. Therefore, the audio category of the audio frame is determined as the non-human voice audio. When the conditional probability is smaller than the second conditional probability, it indicates that the audio frame determined as the human voice audio by the voice activity detection algorithm is the human voice audio, such as the human voice in the speech-to-speech dialogue in the movie and television show, and therefore, the audio category of the audio frame is determined as the human voice audio.
Therefore, by further calculating the first conditional probability that the audio frame determined as the audio of the human voice by the voice activation detection algorithm belongs to the noise and the second conditional probability that the audio frame belongs to the human voice, the audio type to which the audio frame in the second audio file belongs can be accurately determined, so that the video to be processed can be accurately segmented.
The above embodiment will be described in detail with reference to fig. 7.
The present disclosure may determine the audio class to which the audio frame in the second audio file belongs based on a GMM model (gaussian mixture model). Fig. 7 is a flowchart illustrating identifying an audio category to which an audio frame belongs according to another exemplary embodiment. As shown in fig. 7, determining the audio category to which the audio frame belongs based on the GMM model may include:
and determining a value corresponding to the aggressive mode according to the frame length.
Here, a value corresponding to the GMM model aggressive mode is set according to the frame length of the audio frame, and then the second audio file is taken as an input of the GMM model.
And judging whether the signal energy is greater than or equal to a preset threshold value or not.
The GMM model judges whether the signal energy of each audio frame is larger than or equal to a preset threshold value, and when the signal energy of each audio frame is larger than or equal to the preset threshold value, the voice likelihood probability is calculated and voice activation detection is carried out. And when the value is smaller than the preset threshold value, performing smoothing processing on the audio frame.
The log-likelihood ratios for the six sub-bands and the sum of the log-likelihood ratios for the six sub-bands are calculated based on a two-dimensional gaussian model.
And judging whether the log-likelihood ratio is greater than a judgment threshold or not.
Here, the details of the voice activation detection algorithm have been described in the above embodiments, and are not described herein again. And when the log-likelihood ratio is greater than the decision threshold, determining that the audio frame is human voice audio. When the log-likelihood ratio is smaller than the judgment threshold or when the audio frame is determined to be the human voice, calculating a first conditional probability that the audio frame belongs to the noise and a second conditional probability that the audio frame belongs to the human voice.
Wherein, the first conditional probability that the audio frame belongs to the noise and the second conditional probability that the audio frame belongs to the human voice can be used to distinguish whether the audio frame belongs to the human voice or the noise.
And then, updating the parameters of the GMM model based on the first conditional probability and the second conditional probability so that the GMM model can more accurately identify the human voice and the noise.
After updating the Gaussian model parameters, judging whether the global mean value of the noise and the human voice is smaller than a threshold, and separating the noise and the human voice model parameters according to the weight when the global mean value of the noise and the human voice is smaller than the threshold.
FIG. 8 is a flowchart illustrating merging audio segments according to another exemplary embodiment. As shown in fig. 8, after step 130, the method further comprises, for each of the audio segments, performing the steps of:
in step 1301, the start time, end time, and audio duration of the audio segment are determined.
Here, the start time of an audio segment refers to the start time point of the first frame audio frame of the audio segment, the end time of the audio segment refers to the end time point of the last frame audio frame of the audio segment, and the audio duration of the audio segment refers to the duration between the first frame audio frame and the last frame audio frame of the audio segment.
It should be noted that the start time and the end time of the audio segment both refer to the time point when the audio segment is mapped in the video to be processed.
In step 1302, under the condition that the audio duration of the audio segment is less than or equal to a first preset audio duration, if the time interval between the end time of the audio segment and the start time of the next audio segment is less than or equal to a preset time interval, the audio segment and the next audio segment are merged to obtain a new audio segment.
Here, when the audio duration of an audio segment is less than or equal to a first preset audio duration, it is further determined whether a time interval between the end time of the audio segment and the start time of a next audio segment is less than or equal to a preset time interval, and if the time interval is less than or equal to the preset time interval, the audio segment and the next audio segment are merged to obtain a new audio segment.
It should be understood that the first preset audio time period may be set according to actual conditions, such as 5 minutes. The preset time interval can be set according to practical situations, such as 5 seconds. The preset time interval is less than or equal to the representation of the same scenario of the current audio segment and the next audio segment in the movie and television play video, and the continuity exists between the current audio segment and the next audio segment.
Fig. 9 is a schematic diagram illustrating merging audio segments according to another exemplary embodiment. As shown in fig. 9, the audio segments include 5 audio segments, which are a first audio segment, a second audio segment, a third audio segment, a fourth audio segment, and a fifth audio segment from left to right. The time interval between the end time of the first audio segment and the start time of the second audio segment is 4S, the time interval between the end time of the second audio segment and the start time of the third audio segment is 12S, the time interval between the end time of the third audio segment and the start time of the fourth audio segment is 3S, the time interval between the end time of the fourth audio segment and the start time of the fifth audio segment is 10S, and the preset time interval is 5S. The first audio segment is merged with the second audio segment and the third audio segment is merged with the fourth audio segment. The new audio segment includes: an audio segment in which the first audio segment is merged with the second audio segment, an audio segment in which the third audio segment is merged with the fourth audio segment, and a fifth audio segment.
In step 1401, the video to be processed is segmented based on the start time and the end time of the new audio segment, and the video segment is obtained.
Here, the start time and the end time of the new audio segment refer to critical times of the new audio segment. And if the new audio segment is the merged audio segment, segmenting the video to be processed based on the start time and the end time of the merged audio segment. For example, the time of the merged audio segmentation is [ 00. Wherein the video segment is a highlight video corresponding to the video to be processed.
In some implementations, the method further includes:
under the condition that the time interval between the ending time of the audio segment and the starting time of the next audio segment is greater than the preset time interval, if the audio duration of the audio segment is greater than or equal to a second preset audio duration, the audio segment is reserved;
if the audio time length of the audio segment is less than the second preset audio time length, discarding the audio segment;
and the second preset audio time length is less than the first preset audio time length.
When the time interval between the ending time of an audio segment and the starting time of the next audio segment is greater than a preset time interval, for example, greater than 5 seconds, further determining whether the audio duration of the audio segment is greater than a second preset audio duration, where the second preset audio duration is less than the first preset audio duration, and if the audio duration of the audio segment is greater than or equal to the second preset audio duration, retaining the audio segment; and if the audio time length of the audio segment is less than a second preset audio time length, discarding the audio segment. It should be understood that the audio segment greater than or equal to the second preset audio duration may be understood as a target audio segment, i.e., a highlight video that needs to be segmented from the movie video.
It should be noted that the second preset audio time period may be set according to actual situations, for example, the second preset audio time period may be set according to the length of the short video, the length of the short video is generally between 30 seconds and 5 minutes, and then the second preset audio time period may be set to 1 minute.
In some implementations, the method further includes:
and under the condition that the audio time length of the combined audio segment is greater than the first preset audio time length, discarding the combined audio segment.
Here, the audio duration of the merged audio segment refers to a time interval between a start time and an end time of the merged audio segment. For example, the time of the merged audio segment is [ 00. When the audio duration of the merged audio segment is greater than the first preset audio duration, for example, greater than 5 minutes, it indicates that the merged audio segment does not belong to the duration range of the short video, or the merged audio segment does not belong to the highlight segment in the movie video, and the merged audio segment is discarded.
Fig. 10 is a flow diagram illustrating the acquisition of a pending video according to an exemplary embodiment. As shown in fig. 10, in some implementations, the video to be processed can be obtained by:
in step 101, an original video is acquired.
Here, the original video may be a movie video, such as a television play video, a movie video.
In step 102, it is determined whether the video duration of the original video is greater than a preset duration threshold.
Here, the video duration refers to the total duration of the movie video. The preset duration threshold may be set according to a general duration of the movie and television video. For example, for a movie video in china, generally within 40 minutes to 2 hours, the preset duration threshold may be set to 1 hour.
In step 103, in a case that the video duration of the original video is greater than a preset duration threshold, capturing a video between a first time point and a second time point in the original video to obtain the to-be-processed video, wherein the second time point is later than the first time point.
Here, when the video duration of the original video is greater than a preset duration threshold, if the video duration is greater than 1 hour, capturing a video between a first time point and a second time point in the original video, and taking the captured video as a video to be processed. Wherein the second time point is later than the first time point.
It should be understood that the first time point and the second time point can be determined according to the time length of the original video. For example, for a movie video with a duration greater than 1 hour, the first time point may be set to be a time point of 5 minutes, and the second time point is set to be a time point of 1 hour, so that the finally captured video to be processed is a video in an interval of 5 minutes to 1 hour.
It is worth to be noted that, for an original video with a video duration longer than 1 hour, multiple videos to be processed may be obtained through multiple capturing. For example, for an original video of 2 hours, the original video can be divided into two original videos with the duration of 1 hour, and then two videos to be processed with the duration of 55 minutes are obtained through interception, so that the calculation amount of each video division is reduced.
In step 104, in a case that the video duration of the original video is less than or equal to the preset duration threshold, intercepting a video between the first time point and an end time point of the original video in the original video to obtain the to-be-processed video.
Here, when the video duration of the original video is less than or equal to the preset duration threshold, if the video duration is less than or equal to 1 hour, capturing a video between a first time point and an end time point of the original video in the original video, and taking the captured video as a video to be processed. For example, if the video duration of the original video is 45 minutes, the captured video to be processed is a video in an interval of 5 minutes to 45 minutes.
Therefore, the film head and the film tail of most of movie and television play videos can be accurately removed, and the calculation amount is reduced.
Fig. 11 is a block diagram illustrating a video segmentation apparatus according to an example embodiment. Referring to fig. 11, the apparatus includes an extraction module 121, a separation module 122, an audio segmentation module 123, and a video segmentation module 124.
The extracting module 121 is configured to extract an audio in a video to be processed, and obtain a first audio file;
the separation module 122 is configured to perform an operation of separating human voice and background sound on the first audio file, so as to obtain a second audio file including human voice;
the audio segmentation module 123 is configured to segment the second audio file to obtain at least one audio segment, wherein the audio segment includes a continuous human voice;
the video segmentation module 124 is configured to segment the video to be processed based on the critical time of the audio segment to obtain at least one video segment.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
The present disclosure also provides a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the video segmentation method provided by the present disclosure.
Fig. 12 is a block diagram illustrating an electronic device 800 in accordance with an example embodiment. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
Referring to fig. 12, electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.
The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the video segmentation method described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The power components 806 provide power to the various components of the electronic device 800. Power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for electronic device 800.
The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 800 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the electronic device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described video segmentation method.
In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the electronic device 800 to perform the video segmentation method described above is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the video segmentation method described above when executed by the programmable apparatus.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. A method for video segmentation, comprising:
extracting audio in a video to be processed to obtain a first audio file;
carrying out the operation of separating the voice and the background sound on the first audio file to obtain a second audio file comprising the voice;
segmenting the second audio file to obtain at least one audio segment, wherein the audio segment comprises continuous human voice;
and segmenting the video to be processed based on the critical moment of the audio segmentation to obtain at least one video segmentation.
2. The video segmentation method of claim 1, wherein the segmenting the second audio file to obtain at least one audio segment comprises:
detecting each audio frame in the second audio file, and determining an audio type to which the audio frame belongs, wherein the audio type comprises human voice audio and non-human voice audio;
and segmenting the second audio file based on a boundary between the audio frame belonging to the human voice audio and the audio frame belonging to the non-human voice audio in the second audio file to obtain the audio segmentation.
3. The method of claim 2, wherein the detecting the audio frame and determining the audio category to which the audio frame belongs comprises:
judging whether the signal energy of the audio frame is greater than or equal to a preset threshold value or not;
and under the condition that the signal energy of the audio frame is greater than or equal to the preset threshold value, determining the audio type of the audio frame based on a voice activation detection algorithm.
4. The video segmentation method of claim 3, wherein the method further comprises:
under the condition that the audio type to which the audio frame belongs is determined to be the human voice audio based on the voice activation detection algorithm, calculating a first conditional probability that the audio frame belongs to noise and a second conditional probability that the audio frame belongs to human voice;
and when the first conditional probability is greater than the second conditional probability, determining the audio type of the audio frame as the non-human voice audio.
5. The video segmentation method according to any one of claims 1 to 4, wherein after segmenting the second audio file to obtain at least one audio segment, the method further comprises:
for each of the audio segments, performing the steps of:
determining a start time, an end time and an audio duration of the audio segment;
under the condition that the audio time length of the audio segment is less than or equal to a first preset audio time length, if the time interval between the ending time of the audio segment and the starting time of the next audio segment is less than or equal to a preset time interval, combining the audio segment and the next audio segment to obtain a new audio segment;
the segmenting the video to be processed based on the critical moment of the audio segment to obtain at least one video segment includes:
and segmenting the video to be processed based on the start time and the end time of the new audio segmentation to obtain the video segmentation.
6. The video segmentation method of claim 5, wherein the method further comprises:
under the condition that the time interval between the ending time of the audio segment and the starting time of the next audio segment is greater than the preset time interval, if the audio duration of the audio segment is greater than or equal to a second preset audio duration, the audio segment is reserved;
if the audio time length of the audio segment is less than the second preset audio time length, discarding the audio segment;
and the second preset audio time length is less than the first preset audio time length.
7. The video segmentation method of claim 5, wherein the method further comprises:
and under the condition that the audio time length of the combined audio segment is greater than the first preset audio time length, discarding the combined audio segment.
8. The video segmentation method according to claim 1, wherein the video to be processed is obtained by:
acquiring an original video;
under the condition that the video time of the original video is greater than a preset time threshold, intercepting the video between a first time point and a second time point in the original video to obtain the video to be processed, wherein the second time point is later than the first time point;
and under the condition that the video time length of the original video is less than or equal to the preset time length threshold, intercepting the video between the first time point and the ending time point of the original video in the original video to obtain the video to be processed.
9. An electronic device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to execute instructions stored in the memory to implement the steps of the method of any one of claims 1 to 8.
10. A computer-readable storage medium, on which computer program instructions are stored, which program instructions, when executed by a processor, carry out the steps of the method according to any one of claims 1 to 8.
CN202110722106.2A 2021-06-28 2021-06-28 Video segmentation method, electronic device, and storage medium Pending CN115604539A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110722106.2A CN115604539A (en) 2021-06-28 2021-06-28 Video segmentation method, electronic device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110722106.2A CN115604539A (en) 2021-06-28 2021-06-28 Video segmentation method, electronic device, and storage medium

Publications (1)

Publication Number Publication Date
CN115604539A true CN115604539A (en) 2023-01-13

Family

ID=84840971

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110722106.2A Pending CN115604539A (en) 2021-06-28 2021-06-28 Video segmentation method, electronic device, and storage medium

Country Status (1)

Country Link
CN (1) CN115604539A (en)

Similar Documents

Publication Publication Date Title
CN106911961B (en) Multimedia data playing method and device
EP2998960B1 (en) Method and device for video browsing
CN105845124B (en) Audio processing method and device
WO2020228418A1 (en) Video processing method and device, electronic apparatus, and storage medium
CN111128253B (en) Audio editing method and device
CN106409317B (en) Method and device for extracting dream speech
CN109599104B (en) Multi-beam selection method and device
CN106534951B (en) Video segmentation method and device
CN111539443A (en) Image recognition model training method and device and storage medium
US9799376B2 (en) Method and device for video browsing based on keyframe
CN110992979B (en) Detection method and device and electronic equipment
WO2018095252A1 (en) Video recording method and device
CN110930984A (en) Voice processing method and device and electronic equipment
CN111916061A (en) Voice endpoint detection method and device, readable storage medium and electronic equipment
CN110991329A (en) Semantic analysis method and device, electronic equipment and storage medium
US20220222831A1 (en) Method for processing images and electronic device therefor
WO2023024791A1 (en) Frame rate adjustment method and apparatus, electronic device, storage medium, and program
TW202145064A (en) Object counting method electronic equipment computer readable storage medium
CN114333804B (en) Audio classification recognition method and device, electronic equipment and storage medium
CN111062407B (en) Image processing method and device, electronic equipment and storage medium
CN112185421A (en) Sound quality detection method, device, electronic equipment and storage medium
CN112201267A (en) Audio processing method and device, electronic equipment and storage medium
CN112312039A (en) Audio and video information acquisition method, device, equipment and storage medium
CN115604539A (en) Video segmentation method, electronic device, and storage medium
CN112185413B (en) Voice processing method and device for voice processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination