CN108307229A - A kind of processing method and equipment of video-audio data - Google Patents

A kind of processing method and equipment of video-audio data Download PDF

Info

Publication number
CN108307229A
CN108307229A CN201810107188.8A CN201810107188A CN108307229A CN 108307229 A CN108307229 A CN 108307229A CN 201810107188 A CN201810107188 A CN 201810107188A CN 108307229 A CN108307229 A CN 108307229A
Authority
CN
China
Prior art keywords
video
content
audio
subobject
feature information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810107188.8A
Other languages
Chinese (zh)
Other versions
CN108307229B (en
Inventor
徐常亮
李尉冉
傅丕毅
张云远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xinhua Wisdom Cloud Technology Co Ltd
Original Assignee
Xinhua Wisdom Cloud Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xinhua Wisdom Cloud Technology Co Ltd filed Critical Xinhua Wisdom Cloud Technology Co Ltd
Priority to CN201810107188.8A priority Critical patent/CN108307229B/en
Publication of CN108307229A publication Critical patent/CN108307229A/en
Application granted granted Critical
Publication of CN108307229B publication Critical patent/CN108307229B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Television Signal Processing For Recording (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This application provides a kind of processing method of video-audio data and equipment, video-audio data Object Segmentation is first multiple subobjects by the program, then the video feature information about video content in the subobject is extracted, and about the audio feature information of audio content in the subobject, further according to the video feature information and audio feature information, determine the content tab of each subobject, the particular content that each subobject includes in video-audio data object can be determined by content tab, the association between content tab can be used for indicating the incidence relation between each section content simultaneously, and then it can be effectively using the audio-video frequency content in video-audio data object, realize that the United Dispatching of video and audio material uses.

Description

A kind of processing method and equipment of video-audio data
Technical field
This application involves information technology field more particularly to the processing methods and equipment of a kind of video-audio data.
Background technology
As development of smart machine and audio frequency and video technology, such as film, TV play etc. include audio content and are regarded The speed that the video-audio data object of frequency content is generated and propagated is greatly speeded up, but these video-audio data objects are generally independently deposited The unified method for identifying and applying and channel are being lacked for content therein.And current technology is mainly technology by regarding Frequently/audio-frequency fingerprint and corresponding audio/video library carry out the identification of video/audio, it is difficult to determine in video-audio data object The content for including specifically extremely between incidence relation, and then can not effectively apply video-audio data object in audio-video frequency content.
Apply for content
The purpose of the application is to provide a kind of processing method and equipment of video-audio data, to solve in the prior art The problem of incidence relation being difficult between determining the content for including specifically in video-audio data object extremely.
To achieve the above object, this application provides a kind of processing method of video-audio data, this method includes:
It is multiple subobjects by video-audio data Object Segmentation;
It extracts in video feature information and the subobject in the subobject about video content about in audio The audio feature information of appearance;
According to the video feature information and audio feature information, the content tab of each subobject is determined.
Another aspect based on the application, additionally provides a kind of processing equipment of video-audio data, which includes:
Divide module, for being multiple subobjects by video-audio data Object Segmentation;
Characteristic extracting module, for extracting video feature information in the subobject about video content and described About the audio feature information of audio content in subobject;
Categorical match module, for according to the video feature information and audio feature information, determining each subobject Content tab.
In addition, present invention also provides a kind of processing equipments of video-audio data, wherein the equipment includes:
Processor;And
One or more machine readable medias of machine readable instructions are stored with, when the processor execution machine can When reading instruction so that the equipment executes the processing method of video-audio data above-mentioned.
It is first multiple subobjects by video-audio data Object Segmentation in the processing scheme of video-audio data provided by the present application, Then it extracts in video feature information and the subobject in the subobject about video content about audio content Audio feature information determines the content tab of each subobject further according to the video feature information and audio feature information, leads to Cross the particular content that content tab can determine that each subobject includes in video-audio data object, while the pass between content tab Connection can be used for indicating the incidence relation between each section content, and then can effectively apply the sound in video-audio data object Video content realizes that the United Dispatching of video and audio material uses.
Description of the drawings
By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other Feature, objects and advantages will become more apparent upon:
Fig. 1 shows a kind of process chart of the processing method of video-audio data provided by the embodiments of the present application;
Fig. 2 shows overall flows when being handled video-audio data object using method provided by the embodiments of the present application Schematic diagram;
Fig. 3 shows a kind of structural schematic diagram of the processing equipment of video-audio data provided by the embodiments of the present application;
Fig. 4 shows the structural schematic diagram of the processing equipment of another video-audio data provided by the embodiments of the present application;
Same or analogous reference numeral represents same or analogous component in attached drawing.
Specific implementation mode
The application is described in further detail below in conjunction with the accompanying drawings.
In a typical configuration of this application, terminal, the equipment of service network include one or more processors (CPU), input/output interface, network interface and memory.
Memory may include computer-readable medium in volatile memory, random access memory (RAM) and/or The forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable medium Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media, can be by any side Method or technology realize information storage.Information can be computer-readable instruction, data structure, the module of program or other numbers According to.The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), dynamic random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), fast flash memory bank or other memory techniques, CD-ROM (CD- ROM), digital versatile disc (DVD) or other optical storages, magnetic tape cassette, magnetic tape disk storage or other magnetic storages Equipment or any other non-transmission medium can be used for storage and can be accessed by a computing device information.
The embodiment of the present application provides a kind of processing method of video-audio data, and this method can be to determining video-audio data object In each subobject particular content for including, can effectively apply the audio-video frequency content in video-audio data object, realize and regard sound The United Dispatching of frequency data uses.The executive agent of this method can be user equipment, the network equipment or user equipment and network Equipment is integrated constituted equipment by network, or can also be the application program for running on above equipment.The user Equipment includes but not limited to all kinds of terminal devices such as computer, mobile phone, tablet computer;The network equipment includes but not limited to such as The realizations such as network host, single network server, multiple network server collection or the set of computers based on cloud computing.Here, Cloud is made of a large amount of hosts or network server for being based on cloud computing (Cloud Computing), wherein cloud computing is distributed One kind of calculating, a virtual machine being made of the computer collection of a group loose couplings.
Fig. 1 shows a kind of processing method of video-audio data provided by the embodiments of the present application, and this method comprises the following steps:
Video-audio data Object Segmentation is multiple subobjects by step S101.The video-audio data in the embodiment of the present application Object refers to the file or data flow for including audio, video data, and particular content can be film, a TV play etc..It is described Subobject refers to a portion content of video-audio data object, for example, for 120 minutes films of a duration for, can be with Multiple segments are averagely divided into according to duration, each segment is a subobject.
In some embodiments of the present application, Spatial-temporal slice can be passed through when being split to video-audio data object The mode of (spatio-temporal slice) cluster, i.e., according to the video content in video-audio data object, to the audio-visual number Spatial-temporal slice cluster is carried out according to object, and is based on cluster result, determines multiple subobjects.The Spatial-temporal slice refers to by video figure As sequence successive frame in same position the image that is formed according to sequential of pixel bars band, since the picture of similar content is in vision On have certain similitude, video-audio data object is split can be partitioned by way of Spatial-temporal slice cluster Video-audio data in each subobject belongs to similar content.
For example, the picture in one section of video includes 3 partial contents, first part is two human dialogs in indoor scene Picture, second part are the picture about gardens scenery in outdoor scene, and Part III is then the picture that outdoor scene explodes Face.It, can be accurate by way of Spatial-temporal slice cluster since this three parts picture visually has very big difference This section of video is divided into three parts, the video frame that each part is included is a cluster result, corresponding to Video and audio be a subobject.
In actual scene, since the actual conditions of each picture can be more complicated, the cluster result based on Spatial-temporal slice Error may also be will appear, such as first part may be due to personage's about the picture of two human dialogs in indoor scene It is mobile, cause the image content of wherein certain part that larger change occurs so that the first part is divided into two cluster knots Fruit, or it is also possible to second part and the picture of Part III is divided into a cluster result.It is tied as a result, based on cluster Fruit can be according to the similarity between the cluster result, to the cluster result into Mobile state tune when determining multiple subobjects It is whole, determine multiple subobjects.For example, by setting dynamic threshold so that similarity threshold when being clustered can be adjusted dynamically It is whole, to be merged to preliminary cluster result or continue to split so that final cluster result is more accurate.
Step S102 is extracted in video feature information and the subobject in the subobject about video content Audio feature information about audio content.
When handling the part about video, handled based on the video content in each subobject, such as one Portion's film carries out feature extraction, you can obtain its feature after being divided into multiple segments to the video content in each segment Information.In some embodiments of the present application, key frame can be first extracted from the video content of the subobject, then to closing Key frame is handled, and the video feature information of the key frame is obtained, as the video about video content in the subobject Characteristic information.
Wherein, key frame refers to the frame residing for the key operations in image motion or variation, being capable of reflecting video image sequence The content actually expressed, such as a video content about explosion, key frame can be indicate explosion cause (such as Hit when occurring) frame when generating of frame, explosive flame, explosive flame maximum when frame etc. when disappearing of frame and explosive flame Deng.Since key frame has been able to the physical meaning of preferably reflecting video content, by by the video features of key frame Information is as the video feature information about video content in the subobject, it is possible to reduce processing operand improves processing speed Degree.
The video feature information can be the characteristics of image such as texture, color, shape or spatial relationship, in actual scene In, selection can be needed to be used as video feature information suitable for one or more characteristics of image of current scene according to scene, To improve the accuracy of processing.The form of multi-C vector collection may be used for the video feature information got to record.
And when handling the part about audio, then it can be handled based on the audio content in each subobject.Such as For a film, after being divided into multiple segments, feature extraction is carried out to the audio content in each segment, you can obtain Its characteristic information.For general video-audio data object, audio content includes multiple types, for example, the sound of personage, audio, Ambient sound, background music etc..By taking the video content of two human dialogs in indoor scene as an example, corresponding audio content can It can the footsteps when walking about of voice, two personages comprising two personages, the sound that vehicle was opened outside room and background Music etc., these audio contents can correspond to the different wave of different-waveband.Thus it in some embodiments of the present application, is extracting When audio frequency characteristics, waveform recognition can be carried out in different wave bands, inhomogeneity is extracted from the audio content of the subobject The audio collection of type, these audio collections can be voice/audio collection, ambient sound collection or background music collection etc..For these sounds Frequency collects, and can extract audio feature information therein respectively, believes about the audio frequency characteristics of audio content as in the subobject Breath.The form of multi-C vector collection may be used for the audio feature information got to record.
It, can be first by audio content from the son when audio content in child objects is handled in actual scene It is separated in object.Meanwhile the accuracy to improve when audio feature extraction, different wave band carry out waveform recognition it Before, can noise reduction process first be carried out to the audio content of the subobject.
Step S103 determines the content tab of each subobject according to the video feature information and audio feature information. The content tab is intended to indicate that the information for the video content that subobject is included actually, can be according to the demand of user from each A scheduling description video content, such as the content for describing to include, residing scene or corresponding emotion etc..
In some embodiments of the present application, the mode of deep learning may be used to complete the identification of content tab, Carry out video-audio data processing before, a deep learning model can be built, by marked content tab audio content and Video content is trained deep learning model the identification so that it can for subobject content tab as training set.Example Such as, scheme provided by the embodiments of the present application is if desired allow to identify the segment in a certain film whether about in explosion Hold, then all kinds of videos about explosion and audio can be provided and are used as training set, includes about these videos in the training set Video feature information and audio feature information about these audios, and it is explosion to have marked its content tab.In training sample Under the premise of this is enough, deep learning model can be special to the video feature information for not marking content tab or audio of input Reference breath is identified, and determines whether its content tab can be explosion, so that it is determined that the content corresponding to the vidclip.
It, can be according to the subobject after determining the content tab of subobject in another embodiment of the application Content tab sorts out the subobject in the video-audio data object, generates object of classification collection.For example, for an electricity The segment of all about explosion can be classified as the set of explosion segment, the segment that all about personage fights also may be used by shadow To be individually classified as a set.
In actual scene, when child objects are sorted out, external input or preset classification condition, example can be based on Keyword input by user can be such as obtained, matched content tab is chosen according to keyword, and then obtains suitable content Set.By taking film as an example, the trailer of the film is if desired generated, then can will use scheme provided by the embodiments of the present application will The film is divided into multiple segments, then generates the corresponding content tab of each segment.User can be defeated according to the actual needs Enter corresponding keyword, to choose the segment for generating trailer and needing, such as user needs to generate the advance notice that style compares tender feeling Piece can then choose the segment met corresponding to the content tab of the style, as the material for generating trailer, form one A set of segments.Similarly, if user needs to generate the more trailer of content of fighting, corresponding content label can also be chosen Segment.
For audio content and video content, its label can be individually set, you can be divided into video content label and sound Frequency content tab, the two correspond, and are associated with the subobject for the video-audio data object that segmentation obtains.As a result, based on interior When appearance label is sorted out, it can also classify in combination with audio and video individually according to audio or video, To generate the set of user's needs, collection can be right according to the video content label and/or audio content label of the subobject The video content of subobject and/or audio content are sorted out in the video-audio data object, obtain video content collection and/or sound Frequency content set.
Fig. 2 shows overall flows when being handled video-audio data object using method provided by the embodiments of the present application Schematic diagram, the overall flow include following processing step:
S201 is primarily based on video content and is split, and is divided into multiple subobjects.
S202 carries out video feature extraction for the video content after segmentation, obtains video feature information.
S203, while audio and video are detached, the audio content corresponding to video after being divided.
S204 carries out noise reduction to audio content, eliminates noise.
S205 identifies waveform in different-waveband, isolates different types of audio, such as separation voice/audio etc..
S206 carries out audio spy's feature extraction to different types of audio, obtains audio feature information.
S207 handles video feature information and audio feature information input deep learning model.
S208 identifies content tab, is classified as in multiple video content collection and audio according to the handling result of deep learning Hold collection.
Based on same inventive concept, the processing equipment of video-audio data, the equipment pair are additionally provided in the embodiment of the present application The method answered is the method in previous embodiment, and its principle solved the problems, such as is similar to this method.
The embodiment of the present application provides a kind of processing equipment of video-audio data, which can be to determining video-audio data object In each subobject particular content for including, can effectively apply the audio-video frequency content in video-audio data object, realize and regard sound The United Dispatching of frequency data uses.The specific implementation of the equipment can be user equipment, the network equipment or user equipment and network Equipment is integrated constituted equipment by network, or can also be the application program for running on above equipment.The user Equipment includes but not limited to all kinds of terminal devices such as computer, mobile phone, tablet computer;The network equipment includes but not limited to such as The realizations such as network host, single network server, multiple network server collection or the set of computers based on cloud computing.Here, Cloud is made of a large amount of hosts or network server for being based on cloud computing (Cloud Computing), wherein cloud computing is distributed One kind of calculating, a virtual machine being made of the computer collection of a group loose couplings.
Fig. 3 shows that a kind of processing equipment of video-audio data provided by the embodiments of the present application, the equipment include segmentation module 310, characteristic extracting module 320 and categorical match module 330.The segmentation module 310 is used for video-audio data Object Segmentation Multiple subobjects.The video-audio data object in the embodiment of the present application refers to the file or data for including audio, video data Stream, particular content can be film, a TV play etc..The subobject refers in a portion of video-audio data object Hold, for example, for 120 minutes films of a duration for, multiple segments, each segment can be averagely divided into according to duration An as subobject.
In some embodiments of the present application, when segmentation module 310 can pass through when being split to video-audio data object The mode of cut-in without ball piece (spatio-temporal slice) cluster, i.e., according to the video content in video-audio data object, to described Video-audio data object carries out Spatial-temporal slice cluster, and is based on cluster result, determines multiple subobjects.The Spatial-temporal slice refer to by The image that the pixel bars band of same position is formed according to sequential in the successive frame of sequence of video images, due to the picture of similar content Certain similitude is visually had, video-audio data object is split and can be made by way of Spatial-temporal slice cluster The video-audio data being partitioned into each subobject belongs to similar content.
For example, the picture in one section of video includes 3 partial contents, first part is two human dialogs in indoor scene Picture, second part are the picture about gardens scenery in outdoor scene, and Part III is then the picture that outdoor scene explodes Face.It, can be accurate by way of Spatial-temporal slice cluster since this three parts picture visually has very big difference This section of video is divided into three parts, the video frame that each part is included is a cluster result, corresponding to Video and audio be a subobject.
In actual scene, since the actual conditions of each picture can be more complicated, the cluster result based on Spatial-temporal slice Error may also be will appear, such as first part may be due to personage's about the picture of two human dialogs in indoor scene It is mobile, cause the image content of wherein certain part that larger change occurs so that the first part is divided into two cluster knots Fruit, or it is also possible to second part and the picture of Part III is divided into a cluster result.It is tied as a result, based on cluster Fruit can be according to the similarity between the cluster result, to the cluster result into Mobile state tune when determining multiple subobjects It is whole, determine multiple subobjects.For example, by setting dynamic threshold so that similarity threshold when being clustered can be adjusted dynamically It is whole, to be merged to preliminary cluster result or continue to split so that final cluster result is more accurate.
Characteristic extracting module 320 is used to extract video feature information, the Yi Jisuo in the subobject about video content State the audio feature information about audio content in subobject.Due to being related to the processing of video and audio, the feature extraction mould Block can include video feature extraction submodule and audio feature extraction submodule.
When handling the part about video, handled based on the video content in each subobject, such as one Portion's film carries out feature extraction, you can obtain its feature after being divided into multiple segments to the video content in each segment Information.In some embodiments of the present application, key frame can be first extracted from the video content of the subobject, then to closing Key frame is handled, and the video feature information of the key frame is obtained, as the video about video content in the subobject Characteristic information.
Wherein, key frame refers to the frame residing for the key operations in image motion or variation, being capable of reflecting video image sequence The content actually expressed, such as a video content about explosion, key frame can be indicate explosion cause (such as Hit when occurring) frame when generating of frame, explosive flame, explosive flame maximum when frame etc. when disappearing of frame and explosive flame Deng.Since key frame has been able to the physical meaning of preferably reflecting video content, by by the video features of key frame Information is as the video feature information about video content in the subobject, it is possible to reduce processing operand improves processing speed Degree.
The video feature information can be the characteristics of image such as texture, color, shape or spatial relationship, in actual scene In, selection can be needed to be used as video feature information suitable for one or more characteristics of image of current scene according to scene, To improve the accuracy of processing.The form of multi-C vector collection may be used for the video feature information got to record.
And when handling the part about audio, then it can be handled based on the audio content in each subobject.Such as For a film, after being divided into multiple segments, feature extraction is carried out to the audio content in each segment, you can obtain Its characteristic information.For general video-audio data object, audio content includes multiple types, for example, the sound of personage, audio, Ambient sound, background music etc..By taking the video content of two human dialogs in indoor scene as an example, corresponding audio content can It can the footsteps when walking about of voice, two personages comprising two personages, the sound that vehicle was opened outside room and background Music etc., these audio contents can correspond to the different wave of different-waveband.Thus it in some embodiments of the present application, is extracting When audio frequency characteristics, waveform recognition can be carried out in different wave bands, inhomogeneity is extracted from the audio content of the subobject The audio collection of type, these audio collections can be voice/audio collection, ambient sound collection or background music collection etc..For these sounds Frequency collects, and can extract audio feature information therein respectively, believes about the audio frequency characteristics of audio content as in the subobject Breath.The form of multi-C vector collection may be used for the audio feature information got to record.
In actual scene, equipment provided by the embodiments of the present application can also include noise reduction module, audio and video separation module Deng, can be first right from the son by audio content when wherein audio content of the noise reduction module in child objects is handled It is separated as in.Meanwhile the accuracy to improve when audio feature extraction, before different wave bands carries out waveform recognition, Audio and video separation module first can carry out noise reduction process to the audio content of the subobject.
Categorical match module 330 determines the interior of each subobject according to the video feature information and audio feature information Hold label.The content tab is intended to indicate that the information for the video content that subobject is included actually, can be according to user's Demand describes video content, such as the content for describing to include, residing scene or corresponding emotion etc. from each scheduling.
In some embodiments of the present application, it is interior to complete that the mode of deep learning may be used in categorical match module 330 The identification for holding label can build a deep learning model, by marking content mark before the processing for carrying out video-audio data The audio content and video content of label are trained deep learning model as training set so that it can in subobject Hold the identification of label.For example, scheme provided by the embodiments of the present application is if desired allow to identify that the segment in a certain film is The no content about explosion can then provide all kinds of videos about explosion and audio as training set, be wrapped in the training set Containing about these videos video feature information and about the audio feature information of these audios, and marked its content tab For explosion.Under the premise of training sample is enough, deep learning model can be to the video for not marking content tab of input Characteristic information or audio feature information are identified, and determine whether its content tab can be explosion, so that it is determined that the movie film Content corresponding to section.
In another embodiment of the application, categorical match module 330 is after determining the content tab of subobject, Ke Yigen According to the content tab of the subobject, the subobject in the video-audio data object is sorted out, generates object of classification collection.Example Such as, for a film, the segment of all about explosion can be classified as to the set of explosion segment, all about personage is beaten The segment of bucket can also individually be classified as a set.
In actual scene, when child objects are sorted out, external input or preset classification condition, example can be based on Keyword input by user can be such as obtained, matched content tab is chosen according to keyword, and then obtains suitable content Set.By taking film as an example, the trailer of the film is if desired generated, then can will use scheme provided by the embodiments of the present application will The film is divided into multiple segments, then generates the corresponding content tab of each segment.User can be defeated according to the actual needs Enter corresponding keyword, to choose the segment for generating trailer and needing, such as user needs to generate the advance notice that style compares tender feeling Piece can then choose the segment met corresponding to the content tab of the style, as the material for generating trailer, form one A set of segments.Similarly, if user needs to generate the more trailer of content of fighting, corresponding content label can also be chosen Segment.
For audio content and video content, its label can be individually set, you can be divided into video content label and sound Frequency content tab, the two correspond, and are associated with the subobject for the video-audio data object that segmentation obtains.As a result, based on interior When appearance label is sorted out, it can also classify in combination with audio and video individually according to audio or video, To generate the set of user's needs, collection can be right according to the video content label and/or audio content label of the subobject The video content of subobject and/or audio content are sorted out in the video-audio data object, obtain video content collection and/or sound Frequency content set.
It is first more by video-audio data Object Segmentation in conclusion in the processing scheme of video-audio data provided by the present application A subobject, then extract in video feature information and the subobject in the subobject about video content about The audio feature information of audio content determines each subobject further according to the video feature information and audio feature information Content tab can determine the particular content that each subobject includes in video-audio data object, while content by content tab Association between label can be used for indicating the incidence relation between each section content, and then can effectively apply audio-visual number According to the audio-video frequency content in object, realize that the United Dispatching of video and audio material uses.
In addition, the part of the application can be applied to computer program product, such as computer program instructions, when its quilt When computer executes, by the operation of the computer, it can call or provide according to the present processes and/or technical solution. And the program instruction of the present processes is called, it is possibly stored in fixed or moveable recording medium, and/or pass through Broadcast or the data flow in other signal loaded mediums and be transmitted, and/or be stored in the calculating run according to program instruction In the working storage of machine equipment.Here, including an equipment as shown in Figure 4 according to one embodiment of the application, this sets Standby includes the one or more machine readable medias 410 for being stored with machine readable instructions and the place for executing machine readable instructions Manage device 420, wherein when the machine readable instructions are executed by the processor so that the equipment is executed based on aforementioned according to this The method and/or technology scheme of multiple embodiments of application.
It should be noted that the application can be carried out in the assembly of software and/or software and hardware, for example, can adopt With application-specific integrated circuit (ASIC), general purpose computer or any other realized similar to hardware device.In one embodiment In, the software program of the application can be executed by processor to realize above step or function.Similarly, the software of the application Program (including relevant data structure) can be stored in computer readable recording medium storing program for performing, for example, RAM memory, magnetic or CD-ROM driver or floppy disc and similar devices.In addition, hardware can be used to realize in some steps or function of the application, for example, Coordinate to execute the circuit of each step or function as with processor.
It is obvious to a person skilled in the art that the application is not limited to the details of above-mentioned exemplary embodiment, Er Qie In the case of without departing substantially from spirit herein or essential characteristic, the application can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and scope of the present application is by appended power Profit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent requirements of the claims Variation is included in the application.Any reference signs in the claims should not be construed as limiting the involved claims.This Outside, it is clear that one word of " comprising " is not excluded for other units or step, and odd number is not excluded for plural number.That is stated in device claim is multiple Unit or device can also be realized by a unit or device by software or hardware.The first, the second equal words are used for table Show title, and does not represent any particular order.

Claims (21)

1. a kind of processing method of video-audio data, wherein this method includes:
It is multiple subobjects by video-audio data Object Segmentation;
It extracts in video feature information and the subobject in the subobject about video content about audio content Audio feature information;
According to the video feature information and audio feature information, the content tab of each subobject is determined.
2. it is multiple subobjects by video-audio data Object Segmentation according to the method described in claim 1, wherein, including:
According to the video content in video-audio data object, Spatial-temporal slice cluster is carried out to the video-audio data object;
Based on cluster result, multiple subobjects are determined.
3. according to the method described in claim 2, wherein, it is based on cluster result, determines multiple subobjects, including:
According to the similarity between the cluster result, the cluster result is adjusted into Mobile state, determines multiple subobjects.
4. according to the method described in claim 1, wherein, extracting the video features letter about video content in the subobject Breath, including:
Key frame is extracted from the video content of the subobject;
The video feature information for obtaining the key frame, as the video feature information about video content in the subobject.
5. according to the method described in claim 1, wherein, extracting the audio frequency characteristics letter about audio content in the subobject Breath, including:
Waveform recognition is carried out in different wave bands, different types of audio collection is extracted from the audio content of the subobject;
The audio feature information in the audio collection is extracted respectively, as the audio frequency characteristics about audio content in the subobject Information.
6. according to the method described in claim 5, wherein, waveform recognition is carried out in different wave bands, from the sound of the subobject Before extracting different types of audio collection in frequency content, further include:
Noise reduction process is carried out to the audio content of the subobject.
7. according to the method described in claim 1, wherein, extracting the audio feature information about audio content in the subobject Before, further include:
The audio content is isolated from the subobject.
8. according to the method described in claim 1, wherein, according to the video feature information and audio feature information, determining every The content tab of a subobject, including:
The video feature information and audio feature information are inputted into deep learning model, obtain the content mark of each subobject Label, wherein the deep learning model is trained acquisition based on the audio content and video content for marking content tab.
9. according to the method described in claim 1, wherein, this method further includes:
According to the content tab of the subobject, the subobject in the video-audio data object is sorted out, generates classification pair As collection.
10. according to the method described in claim 9, wherein, the content tab includes video content label and audio content mark Label;
According to the content tab of the subobject, the subobject in the video-audio data object is sorted out, obtains classification pair As collection, including:
According to the video content label and/or audio content label of the subobject, to subobject in the video-audio data object Video content and/or audio content sorted out, obtain video content collection and/or video content collection.
11. a kind of processing equipment of video-audio data, wherein the equipment includes:
Divide module, for being multiple subobjects by video-audio data Object Segmentation;
Characteristic extracting module is right for extracting video feature information in the subobject about video content and the son About the audio feature information of audio content as in;
Categorical match module, for according to the video feature information and audio feature information, determining the content of each subobject Label.
12. equipment according to claim 11, wherein the segmentation module, for according to regarding in video-audio data object Frequency content carries out Spatial-temporal slice cluster to the video-audio data object;Based on cluster result, multiple subobjects are determined.
13. equipment according to claim 12, wherein the segmentation module, for according between the cluster result Similarity adjusts the cluster result into Mobile state, determines multiple subobjects.
14. equipment according to claim 11, wherein the characteristic extracting module, for the video from the subobject Key frame is extracted in content;The video feature information for obtaining the key frame, as in the subobject about video content Video feature information.
15. equipment according to claim 11, wherein the characteristic extracting module, in different wave bands into traveling wave Shape identifies, different types of audio collection is extracted from the audio content of the subobject;It extracts respectively in the audio collection Audio feature information, as the audio feature information about audio content in the subobject.
16. equipment according to claim 15, wherein the equipment further includes:
Noise reduction module extracts difference for carrying out waveform recognition in different wave bands from the audio content of the subobject Before the audio collection of type, noise reduction process is carried out to the audio content of the subobject.
17. equipment according to claim 11, wherein the equipment further includes:
Audio and video separation module, for isolating the audio content from the subobject.
18. equipment according to claim 11, wherein according to the video feature information and audio feature information, determine The content tab of each subobject, including:
The video feature information and audio feature information are inputted into deep learning model, obtain the content mark of each subobject Label, wherein the deep learning model is trained acquisition based on the audio content and video content for marking content tab.
19. equipment according to claim 11, wherein the categorical match module is additionally operable to according to the subobject Content tab sorts out the subobject in the video-audio data object, generates object of classification collection.
20. equipment according to claim 19, wherein the content tab includes video content label and audio content mark Label;
The categorical match module, for the video content label and/or audio content label according to the subobject, to described The video content of subobject and/or audio content are sorted out in video-audio data object, obtain in video content collection and/or video Hold collection.
21. a kind of processing equipment of video-audio data, wherein the equipment includes:
Processor;And
One or more machine readable medias of machine readable instructions are stored with, when the processor executes the machine readable finger When enabling so that the equipment executes the method as described in any one of claims 1 to 10.
CN201810107188.8A 2018-02-02 2018-02-02 Video and audio data processing method and device Active CN108307229B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810107188.8A CN108307229B (en) 2018-02-02 2018-02-02 Video and audio data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810107188.8A CN108307229B (en) 2018-02-02 2018-02-02 Video and audio data processing method and device

Publications (2)

Publication Number Publication Date
CN108307229A true CN108307229A (en) 2018-07-20
CN108307229B CN108307229B (en) 2023-12-22

Family

ID=62850942

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810107188.8A Active CN108307229B (en) 2018-02-02 2018-02-02 Video and audio data processing method and device

Country Status (1)

Country Link
CN (1) CN108307229B (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109101920A (en) * 2018-08-07 2018-12-28 石家庄铁道大学 Video time domain unit partioning method
CN109120996A (en) * 2018-08-31 2019-01-01 深圳市万普拉斯科技有限公司 Video information recognition methods, storage medium and computer equipment
CN109257622A (en) * 2018-11-01 2019-01-22 广州市百果园信息技术有限公司 A kind of audio/video processing method, device, equipment and medium
CN109587568A (en) * 2018-11-01 2019-04-05 北京奇艺世纪科技有限公司 Video broadcasting method, device, computer readable storage medium
CN110213670A (en) * 2019-05-31 2019-09-06 北京奇艺世纪科技有限公司 Method for processing video frequency, device, electronic equipment and storage medium
CN110234038A (en) * 2019-05-13 2019-09-13 特斯联(北京)科技有限公司 A kind of user management method based on distributed storage
CN110324726A (en) * 2019-05-29 2019-10-11 北京奇艺世纪科技有限公司 Model generation, method for processing video frequency, device, electronic equipment and storage medium
CN110677716A (en) * 2019-08-20 2020-01-10 咪咕音乐有限公司 Audio processing method, electronic device, and storage medium
CN110930997A (en) * 2019-12-10 2020-03-27 四川长虹电器股份有限公司 Method for labeling audio by using deep learning model
CN111008287A (en) * 2019-12-19 2020-04-14 Oppo(重庆)智能科技有限公司 Audio and video processing method and device, server and storage medium
CN111770375A (en) * 2020-06-05 2020-10-13 百度在线网络技术(北京)有限公司 Video processing method and device, electronic equipment and storage medium
CN112487248A (en) * 2020-12-01 2021-03-12 深圳市易平方网络科技有限公司 Video file label generation method and device, intelligent terminal and storage medium
CN113095231A (en) * 2021-04-14 2021-07-09 上海西井信息科技有限公司 Video identification method, system, device and storage medium based on classified object
CN113163272A (en) * 2020-01-07 2021-07-23 海信集团有限公司 Video editing method, computer device and storage medium

Citations (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20040041127A (en) * 2004-04-23 2004-05-14 학교법인 한국정보통신학원 An intelligent agent system for providing viewer-customized video skims in digital TV broadcasting
US6829781B1 (en) * 2000-05-24 2004-12-07 At&T Corp. Network-based service to provide on-demand video summaries of television programs
CN1938714A (en) * 2004-03-23 2007-03-28 英国电讯有限公司 Method and system for semantically segmenting scenes of a video sequence
CN100538698C (en) * 2004-01-14 2009-09-09 三菱电机株式会社 Summary transcriber and summary reproducting method
JP2010039877A (en) * 2008-08-07 2010-02-18 Nippon Telegr & Teleph Corp <Ntt> Apparatus and program for generating digest content
US20100104261A1 (en) * 2008-10-24 2010-04-29 Zhu Liu Brief and high-interest video summary generation
US20120201519A1 (en) * 2011-02-03 2012-08-09 Jennifer Reynolds Generating montages of video segments responsive to viewing preferences associated with a video terminal
US20120281969A1 (en) * 2011-05-03 2012-11-08 Wei Jiang Video summarization using audio and visual cues
CN103299324A (en) * 2010-11-11 2013-09-11 谷歌公司 Learning tags for video annotation using latent subtags
US20140082663A1 (en) * 2009-05-29 2014-03-20 Cognitive Media Networks, Inc. Methods for Identifying Video Segments and Displaying Contextually Targeted Content on a Connected Television
CN103854014A (en) * 2014-02-25 2014-06-11 中国科学院自动化研究所 Terror video identification method and device based on sparse representation of context
US20140270699A1 (en) * 2013-03-14 2014-09-18 Centurylink Intellectual Property Llc Auto-Summarizing Video Content System and Method
US9002175B1 (en) * 2013-03-13 2015-04-07 Google Inc. Automated video trailer creation
US20150134673A1 (en) * 2013-10-03 2015-05-14 Minute Spoteam Ltd. System and method for creating synopsis for multimedia content
CN105279495A (en) * 2015-10-23 2016-01-27 天津大学 Video description method based on deep learning and text summarization
CN105611413A (en) * 2015-12-24 2016-05-25 小米科技有限责任公司 Method and device for adding video clip class markers
US9635337B1 (en) * 2015-03-27 2017-04-25 Amazon Technologies, Inc. Dynamically generated media trailers
CN106649713A (en) * 2016-12-21 2017-05-10 中山大学 Movie visualization processing method and system based on content
CN106779073A (en) * 2016-12-27 2017-05-31 西安石油大学 Media information sorting technique and device based on deep neural network
CN106878632A (en) * 2017-02-28 2017-06-20 北京知慧教育科技有限公司 A kind for the treatment of method and apparatus of video data
CN107077595A (en) * 2014-09-08 2017-08-18 谷歌公司 Selection and presentation representative frame are for video preview
CN107436921A (en) * 2017-07-03 2017-12-05 李洪海 Video data handling procedure, device, equipment and storage medium

Patent Citations (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6829781B1 (en) * 2000-05-24 2004-12-07 At&T Corp. Network-based service to provide on-demand video summaries of television programs
CN100538698C (en) * 2004-01-14 2009-09-09 三菱电机株式会社 Summary transcriber and summary reproducting method
CN1938714A (en) * 2004-03-23 2007-03-28 英国电讯有限公司 Method and system for semantically segmenting scenes of a video sequence
KR20040041127A (en) * 2004-04-23 2004-05-14 학교법인 한국정보통신학원 An intelligent agent system for providing viewer-customized video skims in digital TV broadcasting
JP2010039877A (en) * 2008-08-07 2010-02-18 Nippon Telegr & Teleph Corp <Ntt> Apparatus and program for generating digest content
US20100104261A1 (en) * 2008-10-24 2010-04-29 Zhu Liu Brief and high-interest video summary generation
US20140082663A1 (en) * 2009-05-29 2014-03-20 Cognitive Media Networks, Inc. Methods for Identifying Video Segments and Displaying Contextually Targeted Content on a Connected Television
CN103299324A (en) * 2010-11-11 2013-09-11 谷歌公司 Learning tags for video annotation using latent subtags
US20120201519A1 (en) * 2011-02-03 2012-08-09 Jennifer Reynolds Generating montages of video segments responsive to viewing preferences associated with a video terminal
US20120281969A1 (en) * 2011-05-03 2012-11-08 Wei Jiang Video summarization using audio and visual cues
US9002175B1 (en) * 2013-03-13 2015-04-07 Google Inc. Automated video trailer creation
US20140270699A1 (en) * 2013-03-14 2014-09-18 Centurylink Intellectual Property Llc Auto-Summarizing Video Content System and Method
US20150134673A1 (en) * 2013-10-03 2015-05-14 Minute Spoteam Ltd. System and method for creating synopsis for multimedia content
CN103854014A (en) * 2014-02-25 2014-06-11 中国科学院自动化研究所 Terror video identification method and device based on sparse representation of context
CN107077595A (en) * 2014-09-08 2017-08-18 谷歌公司 Selection and presentation representative frame are for video preview
US9635337B1 (en) * 2015-03-27 2017-04-25 Amazon Technologies, Inc. Dynamically generated media trailers
CN105279495A (en) * 2015-10-23 2016-01-27 天津大学 Video description method based on deep learning and text summarization
CN105611413A (en) * 2015-12-24 2016-05-25 小米科技有限责任公司 Method and device for adding video clip class markers
CN106649713A (en) * 2016-12-21 2017-05-10 中山大学 Movie visualization processing method and system based on content
CN106779073A (en) * 2016-12-27 2017-05-31 西安石油大学 Media information sorting technique and device based on deep neural network
CN106878632A (en) * 2017-02-28 2017-06-20 北京知慧教育科技有限公司 A kind for the treatment of method and apparatus of video data
CN107436921A (en) * 2017-07-03 2017-12-05 李洪海 Video data handling procedure, device, equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
B.L. TSENG 等: "Personalized video summary using visual semantic annotations and automatic speech transcriptions", 《IEEE》 *
兰怡洁: "基于情感的视频摘要研究", 《中国优秀硕士学位论文电子期刊》 *
谢毓湘, 栾悉道, 吴玲达, 老松杨: "NVPS:一个多模态的新闻视频处理***", 情报学报, no. 04 *

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109101920A (en) * 2018-08-07 2018-12-28 石家庄铁道大学 Video time domain unit partioning method
CN109101920B (en) * 2018-08-07 2021-06-25 石家庄铁道大学 Video time domain unit segmentation method
CN109120996A (en) * 2018-08-31 2019-01-01 深圳市万普拉斯科技有限公司 Video information recognition methods, storage medium and computer equipment
CN109257622A (en) * 2018-11-01 2019-01-22 广州市百果园信息技术有限公司 A kind of audio/video processing method, device, equipment and medium
CN109587568A (en) * 2018-11-01 2019-04-05 北京奇艺世纪科技有限公司 Video broadcasting method, device, computer readable storage medium
CN110234038B (en) * 2019-05-13 2020-02-14 特斯联(北京)科技有限公司 User management method based on distributed storage
CN110234038A (en) * 2019-05-13 2019-09-13 特斯联(北京)科技有限公司 A kind of user management method based on distributed storage
CN110324726B (en) * 2019-05-29 2022-02-18 北京奇艺世纪科技有限公司 Model generation method, video processing method, model generation device, video processing device, electronic equipment and storage medium
CN110324726A (en) * 2019-05-29 2019-10-11 北京奇艺世纪科技有限公司 Model generation, method for processing video frequency, device, electronic equipment and storage medium
CN110213670A (en) * 2019-05-31 2019-09-06 北京奇艺世纪科技有限公司 Method for processing video frequency, device, electronic equipment and storage medium
CN110213670B (en) * 2019-05-31 2022-01-07 北京奇艺世纪科技有限公司 Video processing method and device, electronic equipment and storage medium
CN110677716B (en) * 2019-08-20 2022-02-01 咪咕音乐有限公司 Audio processing method, electronic device, and storage medium
CN110677716A (en) * 2019-08-20 2020-01-10 咪咕音乐有限公司 Audio processing method, electronic device, and storage medium
CN110930997A (en) * 2019-12-10 2020-03-27 四川长虹电器股份有限公司 Method for labeling audio by using deep learning model
CN110930997B (en) * 2019-12-10 2022-08-16 四川长虹电器股份有限公司 Method for labeling audio by using deep learning model
CN111008287A (en) * 2019-12-19 2020-04-14 Oppo(重庆)智能科技有限公司 Audio and video processing method and device, server and storage medium
CN111008287B (en) * 2019-12-19 2023-08-04 Oppo(重庆)智能科技有限公司 Audio and video processing method and device, server and storage medium
CN113163272B (en) * 2020-01-07 2022-11-25 海信集团有限公司 Video editing method, computer device and storage medium
CN113163272A (en) * 2020-01-07 2021-07-23 海信集团有限公司 Video editing method, computer device and storage medium
CN111770375A (en) * 2020-06-05 2020-10-13 百度在线网络技术(北京)有限公司 Video processing method and device, electronic equipment and storage medium
US11800042B2 (en) 2020-06-05 2023-10-24 Baidu Online Network Technology (Beijing) Co., Ltd. Video processing method, electronic device and storage medium thereof
CN112487248A (en) * 2020-12-01 2021-03-12 深圳市易平方网络科技有限公司 Video file label generation method and device, intelligent terminal and storage medium
CN113095231A (en) * 2021-04-14 2021-07-09 上海西井信息科技有限公司 Video identification method, system, device and storage medium based on classified object
CN113095231B (en) * 2021-04-14 2023-04-18 上海西井信息科技有限公司 Video identification method, system, device and storage medium based on classified object

Also Published As

Publication number Publication date
CN108307229B (en) 2023-12-22

Similar Documents

Publication Publication Date Title
CN108307229A (en) A kind of processing method and equipment of video-audio data
US20230012732A1 (en) Video data processing method and apparatus, device, and medium
CN109168024B (en) Target information identification method and device
US9961403B2 (en) Visual summarization of video for quick understanding by determining emotion objects for semantic segments of video
US10970334B2 (en) Navigating video scenes using cognitive insights
US20190066732A1 (en) Video Skimming Methods and Systems
WO2022184117A1 (en) Deep learning-based video clipping method, related device, and storage medium
WO2023197979A1 (en) Data processing method and apparatus, and computer device and storage medium
US10277834B2 (en) Suggestion of visual effects based on detected sound patterns
CN115004299A (en) Classifying audio scenes using composite image features
CN110505498A (en) Processing, playback method, device and the computer-readable medium of video
CN111737516A (en) Interactive music generation method and device, intelligent sound box and storage medium
WO2023045635A1 (en) Multimedia file subtitle processing method and apparatus, electronic device, computer-readable storage medium, and computer program product
CN108921032A (en) A kind of new video semanteme extracting method based on deep learning model
WO2023197749A1 (en) Background music insertion time point determining method and apparatus, device, and storage medium
CN111797850A (en) Video classification method and device, storage medium and electronic equipment
CN111488813B (en) Video emotion marking method and device, electronic equipment and storage medium
WO2019127940A1 (en) Video classification model training method, device, storage medium, and electronic device
CN113923504B (en) Video preview moving picture generation method and device
CN115222858A (en) Method and equipment for training animation reconstruction network and image reconstruction and video reconstruction thereof
CN110475139B (en) Video subtitle shielding method and device, storage medium and electronic equipment
JP2005513675A (en) Moving picture shape descriptor extracting apparatus and method showing statistical characteristics of still picture shape descriptor and moving picture index system using the same
CN110275988A (en) Obtain the method and device of picture
CN116013274A (en) Speech recognition method, device, computer equipment and storage medium
CN113542874A (en) Information playing control method, device, equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant