CN108307229A

CN108307229A - A kind of processing method and equipment of video-audio data

Info

Publication number: CN108307229A
Application number: CN201810107188.8A
Authority: CN
Inventors: 徐常亮; 李尉冉; 傅丕毅; 张云远
Original assignee: Xinhua Wisdom Cloud Technology Co Ltd
Current assignee: Xinhua Wisdom Cloud Technology Co Ltd
Priority date: 2018-02-02
Filing date: 2018-02-02
Publication date: 2018-07-20
Anticipated expiration: 2038-02-02
Also published as: CN108307229B

Abstract

This application provides a kind of processing method of video-audio data and equipment, video-audio data Object Segmentation is first multiple subobjects by the program, then the video feature information about video content in the subobject is extracted, and about the audio feature information of audio content in the subobject, further according to the video feature information and audio feature information, determine the content tab of each subobject, the particular content that each subobject includes in video-audio data object can be determined by content tab, the association between content tab can be used for indicating the incidence relation between each section content simultaneously, and then it can be effectively using the audio-video frequency content in video-audio data object, realize that the United Dispatching of video and audio material uses.

Description

A kind of processing method and equipment of video-audio data

Technical field

This application involves information technology field more particularly to the processing methods and equipment of a kind of video-audio data.

Background technology

As development of smart machine and audio frequency and video technology, such as film, TV play etc. include audio content and are regarded The speed that the video-audio data object of frequency content is generated and propagated is greatly speeded up, but these video-audio data objects are generally independently deposited The unified method for identifying and applying and channel are being lacked for content therein.And current technology is mainly technology by regarding Frequently/audio-frequency fingerprint and corresponding audio/video library carry out the identification of video/audio, it is difficult to determine in video-audio data object The content for including specifically extremely between incidence relation, and then can not effectively apply video-audio data object in audio-video frequency content.

Apply for content

The purpose of the application is to provide a kind of processing method and equipment of video-audio data, to solve in the prior art The problem of incidence relation being difficult between determining the content for including specifically in video-audio data object extremely.

To achieve the above object, this application provides a kind of processing method of video-audio data, this method includes：

It is multiple subobjects by video-audio data Object Segmentation；

It extracts in video feature information and the subobject in the subobject about video content about in audio The audio feature information of appearance；

According to the video feature information and audio feature information, the content tab of each subobject is determined.

Another aspect based on the application, additionally provides a kind of processing equipment of video-audio data, which includes：

Divide module, for being multiple subobjects by video-audio data Object Segmentation；

Characteristic extracting module, for extracting video feature information in the subobject about video content and described About the audio feature information of audio content in subobject；

Categorical match module, for according to the video feature information and audio feature information, determining each subobject Content tab.

In addition, present invention also provides a kind of processing equipments of video-audio data, wherein the equipment includes：

Processor；And

One or more machine readable medias of machine readable instructions are stored with, when the processor execution machine can When reading instruction so that the equipment executes the processing method of video-audio data above-mentioned.

It is first multiple subobjects by video-audio data Object Segmentation in the processing scheme of video-audio data provided by the present application, Then it extracts in video feature information and the subobject in the subobject about video content about audio content Audio feature information determines the content tab of each subobject further according to the video feature information and audio feature information, leads to Cross the particular content that content tab can determine that each subobject includes in video-audio data object, while the pass between content tab Connection can be used for indicating the incidence relation between each section content, and then can effectively apply the sound in video-audio data object Video content realizes that the United Dispatching of video and audio material uses.

Description of the drawings

By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other Feature, objects and advantages will become more apparent upon：

Fig. 1 shows a kind of process chart of the processing method of video-audio data provided by the embodiments of the present application；

Fig. 2 shows overall flows when being handled video-audio data object using method provided by the embodiments of the present application Schematic diagram；

Fig. 3 shows a kind of structural schematic diagram of the processing equipment of video-audio data provided by the embodiments of the present application；

Fig. 4 shows the structural schematic diagram of the processing equipment of another video-audio data provided by the embodiments of the present application；

Same or analogous reference numeral represents same or analogous component in attached drawing.

Specific implementation mode

The application is described in further detail below in conjunction with the accompanying drawings.

In a typical configuration of this application, terminal, the equipment of service network include one or more processors (CPU), input/output interface, network interface and memory.

Memory may include computer-readable medium in volatile memory, random access memory (RAM) and/or The forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable medium Example.

Computer-readable medium includes permanent and non-permanent, removable and non-removable media, can be by any side Method or technology realize information storage.Information can be computer-readable instruction, data structure, the module of program or other numbers According to.The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), dynamic random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), fast flash memory bank or other memory techniques, CD-ROM (CD- ROM), digital versatile disc (DVD) or other optical storages, magnetic tape cassette, magnetic tape disk storage or other magnetic storages Equipment or any other non-transmission medium can be used for storage and can be accessed by a computing device information.

The embodiment of the present application provides a kind of processing method of video-audio data, and this method can be to determining video-audio data object In each subobject particular content for including, can effectively apply the audio-video frequency content in video-audio data object, realize and regard sound The United Dispatching of frequency data uses.The executive agent of this method can be user equipment, the network equipment or user equipment and network Equipment is integrated constituted equipment by network, or can also be the application program for running on above equipment.The user Equipment includes but not limited to all kinds of terminal devices such as computer, mobile phone, tablet computer；The network equipment includes but not limited to such as The realizations such as network host, single network server, multiple network server collection or the set of computers based on cloud computing.Here, Cloud is made of a large amount of hosts or network server for being based on cloud computing (Cloud Computing), wherein cloud computing is distributed One kind of calculating, a virtual machine being made of the computer collection of a group loose couplings.

Fig. 1 shows a kind of processing method of video-audio data provided by the embodiments of the present application, and this method comprises the following steps：

Video-audio data Object Segmentation is multiple subobjects by step S101.The video-audio data in the embodiment of the present application Object refers to the file or data flow for including audio, video data, and particular content can be film, a TV play etc..It is described Subobject refers to a portion content of video-audio data object, for example, for 120 minutes films of a duration for, can be with Multiple segments are averagely divided into according to duration, each segment is a subobject.

In some embodiments of the present application, Spatial-temporal slice can be passed through when being split to video-audio data object The mode of (spatio-temporal slice) cluster, i.e., according to the video content in video-audio data object, to the audio-visual number Spatial-temporal slice cluster is carried out according to object, and is based on cluster result, determines multiple subobjects.The Spatial-temporal slice refers to by video figure As sequence successive frame in same position the image that is formed according to sequential of pixel bars band, since the picture of similar content is in vision On have certain similitude, video-audio data object is split can be partitioned by way of Spatial-temporal slice cluster Video-audio data in each subobject belongs to similar content.

For example, the picture in one section of video includes 3 partial contents, first part is two human dialogs in indoor scene Picture, second part are the picture about gardens scenery in outdoor scene, and Part III is then the picture that outdoor scene explodes Face.It, can be accurate by way of Spatial-temporal slice cluster since this three parts picture visually has very big difference This section of video is divided into three parts, the video frame that each part is included is a cluster result, corresponding to Video and audio be a subobject.

In actual scene, since the actual conditions of each picture can be more complicated, the cluster result based on Spatial-temporal slice Error may also be will appear, such as first part may be due to personage's about the picture of two human dialogs in indoor scene It is mobile, cause the image content of wherein certain part that larger change occurs so that the first part is divided into two cluster knots Fruit, or it is also possible to second part and the picture of Part III is divided into a cluster result.It is tied as a result, based on cluster Fruit can be according to the similarity between the cluster result, to the cluster result into Mobile state tune when determining multiple subobjects It is whole, determine multiple subobjects.For example, by setting dynamic threshold so that similarity threshold when being clustered can be adjusted dynamically It is whole, to be merged to preliminary cluster result or continue to split so that final cluster result is more accurate.

Step S102 is extracted in video feature information and the subobject in the subobject about video content Audio feature information about audio content.

When handling the part about video, handled based on the video content in each subobject, such as one Portion's film carries out feature extraction, you can obtain its feature after being divided into multiple segments to the video content in each segment Information.In some embodiments of the present application, key frame can be first extracted from the video content of the subobject, then to closing Key frame is handled, and the video feature information of the key frame is obtained, as the video about video content in the subobject Characteristic information.

Wherein, key frame refers to the frame residing for the key operations in image motion or variation, being capable of reflecting video image sequence The content actually expressed, such as a video content about explosion, key frame can be indicate explosion cause (such as Hit when occurring) frame when generating of frame, explosive flame, explosive flame maximum when frame etc. when disappearing of frame and explosive flame Deng.Since key frame has been able to the physical meaning of preferably reflecting video content, by by the video features of key frame Information is as the video feature information about video content in the subobject, it is possible to reduce processing operand improves processing speed Degree.

The video feature information can be the characteristics of image such as texture, color, shape or spatial relationship, in actual scene In, selection can be needed to be used as video feature information suitable for one or more characteristics of image of current scene according to scene, To improve the accuracy of processing.The form of multi-C vector collection may be used for the video feature information got to record.

And when handling the part about audio, then it can be handled based on the audio content in each subobject.Such as For a film, after being divided into multiple segments, feature extraction is carried out to the audio content in each segment, you can obtain Its characteristic information.For general video-audio data object, audio content includes multiple types, for example, the sound of personage, audio, Ambient sound, background music etc..By taking the video content of two human dialogs in indoor scene as an example, corresponding audio content can It can the footsteps when walking about of voice, two personages comprising two personages, the sound that vehicle was opened outside room and background Music etc., these audio contents can correspond to the different wave of different-waveband.Thus it in some embodiments of the present application, is extracting When audio frequency characteristics, waveform recognition can be carried out in different wave bands, inhomogeneity is extracted from the audio content of the subobject The audio collection of type, these audio collections can be voice/audio collection, ambient sound collection or background music collection etc..For these sounds Frequency collects, and can extract audio feature information therein respectively, believes about the audio frequency characteristics of audio content as in the subobject Breath.The form of multi-C vector collection may be used for the audio feature information got to record.

It, can be first by audio content from the son when audio content in child objects is handled in actual scene It is separated in object.Meanwhile the accuracy to improve when audio feature extraction, different wave band carry out waveform recognition it Before, can noise reduction process first be carried out to the audio content of the subobject.

Step S103 determines the content tab of each subobject according to the video feature information and audio feature information. The content tab is intended to indicate that the information for the video content that subobject is included actually, can be according to the demand of user from each A scheduling description video content, such as the content for describing to include, residing scene or corresponding emotion etc..

In some embodiments of the present application, the mode of deep learning may be used to complete the identification of content tab, Carry out video-audio data processing before, a deep learning model can be built, by marked content tab audio content and Video content is trained deep learning model the identification so that it can for subobject content tab as training set.Example Such as, scheme provided by the embodiments of the present application is if desired allow to identify the segment in a certain film whether about in explosion Hold, then all kinds of videos about explosion and audio can be provided and are used as training set, includes about these videos in the training set Video feature information and audio feature information about these audios, and it is explosion to have marked its content tab.In training sample Under the premise of this is enough, deep learning model can be special to the video feature information for not marking content tab or audio of input Reference breath is identified, and determines whether its content tab can be explosion, so that it is determined that the content corresponding to the vidclip.

It, can be according to the subobject after determining the content tab of subobject in another embodiment of the application Content tab sorts out the subobject in the video-audio data object, generates object of classification collection.For example, for an electricity The segment of all about explosion can be classified as the set of explosion segment, the segment that all about personage fights also may be used by shadow To be individually classified as a set.

In actual scene, when child objects are sorted out, external input or preset classification condition, example can be based on Keyword input by user can be such as obtained, matched content tab is chosen according to keyword, and then obtains suitable content Set.By taking film as an example, the trailer of the film is if desired generated, then can will use scheme provided by the embodiments of the present application will The film is divided into multiple segments, then generates the corresponding content tab of each segment.User can be defeated according to the actual needs Enter corresponding keyword, to choose the segment for generating trailer and needing, such as user needs to generate the advance notice that style compares tender feeling Piece can then choose the segment met corresponding to the content tab of the style, as the material for generating trailer, form one A set of segments.Similarly, if user needs to generate the more trailer of content of fighting, corresponding content label can also be chosen Segment.

For audio content and video content, its label can be individually set, you can be divided into video content label and sound Frequency content tab, the two correspond, and are associated with the subobject for the video-audio data object that segmentation obtains.As a result, based on interior When appearance label is sorted out, it can also classify in combination with audio and video individually according to audio or video, To generate the set of user's needs, collection can be right according to the video content label and/or audio content label of the subobject The video content of subobject and/or audio content are sorted out in the video-audio data object, obtain video content collection and/or sound Frequency content set.

Fig. 2 shows overall flows when being handled video-audio data object using method provided by the embodiments of the present application Schematic diagram, the overall flow include following processing step：

S201 is primarily based on video content and is split, and is divided into multiple subobjects.

S202 carries out video feature extraction for the video content after segmentation, obtains video feature information.

S203, while audio and video are detached, the audio content corresponding to video after being divided.

S204 carries out noise reduction to audio content, eliminates noise.

S205 identifies waveform in different-waveband, isolates different types of audio, such as separation voice/audio etc..

S206 carries out audio spy's feature extraction to different types of audio, obtains audio feature information.

S207 handles video feature information and audio feature information input deep learning model.

S208 identifies content tab, is classified as in multiple video content collection and audio according to the handling result of deep learning Hold collection.

Based on same inventive concept, the processing equipment of video-audio data, the equipment pair are additionally provided in the embodiment of the present application The method answered is the method in previous embodiment, and its principle solved the problems, such as is similar to this method.

The embodiment of the present application provides a kind of processing equipment of video-audio data, which can be to determining video-audio data object In each subobject particular content for including, can effectively apply the audio-video frequency content in video-audio data object, realize and regard sound The United Dispatching of frequency data uses.The specific implementation of the equipment can be user equipment, the network equipment or user equipment and network Equipment is integrated constituted equipment by network, or can also be the application program for running on above equipment.The user Equipment includes but not limited to all kinds of terminal devices such as computer, mobile phone, tablet computer；The network equipment includes but not limited to such as The realizations such as network host, single network server, multiple network server collection or the set of computers based on cloud computing.Here, Cloud is made of a large amount of hosts or network server for being based on cloud computing (Cloud Computing), wherein cloud computing is distributed One kind of calculating, a virtual machine being made of the computer collection of a group loose couplings.

Fig. 3 shows that a kind of processing equipment of video-audio data provided by the embodiments of the present application, the equipment include segmentation module 310, characteristic extracting module 320 and categorical match module 330.The segmentation module 310 is used for video-audio data Object Segmentation Multiple subobjects.The video-audio data object in the embodiment of the present application refers to the file or data for including audio, video data Stream, particular content can be film, a TV play etc..The subobject refers in a portion of video-audio data object Hold, for example, for 120 minutes films of a duration for, multiple segments, each segment can be averagely divided into according to duration An as subobject.

In some embodiments of the present application, when segmentation module 310 can pass through when being split to video-audio data object The mode of cut-in without ball piece (spatio-temporal slice) cluster, i.e., according to the video content in video-audio data object, to described Video-audio data object carries out Spatial-temporal slice cluster, and is based on cluster result, determines multiple subobjects.The Spatial-temporal slice refer to by The image that the pixel bars band of same position is formed according to sequential in the successive frame of sequence of video images, due to the picture of similar content Certain similitude is visually had, video-audio data object is split and can be made by way of Spatial-temporal slice cluster The video-audio data being partitioned into each subobject belongs to similar content.

Characteristic extracting module 320 is used to extract video feature information, the Yi Jisuo in the subobject about video content State the audio feature information about audio content in subobject.Due to being related to the processing of video and audio, the feature extraction mould Block can include video feature extraction submodule and audio feature extraction submodule.

In actual scene, equipment provided by the embodiments of the present application can also include noise reduction module, audio and video separation module Deng, can be first right from the son by audio content when wherein audio content of the noise reduction module in child objects is handled It is separated as in.Meanwhile the accuracy to improve when audio feature extraction, before different wave bands carries out waveform recognition, Audio and video separation module first can carry out noise reduction process to the audio content of the subobject.

Categorical match module 330 determines the interior of each subobject according to the video feature information and audio feature information Hold label.The content tab is intended to indicate that the information for the video content that subobject is included actually, can be according to user's Demand describes video content, such as the content for describing to include, residing scene or corresponding emotion etc. from each scheduling.

In some embodiments of the present application, it is interior to complete that the mode of deep learning may be used in categorical match module 330 The identification for holding label can build a deep learning model, by marking content mark before the processing for carrying out video-audio data The audio content and video content of label are trained deep learning model as training set so that it can in subobject Hold the identification of label.For example, scheme provided by the embodiments of the present application is if desired allow to identify that the segment in a certain film is The no content about explosion can then provide all kinds of videos about explosion and audio as training set, be wrapped in the training set Containing about these videos video feature information and about the audio feature information of these audios, and marked its content tab For explosion.Under the premise of training sample is enough, deep learning model can be to the video for not marking content tab of input Characteristic information or audio feature information are identified, and determine whether its content tab can be explosion, so that it is determined that the movie film Content corresponding to section.

In another embodiment of the application, categorical match module 330 is after determining the content tab of subobject, Ke Yigen According to the content tab of the subobject, the subobject in the video-audio data object is sorted out, generates object of classification collection.Example Such as, for a film, the segment of all about explosion can be classified as to the set of explosion segment, all about personage is beaten The segment of bucket can also individually be classified as a set.

It is first more by video-audio data Object Segmentation in conclusion in the processing scheme of video-audio data provided by the present application A subobject, then extract in video feature information and the subobject in the subobject about video content about The audio feature information of audio content determines each subobject further according to the video feature information and audio feature information Content tab can determine the particular content that each subobject includes in video-audio data object, while content by content tab Association between label can be used for indicating the incidence relation between each section content, and then can effectively apply audio-visual number According to the audio-video frequency content in object, realize that the United Dispatching of video and audio material uses.

In addition, the part of the application can be applied to computer program product, such as computer program instructions, when its quilt When computer executes, by the operation of the computer, it can call or provide according to the present processes and/or technical solution. And the program instruction of the present processes is called, it is possibly stored in fixed or moveable recording medium, and/or pass through Broadcast or the data flow in other signal loaded mediums and be transmitted, and/or be stored in the calculating run according to program instruction In the working storage of machine equipment.Here, including an equipment as shown in Figure 4 according to one embodiment of the application, this sets Standby includes the one or more machine readable medias 410 for being stored with machine readable instructions and the place for executing machine readable instructions Manage device 420, wherein when the machine readable instructions are executed by the processor so that the equipment is executed based on aforementioned according to this The method and/or technology scheme of multiple embodiments of application.

It should be noted that the application can be carried out in the assembly of software and/or software and hardware, for example, can adopt With application-specific integrated circuit (ASIC), general purpose computer or any other realized similar to hardware device.In one embodiment In, the software program of the application can be executed by processor to realize above step or function.Similarly, the software of the application Program (including relevant data structure) can be stored in computer readable recording medium storing program for performing, for example, RAM memory, magnetic or CD-ROM driver or floppy disc and similar devices.In addition, hardware can be used to realize in some steps or function of the application, for example, Coordinate to execute the circuit of each step or function as with processor.

It is obvious to a person skilled in the art that the application is not limited to the details of above-mentioned exemplary embodiment, Er Qie In the case of without departing substantially from spirit herein or essential characteristic, the application can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and scope of the present application is by appended power Profit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent requirements of the claims Variation is included in the application.Any reference signs in the claims should not be construed as limiting the involved claims.This Outside, it is clear that one word of " comprising " is not excluded for other units or step, and odd number is not excluded for plural number.That is stated in device claim is multiple Unit or device can also be realized by a unit or device by software or hardware.The first, the second equal words are used for table Show title, and does not represent any particular order.

Claims

1. a kind of processing method of video-audio data, wherein this method includes：

It is multiple subobjects by video-audio data Object Segmentation；

It extracts in video feature information and the subobject in the subobject about video content about audio content Audio feature information；

2. it is multiple subobjects by video-audio data Object Segmentation according to the method described in claim 1, wherein, including：

According to the video content in video-audio data object, Spatial-temporal slice cluster is carried out to the video-audio data object；

Based on cluster result, multiple subobjects are determined.

3. according to the method described in claim 2, wherein, it is based on cluster result, determines multiple subobjects, including：

According to the similarity between the cluster result, the cluster result is adjusted into Mobile state, determines multiple subobjects.

4. according to the method described in claim 1, wherein, extracting the video features letter about video content in the subobject Breath, including：

Key frame is extracted from the video content of the subobject；

The video feature information for obtaining the key frame, as the video feature information about video content in the subobject.

5. according to the method described in claim 1, wherein, extracting the audio frequency characteristics letter about audio content in the subobject Breath, including：

Waveform recognition is carried out in different wave bands, different types of audio collection is extracted from the audio content of the subobject；

The audio feature information in the audio collection is extracted respectively, as the audio frequency characteristics about audio content in the subobject Information.

6. according to the method described in claim 5, wherein, waveform recognition is carried out in different wave bands, from the sound of the subobject Before extracting different types of audio collection in frequency content, further include：

Noise reduction process is carried out to the audio content of the subobject.

7. according to the method described in claim 1, wherein, extracting the audio feature information about audio content in the subobject Before, further include：

The audio content is isolated from the subobject.

8. according to the method described in claim 1, wherein, according to the video feature information and audio feature information, determining every The content tab of a subobject, including：

The video feature information and audio feature information are inputted into deep learning model, obtain the content mark of each subobject Label, wherein the deep learning model is trained acquisition based on the audio content and video content for marking content tab.

9. according to the method described in claim 1, wherein, this method further includes：

According to the content tab of the subobject, the subobject in the video-audio data object is sorted out, generates classification pair As collection.

10. according to the method described in claim 9, wherein, the content tab includes video content label and audio content mark Label；

According to the content tab of the subobject, the subobject in the video-audio data object is sorted out, obtains classification pair As collection, including：

According to the video content label and/or audio content label of the subobject, to subobject in the video-audio data object Video content and/or audio content sorted out, obtain video content collection and/or video content collection.

11. a kind of processing equipment of video-audio data, wherein the equipment includes：

Characteristic extracting module is right for extracting video feature information in the subobject about video content and the son About the audio feature information of audio content as in；

Categorical match module, for according to the video feature information and audio feature information, determining the content of each subobject Label.

12. equipment according to claim 11, wherein the segmentation module, for according to regarding in video-audio data object Frequency content carries out Spatial-temporal slice cluster to the video-audio data object；Based on cluster result, multiple subobjects are determined.

13. equipment according to claim 12, wherein the segmentation module, for according between the cluster result Similarity adjusts the cluster result into Mobile state, determines multiple subobjects.

14. equipment according to claim 11, wherein the characteristic extracting module, for the video from the subobject Key frame is extracted in content；The video feature information for obtaining the key frame, as in the subobject about video content Video feature information.

15. equipment according to claim 11, wherein the characteristic extracting module, in different wave bands into traveling wave Shape identifies, different types of audio collection is extracted from the audio content of the subobject；It extracts respectively in the audio collection Audio feature information, as the audio feature information about audio content in the subobject.

16. equipment according to claim 15, wherein the equipment further includes：

Noise reduction module extracts difference for carrying out waveform recognition in different wave bands from the audio content of the subobject Before the audio collection of type, noise reduction process is carried out to the audio content of the subobject.

17. equipment according to claim 11, wherein the equipment further includes：

Audio and video separation module, for isolating the audio content from the subobject.

18. equipment according to claim 11, wherein according to the video feature information and audio feature information, determine The content tab of each subobject, including：

19. equipment according to claim 11, wherein the categorical match module is additionally operable to according to the subobject Content tab sorts out the subobject in the video-audio data object, generates object of classification collection.

20. equipment according to claim 19, wherein the content tab includes video content label and audio content mark Label；

The categorical match module, for the video content label and/or audio content label according to the subobject, to described The video content of subobject and/or audio content are sorted out in video-audio data object, obtain in video content collection and/or video Hold collection.

21. a kind of processing equipment of video-audio data, wherein the equipment includes：

Processor；And

One or more machine readable medias of machine readable instructions are stored with, when the processor executes the machine readable finger When enabling so that the equipment executes the method as described in any one of claims 1 to 10.