CN109472232A

CN109472232A - Video semanteme characterizing method, system and medium based on multi-modal fusion mechanism

Info

Publication number: CN109472232A
Application number: CN201811289502.5A
Authority: CN
Inventors: 侯素娟; 车统统; 王海帅; 郑元杰; 王静; 贾伟宽; 史云峰
Original assignee: Shandong Normal University
Current assignee: Suzhou Wuyun Pen And Ink Education Technology Co ltd
Priority date: 2018-10-31
Filing date: 2018-10-31
Publication date: 2019-03-15
Anticipated expiration: 2038-10-31
Also published as: CN109472232B

Abstract

The present disclosure discloses video semanteme characterizing method, system and medium based on multi-modal fusion mechanism, feature extraction: visual signature, phonetic feature, motion feature, text feature and the domain features of video itself are extracted；Fusion Features: by the vision of extraction, voice, movement, text feature and domain features, topic model is distributed by the multi-level implicit Di Li Cray of building and carries out Fusion Features；Feature Mapping: by fused Feature Mapping to a high-level semantic space, fused character representation sequence is obtained.Unique advantage of the model using topic model in semantic analysis field, the video characteristic manner that the model training proposed on its basis obtains have comparatively ideal distinction in semantic space.

Description

Video semanteme characterizing method, system and medium based on multi-modal fusion mechanism

Technical field

This disclosure relates to video semanteme characterizing method, system and medium based on multi-modal fusion mechanism.

Background technique

As cybertimes data volume is in explosive growth, the arrival of media big data era is accelerated.Wherein, video is made It is closely bound up with people's lives for the important carrier of multimedia messages.The evolution of mass data does not require nothing more than the place to data Reason mode generates great change, while also bringing very big challenge to the storage of video, processing and application.One urgent need to resolve The problem of be how effective organization and management to be carried out to data.As data are constantly be generated in a steady stream, due to hardware condition Limitation, so that data can only be segmented or timesharing is stored, this inevitably will cause different degrees of loss of learning.Cause This provides one kind for video and is concisely and efficiently data characterization method, is significant to video analysis and raising data management efficiency 's.

Video data has a characteristic that 1) on data mode, video data has the structure of multimode complexity, it is one The data flow of the non-complete lattice of kind.Each video is the streaming structure being distributed by a series of picture frame along time shaft, The multifrequency natures such as vision and movement are shown in space-time hyperspace, while having incorporated acoustic characteristic again on time span.Its Expressive force is strong, contains much information, and the content contained has many characteristics, such as rich, magnanimity, unstructured.Contain in video this Kind multi-mode feature brings huge challenge to video characterization；2) on Composition of contents, video has very strong logicality again.It is It is made of a series of logic unit, contains semantic information abundant, generation can be depicted by continuous several frames and existed Event under specific space-time environment, to express specific semantic content.The diversity of video content and to video content understand Otherness and ambiguity so as to characterization video data feature extraction become difficult so that based on semantic information Video, which understands, has more challenge.

Traditional data characterization method, such as the video features learning method of view-based access control model, the succinct table of available video Sign, however to the good feature of reasonable construction, need certain experience and professional domain feature.The utilization of deep learning method Visual task is set to obtain remarkable break-throughs, but the problems such as there are still " semantic gaps " and " multimode isomery wide gap ".Currently, by using Multi-modal fusion technology establishes the Efficient Characterization to video, is across the effective way of " multimode isomery " wide gap.Understand video, most Natural mode is namely based on the multi-modal information in video, is given expression to the content of video using the high-level concept in people's thinking Come, this is also across the optimal path of " semantic gap ".However, needing integrated use phase for the video analysis of specific area The domain features and the existing effective representation pattern of multimodality fusion technology mining answered complete specific task.Although computer Technology continues to develop, and computer how to be allowed accurately to understand that the semantic concept in video is still a problem.

Summary of the invention

In order to solve the deficiencies in the prior art, present disclose provides the video semanteme characterization sides based on multi-modal fusion mechanism Method, system and medium, it is an expansible general characterization model, not only right during model training and global optimization It is expansible in the number of single mode information, and the domain features that any type of video is contained can be dissolved into model In.Model has fully considered the relationship between each mode, and multi-modal interactive process is dissolved into the joint instruction to entire model During experienced and global optimization.Model, in the unique advantage in semantic analysis field, is proposed on its basis using topic model The video characteristic manner that model training obtains has comparatively ideal distinction in semantic space.

In order to solve the above-mentioned technical problem, the disclosure adopts the following technical scheme that

As the disclosure in a first aspect, providing the video semanteme characterizing method based on multi-modal fusion mechanism；

Video semanteme characterizing method based on multi-modal fusion mechanism, comprising:

Feature extraction: visual signature, phonetic feature, motion feature, text feature and the domain features of video itself are extracted；

Fusion Features: by the vision of extraction, voice, movement, text feature and domain features, pass through the multi-level of building Implicit Di Li Cray distribution topic model carries out Fusion Features；

Feature Mapping: by fused Feature Mapping to a high-level semantic space, fused character representation sequence is obtained Column.

As some possible implementations, the specific steps of the visual signature of video are extracted are as follows:

Pretreatment: Video segmentation is several camera lenses by Video segmentation；Picture frame in each camera lens is sequentially in time Form image frame sequence；

Step (a1): deep learning neural network model is established；

The deep learning neural network model, comprising: sequentially connected input layer, the first convolutional layer C1, the first pond Layer S2, the second convolutional layer C3, the second pond layer S4, third convolutional layer C5, full articulamentum F6 and output layer；

The image frame sequence of each camera lens of video: being input to the input layer of deep learning neural network model by step (a2), Picture frame is passed to the first convolutional layer C1 by input layer；

First convolutional layer C1 carries out each frame image in the image frame sequence of video with one group of trainable convolution kernel Convolution, each layer characteristic pattern that convolution is obtained average to obtain average characteristics figure, then by obtained average characteristics figure together with biasing Activation primitive is inputted, one group of Feature Mapping figure is exported；

First pond layer S2 carries out weight to the pixel value of each pixel of the obtained Feature Mapping figure of the first convolutional layer C1 Folded pondization operation, reduces the length and width of the first convolutional layer output Feature Mapping figure matrix；Then operation result is passed to second Convolutional layer C3；

Second convolutional layer C3 carries out convolution operation for the operation result to the first pond layer S2；Second convolutional layer C3's The number of convolution kernel is two times of the convolution kernel number of the first convolutional layer C1；

Second pond layer S4, the Feature Mapping figure for exporting to the second convolutional layer C3 carry out overlapping poolization and operate to reduce The size of characteristic pattern matrix；

Third convolutional layer C5, using the convolution kernel of size identical as the second pond layer S4 to the knot of the second pond layer S4 Fruit carries out convolution operation, finally obtains the characteristic pattern of several 1 × 1 pixels；

Full articulamentum F6, this layer of each neuron is connect entirely with each neuron in third convolutional layer C5, will The result that third convolutional layer C5 is obtained is expressed as feature vector；

Output layer, the feature vector for exporting full articulamentum F6, which is input in classifier, classifies, and calculates classification Accuracy rate, by backpropagation adjusting parameter, repeats step (a2) when classification accuracy is lower than given threshold, until Classification accuracy is higher than given threshold；When classification accuracy is higher than given threshold, higher than the classification accuracy institute of given threshold Learning outcome of the corresponding feature vector as final video visual signature.

As some possible implementations, the specific steps of the phonetic feature of video are extracted are as follows:

The voice signal in video is extracted, audio data is converted into sonograph, it is neural using sonograph as deep learning Then network model input carries out unsupervised learning to audio-frequency information by deep learning neural network model, and by connecting entirely Layer is connect, the vector for obtaining video speech feature indicates.

As some possible implementations, the specific steps of the motion feature of video are extracted are as follows:

The optical flow field in video is extracted, and statistics is weighted to light stream direction, it is special to obtain light stream directional information histogram It levies (HOF Histogram of Oriented Optical Flow), the vector as motion feature indicates.

As some possible implementations, the specific steps of the text feature of video are extracted are as follows:

The periphery text information (such as video title, mark) for acquiring the text and video in video frame, using bag of words mould Type extracts text feature from text information.

Domain features refer to rule feature set by video fields.For example, football video can be according to football match Rule and relay specification do some scene specifications (such as front court, back court and forbidden zone) and event definition (as shooting, corner-kick, times Meaning ball etc.).News video has almost the same sequential organization and Scene Semantics, i.e. news camera lens is in announcer and news report Between chronologically switch.Promoted commodity or the associated logo information of service are generally comprised in advertisement video.

As some possible implementations, the specific steps of the multi-modal Fusion Features are as follows:

Step (a1): Di Li Cray is implied with LDA and is distributed topic model (LDA Latent Dirichlet Allocation), by the visual feature vector of video under from visual signature space reflection to semantic feature space Γ；LDA is implicit The input that Di Li Cray is distributed topic model is the visual feature vector of video, and it is defeated that LDA implies Di Li Cray distribution topic model It is out the characterizing semantics on feature space Γ；

Step (a2): implying Di Li Cray with LDA and be distributed topic model, and the speech feature vector of video is special from voice It levies under space reflection to semantic feature space Γ；The input that LDA implies Di Li Cray distribution topic model is special for the voice of video Vector is levied, is exported as the characterizing semantics on feature space Γ；

Step (a3): implying Di Li Cray with LDA and be distributed topic model, and the light stream directional information histogram of video is special Sign is under from motion feature space reflection to semantic feature space Γ；The input of LDA is that the light stream directional information histogram of video is special Sign, exports as the characterizing semantics on feature space Γ；

Step (a4): LDA implies Di Li Cray and is distributed topic model, and the text feature of video is reflected from text feature space It is mapped under the Γ of semantic feature space；The input of LDA is the text of video, is exported as the characterizing semantics on feature space Γ；

Step (a5): being priori knowledge Ω by video field Feature Conversion；

Step (a6): Di Li Cray is implied with LDA and is distributed topic model, step (a1)-step (a4) is obtained each Characterizing semantics of the modal characteristics on the Γ of semantic feature space, set the weight of each modal characteristics, are obtained by Weighted Fusion Video characterization after modality fusion.

The finding process of the weight of each modal characteristics is as follows:

Step (a61): one topic of selection is distributed θ | and α~Dir (α), wherein α is the priori of Dirichlet prior distribution Parameter；

Step (a62): to each word in training sample video, a top level topics distribution is selectedTopic obeys multinomial distribution；

Step (a63): for each modal characteristics weight ρ ∈ NV={ NV modal characteristics dictionary }, a bottom words are selected Topic distributionTopic obeys multinomial distribution；

Step (a64): at each modal characteristics weight ρ, based on selected topic, in conjunction with domain knowledge Ω, from distributionGenerate a word.

For single video d, α, β and ρ, topic θ and top level topics are givenIt is reflected from NV single mode spaces union It is mapped to high-level semantic space, topic θ and top level topicsJoint Distribution Probability p (θ, z^top, d | α, Ω, ρ, β) p (β) are as follows:

Wherein, parameter θ and z^topFor hidden variable；By asking edge distribution by parameter θ and z^topIt eliminates.

Wherein, p (β^η) indicate in the priori interest under η Modal Space between dictionary element,

Using Gauss-Markov random field prior model, it may be assumed that

Wherein, Π^jIndicate the set of the word under η Modal Space with priori interest, σ_iFor the smoothing factor of model, use In adjustment prior model；Exp is indicated using natural constant e as the exponential function at bottom；

For the video corpus D containing M video, generating probability is connected by the marginal probability to M video It is multiplied to arrive:

Objective function is set as the likelihood function of D, i.e.,

When the likelihood function of D maximizes, corresponding parameter ρ is exactly the corresponding weight of each single mode feature, log indicate with A is the logarithm at bottom,Indicate likelihood function.

As the second aspect of the disclosure, the video semanteme characterization system based on multi-modal fusion mechanism is provided；

Video semanteme based on multi-modal fusion mechanism characterizes system, comprising: memory, processor and is stored in storage The computer instruction run on device and on a processor, when the computer instruction is run by processor, completes any of the above-described side Step described in method.

As the third aspect of the disclosure, a kind of computer readable storage medium is provided；

A kind of computer readable storage medium, is stored thereon with computer instruction, and the computer instruction is transported by processor When row, step described in any of the above-described method is completed.

Compared with prior art, the beneficial effect of the disclosure is:

(1) a kind of video semanteme characterizing method based on multi-modal fusion mechanism of disclosure primary study, comprehensive utilization figure The sequence information in video is handled as the algorithm in the related fieldss such as processing, pattern-recognition and machine learning.It will be different necks The video phenetic analysis in domain provides new perspective in research and theoretical reference.

(2) conventional method and deep learning method are combined, in semantic hierarchies research to the Efficient Characterization of video, is effectively contracted In short-sighted frequency understanding generally existing " multimode wide gap " and " semantic gap ".

(3) propose the deep vision feature learning model based on adaptive learning mechanism, learning automata system it is adaptive Property be mainly manifested in two aspect: first is that using Shot Detection technology make depth model input be one group of variable-length frame The number of sequence, frame can adaptively be adjusted according to the length of camera lens；Second is that in the pond C2 layer, according to the scale of characteristic pattern dynamic The size and step-length of computing pool window, to guarantee that the data characterization dimension of all camera lens videos is consistent.

(4) video lens adaptive 3 D deep learning neural network is designed, the visual signature of video features is learnt automatically Algorithm is studied, and is improved the performance of classifier and is optimized the parameter of whole system, carrys out table with a kind of most effective way Levy the visual information of video.

(5) propose that a kind of multi-modal multi-level theme characterization model, the main characteristic of the model there are following three aspects: First is that it is an expansible general characterization model, the number of single mode information be it is expansible, can also be by any type of view The domain features that frequency is contained are dissolved into model, and the specific aim of video characterization ability is improved；Second is that model fully considers Multi-modal interactive process is dissolved into joint training and global optimization to entire model by the relationship between each mode；Third is that Unique advantage using topic model in semantic analysis field trains the video characteristic manner obtained to have in semantic space Comparatively ideal distinction can effectively obtain the concise representation of video.

Detailed description of the invention

The accompanying drawings constituting a part of this application is used to provide further understanding of the present application, and the application's shows Meaning property embodiment and its explanation are not constituted an undue limitation on the present application for explaining the application.

Fig. 1 video lens adaptive 3 D deep learning framework；

Fig. 2 adaptive 3 D convolution process；

Fig. 3 carries out the process of convolutional calculation using convolution kernel；

The overall framework of Fig. 4 video multi-modal fusion mechanism；

The multi-modal multi-level theme of Fig. 5 generates model.

Specific embodiment

It is noted that following detailed description is all illustrative, it is intended to provide further instruction to the application.Unless another It indicates, all technical and scientific terms used herein has usual with the application person of an ordinary skill in the technical field The identical meanings of understanding.

It should be noted that term used herein above is merely to describe specific embodiment, and be not intended to restricted root According to the illustrative embodiments of the application.As used herein, unless the context clearly indicates otherwise, otherwise singular Also it is intended to include plural form, additionally, it should be understood that, when in the present specification using term "comprising" and/or " packet Include " when, indicate existing characteristics, step, operation, device, component and/or their combination.

The disclosure proposes a kind of space-time characteristic learning model of adaptive frame selection mechanism first, to obtain the vision of video Feature.Then on this basis, it is further proposed that can effectively melt visual signature and other modal characteristics in conjunction with domain features The model of conjunction realizes the characterizing semantics to video.

To achieve the above object, video characterization model combination conventional method and deep learning method described in the disclosure, it is comprehensive It closes and uses the advantage of traditional Feature Selection technology, deep learning mechanism and topic model theory to the multi-modal fusion machine of video System expansion research, and further the Efficient Characterization to video is studied in semantic hierarchies.

Specific research technical scheme is as follows:

The time-space domain information representation study mechanism of video is analysed in depth first, guarantee space time information continuity and The Efficient Characterization to video visual information is obtained on the basis of integrality；Then, the syncretizing mechanism of multi-modal information is ground Study carefully, while during the domain features of video are dissolved into the multimodal information fusion of video, finally establishing a set of field video Model of representing semantics.

(1) the time-space domain depth characteristic of video features learns automatically

Design a kind of video lens feature learning model with stronger data capability of fitting and learning ability, Neng Gouchong The advantage of feature extraction layer by layer is waved in distribution.With Shot Detection technology, using lens length as adaptive unit, use The mode extracted layer by layer excavates the Time-space serial information for including in video lens.Therefore it is deep to devise the adaptive 3D of video lens It spends learning network model (see Fig. 1).

Process is as follows:

Step 1: shot segmentation being carried out to video with shot detection technology.

Step 2: using one group of video frame of camera lens as the input of model, information is sequentially delivered to different layers again, and every layer logical One group of filter is crossed to obtain most significant feature information of the observation data on different classes of.

Step 3: the pixel value of final all camera lens frames is rasterized, and connects into a vector.

Above-mentioned adaptive 3 D convolution process is embodied in C1 convolutional layer, Fig. 2 is provided to some include camera lens frame that length is L into The process of row convolution, i.e., using L frame sequence as input, the filter that can be learnt by one group is respectively to the corresponding positions of different frame Carry out convolution is set, then obtained each neuron merge averagely, exports one group of feature finally by an activation primitive Mapping graph.For video frame, it is believed that the space relationship inside frame is local, therefore each neuron is arranged to Only perceive regional area.

In convolution process, for the neuron weight of same plane layer in such a way that weight is shared, Fig. 3, which is provided, utilizes convolution kernel Carry out the process of non-linear transfer.

W=(w in Fig. 3₁,w₂,…,w_K) convolution kernel is indicated in the weight of the convolutional layer, it is one group of parameter that can learn；a =(a₁,a₂,…,a_L) be corresponding position in L frame local receptor field, wherein a_i=(a_i1,a_i2,…,a_iK).It is rolled up with some When product collecting image executes convolution algorithm, participate in convolution algorithm is some region of input picture, and the size in this region is just It is receptive field.

In the pond S2 layer, it is weighted first by the pixel unit of the Feature Mapping figure obtained to C1 convolutional layer, Then operation result is continued to transmit by operation nonlinear function to next layer.Pondization is operated, using overlapping in implementation process Chi Hua, that is, the step-length being arranged are less than pond window size.

It is variant in view of the size of the video frame in different data source, after C1 convolution operation, obtained characteristic pattern size May be inconsistent, this will cause the characteristic dimension difference of each video lens of full articulamentum, and then lead to generated tables of data Levy inconsistent problem.The strategy taken in the disclosure is the size and step according to the scale dynamic computing pool window of characteristic pattern It is long, to guarantee the consistency to all video lens characterization vector in dimension.

In C3 convolutional layer, the convolution kernel for designing 2 times of C1 convolutional layers of number acts on S2 layers, it is therefore an objective to spatial discrimination Rate is successively decreased between the layers, is able to detect that more characteristic informations.The pond S4 layer uses the operation similar with S2, it is therefore an objective to logical Crossing down-sampling technology makes operation result continue to be transmitted to next layer.At C5 layers, using the convolution of size identical as S4 The characteristic pattern that S4 layers of verification obtains carries out convolution algorithm, finally obtains a series of characteristic pattern that sizes are 1 × 1.F6 is full connection Figure realizes the feature that input is ultimately expressed as to certain length by the way that each neuron is connected with whole neurons in C5 The characterization purpose of vector.Then feature training obtained is sent into classifier, and it is whole to optimize to further increase the performance of classifier The parameter of a system characterizes the visual information of video with a kind of most effective way.

(2) multimodal information fusion of video characterization

In the pretreatment stage for carrying out video multi-modal Fusion Features, need to mention each modal characteristics of video respectively It takes and characterizes.In general, the feature of video has two major classes: one kind is generic features, comprising:

It 1) include time dimension and space dimension information containing the visual signature of time series；

2) text feature is turned text information using bag of words including the text and video periphery text in video frame Turn to the number description that can be modeled；

3) motion feature, the first Optic flow information in extraction video, then use light stream histogram HOF (Histogram Of Oriented Optical Flow) it is described；

4) audio-frequency information in video is converted into sonograph first, then using sonograph as input, passed through by audio frequency characteristics It finely tunes existing network model and carries out unsupervised learning, obtain indicating the vector of voice messaging.

Another kind of is domain features, this is related to video genre and specific application field.

To the simple combination of not each category feature of process of modal characteristics each in video processing, several different modalities are special The interaction and fusion of sign.The disclosure is distributed as point of penetration with potential Di Li Cray popular in topic model, comprehensive machine learning, The subjects theories such as image procossing and speech recognition merge each modal information of above-mentioned video.By constructing multi-level master It inscribes model and video data is mapped to a upper space from each Modal Space and domain features are organic, obtain in the upper space On video-level characterization sequence.Fig. 4 provides the overall framework of multi-modal fusion mechanism.

For (2) described above, the multimodal information fusion of video characterization needs to extract each modal characteristics of video respectively. Then it constructs multi-level topic model and carries out multiple features fusion.Process is as follows:

(1) each modal characteristics of video are extracted respectively, i.e., extracts visual information, the voice messaging, movement letter of video respectively Breath, text information.

It is the input of unit as model using camera lens (one group of video frame) for visual information, by finely tuning existing net Network model such as AlexNet, GoogleNet carry out unsupervised learning, finally, the pixel value of all camera lens frames to visual information It is rasterized, connects into a vector；For voice messaging, audio data is converted into sonograph, using sonograph information as Then the input of model carries out unsupervised learning to audio-frequency information by finely tuning existing network model, and passes through full articulamentum, Obtain a vector；For motion information, intends extracting the Optic flow information in video first, then use light stream histogram HOF (Histogram of Oriented Optical Flow) is characterized；For text information, including the text in video frame With video periphery text, proposed adoption bag of words convert text information to the number description that can be modeled.

(2) it constructs multi-level topic model and realizes multiple features fusion

It is by the topic model (Fig. 5) for constructing multi-level that video data is organic from each modal characteristics space and domain features It is mapped to a high-level semantic space, realizes multi-modal fusion.Specific implementation process is as follows:

Contain M video in model hypothesis corpus, is denoted as D={ d₁,d₂,…,d_M, each video d_i(1≤i≤M) packet One group of potential subject information is contained, it is believed that these subject informations are by the dictionary element under each Modal Space certain Under priori conditions, generated in a high-level semantic space according to organic be mapped to of certain distribution.Model using video as Processing unit relates to the topic model of two kinds of levels altogether, realizes that the multi-modal information using domain features as priori knowledge melts It closes, finally obtains the topic characterization of vector form.Model includes the theme of two levels altogether, respectively with Z^topAnd Z^lowIt indicates, Z^top Theme after indicating video fusion, Z^lowTheme before indicating fusion, the former is by the latter according to using ρ as the multinomial of parameter point Cloth composition, parameter Ω correspond to the domain features of video.Model thinks to be in independent same distribution between word under different modalities space.Figure is special The construction for levying dictionary uses K-means clustering technique, is constructed in a manner of bag of words.

Model Parameter θ is obeyed to be distributed by the Dirichlet of Study first of α, indicates that currently processed video is contained Distribution.Parameter NV is multi-modal quantity, and β indicates the dictionary under different modalities space.Model is realized by solving parameter ρ The weight offering question of different modalities when multimode state space is converted to semantic space.

The generating process of each video corpus is as shown in the table:

Above-mentioned generating process:

Step 1: one topic of selection is distributed θ | α~Dir (α), wherein α is that the priori of Dirichlet prior distribution is joined Number；

Step 2: selecting a top level topics distribution to word each in videoTopic is obeyed more Item formula distribution；

Step 3: at Modal Space p, selecting a bottom topic distribution for NV Modal Space Topic obeys multinomial distribution；

Step 4: according to selected topic, from distributionGenerate a word.

For single video d, in given parameters α, β, ρ, in conjunction with domain knowledge Ω, topic θ in model, top level topicsWhen from NV Modal Space Joint Mapping to upper space, their Joint Distribution probability are as follows:

Wherein parameter θ and z^topFor hidden variable.It can be by asking edge distribution to be eliminated.

Above-mentioned p (β^η) indicate in the priori interest under η Modal Space between dictionary element,

Using typical Gauss-Markov random field prior model, it may be assumed that

Π in above formula^jIndicate the set of the word under η Modal Space with priori interest, σ_iFor the smoothing factor of model, use In adjustment prior model.

For the video corpus D containing M video, likelihood value can pass through the marginal probability to M video It is even multiplied to arrive:

It is expected that finding suitable parameter alpha, ρ, β can maximize following likelihood functions of corpus, i.e. objective function indicates At:

By solving above-mentioned model, organically blending for multi-modal feature and video field feature may be implemented, finally obtain To the characterizing semantics of video.

The general thought of above procedure is as shown in Figure 4.Summarize above procedure, the multi-modal multi-level theme that the disclosure proposes Model has following characteristic relative to existing method:

1) it is an expansible general characterization model, during model training and global optimization, not only for list The number of modal information is expansible, and the domain features that any type of video is contained can be dissolved into model, Improve the specific aim of video characterization ability；

2) model has fully considered the relationship between each mode, and multi-modal interactive process is dissolved into entire model During joint training and global optimization；

3) topic model itself has its unique advantage in semantic analysis field, the model training proposed on this basis The video characteristic manner of acquisition has comparatively ideal distinction in semantic space, this is also the effective of acquisition video concise representation One of mode.

The foregoing is merely preferred embodiment of the present application, are not intended to limit this application, for the skill of this field For art personnel, various changes and changes are possible in this application.Within the spirit and principles of this application, made any to repair Change, equivalent replacement, improvement etc., should be included within the scope of protection of this application.

Claims

1. the video semanteme characterizing method based on multi-modal fusion mechanism, characterized in that include:

Fusion Features: by the vision of extraction, voice, movement, text feature and domain features, multi-level by building is implied Di Li Cray is distributed topic model and carries out Fusion Features；

Feature Mapping: by fused Feature Mapping to a high-level semantic space, fused character representation sequence is obtained.

2. as described in claim 1 based on the video semanteme characterizing method of multi-modal fusion mechanism, characterized in that extract video Visual signature specific steps are as follows:

Pretreatment: Video segmentation is several camera lenses by Video segmentation；Picture frame in each camera lens forms sequentially in time Image frame sequence；

Step (a1): deep learning neural network model is established；

The deep learning neural network model, comprising: sequentially connected input layer, the first convolutional layer C1, the first pond layer S2, Second convolutional layer C3, the second pond layer S4, third convolutional layer C5, full articulamentum F6 and output layer；

The image frame sequence of each camera lens of video: being input to the input layer of deep learning neural network model by step (a2), input Picture frame is passed to the first convolutional layer C1 by layer；

First convolutional layer C1 rolls up each frame image in the image frame sequence of video with one group of trainable convolution kernel Product, each layer characteristic pattern that convolution is obtained average to obtain average characteristics figure, then obtained average characteristics figure is defeated together with biasing Enter activation primitive, exports one group of Feature Mapping figure；

First pond layer S2 carries out overlapping pool to the pixel value of each pixel of the obtained Feature Mapping figure of the first convolutional layer C1 Change operation, reduces the length and width of the first convolutional layer output Feature Mapping figure matrix；Then operation result is passed into the second convolution Layer C3；

Second convolutional layer C3 carries out convolution operation for the operation result to the first pond layer S2；The convolution of second convolutional layer C3 The number of core is two times of the convolution kernel number of the first convolutional layer C1；

Second pond layer S4, the Feature Mapping figure for exporting to the second convolutional layer C3 carry out overlapping poolization and operate to reduce feature The size of figure matrix；

Third convolutional layer C5, using size identical as the second pond layer S4 convolution kernel to the result of the second pond layer S4 into Row convolution operation finally obtains the characteristic pattern of several 1 × 1 pixels；

Full articulamentum F6, this layer of each neuron is connect entirely with each neuron in third convolutional layer C5, by third The result that convolutional layer C5 is obtained is expressed as feature vector；

Output layer, the feature vector for exporting full articulamentum F6, which is input in classifier, classifies, and it is accurate to calculate classification Rate, by backpropagation adjusting parameter, repeats step (a2) when classification accuracy is lower than given threshold, until classification Accuracy rate is higher than given threshold；When classification accuracy is higher than given threshold, corresponding to the classification accuracy higher than given threshold Learning outcome of the feature vector as final video visual signature.

3. as described in claim 1 based on the video semanteme characterizing method of multi-modal fusion mechanism, characterized in that extract video Phonetic feature specific steps are as follows:

The voice signal in video is extracted, audio data is converted into sonograph, using sonograph as deep learning neural network Then mode input carries out unsupervised learning to audio-frequency information by deep learning neural network model, and passes through full articulamentum, The vector for obtaining video speech feature indicates.

4. as described in claim 1 based on the video semanteme characterizing method of multi-modal fusion mechanism, characterized in that extract video Motion feature specific steps are as follows:

The optical flow field in video is extracted, and statistics is weighted to light stream direction, obtains light stream directional information histogram feature, is made It is indicated for the vector of motion feature.

5. as described in claim 1 based on the video semanteme characterizing method of multi-modal fusion mechanism, characterized in that extract video Text feature specific steps are as follows:

The periphery text information for acquiring the text and video in video frame extracts text spy using bag of words from text information Sign.

6. as described in claim 1 based on the video semanteme characterizing method of multi-modal fusion mechanism, characterized in that domain features Refer to rule feature set by video fields.

7. as described in claim 1 based on the video semanteme characterizing method of multi-modal fusion mechanism, characterized in that

The specific steps of the multi-modal Fusion Features are as follows:

Step (a1): implying Di Li Cray with LDA and be distributed topic model, and the visual feature vector of video is empty from visual signature Between be mapped under the Γ of semantic feature space；LDA imply Di Li Cray distribution topic model input be video visual signature to Amount, it is the characterizing semantics on feature space Γ that LDA, which implies the distribution topic model output of Di Li Cray,；

Step (a2): implying Di Li Cray with LDA and be distributed topic model, and the speech feature vector of video is empty from phonetic feature Between be mapped under the Γ of semantic feature space；LDA imply Di Li Cray distribution topic model input be video phonetic feature to Amount, exports as the characterizing semantics on feature space Γ；

Step (a3): with LDA imply Di Li Cray be distributed topic model, by the light stream directional information histogram feature of video from Under motion feature space reflection to semantic feature space Γ；The input of LDA is the light stream directional information histogram feature of video, defeated It is out the characterizing semantics on feature space Γ；

Step (a4): LDA imply Di Li Cray be distributed topic model, by the text feature of video from text feature space reflection to Under the Γ of semantic feature space；The input of LDA is the text of video, is exported as the characterizing semantics on feature space Γ；

Step (a5): being priori knowledge Ω by video field Feature Conversion；

Step (a6): Di Li Cray is implied with LDA and is distributed topic model, each mode that step (a1)-step (a4) is obtained Characterizing semantics of the feature on the Γ of semantic feature space, set the weight of each modal characteristics, are obtained by Weighted Fusion in mode Fused video characterization.

8. video semanteme based on multi-modal fusion mechanism characterizes system, characterized in that include: memory, processor and deposit The computer instruction run on a memory and on a processor is stored up, when the computer instruction is run by processor, in completion State step described in any one of claim 1-7 method.

9. a kind of computer readable storage medium, characterized in that be stored thereon with computer instruction, the computer instruction is located When managing device operation, step described in any one of the claims 1-7 method is completed.