CN109472232A - Video semanteme characterizing method, system and medium based on multi-modal fusion mechanism - Google Patents
Video semanteme characterizing method, system and medium based on multi-modal fusion mechanism Download PDFInfo
- Publication number
- CN109472232A CN109472232A CN201811289502.5A CN201811289502A CN109472232A CN 109472232 A CN109472232 A CN 109472232A CN 201811289502 A CN201811289502 A CN 201811289502A CN 109472232 A CN109472232 A CN 109472232A
- Authority
- CN
- China
- Prior art keywords
- video
- feature
- layer
- convolutional layer
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The present disclosure discloses video semanteme characterizing method, system and medium based on multi-modal fusion mechanism, feature extraction: visual signature, phonetic feature, motion feature, text feature and the domain features of video itself are extracted;Fusion Features: by the vision of extraction, voice, movement, text feature and domain features, topic model is distributed by the multi-level implicit Di Li Cray of building and carries out Fusion Features;Feature Mapping: by fused Feature Mapping to a high-level semantic space, fused character representation sequence is obtained.Unique advantage of the model using topic model in semantic analysis field, the video characteristic manner that the model training proposed on its basis obtains have comparatively ideal distinction in semantic space.
Description
Technical field
This disclosure relates to video semanteme characterizing method, system and medium based on multi-modal fusion mechanism.
Background technique
As cybertimes data volume is in explosive growth, the arrival of media big data era is accelerated.Wherein, video is made
It is closely bound up with people's lives for the important carrier of multimedia messages.The evolution of mass data does not require nothing more than the place to data
Reason mode generates great change, while also bringing very big challenge to the storage of video, processing and application.One urgent need to resolve
The problem of be how effective organization and management to be carried out to data.As data are constantly be generated in a steady stream, due to hardware condition
Limitation, so that data can only be segmented or timesharing is stored, this inevitably will cause different degrees of loss of learning.Cause
This provides one kind for video and is concisely and efficiently data characterization method, is significant to video analysis and raising data management efficiency
's.
Video data has a characteristic that 1) on data mode, video data has the structure of multimode complexity, it is one
The data flow of the non-complete lattice of kind.Each video is the streaming structure being distributed by a series of picture frame along time shaft,
The multifrequency natures such as vision and movement are shown in space-time hyperspace, while having incorporated acoustic characteristic again on time span.Its
Expressive force is strong, contains much information, and the content contained has many characteristics, such as rich, magnanimity, unstructured.Contain in video this
Kind multi-mode feature brings huge challenge to video characterization;2) on Composition of contents, video has very strong logicality again.It is
It is made of a series of logic unit, contains semantic information abundant, generation can be depicted by continuous several frames and existed
Event under specific space-time environment, to express specific semantic content.The diversity of video content and to video content understand
Otherness and ambiguity so as to characterization video data feature extraction become difficult so that based on semantic information
Video, which understands, has more challenge.
Traditional data characterization method, such as the video features learning method of view-based access control model, the succinct table of available video
Sign, however to the good feature of reasonable construction, need certain experience and professional domain feature.The utilization of deep learning method
Visual task is set to obtain remarkable break-throughs, but the problems such as there are still " semantic gaps " and " multimode isomery wide gap ".Currently, by using
Multi-modal fusion technology establishes the Efficient Characterization to video, is across the effective way of " multimode isomery " wide gap.Understand video, most
Natural mode is namely based on the multi-modal information in video, is given expression to the content of video using the high-level concept in people's thinking
Come, this is also across the optimal path of " semantic gap ".However, needing integrated use phase for the video analysis of specific area
The domain features and the existing effective representation pattern of multimodality fusion technology mining answered complete specific task.Although computer
Technology continues to develop, and computer how to be allowed accurately to understand that the semantic concept in video is still a problem.
Summary of the invention
In order to solve the deficiencies in the prior art, present disclose provides the video semanteme characterization sides based on multi-modal fusion mechanism
Method, system and medium, it is an expansible general characterization model, not only right during model training and global optimization
It is expansible in the number of single mode information, and the domain features that any type of video is contained can be dissolved into model
In.Model has fully considered the relationship between each mode, and multi-modal interactive process is dissolved into the joint instruction to entire model
During experienced and global optimization.Model, in the unique advantage in semantic analysis field, is proposed on its basis using topic model
The video characteristic manner that model training obtains has comparatively ideal distinction in semantic space.
In order to solve the above-mentioned technical problem, the disclosure adopts the following technical scheme that
As the disclosure in a first aspect, providing the video semanteme characterizing method based on multi-modal fusion mechanism;
Video semanteme characterizing method based on multi-modal fusion mechanism, comprising:
Feature extraction: visual signature, phonetic feature, motion feature, text feature and the domain features of video itself are extracted;
Fusion Features: by the vision of extraction, voice, movement, text feature and domain features, pass through the multi-level of building
Implicit Di Li Cray distribution topic model carries out Fusion Features;
Feature Mapping: by fused Feature Mapping to a high-level semantic space, fused character representation sequence is obtained
Column.
As some possible implementations, the specific steps of the visual signature of video are extracted are as follows:
Pretreatment: Video segmentation is several camera lenses by Video segmentation;Picture frame in each camera lens is sequentially in time
Form image frame sequence;
Step (a1): deep learning neural network model is established;
The deep learning neural network model, comprising: sequentially connected input layer, the first convolutional layer C1, the first pond
Layer S2, the second convolutional layer C3, the second pond layer S4, third convolutional layer C5, full articulamentum F6 and output layer;
The image frame sequence of each camera lens of video: being input to the input layer of deep learning neural network model by step (a2),
Picture frame is passed to the first convolutional layer C1 by input layer;
First convolutional layer C1 carries out each frame image in the image frame sequence of video with one group of trainable convolution kernel
Convolution, each layer characteristic pattern that convolution is obtained average to obtain average characteristics figure, then by obtained average characteristics figure together with biasing
Activation primitive is inputted, one group of Feature Mapping figure is exported;
First pond layer S2 carries out weight to the pixel value of each pixel of the obtained Feature Mapping figure of the first convolutional layer C1
Folded pondization operation, reduces the length and width of the first convolutional layer output Feature Mapping figure matrix;Then operation result is passed to second
Convolutional layer C3;
Second convolutional layer C3 carries out convolution operation for the operation result to the first pond layer S2;Second convolutional layer C3's
The number of convolution kernel is two times of the convolution kernel number of the first convolutional layer C1;
Second pond layer S4, the Feature Mapping figure for exporting to the second convolutional layer C3 carry out overlapping poolization and operate to reduce
The size of characteristic pattern matrix;
Third convolutional layer C5, using the convolution kernel of size identical as the second pond layer S4 to the knot of the second pond layer S4
Fruit carries out convolution operation, finally obtains the characteristic pattern of several 1 × 1 pixels;
Full articulamentum F6, this layer of each neuron is connect entirely with each neuron in third convolutional layer C5, will
The result that third convolutional layer C5 is obtained is expressed as feature vector;
Output layer, the feature vector for exporting full articulamentum F6, which is input in classifier, classifies, and calculates classification
Accuracy rate, by backpropagation adjusting parameter, repeats step (a2) when classification accuracy is lower than given threshold, until
Classification accuracy is higher than given threshold;When classification accuracy is higher than given threshold, higher than the classification accuracy institute of given threshold
Learning outcome of the corresponding feature vector as final video visual signature.
As some possible implementations, the specific steps of the phonetic feature of video are extracted are as follows:
The voice signal in video is extracted, audio data is converted into sonograph, it is neural using sonograph as deep learning
Then network model input carries out unsupervised learning to audio-frequency information by deep learning neural network model, and by connecting entirely
Layer is connect, the vector for obtaining video speech feature indicates.
As some possible implementations, the specific steps of the motion feature of video are extracted are as follows:
The optical flow field in video is extracted, and statistics is weighted to light stream direction, it is special to obtain light stream directional information histogram
It levies (HOF Histogram of Oriented Optical Flow), the vector as motion feature indicates.
As some possible implementations, the specific steps of the text feature of video are extracted are as follows:
The periphery text information (such as video title, mark) for acquiring the text and video in video frame, using bag of words mould
Type extracts text feature from text information.
Domain features refer to rule feature set by video fields.For example, football video can be according to football match
Rule and relay specification do some scene specifications (such as front court, back court and forbidden zone) and event definition (as shooting, corner-kick, times
Meaning ball etc.).News video has almost the same sequential organization and Scene Semantics, i.e. news camera lens is in announcer and news report
Between chronologically switch.Promoted commodity or the associated logo information of service are generally comprised in advertisement video.
As some possible implementations, the specific steps of the multi-modal Fusion Features are as follows:
Step (a1): Di Li Cray is implied with LDA and is distributed topic model (LDA Latent Dirichlet
Allocation), by the visual feature vector of video under from visual signature space reflection to semantic feature space Γ;LDA is implicit
The input that Di Li Cray is distributed topic model is the visual feature vector of video, and it is defeated that LDA implies Di Li Cray distribution topic model
It is out the characterizing semantics on feature space Γ;
Step (a2): implying Di Li Cray with LDA and be distributed topic model, and the speech feature vector of video is special from voice
It levies under space reflection to semantic feature space Γ;The input that LDA implies Di Li Cray distribution topic model is special for the voice of video
Vector is levied, is exported as the characterizing semantics on feature space Γ;
Step (a3): implying Di Li Cray with LDA and be distributed topic model, and the light stream directional information histogram of video is special
Sign is under from motion feature space reflection to semantic feature space Γ;The input of LDA is that the light stream directional information histogram of video is special
Sign, exports as the characterizing semantics on feature space Γ;
Step (a4): LDA implies Di Li Cray and is distributed topic model, and the text feature of video is reflected from text feature space
It is mapped under the Γ of semantic feature space;The input of LDA is the text of video, is exported as the characterizing semantics on feature space Γ;
Step (a5): being priori knowledge Ω by video field Feature Conversion;
Step (a6): Di Li Cray is implied with LDA and is distributed topic model, step (a1)-step (a4) is obtained each
Characterizing semantics of the modal characteristics on the Γ of semantic feature space, set the weight of each modal characteristics, are obtained by Weighted Fusion
Video characterization after modality fusion.
The finding process of the weight of each modal characteristics is as follows:
Step (a61): one topic of selection is distributed θ | and α~Dir (α), wherein α is the priori of Dirichlet prior distribution
Parameter;
Step (a62): to each word in training sample video, a top level topics distribution is selectedTopic obeys multinomial distribution;
Step (a63): for each modal characteristics weight ρ ∈ NV={ NV modal characteristics dictionary }, a bottom words are selected
Topic distributionTopic obeys multinomial distribution;
Step (a64): at each modal characteristics weight ρ, based on selected topic, in conjunction with domain knowledge Ω, from distributionGenerate a word.
For single video d, α, β and ρ, topic θ and top level topics are givenIt is reflected from NV single mode spaces union
It is mapped to high-level semantic space, topic θ and top level topicsJoint Distribution Probability p (θ, ztop, d | α, Ω, ρ, β) p (β) are as follows:
Wherein, parameter θ and ztopFor hidden variable;By asking edge distribution by parameter θ and ztopIt eliminates.
Wherein, p (βη) indicate in the priori interest under η Modal Space between dictionary element,
Using Gauss-Markov random field prior model, it may be assumed that
Wherein, ΠjIndicate the set of the word under η Modal Space with priori interest, σiFor the smoothing factor of model, use
In adjustment prior model;Exp is indicated using natural constant e as the exponential function at bottom;
For the video corpus D containing M video, generating probability is connected by the marginal probability to M video
It is multiplied to arrive:
Objective function is set as the likelihood function of D, i.e.,
When the likelihood function of D maximizes, corresponding parameter ρ is exactly the corresponding weight of each single mode feature, log indicate with
A is the logarithm at bottom,Indicate likelihood function.
As the second aspect of the disclosure, the video semanteme characterization system based on multi-modal fusion mechanism is provided;
Video semanteme based on multi-modal fusion mechanism characterizes system, comprising: memory, processor and is stored in storage
The computer instruction run on device and on a processor, when the computer instruction is run by processor, completes any of the above-described side
Step described in method.
As the third aspect of the disclosure, a kind of computer readable storage medium is provided;
A kind of computer readable storage medium, is stored thereon with computer instruction, and the computer instruction is transported by processor
When row, step described in any of the above-described method is completed.
Compared with prior art, the beneficial effect of the disclosure is:
(1) a kind of video semanteme characterizing method based on multi-modal fusion mechanism of disclosure primary study, comprehensive utilization figure
The sequence information in video is handled as the algorithm in the related fieldss such as processing, pattern-recognition and machine learning.It will be different necks
The video phenetic analysis in domain provides new perspective in research and theoretical reference.
(2) conventional method and deep learning method are combined, in semantic hierarchies research to the Efficient Characterization of video, is effectively contracted
In short-sighted frequency understanding generally existing " multimode wide gap " and " semantic gap ".
(3) propose the deep vision feature learning model based on adaptive learning mechanism, learning automata system it is adaptive
Property be mainly manifested in two aspect: first is that using Shot Detection technology make depth model input be one group of variable-length frame
The number of sequence, frame can adaptively be adjusted according to the length of camera lens;Second is that in the pond C2 layer, according to the scale of characteristic pattern dynamic
The size and step-length of computing pool window, to guarantee that the data characterization dimension of all camera lens videos is consistent.
(4) video lens adaptive 3 D deep learning neural network is designed, the visual signature of video features is learnt automatically
Algorithm is studied, and is improved the performance of classifier and is optimized the parameter of whole system, carrys out table with a kind of most effective way
Levy the visual information of video.
(5) propose that a kind of multi-modal multi-level theme characterization model, the main characteristic of the model there are following three aspects:
First is that it is an expansible general characterization model, the number of single mode information be it is expansible, can also be by any type of view
The domain features that frequency is contained are dissolved into model, and the specific aim of video characterization ability is improved;Second is that model fully considers
Multi-modal interactive process is dissolved into joint training and global optimization to entire model by the relationship between each mode;Third is that
Unique advantage using topic model in semantic analysis field trains the video characteristic manner obtained to have in semantic space
Comparatively ideal distinction can effectively obtain the concise representation of video.
Detailed description of the invention
The accompanying drawings constituting a part of this application is used to provide further understanding of the present application, and the application's shows
Meaning property embodiment and its explanation are not constituted an undue limitation on the present application for explaining the application.
Fig. 1 video lens adaptive 3 D deep learning framework;
Fig. 2 adaptive 3 D convolution process;
Fig. 3 carries out the process of convolutional calculation using convolution kernel;
The overall framework of Fig. 4 video multi-modal fusion mechanism;
The multi-modal multi-level theme of Fig. 5 generates model.
Specific embodiment
It is noted that following detailed description is all illustrative, it is intended to provide further instruction to the application.Unless another
It indicates, all technical and scientific terms used herein has usual with the application person of an ordinary skill in the technical field
The identical meanings of understanding.
It should be noted that term used herein above is merely to describe specific embodiment, and be not intended to restricted root
According to the illustrative embodiments of the application.As used herein, unless the context clearly indicates otherwise, otherwise singular
Also it is intended to include plural form, additionally, it should be understood that, when in the present specification using term "comprising" and/or " packet
Include " when, indicate existing characteristics, step, operation, device, component and/or their combination.
The disclosure proposes a kind of space-time characteristic learning model of adaptive frame selection mechanism first, to obtain the vision of video
Feature.Then on this basis, it is further proposed that can effectively melt visual signature and other modal characteristics in conjunction with domain features
The model of conjunction realizes the characterizing semantics to video.
To achieve the above object, video characterization model combination conventional method and deep learning method described in the disclosure, it is comprehensive
It closes and uses the advantage of traditional Feature Selection technology, deep learning mechanism and topic model theory to the multi-modal fusion machine of video
System expansion research, and further the Efficient Characterization to video is studied in semantic hierarchies.
Specific research technical scheme is as follows:
The time-space domain information representation study mechanism of video is analysed in depth first, guarantee space time information continuity and
The Efficient Characterization to video visual information is obtained on the basis of integrality;Then, the syncretizing mechanism of multi-modal information is ground
Study carefully, while during the domain features of video are dissolved into the multimodal information fusion of video, finally establishing a set of field video
Model of representing semantics.
(1) the time-space domain depth characteristic of video features learns automatically
Design a kind of video lens feature learning model with stronger data capability of fitting and learning ability, Neng Gouchong
The advantage of feature extraction layer by layer is waved in distribution.With Shot Detection technology, using lens length as adaptive unit, use
The mode extracted layer by layer excavates the Time-space serial information for including in video lens.Therefore it is deep to devise the adaptive 3D of video lens
It spends learning network model (see Fig. 1).
Process is as follows:
Step 1: shot segmentation being carried out to video with shot detection technology.
Step 2: using one group of video frame of camera lens as the input of model, information is sequentially delivered to different layers again, and every layer logical
One group of filter is crossed to obtain most significant feature information of the observation data on different classes of.
Step 3: the pixel value of final all camera lens frames is rasterized, and connects into a vector.
Above-mentioned adaptive 3 D convolution process is embodied in C1 convolutional layer, Fig. 2 is provided to some include camera lens frame that length is L into
The process of row convolution, i.e., using L frame sequence as input, the filter that can be learnt by one group is respectively to the corresponding positions of different frame
Carry out convolution is set, then obtained each neuron merge averagely, exports one group of feature finally by an activation primitive
Mapping graph.For video frame, it is believed that the space relationship inside frame is local, therefore each neuron is arranged to
Only perceive regional area.
In convolution process, for the neuron weight of same plane layer in such a way that weight is shared, Fig. 3, which is provided, utilizes convolution kernel
Carry out the process of non-linear transfer.
W=(w in Fig. 31,w2,…,wK) convolution kernel is indicated in the weight of the convolutional layer, it is one group of parameter that can learn;a
=(a1,a2,…,aL) be corresponding position in L frame local receptor field, wherein ai=(ai1,ai2,…,aiK).It is rolled up with some
When product collecting image executes convolution algorithm, participate in convolution algorithm is some region of input picture, and the size in this region is just
It is receptive field.
In the pond S2 layer, it is weighted first by the pixel unit of the Feature Mapping figure obtained to C1 convolutional layer,
Then operation result is continued to transmit by operation nonlinear function to next layer.Pondization is operated, using overlapping in implementation process
Chi Hua, that is, the step-length being arranged are less than pond window size.
It is variant in view of the size of the video frame in different data source, after C1 convolution operation, obtained characteristic pattern size
May be inconsistent, this will cause the characteristic dimension difference of each video lens of full articulamentum, and then lead to generated tables of data
Levy inconsistent problem.The strategy taken in the disclosure is the size and step according to the scale dynamic computing pool window of characteristic pattern
It is long, to guarantee the consistency to all video lens characterization vector in dimension.
In C3 convolutional layer, the convolution kernel for designing 2 times of C1 convolutional layers of number acts on S2 layers, it is therefore an objective to spatial discrimination
Rate is successively decreased between the layers, is able to detect that more characteristic informations.The pond S4 layer uses the operation similar with S2, it is therefore an objective to logical
Crossing down-sampling technology makes operation result continue to be transmitted to next layer.At C5 layers, using the convolution of size identical as S4
The characteristic pattern that S4 layers of verification obtains carries out convolution algorithm, finally obtains a series of characteristic pattern that sizes are 1 × 1.F6 is full connection
Figure realizes the feature that input is ultimately expressed as to certain length by the way that each neuron is connected with whole neurons in C5
The characterization purpose of vector.Then feature training obtained is sent into classifier, and it is whole to optimize to further increase the performance of classifier
The parameter of a system characterizes the visual information of video with a kind of most effective way.
(2) multimodal information fusion of video characterization
In the pretreatment stage for carrying out video multi-modal Fusion Features, need to mention each modal characteristics of video respectively
It takes and characterizes.In general, the feature of video has two major classes: one kind is generic features, comprising:
It 1) include time dimension and space dimension information containing the visual signature of time series;
2) text feature is turned text information using bag of words including the text and video periphery text in video frame
Turn to the number description that can be modeled;
3) motion feature, the first Optic flow information in extraction video, then use light stream histogram HOF (Histogram
Of Oriented Optical Flow) it is described;
4) audio-frequency information in video is converted into sonograph first, then using sonograph as input, passed through by audio frequency characteristics
It finely tunes existing network model and carries out unsupervised learning, obtain indicating the vector of voice messaging.
Another kind of is domain features, this is related to video genre and specific application field.
To the simple combination of not each category feature of process of modal characteristics each in video processing, several different modalities are special
The interaction and fusion of sign.The disclosure is distributed as point of penetration with potential Di Li Cray popular in topic model, comprehensive machine learning,
The subjects theories such as image procossing and speech recognition merge each modal information of above-mentioned video.By constructing multi-level master
It inscribes model and video data is mapped to a upper space from each Modal Space and domain features are organic, obtain in the upper space
On video-level characterization sequence.Fig. 4 provides the overall framework of multi-modal fusion mechanism.
For (2) described above, the multimodal information fusion of video characterization needs to extract each modal characteristics of video respectively.
Then it constructs multi-level topic model and carries out multiple features fusion.Process is as follows:
(1) each modal characteristics of video are extracted respectively, i.e., extracts visual information, the voice messaging, movement letter of video respectively
Breath, text information.
It is the input of unit as model using camera lens (one group of video frame) for visual information, by finely tuning existing net
Network model such as AlexNet, GoogleNet carry out unsupervised learning, finally, the pixel value of all camera lens frames to visual information
It is rasterized, connects into a vector;For voice messaging, audio data is converted into sonograph, using sonograph information as
Then the input of model carries out unsupervised learning to audio-frequency information by finely tuning existing network model, and passes through full articulamentum,
Obtain a vector;For motion information, intends extracting the Optic flow information in video first, then use light stream histogram HOF
(Histogram of Oriented Optical Flow) is characterized;For text information, including the text in video frame
With video periphery text, proposed adoption bag of words convert text information to the number description that can be modeled.
(2) it constructs multi-level topic model and realizes multiple features fusion
It is by the topic model (Fig. 5) for constructing multi-level that video data is organic from each modal characteristics space and domain features
It is mapped to a high-level semantic space, realizes multi-modal fusion.Specific implementation process is as follows:
Contain M video in model hypothesis corpus, is denoted as D={ d1,d2,…,dM, each video di(1≤i≤M) packet
One group of potential subject information is contained, it is believed that these subject informations are by the dictionary element under each Modal Space certain
Under priori conditions, generated in a high-level semantic space according to organic be mapped to of certain distribution.Model using video as
Processing unit relates to the topic model of two kinds of levels altogether, realizes that the multi-modal information using domain features as priori knowledge melts
It closes, finally obtains the topic characterization of vector form.Model includes the theme of two levels altogether, respectively with ZtopAnd ZlowIt indicates, Ztop
Theme after indicating video fusion, ZlowTheme before indicating fusion, the former is by the latter according to using ρ as the multinomial of parameter point
Cloth composition, parameter Ω correspond to the domain features of video.Model thinks to be in independent same distribution between word under different modalities space.Figure is special
The construction for levying dictionary uses K-means clustering technique, is constructed in a manner of bag of words.
Model Parameter θ is obeyed to be distributed by the Dirichlet of Study first of α, indicates that currently processed video is contained
Distribution.Parameter NV is multi-modal quantity, and β indicates the dictionary under different modalities space.Model is realized by solving parameter ρ
The weight offering question of different modalities when multimode state space is converted to semantic space.
The generating process of each video corpus is as shown in the table:
Above-mentioned generating process:
Step 1: one topic of selection is distributed θ | α~Dir (α), wherein α is that the priori of Dirichlet prior distribution is joined
Number;
Step 2: selecting a top level topics distribution to word each in videoTopic is obeyed more
Item formula distribution;
Step 3: at Modal Space p, selecting a bottom topic distribution for NV Modal Space Topic obeys multinomial distribution;
Step 4: according to selected topic, from distributionGenerate a word.
For single video d, in given parameters α, β, ρ, in conjunction with domain knowledge Ω, topic θ in model, top level topicsWhen from NV Modal Space Joint Mapping to upper space, their Joint Distribution probability are as follows:
Wherein parameter θ and ztopFor hidden variable.It can be by asking edge distribution to be eliminated.
Above-mentioned p (βη) indicate in the priori interest under η Modal Space between dictionary element,
Using typical Gauss-Markov random field prior model, it may be assumed that
Π in above formulajIndicate the set of the word under η Modal Space with priori interest, σiFor the smoothing factor of model, use
In adjustment prior model.
For the video corpus D containing M video, likelihood value can pass through the marginal probability to M video
It is even multiplied to arrive:
It is expected that finding suitable parameter alpha, ρ, β can maximize following likelihood functions of corpus, i.e. objective function indicates
At:
By solving above-mentioned model, organically blending for multi-modal feature and video field feature may be implemented, finally obtain
To the characterizing semantics of video.
The general thought of above procedure is as shown in Figure 4.Summarize above procedure, the multi-modal multi-level theme that the disclosure proposes
Model has following characteristic relative to existing method:
1) it is an expansible general characterization model, during model training and global optimization, not only for list
The number of modal information is expansible, and the domain features that any type of video is contained can be dissolved into model,
Improve the specific aim of video characterization ability;
2) model has fully considered the relationship between each mode, and multi-modal interactive process is dissolved into entire model
During joint training and global optimization;
3) topic model itself has its unique advantage in semantic analysis field, the model training proposed on this basis
The video characteristic manner of acquisition has comparatively ideal distinction in semantic space, this is also the effective of acquisition video concise representation
One of mode.
The foregoing is merely preferred embodiment of the present application, are not intended to limit this application, for the skill of this field
For art personnel, various changes and changes are possible in this application.Within the spirit and principles of this application, made any to repair
Change, equivalent replacement, improvement etc., should be included within the scope of protection of this application.
Claims (9)
1. the video semanteme characterizing method based on multi-modal fusion mechanism, characterized in that include:
Feature extraction: visual signature, phonetic feature, motion feature, text feature and the domain features of video itself are extracted;
Fusion Features: by the vision of extraction, voice, movement, text feature and domain features, multi-level by building is implied
Di Li Cray is distributed topic model and carries out Fusion Features;
Feature Mapping: by fused Feature Mapping to a high-level semantic space, fused character representation sequence is obtained.
2. as described in claim 1 based on the video semanteme characterizing method of multi-modal fusion mechanism, characterized in that extract video
Visual signature specific steps are as follows:
Pretreatment: Video segmentation is several camera lenses by Video segmentation;Picture frame in each camera lens forms sequentially in time
Image frame sequence;
Step (a1): deep learning neural network model is established;
The deep learning neural network model, comprising: sequentially connected input layer, the first convolutional layer C1, the first pond layer S2,
Second convolutional layer C3, the second pond layer S4, third convolutional layer C5, full articulamentum F6 and output layer;
The image frame sequence of each camera lens of video: being input to the input layer of deep learning neural network model by step (a2), input
Picture frame is passed to the first convolutional layer C1 by layer;
First convolutional layer C1 rolls up each frame image in the image frame sequence of video with one group of trainable convolution kernel
Product, each layer characteristic pattern that convolution is obtained average to obtain average characteristics figure, then obtained average characteristics figure is defeated together with biasing
Enter activation primitive, exports one group of Feature Mapping figure;
First pond layer S2 carries out overlapping pool to the pixel value of each pixel of the obtained Feature Mapping figure of the first convolutional layer C1
Change operation, reduces the length and width of the first convolutional layer output Feature Mapping figure matrix;Then operation result is passed into the second convolution
Layer C3;
Second convolutional layer C3 carries out convolution operation for the operation result to the first pond layer S2;The convolution of second convolutional layer C3
The number of core is two times of the convolution kernel number of the first convolutional layer C1;
Second pond layer S4, the Feature Mapping figure for exporting to the second convolutional layer C3 carry out overlapping poolization and operate to reduce feature
The size of figure matrix;
Third convolutional layer C5, using size identical as the second pond layer S4 convolution kernel to the result of the second pond layer S4 into
Row convolution operation finally obtains the characteristic pattern of several 1 × 1 pixels;
Full articulamentum F6, this layer of each neuron is connect entirely with each neuron in third convolutional layer C5, by third
The result that convolutional layer C5 is obtained is expressed as feature vector;
Output layer, the feature vector for exporting full articulamentum F6, which is input in classifier, classifies, and it is accurate to calculate classification
Rate, by backpropagation adjusting parameter, repeats step (a2) when classification accuracy is lower than given threshold, until classification
Accuracy rate is higher than given threshold;When classification accuracy is higher than given threshold, corresponding to the classification accuracy higher than given threshold
Learning outcome of the feature vector as final video visual signature.
3. as described in claim 1 based on the video semanteme characterizing method of multi-modal fusion mechanism, characterized in that extract video
Phonetic feature specific steps are as follows:
The voice signal in video is extracted, audio data is converted into sonograph, using sonograph as deep learning neural network
Then mode input carries out unsupervised learning to audio-frequency information by deep learning neural network model, and passes through full articulamentum,
The vector for obtaining video speech feature indicates.
4. as described in claim 1 based on the video semanteme characterizing method of multi-modal fusion mechanism, characterized in that extract video
Motion feature specific steps are as follows:
The optical flow field in video is extracted, and statistics is weighted to light stream direction, obtains light stream directional information histogram feature, is made
It is indicated for the vector of motion feature.
5. as described in claim 1 based on the video semanteme characterizing method of multi-modal fusion mechanism, characterized in that extract video
Text feature specific steps are as follows:
The periphery text information for acquiring the text and video in video frame extracts text spy using bag of words from text information
Sign.
6. as described in claim 1 based on the video semanteme characterizing method of multi-modal fusion mechanism, characterized in that domain features
Refer to rule feature set by video fields.
7. as described in claim 1 based on the video semanteme characterizing method of multi-modal fusion mechanism, characterized in that
The specific steps of the multi-modal Fusion Features are as follows:
Step (a1): implying Di Li Cray with LDA and be distributed topic model, and the visual feature vector of video is empty from visual signature
Between be mapped under the Γ of semantic feature space;LDA imply Di Li Cray distribution topic model input be video visual signature to
Amount, it is the characterizing semantics on feature space Γ that LDA, which implies the distribution topic model output of Di Li Cray,;
Step (a2): implying Di Li Cray with LDA and be distributed topic model, and the speech feature vector of video is empty from phonetic feature
Between be mapped under the Γ of semantic feature space;LDA imply Di Li Cray distribution topic model input be video phonetic feature to
Amount, exports as the characterizing semantics on feature space Γ;
Step (a3): with LDA imply Di Li Cray be distributed topic model, by the light stream directional information histogram feature of video from
Under motion feature space reflection to semantic feature space Γ;The input of LDA is the light stream directional information histogram feature of video, defeated
It is out the characterizing semantics on feature space Γ;
Step (a4): LDA imply Di Li Cray be distributed topic model, by the text feature of video from text feature space reflection to
Under the Γ of semantic feature space;The input of LDA is the text of video, is exported as the characterizing semantics on feature space Γ;
Step (a5): being priori knowledge Ω by video field Feature Conversion;
Step (a6): Di Li Cray is implied with LDA and is distributed topic model, each mode that step (a1)-step (a4) is obtained
Characterizing semantics of the feature on the Γ of semantic feature space, set the weight of each modal characteristics, are obtained by Weighted Fusion in mode
Fused video characterization.
8. video semanteme based on multi-modal fusion mechanism characterizes system, characterized in that include: memory, processor and deposit
The computer instruction run on a memory and on a processor is stored up, when the computer instruction is run by processor, in completion
State step described in any one of claim 1-7 method.
9. a kind of computer readable storage medium, characterized in that be stored thereon with computer instruction, the computer instruction is located
When managing device operation, step described in any one of the claims 1-7 method is completed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811289502.5A CN109472232B (en) | 2018-10-31 | 2018-10-31 | Video semantic representation method, system and medium based on multi-mode fusion mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811289502.5A CN109472232B (en) | 2018-10-31 | 2018-10-31 | Video semantic representation method, system and medium based on multi-mode fusion mechanism |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109472232A true CN109472232A (en) | 2019-03-15 |
CN109472232B CN109472232B (en) | 2020-09-29 |
Family
ID=65666408
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811289502.5A Active CN109472232B (en) | 2018-10-31 | 2018-10-31 | Video semantic representation method, system and medium based on multi-mode fusion mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109472232B (en) |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110046279A (en) * | 2019-04-18 | 2019-07-23 | 网易传媒科技(北京)有限公司 | Prediction technique, medium, device and the calculating equipment of video file feature |
CN110162669A (en) * | 2019-04-04 | 2019-08-23 | 腾讯科技(深圳)有限公司 | Visual classification processing method, device, computer equipment and storage medium |
CN110234018A (en) * | 2019-07-09 | 2019-09-13 | 腾讯科技(深圳)有限公司 | Multimedia content description generation method, training method, device, equipment and medium |
CN110390311A (en) * | 2019-07-27 | 2019-10-29 | 苏州过来人科技有限公司 | A kind of video analysis algorithm based on attention and subtask pre-training |
CN110399934A (en) * | 2019-07-31 | 2019-11-01 | 北京达佳互联信息技术有限公司 | A kind of video classification methods, device and electronic equipment |
CN110489593A (en) * | 2019-08-20 | 2019-11-22 | 腾讯科技(深圳)有限公司 | Topic processing method, device, electronic equipment and the storage medium of video |
CN110580509A (en) * | 2019-09-12 | 2019-12-17 | 杭州海睿博研科技有限公司 | multimodal data processing system and method for generating countermeasure model based on hidden representation and depth |
CN110647804A (en) * | 2019-08-09 | 2020-01-03 | 中国传媒大学 | Violent video identification method, computer system and storage medium |
CN110674348A (en) * | 2019-09-27 | 2020-01-10 | 北京字节跳动网络技术有限公司 | Video classification method and device and electronic equipment |
CN111235709A (en) * | 2020-03-18 | 2020-06-05 | 东华大学 | Online detection system for spun yarn evenness of ring spinning based on machine vision |
CN111401259A (en) * | 2020-03-18 | 2020-07-10 | 南京星火技术有限公司 | Model training method, system, computer readable medium and electronic device |
CN111414959A (en) * | 2020-03-18 | 2020-07-14 | 南京星火技术有限公司 | Image recognition method and device, computer readable medium and electronic equipment |
CN111723239A (en) * | 2020-05-11 | 2020-09-29 | 华中科技大学 | Multi-mode-based video annotation method |
WO2021134277A1 (en) * | 2019-12-30 | 2021-07-08 | 深圳市优必选科技股份有限公司 | Emotion recognition method, intelligent device, and computer-readable storage medium |
CN113094550A (en) * | 2020-01-08 | 2021-07-09 | 百度在线网络技术(北京)有限公司 | Video retrieval method, device, equipment and medium |
CN113177914A (en) * | 2021-04-15 | 2021-07-27 | 青岛理工大学 | Robot welding method and system based on semantic feature clustering |
CN113239184A (en) * | 2021-07-09 | 2021-08-10 | 腾讯科技(深圳)有限公司 | Knowledge base acquisition method and device, computer equipment and storage medium |
CN113806609A (en) * | 2021-09-26 | 2021-12-17 | 郑州轻工业大学 | Multi-modal emotion analysis method based on MIT and FSM |
CN113903358A (en) * | 2021-10-15 | 2022-01-07 | 北京房江湖科技有限公司 | Voice quality inspection method, readable storage medium and computer program product |
JP2022135930A (en) * | 2021-03-05 | 2022-09-15 | ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド | Video classification method, apparatus, device, and storage medium |
WO2022198854A1 (en) * | 2021-03-24 | 2022-09-29 | 北京百度网讯科技有限公司 | Method and apparatus for extracting multi-modal poi feature |
CN117933269A (en) * | 2024-03-22 | 2024-04-26 | 合肥工业大学 | Multi-mode depth model construction method and system based on emotion distribution |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7920761B2 (en) * | 2006-08-21 | 2011-04-05 | International Business Machines Corporation | Multimodal identification and tracking of speakers in video |
CN103778443A (en) * | 2014-02-20 | 2014-05-07 | 公安部第三研究所 | Method for achieving scene analysis description based on theme model method and field rule library |
CN104090971A (en) * | 2014-07-17 | 2014-10-08 | 中国科学院自动化研究所 | Cross-network behavior association method for individual application |
CN105760507A (en) * | 2016-02-23 | 2016-07-13 | 复旦大学 | Cross-modal subject correlation modeling method based on deep learning |
-
2018
- 2018-10-31 CN CN201811289502.5A patent/CN109472232B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7920761B2 (en) * | 2006-08-21 | 2011-04-05 | International Business Machines Corporation | Multimodal identification and tracking of speakers in video |
CN103778443A (en) * | 2014-02-20 | 2014-05-07 | 公安部第三研究所 | Method for achieving scene analysis description based on theme model method and field rule library |
CN104090971A (en) * | 2014-07-17 | 2014-10-08 | 中国科学院自动化研究所 | Cross-network behavior association method for individual application |
CN105760507A (en) * | 2016-02-23 | 2016-07-13 | 复旦大学 | Cross-modal subject correlation modeling method based on deep learning |
Non-Patent Citations (3)
Title |
---|
LIU ZHENG ET.AL: "MMDF-LDA: An improved Multi-Modal Latent Dirichlet Allocation model for social image annotation", 《EXPERT SYSTEMS WITH APPLICATIONS》 * |
QIN JIN ET.AL: "Describing Videos using Multi-modal Fusion", 《PROCEEDINGS OF THE 24TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA》 * |
张德 等: "基于语义空间统一表征的视频多模态内容分析技术", 《电视技术》 * |
Cited By (34)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110162669A (en) * | 2019-04-04 | 2019-08-23 | 腾讯科技(深圳)有限公司 | Visual classification processing method, device, computer equipment and storage medium |
CN110162669B (en) * | 2019-04-04 | 2021-07-02 | 腾讯科技(深圳)有限公司 | Video classification processing method and device, computer equipment and storage medium |
CN110046279A (en) * | 2019-04-18 | 2019-07-23 | 网易传媒科技(北京)有限公司 | Prediction technique, medium, device and the calculating equipment of video file feature |
CN110046279B (en) * | 2019-04-18 | 2022-02-25 | 网易传媒科技(北京)有限公司 | Video file feature prediction method, medium, device and computing equipment |
CN110234018A (en) * | 2019-07-09 | 2019-09-13 | 腾讯科技(深圳)有限公司 | Multimedia content description generation method, training method, device, equipment and medium |
CN110234018B (en) * | 2019-07-09 | 2022-05-31 | 腾讯科技(深圳)有限公司 | Multimedia content description generation method, training method, device, equipment and medium |
CN110390311A (en) * | 2019-07-27 | 2019-10-29 | 苏州过来人科技有限公司 | A kind of video analysis algorithm based on attention and subtask pre-training |
CN110399934A (en) * | 2019-07-31 | 2019-11-01 | 北京达佳互联信息技术有限公司 | A kind of video classification methods, device and electronic equipment |
CN110647804A (en) * | 2019-08-09 | 2020-01-03 | 中国传媒大学 | Violent video identification method, computer system and storage medium |
CN110489593A (en) * | 2019-08-20 | 2019-11-22 | 腾讯科技(深圳)有限公司 | Topic processing method, device, electronic equipment and the storage medium of video |
CN110580509A (en) * | 2019-09-12 | 2019-12-17 | 杭州海睿博研科技有限公司 | multimodal data processing system and method for generating countermeasure model based on hidden representation and depth |
CN110674348A (en) * | 2019-09-27 | 2020-01-10 | 北京字节跳动网络技术有限公司 | Video classification method and device and electronic equipment |
CN110674348B (en) * | 2019-09-27 | 2023-02-03 | 北京字节跳动网络技术有限公司 | Video classification method and device and electronic equipment |
WO2021134277A1 (en) * | 2019-12-30 | 2021-07-08 | 深圳市优必选科技股份有限公司 | Emotion recognition method, intelligent device, and computer-readable storage medium |
CN113094550B (en) * | 2020-01-08 | 2023-10-24 | 百度在线网络技术(北京)有限公司 | Video retrieval method, device, equipment and medium |
CN113094550A (en) * | 2020-01-08 | 2021-07-09 | 百度在线网络技术(北京)有限公司 | Video retrieval method, device, equipment and medium |
CN111414959B (en) * | 2020-03-18 | 2024-02-02 | 南京星火技术有限公司 | Image recognition method, device, computer readable medium and electronic equipment |
CN111401259B (en) * | 2020-03-18 | 2024-02-02 | 南京星火技术有限公司 | Model training method, system, computer readable medium and electronic device |
CN111414959A (en) * | 2020-03-18 | 2020-07-14 | 南京星火技术有限公司 | Image recognition method and device, computer readable medium and electronic equipment |
CN111401259A (en) * | 2020-03-18 | 2020-07-10 | 南京星火技术有限公司 | Model training method, system, computer readable medium and electronic device |
CN111235709A (en) * | 2020-03-18 | 2020-06-05 | 东华大学 | Online detection system for spun yarn evenness of ring spinning based on machine vision |
CN111723239A (en) * | 2020-05-11 | 2020-09-29 | 华中科技大学 | Multi-mode-based video annotation method |
CN111723239B (en) * | 2020-05-11 | 2023-06-16 | 华中科技大学 | Video annotation method based on multiple modes |
JP7334395B2 (en) | 2021-03-05 | 2023-08-29 | ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド | Video classification methods, devices, equipment and storage media |
JP2022135930A (en) * | 2021-03-05 | 2022-09-15 | ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド | Video classification method, apparatus, device, and storage medium |
WO2022198854A1 (en) * | 2021-03-24 | 2022-09-29 | 北京百度网讯科技有限公司 | Method and apparatus for extracting multi-modal poi feature |
CN113177914B (en) * | 2021-04-15 | 2023-02-17 | 青岛理工大学 | Robot welding method and system based on semantic feature clustering |
CN113177914A (en) * | 2021-04-15 | 2021-07-27 | 青岛理工大学 | Robot welding method and system based on semantic feature clustering |
CN113239184A (en) * | 2021-07-09 | 2021-08-10 | 腾讯科技(深圳)有限公司 | Knowledge base acquisition method and device, computer equipment and storage medium |
CN113806609B (en) * | 2021-09-26 | 2022-07-12 | 郑州轻工业大学 | Multi-modal emotion analysis method based on MIT and FSM |
CN113806609A (en) * | 2021-09-26 | 2021-12-17 | 郑州轻工业大学 | Multi-modal emotion analysis method based on MIT and FSM |
CN113903358B (en) * | 2021-10-15 | 2022-11-04 | 贝壳找房(北京)科技有限公司 | Voice quality inspection method, readable storage medium and computer program product |
CN113903358A (en) * | 2021-10-15 | 2022-01-07 | 北京房江湖科技有限公司 | Voice quality inspection method, readable storage medium and computer program product |
CN117933269A (en) * | 2024-03-22 | 2024-04-26 | 合肥工业大学 | Multi-mode depth model construction method and system based on emotion distribution |
Also Published As
Publication number | Publication date |
---|---|
CN109472232B (en) | 2020-09-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109472232A (en) | Video semanteme characterizing method, system and medium based on multi-modal fusion mechanism | |
Yao et al. | Exploring visual relationship for image captioning | |
Ge et al. | Facial expression recognition based on deep learning | |
Kontschieder et al. | Structured class-labels in random forests for semantic image labelling | |
CN108829677A (en) | A kind of image header automatic generation method based on multi-modal attention | |
Banica et al. | Video object segmentation by salient segment chain composition | |
Bin et al. | Adaptively attending to visual attributes and linguistic knowledge for captioning | |
CN108427740B (en) | Image emotion classification and retrieval algorithm based on depth metric learning | |
GB2565775A (en) | A Method, an apparatus and a computer program product for object detection | |
Zhu et al. | Efficient action detection in untrimmed videos via multi-task learning | |
He et al. | Moving object recognition using multi-view three-dimensional convolutional neural networks | |
CN110956158A (en) | Pedestrian shielding re-identification method based on teacher and student learning frame | |
CN108154156B (en) | Image set classification method and device based on neural topic model | |
Islam et al. | A review on video classification with methods, findings, performance, challenges, limitations and future work | |
Liu et al. | Robust salient object detection for RGB images | |
CN107506792A (en) | A kind of semi-supervised notable method for checking object | |
Kindiroglu et al. | Temporal accumulative features for sign language recognition | |
Meena et al. | A review on video summarization techniques | |
Tighe et al. | Scene parsing with object instance inference using regions and per-exemplar detectors | |
Liu et al. | Application of gcForest to visual tracking using UAV image sequences | |
Li et al. | Research on efficient feature extraction: Improving YOLOv5 backbone for facial expression detection in live streaming scenes | |
Luo et al. | An optimization framework of video advertising: using deep learning algorithm based on global image information | |
Wu et al. | Self-learning and explainable deep learning network toward the security of artificial intelligence of things | |
Yang et al. | Visual Skeleton and Reparative Attention for Part-of-Speech image captioning system | |
Shi et al. | Uncertain and biased facial expression recognition based on depthwise separable convolutional neural network with embedded attention mechanism |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20210506 Address after: Room 1605, Kangzhen building, 18 Louyang Road, Suzhou Industrial Park, 215000, Jiangsu Province Patentee after: Suzhou Wuyun pen and ink Education Technology Co.,Ltd. Address before: No.1 Daxue Road, University Science Park, Changqing District, Jinan City, Shandong Province Patentee before: SHANDONG NORMAL University |