CN109472232A - Video semanteme characterizing method, system and medium based on multi-modal fusion mechanism - Google Patents

Video semanteme characterizing method, system and medium based on multi-modal fusion mechanism Download PDF

Info

Publication number
CN109472232A
CN109472232A CN201811289502.5A CN201811289502A CN109472232A CN 109472232 A CN109472232 A CN 109472232A CN 201811289502 A CN201811289502 A CN 201811289502A CN 109472232 A CN109472232 A CN 109472232A
Authority
CN
China
Prior art keywords
video
feature
layer
convolutional layer
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811289502.5A
Other languages
Chinese (zh)
Other versions
CN109472232B (en
Inventor
侯素娟
车统统
王海帅
郑元杰
王静
贾伟宽
史云峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Wuyun Pen And Ink Education Technology Co ltd
Original Assignee
Shandong Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Normal University filed Critical Shandong Normal University
Priority to CN201811289502.5A priority Critical patent/CN109472232B/en
Publication of CN109472232A publication Critical patent/CN109472232A/en
Application granted granted Critical
Publication of CN109472232B publication Critical patent/CN109472232B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The present disclosure discloses video semanteme characterizing method, system and medium based on multi-modal fusion mechanism, feature extraction: visual signature, phonetic feature, motion feature, text feature and the domain features of video itself are extracted;Fusion Features: by the vision of extraction, voice, movement, text feature and domain features, topic model is distributed by the multi-level implicit Di Li Cray of building and carries out Fusion Features;Feature Mapping: by fused Feature Mapping to a high-level semantic space, fused character representation sequence is obtained.Unique advantage of the model using topic model in semantic analysis field, the video characteristic manner that the model training proposed on its basis obtains have comparatively ideal distinction in semantic space.

Description

Video semanteme characterizing method, system and medium based on multi-modal fusion mechanism
Technical field
This disclosure relates to video semanteme characterizing method, system and medium based on multi-modal fusion mechanism.
Background technique
As cybertimes data volume is in explosive growth, the arrival of media big data era is accelerated.Wherein, video is made It is closely bound up with people's lives for the important carrier of multimedia messages.The evolution of mass data does not require nothing more than the place to data Reason mode generates great change, while also bringing very big challenge to the storage of video, processing and application.One urgent need to resolve The problem of be how effective organization and management to be carried out to data.As data are constantly be generated in a steady stream, due to hardware condition Limitation, so that data can only be segmented or timesharing is stored, this inevitably will cause different degrees of loss of learning.Cause This provides one kind for video and is concisely and efficiently data characterization method, is significant to video analysis and raising data management efficiency 's.
Video data has a characteristic that 1) on data mode, video data has the structure of multimode complexity, it is one The data flow of the non-complete lattice of kind.Each video is the streaming structure being distributed by a series of picture frame along time shaft, The multifrequency natures such as vision and movement are shown in space-time hyperspace, while having incorporated acoustic characteristic again on time span.Its Expressive force is strong, contains much information, and the content contained has many characteristics, such as rich, magnanimity, unstructured.Contain in video this Kind multi-mode feature brings huge challenge to video characterization;2) on Composition of contents, video has very strong logicality again.It is It is made of a series of logic unit, contains semantic information abundant, generation can be depicted by continuous several frames and existed Event under specific space-time environment, to express specific semantic content.The diversity of video content and to video content understand Otherness and ambiguity so as to characterization video data feature extraction become difficult so that based on semantic information Video, which understands, has more challenge.
Traditional data characterization method, such as the video features learning method of view-based access control model, the succinct table of available video Sign, however to the good feature of reasonable construction, need certain experience and professional domain feature.The utilization of deep learning method Visual task is set to obtain remarkable break-throughs, but the problems such as there are still " semantic gaps " and " multimode isomery wide gap ".Currently, by using Multi-modal fusion technology establishes the Efficient Characterization to video, is across the effective way of " multimode isomery " wide gap.Understand video, most Natural mode is namely based on the multi-modal information in video, is given expression to the content of video using the high-level concept in people's thinking Come, this is also across the optimal path of " semantic gap ".However, needing integrated use phase for the video analysis of specific area The domain features and the existing effective representation pattern of multimodality fusion technology mining answered complete specific task.Although computer Technology continues to develop, and computer how to be allowed accurately to understand that the semantic concept in video is still a problem.
Summary of the invention
In order to solve the deficiencies in the prior art, present disclose provides the video semanteme characterization sides based on multi-modal fusion mechanism Method, system and medium, it is an expansible general characterization model, not only right during model training and global optimization It is expansible in the number of single mode information, and the domain features that any type of video is contained can be dissolved into model In.Model has fully considered the relationship between each mode, and multi-modal interactive process is dissolved into the joint instruction to entire model During experienced and global optimization.Model, in the unique advantage in semantic analysis field, is proposed on its basis using topic model The video characteristic manner that model training obtains has comparatively ideal distinction in semantic space.
In order to solve the above-mentioned technical problem, the disclosure adopts the following technical scheme that
As the disclosure in a first aspect, providing the video semanteme characterizing method based on multi-modal fusion mechanism;
Video semanteme characterizing method based on multi-modal fusion mechanism, comprising:
Feature extraction: visual signature, phonetic feature, motion feature, text feature and the domain features of video itself are extracted;
Fusion Features: by the vision of extraction, voice, movement, text feature and domain features, pass through the multi-level of building Implicit Di Li Cray distribution topic model carries out Fusion Features;
Feature Mapping: by fused Feature Mapping to a high-level semantic space, fused character representation sequence is obtained Column.
As some possible implementations, the specific steps of the visual signature of video are extracted are as follows:
Pretreatment: Video segmentation is several camera lenses by Video segmentation;Picture frame in each camera lens is sequentially in time Form image frame sequence;
Step (a1): deep learning neural network model is established;
The deep learning neural network model, comprising: sequentially connected input layer, the first convolutional layer C1, the first pond Layer S2, the second convolutional layer C3, the second pond layer S4, third convolutional layer C5, full articulamentum F6 and output layer;
The image frame sequence of each camera lens of video: being input to the input layer of deep learning neural network model by step (a2), Picture frame is passed to the first convolutional layer C1 by input layer;
First convolutional layer C1 carries out each frame image in the image frame sequence of video with one group of trainable convolution kernel Convolution, each layer characteristic pattern that convolution is obtained average to obtain average characteristics figure, then by obtained average characteristics figure together with biasing Activation primitive is inputted, one group of Feature Mapping figure is exported;
First pond layer S2 carries out weight to the pixel value of each pixel of the obtained Feature Mapping figure of the first convolutional layer C1 Folded pondization operation, reduces the length and width of the first convolutional layer output Feature Mapping figure matrix;Then operation result is passed to second Convolutional layer C3;
Second convolutional layer C3 carries out convolution operation for the operation result to the first pond layer S2;Second convolutional layer C3's The number of convolution kernel is two times of the convolution kernel number of the first convolutional layer C1;
Second pond layer S4, the Feature Mapping figure for exporting to the second convolutional layer C3 carry out overlapping poolization and operate to reduce The size of characteristic pattern matrix;
Third convolutional layer C5, using the convolution kernel of size identical as the second pond layer S4 to the knot of the second pond layer S4 Fruit carries out convolution operation, finally obtains the characteristic pattern of several 1 × 1 pixels;
Full articulamentum F6, this layer of each neuron is connect entirely with each neuron in third convolutional layer C5, will The result that third convolutional layer C5 is obtained is expressed as feature vector;
Output layer, the feature vector for exporting full articulamentum F6, which is input in classifier, classifies, and calculates classification Accuracy rate, by backpropagation adjusting parameter, repeats step (a2) when classification accuracy is lower than given threshold, until Classification accuracy is higher than given threshold;When classification accuracy is higher than given threshold, higher than the classification accuracy institute of given threshold Learning outcome of the corresponding feature vector as final video visual signature.
As some possible implementations, the specific steps of the phonetic feature of video are extracted are as follows:
The voice signal in video is extracted, audio data is converted into sonograph, it is neural using sonograph as deep learning Then network model input carries out unsupervised learning to audio-frequency information by deep learning neural network model, and by connecting entirely Layer is connect, the vector for obtaining video speech feature indicates.
As some possible implementations, the specific steps of the motion feature of video are extracted are as follows:
The optical flow field in video is extracted, and statistics is weighted to light stream direction, it is special to obtain light stream directional information histogram It levies (HOF Histogram of Oriented Optical Flow), the vector as motion feature indicates.
As some possible implementations, the specific steps of the text feature of video are extracted are as follows:
The periphery text information (such as video title, mark) for acquiring the text and video in video frame, using bag of words mould Type extracts text feature from text information.
Domain features refer to rule feature set by video fields.For example, football video can be according to football match Rule and relay specification do some scene specifications (such as front court, back court and forbidden zone) and event definition (as shooting, corner-kick, times Meaning ball etc.).News video has almost the same sequential organization and Scene Semantics, i.e. news camera lens is in announcer and news report Between chronologically switch.Promoted commodity or the associated logo information of service are generally comprised in advertisement video.
As some possible implementations, the specific steps of the multi-modal Fusion Features are as follows:
Step (a1): Di Li Cray is implied with LDA and is distributed topic model (LDA Latent Dirichlet Allocation), by the visual feature vector of video under from visual signature space reflection to semantic feature space Γ;LDA is implicit The input that Di Li Cray is distributed topic model is the visual feature vector of video, and it is defeated that LDA implies Di Li Cray distribution topic model It is out the characterizing semantics on feature space Γ;
Step (a2): implying Di Li Cray with LDA and be distributed topic model, and the speech feature vector of video is special from voice It levies under space reflection to semantic feature space Γ;The input that LDA implies Di Li Cray distribution topic model is special for the voice of video Vector is levied, is exported as the characterizing semantics on feature space Γ;
Step (a3): implying Di Li Cray with LDA and be distributed topic model, and the light stream directional information histogram of video is special Sign is under from motion feature space reflection to semantic feature space Γ;The input of LDA is that the light stream directional information histogram of video is special Sign, exports as the characterizing semantics on feature space Γ;
Step (a4): LDA implies Di Li Cray and is distributed topic model, and the text feature of video is reflected from text feature space It is mapped under the Γ of semantic feature space;The input of LDA is the text of video, is exported as the characterizing semantics on feature space Γ;
Step (a5): being priori knowledge Ω by video field Feature Conversion;
Step (a6): Di Li Cray is implied with LDA and is distributed topic model, step (a1)-step (a4) is obtained each Characterizing semantics of the modal characteristics on the Γ of semantic feature space, set the weight of each modal characteristics, are obtained by Weighted Fusion Video characterization after modality fusion.
The finding process of the weight of each modal characteristics is as follows:
Step (a61): one topic of selection is distributed θ | and α~Dir (α), wherein α is the priori of Dirichlet prior distribution Parameter;
Step (a62): to each word in training sample video, a top level topics distribution is selectedTopic obeys multinomial distribution;
Step (a63): for each modal characteristics weight ρ ∈ NV={ NV modal characteristics dictionary }, a bottom words are selected Topic distributionTopic obeys multinomial distribution;
Step (a64): at each modal characteristics weight ρ, based on selected topic, in conjunction with domain knowledge Ω, from distributionGenerate a word.
For single video d, α, β and ρ, topic θ and top level topics are givenIt is reflected from NV single mode spaces union It is mapped to high-level semantic space, topic θ and top level topicsJoint Distribution Probability p (θ, ztop, d | α, Ω, ρ, β) p (β) are as follows:
Wherein, parameter θ and ztopFor hidden variable;By asking edge distribution by parameter θ and ztopIt eliminates.
Wherein, p (βη) indicate in the priori interest under η Modal Space between dictionary element,
Using Gauss-Markov random field prior model, it may be assumed that
Wherein, ΠjIndicate the set of the word under η Modal Space with priori interest, σiFor the smoothing factor of model, use In adjustment prior model;Exp is indicated using natural constant e as the exponential function at bottom;
For the video corpus D containing M video, generating probability is connected by the marginal probability to M video It is multiplied to arrive:
Objective function is set as the likelihood function of D, i.e.,
When the likelihood function of D maximizes, corresponding parameter ρ is exactly the corresponding weight of each single mode feature, log indicate with A is the logarithm at bottom,Indicate likelihood function.
As the second aspect of the disclosure, the video semanteme characterization system based on multi-modal fusion mechanism is provided;
Video semanteme based on multi-modal fusion mechanism characterizes system, comprising: memory, processor and is stored in storage The computer instruction run on device and on a processor, when the computer instruction is run by processor, completes any of the above-described side Step described in method.
As the third aspect of the disclosure, a kind of computer readable storage medium is provided;
A kind of computer readable storage medium, is stored thereon with computer instruction, and the computer instruction is transported by processor When row, step described in any of the above-described method is completed.
Compared with prior art, the beneficial effect of the disclosure is:
(1) a kind of video semanteme characterizing method based on multi-modal fusion mechanism of disclosure primary study, comprehensive utilization figure The sequence information in video is handled as the algorithm in the related fieldss such as processing, pattern-recognition and machine learning.It will be different necks The video phenetic analysis in domain provides new perspective in research and theoretical reference.
(2) conventional method and deep learning method are combined, in semantic hierarchies research to the Efficient Characterization of video, is effectively contracted In short-sighted frequency understanding generally existing " multimode wide gap " and " semantic gap ".
(3) propose the deep vision feature learning model based on adaptive learning mechanism, learning automata system it is adaptive Property be mainly manifested in two aspect: first is that using Shot Detection technology make depth model input be one group of variable-length frame The number of sequence, frame can adaptively be adjusted according to the length of camera lens;Second is that in the pond C2 layer, according to the scale of characteristic pattern dynamic The size and step-length of computing pool window, to guarantee that the data characterization dimension of all camera lens videos is consistent.
(4) video lens adaptive 3 D deep learning neural network is designed, the visual signature of video features is learnt automatically Algorithm is studied, and is improved the performance of classifier and is optimized the parameter of whole system, carrys out table with a kind of most effective way Levy the visual information of video.
(5) propose that a kind of multi-modal multi-level theme characterization model, the main characteristic of the model there are following three aspects: First is that it is an expansible general characterization model, the number of single mode information be it is expansible, can also be by any type of view The domain features that frequency is contained are dissolved into model, and the specific aim of video characterization ability is improved;Second is that model fully considers Multi-modal interactive process is dissolved into joint training and global optimization to entire model by the relationship between each mode;Third is that Unique advantage using topic model in semantic analysis field trains the video characteristic manner obtained to have in semantic space Comparatively ideal distinction can effectively obtain the concise representation of video.
Detailed description of the invention
The accompanying drawings constituting a part of this application is used to provide further understanding of the present application, and the application's shows Meaning property embodiment and its explanation are not constituted an undue limitation on the present application for explaining the application.
Fig. 1 video lens adaptive 3 D deep learning framework;
Fig. 2 adaptive 3 D convolution process;
Fig. 3 carries out the process of convolutional calculation using convolution kernel;
The overall framework of Fig. 4 video multi-modal fusion mechanism;
The multi-modal multi-level theme of Fig. 5 generates model.
Specific embodiment
It is noted that following detailed description is all illustrative, it is intended to provide further instruction to the application.Unless another It indicates, all technical and scientific terms used herein has usual with the application person of an ordinary skill in the technical field The identical meanings of understanding.
It should be noted that term used herein above is merely to describe specific embodiment, and be not intended to restricted root According to the illustrative embodiments of the application.As used herein, unless the context clearly indicates otherwise, otherwise singular Also it is intended to include plural form, additionally, it should be understood that, when in the present specification using term "comprising" and/or " packet Include " when, indicate existing characteristics, step, operation, device, component and/or their combination.
The disclosure proposes a kind of space-time characteristic learning model of adaptive frame selection mechanism first, to obtain the vision of video Feature.Then on this basis, it is further proposed that can effectively melt visual signature and other modal characteristics in conjunction with domain features The model of conjunction realizes the characterizing semantics to video.
To achieve the above object, video characterization model combination conventional method and deep learning method described in the disclosure, it is comprehensive It closes and uses the advantage of traditional Feature Selection technology, deep learning mechanism and topic model theory to the multi-modal fusion machine of video System expansion research, and further the Efficient Characterization to video is studied in semantic hierarchies.
Specific research technical scheme is as follows:
The time-space domain information representation study mechanism of video is analysed in depth first, guarantee space time information continuity and The Efficient Characterization to video visual information is obtained on the basis of integrality;Then, the syncretizing mechanism of multi-modal information is ground Study carefully, while during the domain features of video are dissolved into the multimodal information fusion of video, finally establishing a set of field video Model of representing semantics.
(1) the time-space domain depth characteristic of video features learns automatically
Design a kind of video lens feature learning model with stronger data capability of fitting and learning ability, Neng Gouchong The advantage of feature extraction layer by layer is waved in distribution.With Shot Detection technology, using lens length as adaptive unit, use The mode extracted layer by layer excavates the Time-space serial information for including in video lens.Therefore it is deep to devise the adaptive 3D of video lens It spends learning network model (see Fig. 1).
Process is as follows:
Step 1: shot segmentation being carried out to video with shot detection technology.
Step 2: using one group of video frame of camera lens as the input of model, information is sequentially delivered to different layers again, and every layer logical One group of filter is crossed to obtain most significant feature information of the observation data on different classes of.
Step 3: the pixel value of final all camera lens frames is rasterized, and connects into a vector.
Above-mentioned adaptive 3 D convolution process is embodied in C1 convolutional layer, Fig. 2 is provided to some include camera lens frame that length is L into The process of row convolution, i.e., using L frame sequence as input, the filter that can be learnt by one group is respectively to the corresponding positions of different frame Carry out convolution is set, then obtained each neuron merge averagely, exports one group of feature finally by an activation primitive Mapping graph.For video frame, it is believed that the space relationship inside frame is local, therefore each neuron is arranged to Only perceive regional area.
In convolution process, for the neuron weight of same plane layer in such a way that weight is shared, Fig. 3, which is provided, utilizes convolution kernel Carry out the process of non-linear transfer.
W=(w in Fig. 31,w2,…,wK) convolution kernel is indicated in the weight of the convolutional layer, it is one group of parameter that can learn;a =(a1,a2,…,aL) be corresponding position in L frame local receptor field, wherein ai=(ai1,ai2,…,aiK).It is rolled up with some When product collecting image executes convolution algorithm, participate in convolution algorithm is some region of input picture, and the size in this region is just It is receptive field.
In the pond S2 layer, it is weighted first by the pixel unit of the Feature Mapping figure obtained to C1 convolutional layer, Then operation result is continued to transmit by operation nonlinear function to next layer.Pondization is operated, using overlapping in implementation process Chi Hua, that is, the step-length being arranged are less than pond window size.
It is variant in view of the size of the video frame in different data source, after C1 convolution operation, obtained characteristic pattern size May be inconsistent, this will cause the characteristic dimension difference of each video lens of full articulamentum, and then lead to generated tables of data Levy inconsistent problem.The strategy taken in the disclosure is the size and step according to the scale dynamic computing pool window of characteristic pattern It is long, to guarantee the consistency to all video lens characterization vector in dimension.
In C3 convolutional layer, the convolution kernel for designing 2 times of C1 convolutional layers of number acts on S2 layers, it is therefore an objective to spatial discrimination Rate is successively decreased between the layers, is able to detect that more characteristic informations.The pond S4 layer uses the operation similar with S2, it is therefore an objective to logical Crossing down-sampling technology makes operation result continue to be transmitted to next layer.At C5 layers, using the convolution of size identical as S4 The characteristic pattern that S4 layers of verification obtains carries out convolution algorithm, finally obtains a series of characteristic pattern that sizes are 1 × 1.F6 is full connection Figure realizes the feature that input is ultimately expressed as to certain length by the way that each neuron is connected with whole neurons in C5 The characterization purpose of vector.Then feature training obtained is sent into classifier, and it is whole to optimize to further increase the performance of classifier The parameter of a system characterizes the visual information of video with a kind of most effective way.
(2) multimodal information fusion of video characterization
In the pretreatment stage for carrying out video multi-modal Fusion Features, need to mention each modal characteristics of video respectively It takes and characterizes.In general, the feature of video has two major classes: one kind is generic features, comprising:
It 1) include time dimension and space dimension information containing the visual signature of time series;
2) text feature is turned text information using bag of words including the text and video periphery text in video frame Turn to the number description that can be modeled;
3) motion feature, the first Optic flow information in extraction video, then use light stream histogram HOF (Histogram Of Oriented Optical Flow) it is described;
4) audio-frequency information in video is converted into sonograph first, then using sonograph as input, passed through by audio frequency characteristics It finely tunes existing network model and carries out unsupervised learning, obtain indicating the vector of voice messaging.
Another kind of is domain features, this is related to video genre and specific application field.
To the simple combination of not each category feature of process of modal characteristics each in video processing, several different modalities are special The interaction and fusion of sign.The disclosure is distributed as point of penetration with potential Di Li Cray popular in topic model, comprehensive machine learning, The subjects theories such as image procossing and speech recognition merge each modal information of above-mentioned video.By constructing multi-level master It inscribes model and video data is mapped to a upper space from each Modal Space and domain features are organic, obtain in the upper space On video-level characterization sequence.Fig. 4 provides the overall framework of multi-modal fusion mechanism.
For (2) described above, the multimodal information fusion of video characterization needs to extract each modal characteristics of video respectively. Then it constructs multi-level topic model and carries out multiple features fusion.Process is as follows:
(1) each modal characteristics of video are extracted respectively, i.e., extracts visual information, the voice messaging, movement letter of video respectively Breath, text information.
It is the input of unit as model using camera lens (one group of video frame) for visual information, by finely tuning existing net Network model such as AlexNet, GoogleNet carry out unsupervised learning, finally, the pixel value of all camera lens frames to visual information It is rasterized, connects into a vector;For voice messaging, audio data is converted into sonograph, using sonograph information as Then the input of model carries out unsupervised learning to audio-frequency information by finely tuning existing network model, and passes through full articulamentum, Obtain a vector;For motion information, intends extracting the Optic flow information in video first, then use light stream histogram HOF (Histogram of Oriented Optical Flow) is characterized;For text information, including the text in video frame With video periphery text, proposed adoption bag of words convert text information to the number description that can be modeled.
(2) it constructs multi-level topic model and realizes multiple features fusion
It is by the topic model (Fig. 5) for constructing multi-level that video data is organic from each modal characteristics space and domain features It is mapped to a high-level semantic space, realizes multi-modal fusion.Specific implementation process is as follows:
Contain M video in model hypothesis corpus, is denoted as D={ d1,d2,…,dM, each video di(1≤i≤M) packet One group of potential subject information is contained, it is believed that these subject informations are by the dictionary element under each Modal Space certain Under priori conditions, generated in a high-level semantic space according to organic be mapped to of certain distribution.Model using video as Processing unit relates to the topic model of two kinds of levels altogether, realizes that the multi-modal information using domain features as priori knowledge melts It closes, finally obtains the topic characterization of vector form.Model includes the theme of two levels altogether, respectively with ZtopAnd ZlowIt indicates, Ztop Theme after indicating video fusion, ZlowTheme before indicating fusion, the former is by the latter according to using ρ as the multinomial of parameter point Cloth composition, parameter Ω correspond to the domain features of video.Model thinks to be in independent same distribution between word under different modalities space.Figure is special The construction for levying dictionary uses K-means clustering technique, is constructed in a manner of bag of words.
Model Parameter θ is obeyed to be distributed by the Dirichlet of Study first of α, indicates that currently processed video is contained Distribution.Parameter NV is multi-modal quantity, and β indicates the dictionary under different modalities space.Model is realized by solving parameter ρ The weight offering question of different modalities when multimode state space is converted to semantic space.
The generating process of each video corpus is as shown in the table:
Above-mentioned generating process:
Step 1: one topic of selection is distributed θ | α~Dir (α), wherein α is that the priori of Dirichlet prior distribution is joined Number;
Step 2: selecting a top level topics distribution to word each in videoTopic is obeyed more Item formula distribution;
Step 3: at Modal Space p, selecting a bottom topic distribution for NV Modal Space Topic obeys multinomial distribution;
Step 4: according to selected topic, from distributionGenerate a word.
For single video d, in given parameters α, β, ρ, in conjunction with domain knowledge Ω, topic θ in model, top level topicsWhen from NV Modal Space Joint Mapping to upper space, their Joint Distribution probability are as follows:
Wherein parameter θ and ztopFor hidden variable.It can be by asking edge distribution to be eliminated.
Above-mentioned p (βη) indicate in the priori interest under η Modal Space between dictionary element,
Using typical Gauss-Markov random field prior model, it may be assumed that
Π in above formulajIndicate the set of the word under η Modal Space with priori interest, σiFor the smoothing factor of model, use In adjustment prior model.
For the video corpus D containing M video, likelihood value can pass through the marginal probability to M video It is even multiplied to arrive:
It is expected that finding suitable parameter alpha, ρ, β can maximize following likelihood functions of corpus, i.e. objective function indicates At:
By solving above-mentioned model, organically blending for multi-modal feature and video field feature may be implemented, finally obtain To the characterizing semantics of video.
The general thought of above procedure is as shown in Figure 4.Summarize above procedure, the multi-modal multi-level theme that the disclosure proposes Model has following characteristic relative to existing method:
1) it is an expansible general characterization model, during model training and global optimization, not only for list The number of modal information is expansible, and the domain features that any type of video is contained can be dissolved into model, Improve the specific aim of video characterization ability;
2) model has fully considered the relationship between each mode, and multi-modal interactive process is dissolved into entire model During joint training and global optimization;
3) topic model itself has its unique advantage in semantic analysis field, the model training proposed on this basis The video characteristic manner of acquisition has comparatively ideal distinction in semantic space, this is also the effective of acquisition video concise representation One of mode.
The foregoing is merely preferred embodiment of the present application, are not intended to limit this application, for the skill of this field For art personnel, various changes and changes are possible in this application.Within the spirit and principles of this application, made any to repair Change, equivalent replacement, improvement etc., should be included within the scope of protection of this application.

Claims (9)

1. the video semanteme characterizing method based on multi-modal fusion mechanism, characterized in that include:
Feature extraction: visual signature, phonetic feature, motion feature, text feature and the domain features of video itself are extracted;
Fusion Features: by the vision of extraction, voice, movement, text feature and domain features, multi-level by building is implied Di Li Cray is distributed topic model and carries out Fusion Features;
Feature Mapping: by fused Feature Mapping to a high-level semantic space, fused character representation sequence is obtained.
2. as described in claim 1 based on the video semanteme characterizing method of multi-modal fusion mechanism, characterized in that extract video Visual signature specific steps are as follows:
Pretreatment: Video segmentation is several camera lenses by Video segmentation;Picture frame in each camera lens forms sequentially in time Image frame sequence;
Step (a1): deep learning neural network model is established;
The deep learning neural network model, comprising: sequentially connected input layer, the first convolutional layer C1, the first pond layer S2, Second convolutional layer C3, the second pond layer S4, third convolutional layer C5, full articulamentum F6 and output layer;
The image frame sequence of each camera lens of video: being input to the input layer of deep learning neural network model by step (a2), input Picture frame is passed to the first convolutional layer C1 by layer;
First convolutional layer C1 rolls up each frame image in the image frame sequence of video with one group of trainable convolution kernel Product, each layer characteristic pattern that convolution is obtained average to obtain average characteristics figure, then obtained average characteristics figure is defeated together with biasing Enter activation primitive, exports one group of Feature Mapping figure;
First pond layer S2 carries out overlapping pool to the pixel value of each pixel of the obtained Feature Mapping figure of the first convolutional layer C1 Change operation, reduces the length and width of the first convolutional layer output Feature Mapping figure matrix;Then operation result is passed into the second convolution Layer C3;
Second convolutional layer C3 carries out convolution operation for the operation result to the first pond layer S2;The convolution of second convolutional layer C3 The number of core is two times of the convolution kernel number of the first convolutional layer C1;
Second pond layer S4, the Feature Mapping figure for exporting to the second convolutional layer C3 carry out overlapping poolization and operate to reduce feature The size of figure matrix;
Third convolutional layer C5, using size identical as the second pond layer S4 convolution kernel to the result of the second pond layer S4 into Row convolution operation finally obtains the characteristic pattern of several 1 × 1 pixels;
Full articulamentum F6, this layer of each neuron is connect entirely with each neuron in third convolutional layer C5, by third The result that convolutional layer C5 is obtained is expressed as feature vector;
Output layer, the feature vector for exporting full articulamentum F6, which is input in classifier, classifies, and it is accurate to calculate classification Rate, by backpropagation adjusting parameter, repeats step (a2) when classification accuracy is lower than given threshold, until classification Accuracy rate is higher than given threshold;When classification accuracy is higher than given threshold, corresponding to the classification accuracy higher than given threshold Learning outcome of the feature vector as final video visual signature.
3. as described in claim 1 based on the video semanteme characterizing method of multi-modal fusion mechanism, characterized in that extract video Phonetic feature specific steps are as follows:
The voice signal in video is extracted, audio data is converted into sonograph, using sonograph as deep learning neural network Then mode input carries out unsupervised learning to audio-frequency information by deep learning neural network model, and passes through full articulamentum, The vector for obtaining video speech feature indicates.
4. as described in claim 1 based on the video semanteme characterizing method of multi-modal fusion mechanism, characterized in that extract video Motion feature specific steps are as follows:
The optical flow field in video is extracted, and statistics is weighted to light stream direction, obtains light stream directional information histogram feature, is made It is indicated for the vector of motion feature.
5. as described in claim 1 based on the video semanteme characterizing method of multi-modal fusion mechanism, characterized in that extract video Text feature specific steps are as follows:
The periphery text information for acquiring the text and video in video frame extracts text spy using bag of words from text information Sign.
6. as described in claim 1 based on the video semanteme characterizing method of multi-modal fusion mechanism, characterized in that domain features Refer to rule feature set by video fields.
7. as described in claim 1 based on the video semanteme characterizing method of multi-modal fusion mechanism, characterized in that
The specific steps of the multi-modal Fusion Features are as follows:
Step (a1): implying Di Li Cray with LDA and be distributed topic model, and the visual feature vector of video is empty from visual signature Between be mapped under the Γ of semantic feature space;LDA imply Di Li Cray distribution topic model input be video visual signature to Amount, it is the characterizing semantics on feature space Γ that LDA, which implies the distribution topic model output of Di Li Cray,;
Step (a2): implying Di Li Cray with LDA and be distributed topic model, and the speech feature vector of video is empty from phonetic feature Between be mapped under the Γ of semantic feature space;LDA imply Di Li Cray distribution topic model input be video phonetic feature to Amount, exports as the characterizing semantics on feature space Γ;
Step (a3): with LDA imply Di Li Cray be distributed topic model, by the light stream directional information histogram feature of video from Under motion feature space reflection to semantic feature space Γ;The input of LDA is the light stream directional information histogram feature of video, defeated It is out the characterizing semantics on feature space Γ;
Step (a4): LDA imply Di Li Cray be distributed topic model, by the text feature of video from text feature space reflection to Under the Γ of semantic feature space;The input of LDA is the text of video, is exported as the characterizing semantics on feature space Γ;
Step (a5): being priori knowledge Ω by video field Feature Conversion;
Step (a6): Di Li Cray is implied with LDA and is distributed topic model, each mode that step (a1)-step (a4) is obtained Characterizing semantics of the feature on the Γ of semantic feature space, set the weight of each modal characteristics, are obtained by Weighted Fusion in mode Fused video characterization.
8. video semanteme based on multi-modal fusion mechanism characterizes system, characterized in that include: memory, processor and deposit The computer instruction run on a memory and on a processor is stored up, when the computer instruction is run by processor, in completion State step described in any one of claim 1-7 method.
9. a kind of computer readable storage medium, characterized in that be stored thereon with computer instruction, the computer instruction is located When managing device operation, step described in any one of the claims 1-7 method is completed.
CN201811289502.5A 2018-10-31 2018-10-31 Video semantic representation method, system and medium based on multi-mode fusion mechanism Active CN109472232B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811289502.5A CN109472232B (en) 2018-10-31 2018-10-31 Video semantic representation method, system and medium based on multi-mode fusion mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811289502.5A CN109472232B (en) 2018-10-31 2018-10-31 Video semantic representation method, system and medium based on multi-mode fusion mechanism

Publications (2)

Publication Number Publication Date
CN109472232A true CN109472232A (en) 2019-03-15
CN109472232B CN109472232B (en) 2020-09-29

Family

ID=65666408

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811289502.5A Active CN109472232B (en) 2018-10-31 2018-10-31 Video semantic representation method, system and medium based on multi-mode fusion mechanism

Country Status (1)

Country Link
CN (1) CN109472232B (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110046279A (en) * 2019-04-18 2019-07-23 网易传媒科技(北京)有限公司 Prediction technique, medium, device and the calculating equipment of video file feature
CN110162669A (en) * 2019-04-04 2019-08-23 腾讯科技(深圳)有限公司 Visual classification processing method, device, computer equipment and storage medium
CN110234018A (en) * 2019-07-09 2019-09-13 腾讯科技(深圳)有限公司 Multimedia content description generation method, training method, device, equipment and medium
CN110390311A (en) * 2019-07-27 2019-10-29 苏州过来人科技有限公司 A kind of video analysis algorithm based on attention and subtask pre-training
CN110399934A (en) * 2019-07-31 2019-11-01 北京达佳互联信息技术有限公司 A kind of video classification methods, device and electronic equipment
CN110489593A (en) * 2019-08-20 2019-11-22 腾讯科技(深圳)有限公司 Topic processing method, device, electronic equipment and the storage medium of video
CN110580509A (en) * 2019-09-12 2019-12-17 杭州海睿博研科技有限公司 multimodal data processing system and method for generating countermeasure model based on hidden representation and depth
CN110647804A (en) * 2019-08-09 2020-01-03 中国传媒大学 Violent video identification method, computer system and storage medium
CN110674348A (en) * 2019-09-27 2020-01-10 北京字节跳动网络技术有限公司 Video classification method and device and electronic equipment
CN111235709A (en) * 2020-03-18 2020-06-05 东华大学 Online detection system for spun yarn evenness of ring spinning based on machine vision
CN111401259A (en) * 2020-03-18 2020-07-10 南京星火技术有限公司 Model training method, system, computer readable medium and electronic device
CN111414959A (en) * 2020-03-18 2020-07-14 南京星火技术有限公司 Image recognition method and device, computer readable medium and electronic equipment
CN111723239A (en) * 2020-05-11 2020-09-29 华中科技大学 Multi-mode-based video annotation method
WO2021134277A1 (en) * 2019-12-30 2021-07-08 深圳市优必选科技股份有限公司 Emotion recognition method, intelligent device, and computer-readable storage medium
CN113094550A (en) * 2020-01-08 2021-07-09 百度在线网络技术(北京)有限公司 Video retrieval method, device, equipment and medium
CN113177914A (en) * 2021-04-15 2021-07-27 青岛理工大学 Robot welding method and system based on semantic feature clustering
CN113239184A (en) * 2021-07-09 2021-08-10 腾讯科技(深圳)有限公司 Knowledge base acquisition method and device, computer equipment and storage medium
CN113806609A (en) * 2021-09-26 2021-12-17 郑州轻工业大学 Multi-modal emotion analysis method based on MIT and FSM
CN113903358A (en) * 2021-10-15 2022-01-07 北京房江湖科技有限公司 Voice quality inspection method, readable storage medium and computer program product
JP2022135930A (en) * 2021-03-05 2022-09-15 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド Video classification method, apparatus, device, and storage medium
WO2022198854A1 (en) * 2021-03-24 2022-09-29 北京百度网讯科技有限公司 Method and apparatus for extracting multi-modal poi feature
CN117933269A (en) * 2024-03-22 2024-04-26 合肥工业大学 Multi-mode depth model construction method and system based on emotion distribution

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7920761B2 (en) * 2006-08-21 2011-04-05 International Business Machines Corporation Multimodal identification and tracking of speakers in video
CN103778443A (en) * 2014-02-20 2014-05-07 公安部第三研究所 Method for achieving scene analysis description based on theme model method and field rule library
CN104090971A (en) * 2014-07-17 2014-10-08 中国科学院自动化研究所 Cross-network behavior association method for individual application
CN105760507A (en) * 2016-02-23 2016-07-13 复旦大学 Cross-modal subject correlation modeling method based on deep learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7920761B2 (en) * 2006-08-21 2011-04-05 International Business Machines Corporation Multimodal identification and tracking of speakers in video
CN103778443A (en) * 2014-02-20 2014-05-07 公安部第三研究所 Method for achieving scene analysis description based on theme model method and field rule library
CN104090971A (en) * 2014-07-17 2014-10-08 中国科学院自动化研究所 Cross-network behavior association method for individual application
CN105760507A (en) * 2016-02-23 2016-07-13 复旦大学 Cross-modal subject correlation modeling method based on deep learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LIU ZHENG ET.AL: "MMDF-LDA: An improved Multi-Modal Latent Dirichlet Allocation model for social image annotation", 《EXPERT SYSTEMS WITH APPLICATIONS》 *
QIN JIN ET.AL: "Describing Videos using Multi-modal Fusion", 《PROCEEDINGS OF THE 24TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA》 *
张德 等: "基于语义空间统一表征的视频多模态内容分析技术", 《电视技术》 *

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110162669A (en) * 2019-04-04 2019-08-23 腾讯科技(深圳)有限公司 Visual classification processing method, device, computer equipment and storage medium
CN110162669B (en) * 2019-04-04 2021-07-02 腾讯科技(深圳)有限公司 Video classification processing method and device, computer equipment and storage medium
CN110046279A (en) * 2019-04-18 2019-07-23 网易传媒科技(北京)有限公司 Prediction technique, medium, device and the calculating equipment of video file feature
CN110046279B (en) * 2019-04-18 2022-02-25 网易传媒科技(北京)有限公司 Video file feature prediction method, medium, device and computing equipment
CN110234018A (en) * 2019-07-09 2019-09-13 腾讯科技(深圳)有限公司 Multimedia content description generation method, training method, device, equipment and medium
CN110234018B (en) * 2019-07-09 2022-05-31 腾讯科技(深圳)有限公司 Multimedia content description generation method, training method, device, equipment and medium
CN110390311A (en) * 2019-07-27 2019-10-29 苏州过来人科技有限公司 A kind of video analysis algorithm based on attention and subtask pre-training
CN110399934A (en) * 2019-07-31 2019-11-01 北京达佳互联信息技术有限公司 A kind of video classification methods, device and electronic equipment
CN110647804A (en) * 2019-08-09 2020-01-03 中国传媒大学 Violent video identification method, computer system and storage medium
CN110489593A (en) * 2019-08-20 2019-11-22 腾讯科技(深圳)有限公司 Topic processing method, device, electronic equipment and the storage medium of video
CN110580509A (en) * 2019-09-12 2019-12-17 杭州海睿博研科技有限公司 multimodal data processing system and method for generating countermeasure model based on hidden representation and depth
CN110674348A (en) * 2019-09-27 2020-01-10 北京字节跳动网络技术有限公司 Video classification method and device and electronic equipment
CN110674348B (en) * 2019-09-27 2023-02-03 北京字节跳动网络技术有限公司 Video classification method and device and electronic equipment
WO2021134277A1 (en) * 2019-12-30 2021-07-08 深圳市优必选科技股份有限公司 Emotion recognition method, intelligent device, and computer-readable storage medium
CN113094550B (en) * 2020-01-08 2023-10-24 百度在线网络技术(北京)有限公司 Video retrieval method, device, equipment and medium
CN113094550A (en) * 2020-01-08 2021-07-09 百度在线网络技术(北京)有限公司 Video retrieval method, device, equipment and medium
CN111414959B (en) * 2020-03-18 2024-02-02 南京星火技术有限公司 Image recognition method, device, computer readable medium and electronic equipment
CN111401259B (en) * 2020-03-18 2024-02-02 南京星火技术有限公司 Model training method, system, computer readable medium and electronic device
CN111414959A (en) * 2020-03-18 2020-07-14 南京星火技术有限公司 Image recognition method and device, computer readable medium and electronic equipment
CN111401259A (en) * 2020-03-18 2020-07-10 南京星火技术有限公司 Model training method, system, computer readable medium and electronic device
CN111235709A (en) * 2020-03-18 2020-06-05 东华大学 Online detection system for spun yarn evenness of ring spinning based on machine vision
CN111723239A (en) * 2020-05-11 2020-09-29 华中科技大学 Multi-mode-based video annotation method
CN111723239B (en) * 2020-05-11 2023-06-16 华中科技大学 Video annotation method based on multiple modes
JP7334395B2 (en) 2021-03-05 2023-08-29 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド Video classification methods, devices, equipment and storage media
JP2022135930A (en) * 2021-03-05 2022-09-15 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド Video classification method, apparatus, device, and storage medium
WO2022198854A1 (en) * 2021-03-24 2022-09-29 北京百度网讯科技有限公司 Method and apparatus for extracting multi-modal poi feature
CN113177914B (en) * 2021-04-15 2023-02-17 青岛理工大学 Robot welding method and system based on semantic feature clustering
CN113177914A (en) * 2021-04-15 2021-07-27 青岛理工大学 Robot welding method and system based on semantic feature clustering
CN113239184A (en) * 2021-07-09 2021-08-10 腾讯科技(深圳)有限公司 Knowledge base acquisition method and device, computer equipment and storage medium
CN113806609B (en) * 2021-09-26 2022-07-12 郑州轻工业大学 Multi-modal emotion analysis method based on MIT and FSM
CN113806609A (en) * 2021-09-26 2021-12-17 郑州轻工业大学 Multi-modal emotion analysis method based on MIT and FSM
CN113903358B (en) * 2021-10-15 2022-11-04 贝壳找房(北京)科技有限公司 Voice quality inspection method, readable storage medium and computer program product
CN113903358A (en) * 2021-10-15 2022-01-07 北京房江湖科技有限公司 Voice quality inspection method, readable storage medium and computer program product
CN117933269A (en) * 2024-03-22 2024-04-26 合肥工业大学 Multi-mode depth model construction method and system based on emotion distribution

Also Published As

Publication number Publication date
CN109472232B (en) 2020-09-29

Similar Documents

Publication Publication Date Title
CN109472232A (en) Video semanteme characterizing method, system and medium based on multi-modal fusion mechanism
Yao et al. Exploring visual relationship for image captioning
Ge et al. Facial expression recognition based on deep learning
Kontschieder et al. Structured class-labels in random forests for semantic image labelling
CN108829677A (en) A kind of image header automatic generation method based on multi-modal attention
Banica et al. Video object segmentation by salient segment chain composition
Bin et al. Adaptively attending to visual attributes and linguistic knowledge for captioning
CN108427740B (en) Image emotion classification and retrieval algorithm based on depth metric learning
GB2565775A (en) A Method, an apparatus and a computer program product for object detection
Zhu et al. Efficient action detection in untrimmed videos via multi-task learning
He et al. Moving object recognition using multi-view three-dimensional convolutional neural networks
CN110956158A (en) Pedestrian shielding re-identification method based on teacher and student learning frame
CN108154156B (en) Image set classification method and device based on neural topic model
Islam et al. A review on video classification with methods, findings, performance, challenges, limitations and future work
Liu et al. Robust salient object detection for RGB images
CN107506792A (en) A kind of semi-supervised notable method for checking object
Kindiroglu et al. Temporal accumulative features for sign language recognition
Meena et al. A review on video summarization techniques
Tighe et al. Scene parsing with object instance inference using regions and per-exemplar detectors
Liu et al. Application of gcForest to visual tracking using UAV image sequences
Li et al. Research on efficient feature extraction: Improving YOLOv5 backbone for facial expression detection in live streaming scenes
Luo et al. An optimization framework of video advertising: using deep learning algorithm based on global image information
Wu et al. Self-learning and explainable deep learning network toward the security of artificial intelligence of things
Yang et al. Visual Skeleton and Reparative Attention for Part-of-Speech image captioning system
Shi et al. Uncertain and biased facial expression recognition based on depthwise separable convolutional neural network with embedded attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20210506

Address after: Room 1605, Kangzhen building, 18 Louyang Road, Suzhou Industrial Park, 215000, Jiangsu Province

Patentee after: Suzhou Wuyun pen and ink Education Technology Co.,Ltd.

Address before: No.1 Daxue Road, University Science Park, Changqing District, Jinan City, Shandong Province

Patentee before: SHANDONG NORMAL University