CN104778224A

CN104778224A - Target object social relation identification method based on video semantics

Info

Publication number: CN104778224A
Application number: CN201510137760.1A
Authority: CN
Inventors: 陈志�; 高翔; 岳文静
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Fan Liyang; Li Bo
Priority date: 2015-03-26
Filing date: 2015-03-26
Publication date: 2015-07-15
Anticipated expiration: 2035-03-26
Also published as: CN104778224B

Abstract

The invention relates to a target object social relation identification method based on video semantics. The method comprises the following steps: firstly preprocessing video data inputted by a user to obtain a lens image frame sequence, extracting a key frame from the lens image frame sequence, extracting a characteristic vector of the key frame by virtue of an SVM learning model, storing lens semantics obtained by analyzing the characteristic vector to lens node in an XML file, classifying the lens nodes with identical character semantic nodes into a group of lens nodes according to the time at the node of each lens and the semantic node corresponding to each character, storing the data of each group of classified lens nodes into a scene node of the XML file according to the gradually-increases sequence of the node value with the name of time node, sequentially structuring a lens semantic sequence to represent the scene; and finally storing the semantic information and social relation of characters in the storage scene by utilizing the scene semantic matrixes one by one, and merging the character semantic information and social relations in all scene semantic matrixes into a large matrix representing the video semantics in a way for extracting a union set.

Description

A kind of destination object social networks recognition methods based on video semanteme

Technical field

The present invention relates to a kind of social networks recognition methods, by carrying out semantic analysis to video, identifying wherein implicit social networks, belonging to the interleaving techniques field of image procossing, social networks, software engineering.

Background technology

Social Media refer to allow people to write, share, evaluate, discuss, the website of communicating with each other and technology.It is instrument and the platform that people are used for sharing suggestion, opinion, experience and viewpoint each other, and present stage mainly comprises social network sites, microblogging, micro-letter, blog etc.Social Media is a kind of cloud service, and cloud computing technology widespread use and Social Media are inherently that a kind of Web based on cloud computing applies.

Social networks and social networking service, along with the image of the people of evolution silently on network of social activity is tending towards complete more, at this time social networks has occurred.External main representative product has Facebook, Twitter, and domestic main representative has Renren Network, happy net, Sina's microblogging etc.Instantly the hot spot technology of social networks is in conjunction with cloud computing, ecommerce and emotion perception technology.Social networks refers to the interpersonal relation in social communication, also refers to the relation in social networks between good friend.By contact and interactive frequency between user, social networks divides strong relation and the large class of weak relation two.

The structure of video generally can be divided into four levels from high toward low: video sequence, scene, camera lens, frame.Video sequence generally refers to an independent video file, or a video segment.Video sequence is made up of several scenes.Each scene comprises the relevant camera lens of one or more semanteme, and these camera lenses can be continuous print or spaced.Each case for lense is containing some continuous print picture frames.Video semanteme extracts can be decomposed into split the camera lens of video and do image, semantic to the camera lens after segmentation and extracts.Block-based video lens cutting techniques, can by shot segmentation relevant for content together, and then choose and close key frame in camera lens to represent this camera lens; Image, semantic extractive technique is one of committed step that video semanteme extracts, and mainly comprises the extraction of semantics of detection to destination object and classification and destination object.

SVM is a kind of learning model of support vector machine, is in fact a sorter.It shows many distinctive advantages in solution small sample, non-linear and high dimensional pattern identification.SVM is a kind of learning model having supervision of support vector machine, is commonly used to carry out pattern-recognition, classification and regretional analysis.It is in fact a sorter.It is that the divisible situation of linear is analyzed, for the situation of linearly inseparable, by using non-linear map the sample of low-dimensional input space linearly inseparable is converted into high-dimensional feature space makes its linear separability, thus make high-dimensional feature space adopt linear algorithm to carry out linear analysis to the nonlinear characteristic of sample to become possibility.

XML, as a kind of extend markup language, is a kind of important software engineering.It can with a kind of structure of the management information of mode flexibly, the structural relation had by different node hierarchy description information itself.Dom4j is that the XML that increases income that dom4j.org produces resolves bag, and user can use dom4j technology to read and write each node content of XML file.

The present invention utilizes the technology such as Video processing, SVM, XML to solve destination object social networks identification problem.

Summary of the invention

Technical matters: the object of this invention is to provide a kind of destination object social networks recognition methods based on video semanteme, Social Media resource contains abundant semantic information, wherein video obtains social semantic important sources, but current main dependence people identifies and identifies relevant social semantic, lack effective technology and excavate by software analysis semantic information the social networks that destination object in video contains, the object of the invention is to address this problem.

Technical scheme: first the present invention carries out pre-service to video data, becomes a series of camera lens by partitioning video data, obtains the semanteme collection of destination object in camera lens; Then, then according to the content of camera lens and sequential relationship, relevant camera lens is formed semantic sequence, form specific scene; Finally, the semantic information by analyzing scene excavates the social networks between destination object.

The destination object social networks recognition methods based on video semanteme that the present invention proposes comprises following steps:

Step 1) first pre-service is carried out to the video data of user's input, concrete treatment scheme is as follows:

Step 1.1) utilize block-based comparative approach to split video data, obtain the camera lens of this video data, described block-based comparative approach is the region unit image of each frame of video data being divided into user's specified quantity, different camera lenses is marked off by the similarity comparing the region unit between successive frame, wherein frame is a least unit i.e. two field picture of video data, camera lens is one group of continuous print frame sequence in video, the characteristic sum specific standards of the similarity of described region unit is specified by user, region unit between the successive frame of same camera lens has similarity,

Step 1.2) from each camera lens, extraction is in that frame in this camera lens frame sequence centre position as key frame successively, and this key frame represents this camera lens in subsequent treatment;

Step 2) extract the semanteme collection of the destination object that in all key frames, user specifies, form semanteme collection being converted to key-value pair is saved in the file of XML format; Described destination object comprises background object and foreground object two class, and foreground object is who object, and background object is place residing for personage, temporal information; Described semanteme collection is the set of the semantic information that in video, destination object extracts, and comprises the semanteme of background, time, dialogue, personage, color, shape, texture; The file of described XML format comprises 3 layers of nested node, ground floor is scene node, use <scene> labeled marker, described scene refers to the arrangement of mirrors header sequence according to the sequential relationship composition between the semantic information of camera lens and camera lens; The second layer is camera lens node, uses <short> labeled marker; Third layer is concrete semantic node, uses <key> labeled marker; The concrete treatment scheme extracting the semanteme collection of the destination object that in each key frame, user specifies is as follows:

Step 2.1) key frame is carried out to detection and the classification of destination object, extract all destination objects that this key frame packet contains, record the time point that dialog information in this key frame between personage and this key frame are arranged in the broadcasting of video simultaneously;

Step 2.2) extract the visual signature of all prospects of key frame and background object, form corresponding proper vector, the visual signature of described background object comprises color, texture; The visual signature of foreground object comprises color, texture, shape;

Step 2.3) with SVM, the proper vector of destination object in key frame is learnt, extract the semantic information of foreground object and background object; The semantic information of described foreground object is the semantic information of the visual behaviour performance of foreground object, comprises color, shape, texture, personage, dialogue; The semantic information that described background object is got is the environment semantic information residing for background object, comprises background, time, and described SVM is a kind of learning model having supervision;

Step 2.4) by the foreground object of key frame of acquisition and the semantic information of background object, be saved in the camera lens node of XML file according to the form of key-value pair under;

Step 3) analyzing step 2) node lower time of each camera lens of obtaining and the semantic node corresponding to personage, the camera lens node having the semantic node of identical personage is classified as an arrangement of mirrors head node; The semantic node of described personage be exactly in XML file under <short> node in <key> node name attribute be that key-value pair of personage;

Step 4) data of every arrangement of mirrors head node of having classified are called according to name the incremental order of the nodal value of time node is saved in the scene node of XML file under, construct camera lens semantic sequence successively, represent scene one by one;

Step 5) each scene node in analyzing XML file successively, analyze all semantic informations that it comprises, obtain the semantic information of relation between personage and personage, these information of each scene are saved in successively one by one in matrix, the element of every a line of these matrixes or each row stores the semantic information of relation between a personage and other personages and this personage, and the line number of each personage in a matrix or row number are specified by user; Described Scene Semantics information comprises the semantic information of social networks between personage and personage, and the concrete the treatment scheme wherein social networks of personage and the semantic information of personage being saved in a matrix is as follows:

Step 5.1) extract under XML file Scene node all camera lens nodes in semantic node, obtain the semantic information that this scene is all;

Step 5.2) from obtaining the semantic information finding out personage all semantic informations of scene, set up a matrix according to this, in matrix except diagonal entry, when line number and a column element of a row element arranges number identical, this row element and a column element represent the social networks of same personage, and cornerwise element preserves the semantic information of this personage; The line number of described cornerwise element and row are number identical;

Step 5.3) assignment is carried out to the element of scene homography, from step 5.1) all semantic informations of obtaining, social networks between extraction personage and the semantic information of personage, social networks and the semantic information of personage is preserved again successively, by aggregate assignment to the element of the correspondence position of matrix with set HashMap; Described set HashMap is a data acquisition being used for depositing key-value pair;

Step 6) obtain according to all matrixes representing Scene Semantics the matrix that represents video semanteme information; This matrix preserves semantic information and the social networks of all personages in video, the element of every a line of matrix or each row stores the semantic information of relation between a personage and other personages and this personage, the line number of each personage in a matrix or row number are specified by user, wherein each personage line number in a matrix or row number are specified by user, and idiographic flow is as follows:

Step 6.1) from the matrix of all Scene Semantics, extract all personages and each personage corresponding personage's semantic information set HashMap, successively union is got in these semantic information set, merge and be saved in a HashMap set, then the HashMap set after this merging is saved in the corresponding diagonal entry of matrix;

Step 6.2) from the matrix of all Scene Semantics, extract social networks between personage, according to personage's line number in a matrix or row number, successively union is got in the social networks set of identical personage, merge and be saved in a HashMap set, then the HashMap set after this merging is saved in each personage position in a matrix.

Beneficial effect: video content structuring is first preserved by the present invention, is convenient to computer recognizing and analyzes video semanteme, thus can effectively infer the social networks contained in video, has widened the mode excavating social networks.Specifically, the method belonging to the present invention has following beneficial effect:

(1) the present invention uses XML technology that the Content Transformation of video is become structured data format, is convenient to preserve the inherent semantic information contained in video.For resolving the semantic information that contains in video below and social networks provides a kind of architecture basics.

(2) the present invention extracts from multi-angle video semanteme, can obtain the semantic information enriched, and provides abundant content basis for analyzing the social networks contained in video below exactly.

Accompanying drawing explanation

Fig. 1 is method flow diagram of the present invention,

Fig. 2 is Scene Semantics information matrix structural drawing of the present invention.

Embodiment

Specifically implement to be described in more detail to the present invention below in conjunction with Fig. 1.

One, the semanteme of destination object in camera lens is extracted

First destination object detection and classification is carried out to camera lens, foreground object and background object is extracted in camera lens, and their vision and behavioural characteristic, writing time and personage's dialog information, and the semanteme that the proper vector of destination object is analyzed out is saved in the form of key-value pair in the <key> node under the <short> node of XML file.As follows:

Specifically starting with from color, texture and shape tripartite region feature when extracting vision and behavioural characteristic, in XML file, representing by the value of the <key> node of color, shape and texture by name the semantic information obtained from this three aspect respectively; Time is used for distinguishing different camera lenses, also for the structure of semantic sequence is below given a clue, represents with the <key> node of time by name; Here destination object mainly refers to people, represents with the <key> node of object by name; Dialogue in the camera lens <key> node of dialogue by name represents, in node, which destination object this feature node of obj_name attribute representative belongs to.

Be transformed in the <short> node process of XML format file at camera lens, use SVM model and come vision inside training study camera lens sample, behavioural characteristic vector, carry out the classification to destination object, SVM model is used to obtain the proper vector of the camera lens except sample again, and with conversion regime same above conversion, under the semantic information of each camera lens being saved in the <short> node of XML file.

Two, according to shot sequence structure scene

Each scene is made up of the shot sequence that a group is associated.Shot sequence is constructed according to the contact between time series and destination object.And all camera lenses have been converted to <short> node in corresponding XML file above, so, structure scene uses dom4j technology to resolve this group <short> node exactly, be the value that object is corresponding by the name attribute in key node in coupling <short> node file, find out and be worth consistent all <short> nodes with this, the <scene> node that the order increased progressively according to the time is saved in XML file successively gets off to represent a scene of this arrangement of mirrors header sequence and video.As follows:

Three, Scene Semantics is analyzed

Build Scene Semantics with a matrix, be in fact that a n dimension is upright, n is all foreground object numbers comprised in video, this refers to personage's number.The <scene> node of the scene in video by XML file is represented above, use dom4j technology to carry out analyzing XML file on this basis, the content of parsing is saved in constructed matrix.Wherein each element of matrix is a HashMap set, what preserve is key-value pair data in the corresponding XML file of this camera lens under <short> node, the <scene> node information of XML file is resolved in circulation, camera lens semantic informations all for this each scene is saved in corresponding HashMap.

Four, Scene Semantics matrix

Each scene node in analyzing XML file, analyzes all semantic informations that it comprises successively, obtains the semantic information of relation between personage and personage.The semantic information of each scene is saved in successively one by one in matrix.Described Scene Semantics information comprises the semantic information of social networks between personage and personage.The concrete the treatment scheme wherein social networks of personage and the semantic information of personage being saved in a matrix is as follows:

(1) the semantic node in all camera lens nodes under analyzing XML file Scene node, obtains all semantic informations.

(2) from the semantic information obtained, find out the semantic information of personage, set up a matrix according to this.Line number in matrix except diagonal entry and row number that identical two group element, represent the social networks of same person, cornerwise element preserves the semantic information of corresponding personage.

(3) assignment is carried out to entry of a matrix element, analyze all semantic informations of obtaining, obtain the semantic information of social networks between personage and personage, then preserve social networks and the semantic information of corresponding personage successively with HashMap set.Described HashMap set is a data acquisition being used for depositing key-value pair.Last successively by aggregate assignment to the element of the correspondence position of matrix.

Five, video semanteme matrix

The matrix that one represents video semanteme information is obtained according to all matrixes representing Scene Semantics.This matrix preserves semantic information and the social networks of all personages in video.The element of every a line of matrix or each row stores the semantic information of relation between a personage and other personages and this personage, and the line number of each personage in a matrix or row number are specified by user.Wherein each personage line number in a matrix or row number are specified by user.Idiographic flow is as follows:

(1) from the matrix of all Scene Semantics, all personages and each personage corresponding personage's semantic information set HashMap is extracted.Successively union is got in these semantic information set, be merged into a large set, then this large set is saved in the corresponding diagonal entry of matrix.

(2) from the matrix of all Scene Semantics, extract the social networks set HashMap between personage.According to human classification, successively union is got in the social networks set of identical personage, be merged into a large set.According to personage's line number in a matrix or row number, the big collection of the social networks of each personage is preserved corresponding position in a matrix.

Embodiment is set forth further below in conjunction with case.

Suppose there is one section of video, description be have four people waiting for bus at bus platform, represent four personages with A, B, C, D respectively, video only comprises two scenes.

(1), scene one: A laughs at and said sentence to B, and " see a film in the evening together! ", then B laughs at and answers: " dear, you are happy, and what is all right.”

(2), scene two: C and D look at oneself mobile phone, between without any interchange.

This scene has four target persons, so semantic between storing with 4 dimensions are upright, four personage's targets A, B, C, D represent.The semantic information that upright the first row stores A and occurs between other people, below by that analogy.The value of each element upright, deposits be expert at corresponding target person and the corresponding target person of row, the semantic information extracted in video.

Scene one comprises two shot sequences:

1) A laughs at and has said sentence to B, and " see a film in the evening together! "

2) B laughs at and answers: " dear, you are happy, and what is all right.”

Scene two comprises a camera lens: 1) C and D is seeing the mobile phone, and never exchanges.

In scene one, to convert XML file to as follows for camera lens:

Scene Semantics information matrix structure as shown in Figure 2.

Social networks is analyzed:

(1) A and B relation is very close.

(2) C and D relation is strange.

Claims

1., based on a destination object social networks recognition methods for video semanteme, it is characterized in that the step that the method comprises is: