CN103312938B

CN103312938B - Video process apparatus, method for processing video frequency and equipment

Info

Publication number: CN103312938B
Application number: CN201210071078.3A
Authority: CN
Inventors: 李斐; 刘汝杰; 石原正树; 上原祐介
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2012-03-16
Filing date: 2012-03-16
Publication date: 2016-07-06
Anticipated expiration: 2032-03-16
Also published as: JP2013196700A; JP6015504B2; CN103312938A

Abstract

The invention provides video process apparatus, method for processing video frequency and equipment, with the problem at least overcoming the Video processing weak effect of existing supervised and the existence of Semi-supervised video processing technique.Video process apparatus includes: for extracting the pretreatment unit representing frame He carrying out image segmentation；Extract the feature extraction unit of camera lens level, frame level and region class visual signature；The weighted graph building camera lens level, frame level and region class weighted graph sets up unit；The construction of function unit of structure cost function；Obtain video lens by solving the optimal problem of cost function, represent the computing unit of the soft label in frame and region；And the video processing unit of Video processing is carried out according to above-mentioned soft label.Method for processing video frequency is for performing to be capable of the process of the function of above-mentioned video process apparatus.The said equipment includes above-mentioned video process apparatus.The above-mentioned technology of the application present invention, it is possible to obtain good Video processing effect, it is possible to be applied to field of video processing.

Description

Video processing device, video processing method, and apparatus

Technical Field

The present invention relates to the field of video processing, and in particular, to a video processing apparatus, a video processing method, and a device.

Background

With the rapid increase in the number of digital videos, research and development of effective video processing techniques are required. Generally, in some existing video processing technologies, a user is required to provide some training video shots, and then perform corresponding video processing according to the training video shots. Where training video shots may include both tagged video shots and untagged video shots, the tagged video shots typically include positive example video shots (i.e., positively tagged video shots) and negative example video shots (i.e., negatively tagged video shots). These video processing techniques can be classified into two categories, namely supervised video processing techniques and semi-supervised video processing techniques, depending on the type of training video shots.

For the supervised video processing technology, all the training video shots adopted by the supervised video processing technology are labeled video shots. However, the number of tagged video footage is often limited, so the processing performed using this technique is often inefficient and does not make efficient use of the information in untagged video footage.

For semi-supervised video processing techniques, the training video shots used include both tagged and untagged video shots. Semi-supervised video processing techniques can utilize the information contained in unlabelled video shots relatively efficiently as compared to supervised video processing techniques. However, most of the existing semi-supervised video processing technologies are video processing technologies using only a lens-level weighted graph or only a frame-level weighted graph, and even some technologies using both a lens-level weighted graph and a frame-level weighted graph, the processing effect is poor because the calculation results are calculated by using both weighted graphs separately and simply combined, and the relation between the two is not considered in the calculation process.

Disclosure of Invention

The following presents a simplified summary of the invention in order to provide a basic understanding of some aspects of the invention. It should be understood that this summary is not an exhaustive overview of the invention. It is not intended to determine the key or critical elements of the present invention, nor is it intended to limit the scope of the present invention. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is discussed later.

In view of the foregoing defects of the prior art, it is an object of the present invention to provide a video processing apparatus, a video processing method and a device, so as to overcome at least the problem of poor video processing effect of the existing supervised video processing technology and semi-supervised video processing technology.

In order to achieve the above object, according to one aspect of the present invention, there is provided a video processing apparatus including: a pre-processing unit configured to extract at least one representative frame of each video shot in a set of video shots, respectively, and to divide each extracted representative frame into a plurality of regions, wherein at least some of the video shots of the set of video shots are tagged video shots; a feature extraction unit configured to extract a shot-level visual feature, a frame-level visual feature, and a region-level visual feature of each video shot in the set of video shots; a weighted graph establishing unit configured to construct a lens-level weighted graph according to the lens-level visual features, a frame-level weighted graph according to the frame-level visual features, and a region-level weighted graph according to the region-level visual features; a function constructing unit configured to construct a cost function according to the structure information of the lens-level weighting graph, the frame-level weighting graph, and the area-level weighting graph, and according to a relationship among the soft label of each video shot, the soft label of each representative frame, and the soft label of each area, with the soft label of each video shot in the video shot set, the soft label of each representative frame in the video shot set, and the soft label of each area in the representative frame being unknown quantities; a calculation unit configured to obtain a calculated value of the unknown quantity by solving an optimization problem of the cost function; and a video processing unit configured to perform video processing based on the calculated value obtained by the calculating unit.

According to another aspect of the present invention, there is also provided a video processing method, including: respectively extracting at least one representative frame of each video shot in a video shot set, and dividing each extracted representative frame into a plurality of areas, wherein at least part of the video shots in the video shot set are labeled video shots; extracting lens-level visual features, frame-level visual features and region-level visual features of each video lens in the video lens set; constructing a lens-level weighted graph according to the lens-level visual features, a frame-level weighted graph according to the frame-level visual features, and a region-level weighted graph according to the region-level visual features; constructing a cost function according to the structure information of the lens-level weighting graph, the frame-level weighting graph and the region-level weighting graph and according to the relationship among the soft label of each video shot, the soft label of each representative frame and the soft label of each region, with the soft label of each video shot in the video shot set, the soft label of each representative frame in each video shot and the soft label of each region in each representative frame as unknowns; obtaining a calculation value of the unknown quantity by solving an optimal problem of the cost function; and performing video processing based on the obtained calculated value.

According to another aspect of the present invention, there is also provided an apparatus comprising the video processing device as described above.

According to other aspects of the present invention, there is also provided a corresponding computer-readable storage medium having stored thereon a computer program executable by a computing device, the program, when executed, being capable of causing the computing device to perform the above-mentioned video processing method.

The video processing device and the video processing method and the equipment comprising the video processing device can realize at least one of the following benefits: the characteristic information of the video shot is fully utilized by utilizing the three weighted graphs, and the relation among the three weighted graphs is fully mined, so that a better video processing effect can be obtained; the video processing can be realized by further utilizing the video shots without labels on the basis of utilizing the video shots with labels, so that the processing effect of the video processing can be improved; a more accurate video retrieval result can be obtained; and a more accurate video concept detection result can be obtained.

These and other advantages of the present invention will become more apparent from the following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings.

Drawings

The invention may be better understood by referring to the following description in conjunction with the accompanying drawings, in which like reference numerals are used throughout the figures to indicate like or similar parts. The accompanying drawings, which are incorporated in and form a part of this specification, illustrate preferred embodiments of the present invention and, together with the detailed description, serve to further explain the principles and advantages of the invention. In the drawings:

fig. 1 is a block diagram schematically showing an example structure of a video processing apparatus according to an embodiment of the present invention.

Fig. 2 is a block diagram schematically illustrating one possible example structure of the weighted graph building unit in fig. 1.

Fig. 3 is a block diagram schematically illustrating one possible example structure of the function construction unit in fig. 1.

Fig. 4 is a block diagram schematically illustrating one possible example structure of the computing unit in fig. 1.

Fig. 5 is a block diagram schematically illustrating one possible example structure of the video processing unit in fig. 1.

Fig. 6 is a flow chart schematically illustrating an exemplary process of a video processing method according to an embodiment of the present invention.

Fig. 7 is a flowchart schematically illustrating one possible exemplary process of step S660 shown in fig. 6.

Fig. 8 is a flowchart schematically showing one possible exemplary process of step S670 shown in fig. 6 in the case where the video processing is an example of video concept detection.

Fig. 9 is a block diagram showing a configuration of hardware of one possible information processing apparatus that can be used to implement the video processing device and the video processing method according to the embodiment of the present invention.

Skilled artisans appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures may be exaggerated relative to other elements to help improve the understanding of the embodiments of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described hereinafter with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an actual implementation are described in the specification. It will of course be appreciated that in the development of any such actual embodiment, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which will vary from one implementation to another. Moreover, it will be appreciated that such a development effort might be complex and time-consuming, but would nevertheless be a routine undertaking for those of ordinary skill in the art having the benefit of this disclosure.

It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the device structures and/or processing steps closely related to the solution according to the present invention are shown in the drawings, and other details not so relevant to the present invention are omitted.

As described above, the supervised or semi-supervised video processing techniques in the prior art have poor processing effect when processing video shots due to the reasons described above. In order to improve the video processing effect, the invention provides a video processing device which can simultaneously utilize the lens-level visual features, the frame-level visual features and the region-level visual features of a video shot, fully utilize the information in the video shot and better reflect the characteristics of the video shot and the relationship between the video shot and the video shot.

The video processing apparatus includes: a pre-processing unit configured to extract at least one representative frame of each video shot in a set of video shots, respectively, and to divide each extracted representative frame into a plurality of regions, wherein at least some of the video shots of the set of video shots are tagged video shots; a feature extraction unit configured to extract a shot-level visual feature, a frame-level visual feature, and a region-level visual feature of each video shot in the set of video shots; a weighted graph establishing unit configured to construct a lens-level weighted graph according to the lens-level visual features, a frame-level weighted graph according to the frame-level visual features, and a region-level weighted graph according to the region-level visual features; a function constructing unit configured to construct a cost function according to the structure information of the lens-level weighting graph, the frame-level weighting graph, and the region-level weighting graph, and according to a relationship among the soft label of each video shot, the soft label of each representative frame, and the soft label of each region, with the soft label of each video shot in the video shot set, the soft label of each representative frame in the video shot set, and the soft label of each region in the representative frame being unknown quantities; a calculation unit configured to obtain a calculated value of the unknown quantity by solving an optimization problem of the cost function; and a video processing unit configured to perform video processing based on the calculated value obtained by the calculating unit.

A video processing apparatus according to an embodiment of the present invention is described in detail below with reference to fig. 1 to 5.

Fig. 1 is a block diagram schematically showing an example structure of a video processing apparatus 100 according to an embodiment of the present invention. As shown in fig. 1, the video processing apparatus 100 according to an embodiment of the present invention includes a preprocessing unit 110, a feature extraction unit 120, a weighted graph creation unit 130, a function construction unit 140, a calculation unit 150, and a video processing unit 160.

As shown in fig. 1, the pre-processing unit 110 in the video processing apparatus 100 is configured to extract at least one representative frame from each video shot in the video shot set, and perform image segmentation on each extracted representative frame, that is, divide each extracted representative frame of each video shot into a plurality of regions. The representative frame extracted from each video shot may be any one or any plurality of frames in the video shot, or may be a frame extracted by using some existing methods for extracting frames, and the image segmentation described herein may be implemented by using any one of the existing image segmentation methods, and will not be described in detail herein. Further, the video footage set may include a plurality of video footage, and at least some of the plurality of video footage may be tagged video footage. That is, the video shots in the video shot set may all be tagged video shots, or some of the video shots may be tagged video shots and the rest may be untagged video shots. The tagged video shots may be positively tagged video shots (hereinafter referred to as "positive example video shots") or negatively tagged video shots (hereinafter referred to as "negative example video shots"). It should be noted that the "tag" (also called hard tag) carried by the video shot herein is labeling information, and is generally information representing the category of an object (e.g., video shot) that is pre-labeled on the object by a user, for example. Wherein positively tagged video shots (i.e., positively hard tagged video shots) are typically video shots that conform to a particular category, and negatively tagged video shots (i.e., negatively hard tagged video shots) are typically video shots that do not conform to a particular category. For example, a positive tag may be in the form of "a" and correspondingly a negative tag may be in the form of "non-a". One simple example is where "a" is a "tiger," i.e., video shots with positive labels are video shots labeled "tiger" (these video shots conform to the category "tiger," indicating that "tiger" is included in these video shots), while video shots with negative labels are video shots labeled "non-tiger" (these video shots do not conform to the category "tiger," indicating that "tiger" is not included in these video shots).

It should be noted that the number of extracted representative frames of each video shot in the video shot set may be the same or different. In addition, the representative frame of each video shot in the video shot set may be divided into a plurality of regions by image division, but the number of regions into which each representative frame is divided may be the same or different.

Then, the shot-level visual feature, the frame-level visual feature, and the region-level visual feature of each video shot in the above video shot set are extracted by the feature extraction unit 120. The lens-level visual features of each video lens refer to the visual features of the video lens extracted on the lens layer; the frame-level visual features of each video shot refer to the visual features of the video shot extracted on a frame level; and the region-level visual feature of each video shot refers to the visual feature of the video shot extracted on a region level. The "visual characteristics" herein are information that can reflect the content of a video shot to a certain extent, and may be any one of visual characteristics such as color characteristics, texture characteristics, and shape characteristics, or a combination of any of the above visual characteristics. In addition, various methods for extracting visual features existing in the prior art can be used by the present invention, and are not described in detail herein.

The weighted graph establishing unit 130 may establish three types of weighted graphs (or weighted graphs) according to the shot-level visual features, the frame-level visual features, and the region-level visual features of each video shot in the video shot set extracted by the feature extracting unit 120. Specifically, the weighting graph creating unit 130 may create a lens-level weighting graph according to the lens-level visual features of each video shot extracted by the feature extracting unit 120, and a frame-level weighting graph according to the frame-level visual features of each video shot (i.e., the visual features of each frame) extracted by the feature extracting unit 120, and may also create a region-level weighting graph according to the region-level visual features of each video shot (i.e., the visual features of each region) extracted by the feature extracting unit 120.

In one implementation of the video processing apparatus according to the embodiment of the present invention, the weighted graph establishing unit 130 as shown in fig. 1 may be implemented with a structure as shown in fig. 2. Fig. 2 is a block diagram schematically illustrating one possible example structure of the weighted graph establishing unit 130 in fig. 1. As shown in fig. 2, the weighted graph establishing unit 130 may include a first establishing subunit 210, a second establishing subunit 220, and a third establishing subunit 230.

The first establishing subunit 210 may be configured to construct the shot-level weighting graph, for example, by using each video shot in the video shot set as a node, and using a similarity between each two nodes on a shot-level visual feature as a weight of a weighted edge between the two nodes, to construct the shot-level weighting graph. In other words, in the shot-level weighting graph constructed by the first establishing subunit 210, each node represents one of the video shots in the video shot set, and the weight of the weighted edge connecting two nodes represents the similarity between the two video shots corresponding to the two nodes based on the shot-level visual characteristics. The nodes in the shot-level weighted graph correspond to the video shots in the video shot set one by one.

Similarly, the second establishing subunit 220 may be configured to construct the frame-level weighting graph, for example, by using each representative frame of each video shot in the video shot set as a node, and using the similarity between each two nodes on the frame-level visual feature as a weight of a weighted edge between the two nodes. In other words, in the frame-level weighted graph constructed by the second establishing subunit 220, each node represents one representative frame of one of the video shots in the video shot set, and the weight of the weighted edge connecting two nodes represents the similarity between the two representative frames corresponding to the two nodes based on the frame-level visual features. Wherein, the nodes in the frame-level weighted graph correspond to the representative frames of the video shots in the video shot set one by one.

Furthermore, the third establishing subunit 230 may be configured to construct the above-mentioned region-level weighting graph, for example, by taking each region of each representative frame of each video shot in the above-mentioned video shot set as a node, and taking the similarity between each two nodes in the region-level visual characteristic as a weight of a weighted edge between the two nodes. In other words, in the region-level weighting graph constructed by the third establishing subunit 230, each node represents one of the regions of one of the representative frames of one of the video shots in the video shot set, and the weighting value of the weighting edge connecting two nodes represents the similarity between the two regions corresponding to the two nodes based on the region-level visual characteristics. The nodes in the region-level weighted graph correspond to regions contained in the representative frames of the video shots in the video shot set in a one-to-one mode.

Turning to fig. 1, after the lens-level, frame-level, and region-level weighting maps are constructed by the weighting map creating unit 130, a cost function may be constructed by the function constructing unit 140. Wherein, in the cost function, the unknown quantity is a soft label of each video shot in the video shot set, a soft label of each representative frame of each video shot in the video shot set, and a soft label of each region of each representative frame of each video shot in the video shot set. Then, a cost function can be constructed according to the structural information of the shot level weighted graph, the frame level weighted graph, and the region level weighted graph constructed by the weighted graph establishing unit 130, and according to the relationship among the soft label of each video shot in the video shot set, the soft label of the representative frame of each video shot, and the soft label of the region in the representative frame of each video shot.

Note that soft label (softlabel) is a concept defined in relation to the concept of hard label. A hard tag is often a real label, which is usually a kind of information reflecting the sample category pre-labeled on a predetermined sample (e.g. video shot); a soft label is a virtual label that generally reflects how well an object (e.g., a video shot, frame, or region) to which the soft label belongs conforms to the category information characterized by the hard labels in a predetermined sample. In general, a soft label can be any real number between-1 and 1 (including-1 and 1), in which case, the closer (i.e., larger) the value of the soft label is to 1, the more consistent the class of the object corresponding to the soft label and the positively labeled object in the predetermined sample is; conversely, a soft label having a value closer to-1 (i.e., smaller) indicates that the object to which the soft label corresponds is less likely to be out of the category of positively labeled objects in the predetermined sample. In other words, the larger the value of the soft label, the more likely it is that the object corresponding to the soft label corresponds to the category of the object with the positive label, and the smaller the value of the soft label, the less likely it is that the object corresponding to the soft label corresponds to the category of the object with the positive label. In addition, the soft label may be set to other real numbers, for example, a real number greater than 1 or less than-1, in which case, the larger the soft label is, the more the corresponding object is matched with the class of the positively labeled object in the predetermined sample.

For example, in a case where a predetermined sample contains positively-labeled video shots and negatively-labeled video shots, and the above-described positively-labeled video shots are video shots labeled "tigers" and the above-described negatively-labeled video shots are video shots labeled "non-tigers", if a soft label of a certain video shot is 0.1 and a soft label of another video shot is 0.8, the probability that a tiger is contained in a video shot with a soft label of 0.8 is much higher than a video shot with a soft label of 0.1.

Specifically, the function and operation of the function construction unit 140 may be implemented using a structure as shown in fig. 3. Fig. 3 is a block diagram schematically illustrating one possible example structure of the function construction unit 140 in fig. 1.

As shown in fig. 3, the function construction unit 140 may include a first setting subunit 310, a second setting subunit 320, and a function construction subunit 330. Wherein, the first setting subunit 310 is configured to set a first constraint condition according to the structure information of the lens-level weighting map, the frame-level weighting map and the region-level weighting map constructed by the weighting map establishing unit 130, the second setting subunit 320 is configured to set a second constraint condition according to the relationship between the soft label of the tagged video lens in the video lens set and the soft label of the representative frame of the tagged video lens in the video lens set and the soft label of the region in the representative frame, and then the function constructing subunit 330 is configured to obtain the cost function according to the above two constraint conditions. As described above, the unknowns in the cost function are the soft label of each video shot in the set of video shots, the soft label of each representative frame of each video shot in the set of video shots, and the soft label of each region in each representative frame of each video shot in the set of video shots.

Specifically, in consideration of the structure information of the above-described three kinds of weighted graphs, such a first constraint condition may be set by the first setting subunit 310: let the difference between the soft labels of two video shots for which the shot-level visual features are more similar be smaller, let the difference between the soft labels of two representative frames for which the frame-level visual features are more similar be smaller, and let the difference between the soft labels of two regions for which the region-level visual features are more similar be smaller.

Further, for those tagged video shots in the video shot collection described above, the soft tags for negatively tagged video shots may be made as close to-1 as possible, while the soft tags for positively tagged video shots may be made as close to 1 as possible. This is because, in a case where a video shot with a positive tag contains a content of a certain category, and a video shot with a negative tag does not contain the content of the certain category, when a soft tag is set to any real number between-1 and 1, the video shot with the soft tag closer to 1 is more likely to contain the content of the certain category, and the video shot with the soft tag closer to-1 is less likely to contain the content of the certain category. For example, for a video shot labeled "non-tiger" (i.e., negative label), the soft label of the video shot may be as close to-1 as possible; conversely, for a video shot labeled "tiger" (i.e., a positive label), the soft label of the video shot may be made as close to 1 as possible.

For the representative frames of the video shots with negative labels, if a certain video lens has a negative label, it indicates that the video shot does not contain the content of the specific category, and indicates that any frame in the video shot does not contain the content of the specific category, and any area in any frame in the video shot does not contain the content of the specific category. Therefore, the soft label of each representative frame in the negatively labeled video shots may be as close as possible to-1, and the soft label of each region of each representative frame in the negatively labeled video shots may also be as close as possible to-1.

The situation is somewhat more complicated for the representative frames of the positively labeled video shots and the regions therein.

For example, for a representative frame of a positively tagged video shot, if a video lens has a positive tag, it indicates that the video shot contains the "content of the specific category", i.e., that at least one frame of the video shot contains the "content of the specific category", but it cannot be determined which frames contain the information. In the case of considering only representative frames of video shots, at least one representative frame of the positively labeled video shots may be considered to contain the "content of a specific category", but it cannot be determined which representative frames contain the information. If a video lens has a positive label, only the representative frame with the largest soft label in the video lens can be considered, and the soft label of the representative frame is as close as possible to the soft label of the video lens. In this way, the shot-level weighted graph and the frame-level weighted graph are correlated.

In addition, as described above, when it is considered that the "content of the specific category" is included in at least one representative frame in the positively labeled video footage, at least one area including the "content of the specific category" exists in each of the "at least one representative frame". For each of the "at least one representative frame", only the region having the largest soft label in the representative frame may be considered, so that the soft label of the region is as close as possible to the soft label of the representative frame (i.e., the representative frame to which the region belongs). In this way, the frame-level weighting graph and the region-level weighting graph are correlated.

Furthermore, it should be noted that, in general, it is not known which frames in a positively labeled video shot are positive (i.e., which frames contain the above-mentioned "content of a specific category"). Therefore, some frames that may be positive examples (i.e., frames that may contain the "content of a specific category" described above, hereinafter referred to as "possible positive frames") may be selected according to certain criteria. For example, the possible positive frames may be representative frames in which the value of the soft label is higher than a fifth preset threshold, or representative frames in which a region in which the value of the soft label is higher than a sixth preset threshold is included.

Thus, such second constraint conditions may be set by the second setting subunit 320: the soft labels of the video shots with the negative labels, the soft labels of all representative frames in the video shots with the negative labels and the soft labels of all areas of all representative frames in the video shots with the negative labels are close to-1 as much as possible, the soft labels of the video shots with the positive labels are close to 1 as much as possible, the soft label of the representative frame with the largest soft label in the video shots with the positive labels is close to the soft label of the video shot to which the representative frame belongs as much as possible, and the soft label of the area with the largest soft label in each possible positive frame in the video shots with the positive labels is close to the soft label of the representative frame to which the area belongs as much as possible.

According to the above two constraints, the cost function can be constructed by the function construction subunit 330. For example, the function constructing subunit 330 may construct the following cost function according to the above two constraints:

expression one:

Q (f^{S}, f^{F}, f^{R})

= \frac{1}{2} Σ_{g, h} W_{gh}^{S} {(f_{g}^{S} / \sqrt{d_{g}^{S}} - f_{h}^{S} / \sqrt{d_{h}^{S}})}^{2} + \frac{μ_{G}^{F}}{2} Σ_{i, j} W_{ij}^{F} {(f_{i}^{F} / \sqrt{d_{i}^{F}} - f_{j}^{F} / \sqrt{d_{j}^{F}})}^{2}

+ \frac{μ_{G}^{R}}{2} Σ_{k, l} W_{kl}^{R} {(f_{k}^{R} / \sqrt{d_{k}^{R}} - f_{l}^{R} / \sqrt{d_{l}^{R}})}^{2} + μ_{-}^{S} Σ_{S_{g} &Element; S^{-}} H_{1} (f_{g}^{S}, - 1)

+ μ_{-}^{F} Σ_{F_{i} &Element; F^{-}} H_{1} (f_{i}^{F}, - 1) + μ_{-}^{R} Σ_{R_{k} &Element; R^{-}} H_{1} (f_{k}^{R}, - 1) + μ_{+}^{S} Σ_{S_{g} &Element; S^{+}} H_{2} (f_{g}^{S}, 1)

+ μ_{+}^{F} Σ_{S_{g} &Element; S^{+}} H_{2} (\max_{F_{i} &Element; S_{g}} f_{i}^{F}, f_{g}^{S}) + μ_{+}^{R} Σ_{F_{i} &Element; C^{+}} H_{2} (\max_{R_{l} &Element; F_{i}} f_{l}^{R}, f_{i}^{F})

wherein,andsoft labels respectively representing the g-th and h-th video shots in the video shot set, where g 1, 2.Andsoft labels respectively representing the ith and jth representative frames of all video shots in the video shot set, wherein i is 1, 2.Andsoft labels respectively representing kth and l regions of all regions included in all representative frames of all video shots in the video shot set, where k is 1, 2All representative frames of a shot include the number of regions. Furthermore, f^SVector, f, consisting of soft labels of all video shots in said set of video shots^FA vector, f, consisting of soft labels representing all representative frames of all video shots of said set of video shots^RA vector consisting of soft labels for all regions in all representative frames of all video shots in the video shot set.Representing the weight of the weighted edge between the nodes corresponding to the g-th video shot and the h-th video shot in the video shot set respectively in the shot-level weighted graph, W^SA matrix representing the weight values of all weighted edges in the shot-level weighted graph, i.e.,is W^SRow g, column h, and, in addition,andrespectively represent W^SThe sum of all elements of the g-th row and the sum of all elements of the h-th row of (1).Representing the weight value W of the weighted edge between the nodes corresponding to the ith and jth representative frames in all the representative frames of all the video shots in the video shot set respectively in the frame-level weighted graph^FA matrix is represented that consists of the weights of all weighted edges in the frame-level weighted graph, i.e.,is W^FRow i, column j, and, in addition,andrespectively represent W^FThe sum of all elements of the ith row and the sum of all elements of the jth row of (1). In a similar manner to that described above,weight values W of weighted edges of the k-th area and the l-th area among all the areas included in all the representative frames of all the video shots in the video shot set between the nodes corresponding to the areas in the area-level weighted graph^RA matrix consisting of the weights of all weighted edges in the region-level weighted graph, i.e.,is W^RThe kth row, the l column elements of (1), and further,andrespectively represent W^RThe sum of all elements of the k-th line and the sum of all elements of the l-th line.

Furthermore, in the above expression one, S_gRepresenting the g-th video shot, S, in a video shot set⁺And S^-Respectively representing positive and negative example video shot sets, F_iI-th representative frame, F, of all representative frames representing all video shots in a video shot set^-Set representing all representative frames in a negative example video shot set, R_kK-th region, R, of all regions of all representative frames representing all video shots of a set of video shots^-Negative example video shots representing a video shot collectionSet of all regions of all representative frames in a set, C⁺Is a set of possible positive frames among all representative frames contained in all video shots of a video shot set, H₁(x, y) and H₂(x, y) is a function that measures the inconsistency between two quantities (i.e., the inconsistency between x and y), and one form that may be used is H₁(x，y)＝(max(x-y，0))²And H₂(x，y)＝(max(y-x，0))². In addition to this, the present invention is,andthe weighting coefficients for each corresponding cost term in the formula are preset according to empirical values or through experiments.

In the above expression one, the first three terms are cost terms corresponding to the first constraint condition in the cost function, and the last four terms are cost terms corresponding to the second constraint condition in the cost function. In addition, the superscript "S" appearing in the formula represents a video shot, the superscript "F" represents a frame, and the superscript "R" represents a region.

It should be noted that the specific formula of the cost function given above is only an exemplary expression of the cost function, and is not used to limit the scope of the present invention. For example, the expression for the cost function given above may also be:

expression II:

Q (f^{S}, f^{F}, f^{R})

= \frac{1}{2} Σ_{g, h} W_{gh}^{S} {(f_{g}^{S} - f_{h}^{S})}^{2} + \frac{μ_{G}^{F}}{2} Σ_{i, j} W_{ij}^{F} {(f_{i}^{F} - f_{j}^{F})}^{2}

+ \frac{μ_{G}^{R}}{2} Σ_{k, l} W_{kl}^{R} {(f_{k}^{R} - f_{l}^{R})}^{2} + μ_{-}^{S} Σ_{S_{g} &Element; S^{-}} H_{1} (f_{g}^{S}, - 1)

+ μ_{-}^{F} Σ_{F_{i} &Element; F^{-}} H_{1} (f_{i}^{F}, - 1) + μ_{-}^{R} Σ_{R_{k} &Element; R^{-}} H_{1} (f_{k}^{R}, - 1) + μ_{+}^{S} Σ_{S_{g} &Element; S^{+}} H_{2} (f_{g}^{S}, 1)

+ μ_{+}^{F} Σ_{S_{g} &Element; S^{+}} H_{2} (\max_{F_{i} &Element; S_{g}} f_{i}^{F}, f_{g}^{S}) + μ_{+}^{R} Σ_{F_{i} &Element; C^{+}} H_{2} (\max_{R_{l} &Element; F_{i}} f_{l}^{R}, f_{i}^{F})

wherein, compared with the first expression, the second expression removes the first item in the first expressionAndin the second itemAndand in the third itemAnd

in addition, the expression of the cost function may have other variations, such as in the above-mentioned expression one and expression two, where H is₁(x, y) and H₂The specific expression form of (x, y) may also be: h₁(x，y)＝(x-y)²And H₂(x，y)＝(x-y)²And so on. In addition, variations, modifications and other expressions of the above formula which may occur to those skilled in the art from the above disclosure and/or the combination of common general knowledge are intended to be included within the scope of the present invention.

Next, in order to calculate and obtain the unknown quantity therein according to the constructed cost function, that is, in order to obtain the value of the soft label of each video shot in the video shot set, the value of the soft label of each representative frame of each video shot in the video shot set, and the value of the soft label of each region in each representative frame of each video shot in the video shot set, the optimal problem of the cost function may be solved by the calculating unit 150. Specifically, the function and operation of the calculation unit 150 may be realized by the structure as shown in fig. 4.

Fig. 4 is a block diagram schematically illustrating one possible example structure of the computing unit 150 in fig. 1. As shown in fig. 4, the calculation unit 150 may include an initialization sub-unit 410, a third calculation sub-unit 420, a fourth calculation sub-unit 430, a fifth calculation sub-unit 440, and a third decision sub-unit 450. With the exemplary structure shown in fig. 4, the calculation unit 150 may use an iterative calculation method to solve the optimal problem, i.e., by solving f^SAnd f^FAssigning initial values to carry out iterative computation by using the cost function to finally obtain f^R、f^FAnd f^SThe value of (c). The specific functions and processes of the respective sub-units of the calculation unit 150 shown in fig. 4 will be described in detail below.

As shown in FIG. 4, the initialization subunit 410 is used to assemble the soft label f for each video shot in the video shot set^SAnd soft label f of each representative frame in each video shot in the video shot set^FAnd assigning an initial value.

For example, the initialization sub-unit 410 may set the initial value f of the soft label of each video shot in the video shot set in this way^S(0): if S is_gIs a positively tagged videoLens then orderIf S is_gIf the video shot is a video shot with a negative label, the order isFurthermore, if S_gIf it is a video shot without a label, then order

Similarly, the initial value f of the soft label for each representative frame in each video shot in the video shot set may be set by the initialization subunit 410 in this way^F(0): if F_iIf it is a representative frame in a video shot with a positive tag, then orderIf F_iIf it is a representative frame in a video shot with a negative label, then orderFurther, if F_iIs a representative frame in an unlabeled video shot, then order

As shown in fig. 4, the third computing subunit 420 is used for collecting the soft labels f of the video shots according to the video shots^SAnd soft label f according to each representative frame in each video shot in the video shot set^FThe cost function is converted into a minimization problem with constraint, and the minimization problem with constraint is solved by using a constraint concave-convex process (CCCP) to obtain the soft label f of each area of each representative frame in each video shot in the video shot set^RAs a calculated value of f^RThe current value of (a).

For example, at the time of first calculation, f^SAnd f^FAre their initial values according to f^SAnd f^FThe current value of (a) may reduce the cost function in the form of expression one to the following equation,

expression one:

Q (f^{R})

= \frac{μ_{G}^{R}}{2} Σ_{k, l} W_{kl}^{R} {(f_{k}^{R} / \sqrt{d_{k}^{R}} - f_{l}^{R} / \sqrt{d_{l}^{R}})}^{2} + μ_{-}^{R} Σ_{R_{k} &Element; R^{-}} H_{1} (f_{k}^{R}, - 1)

+ μ_{+}^{R} Σ_{F_{i} &Element; C^{+}} H_{2} (\max_{R_{l} &Element; F_{i}} f_{l}^{R}, f_{i}^{F})

wherein, the meaning of each quantity in expression one is the same as that in expression one. In addition, in expression one, a set C of possible positive frames among all representative frames included in all video shots in the above video shot set⁺Can be defined as follows:wherein TH is^FI.e., the fifth predetermined threshold, and TH, as described above^FThe value of (d) can be determined according to the following equation:

{TH}^{F} = \max {t | &ForAll; S_{g} &Element; S^{+}, &Exists; F_{i} &Element; S_{g}, f_{i}^{F} &GreaterEqual; t} = \min_{S_{g} &Element; S^{+}} \max_{F_{i} &Element; S_{g}} f_{i}^{F} .

the optimization problem can be solved using CCCP by transforming a cost function in the form of expression one into a minimization problem with constraints by introducing relaxation factors. For a detailed description of CCCP, reference is made to documents a.j.smola, s.v.n.vishwananathan, andt.hofmann, "kernelmethods for discovery variables," inproc.int.workbenchoporatifial intellignends, 2005.

Thus, the third calculation subunit 420 uses f^SAnd f^FF, and the cost function, can be obtained in the above manner^RAs the calculated value of f^RThe current value of (a).

As shown in fig. 4, the fourth calculating subunit 430 may collect soft labels f of video shots according to the video shots^SAnd soft label f for each region of each representative frame in each video shot in the video shot set^RConverts the cost function into a constrained minimization problem, and solves the constrained minimization problem by using CCCP to obtain a soft label f of each representative frame in each video shot in the video shot set^FAs a calculated value of f^FThe current value of (a).

In particular, soft label f at video shot^SSoft label f of sum region^RIn certain cases, the cost function in the form of expression one can be simplified to:

the expression is two:

Q (f^{F})

= \frac{μ_{G}^{F}}{2} Σ_{i, j} W_{ij}^{F} {(f_{i}^{F} / \sqrt{d_{i}^{F}} - f_{j}^{F} / \sqrt{d_{j}^{F}})}^{2} + μ_{-}^{F} Σ_{F_{i} &Element; F^{-}} H_{1} (f_{i}^{F}, - 1)

+ μ_{+}^{F} Σ_{S_{g} &Element; S^{+}} H_{2} (\max_{F_{i} &Element; S_{g}} f_{i}^{F}, f_{g}^{S}) + μ_{+}^{R} Σ_{F_{i} &Element; C^{+}} H_{2} (\max_{R_{l} &Element; F_{i}} f_{l}^{R}, f_{i}^{F})

wherein the meaning of each quantity in expression one or two is the same as that in expression one. Furthermore, in expression two, the set C of possible positive frames among all representative frames contained in all video shots in the above-mentioned video shot set⁺Can be defined as follows:wherein TH is^RI.e., the sixth predetermined threshold, as described above, and TH^RThe value of (d) can be determined according to the following equation:

{TH}^{R} = \max {t | &ForAll; S_{g} &Element; S^{+}, &Exists; R_{k} &Element; S_{g}, f_{k}^{R} &GreaterEqual; t} = \min_{S_{g} &Element; S^{+}} \max_{R_{k} &Element; S_{g}} f_{j}^{R} .

likewise, the band-constrained minimization problem can be solved using a band-constrained concave-convex process by introducing a relaxation factor to convert the cost function, which is in the form of expression two, into the band-constrained minimization problem.

Thus, the fourth calculating subunit 430 uses f^SAnd f^RF, and the cost function, can be obtained in the above manner^FAs the calculated value of f^FThe current value of (a).

As shown in fig. 4, the fifth calculating subunit 440 may calculate the soft label f of each representative frame in each video shot in the video shot set^FAnd soft label f for each region of each representative frame in each video shot in the video shot set^RThe soft label f of each video shot in the video shot set can be obtained by directly utilizing the cost function to calculate the current value of the video shot^SAs a calculated value of f^SThe current value of (a).

In particular, at the soft label f representing the frame^FSoft label f of sum region^RIn certain cases, the cost function in the form of expression one can be simplified to:

expression one is three:

Q (f^{S})

= \frac{1}{2} Σ_{g, h} W_{gh}^{S} {(f_{g}^{S} / \sqrt{d_{g}^{S}} - f_{h}^{S} / \sqrt{d_{h}^{S}})}^{2} + μ_{-}^{S} Σ_{S_{g} &Element; S^{-}} H_{1} (f_{g}^{S}, - 1)

+ μ_{+}^{S} Σ_{S_{g} &Element; S^{+}} H_{2} (f_{g}^{S}, 1) + μ_{+}^{F} Σ_{S_{g} &Element; S^{+}} H_{2} (\max_{F_{i} &Element; S_{g}} f_{i}^{F}, f_{g}^{S})

wherein the meaning of each amount in the expression one and three is the same as that in the expression one. According to the expression one or three, the fifth calculating subunit 440 can directly solve to obtain f^SAs a value of f^SThe current value of (a).

As shown in fig. 4, the third judging subunit 450 is configured to judge f after the third calculating subunit 420, the fourth calculating subunit 430 and the fifth calculating subunit 440 respectively perform one calculation in turn each time^R、f^FAnd f^SWhether the current calculation results converge: if it isIf yes, f will be^R、f^FAnd f^SAs a computed value of the unknown quantity in the cost function described above; otherwise, the third calculation subunit 420, the fourth calculation subunit 430 and the fifth calculation subunit 440 are used again to perform the next iterative calculation, the third determination subunit 450 is used again to perform the determination, and so on, and the iterative calculation is repeated until the third determination subunit 450 determines f^R、f^FAnd f^SUntil the current calculation result of (c) converges.

As described above, through the processing of the preprocessing unit 110, the feature extraction unit 120, the weighted graph creation unit 130, the function construction unit 140, and the calculation unit 150, the calculation values of the soft label of each video shot, each representative frame, and each region in the video shot set can be obtained, and the video processing unit 160 can perform video processing based on the obtained calculation values.

Among them, the video processing performed by the video processing unit 160 may be various processing capable of performing operations using the above-described soft tag.

For example, in one application example of the video processing apparatus according to the embodiment of the present invention, the above-described "video processing" may be video retrieval, that is, the above-described video processing apparatus may be a video retrieval apparatus.

Generally, to retrieve a desired video shot, a user provides some tagged training video shot to the retrieval system as a query video shot. The technology can be applied to many aspects of people's daily life, such as digital video libraries, personal video recording and video management, online movie and television websites and the like.

In this example, the number of query video footage provided by the user may be one or more. When the number of query video shots is one, the query video shots are positively labeled video shots. When the number of the query video shots is multiple, the query video shots may all be video shots with positive tags, or may be a combination of video shots with positive tags and video shots with negative tags. In the special case that the query video shot itself only includes one frame of image, the query video shot is the query image, and the representative frame extracted from the query video shot is the query image itself.

As described above, through a series of processing operations of the preprocessing unit 110, the feature extraction unit 120, the weighted graph creation unit 130, the function construction unit 140, and the calculation unit 150, a computed value of the soft label for each video shot in the set of video shots, the soft label for each representative frame of each video shot in the set of video shots, and the soft label for each region of each representative frame of each video shot in the set of video shots may be obtained, thus, using the computed values of these soft labels, the video processing unit 160 can determine the similarity between video shots in the set of video shots (those video shots other than the query video shot) and the query video shot, and then those video shots in which the similarity with the query video shot is within a predetermined range may be determined as the result of the video retrieval (i.e., retrieval result).

For example, in one example, the video processing unit 160 may determine a video shot that satisfies the following condition as a result of the video retrieval: the soft label of the video shot is higher than a first preset threshold, the soft label of the representative frame with the largest soft label in the video shot is higher than a second preset threshold, and the soft label of the area with the largest soft label in the representative frame with the largest soft label in the video shot is higher than a third preset threshold. The values of the first, second and third preset thresholds may be the same or different. For example, the video processing unit 160 may determine as the retrieval result a portion of the video shot in which the soft label of the video shot in the final calculation result is higher than 0.8, the soft label of the representative frame having the largest soft label therein is higher than 0.75, and the soft label of the region having the largest soft label in the representative frame is higher than 0.7.

In another example, the video processing unit 160 may determine that the video shot satisfies the following condition as the result of the video retrieval: the soft label of the video shot, the soft label of the representative frame with the largest soft label in the video shot, and the weighted sum of the soft labels of the regions with the largest soft label in the representative frame with the largest soft label are the largest first N video shots, where N is a positive integer. For example, the expression for the weighted sum may be:

{αf}_{g}^{S} + β \max_{F_{i} &Element; S_{g}} f_{i}^{F} + (1 - α - β) \max_{R_{k} &Element; F_{i 0}} f_{k}^{R} .

that is, for each video shot S_gAnd (g ═ 1, 2., L), calculating a corresponding weighted sum value according to the above formula, and selecting the video shots corresponding to the largest first N weighted sums as the final retrieval result. Wherein,representing a video shot S_gThe one with the largest soft label in (F) represents the value of the soft label of the frame_i0Representing video shotsS_gThe representative frame with the largest soft label in (c), andthen the video shot S is indicated_gRepresentative frame F with the largest soft label in (a)_i0α and β are linear combination coefficients, and 0 < α < 1, 0 < β < 1, 0 < α + β < 1.

Further, the video processing unit 160 may output the retrieval result to the user in any one of the following orders: according to the size sequence of the soft labels of the video shots corresponding to the retrieval results; or according to the size sequence of the soft labels of the representative frames with the largest soft labels in the video shots corresponding to the retrieval results; or according to the size sequence of the soft labels of the areas with the maximum soft labels in the representative frame with the maximum soft labels in the video shot corresponding to the retrieval result; or according to the magnitude sequence of the weighted sum of the soft label of the video shot corresponding to the retrieval result, the soft label of the representative frame with the largest soft label in the video shot and the soft label of the area with the largest soft label in the representative frame.

In this example, the video processing apparatus obtains soft labels for each video shot in the video shot set and each representative frame and each region of each video shot by using the structural features of the three kinds of weighted graphs, i.e., the shot-level weighted graph, the frame-level weighted graph and the region-level weighted graph, and the association among the three, according to the query video shot and the label information thereof provided by the user, and determines the correlation (or similarity) between each of those video shots in the video shot set other than the query video shot and the query video shot according to the soft labels, thereby determining those video shots among them that are most correlated (or most similar) to the query video shot as the result of the retrieval. Compared with the existing video retrieval technology, the video processing device according to the embodiment of the invention can simultaneously utilize three kinds of weighted graphs, namely the lens-level weighted graph, the frame-level weighted graph and the area-level weighted graph to realize video retrieval, fully excavate the relation among the three kinds of weighted graphs, and utilize the video shots with labels and the video shots without labels without being influenced by the problem that the resources of the video shots with labels are limited, thereby obtaining better video processing effect, namely obtaining more accurate retrieval result.

Further, in another application example of the video processing apparatus according to the embodiment of the present invention, the above-described "video processing" may also be video concept detection, that is, the above-described video processing apparatus may be video concept detection apparatus.

Generally, the purpose of video concept detection is to determine whether (or to what extent) some given semantic concept is contained in the video shot under test. The technology can be applied to many aspects of people's daily life, such as video libraries, home video management, video on demand and the like.

In this example, the video shots to be tested are unlabeled video shots, which may or may not be included in the video shot set. The number of the video shots to be measured can be one or more. Further, as described above, at least some of the video shots in the video shot set in this example are tagged video shots in order to determine whether the video shot under test contains semantic concepts related to the tagged video shots in the video shot set.

Similarly to the foregoing example, through a series of processing operations of the preprocessing unit 110, the feature extraction unit 120, the weighted graph establishing unit 130, the function constructing unit 140, and the calculating unit 150, the calculated value of the soft label of each video shot in the video shot set and the calculated value of the soft label of each representative frame and each area of each representative frame in the video shot set can be obtained, and thus, using the calculated values of the soft labels, the video processing unit 160 can determine whether the video shot to be tested contains the above semantic concept, that is, whether the semantic concept related to the tagged video shot in the video shot set is contained. For example, in the case where the video shot set includes a positively tagged video shot and a negatively tagged video shot, and the positively tagged video shot is a video shot tagged with "tiger" and the negatively tagged video shot is a video shot tagged with "non-tiger", it is easy to know that "tiger" is a semantic concept related to the tagged video shot in the video shot set, that is, the video processing unit 160 needs to determine whether the content of the video shot to be tested includes tiger. Specifically, the functions and processing of the video processing unit 160 can be realized by the structure shown in fig. 5.

Fig. 5 is a block diagram schematically showing one possible example structure of the video processing unit 160 shown in fig. 1 in this application example. As shown in fig. 5, the video processing unit 160 may include a first decision subunit 510, a first calculation subunit 520, a second calculation subunit 530, and a second decision subunit 540.

In order to determine whether the video shot to be measured includes "semantic concepts related to tagged video shots in the video shot set," it may be first determined by the first determining subunit 510 whether the video shot to be measured is included in the video shot set, and then the following calculation processing may be described in two cases.

In the first case, i.e. in the case where the video shots to be tested are not included in the video shot set, at least one representative frame of the video shot under test may first be extracted by the first calculation subunit 520, and then, performing image segmentation on each representative frame of the extracted video shot to be detected to respectively obtain a plurality of areas of each representative frame, further, the calculated value of the soft label of the video shot to be measured, the calculated value of the soft label of each representative frame in the video shot to be measured, and the calculated value of the soft label of each region of each representative frame in the video shot to be measured can be obtained according to the result obtained by the calculating unit 150 (that is, the calculated values of the soft labels of the respective video shots, the respective representative frames of the respective video shots, and the respective regions of the respective representative frames) (a specific calculating process will be described later). Then, based on the computed values of the video shot under test and the soft labels of each representative frame and each region therein, a degree value of the video shot under test containing semantic concepts related to the labeled video shots in the video shot set can be computed by the second computing subunit 530.

In this case, the soft label of the video shot to be tested and the soft labels of the representative frames and the areas thereof can be calculated according to the following expressions three to five:

expression three:

f^{S} (S_{t}) = \frac{Σ_{g} [f_{g}^{S} W^{S} (S_{t}, S_{g}) / \sqrt{d_{g}^{S}}]}{Σ_{g} W^{S} (S_{t}, S_{g}) / \sqrt{d_{t}^{S}}} = \frac{\sqrt{d_{t}^{S}} Σ_{g} [f_{g}^{S} W^{S} (S_{t}, S_{g}) / \sqrt{d_{g}^{S}}]}{Σ_{g} W^{S} (S_{t}, S_{g})}

and the expression is four:

f^{F} (F_{t}) = \frac{Σ_{i} [f_{i}^{F} W^{F} (F_{t}, F_{i}) / \sqrt{d_{i}^{F}}]}{Σ_{i} W^{F} (F_{t}, F_{i}) / \sqrt{d_{t}^{F}}} = \frac{\sqrt{d_{t}^{F}} Σ_{i} [f_{i}^{F} W^{F} (F_{t}, F_{i}) / \sqrt{d_{i}^{F}}]}{Σ_{i} W^{F} (F_{t}, F_{i})}

expression five:

f^{R} (R_{t}) = \frac{Σ_{k} [f_{k}^{R} W^{R} (R_{t}, R_{k}) / \sqrt{d_{k}^{R}}]}{Σ_{k} W^{R} (R_{t}, R_{k}) / \sqrt{d_{t}^{R}}} = \frac{\sqrt{d_{t}^{R}} Σ_{k} [f_{k}^{R} W^{R} (R_{t}, R_{k}) / \sqrt{d_{k}^{R}}]}{Σ_{k} W^{R} (R_{t}, R_{k})}

wherein S is_tRepresenting the video shot to be tested, F_tRepresenting a representative frame, R, of the video shot under test_tRepresenting a certain area in a certain representative frame in the video shot under test, f^S(S_t) Soft label representing the video shot to be tested, f^F(F_t) Representing a certain representative frame F in the video shot under test_tSoft label of f^R(R_t) A certain representative frame F representing the video shot to be measured_tA certain region R in (1)_tThe soft label of (1) is, S_g、F_iand R_kThe same as described hereinbefore. W^S(S_t，S_g) For the video lens S to be tested_tWith the g-th video shot S in the video shot set_gBased on the similarity of lens-level visual features between them,for the g-th video shot S in the video shot set_gThe sum of the similarity of the video shots corresponding to all the nodes in the shot-level weighted graph,the sum of the similarity of the video shot to be detected and the video shots corresponding to all the nodes in the shot-level weighted graph. W^F(F_t，F_i) For a certain representative frame F in the video shot to be tested_tAnd the ith representative frame F in all representative frames of all video shots in the video shot set_iBased on the similarity of the frame-level visual features between them,for the above-mentioned ith representative frame F_iThe sum of the similarities of the representative frames corresponding to all nodes in the frame-level weighted graph,for a certain representative frame F in the video shot to be tested_tAnd in frame-level weighted graphsAnd the sum of the similarity of the representative frames corresponding to all the nodes. W^R(R_t，R_k) For a certain representative frame F in the video shot to be tested_tA certain region R in (1)_tWith the kth region R of all the regions contained in all the representative frames of all the video shots in the video shot set_kBased on the similarity of the region-level visual features between them,is the k-th region R_kThe sum of the similarity of the areas corresponding to all the nodes in the area-level weighted graph,for a certain representative frame F in the video shot to be tested_tA certain region R in (1)_tAnd the sum of the similarity of the areas corresponding to all the nodes in the area-level weighted graph.

In addition, in another implementation manner, the soft label of the video shot to be tested and the soft labels of the representative frames and the areas thereof can also be calculated according to the following expressions six to eight:

expression six:

f^{S} (S_{t}) = \frac{Σ_{g} f_{g}^{S} W^{S} (S_{t}, S_{g})}{Σ_{g} W^{S} (S_{t}, S_{g})}

expression seven:

f^{F} (F_{t}) = \frac{Σ_{i} f_{i}^{F} W^{F} (F_{t}, F_{i})}{Σ_{i} W^{F} (F_{t}, F_{i})}

expression eight:

f^{R} (R_{t}) = \frac{Σ_{k} f_{k}^{R} W^{R} (R_{t}, R_{k})}{Σ_{k} W^{R} (R_{t}, R_{k})}

it should be noted that, when the cost function is constructed by using the expression one described above, the soft label of the video lens to be tested and the soft labels of each representative frame and each region in the soft label can be calculated by using the expressions three to five; similarly, when the cost function is constructed by using the above-described expression two, the soft label of the video shot under test and the soft labels of the representative frames and the areas thereof can be calculated by using the expressions six to eight.

In the second case, that is, in the case that the video shot to be tested is included in the video shot set, the calculation value of the soft label of the video shot to be tested and the calculation value of the soft label of each representative frame and each region in the video shot to be tested are already obtained through the calculation of the calculating unit 150, so that the degree value of the semantic concept related to the tagged video shot in the video shot set can be directly calculated through the second calculating subunit 530 in the manner described above.

In the two cases, the degree value of the semantic concept related to the video shot with the label in the video shot set can be calculated by using the following formula:

{αf}_{g}^{S} + β \max_{F_{i} &Element; S_{g}} f_{i}^{F} + (1 - α - β) \max_{R_{k} &Element; F_{i 0}} f_{k}^{R} .

wherein, the parameters in the formula are the same as those defined above, and are not repeated herein.

Thus, in this example, the degree value of the semantic concept related to the tagged video shots in the video shot set can be obtained by the first determining subunit 510, the first calculating subunit 520, and the second calculating subunit 530. For example, in the case that the positive label is "tiger", through the three sub-units 510 and 530, it can be determined how much the content of the video shot to be tested includes the tiger.

Then, if the degree value is greater than or equal to a fourth preset threshold (for example, the fourth preset threshold is 0.75), the second determining subunit 540 may determine that the content of the video shot to be detected includes "semantic concepts related to the tagged video shots in the video shot set". If the degree value is smaller than the fourth preset threshold, the second determining subunit 540 may determine that the content of the video shot to be detected does not include "semantic concepts related to the tagged video shots in the video shot set".

According to the determination result, in the case that the second determining subunit 540 determines that the video shot to be tested includes "semantic concept related to the tagged video shot in the video shot set", the second determining subunit 540 may further label the video shot to be tested by using the semantic concept, that is, label information of the video shot with positive label in the video shot set may be used to label the video shot to be tested. For example, when the second determining subunit 540 determines that the video lens to be tested includes a "tiger", a label of the "tiger" may be attached to the video lens to be tested.

In this example, the video processing apparatus obtains soft labels of each video shot in the video shot set and each representative frame and each region in each video shot by using the structural features of the three weighted maps, i.e., the shot-level weighted map, the frame-level weighted map and the region-level weighted map, and the relationship among the three weighted maps, and determines whether the video shot to be tested contains semantic concepts related to the labeled video shot in the video shot set according to the soft labels. Compared with the existing video concept detection technology, the video concept detection realized by the video processing device of the embodiment of the invention can simultaneously utilize the three weighted graphs, more fully utilize the characteristic information of the video shots, fully excavate the relation among the three weighted graphs, and simultaneously utilize the video shots without labels on the basis of utilizing the video shots with labels, thereby obtaining better video processing effect, namely obtaining more accurate concept detection result.

As can be seen from the above description, with the video processing apparatus according to the embodiment of the present invention, it is possible to utilize three types of weighting graphs, namely, the shot-level weighting graph, the frame-level weighting graph, and the region-level weighting graph, to more fully utilize the feature information of the video shot, and to fully mine the connection between the three types of weighting graphs, so as to obtain a better video processing effect.

In addition, the embodiment of the invention also provides a video processing method. An exemplary process of the method is described below in conjunction with fig. 6 and 7.

Fig. 6 is a flow chart schematically illustrating an exemplary process of a video processing method according to an embodiment of the present invention. As shown in fig. 6, the process flow 600 of the video processing method according to the embodiment of the present invention starts at step S610 and then performs step S620.

In step S620, at least one representative frame of each video shot in a video shot set is extracted, and each extracted representative frame is divided into a plurality of regions, wherein at least a portion of the video shots in the video shot set are tagged video shots. Then, step S630 is performed. The image segmentation involved in step S620 may adopt the method described above.

In step S630, a shot-level visual feature, a frame-level visual feature, and a region-level visual feature of each video shot in the video shot set are extracted. Then, step S640 is performed. The characteristics, selection, extraction method, etc. of the above three visual features can all refer to the corresponding contents described above, and detailed description thereof is omitted here.

In step S640, a lens-level weighting graph is constructed according to the lens-level visual characteristics, a frame-level weighting graph is constructed according to the frame-level visual characteristics, and a region-level weighting graph is constructed according to the region-level visual characteristics. Then, step S650 is performed.

Among them, in one implementation, the above-mentioned lens-level weighting graph, frame-level weighting graph and region-level weighting graph may be constructed as follows: taking each video shot in the video shot set as a node, and taking the similarity of the visual characteristics of the shot level between every two nodes as the weight of the weighted edge between the two nodes to construct the shot level weighted graph; constructing the frame-level weighted graph by taking each representative frame of each video shot in the video shot set as a node and taking the similarity of frame-level visual features between every two nodes as the weight of a weighted edge between the two nodes; and constructing the region-level weighted graph by taking each region in each representative frame of each video shot in the video shot set as a node and taking the similarity of region-level visual features between every two nodes as the weight of a weighted edge between the two nodes.

In step S650, with the soft label of each video shot in the video shot set, the soft label of each representative frame in the video shot set, and the soft label of each region in the representative frame as unknowns, a cost function is constructed according to the structure information of the lens-level weighting graph, the frame-level weighting graph, and the region-level weighting graph, and according to the relationship among the soft label of each video shot, the soft label of each representative frame, and the soft label of each region. Then, step S660 is performed.

Specifically, the above cost function may be constructed using a method as will be described below.

For example, such a first constraint condition may be set according to the structure information of the above-described lens-level weighting map, frame-level weighting map, and region-level weighting map: let the difference between the soft labels of two video shots for which the shot-level visual features are more similar be smaller, let the difference between the soft labels of two representative frames for which the frame-level visual features are more similar be smaller, and let the difference between the soft labels of two regions for which the region-level visual features are more similar be smaller.

Further, such second constraint conditions may be set according to the relationship between the soft label of the tagged video shots in the video shot set and the soft label of the representative frame and the soft label of the region in the representative frame in the tagged video shots in the video shot set: the soft labels of the video shots with the negative labels, the soft labels of all representative frames in the video shots with the negative labels and the soft labels of all areas of all representative frames in the video shots with the negative labels are close to-1 as much as possible, the soft labels of the video shots with the positive labels are close to 1 as much as possible, the soft label of the representative frame with the largest soft label in the video shots with the positive labels is close to the soft label of the video shot to which the representative frame belongs as much as possible, and the soft label of the area with the largest soft label in each possible positive frame in the video shots with the positive labels is close to the soft label of the representative frame to which the area belongs as much as possible.

It should be noted that each possible positive frame may be a frame whose soft tag value is higher than the fifth preset threshold, or may be a frame in which an area whose soft tag value is higher than the sixth preset threshold is included.

Then, a cost function may be constructed according to the first constraint and the second constraint described above. The cost function may be in any form described above, and is not described herein again.

Then, in step S660, the calculated value of the unknown quantity is obtained by solving the optimization problem of the cost function. Then, step S670 is performed.

In step S660, an iterative calculation method may be used to solve the optimal problem, that is, f is calculated^SAnd f^FAssigning initial values to carry out iterative computation by using the cost function to finally obtain f^R、f^FAnd f^SThe value of (c). Next, one possible example calculation procedure of step S660 will be described with reference to fig. 7.

FIG. 7 is a schematic diagram illustrating one possibility of step S670 shown in FIG. 6Is described. As shown in fig. 7, in step S710, the soft label f of each video shot is first concentrated on the video shots^SAnd soft label f of each representative frame in each video shot in the video shot set^FAnd assigning an initial value. In step S710, the same processing method as that performed by the initialization sub-unit 410 described above in conjunction with fig. 4 may be adopted to perform soft labeling f for each video shot^SAnd soft labels f for each representative frame^FThe initial value is assigned, and will not be described in detail here. Then, step S720 is performed.

Next, f is calculated by loop processing of steps S720-S750^R、f^FAnd f^SThe value of (c).

In step S720, the soft label f of each video shot is collected according to the video shots^SAnd soft label f according to each representative frame in each video shot in the video shot set^FConverts the cost function into a constrained minimization problem, and solves the constrained minimization problem by using CCCP (common control program) to obtain soft labels f of each region of each representative frame in each video shot in the video shot set^RAs a calculated value of f^RThe current value of (a). Wherein in step S720, f can be obtained by the same method as the processing method performed by the third calculation subunit 420 described above in connection with fig. 4^RThe calculated values are not described in detail herein. Then, step S730 is performed.

In step S730, the soft label f of each video shot is collected according to the video shots^SAnd soft label f for each region of each representative frame in each video shot in the video shot set^RConverts the cost function into a constrained minimization problem, and solves the constrained minimization problem by using CCCP to obtain a soft label f of each representative frame in each video shot in the video shot set^FAs a calculated value of f^FThe current value of (a). Therein, in step S730, the fourth calculation subunit 430 described above in connection with fig. 4 can be adopted to performBy the same method as in (1) to obtain f^FThe calculated values are not described in detail herein. Then, step S740 is performed.

In step S740, the soft label f of each representative frame in each video shot in the video shot set is determined^FAnd soft label f for each region of each representative frame in each video shot in the video shot set^RThe soft label f of each video shot in the video shot set can be obtained by directly utilizing the cost function to calculate the current value of the video shot^SAs a calculated value of f^SThe current value of (a). Wherein in step S740, f can be obtained by the same method as the processing method performed by the fifth calculation subunit 440 described above in connection with fig. 4^SThe calculated values are not described in detail herein. Then, step S750 is performed.

In step S750, f is judged^R、f^FAnd f^SWhether the current calculation results converge: if yes, keeping the current values of the soft labels of the video shots, the soft labels of the representative frames and the soft labels of the areas as the calculated values of the unknown quantity in the cost function, and then continuing to execute the step S670; otherwise, the process returns to step S720 for the next iteration.

Thus, through iterative calculation of the loop of steps S720-750, it is possible to perform the calculation at f^R、f^FAnd f^SIn the case of determining two of the above vectors, the elements in the remaining vector are used as variables, so that the values of the elements in the remaining vector can be obtained by solving, and the iteration is performed in the above order, that is, according to f^R→f^F→f^S→f^R→f^F→f^S→.. the order is to perform loop iteration calculation until the calculation result converges. Thus, the calculated value of the unknown quantity in the cost function described in step S650 can be obtained.

Turning to fig. 6, in step S670, video processing is performed based on the calculated value of the unknown quantity calculated as described above. Step S680 is then performed.

Among other things, in one example of a video processing method according to an embodiment of the present invention, the video processing involved in the video processing method may be video retrieval, in which case the video footage set includes tagged query video footage. In this case, in step S670, a video footage of the video footage set other than the query video footage and having a similarity to the query video footage within a predetermined range may be determined as a retrieval result based on the obtained calculation value. The search result may be a video shot: video shots whose soft label is above a first preset threshold, and the soft label of the representative frame with the largest soft label in the video shots is above a second preset threshold, and the soft label of the area with the largest soft label in the representative frame is above a third preset threshold.

Further, the search result may be a video shot including: the soft label of the video shot, and the three weighted sum of the soft label of the representative frame having the largest soft label and the soft label of the area having the largest soft label in the representative frame, wherein N is a positive integer. In the special case that the query video shot itself only includes one frame of image, the query video shot is the query image, and the representative frame extracted from the query video shot is the query image itself. In addition, the search results may also be output in a certain order, for example, the search results may be output in any one of the several orders described above, which is not described herein again.

Further, in another example of the video processing method according to the embodiment of the present invention, the video processing involved in the video processing method may also be video concept detection. In this case, in step S670, it may be determined whether the video shot under test without a tag contains a semantic concept related to the tagged video shot in the video shot set according to the calculated value of the soft tag obtained in step S660. In this case, the process of step S670 may be implemented by steps S810-S860 as shown in fig. 8, and the specific process procedure will be described below.

Fig. 8 is a flowchart schematically showing one possible exemplary process of step S670 shown in fig. 6 in the case where the video processing is an example of video concept detection. As shown in fig. 8, in step S810, it is determined whether a video shot to be tested is included in the video shot set: if so, the soft label of the video lens to be detected, and the soft labels of each representative frame and each region are obtained, so that the step S830 can be directly executed to perform the next calculation; if not, since the soft labels of the video shot to be tested and the soft labels of each representative frame and each region are not known, step S820 may be performed to obtain the soft labels.

In step S820, at least one frame in the video shot to be detected may be first extracted as a representative frame of the video shot to be detected, then each representative frame of the video shot to be detected is divided into a plurality of regions, and then a calculated value of a soft label of the video shot to be detected, a calculated value of a soft label of each representative frame in the video shot to be detected, and a calculated value of a soft label of each region of each representative frame in the video shot to be detected are obtained according to the obtained calculated value of the unknown quantity. The specific calculation method may refer to the method for calculating the soft label of the video shot to be detected, each representative frame in the video shot to be detected, and the soft label of each region in the video shot to be detected, which is not described herein again. After step S820 is performed, step S830 is performed.

In step S830, calculating a degree value of the video shot to be detected containing semantic concepts related to the tagged video shots in the video shot set according to the obtained calculated value of the soft tag of the video shot to be detected, the calculated value of the soft tag of each representative frame in the video shot to be detected, and the calculated value of the soft tag of each region of each representative frame in the video shot to be detected; the calculation process may also refer to the calculation method for the degree value described in the corresponding section above, and is not described again. Then, step S840 is performed.

In step S840, it is determined whether the degree value is greater than or equal to a fourth preset threshold: if yes, go to step S850, that is, determine in step S850 that the video shot to be tested includes "semantic concepts related to the tagged video shots in the video shot set, and then go to the subsequent steps (for example, step S680 shown in fig. 6); otherwise, step S860 is executed, that is, it is determined in step S860 that the video shot to be tested does not include "semantic concepts related to the tagged video shots in the video shot set, and then the subsequent steps are executed (for example, step S680 shown in fig. 6).

It should be noted that the processing or sub-processing of each step in the above-described video processing method according to the embodiment of the present invention may have a processing procedure capable of realizing the operation or function of the unit, sub-unit, module or sub-module of the video processing apparatus described above, and a similar technical effect can be achieved, and the description thereof is omitted here.

As can be seen from the above description, by applying the video processing method according to the embodiment of the present invention, it is possible to utilize three types of weighting graphs, namely, the shot-level weighting graph, the frame-level weighting graph and the region-level weighting graph, to more fully utilize the feature information of the video shot, and to fully mine the connection between the three types of weighting graphs, so as to obtain a better video processing effect. In addition, the video processing method according to the embodiment of the invention can simultaneously utilize the video shots with labels and the video shots without labels, thereby greatly enriching the available resources and enabling the processing effect to be better and more accurate.

In addition, the embodiment of the invention also provides equipment which comprises the video processing device. Such as a camera, a video camera, a computer (e.g., a desktop or laptop computer), a cell phone (e.g., a smart phone), a personal digital assistant, and a multimedia processing device (e.g., MP3, MP4, etc. with video playing capabilities), among others.

According to the device of the embodiment of the invention, by integrating the video processing device, three types of weighted graphs of a lens-level weighted graph, a frame-level weighted graph and a region-level weighted graph can be utilized, the characteristic information of the video lens can be more fully utilized, and the relation among the three types of weighted graphs can be fully mined, so that a better video processing effect can be obtained.

Each constituent unit, sub-unit, and the like in the above-described video processing apparatus according to an embodiment of the present invention may be configured by software, firmware, hardware, or any combination thereof. In the case of implementation by software or firmware, a program constituting the software or firmware may be installed from a storage medium or a network to a machine having a dedicated hardware structure (for example, a general-purpose machine 900 shown in fig. 9), and the machine may be capable of executing various functions of the above-described constituent units and sub-units when various programs are installed.

In fig. 9, a Central Processing Unit (CPU)901 performs various processes in accordance with a program stored in a Read Only Memory (ROM)902 or a program loaded from a storage section 908 to a Random Access Memory (RAM) 903. In the RAM903, data necessary when the CPU901 executes various processes and the like is also stored as necessary. The CPU901, ROM902, and RAM903 are connected to each other via a bus 904. An input/output interface 905 is also connected to bus 904.

The following components are also connected to the input/output interface 905: an input section 906 (including a keyboard, a mouse, and the like), an output section 907 (including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker, and the like), a storage section 908 (including a hard disk, and the like), a communication section 909 (including a network interface card such as a LAN card, a modem, and the like). The communication section 909 performs communication processing via a network such as the internet. The driver 910 may also be connected to the input/output interface 905 as necessary. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like can be mounted on the drive 910 as needed, so that a computer program read out therefrom can be mounted in the storage section 908 as needed.

In the case where the series of processes described above is implemented by software, a program constituting the software may be installed from a network such as the internet or from a storage medium such as the removable medium 911.

It will be understood by those skilled in the art that such a storage medium is not limited to the removable medium 911 shown in fig. 9 in which the program is stored, distributed separately from the apparatus to provide the program to the user. Examples of the removable medium 911 include a magnetic disk (including a flexible disk), an optical disk (including a compact disc-read only memory (CD-ROM) and a Digital Versatile Disc (DVD)), a magneto-optical disk (including a mini-disk (MD) (registered trademark)), and a semiconductor memory. Alternatively, the storage medium may be the ROM902, a hard disk included in the storage section 908, or the like, in which programs are stored, and which is distributed to users together with the device including them.

In addition, the invention also provides a program product which stores the machine-readable instruction codes. The instruction codes are read and executed by a machine, and can execute the video processing method according to the embodiment of the invention. Accordingly, various storage media such as magnetic disks, optical disks, magneto-optical disks, semiconductor memories, etc., for carrying such program products are also included in the disclosure of the present invention.

In the foregoing description of specific embodiments of the invention, features described and/or illustrated with respect to one embodiment may be used in the same or similar manner in one or more other embodiments, in combination with or instead of the features of the other embodiments.

It should be emphasized that the term "comprises/comprising" when used herein, is taken to specify the presence of stated features, elements, steps or components, but does not preclude the presence or addition of one or more other features, elements, steps or components. The terms "first," "second," and the like, as used in ordinal numbers, do not denote an order of execution or importance of the features, elements, steps, or components defined by the terms, but are merely configured to identify between the features, elements, steps, or components for clarity of description.

Furthermore, the methods of the embodiments of the present invention are not limited to being performed in the time sequence described in the specification or shown in the drawings, and may be performed in other time sequences, in parallel, or independently. Therefore, the order of execution of the methods described in this specification does not limit the technical scope of the present invention.

Further, it is apparent that the respective operational procedures of the above-described method according to the present invention can also be implemented in the form of computer-executable programs stored in various machine-readable storage media.

Moreover, the object of the present invention can also be achieved by: a storage medium storing the above executable program code is directly or indirectly supplied to a system or an apparatus, and a computer or a Central Processing Unit (CPU) in the system or the apparatus reads out and executes the program code.

At this time, as long as the system or the apparatus has a function of executing a program, the embodiment of the present invention is not limited to the program, and the program may be in any form, for example, an object program, a program executed by an interpreter, a script program provided to an operating system, or the like.

Such machine-readable storage media include, but are not limited to: various memories and storage units, semiconductor devices, magnetic disk units such as optical, magnetic, and magneto-optical disks, and other media suitable for storing information, etc.

In addition, the present invention can also be implemented by a client computer connecting to a corresponding website on the internet, and downloading and installing computer program codes according to the present invention into the computer and then executing the program.

Finally, it should also be noted that, in this document, relational terms such as left and right, first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The term "comprising", without further limitation, means that the element so defined is not excluded from the group consisting of additional identical elements in the process, method, article, or apparatus that comprises the element.

In summary, in the embodiments according to the present invention, the present invention provides the following solutions:

supplementary note 1. a video processing apparatus, comprising: a pre-processing unit configured to extract at least one representative frame of each video shot in a set of video shots, respectively, and to divide each extracted representative frame into a plurality of regions, wherein at least some of the video shots of the set of video shots are tagged video shots; a feature extraction unit configured to extract a shot-level visual feature, a frame-level visual feature, and a region-level visual feature of each video shot in the set of video shots; a weighting graph establishing unit configured to construct a lens-level weighting graph according to the lens-level visual features, a frame-level weighting graph according to the frame-level visual features, and a region-level weighting graph according to the region-level visual features; a function constructing unit configured to construct a cost function according to structural information of the lens-level weighting map, the frame-level weighting map, and the region-level weighting map, and according to a relationship among the soft label of each video shot, the soft label of each representative frame, and the soft label of each region, with the soft label of each video shot in the video shot set, the soft label of each representative frame in the video shot set, and the soft label of each region in the video shot set being unknown quantities; a calculation unit configured to obtain a calculated value of the unknown quantity by solving an optimization problem of the cost function; and a video processing unit configured to perform video processing based on the calculated value obtained by the calculating unit.

Reference numeral 2 denotes the video processing apparatus according to reference numeral 1, which is a video retrieval apparatus, wherein the video shot set includes a query video shot with a tag, and the video processing unit is configured to determine, as a retrieval result, a video shot in the video shot set, other than the query video shot, whose similarity to the query video shot is within a predetermined range, based on the calculated value obtained by the calculating unit.

Note 3. the video processing apparatus according to note 2, wherein the video shots in the video shot set other than the inquiry video shot whose similarity to the inquiry video shot is within a predetermined range are one of the following video shots: the video shots with the soft labels higher than a first preset threshold value, the soft labels of the representative frames with the largest soft labels in the video shots are higher than a second preset threshold value, and the soft labels of the areas with the largest soft labels in the representative frames are higher than a third preset threshold value; and the largest first N video shots of the weighted sum of the soft label of the representative frame having the largest soft label and the soft label of the area having the largest soft label in the representative frame, wherein N is a positive integer.

Note 4. the video processing apparatus according to note 2 or 3, wherein in a case where the query video footage includes only one frame image, the query video footage is a query image, and a representative frame in the query video footage is the query image itself.

Reference numeral 5, the video processing apparatus according to reference numeral 1, which is a video concept detection apparatus, wherein the video processing unit is configured to determine whether a video shot under test without a tag contains a semantic concept related to a tagged video shot in the video shot set, based on a result obtained by the calculation unit.

Supplementary note 6 the video processing apparatus according to supplementary note 5, wherein the video processing unit includes: a first determination subunit configured to determine whether the video shot to be tested is included in the set of video shots; a first calculating subunit, configured to, in a case that the video shot to be detected is not included in the video shot set, extract at least one representative frame of the video shot to be detected, divide each representative frame of the video shot to be detected into a plurality of regions, and obtain, according to a result obtained by the calculating unit, a calculated value of a soft label of the video shot to be detected, a calculated value of a soft label of each representative frame of the video shot to be detected, and a calculated value of a soft label of each region of each representative frame of the video shot to be detected; a second calculating subunit configured to calculate, according to the result obtained by the first calculating subunit, a degree value that the video shot to be tested contains semantic concepts related to the tagged video shots in the video shot set; and a second determining subunit configured to determine that the video shot to be determined includes semantic concepts related to tagged video shots in the video shot set if the degree value calculated by the second calculating subunit is greater than or equal to a fourth preset threshold, and determine that the video shot to be determined does not include semantic concepts related to tagged video shots in the video shot set if the degree value is less than the fourth preset threshold.

Reference numeral 7, the video processing apparatus according to reference numeral 5 or 6, wherein the video processing unit is further configured to label the video shot to be tested with a label of a positively labeled video shot in the video shot set if the video shot to be tested is determined to contain a semantic concept related to the labeled video shot in the video shot set.

Note 8. the video processing apparatus according to any one of notes 1 to 7, wherein the weighted graph creating unit includes: a first establishing subunit, configured to construct the lens-level weighted graph with each video lens in the set of video lenses as a node and with a similarity between each two nodes on a lens-level visual feature as a weight of a weighted edge between the two nodes; a second establishing subunit configured to establish the frame-level weighted graph with each representative frame of each video shot in the set of video shots as a node and with a similarity in frame-level visual features between each two nodes as a weight of a weighted edge between the two nodes; and a third establishing subunit configured to establish the region-level weighted graph with each region of each representative frame of each video shot in the video shot set as a node, and with a similarity in region-level visual characteristics between each two nodes as a weight of a weighted edge between the two nodes.

Note 9. the video processing apparatus according to any one of notes 1 to 8, wherein the function constructing unit includes: a first setting subunit configured to set such a first constraint condition according to the structure information of the lens-level weighting map, the frame-level weighting map, and the region-level weighting map: the difference between the soft labels of two video shots with more similar shot-level visual features is made smaller, the difference between the soft labels of two representative frames with more similar frame-level visual features is made smaller, and the difference between the soft labels of two areas with more similar area-level visual features is made smaller; a second setting subunit configured to set such a second constraint condition according to a relationship among the soft label of each video shot, the soft label of each representative frame, and the soft label of each region: the method comprises the steps of enabling soft labels of video shots with negative labels, soft labels of all representative frames in the video shots with negative labels and soft labels of all regions of all representative frames in the video shots with negative labels to be as close as possible to-1, enabling the soft labels of the video shots with positive labels to be as close as possible to 1, enabling the soft labels of the representative frames with the largest soft labels in the video shots with positive labels to be as close as possible to the soft labels of the video shots which the representative frames belong to, and enabling the soft labels of the regions with the largest soft labels in each possible positive frame in the video shots with positive labels to be as close as possible to the soft labels of the representative frames which the regions belong to; and a function construction subunit configured to construct a cost function according to the first constraint condition and the second constraint condition with the soft label of each video shot in the set of video shots, the soft label of each representative frame of each video shot in the set of video shots, and the soft label of each region of each representative frame of each video shot in the set of video shots as unknowns.

Supplementary note 10 the video processing apparatus according to supplementary note 9, wherein the possible positive frames are frames in which: the value of the soft label of the frame is higher than a fifth preset threshold value; or the frame contains a region with the soft label higher than a sixth preset threshold.

Note 11. the video processing apparatus according to any one of notes 1 to 10, wherein the calculation unit includes:

an initialization subunit configured to initialize the soft label of each video shot in the set of video shots and the soft label of each representative frame in each video shot in the set of video shots;

a third computing subunit, configured to convert the cost function into a constrained minimization problem according to the current value of the soft label of each video shot in the video shot set and according to the current value of the soft label of each representative frame in each video shot in the video shot set, and solve the constrained minimization problem by using a constrained concave-convex process to obtain a computed value of the soft label of each region of each representative frame in each video shot in the video shot set;

a fourth calculating subunit, configured to convert the cost function into a constrained minimization problem according to the current value of the soft label of each video shot in the video shot set and according to the current value of the soft label of each region of each representative frame in each video shot in the video shot set, and solve the constrained minimization problem by using a constrained concave-convex process to obtain a calculated value of the soft label of each representative frame in each video shot in the video shot set;

a fifth calculating subunit, configured to calculate, according to the current value of the soft label of each representative frame in each video shot in the video shot set and according to the current value of the soft label of each region of each representative frame in each video shot in the video shot set, by using the cost function, to obtain a calculated value of the soft label of each video shot in the video shot set; and

a third determining subunit, configured to determine whether current values of the soft label of each video shot in the video shot set, the soft label of each representative frame in each video shot in the video shot set, and the soft label of each region of each representative frame in each video shot in the video shot set converge after each calculation performed in sequence by the third calculating subunit, the fourth calculating subunit, and the fifth calculating subunit: if so, keeping the current values of the soft labels of the video shots, the soft labels of the representative frames and the soft labels of the regions as the calculation values of the unknown quantity in the cost function; otherwise, respectively performing the next iterative computation by using the third computation subunit, the fourth computation subunit and the fifth computation subunit again until the third determination subunit determines that the current values of the soft labels of the video shots, the soft labels of the representative frames and the soft labels of the regions are converged.

Supplementary note 12. a video processing method, comprising: respectively extracting at least one representative frame of each video shot in a video shot set, and dividing each extracted representative frame into a plurality of areas, wherein at least part of the video shots in the video shot set are tagged video shots; extracting lens-level visual features, frame-level visual features and region-level visual features of each video lens in the video lens set; constructing a lens-level weighted graph according to the lens-level visual features, constructing a frame-level weighted graph according to the frame-level visual features, and constructing a region-level weighted graph according to the region-level visual features; constructing a cost function according to the structure information of the lens-level weighting graph, the frame-level weighting graph and the region-level weighting graph and according to the relationship among the soft label of each video shot, the soft label of each representative frame and the soft label of each region, with the soft label of each video shot in the video shot set, the soft label of each representative frame in the video shot set and the soft label of each region in the video shot as unknowns; obtaining a calculation value of the unknown quantity by solving an optimal problem of the cost function; and performing video processing based on the obtained calculated value.

Reference 13. the video processing method according to reference 12, the video processing being video retrieval, wherein the set of video shots comprises tagged query video shots, and the step of video processing according to the obtained calculated values comprises: and according to the obtained calculated value, determining the video shots, except the query video shot, in the video shot set, of which the similarity with the query video shot is within a preset range as a retrieval result.

Supplementary note 14. the video processing method according to supplementary note 13, wherein a video shot in the video shot set other than the query video shot having a similarity to the query video shot within a predetermined range is one of the following video shots: the video shots with the soft labels higher than a first preset threshold value, the soft labels of the representative frames with the largest soft labels in the video shots are higher than a second preset threshold value, and the soft labels of the areas with the largest soft labels in the representative frames are higher than a third preset threshold value; and the largest first N video shots of the weighted sum of the soft label of the representative frame having the largest soft label and the soft label of the area having the largest soft label in the representative frame, wherein N is a positive integer.

Reference 15. the video processing method according to reference 13 or 14, wherein, in a case where the query video footage includes only one frame image, the query video footage is a query image, and a representative frame in the query video footage is the query image itself.

Note 16. the video processing method according to note 12, wherein the video processing is video concept detection, and the step of performing video processing based on the obtained calculation value includes: and judging whether the video shots to be detected without labels contain semantic concepts related to the labeled video shots in the video shot set or not according to the obtained calculated values.

Reference numeral 17, the video processing method according to reference numeral 16, wherein the determining whether the video shots to be tested without labels contain semantic concepts related to the labeled video shots in the video shot set comprises: judging whether the video shot to be detected is included in the video shot set or not; under the condition that the video lens to be detected is not included in the video lens set, extracting at least one representative frame of the video lens to be detected, dividing each representative frame of the video lens to be detected into a plurality of areas, and obtaining a calculation value of a soft label of the video lens to be detected, a calculation value of a soft label of each representative frame in the video lens to be detected and a calculation value of a soft label of each area of each representative frame in the video lens to be detected according to the calculation value of the unknown quantity; calculating a degree value of the video shot to be detected containing semantic concepts related to the tagged video shots in the video shot set according to the obtained calculated value of the soft tag of the video shot to be detected, the calculated value of the soft tag of each representative frame in the video shot to be detected and the calculated value of the soft tag of each area of each representative frame in the video shot to be detected; and determining that the video shot under test contains semantic concepts related to the tagged video shots in the video shot set if the degree value is greater than or equal to a fourth preset threshold, and determining that the video shot under test does not contain semantic concepts related to the tagged video shots in the video shot set if the degree value is less than the fourth preset threshold.

Supplementary note 18. the video processing method according to supplementary note 16 or 17, further comprising: and under the condition that the video lens to be detected is judged to contain semantic concepts related to the video lens with the label in the video lens set, labeling the video lens to be detected by using the label of the video lens with the positive label in the video lens set.

Supplementary note 19. the video processing method according to any of supplementary notes 12-18, wherein said constructing a lens-level weighting map according to said lens-level visual characteristics, a frame-level weighting map according to said frame-level visual characteristics, and a region-level weighting map according to said region-level visual characteristics comprises: taking each video shot in the video shot set as a node, and taking the similarity of every two nodes on a shot-level visual feature as a weight of a weighted edge between the two nodes to construct the shot-level weighted graph; constructing the frame-level weighted graph by taking each representative frame of each video shot in the video shot set as a node and taking the similarity of each two nodes on the frame-level visual feature as the weight of a weighted edge between the two nodes; and constructing the region-level weighted graph by taking each region of each representative frame of each video shot in the video shot set as a node and taking the similarity of each two nodes on the region-level visual characteristics as the weight of a weighted edge between the two nodes.

Reference 20. the video processing method according to any of the references 12 to 19, wherein the cost function is constructed by: setting a first constraint condition according to the structure information of the lens-level weighting map, the frame-level weighting map, and the region-level weighting map: the difference between the soft labels of two video shots with more similar shot-level visual features is made smaller, the difference between the soft labels of two representative frames with more similar frame-level visual features is made smaller, and the difference between the soft labels of two areas with more similar area-level visual features is made smaller; setting such second constraint condition according to the relationship among the soft label of each video shot, the soft label of each representative frame and the soft label of each region: the method comprises the steps of enabling soft labels of video shots with negative labels, soft labels of all representative frames in the video shots with negative labels and soft labels of all regions of all representative frames in the video shots with negative labels to be as close as possible to-1, enabling the soft labels of the video shots with positive labels to be as close as possible to 1, enabling the soft labels of the representative frames with the largest soft labels in the video shots with positive labels to be as close as possible to the soft labels of the video shots which the representative frames belong to, and enabling the soft labels of the regions with the largest soft labels in each possible positive frame in the video shots with positive labels to be as close as possible to the soft labels of the representative frames which the regions belong to; and constructing a cost function according to the first constraint condition and the second constraint condition by taking the soft label of each video shot in the video shot set, the soft label of each representative frame of each video shot in the video shot set and the soft label of each area of each representative frame of each video shot in the video shot set as unknowns.

Supplementary note 21 the video processing method according to supplementary note 20, wherein the possible positive frames are frames that: the value of the soft label of the frame is higher than a fifth preset threshold value; or the frame contains a region in which the value of the soft label is higher than a sixth preset threshold.

Reference 22. the video processing method according to any of the references 12 to 21, wherein said obtaining the calculated value of the unknown quantity by solving the optimization problem of the cost function comprises:

setting initial values for the soft labels of the video shots in the video shot set and the soft labels of the representative frames in the video shots in the video shot set;

converting the cost function into a minimization problem with constraint according to the current value of the soft label of each video lens in the video lens set and the current value of the soft label of each representative frame in each video lens in the video lens set, and solving the minimization problem with constraint by using a concave-convex process with constraint to obtain a calculation value of the soft label of each region of each representative frame in each video lens in the video lens set;

converting the cost function into a minimization problem with constraint according to the current value of the soft label of each video lens in the video lens set and the current value of the soft label of each area of each representative frame in each video lens in the video lens set, and solving the minimization problem with constraint by using a concave-convex process with constraint to obtain a calculation value of the soft label of each representative frame in each video lens in the video lens set;

calculating by using the cost function according to the current value of the soft label of each representative frame in each video shot in the video shot set and the current value of the soft label of each region of each representative frame in each video shot in the video shot set to obtain the calculated value of the soft label of each video shot in the video shot set; and

determining whether the current values of the soft label of each video shot, the soft label of each representative frame, and the soft label of each region converge: if so, keeping the current values of the soft labels of the video shots, the soft labels of the representative frames and the soft labels of the regions as the calculation values of the unknown quantity in the cost function; otherwise, performing the next iterative computation to sequentially compute the computed value of the soft label of each region, the computed value of the soft label of each representative frame and the computed value of the soft label of each video shot respectively until the computed values of the soft label of each region, the computed value of the soft label of each representative frame and the computed values of the soft label of each video shot converge.

Supplementary note 23. an apparatus comprising the video processing device as described in any of supplementary notes 1 to 11.

Supplementary notes 24. the apparatus according to supplementary notes 23, wherein the apparatus is any one of the following: cameras, camcorders, computers, cell phones, personal digital assistants, and multimedia processing devices.

Reference 25 a computer readable storage medium having stored thereon a computer program executable by a computing device, the program when executed being capable of causing the computing device to perform a video processing method according to any one of the references 12-22.

Claims

1. A video processing apparatus comprising:

a pre-processing unit configured to extract at least one representative frame of each video shot in a set of video shots, respectively, and to divide each extracted representative frame into a plurality of regions, wherein at least some of the video shots of the set of video shots are tagged video shots;

a feature extraction unit configured to extract a shot-level visual feature, a frame-level visual feature, and a region-level visual feature of each video shot in the set of video shots;

a weighting graph establishing unit configured to construct a lens-level weighting graph according to the lens-level visual features, a frame-level weighting graph according to the frame-level visual features, and a region-level weighting graph according to the region-level visual features;

a function constructing unit configured to construct a cost function according to structural information of the lens-level weighting map, the frame-level weighting map, and the region-level weighting map, and according to a relationship among the soft label of each video shot, the soft label of each representative frame, and the soft label of each region, with the soft label of each video shot in the video shot set, the soft label of each representative frame in the video shot set, and the soft label of each region in the video shot set being unknown quantities;

a calculation unit configured to obtain a calculated value of the unknown quantity by solving an optimization problem of the cost function; and

a video processing unit configured to perform video processing based on the calculated value obtained by the calculating unit.

2. The video processing device of claim 1, the video processing device being a video retrieval device, wherein,

the set of video shots includes tagged query video shots, an

The video processing unit is configured to determine, according to the calculated value obtained by the calculating unit, video shots in the video shot set, except the query video shot, whose similarity to the query video shot is within a predetermined range, as retrieval results.

3. The video processing apparatus according to claim 2, wherein a video shot in the video shot set other than the query video shot having a similarity to the query video shot within a predetermined range is one of:

the video shots with the soft labels higher than a first preset threshold value, the soft labels of the representative frames with the largest soft labels in the video shots are higher than a second preset threshold value, and the soft labels of the areas with the largest soft labels in the representative frames are higher than a third preset threshold value; and

the soft label of the video shot, and the three weighted sum of the soft label of the representative frame having the largest soft label and the soft label of the area having the largest soft label in the representative frame, wherein N is a positive integer.

4. The video processing device of claim 1, the video processing device being a video concept detection device, wherein,

the video processing unit is configured to determine whether the video shots to be tested without tags contain semantic concepts related to the tagged video shots in the set of video shots, based on the result obtained by the calculation unit.

5. The video processing device of claim 4, wherein the video processing unit comprises:

a first determination subunit configured to determine whether the video shot to be tested is included in the set of video shots;

a first calculating subunit, configured to, in a case that the video shot to be detected is not included in the video shot set, extract at least one representative frame of the video shot to be detected, divide each representative frame of the video shot to be detected into a plurality of regions, and obtain, according to a result obtained by the calculating unit, a calculated value of a soft label of the video shot to be detected, a calculated value of a soft label of each representative frame of the video shot to be detected, and a calculated value of a soft label of each region of each representative frame of the video shot to be detected;

a second calculating subunit configured to calculate, according to the result obtained by the first calculating subunit, a degree value that the video shot to be tested contains semantic concepts related to the tagged video shots in the video shot set; and

a second determination subunit configured to determine that the video shot under test contains semantic concepts related to tagged video shots in the video shot set if the degree value calculated by the second calculation subunit is greater than or equal to a fourth preset threshold, and determine that the video shot under test does not contain semantic concepts related to tagged video shots in the video shot set if the degree value is less than the fourth preset threshold.

6. The video processing apparatus according to any one of claims 1 to 5, wherein the function construction unit includes:

a first setting subunit configured to set such a first constraint condition according to the structure information of the lens-level weighting map, the frame-level weighting map, and the region-level weighting map: the difference between the soft labels of two video shots with more similar shot-level visual features is made smaller, the difference between the soft labels of two representative frames with more similar frame-level visual features is made smaller, and the difference between the soft labels of two areas with more similar area-level visual features is made smaller;

a second setting subunit configured to set such a second constraint condition according to a relationship among the soft label of each video shot, the soft label of each representative frame, and the soft label of each region: the method comprises the steps of enabling soft labels of video shots with negative labels, soft labels of all representative frames in the video shots with negative labels and soft labels of all regions of all representative frames in the video shots with negative labels to be as close as possible to-1, enabling the soft labels of the video shots with positive labels to be as close as possible to 1, enabling the soft labels of the representative frames with the largest soft labels in the video shots with positive labels to be as close as possible to the soft labels of the video shots which the representative frames belong to, and enabling the soft labels of the regions with the largest soft labels in each possible positive frame in the video shots with positive labels to be as close as possible to the soft labels of the representative frames which the regions belong to; and

a function construction subunit configured to construct a cost function according to the first constraint condition and the second constraint condition with the soft label of each video shot in the set of video shots, the soft label of each representative frame of each video shot in the set of video shots, and the soft label of each region of each representative frame of each video shot in the set of video shots as unknowns.

7. The video processing apparatus of claim 6, wherein the possible positive frames are frames that:

the value of the soft label of the frame is higher than a fifth preset threshold value; or

The frame includes a region where the soft label is higher than a sixth preset threshold.

8. The video processing apparatus according to any one of claims 1 to 5, wherein the calculation unit includes:

9. A video processing method, comprising:

respectively extracting at least one representative frame of each video shot in a video shot set, and dividing each extracted representative frame into a plurality of areas, wherein at least part of the video shots in the video shot set are tagged video shots;

extracting lens-level visual features, frame-level visual features and region-level visual features of each video lens in the video lens set;

constructing a lens-level weighted graph according to the lens-level visual features, constructing a frame-level weighted graph according to the frame-level visual features, and constructing a region-level weighted graph according to the region-level visual features;

constructing a cost function according to the structure information of the lens-level weighting graph, the frame-level weighting graph and the region-level weighting graph and according to the relationship among the soft label of each video shot, the soft label of each representative frame and the soft label of each region, with the soft label of each video shot in the video shot set, the soft label of each representative frame in the video shot set and the soft label of each region in the video shot as unknowns;

obtaining a calculation value of the unknown quantity by solving an optimal problem of the cost function; and

video processing is performed based on the obtained calculated value.

10. An apparatus comprising the video processing device of any of claims 1-8.