CN112912888A

CN112912888A - Apparatus and method for identifying video activity

Info

Publication number: CN112912888A
Application number: CN201880098842.1A
Authority: CN
Inventors: 米兰·雷德齐克; 塔里克·乔杜里; 刘少卿; 遇冰; 袁鹏; 哈姆迪·奥兹贝布尔特卢; 王洪斌
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2018-10-31
Filing date: 2018-10-31
Publication date: 2021-06-04
Also published as: WO2020088763A1

Abstract

The embodiment of the invention relates to motion recognition in videos. To this end, one embodiment of the present invention includes an apparatus and method for identifying one or more activities in a video, where the apparatus and method uses a deep learning network. The apparatus is for: receiving the video; dividing the video into an RGB portion and an Optical Flow (OF) portion; calculating a plurality of spatial label predictors based on the RGB component using a spatial component of the deep learning network; calculating a plurality OF time-stamp prediction values based on the OF portion using a time portion OF the deep learning network; and fusing the spatial tag prediction value and the temporal tag prediction value to obtain a tag associated with an activity in the video.

Description

Apparatus and method for identifying video activity

Technical Field

The embodiment of the invention relates to motion recognition in videos. To this end, embodiments of the present invention provide an apparatus and method for identifying one or more activities in a video, where the apparatus and method use a deep learning network. Accordingly, embodiments of the present invention are also directed to designing an efficient deep learning network architecture that is particularly well suited for recognizing activities in videos. For example, the embodiment of the invention is suitable for a video monitoring system and a video camera.

Background

The traditional video monitoring system assists police or security personnel to prevent crimes. The benefits of the surveillance camera network are evident: rather than deploying security or law enforcement personnel in every corner, a large area can be monitored by a small number of personnel, for example from a control room. Since the 90 s of the last century, the number of surveillance cameras has increased exponentially. Video surveillance systems typically use motion detection algorithms, which are sensitive to background motion such as illumination changes, camera shake, swaying branches and leaves, or distant vehicles, but generally cannot handle continuous motion in the camera field of view.

Accordingly, a great deal of research is currently underway to combine image and video analysis methods with deep learning techniques for more autonomous analysis. The use of deep learning algorithms may enhance the robustness of video surveillance systems, particularly when large amounts of data are used and the algorithms are trained over long periods of time (e.g., days). The current deep learning framework is network based, tuned by many parameters, and achieves better performance than traditional computer vision based solutions. Notably, full use of hardware and dataset creation is required to achieve more reliable results in this area of research. Thus, to date, the following possibilities are promising: a robust end-to-end solution is developed for behavior analysis of videos and integrated with existing monitoring systems and the like.

In particular, Behavioral Analysis (BA) based on video motion recognition in video surveillance data or abnormal behavioral analysis (UBA) for certain categories of activities has attracted a great deal of research interest in the consumer industry. Compared with the traditional computer vision-based BA, the deep learning-based BA system has the following advantages:

1. extraction of ergonomic features (edges, corners, colors) can be avoided;

2. robust, can be used on different data sets (by supporting data amplification);

3. the model may be saved and the weights may be fine-tuned using migration learning in order to achieve robustness of different data sets;

4. a method is provided for utilizing a large amount of computing resources to run a system using a multi-GPU heuristic on available GPU processors.

However, conventional BA approaches based on video motion recognition still rely on training based on manual labeling (i.e., supervised learning), which needs to be provided for each camera sensor. Furthermore, conventional BA methods based on video motion recognition still suffer from low accuracy, low efficiency, and/or poor robustness.

Disclosure of Invention

In view of the above, embodiments of the present invention aim to improve the conventional video motion recognition method. It is an object to provide an apparatus and method that is more efficient, more accurate, and more reliable in identifying activity in a video than conventional methods, or in other words, more robust than conventional methods.

In particular, the apparatus and method are able to identify different types of user activities (e.g. categories) given in the form of videos or images (i.e. so-called action events in videos) by associating a tag with each activity determined in the video. Based on such tags, the apparatus and method are also able to analyze the behavior of people appearing in these videos. Currently, video surveillance systems and applications are targeted to incident tasks that are typically performed by security personnel, namely: patrol inspection, peripheral intrusion monitoring, detection of unattended objects, and the like. Thus, the apparatus and method are capable of specifically detecting the above-mentioned activity in the video. The apparatus and method are based on deep learning techniques.

Embodiments of the invention are defined in the appended independent claims. Advantageous embodiments of the invention are further defined in the respective dependent claims.

In particular, embodiments of the present invention support the implementation of a BA module based on activity recognition heuristics, taking into account several traditional approaches, but fusing these approaches in a new unified framework. Therefore, in order to obtain more information about the input video, the post-fusion function is used in particular as a main idea. The embodiment of the invention also considers the principle of an effective deep learning network architecture for motion recognition in the video, and can learn the network model only through limited training samples.

Yet another idea is to extract Red, Green, Blue, RGB frames and Optical Flow (OF) frames, respectively, from the input video in a dual network manner. Then, a sparse time sampling strategy is combined with video level monitoring so as to use the late fusion function for effective learning. The RGB frames are individual video images extracted at a particular frame rate. The OF may be calculated, for example, by determining a pattern OF apparent motion OF an image object and/or camera between two consecutive frames caused by motion OF the object. OF can be described as a two-dimensional (2D) vector field, where each vector is a displacement vector showing the movement OF a point from a first frame to a second frame. A 2-channel array can be obtained from the optical flow vectors and the magnitude and direction of these vectors can be found. The direction corresponds to the hue value of the image and the size corresponds to the value plane.

A first aspect of the invention provides a device for identifying one or more activities in a video, each activity being associated with a predetermined label, wherein the device is adapted to use a deep learning network and to perform the following operations during an inference phase: receiving the video; dividing the video into an RGB portion and an OF portion; calculating a plurality of spatial label predictors based on the RGB component using a spatial component of the deep learning network; calculating a plurality OF time-stamp prediction values based on the OF portion using a time portion OF the deep learning network; and fusing the spatial tag prediction value and the temporal tag prediction value to obtain a tag associated with an activity in the video.

By first obtaining the spatial tag prediction value and the temporal tag prediction value separately and then fusing them to obtain the final tag (i.e., using a late-stage fusion function), the activity in the video can be identified more efficiently, more accurately, and more reliably.

"deep learning networks" include, for example, Neural networks, such as Convolutional Neural Networks (CNN) or Convolutional networks (ConvNet), and/or include one or more hopping connections as proposed in Residual networks (ResNet), and/or batch normalization (bn) -initiation type networks. The deep learning network can be trained during a training phase of the device and can be used to identify activity in the video during an inference phase of the device.

A "tag" or "category tag" identifies an activity or class of activities (e.g., "patrol" or "perimeter intrusion"). That is, one tag is directly associated with one activity. The tag may be determined prior to operating the device for activity recognition.

"label prediction value" refers to a predicted label, i.e., at least one preliminary label, and may generally include a prediction value for a plurality of labels, such as label candidates, each associated with a different probability of becoming a correct label.

"temporal" is based on OF, i.e., referring to motion in the video, and "spatial" is based on RGB, i.e., referring to the spatial distribution OF color intensity, etc., characteristics OF pixels or regions, etc., in the video.

In one implementation form of the first aspect, the apparatus is further configured to: extracting a plurality OF RGB slices and a plurality OF OF slices from the video so as to divide the video into the RGB part and the OF part; computing a plurality of label predictors for each of the RGB segments using the spatial portion of the deep learning network; calculating a plurality OF label prediction values for each OF the OF segments using the time portion OF the deep learning network; computing the plurality of spatial label predictors based on the label predictors for the RGB segments; and calculating the plurality OF temporal tag prediction values based on the tag prediction value OF the OF segment.

In this way, the tags can be predicted more accurately and efficiently. A "segment" is a short segment or fraction of video that may be randomly sampled, for example, from the video. The "RGB clip" contains RGB frames extracted from the video clip, and the "OF clip" contains OF frames extracted from the video clip.

In another implementation form of the first aspect, the apparatus is further configured to: computing a plurality of label predictors for each RGB frame in a given RGB segment using the spatial portion of the deep learning network, and computing the plurality of label predictors for the given RGB segment based on the label predictors for the RGB frame; and/or calculating a plurality OF tag prediction values for each OF a given OF segment using the time portion OF the deep learning network, and calculating the plurality OF tag prediction values for the given OF segment based on the tag prediction values OF the OF frames.

The RGB portion and each RGB segment comprise a plurality of frames, i.e. "RGB frames". The OF part and each OF the OF slices comprise a plurality OF frames, i.e. "OF frames". A frame is an image or picture of a video, i.e., a tag prediction value of a frame considers that picture of a video to predict one or more tags associated with an activity.

In another implementation form of the first aspect, the apparatus is further configured to, in order to fuse the spatial label predictor and the temporal label predictor: calculating the sum of the normalized tag predicted values of the same tag according to the determined number of the plurality of space tag predicted values and the determined number of the plurality of time tag predicted values; and selecting the normalized label prediction value with the highest score as the label.

In this way, the tags can be predicted more accurately and efficiently.

In another implementation form of the first aspect, the apparatus is further configured to: and calculating the sum of the normalized scaling occurrence frequencies of all the space tag predicted values and the time tag predicted values of the same tag according to the sum of the normalized tag predicted values of the same tag.

"frequency of occurrence" refers to the frequency of prediction spatial or temporal labels (candidates), i.e., label prediction frequency.

In another implementation form of the first aspect, the apparatus is further configured to: a tag is obtained for each of a plurality of videos in a dataset, and a precision of the dataset is calculated based on the obtained tags.

Thus, the accuracy of motion recognition can be further improved.

In a further embodiment of the first aspect, the deep learning Network is a Time Segment Network (TSN) -bn-initiation type Network, enhanced by a hopping connection from a residual Network (ResNet).

In this way, the device can efficiently and accurately acquire the tag based on the deep learning. Hopping connections can quickly split the layers in a deep learning network.

In another implementation form of the first aspect, the spatial portion and/or the temporal portion of the deep-learning network comprises a plurality of connected input layers, a plurality of connected output layers, and a plurality of hopping connections, each hopping connection connecting one input layer to one output layer.

In another implementation form of the first aspect, the apparatus is further configured to, during a training and testing phase: receiving a training/testing video, and outputting a result comprising a ranked list of prediction labels based on the training/testing video, each prediction label associated with a confidence value.

Thus, it is possible to accurately train a deep learning network and improve the results obtained in the inference phase of the device.

In another implementation form of the first aspect, the result further comprises the calculated loss.

In another implementation form of the first aspect, the apparatus is further configured to: interrupting the training phase if a loss of a predetermined value is calculated.

In another implementation form of the first aspect, the apparatus is further configured to: and when the training stage is finished, acquiring a pre-training network model of the deep learning network.

A second aspect of the invention provides a method for identifying one or more activities in a video, each activity being associated with a predetermined label, wherein the method uses a deep learning network and comprises, in an inference phase: receiving the video; dividing the video into an RGB portion and an OF portion; calculating a plurality of spatial label predictors based on the RGB component using a spatial component of the deep learning network; calculating a plurality OF time-stamp prediction values based on the OF portion using a time portion OF the deep learning network; and fusing the spatial tag prediction value and the temporal tag prediction value to obtain a tag associated with an activity in the video.

In one embodiment of the second aspect, the method further comprises: extracting a plurality OF RGB slices and a plurality OF OF slices from the video so as to divide the video into the RGB part and the OF part; computing a plurality of label predictors for each of the RGB segments using the spatial portion of the deep learning network; calculating a plurality OF label prediction values for each OF the OF segments using the time portion OF the deep learning network; computing the plurality of spatial label predictors based on the label predictors for the RGB segments; and calculating the plurality OF temporal tag prediction values based on the tag prediction value OF the OF segment.

In another implementation form of the second aspect, the method further comprises: computing a plurality of label predictors for each RGB frame in a given RGB segment using the spatial portion of the deep learning network, and computing the plurality of label predictors for the given RGB segment based on the label predictors for the RGB frame; and/or calculating a plurality OF tag prediction values for each OF a given OF segment using the time portion OF the deep learning network, and calculating the plurality OF tag prediction values for the given OF segment based on the tag prediction values OF the OF frames.

In a further implementation form of the second aspect, the method further comprises, to fuse the spatial label predictor and the temporal label predictor: outputting and calculating the sum of the normalized tag predicted values of the same tag according to the determined number of the plurality of space tag predicted values and the determined number of the plurality of time tag predicted values; and selecting the normalized label prediction value with the highest score as the label.

In another implementation form of the second aspect, the method further comprises: and calculating the sum of the normalized scaling occurrence frequencies of all the space tag predicted values and the time tag predicted values of the same tag according to the sum of the normalized tag predicted values of the same tag.

In another implementation form of the second aspect, the method further comprises: a tag is obtained for each of a plurality of videos in a dataset, and a precision of the dataset is calculated based on the obtained tags.

In a further embodiment of the second aspect, the deep learning network is a TSN-bn-initiation type network, enhanced by a hopping connection from a residual network.

In a further embodiment of the second aspect, the spatial part and/or the temporal part of the deep learning network comprise a plurality of connected input layers, a plurality of connected output layers and a plurality of jump connections, each jump connection connecting one input layer to one output layer.

In another implementation form of the second aspect, the method further comprises, in a training/testing phase: receiving a training/testing video, and outputting a result comprising a ranked list of prediction labels based on the training/testing video, each prediction label associated with a confidence value.

In a further embodiment of the second aspect, the result further comprises a calculated loss.

In another implementation form of the second aspect, the method further comprises: interrupting the training phase if a loss of a predetermined value is calculated.

In another implementation form of the second aspect, the method further comprises: and when the training stage is finished, acquiring a pre-training network model of the deep learning network.

The method of the second aspect and its implementation forms achieve all the advantages and effects described above in connection with the device of the first aspect and its corresponding implementation forms.

A third aspect of the invention provides a computer program product comprising program code which, when executed by one or more processors of an apparatus, is operable to control the apparatus to perform the method of the second aspect or any implementation form thereof.

The advantages of the method of the second aspect and its embodiments may thus be achieved by executing the program code.

In general, in the above aspects and embodiments, the fusion OF the spatial label prediction value and the temporal label prediction value (post-fusion function) can be performed while simultaneously considering the prediction values OF the RGB part and the OF part, specifically, the output prediction values OF the two data streams are fused separately. The fusion function may be based on the first k predicted values of each stream, respectively (k ≧ 1): for example, for each RGB frame of the input video, the first k predictors of the network output can be found. All predictors can then be grouped based on one information source (all RGB frames), and the top k predictors can also be selected (based on majority tickets or frequency of occurrence). In order to output a prediction value for the input video based on only the RGB part, the first (most likely) ranked prediction value may be taken as the correct prediction value and may be compared to its label (ground truth prediction value). For the OF portion, the same process can be repeated and a predicted value based on only that portion can be obtained. To fuse the RGB and OF parts, the process OF RGB and OF can be repeated, but this time taking into account the first m (m ≧ 1, preferably m ≧ k) predictors. Then, the union (sum) of the normalized predicted values of the same label can be found from the two parts, and the one with the most votes can be selected. It may be noted that such fusion heuristics are not actually dependent on the type of input data.

The major improvements (as defined in the aspects and embodiments) provided by embodiments of the present invention improve accuracy and efficiency over conventional video motion recognition methods. The improvement of the precision is reflected in three different test data sets, and the training speed is also slightly improved.

It has to be noted that all devices, elements, units and means described in the present application can be implemented in software or hardware elements or any kind of combination thereof. All steps performed by the various entities described in the present application and the functions described to be performed by the various entities are intended to indicate that the respective entities are adapted or arranged to perform the respective steps and functions. Although in the following description of specific embodiments specific functions or steps performed by an external entity are not reflected in the description of specific elements of the entity performing the specific steps or functions, it should be clear to a skilled person that these methods and functions may be implemented in respective hardware or software elements or any combination thereof.

Drawings

The foregoing aspects and many of the attendant aspects of this invention will become more readily appreciated as the same become better understood by reference to the following detailed description, when taken in conjunction with the accompanying drawings, wherein:

fig. 1 shows an apparatus according to an embodiment of the invention.

Fig. 2 shows the inference phase of the device according to an embodiment of the invention.

Fig. 3 shows a training phase of the apparatus according to an embodiment of the invention.

Fig. 4 illustrates an exemplary inference phase flow implemented by an apparatus according to an embodiment of the invention.

Fig. 5 illustrates a basic block example of a deep learning network using a hopping connection of a device according to an embodiment of the present invention.

FIG. 6 shows an example of a portion of a deep learning network of devices according to an embodiment of the invention.

FIG. 7 illustrates a method according to an embodiment of the invention.

Detailed Description

Fig. 1 shows an apparatus 100 according to an embodiment of the invention. The device 100 is used to identify (input) one or more activities in a video 101. The apparatus 100 may be implemented in a video surveillance system and/or may receive video 101 from a camera, in particular a camera of a video surveillance system. However, device 100 is capable of performing motion recognition on any type of input video, regardless of the source of the input video. Each activity is associated with a predetermined tag 104. Device 100 may be aware of some predetermined tags 104 and/or device 104 may learn or train the determined tags 104. The apparatus 100 is particularly useful for using a deep learning network 102 and can operate accordingly in an inference phase and a training phase. The deep learning network may be implemented by at least one processor or processing circuit of the device 100.

As shown in fig. 1, during the inference phase, device 100 is dedicated to receiving video 101 (e.g., from a camera or from a video post-processing device that outputs post-processed video), where video activity is to be recognized by device 100. To this end, the apparatus 100 is arranged to first divide the video 101 into an RGB part 101a and an OF part 101 b. The RGB part represents spatial features (e.g., color, contrast, shape, etc.) in the video 101, and the OF part 101b represents temporal features (i.e., motion features) in the video 101.

Further, the device 100 is configured to use a deep learning network 102, the deep learning network 102 comprising a spatial portion 102a and a temporal portion 102 b. Deep learning network 102 may be implemented in device 100 by software. The spatial portion 102a is used to compute a plurality OF spatial label prediction values 103a based on the RGB portion 101a OF the video 101, while the temporal portion 102b is used to compute a plurality OF temporal label prediction values 103b based on the OF portion 101b OF the video 101.

The apparatus 100 is then used to fuse the spatial tag prediction value 103a and the temporal tag prediction value 103b in order to obtain a (fused) tag 104 associated with the activity in the video 101. This fusion is also called late fusion because it operates on the tag predictors, i.e., on the preliminary results. The tag 104 classifies the activity in the video 101, i.e., the activity in the video 101 has been identified.

Fig. 2 and 3 show two more detailed block diagrams of the device 100, respectively, which give more insight into some dependencies and functions between certain components of the device 100. In particular, fig. 2 shows an apparatus 100 according to an embodiment of the invention, which is based on the apparatus 100 shown in fig. 1 and which operates in an inference phase. Fig. 3 also shows an apparatus 100 according to an embodiment of the invention, which is based on the apparatus 100 shown in fig. 1 but which operates in a training phase. The apparatus 100 of fig. 2 and 3 may be identical. Since the device 100 is based on a deep learning network, the training and reasoning (also called testing) phases (also called epochs) can be distinguished.

Fig. 2 shows a block diagram of the testing/reasoning phase for the device 100. It can be seen that the apparatus 100 is arranged to extract a plurality OF RGB segments 200a and a plurality OF segments 200b from the video 101, respectively, in order to divide the video into an RGB part 101a and an OF part 101 b. The RGB segments 200a and OF segments 200b then propagate through the corresponding deep learning network 102 portions (i.e., through the spatial portion 102a and temporal portion 102b, respectively). Thus, the spatial portion 102a calculates a plurality OF tag predictors 201a for each OF the RGB segments 200a, and the temporal portion 102b calculates a plurality OF tag predictors 201b for each OF the OF segments 200 b.

Then, a spatio-temporal consensus predictor is obtained, wherein the apparatus 100 is configured to calculate a plurality OF spatial label predictors 103a based on the label predictor 201a OF the RGB segment 200a and a plurality OF temporal label predictors 103b based on the label predictor 201b OF the OF segment 200 b.

The spatial tag prediction value 103a and the temporal tag prediction value 103b are then fused (post-fusion) to obtain at least one tag 104 associated with at least one activity in the video 101, i.e. to make a final prediction of the activity. The tag 104 or tags 104 may be provided to a watch list ordering block 202. Multiple predictors for multiple available videos (from the data set) can be processed to obtain a final precision (over the entire data set). In other words, the apparatus 100 may obtain at least one tag 104 for each of the plurality of videos 101 in the dataset and may calculate the accuracy of the dataset based on the obtained tags 104.

Fig. 3 shows a block diagram of the training phase for the apparatus 100. After each training iteration, the verification accuracy result and the corresponding confidence value score may be calculated using a verification output 301, where the verification output 301 comprises a sorted list of (current) predicted tags 104 based on input video frames (images) of the training/test video 300. In other words, apparatus 100 may output results 301, where results 301 comprise a sorted list of predicted labels 104 based on training/test video 300, each predicted label 104 associated with a confidence value. Furthermore, the loss may be calculated and the entire training process may be repeated until the process completes the last training iteration or a certain (predefined) loss value is reached. In other words, the result 301 output by the apparatus 100 during the training phase may also include the calculated loss, and the apparatus 100 may be configured to interrupt the training phase when a predetermined value of loss is calculated.

Finally, the apparatus 100 obtains a pre-trained network model (i.e., at least one network graph and trained network weights) of the deep learning network 102 at the end of the training phase, which the apparatus 100 may use to identify activity in the video 101 during a testing (reasoning) phase (e.g., as shown in fig. 2).

The deep learning network 102 used by the device 100 provided by an embodiment of the present invention may be a TSN-bn-initiation type network, enhanced by a hopping connection from ResNet. In particular, the deep learning network 101 may be a modification and/or a combination of different building blocks, in particular it may be based on a combination of TSNs and bn-initiation type networks with hopping connections as proposed in ResNet. The building blocks are described first, followed by a description of the combined deep learning network 102.

The TSN may be selected as a building block of the deep learning network 102. TSNs are typically able to model dynamics throughout the video. To this end, the TSN may consist of spatial streams ConvNet and temporal streams ConvNet. Fig. 4 shows a general method of performing motion recognition in a video by device 100 using such a TSN. In particular, the inference phase of such a device 100 is shown. In the example of fig. 4, the input video is divided 400 into a plurality of segments (also referred to as slices or chunks) and then short segments are extracted 401 from each segment, wherein a segment comprises more than one frame, i.e. comprises a plurality of frames. This means that the TSN operates on a sequence of short segments that are sparsely sampled (in the temporal and/or spatial domain, e.g., depending on the video size) from the entire video, rather than processing a single frame or a single segment (also referred to as a frame stack). Each segment in the sequence may generate its own preliminary predictor of action category (category score 402). The category scores 402 of the different segments may be fused 403 by a segment consensus function to produce a segment consensus, which is a video level predictor. The predicted values from all modalities are then fused 404 to produce a final predicted value. ConvNet on all fragments can share parameters.

In the training/learning phase, the loss values for the video-level prediction, rather than the segment-level prediction, can be optimized by iteratively updating the model parameters. Formally, a given video V may be divided into K segments of equal duration (S1, S2, … …, SK }. then, the TSN may model the sequence of segments as follows:

TSN(T1,T2,……,TK)＝M(G(F(T1；W),F(T2；W),……,F(TK；W)))。

here, (T1, T2, … …, TK) is the fragment sequence. Each segment Tk may be randomly sampled from the corresponding segment Sk, where K is an integer index ranging from 1 to K. F (Tk; W) may define a function representing ConvNet with parameter W that operates on the short segments Tk and produces category scores for all categories. The piecewise consensus function G combines the outputs of multiple short segments to obtain a class hypothesis consensus between them. Based on this consensus, the probability of each action category in the entire video is predicted by a prediction function M (Softmax function). In combination with standard classification cross-entropy losses, the final loss function for segment consensus may be:

here, C is the number of action classes, yi is the ground truth label for class i. The category score Gi here is inferred from the scores of the same category on all segments using an aggregation function.

Batch normalized restart (bn-initiation) may be selected as another building block of the deep learning network 102. That is, the deep learning network 102 may specifically be or may include a bn-initiation type network. A particular choice of bn-initiation type network is possible, since this has a good balance between accuracy and efficiency.

The bn-initiation architecture may be designed specifically for a dual-stream ConvNet as the first building block. The spatial stream ConvNet may operate on a single RGB image and the temporal stream ConvNet may have as input a stack OF consecutive OF fields. A dual-stream ConvNet may use RGB images for spatial streams and stacked OF fields for temporal streams. A single RGB image typically encodes a static appearance at a particular point in time and lacks contextual information about the previous and next frames. The temporal flow ConvNet takes the OF field as input, aiming at capturing motion information. However, in real video, there is usually camera motion, and the OF field may not be concentrated on human motion.

The ResNet framework may be selected as yet another building block of the deep learning network 102. Although deep networks have better classification performance in most cases, they are more difficult to train than ResNet, mainly for two reasons:

1. gradient disappearance/gradient explosion: sometimes neurons die during training and, depending on their activation function, may never function again. This problem can be solved by using some initialization techniques.

2. Difficult to optimize: the more parameters the model introduces, the more difficult the training of the network.

The main difference of resnets is that they have a quick connection parallel to their normal convolutional layer. This allows for faster training and also provides a clear path for the gradient to pass back to the early layers of the network. This makes the learning process faster by avoiding gradient disappearance or neuronal death.

A ResNet model designed specifically for the deep learning network 102 of the device 100 provided by embodiments of the present invention can accept images and classify them. A simple method is to upsample only the image and then provide it to the trained model, or to skip only the first layer and insert the original image as input for the second convolutional layer, and then fine-tune the last few layers to achieve higher accuracy.

In summary, as described above, the deep learning network 102 of the device 100 provided by the embodiments of the present invention may be based on TSN-bn-indications with hopping connections (as proposed in ResNet). To obtain such a network 102, the following two-step approach may be used:

1. when stacking more layers on a very deep network model, hopping connections and depth residual layers can be used to enable the network to learn deviations from the identity layer.

2. The network can be simplified by reducing the number of layers and bringing them close to the layers that can better distinguish features and improve accuracy.

One embodiment of a deep learning network includes, for example, a total of 202 layers. Residual connections are also part of the network. A modified Linear Unit (ReLU) layer is connected to the convolutional layer of each sub-early Unit. And the convolutional layer of this unit is connected to the output of the 8 th unit located in the middle of the giant network. An add layer connects the input to the batch normalization layer and leads to the ReLU layer after the add process. The RGB and OF streams and the following parts OF the whole network are modified.

1. An additional layer is placed between the convolution _3a _1x1 layer and the bn _3a _1x1_ bn layer, which connects the input directly to the additional layer.

2. The output of batch _ normalization ('initiation _3b _1x1_ bn') is taken as an input to an add layer placed between the convolutional 'initiation _3c _3x3' layer and the bn 'initiation _3c _3x3_ bn' layer.

3. The output of batch _ normalization ('initiation _3a _ double _3x3_2_ bn') is taken as an input to an add layer placed between the convolutional 'initiation _3b _ double _3x3_ 2' layer and the bn 'initiation _3b _ double _3x3_2_ bn' layer.

4. The output of batch _ normalization ('initiation _3c _3x3_ bn') is taken as an input to an add layer placed between the convolutional 'initiation _4a _3x3' layer and the bn 'initiation _4a _3x3_ bn' layer.

5. The output of batch _ normalization ('initiation _4a _3x3_ bn') is taken as an input to an add layer placed between the convolutional 'initiation _4b _3x3' layer and the bn 'initiation _4b _3x3_ bn' layer.

6. The output of batch _ normalization ('initiation _4c _ pool _ proj _ bn') is taken as the input to the add layer placed between the convolutional 'initiation _4e _3x3' layer and the bn 'initiation _4e _3x3_ bn' layer.

7. The output of batch _ normalization ('initiation _4e _ double _3x3_2_ bn') is taken as an input to the add layer placed between the convolution 'initiation _5a _3x3' layer and the bn 'initiation _5a _3x3_ bn' layer.

8. The output of batch _ normalization ('initiation _5a _1x1_ bn') is taken as an input to an add layer placed between the convolutional 'initiation _5a _ double _3x3_ 2' layer and the bn 'initiation _5a _ double _3x3_2_ bn' layer.

9. The output of batch _ normalization ('initiation _5b _3x3_ bn') is taken as the input to the add layer placed between the convolutional 'initiation _5b _ pool _ proj' layer and the bn 'initiation _5b _ pool _ proj _ bn' layer.

The above description is defined as follows:

increment _3a _1x1 convolution 2D (192,64, kernel _ size (1,1), stride (1,1))

Acceptance _3a _1x1_ bn as described above except for batch normalization

intercept _3a _3x3 ═ convolution 2D (64,64, kernel _ size ═ 3, stride ═ 1, pad ═ 1,1)

intercept _3a _5x5 ═ convolution 2D (32,32, kernel _ size ═ 5, stride ═ 1, pad ═ 1,1)

increment _3a _ pool _ proj ═ convolution 2D (192,32, kernel _ size ═ 1, stride ═ 1,1)

increment _3a _ double _3x3_2 ═ convolution 2D (96,96, kernel _ size ═ 3, stride ═ 1, pad ═ 1,1)

increment _3c _3x3 convolution 2D (320,128, kernel _ size (1,1), stride (1,1))

increment _4a _3x3 convolution 2D (576,64, kernel _ size (1,1), stride (1,1))

increment _4c _ pool _ proj ═ convolution 2D (576,128, kernel _ size ═ 1, stride ═ 1,1)

increment _4e _ double _3x3 convolution 2D (608,192, kernel _ size (1,1), stride (1,1))

increment _5a _1x1 convolution 2D (1056,352, kernel _ size (1,1), stride (1,1))

intercept _5b _3x3 ═ convolution 2D (192,320, kernel _ size ═ 3, stride ═ 1, pad ═ 1,1)

If _ bn is part of the name, batch normalization is used later, as shown in the inclusion _3a _1x1_ bn example.

Not the entire modified network may take this form due to space limitations. However, by adding a simple shortcut connection, the accuracy of the reasoning part is improved and the training process is accelerated. However, the tradeoff is that ResNet is more prone to overfitting, which is undesirable. Experiments have shown that overfitting can be reduced by using a pressure differential layer and data enhancement.

Fig. 5 shows a basic block example of a network, an enlarged part of which is shown in fig. 6.

Details of data enhancement are described below. It is well known that the learning effect of a deep learning network depends on whether sufficiently large training data is available. Data enhancement is an efficient method of extending training data by applying transformations and transformations to the label data, making the new sample additional training data. In this work, the following data enhancement techniques were used: random brightness, random flipping (left to right flipping), and a few random alignments. Adding these different versions of the image enables the network to model different characteristics related to these different representations. Therefore, training the deep network with the enhanced data can improve the generalization capability of the deep network to unknown samples.

Details of network training are described below. A cross-modal pre-training technique is used in which the temporal network is initialized with an RGB model. First, the OF field (OF slice) is discretized into intervals OF 0 to 255 by linear transformation. This step makes the range OF the OF field the same as the RGB image (RGB segments). The weights OF the first convolution layer OF the RGB model are then modified to handle the input OF the OF field. In particular, the weights are averaged over the RGB channels, this average being replicated by the number of channels input by the time network.

During the learning process, the mean and variance of activation in each batch can be estimated by batch normalization and used to convert these activation values into a standard gaussian distribution. This operation speeds up the convergence of the training, but also leads to overfitting during the transfer due to biased estimation of the activation distribution from a limited number of training samples. Thus, after initialization by pre-training the model, the mean and variance parameters of all the batch normalization layers except the first layer are frozen. Since the distribution OF is different from the RGB image, the activation values OF the first convolution layer will have a different distribution, thus requiring re-estimation OF the mean and variance. An additional diff layer is added after the global pool layer in the bn-acceptance architecture to further reduce the effect of overfitting. Data enhancement can generate different training samples to prevent severe overfitting. In the original dual-stream ConvNet, in addition to random alignment and brightness, random left-to-right flipping was used to enhance the training samples. In addition, the size of the input image or optical flow field is also determined, as well as the width and height of all training and validation images. This is similarly achieved for the test images.

Details of the fusion part and the network test are described below. Since all segment-level ConvNet share model parameters in the time-segmented network, the learned model can perform frame-by-frame evaluation like general ConvNet. This enables a fair comparison with models that are learned without a time-slicing network framework. Specifically, following the original dual-stream ConvNet's testing scheme, 25 RGB frames or optical-flow stacks were extracted as samples from the motion video. For the fusion of spatial and temporal streaming networks, their weighted average is used. This means that: for each RGB frame of the input video, the first 8 predictors of the network output are found. All predictors are then grouped based on all RGB frames, and the top 8 predictors are also selected (based on majority vote or frequency of occurrence). In order to output a prediction value for the input video based on only the RGB part, the first (most likely) ranked prediction value is taken as the correct prediction value and compared to its label (ground truth prediction value). For the OF portion, the same process is repeated and a predicted value based on only the portion is obtained. To fuse the RGB and OF parts, the process OF RGB and OF is repeated, but this time the first 13 predictors are considered. Then, the union (sum) of the normalized predicted values of the same label is found from the two parts, and the one with the most votes is selected.

Some variations of this fusion heuristic have been tested and applied. One approach is to use a normalized Softmax prediction (normalized in the [0,1] interval) and scale the normalized frequency of occurrence with the method described previously. Thus, it will be possible to derive outputs of RGB and optical flow components, similar to those described above. Then, the sum of the normalized weighted scaled occurrence frequencies of all the predictors is found in the fusion process. Finally, for the output, the most likely output will be taken and compared to the ground truth label. Another variation OF the fusion function is to apply additional weight on the scaled joint prediction described above and/or make guesses for the first and second ranks based on the threshold difference between the first and second ranks, i.e., by observing the difference between the predicted values OF the top two bits OF RGB and OF in many confidence pairs, the following conclusions are drawn: for differences that exceed some reliable large threshold and for RGB and OF, based on either RGB or OF (or both), the correct predictor is believed to be the first ranked predictor. This predicted value is then taken as the correct predicted value.

It is also worth mentioning that the performance gap between the spatial and temporal streams ConvNet is much smaller than in the original dual-stream ConvNet. Based on this fact, the embodiment sets the weight of the spatial stream and the temporal stream to 1 and analyzes their outputs by classification, thereby giving equal credit to the spatial stream and the temporal stream. Therefore, a piecewise consensus function is used before Softmax normalization. To test the video predictors, the prediction scores of all extracted predictors fusing RGB and optical flow frames and different streams before Softmax normalization are decided.

FIG. 7 illustrates a method according to an embodiment of the invention. The method 700 is used to identify one or more activities in the video 101, each activity being associated with a predetermined tag 104. The method 700 uses the deep learning network 102 and may be performed by the device 100 shown in fig. 1 or fig. 2.

In the inference phase, method 700 includes: step 701, receiving a video 101; step 702, dividing the video 101 into an RGB part 101a and an OF part 101 b; step 703, calculating a plurality of spatial label prediction values 103a based on the RGB part 101a using the spatial part 102a of the deep learning network 102; step 704, calculating a plurality OF timestamp prediction values 103b based on the OF part 101b using the time part 102b OF the deep learning network 102; and step 705, fusing the spatial tag prediction value 103a and the temporal tag prediction value 103b to obtain a tag 104 associated with the activity in the video 101.

Embodiments of the invention may be implemented in hardware, software, or any combination thereof. Embodiments of the invention, such as apparatus and/or hardware implementations, may be implemented as any of a variety of suitable circuitry, such as one or more microprocessors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), discrete logic, hardware, etc., or any combinations thereof. Embodiments may include computer program products comprising program code for performing any of the methods described herein when the program code is implemented on a processor. Other embodiments may include at least one memory and at least one processor for storing and executing program code to perform any of the methods described herein. For example, an embodiment may comprise an apparatus for storing software instructions in a suitable non-transitory computer-readable storage medium and executing the instructions in hardware using one or more processors to perform any of the methods described herein.

The invention has been described in connection with various embodiments and implementations as examples. Other variations will be understood and effected by those skilled in the art in practicing the claimed invention, from a study of the drawings, the disclosure, and the independent claims. In the claims and in the description, the term "comprising" does not exclude other elements or steps, and "a" or "an" does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims

1. An apparatus (100) for identifying one or more activities in a video (101), each activity being associated with a predetermined label (104), the apparatus (100) being adapted to use a deep learning network (102) and to perform the following operations during an inference phase:

-receiving the video (101),

-dividing the video (101) into an RGB part (101a) and an Optical Flow (OF) part (101b),

-calculating a plurality of spatial label predictors (103a) based on the RGB components (101a) using a spatial component (102a) of the deep learning network (102),

-calculating a plurality OF time-stamp prediction values (103b) based on the OF part (101b) using a time part (102b) OF the deep learning network (102), and

-fusing the spatial tag prediction value (103a) and the temporal tag prediction value (103b) to obtain a tag (104) associated with an activity in the video (101).

2. The apparatus (100) of claim 1, further configured to:

-extracting a plurality OF RGB segments (200a) and a plurality OF OF segments (200b) from the video (101) in order to divide the video into the RGB part (101a) and the OF part (101b),

-calculating a plurality of label prediction values (201a) for each of the RGB segments (200a) using the spatial portion (102a) of the deep learning network (102),

-calculating a plurality OF label prediction values (201b) for each OF the OF segments (200b) using the time portion (102b) OF the deep learning network (102),

-calculating the plurality of spatial label predictors (103a) based on the label predictors (201a) of the RGB segments (200a), and

-calculating the plurality OF temporal tag prediction values (103b) based on the tag prediction values (201b) OF the OF segments (200 b).

3. The apparatus (100) of claim 1 or 2, further configured to:

-calculating a plurality of label prediction values for each RGB frame in a given RGB segment (200a) using the spatial portion (102a) of the deep learning network (102), and calculating the plurality of label prediction values (201a) for the given RGB segment (200a) based on the label prediction values of the RGB frames, and/or

-calculating a plurality OF tag prediction values for each OF a given OF segment (200b) using the time portion (102b) OF the deep learning network (102), and calculating the plurality OF tag prediction values (201b) for the given OF segment (200b) based on the tag prediction values OF the OF frames.

4. The apparatus (100) according to one of claims 1 to 3, further configured to, for fusing the spatial tag predictor (103a) and the temporal tag predictor (103 b):

-calculating a sum of normalized tag predictors for the same tag based on the determined number of said plurality of spatial tag predictors (103a) and the determined number of said plurality of temporal tag predictors (103b), and

-selecting as the label (104) the normalized label prediction value with the highest score.

5. The apparatus (100) of claim 4, further configured to:

-calculating a sum of normalized scaled occurrence frequencies of all spatial (103a) and temporal (103b) label predictors of said same label as a sum of said normalized label predictors of said same label.

6. The apparatus (100) of one of claims 1 to 5, further configured to:

-obtaining a tag (104) for each of a plurality of videos (101) in the data set, and

-calculating the accuracy of the data set based on the acquired tags (104).

7. The apparatus (100) according to one of claims 1 to 5, characterized in that:

the deep learning network (102) is a TSN-bn-initiation type network, enhanced by a hopping connection from a residual network.

8. The apparatus (100) according to one of claims 1 to 6, characterized in that:

the spatial portion (102a) and/or the temporal portion (102b) of the deep learning network (102) comprises a plurality of connected input layers, a plurality of connected output layers, and a plurality of hopping connections, each hopping connection connecting one input layer to one output layer.

9. The apparatus (100) according to one of claims 1 to 8, further being configured to perform the following operations during a training/testing phase:

-receiving a training or test video (300), and

-outputting a result (301) comprising a sorted list of predicted labels (104) based on the training/testing video (300), each predicted label (104) being associated with a confidence value score.

10. The apparatus (100) of claim 9, wherein:

the result (301) also includes the calculated loss.

11. The apparatus (100) according to claim 9 or 10, for:

-interrupting the training phase if a loss of a predetermined value is calculated.

12. The apparatus (100) according to one of claims 9 to 11, configured to:

-at the end of the training phase, obtaining a pre-trained network model of the deep learning network (102).

13. A method (700) for identifying one or more activities in a video (101), each activity being associated with a predetermined label (104), the method (900) using a deep learning network (102) and comprising, in an inference phase:

-receiving (701) the video (101),

-dividing (702) the video (101) into an RGB part (101a) and an Optical Flow (OF) part (101b),

-calculating a plurality of spatial label prediction values (103a) based on the RGB part (101a) using (703) a spatial part (102a) of the deep learning network (102),

-calculating a plurality OF time-tag predicted values (103b) based on the OF part (101b) using (704) a time part (102b) OF the deep learning network (102), and

-fusing (705) the spatial tag prediction value (103a) and the temporal tag prediction value (103b) to obtain a tag (104) associated with an activity in the video (101).

14. A computer program product comprising program code, which, when executed by one or more processors of an apparatus, is configured to control the apparatus to perform the method (700) according to claim 13.