CN110998594A

CN110998594A - Method and system for detecting motion

Info

Publication number: CN110998594A
Application number: CN201880048903.3A
Authority: CN
Inventors: M·琼斯; T·马克斯; K·库尔卡尼
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2017-08-07
Filing date: 2018-06-18
Publication date: 2020-04-10
Anticipated expiration: 2038-06-18
Also published as: JP2020530162A; EP3665613A1; CN110998594B; US10210391B1; JP6877630B2; WO2019031083A1; US20190042850A1

Abstract

A method and system for detecting motion of an object in a scene from a video of the scene. Video is a video sequence that is divided into chunks, and each chunk comprises a succession of video frames. The method includes the following elements. A video of a scene is acquired, wherein the video comprises a sequence of images. Tracking objects in the video, and for each object and each chunk of the video, further comprising: a sequence of contour images is determined from video frames of a video sequence to represent motion data within a bounding box located around an object. A bounding box is used to generate a cropped contour image and a cropped image for one or more images in the respective chunks. The cropped contour image and the cropped image are passed to a recurrent neural network RNN, which outputs relative scores for each action of interest.

Description

Method and system for detecting motion

Technical Field

The present disclosure relates generally to computer vision and camera surveillance applications, and more particularly, to detecting instances of objects (e.g., people) in a video that perform a particular action of interest by using sequence representative motion information of contour images computed from frames of the video sequence.

Background

In computer vision and camera surveillance applications, a common problem is identifying and detecting specific actions performed by objects such as people, machinery, vehicles, robots, etc. Much work has been done on the general problem of analyzing motion in video, but most prior art work has focused on motion recognition, rather than motion detection.

Motion recognition refers to classifying (i.e., recognizing) which motion is being performed in a video segment that has been temporally cropped such that the segment begins at or near the beginning of the motion and ends at or near the end of the motion. We use the term temporally cropping to denote these video segments. Motion detection refers to the temporal or spatial-temporal localization of each occurrence of an individual motion from a set of known motion classes occurring in a long (i.e., not temporally cropped) video sequence.

The task related to motion recognition is activity recognition. In an activity recognition task, a video segment depicting an activity (e.g., a particular athletic movement being played) is analyzed, and the goal is to determine which activity (e.g., which athletic movement) is depicted in the video.

Fine-grained motion detection refers to motion detection in which the difference between motion classes to be detected is small. For example, in a cooking scenario, detecting motion from a set including similar motions such as shredding, grinding, and peeling is an example of fine-grained motion detection. However, at least one drawback of the prior art method for motion detection is its relatively low accuracy. That is, the prior art motion detection methods do not perform well enough for most computer vision applications, as well as other applications.

The standard pipeline of most video analysis tasks, such as motion recognition, event detection and video retrieval, is the computation of hand-made features, such as Histogram of Oriented Gradients (HOG), histogram of Motion Boundaries (MBH) and Histogram of Optical Flow (HOF). Traditional methods rely on computationally expensive input representations (e.g., improved dense trajectories or dense optical flows), create Fisher vectors for individual video clips, and then perform classification using a support vector machine. However, at least one major drawback of the previous methods of motion detection/recognition described above, among many, is that they rely on input representations and intermediate representations, which are computationally very time consuming and require a large amount of memory to store. This makes these traditional methods infeasible for real world motion detection applications.

Therefore, there is a need to develop motion detection methods that can efficiently detect motion in video in terms of time and memory requirements.

Disclosure of Invention

The present disclosure provides methods and systems that overcome the problems of video analytics tasks, such as action recognition, event detection, and video retrieval, which rely on input representations and intermediate representations that are computationally very time consuming and also require a large amount of memory to store. In particular, the present disclosure describes motion detection methods and systems that can efficiently detect motion in video in terms of minimizing time-consuming computations and reducing memory storage/requirements.

In particular, the present disclosure is based on the recognition that: using a sequence of contour images computed from frames of a video sequence to represent motion information may provide a fast and memory efficient detector for actions and the like in a video. For example, the present disclosure addresses motion detection in videos based on locating the occurrence of a particular motion in time (which frames of the video) and space (where in the individual frames). We recognize through experimentation that we can detect motion in a video using a deep neural network with recursive connections that takes as input a sequence of cropped images around an object (e.g., a person) and contour images representing motion within the cropped region across multiple frames. From previous methods using optical flow-based representations we found that these methods are computationally expensive, i.e., require time-consuming computations and large amounts of memory and storage. This makes these previous motion detection methods infeasible for real-world applications.

The present disclosure also includes an object/person tracker that can spatially locate where within a video frame an action occurs. We have found through experimentation that conventional methods of analyzing motion and appearance over an entire frame, rather than using trackers, use a large amount of information that is not relevant to the motion of interest. In addition, these methods do not have sufficiently detailed information from the areas that are most important to the task.

In addition, the present disclosure also uses a multi-stream Recurrent Neural Network (RNN) that learns features representing two important aspects of motion and appearance, and learns important temporal dynamics over many video frames that differentiate different actions. For example, these methods and systems may be used to detect motion of objects in a video, where the objects may be humans, animals, machinery, vehicles, robots, industrial robots in a factory environment, and the like. The present disclosure provides more accurate motion detection for motion of objects occurring in video that is not temporally cropped.

Another aspect of the present disclosure includes using a Long Short Term Memory (LSTM) network included as one or more layers of RNNs that can learn patterns having a longer duration than can be learned using conventional RNNs. The present disclosure may provide better performance using bi-directional LSTM, which means that the present disclosure may use information from past and future video frames to detect motion.

To facilitate a further understanding of the present disclosure, we provide at least one method step comprising: motion of an object in a scene is detected from a video of the scene, where the video may be captured by a video device and the video itself may be a video sequence that is segmented into chunks, such that each chunk may comprise a succession of video frames.

For example, the method of the present disclosure may include the steps of: a video of a scene is acquired, wherein the video comprises a sequence of images. The video may be downloaded into the memory by the processor, wherein the processor accesses the memory to retrieve the video. A next step may include tracking objects in the video, and for each object and each chunk of the video, the method may further include the steps of: a sequence of contour images is determined from video frames of a video sequence to represent motion data within a bounding box located around an object. A next step may use the bounding box to generate a cropped contour image and a cropped image for one or more images in the respective chunks. Finally, a final step may pass the cropped contour image and the cropped image to a Recurrent Neural Network (RNN), which outputs a relative score for each action of interest.

It is envisaged that the output interface may be connected to a processor, wherein some or all of the data relating to the act of detecting an object in a scene from a video of the scene may be output.

According to an embodiment of the present disclosure, a method of detecting motion of an object in a scene from a video of the scene, wherein the video may be a video sequence segmented into chunks, and each chunk comprises consecutive video frames. The method comprises the following steps. A video of a scene is acquired, wherein the video comprises a sequence of images. Tracking objects in the video, and for each object and each chunk of the video, the method further comprises the steps of: a sequence of contour images is determined from video frames of a video sequence to represent motion data within a bounding box located around an object. A bounding box is used to generate a cropped contour image and a cropped image for one or more images in the respective chunks. The cropped contour image and the cropped image are passed to a Recurrent Neural Network (RNN), which outputs relative scores for each action of interest.

According to an embodiment of the present disclosure, a system detects a motion of interest of an object in a scene from a video of the scene, wherein the video is a video sequence of the scene divided into chunks, and each chunk comprises consecutive video frames. The system includes a processor that acquires a video of a scene such that the video includes a sequence of images. Wherein the processor is configured to track objects in the video and for each object and each chunk of the video the method comprises the following steps. A sequence of contour images is determined from video frames of a video sequence to represent motion information within a bounding box located around an object. A bounding box is used to generate a cropped contour image and a cropped image for one or more images in the respective chunks. The cropped contour image and the cropped image are passed to a Recurrent Neural Network (RNN), which outputs relative scores for each action of interest.

According to an embodiment of the present disclosure, a non-transitory computer-readable storage medium has embodied thereon a program executable by a computer to perform a method of detecting an action of interest of an object in a scene from a video of the scene. Where the video may be a video sequence of a scene that is partitioned into chunks, such that each chunk comprises a succession of video frames. The method includes acquiring, by a processor, a video of a scene, wherein the video may include a sequence of images. Tracking, by a processor, objects in the video, and for each object and each chunk of the video, the processor is configured to: determining a sequence of contour images from video frames of the video sequence within a bounding box located around the object; generating a cropped contour image and a cropped image for one or more images in the respective chunks using the bounding box; and passing the cropped contour image and the cropped image to a Recurrent Neural Network (RNN), the RNN outputting relative scores for each motion of interest via an output interface in communication with the processor.

The presently disclosed embodiments will be further explained with reference to the drawings. The drawings shown are not necessarily to scale, emphasis instead generally being placed upon illustrating the principles of the presently disclosed embodiments.

Drawings

Fig. 1A shows a block diagram of a method of detecting motion of an object in a scene from a video of the scene according to an embodiment of the present disclosure.

Fig. 1B is a schematic diagram illustrating some components of the method of fig. 1A detecting motion of objects in a scene from a video of the scene according to some embodiments of the present disclosure.

Fig. 2 is a schematic diagram illustrating a Recurrent Neural Network (RNN) including a multi-stream Convolutional Neural Network (CNN) as its initial layer and a Long Short Term Memory (LSTM) network as its final layer, according to some embodiments of the present disclosure.

Fig. 3A gives an example of a contour image by showing an input image from a sequence of images according to some embodiments of the present disclosure.

Fig. 3B gives an example of a contour image by showing a contour image determined from an input image according to some embodiments of the present disclosure.

Fig. 4 is a schematic diagram illustrating an LSTM unit according to some embodiments of the present disclosure.

Fig. 5 is a schematic diagram of at least one method and system of detecting motion of an object according to an embodiment of the present disclosure.

Fig. 6 is a block diagram illustrating the method of fig. 1A, which may be implemented using alternative computer or processor configurations, according to an embodiment of the present disclosure.

Detailed Description

While the above-identified drawing figures set forth embodiments of the present disclosure, other embodiments are also contemplated, as noted in the discussion. The present disclosure presents illustrative embodiments by way of representation and not limitation. Numerous other modifications and embodiments can be devised by those skilled in the art which fall within the scope and spirit of the principles of the presently disclosed embodiments.

The following description provides exemplary embodiments only, and is not intended to limit the scope, applicability, or configuration of the disclosure. Rather, the following description of the exemplary embodiments will provide those skilled in the art with a description that will enable one or more exemplary embodiments to be implemented. Various changes may be made in the function and arrangement of elements without departing from the spirit and scope of the disclosed subject matter as set forth in the appended claims. In the following description, specific details are given to provide a thorough understanding of the embodiments. However, it will be understood by one of ordinary skill in the art that the embodiments may be practiced without these specific details. For example, systems, processes, and other elements of the disclosed subject matter may be shown in block diagram form as components in order to avoid obscuring the embodiments in unnecessary detail. In other instances, well-known processes, structures and techniques may be shown without unnecessary detail in order to avoid obscuring the embodiments. Moreover, like reference numbers and designations in the various drawings indicate like elements.

Furthermore, various embodiments may be described as a process which is depicted as a flowchart, a data flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. Additionally, the order of the operations may be rearranged. A process may terminate when its operations are completed, but may have additional steps not discussed or included in the figures. Moreover, not all operations in any specifically described process may be present in all embodiments. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When the procedure corresponds to a function, the termination of the function may correspond to a return of the function to the calling function or the main function.

Moreover, embodiments of the disclosed subject matter can be implemented, at least in part, manually or automatically. May be implemented or at least assisted manually or automatically using a machine, hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the necessary tasks may be stored in a machine readable medium. The processor may perform the required tasks.

SUMMARY

The present disclosure relates generally to computer vision and camera surveillance applications, and more particularly, to detecting instances of objects (e.g., people) in a video that perform a particular action of interest by using a sequence of contour images computed from frames of the video sequence to represent motion information.

In particular, the present disclosure is based on the recognition that: using a sequence of contour images computed from frames of a video sequence to represent motion information may provide a fast and memory efficient detector for actions and the like in a video. For example, the present disclosure addresses motion detection in videos based on locating the occurrence of a particular motion in time (which frames of the video) and space (where in the individual frames). We recognize through experimentation that we can detect motion in a video using a deep neural network with recursive connections that takes as input a cropped image around an object (e.g., a person) and a sequence of contour images representing motion within the cropped region across multiple frames. From previous methods using optical flow-based representations we found that these methods are computationally expensive, i.e., require time-consuming computations and large amounts of memory and storage. This makes these previous motion detection methods infeasible for real-world applications.

In addition, the present disclosure also uses a multi-stream Recurrent Neural Network (RNN) that learns features representing two important aspects of motion and appearance, and learns important temporal dynamics over many video frames that differentiate different actions. For example, these methods and systems may be used to detect motion of objects in a video, where the objects may be humans, animals, machinery, vehicles, robots, industrial robots in a factory setting, and the like. The present disclosure provides more accurate motion detection for motion of objects occurring in video that is not temporally cropped.

Method and system

Fig. 1A illustrates a block diagram of a method 100 of detecting motion of an object in a scene from a video of the scene according to an embodiment of the disclosure. The video may be a video sequence that is partitioned into chunks such that each chunk comprises a succession of video frames. An initial step 120 includes acquiring, by the processor 110, a video of a scene, wherein the video includes a sequence of images.

Step 122 includes tracking objects in the video and, for each object and each chunk of the video, further includes: determining a sequence of contour images from video frames of the video sequence to represent motion data within a bounding box located around the object, step 125; and a step 127 of generating a trimming contour image and a trimming image for one or more images in the respective blocks using the bounding box.

Finally, step 128 includes passing the cropped contour image and the cropped image to a Recurrent Neural Network (RNN), which outputs a relative score for each action of interest.

Fig. 1B is a schematic diagram illustrating components of the method 100 of fig. 1A detecting motion of an object in a scene from a video of the scene in accordance with an embodiment of the present disclosure. In particular, fig. 1B illustrates the basic operations of the method 100 of detecting motion of an object 107 in a scene 105 (e.g., detecting a person in the scene performing a particular motion). Video data 108 of a scene 105 from a video camera 104 is acquired 120 as a sequence of images 115, wherein each image comprises pixels. A scene may include one or more objects 107 that perform an action, such as a person running up a staircase or some other action. The video data is acquired by the processor 110. Further, one or more objects 107 are tracked 122, and bounding boxes 123 for each tracked object 107 are estimated in each chunk of the video image. For example, a chunk may be a sequence of six consecutive images, less than six images, or more than six images.

The image is cropped to the extent of the bounding box 123 and the sequence of contour images is computed 125 and cropped to the extent of the bounding box 123. The resulting cropped contour image and cropped image 127 are passed to a Recurrent Neural Network (RNN)130, the RNN 130 having been trained to output relative scores 140 for the respective actions of interest. These steps may be performed in a processor 110 connected to a memory (not shown).

As described above, embodiments of the present disclosure provide methods and systems for detecting motion of an object in a video. Some embodiments include a training phase and a testing phase, wherein the training phase involves learning parameters of the RNN from training data. Some embodiments may include only a testing phase. For example, a method with only a test phase may be embedded in a small device using a pre-trained RNN.

Fig. 2 is a schematic diagram illustrating a Recurrent Neural Network (RNN) including a multi-stream Convolutional Neural Network (CNN) as its initial layer and a Long Short Term Memory (LSTM) network as its final layer according to an embodiment of the present disclosure.

For example, during the training phase, we train four independent Convolutional Neural Networks (CNNs) 220, as shown in fig. 2. Each CNN processes one of four streams 210: a motion stream 211 and an appearance stream 212 of video images cropped around the position of the tracking object, and a motion stream 213 and an appearance stream 214 of full-frame (not spatially cropped) video images. Some embodiments have only two streams: a motion stream 211 and an appearance stream 212 of video images cropped around the position of the tracked object. This may be useful, for example, for situations where the background scene is noisy, anonymous, or unrelated to the action being performed by the object.

Still referring to fig. 2, in some embodiments, each convolutional network (CNN) uses a VGG (visual geometry group) architecture. However, other CNN architectures may also be used for the individual flows, such as the AlexNet architecture or the ResNet architecture.

The four nets perform action classification tasks on successive small chunks 201 of video 200. Example (b)For example, each chunk may consist of six consecutive video frames. CNN is followed by a projection layer 230 and a Long Short Term Memory (LSTM) unit 240, the projection layer 230 projecting the output of CNN of all streams into a single space. The output of each chunk is from N action classes A₁、A₂、...、A_NThe detected action category 250 of the set.

Two Convolutional Neural Networks (CNNs), one for each image and motion, are trained on a chunk consisting of a video frame cropped to a bounding box of the tracked object. Cropping a frame provides a bounding box to the action that is bounded to the vicinity of the action, which helps classify the action. In some implementations, the bounding box has a fixed pixel size, which helps align objects over multiple executions of actions.

Still referring to fig. 2, in some preferred embodiments, two additional CNNs, one each for image and motion, are trained on chunks consisting of video frames that are not spatially cropped (i.e., each frame is a full frame of video, thus preserving the spatial context of the actions performed within the scene). We refer to this network as a multi-stream neural network because it has multiple (e.g., four) CNNs, each of which handles a different information stream from the video.

After the four networks 220 have been trained, we learn the fully-connected projection layer 230 on the outputs of the four networks to create a joint representation of these independent streams. In some embodiments where the CNN uses a VGG architecture, the output of the network is its fc7 tier output, where fc7 tier is the last fully connected tier in the VGG network. The full-length video 200 is provided to the multi-stream network as a time-series arrangement of chunks 201, and then the corresponding time-series of outputs of the projection layers are fed into a long-short term memory (LSTM) network 240. In some embodiments, the LSTM network operates in both directions, i.e., the LSTM network is bidirectional.

A bidirectional LSTM network consists of two directional LSTM networks (one connected forward in time and the other connected backward in time). In some embodiments, each of the two directional LSTM networks is followed by a fully connected layer (not shown in fig. 2 for clarity) over the implicit state of the respective directional LSTM network, followed by a softmax layer to obtain an intermediate score corresponding to the respective action. Finally, the scores of the two oriented LSTM are combined (e.g., averaged) to obtain a score for each particular action.

Still referring to FIG. 2, there are a number of components in the action detection pipeline that are critical to achieving good performance. In this task, we use a model that characterizes the spatial and long-term temporal information present in the video.

The contour image determined using the bounding box provides a reference image that makes many actions easier to learn by removing positional variations from the input representation. However, some actions are location dependent. For scenes acquired using a still video camera, these actions always occur at the same image location. For example, in a cooking video, washing and rinsing are almost always performed near the sink, and opening the door will most likely be performed near the refrigerator or cabinet. For these reasons, we train two separate depth networks on the cropped and un-cropped chunks of the outline image and video frame.

The first two CNNs are trained on a cropped image cropped using a box from the object tracker to reduce background noise and provide an object-centered reference image for the contour image and image regions. The other two CNNs are trained on the entire (spatially full frame) image to preserve the global spatial context.

Fig. 3A and 3B illustrate a contour image determined from an input image. The input image represents an image from a sequence of images. An object contour may be determined from the input image using an image processing algorithm (e.g., an algorithm using a deep neural network) to determine a contour image.

The contour image may be automatically computed from the input image and represents edges along the boundaries of various objects in the image. Further, the contour image does not represent colors and textures within the input image, but represents only the boundary of the object. The sequence of contour images contains only the most relevant information about the movement of the object in the corresponding image sequence, the object contour.

Since the action to be detected can have a wide range of durations, our method uses the LSTM network 140 to learn the duration and long-term temporal context of the action in a data-driven manner. Our results demonstrate that LSTM is effective in learning long-term temporal context for fine-grained motion detection.

Tracking for fine-grained motion detection

To provide a bounding box around the object for a location independent (cropped) appearance and motion stream, any object tracking method may be used. In a preferred embodiment, we use a state-based tracker to spatially locate actions in the video. Keeping the size of the tracking bounding box fixed, we update the position of the bounding box to maximize the magnitude of the differential image energy within the bounding box. If the magnitude of the difference image energy is greater than the threshold, the position of the bounding box is updated to a position that maximizes the magnitude of the difference image energy. Otherwise, the object either moves slowly or does not move at all. When the object moves too slowly or does not move, the bounding box from the previous chunk is used, i.e., the bounding box is not updated. The position of the bounding box is updated only after processing the tile 101 (e.g., six images) and determining the motion and appearance characteristics relative to the tile to ensure that the bounding box is stationary on all images in the tile.

Our tracking method can be effectively applied when the camera is stationary and has a reasonable estimate on the object size. This is a practical assumption for many videos taken in retail stores, individual homes, or surveillance environments where fine-grained motion detection may be used. For more difficult tracking situations, more complex trackers may be used.

In a preferred embodiment, the bounding box is a rectangular region containing the object, but the bounding box need not be rectangular. More generally, a bounding box is an area of any shape that contains or substantially contains an object to be tracked and may additionally contain a smaller area surrounding the object.

Motion detection over long sequences using bidirectional LSTM networks

Fig. 4 is a schematic diagram illustrating an LSTM unit according to some embodiments of the present disclosure. We now provide a brief description of the Recurrent Neural Network (RNN) and Long Short Term Memory (LSTM) units. Given an input sequence x ═(x₁,…,x_T) RNN uses an implicit state to represent h ═ (h)₁,…,h_T) Such that RNN can map input x to output sequence y ═ (y)₁,…,y_T)。

To determine this representation, the RNN traverses the following recursive equation:

h_t＝g(W_xhx_t+W_hhh_t-1+b_h)，y_t＝g(W_hyh_t+b_z)，

where g is the activation function, W_xhIs a weight matrix, W, that maps the input to an implicit state_hhIs a transition matrix between hidden states of two adjacent time steps, W_hyIs a matrix that maps the hidden state h to the output y, b_hAnd b_zIs the bias term.

Still referring to FIG. 4, unlike Hidden Markov Models (HMMs) which use discrete hidden state representations, recurrent neural networks use a continuous spatial representation of hidden states. However, it is difficult to train RNNs to learn long-term sequence information because the network is expanded using back propagation through time to perform training. This results in the disappearance or explosion of the gradient problem.

To avoid this problem, the LSTM cell has a memory cell c, as shown in fig. 4_tAnd forgetting the door f_tIt helps LSTM learn when to retain the previous state and when to forget the state. This enables the LSTM network to learn long-term time information. The weight update equation for the LSTM cell is as follows:

i_t＝σ(W_xix_t+W_hih_t-1+b_i)

f_t＝σ(W_xfx_t+W_hfh_t-1+b_f)

o_t＝σ(W_xox_t+W_hoh_t-1+b_o)

g_t＝tanh(W_xcx_t+W_hch_t-1+b_c)

c_t＝f_tc_t-1+i_tg_t

h_t＝o_ttanh(c_t)

where σ is sigmoid function, tanh is hyperbolic tangent function, i_t、f_t、o_tAnd c_tInput gate, forget-to-remember gate, output gate and memory cell activation vector, respectively.

Forget to remember the door f_tDetermining when to slave memory cell c_tThe information (and which) is cleared. Input door i_tIt is decided when (and which) new information is to be incorporated into the memory. tan h layer g_tA set of candidate values is generated which is added to the storage unit when the input gate allows.

Still referring to FIG. 4, based on forgetting to remember the door f_tInput and output gates i_tAnd a new candidate value g_tUpdating the memory cell c_t. Output gate o_tWhich information in the memory cell is used as a representation of the hidden state. The implied state is represented as the product between the output gate and a function of the memory cell state.

The LSTM architecture of RNN has been successfully used for generating sentences from images, video to text video descriptions, and speech recognition. However, for the motion recognition task, the performance of the LSTM network is still close to that of a classifier based on Fisher vectors generated on refined dense trajectories. RNNs using LSTM have not been used for motion detection from video, perhaps because of their flat performance in motion recognition from video, which is the focus of this disclosure.

In a common motion recognition dataset, video is temporally cropped to start and end at or near the start and end times of the respective motion. Temporally cropped video is typically short in length (e.g., 2-20 seconds). Thus, there is not enough long-term context to learn in a data-driven manner in the action recognition task. The long-term context may include properties such as the expected duration of an action (which may be after or before another action) and other long-term motion patterns that extend beyond the boundaries of the action in time.

Still referring to fig. 4, the LSTM network has little access to longer-term temporal contexts during the action recognition task. However, in fine-grained motion detection, the video duration is typically on the order of minutes or hours. Therefore, our key insight is that LSTM networks will be more suitable for motion detection (to which we apply) than motion recognition (to which they were previously applied), since LSTM models long-term temporal dynamics in the sequence.

The bi-directional LSTM network integrates information from both future and past chunks to form predictions for individual chunks in the video sequence. Therefore, we predict that a bi-directional LSTM network will be better than a unidirectional LSTM at the time boundaries (i.e., start and end) of the predicted action.

As described herein, the forward LSTM network and the backward LSTM network each generate a softmax score for each action category, and we average the softmax scores of the two LSTM networks to obtain a score (probability) for each action.

Although the LSTM network is trained on long sequences, the back propagation through time can be accomplished using only short sequences of chunks up to a fixed number of steps. To preserve the long-term context, we preserve the implicit state of the last element in the previous chunk sequence when training on the subsequent chunk sequence.

Fig. 5 is a schematic diagram of at least one method and system of detecting motion of an object according to embodiments of the present disclosure. For example, as provided above, the training phase of the method involves training a Recurrent Neural Network (RNN). In the testing phase (i.e., motion detection), the RNN that has been trained is used to detect motion of the subject.

Fig. 5 illustrates the basic operations of a method and system 500 for detecting motion of an object (e.g., detecting a person in a scene performing a particular motion). For example, the method 500 may include at least one sensor 504 that generates input video data of a scene 505. The sensor 504 may be a video camera or some other device that generates input video data. It is contemplated that sensor 504 may collect other data, such as time, temperature, and other data related to scene 505.

The computer readable memory 512 of the computer 514 may store and/or provide the input video data 501 generated by the sensor 504. The sensor 504 collects input video data 501 of a scene 505, which may optionally be stored in an external memory 506 or may be sent directly to an input interface/preprocessor 507 and then to a processor 510.

Further, a video 501 of a scene 505 is acquired 520 as a sequence of images 515, wherein each image comprises pixels. The scene 505 may include one or more objects 507 that perform an action, for example, a person running up a staircase. Optionally, there may be an external memory 506 connected to an input interface/preprocessor 507 connected to the memory 512, which is connected to retrieve the video 520 as described above.

In addition, one or more objects are tracked 522, and bounding boxes 523 of the tracked objects are estimated in respective chunks of the video image. For example, as a non-limiting example, a chunk may be a sequence of six images.

The image is cropped to the extent of the bounding box and a contour image is computed 525 within the bounding box. The resulting cropped contour image and cropped image 527 are passed to a Recurrent Neural Network (RNN)550, the RNN 550 having been trained to output relative scores 560 for each action of interest.

In outputting the relative scores 560 for each action of interest, the output of the relative scores 560 may be stored in the memory 512 or output via the output interface 561. During processing, the processor 514 may communicate with the memory 512 to store or retrieve stored instructions or other data related to the processing.

FIG. 6 is a block diagram illustrating the method of FIG. 1A, which may be implemented using alternative computer or processor configurations, according to an embodiment of the present disclosure. The computer/controller 611 includes a processor 640, a computer readable memory 612, a storage 658, and a user interface 649 with a display 652 and a keyboard 651, connected by a bus 656. For example, the user interface 649, which is in communication with the processor 640 and the computer-readable memory 612, retrieves data and stores it in the computer-readable memory 612 when user input is received from a surface of the user interface 657, a keyboard surface.

It is contemplated that memory 612 may store instructions executable by a processor, historical data, and any data usable by the methods and systems of the present disclosure. Processor 640 may be a single-core processor, a multi-core processor, a computing cluster, or any number of other configurations. The processor 640 may be connected to one or more input devices and output devices by a bus 656. The memory 612 may include Random Access Memory (RAM), Read Only Memory (ROM), flash memory, or any other suitable memory system.

Still referring to fig. 6, the storage 658 may be configured to store supplemental data and/or software modules used by the processor. For example, the storage 658 may store historical data as well as other relevant data as mentioned above with respect to the present disclosure. Additionally or alternatively, the storage 658 may store historical data similar to that as mentioned above with respect to the present disclosure. The storage 658 may include a hard disk drive, an optical disk drive, a thumb drive, an array of drives, or any combination thereof.

The system may optionally be linked by a bus 656 to a display interface (not shown) arranged to connect the system to a display device (not shown), which may include a computer monitor, camera, television, projector, mobile device, or the like.

The controller 611 may include a power supply 654, and the power supply 654 may optionally be located external to the controller 611, depending on the application. A user input interface 657, which is configured to be connected to a display device 648, can be linked via the bus 656, where the display device 648 can include a computer monitor, camera, television, projector, or mobile device, among others. Printer interface 659 may also be connected by bus 656 and configured to connect to printing device 632, where printing device 632 may include a liquid inkjet printer, a solid ink printer, a large commercial printer, a thermal printer, a UV printer, or a dye sublimation printer, among others. A Network Interface Controller (NIC)634 is provided to connect to the network 636 via the bus 656, where data or other data, etc. may be rendered on a third party display device, a third party imaging device, and/or a third party printing device external to the controller 611.

Still referring to fig. 6, data or other data or the like may be transmitted via a communication channel of the network 636 and/or stored within the storage system 658 for storage and/or further processing. Further, data or other data may be received wirelessly or hardwired from the receiver 646 (or an external receiver 638) or transmitted wirelessly or hardwired via the transmitter 647 (or an external transmitter 639), both the receiver 646 and the transmitter 647 being connected by a bus 656. Further, the GPS 601 may be connected to the controller 611 via a bus 656. The controller 611 may be connected to an external sensing device 644 and an external input/output device 641 via the input interface 608. The controller 611 can be connected to other external computers 642. Output interface 609 may be used to output processed data from processor 640.

Aspects of the present disclosure may also include a bidirectional long-short term memory (LSTM) network that manages data stored over time based on conditions, wherein the conditions include an input gate, a forgetting gate, and an output gate, to manage the stored data based on changes over time, wherein the data stored over time is similar to data related to an action of interest, such that the stored data includes historical properties of an expected duration of the action of interest, historical types of the action of interest after or before the action of interest, and historical long term motion patterns that extend beyond bounding box boundaries of the action of interest.

The above-described embodiments of the present disclosure may be implemented in any of numerous ways. For example, embodiments may be implemented using hardware, software, or a combination thereof. Use of ordinal terms such as "first," "second," in the claims to modify a claim element does not by itself connote any priority, precedence, or order of one claim element over another or the temporal order in which acts of a method are performed, but are used merely as labels to distinguish one claim element having a certain name from another element having a same name (but for use of the ordinal term) to distinguish the claim elements.

Additionally, embodiments of the present disclosure may be embodied as a method, examples of which are provided. The actions performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed which perform acts in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments.

Claims

1. A method of detecting motion of an object in a scene from a video of the scene, such that the video is a video sequence of the scene divided into chunks, and each chunk comprises successive video frames, the method comprising the steps of:

obtaining, by a processor, the video of the scene, wherein the video comprises a sequence of images;

tracking, by the processor, the objects in the video, and for each object and each chunk of the video, further comprising:

determining a sequence of contour images from video frames of the video sequence to represent motion data within a bounding box located around the object;

generating a cropped contour image and a cropped image for one or more images in the respective chunks using the bounding box; and

the cropped contour image and the cropped image are passed to a recurrent neural network RNN, which outputs relative scores for each action of interest.

2. The method of claim 1, wherein the RNN comprises a convolutional neural network layer and one or more cyclic neural network layers.

3. The method of claim 2, wherein the convolutional neural network layer operates on a plurality of streams including a sequence of cropped contour images and the cropped images.

4. The method of claim 2, wherein the convolutional neural network layer operates on multiple streams including a sequence of cropped contour images and the cropped images, and contour images and images having a complete spatial extent of the video frame.

5. The method of claim 2, wherein the recurrent neural network layer comprises long-short term memory (LSTM) units.

6. The method of claim 5, wherein the recurrent neural network layer comprises bidirectional long-short term memory (LSTM) units.

7. The method of claim 1, wherein the object is one of a human, a robot, or an industrial robot.

8. The method of claim 7, further comprising a person detector and a person tracker.

9. The method of claim 8, wherein the person tracker identifies at least one bounding box around each person in the video.

10. The method of claim 9, wherein the video frames of the video sequence representing motion data of the object are within a plurality of bounding boxes positioned around the object over time.

11. The method of claim 1, wherein the bounding box is a region having a shape that encompasses at least a portion or all of the tracked object.

12. The method of claim 1, wherein the video is initially acquired in a form other than a sequence of images and converted to a sequence of images.

13. The method of claim 1, wherein the method is used for fine-grained motion detection in the video.

14. A method as claimed in claim 1, wherein the method comprises training the RNN prior to the detecting step, or the RNN has been pre-trained prior to acquiring the video of the scene.

15. The method of claim 1, wherein the detecting step comprises one of a temporal action detection or a space-time action detection.

16. A system for detecting a motion of interest of an object in a scene from a video of the scene, such that the video is a video sequence of the scene divided into chunks, and each chunk comprises successive video frames, the system comprising:

a processor acquires the video of the scene such that the video comprises a sequence of images, wherein the processor is configured to:

tracking the objects in the video, and for each object and each chunk of the video:

determining a sequence of contour images from video frames of the video sequence to represent motion information within a bounding box located around the object;

generating a cropped contour image and a cropped image for one or more images in the respective chunks using the bounding box; and is

17. The system of claim 16, wherein the RNN comprises a convolutional neural network layer and one or more recurrent neural network layers, such that the convolutional neural network layer operates on multiple streams comprising a sequence of cropped contour images and the cropped images.

18. The system of claim 16, wherein the recurrent neural network layer includes long-short term memory (LSTM) units.

19. A non-transitory computer readable storage medium having embodied thereon a program, the program being executable by a computer for performing a method of detecting, from a video of a scene, an action of interest of an object in the scene such that the video is a video sequence of the scene divided into chunks, and each chunk comprises successive video frames, the method comprising the steps of:

tracking, by the processor, the objects in the video, and for each object and each chunk of the video, the processor is configured to:

determining a sequence of contour images from video frames of the video sequence within a bounding box located around the object;

The cropped contour image and the cropped image are passed to a Recurrent Neural Network (RNN) that outputs relative scores for each action of interest via an output interface in communication with the processor.

20. The storage medium of claim 19, wherein the RNN comprises a convolutional neural network layer and one or more recurrent neural network layers, such that the convolutional neural network layer operates on a plurality of streams comprising a sequence of cropped contour images and the cropped images.