CN110998594A - Method and system for detecting motion - Google Patents

Method and system for detecting motion Download PDF

Info

Publication number
CN110998594A
CN110998594A CN201880048903.3A CN201880048903A CN110998594A CN 110998594 A CN110998594 A CN 110998594A CN 201880048903 A CN201880048903 A CN 201880048903A CN 110998594 A CN110998594 A CN 110998594A
Authority
CN
China
Prior art keywords
video
sequence
cropped
images
scene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201880048903.3A
Other languages
Chinese (zh)
Other versions
CN110998594B (en
Inventor
M·琼斯
T·马克斯
K·库尔卡尼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mitsubishi Electric Corp
Original Assignee
Mitsubishi Electric Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mitsubishi Electric Corp filed Critical Mitsubishi Electric Corp
Publication of CN110998594A publication Critical patent/CN110998594A/en
Application granted granted Critical
Publication of CN110998594B publication Critical patent/CN110998594B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2111Selection of the most significant subset of features by using evolutionary computational techniques, e.g. genetic algorithms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2113Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/248Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving reference images or patches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20021Dividing image into blocks, subimages or windows

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Physiology (AREA)
  • Image Analysis (AREA)

Abstract

A method and system for detecting motion of an object in a scene from a video of the scene. Video is a video sequence that is divided into chunks, and each chunk comprises a succession of video frames. The method includes the following elements. A video of a scene is acquired, wherein the video comprises a sequence of images. Tracking objects in the video, and for each object and each chunk of the video, further comprising: a sequence of contour images is determined from video frames of a video sequence to represent motion data within a bounding box located around an object. A bounding box is used to generate a cropped contour image and a cropped image for one or more images in the respective chunks. The cropped contour image and the cropped image are passed to a recurrent neural network RNN, which outputs relative scores for each action of interest.

Description

Method and system for detecting motion
Technical Field
The present disclosure relates generally to computer vision and camera surveillance applications, and more particularly, to detecting instances of objects (e.g., people) in a video that perform a particular action of interest by using sequence representative motion information of contour images computed from frames of the video sequence.
Background
In computer vision and camera surveillance applications, a common problem is identifying and detecting specific actions performed by objects such as people, machinery, vehicles, robots, etc. Much work has been done on the general problem of analyzing motion in video, but most prior art work has focused on motion recognition, rather than motion detection.
Motion recognition refers to classifying (i.e., recognizing) which motion is being performed in a video segment that has been temporally cropped such that the segment begins at or near the beginning of the motion and ends at or near the end of the motion. We use the term temporally cropping to denote these video segments. Motion detection refers to the temporal or spatial-temporal localization of each occurrence of an individual motion from a set of known motion classes occurring in a long (i.e., not temporally cropped) video sequence.
The task related to motion recognition is activity recognition. In an activity recognition task, a video segment depicting an activity (e.g., a particular athletic movement being played) is analyzed, and the goal is to determine which activity (e.g., which athletic movement) is depicted in the video.
Fine-grained motion detection refers to motion detection in which the difference between motion classes to be detected is small. For example, in a cooking scenario, detecting motion from a set including similar motions such as shredding, grinding, and peeling is an example of fine-grained motion detection. However, at least one drawback of the prior art method for motion detection is its relatively low accuracy. That is, the prior art motion detection methods do not perform well enough for most computer vision applications, as well as other applications.
The standard pipeline of most video analysis tasks, such as motion recognition, event detection and video retrieval, is the computation of hand-made features, such as Histogram of Oriented Gradients (HOG), histogram of Motion Boundaries (MBH) and Histogram of Optical Flow (HOF). Traditional methods rely on computationally expensive input representations (e.g., improved dense trajectories or dense optical flows), create Fisher vectors for individual video clips, and then perform classification using a support vector machine. However, at least one major drawback of the previous methods of motion detection/recognition described above, among many, is that they rely on input representations and intermediate representations, which are computationally very time consuming and require a large amount of memory to store. This makes these traditional methods infeasible for real world motion detection applications.
Therefore, there is a need to develop motion detection methods that can efficiently detect motion in video in terms of time and memory requirements.
Disclosure of Invention
The present disclosure relates generally to computer vision and camera surveillance applications, and more particularly, to detecting instances of objects (e.g., people) in a video that perform a particular action of interest by using sequence representative motion information of contour images computed from frames of the video sequence.
The present disclosure provides methods and systems that overcome the problems of video analytics tasks, such as action recognition, event detection, and video retrieval, which rely on input representations and intermediate representations that are computationally very time consuming and also require a large amount of memory to store. In particular, the present disclosure describes motion detection methods and systems that can efficiently detect motion in video in terms of minimizing time-consuming computations and reducing memory storage/requirements.
In particular, the present disclosure is based on the recognition that: using a sequence of contour images computed from frames of a video sequence to represent motion information may provide a fast and memory efficient detector for actions and the like in a video. For example, the present disclosure addresses motion detection in videos based on locating the occurrence of a particular motion in time (which frames of the video) and space (where in the individual frames). We recognize through experimentation that we can detect motion in a video using a deep neural network with recursive connections that takes as input a sequence of cropped images around an object (e.g., a person) and contour images representing motion within the cropped region across multiple frames. From previous methods using optical flow-based representations we found that these methods are computationally expensive, i.e., require time-consuming computations and large amounts of memory and storage. This makes these previous motion detection methods infeasible for real-world applications.
The present disclosure also includes an object/person tracker that can spatially locate where within a video frame an action occurs. We have found through experimentation that conventional methods of analyzing motion and appearance over an entire frame, rather than using trackers, use a large amount of information that is not relevant to the motion of interest. In addition, these methods do not have sufficiently detailed information from the areas that are most important to the task.
In addition, the present disclosure also uses a multi-stream Recurrent Neural Network (RNN) that learns features representing two important aspects of motion and appearance, and learns important temporal dynamics over many video frames that differentiate different actions. For example, these methods and systems may be used to detect motion of objects in a video, where the objects may be humans, animals, machinery, vehicles, robots, industrial robots in a factory environment, and the like. The present disclosure provides more accurate motion detection for motion of objects occurring in video that is not temporally cropped.
Another aspect of the present disclosure includes using a Long Short Term Memory (LSTM) network included as one or more layers of RNNs that can learn patterns having a longer duration than can be learned using conventional RNNs. The present disclosure may provide better performance using bi-directional LSTM, which means that the present disclosure may use information from past and future video frames to detect motion.
To facilitate a further understanding of the present disclosure, we provide at least one method step comprising: motion of an object in a scene is detected from a video of the scene, where the video may be captured by a video device and the video itself may be a video sequence that is segmented into chunks, such that each chunk may comprise a succession of video frames.
For example, the method of the present disclosure may include the steps of: a video of a scene is acquired, wherein the video comprises a sequence of images. The video may be downloaded into the memory by the processor, wherein the processor accesses the memory to retrieve the video. A next step may include tracking objects in the video, and for each object and each chunk of the video, the method may further include the steps of: a sequence of contour images is determined from video frames of a video sequence to represent motion data within a bounding box located around an object. A next step may use the bounding box to generate a cropped contour image and a cropped image for one or more images in the respective chunks. Finally, a final step may pass the cropped contour image and the cropped image to a Recurrent Neural Network (RNN), which outputs a relative score for each action of interest.
It is envisaged that the output interface may be connected to a processor, wherein some or all of the data relating to the act of detecting an object in a scene from a video of the scene may be output.
According to an embodiment of the present disclosure, a method of detecting motion of an object in a scene from a video of the scene, wherein the video may be a video sequence segmented into chunks, and each chunk comprises consecutive video frames. The method comprises the following steps. A video of a scene is acquired, wherein the video comprises a sequence of images. Tracking objects in the video, and for each object and each chunk of the video, the method further comprises the steps of: a sequence of contour images is determined from video frames of a video sequence to represent motion data within a bounding box located around an object. A bounding box is used to generate a cropped contour image and a cropped image for one or more images in the respective chunks. The cropped contour image and the cropped image are passed to a Recurrent Neural Network (RNN), which outputs relative scores for each action of interest.
According to an embodiment of the present disclosure, a system detects a motion of interest of an object in a scene from a video of the scene, wherein the video is a video sequence of the scene divided into chunks, and each chunk comprises consecutive video frames. The system includes a processor that acquires a video of a scene such that the video includes a sequence of images. Wherein the processor is configured to track objects in the video and for each object and each chunk of the video the method comprises the following steps. A sequence of contour images is determined from video frames of a video sequence to represent motion information within a bounding box located around an object. A bounding box is used to generate a cropped contour image and a cropped image for one or more images in the respective chunks. The cropped contour image and the cropped image are passed to a Recurrent Neural Network (RNN), which outputs relative scores for each action of interest.
According to an embodiment of the present disclosure, a non-transitory computer-readable storage medium has embodied thereon a program executable by a computer to perform a method of detecting an action of interest of an object in a scene from a video of the scene. Where the video may be a video sequence of a scene that is partitioned into chunks, such that each chunk comprises a succession of video frames. The method includes acquiring, by a processor, a video of a scene, wherein the video may include a sequence of images. Tracking, by a processor, objects in the video, and for each object and each chunk of the video, the processor is configured to: determining a sequence of contour images from video frames of the video sequence within a bounding box located around the object; generating a cropped contour image and a cropped image for one or more images in the respective chunks using the bounding box; and passing the cropped contour image and the cropped image to a Recurrent Neural Network (RNN), the RNN outputting relative scores for each motion of interest via an output interface in communication with the processor.
The presently disclosed embodiments will be further explained with reference to the drawings. The drawings shown are not necessarily to scale, emphasis instead generally being placed upon illustrating the principles of the presently disclosed embodiments.
Drawings
Fig. 1A shows a block diagram of a method of detecting motion of an object in a scene from a video of the scene according to an embodiment of the present disclosure.
Fig. 1B is a schematic diagram illustrating some components of the method of fig. 1A detecting motion of objects in a scene from a video of the scene according to some embodiments of the present disclosure.
Fig. 2 is a schematic diagram illustrating a Recurrent Neural Network (RNN) including a multi-stream Convolutional Neural Network (CNN) as its initial layer and a Long Short Term Memory (LSTM) network as its final layer, according to some embodiments of the present disclosure.
Fig. 3A gives an example of a contour image by showing an input image from a sequence of images according to some embodiments of the present disclosure.
Fig. 3B gives an example of a contour image by showing a contour image determined from an input image according to some embodiments of the present disclosure.
Fig. 4 is a schematic diagram illustrating an LSTM unit according to some embodiments of the present disclosure.
Fig. 5 is a schematic diagram of at least one method and system of detecting motion of an object according to an embodiment of the present disclosure.
Fig. 6 is a block diagram illustrating the method of fig. 1A, which may be implemented using alternative computer or processor configurations, according to an embodiment of the present disclosure.
Detailed Description
While the above-identified drawing figures set forth embodiments of the present disclosure, other embodiments are also contemplated, as noted in the discussion. The present disclosure presents illustrative embodiments by way of representation and not limitation. Numerous other modifications and embodiments can be devised by those skilled in the art which fall within the scope and spirit of the principles of the presently disclosed embodiments.
The following description provides exemplary embodiments only, and is not intended to limit the scope, applicability, or configuration of the disclosure. Rather, the following description of the exemplary embodiments will provide those skilled in the art with a description that will enable one or more exemplary embodiments to be implemented. Various changes may be made in the function and arrangement of elements without departing from the spirit and scope of the disclosed subject matter as set forth in the appended claims. In the following description, specific details are given to provide a thorough understanding of the embodiments. However, it will be understood by one of ordinary skill in the art that the embodiments may be practiced without these specific details. For example, systems, processes, and other elements of the disclosed subject matter may be shown in block diagram form as components in order to avoid obscuring the embodiments in unnecessary detail. In other instances, well-known processes, structures and techniques may be shown without unnecessary detail in order to avoid obscuring the embodiments. Moreover, like reference numbers and designations in the various drawings indicate like elements.
Furthermore, various embodiments may be described as a process which is depicted as a flowchart, a data flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. Additionally, the order of the operations may be rearranged. A process may terminate when its operations are completed, but may have additional steps not discussed or included in the figures. Moreover, not all operations in any specifically described process may be present in all embodiments. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When the procedure corresponds to a function, the termination of the function may correspond to a return of the function to the calling function or the main function.
Moreover, embodiments of the disclosed subject matter can be implemented, at least in part, manually or automatically. May be implemented or at least assisted manually or automatically using a machine, hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the necessary tasks may be stored in a machine readable medium. The processor may perform the required tasks.
SUMMARY
The present disclosure relates generally to computer vision and camera surveillance applications, and more particularly, to detecting instances of objects (e.g., people) in a video that perform a particular action of interest by using a sequence of contour images computed from frames of the video sequence to represent motion information.
The present disclosure provides methods and systems that overcome the problems of video analytics tasks, such as action recognition, event detection, and video retrieval, which rely on input representations and intermediate representations that are computationally very time consuming and also require a large amount of memory to store. In particular, the present disclosure describes motion detection methods and systems that can efficiently detect motion in video in terms of minimizing time-consuming computations and reducing memory storage/requirements.
In particular, the present disclosure is based on the recognition that: using a sequence of contour images computed from frames of a video sequence to represent motion information may provide a fast and memory efficient detector for actions and the like in a video. For example, the present disclosure addresses motion detection in videos based on locating the occurrence of a particular motion in time (which frames of the video) and space (where in the individual frames). We recognize through experimentation that we can detect motion in a video using a deep neural network with recursive connections that takes as input a cropped image around an object (e.g., a person) and a sequence of contour images representing motion within the cropped region across multiple frames. From previous methods using optical flow-based representations we found that these methods are computationally expensive, i.e., require time-consuming computations and large amounts of memory and storage. This makes these previous motion detection methods infeasible for real-world applications.
The present disclosure also includes an object/person tracker that can spatially locate where within a video frame an action occurs. We have found through experimentation that conventional methods of analyzing motion and appearance over an entire frame, rather than using trackers, use a large amount of information that is not relevant to the motion of interest. In addition, these methods do not have sufficiently detailed information from the areas that are most important to the task.
In addition, the present disclosure also uses a multi-stream Recurrent Neural Network (RNN) that learns features representing two important aspects of motion and appearance, and learns important temporal dynamics over many video frames that differentiate different actions. For example, these methods and systems may be used to detect motion of objects in a video, where the objects may be humans, animals, machinery, vehicles, robots, industrial robots in a factory setting, and the like. The present disclosure provides more accurate motion detection for motion of objects occurring in video that is not temporally cropped.
Another aspect of the present disclosure includes using a Long Short Term Memory (LSTM) network included as one or more layers of RNNs that can learn patterns having a longer duration than can be learned using conventional RNNs. The present disclosure may provide better performance using bi-directional LSTM, which means that the present disclosure may use information from past and future video frames to detect motion.
Method and system
Fig. 1A illustrates a block diagram of a method 100 of detecting motion of an object in a scene from a video of the scene according to an embodiment of the disclosure. The video may be a video sequence that is partitioned into chunks such that each chunk comprises a succession of video frames. An initial step 120 includes acquiring, by the processor 110, a video of a scene, wherein the video includes a sequence of images.
Step 122 includes tracking objects in the video and, for each object and each chunk of the video, further includes: determining a sequence of contour images from video frames of the video sequence to represent motion data within a bounding box located around the object, step 125; and a step 127 of generating a trimming contour image and a trimming image for one or more images in the respective blocks using the bounding box.
Finally, step 128 includes passing the cropped contour image and the cropped image to a Recurrent Neural Network (RNN), which outputs a relative score for each action of interest.
Fig. 1B is a schematic diagram illustrating components of the method 100 of fig. 1A detecting motion of an object in a scene from a video of the scene in accordance with an embodiment of the present disclosure. In particular, fig. 1B illustrates the basic operations of the method 100 of detecting motion of an object 107 in a scene 105 (e.g., detecting a person in the scene performing a particular motion). Video data 108 of a scene 105 from a video camera 104 is acquired 120 as a sequence of images 115, wherein each image comprises pixels. A scene may include one or more objects 107 that perform an action, such as a person running up a staircase or some other action. The video data is acquired by the processor 110. Further, one or more objects 107 are tracked 122, and bounding boxes 123 for each tracked object 107 are estimated in each chunk of the video image. For example, a chunk may be a sequence of six consecutive images, less than six images, or more than six images.
The image is cropped to the extent of the bounding box 123 and the sequence of contour images is computed 125 and cropped to the extent of the bounding box 123. The resulting cropped contour image and cropped image 127 are passed to a Recurrent Neural Network (RNN)130, the RNN 130 having been trained to output relative scores 140 for the respective actions of interest. These steps may be performed in a processor 110 connected to a memory (not shown).
As described above, embodiments of the present disclosure provide methods and systems for detecting motion of an object in a video. Some embodiments include a training phase and a testing phase, wherein the training phase involves learning parameters of the RNN from training data. Some embodiments may include only a testing phase. For example, a method with only a test phase may be embedded in a small device using a pre-trained RNN.
Fig. 2 is a schematic diagram illustrating a Recurrent Neural Network (RNN) including a multi-stream Convolutional Neural Network (CNN) as its initial layer and a Long Short Term Memory (LSTM) network as its final layer according to an embodiment of the present disclosure.
For example, during the training phase, we train four independent Convolutional Neural Networks (CNNs) 220, as shown in fig. 2. Each CNN processes one of four streams 210: a motion stream 211 and an appearance stream 212 of video images cropped around the position of the tracking object, and a motion stream 213 and an appearance stream 214 of full-frame (not spatially cropped) video images. Some embodiments have only two streams: a motion stream 211 and an appearance stream 212 of video images cropped around the position of the tracked object. This may be useful, for example, for situations where the background scene is noisy, anonymous, or unrelated to the action being performed by the object.
Still referring to fig. 2, in some embodiments, each convolutional network (CNN) uses a VGG (visual geometry group) architecture. However, other CNN architectures may also be used for the individual flows, such as the AlexNet architecture or the ResNet architecture.
The four nets perform action classification tasks on successive small chunks 201 of video 200. Example (b)For example, each chunk may consist of six consecutive video frames. CNN is followed by a projection layer 230 and a Long Short Term Memory (LSTM) unit 240, the projection layer 230 projecting the output of CNN of all streams into a single space. The output of each chunk is from N action classes A1、A2、...、ANThe detected action category 250 of the set.
Two Convolutional Neural Networks (CNNs), one for each image and motion, are trained on a chunk consisting of a video frame cropped to a bounding box of the tracked object. Cropping a frame provides a bounding box to the action that is bounded to the vicinity of the action, which helps classify the action. In some implementations, the bounding box has a fixed pixel size, which helps align objects over multiple executions of actions.
Still referring to fig. 2, in some preferred embodiments, two additional CNNs, one each for image and motion, are trained on chunks consisting of video frames that are not spatially cropped (i.e., each frame is a full frame of video, thus preserving the spatial context of the actions performed within the scene). We refer to this network as a multi-stream neural network because it has multiple (e.g., four) CNNs, each of which handles a different information stream from the video.
After the four networks 220 have been trained, we learn the fully-connected projection layer 230 on the outputs of the four networks to create a joint representation of these independent streams. In some embodiments where the CNN uses a VGG architecture, the output of the network is its fc7 tier output, where fc7 tier is the last fully connected tier in the VGG network. The full-length video 200 is provided to the multi-stream network as a time-series arrangement of chunks 201, and then the corresponding time-series of outputs of the projection layers are fed into a long-short term memory (LSTM) network 240. In some embodiments, the LSTM network operates in both directions, i.e., the LSTM network is bidirectional.
A bidirectional LSTM network consists of two directional LSTM networks (one connected forward in time and the other connected backward in time). In some embodiments, each of the two directional LSTM networks is followed by a fully connected layer (not shown in fig. 2 for clarity) over the implicit state of the respective directional LSTM network, followed by a softmax layer to obtain an intermediate score corresponding to the respective action. Finally, the scores of the two oriented LSTM are combined (e.g., averaged) to obtain a score for each particular action.
Still referring to FIG. 2, there are a number of components in the action detection pipeline that are critical to achieving good performance. In this task, we use a model that characterizes the spatial and long-term temporal information present in the video.
The contour image determined using the bounding box provides a reference image that makes many actions easier to learn by removing positional variations from the input representation. However, some actions are location dependent. For scenes acquired using a still video camera, these actions always occur at the same image location. For example, in a cooking video, washing and rinsing are almost always performed near the sink, and opening the door will most likely be performed near the refrigerator or cabinet. For these reasons, we train two separate depth networks on the cropped and un-cropped chunks of the outline image and video frame.
The first two CNNs are trained on a cropped image cropped using a box from the object tracker to reduce background noise and provide an object-centered reference image for the contour image and image regions. The other two CNNs are trained on the entire (spatially full frame) image to preserve the global spatial context.
Fig. 3A and 3B illustrate a contour image determined from an input image. The input image represents an image from a sequence of images. An object contour may be determined from the input image using an image processing algorithm (e.g., an algorithm using a deep neural network) to determine a contour image.
The contour image may be automatically computed from the input image and represents edges along the boundaries of various objects in the image. Further, the contour image does not represent colors and textures within the input image, but represents only the boundary of the object. The sequence of contour images contains only the most relevant information about the movement of the object in the corresponding image sequence, the object contour.
Since the action to be detected can have a wide range of durations, our method uses the LSTM network 140 to learn the duration and long-term temporal context of the action in a data-driven manner. Our results demonstrate that LSTM is effective in learning long-term temporal context for fine-grained motion detection.
Tracking for fine-grained motion detection
To provide a bounding box around the object for a location independent (cropped) appearance and motion stream, any object tracking method may be used. In a preferred embodiment, we use a state-based tracker to spatially locate actions in the video. Keeping the size of the tracking bounding box fixed, we update the position of the bounding box to maximize the magnitude of the differential image energy within the bounding box. If the magnitude of the difference image energy is greater than the threshold, the position of the bounding box is updated to a position that maximizes the magnitude of the difference image energy. Otherwise, the object either moves slowly or does not move at all. When the object moves too slowly or does not move, the bounding box from the previous chunk is used, i.e., the bounding box is not updated. The position of the bounding box is updated only after processing the tile 101 (e.g., six images) and determining the motion and appearance characteristics relative to the tile to ensure that the bounding box is stationary on all images in the tile.
Our tracking method can be effectively applied when the camera is stationary and has a reasonable estimate on the object size. This is a practical assumption for many videos taken in retail stores, individual homes, or surveillance environments where fine-grained motion detection may be used. For more difficult tracking situations, more complex trackers may be used.
In a preferred embodiment, the bounding box is a rectangular region containing the object, but the bounding box need not be rectangular. More generally, a bounding box is an area of any shape that contains or substantially contains an object to be tracked and may additionally contain a smaller area surrounding the object.
Motion detection over long sequences using bidirectional LSTM networks
Fig. 4 is a schematic diagram illustrating an LSTM unit according to some embodiments of the present disclosure. We now provide a brief description of the Recurrent Neural Network (RNN) and Long Short Term Memory (LSTM) units. Given an input sequence x ═(x1,…,xT) RNN uses an implicit state to represent h ═ (h)1,…,hT) Such that RNN can map input x to output sequence y ═ (y)1,…,yT)。
To determine this representation, the RNN traverses the following recursive equation:
ht=g(Wxhxt+Whhht-1+bh),yt=g(Whyht+bz),
where g is the activation function, WxhIs a weight matrix, W, that maps the input to an implicit statehhIs a transition matrix between hidden states of two adjacent time steps, WhyIs a matrix that maps the hidden state h to the output y, bhAnd bzIs the bias term.
Still referring to FIG. 4, unlike Hidden Markov Models (HMMs) which use discrete hidden state representations, recurrent neural networks use a continuous spatial representation of hidden states. However, it is difficult to train RNNs to learn long-term sequence information because the network is expanded using back propagation through time to perform training. This results in the disappearance or explosion of the gradient problem.
To avoid this problem, the LSTM cell has a memory cell c, as shown in fig. 4tAnd forgetting the door ftIt helps LSTM learn when to retain the previous state and when to forget the state. This enables the LSTM network to learn long-term time information. The weight update equation for the LSTM cell is as follows:
it=σ(Wxixt+Whiht-1+bi)
ft=σ(Wxfxt+Whfht-1+bf)
ot=σ(Wxoxt+Whoht-1+bo)
gt=tanh(Wxcxt+Whcht-1+bc)
ct=ftct-1+itgt
ht=ottanh(ct)
where σ is sigmoid function, tanh is hyperbolic tangent function, it、ft、otAnd ctInput gate, forget-to-remember gate, output gate and memory cell activation vector, respectively.
Forget to remember the door ftDetermining when to slave memory cell ctThe information (and which) is cleared. Input door itIt is decided when (and which) new information is to be incorporated into the memory. tan h layer gtA set of candidate values is generated which is added to the storage unit when the input gate allows.
Still referring to FIG. 4, based on forgetting to remember the door ftInput and output gates itAnd a new candidate value gtUpdating the memory cell ct. Output gate otWhich information in the memory cell is used as a representation of the hidden state. The implied state is represented as the product between the output gate and a function of the memory cell state.
The LSTM architecture of RNN has been successfully used for generating sentences from images, video to text video descriptions, and speech recognition. However, for the motion recognition task, the performance of the LSTM network is still close to that of a classifier based on Fisher vectors generated on refined dense trajectories. RNNs using LSTM have not been used for motion detection from video, perhaps because of their flat performance in motion recognition from video, which is the focus of this disclosure.
In a common motion recognition dataset, video is temporally cropped to start and end at or near the start and end times of the respective motion. Temporally cropped video is typically short in length (e.g., 2-20 seconds). Thus, there is not enough long-term context to learn in a data-driven manner in the action recognition task. The long-term context may include properties such as the expected duration of an action (which may be after or before another action) and other long-term motion patterns that extend beyond the boundaries of the action in time.
Still referring to fig. 4, the LSTM network has little access to longer-term temporal contexts during the action recognition task. However, in fine-grained motion detection, the video duration is typically on the order of minutes or hours. Therefore, our key insight is that LSTM networks will be more suitable for motion detection (to which we apply) than motion recognition (to which they were previously applied), since LSTM models long-term temporal dynamics in the sequence.
The bi-directional LSTM network integrates information from both future and past chunks to form predictions for individual chunks in the video sequence. Therefore, we predict that a bi-directional LSTM network will be better than a unidirectional LSTM at the time boundaries (i.e., start and end) of the predicted action.
As described herein, the forward LSTM network and the backward LSTM network each generate a softmax score for each action category, and we average the softmax scores of the two LSTM networks to obtain a score (probability) for each action.
Although the LSTM network is trained on long sequences, the back propagation through time can be accomplished using only short sequences of chunks up to a fixed number of steps. To preserve the long-term context, we preserve the implicit state of the last element in the previous chunk sequence when training on the subsequent chunk sequence.
Fig. 5 is a schematic diagram of at least one method and system of detecting motion of an object according to embodiments of the present disclosure. For example, as provided above, the training phase of the method involves training a Recurrent Neural Network (RNN). In the testing phase (i.e., motion detection), the RNN that has been trained is used to detect motion of the subject.
Fig. 5 illustrates the basic operations of a method and system 500 for detecting motion of an object (e.g., detecting a person in a scene performing a particular motion). For example, the method 500 may include at least one sensor 504 that generates input video data of a scene 505. The sensor 504 may be a video camera or some other device that generates input video data. It is contemplated that sensor 504 may collect other data, such as time, temperature, and other data related to scene 505.
The computer readable memory 512 of the computer 514 may store and/or provide the input video data 501 generated by the sensor 504. The sensor 504 collects input video data 501 of a scene 505, which may optionally be stored in an external memory 506 or may be sent directly to an input interface/preprocessor 507 and then to a processor 510.
Further, a video 501 of a scene 505 is acquired 520 as a sequence of images 515, wherein each image comprises pixels. The scene 505 may include one or more objects 507 that perform an action, for example, a person running up a staircase. Optionally, there may be an external memory 506 connected to an input interface/preprocessor 507 connected to the memory 512, which is connected to retrieve the video 520 as described above.
In addition, one or more objects are tracked 522, and bounding boxes 523 of the tracked objects are estimated in respective chunks of the video image. For example, as a non-limiting example, a chunk may be a sequence of six images.
The image is cropped to the extent of the bounding box and a contour image is computed 525 within the bounding box. The resulting cropped contour image and cropped image 527 are passed to a Recurrent Neural Network (RNN)550, the RNN 550 having been trained to output relative scores 560 for each action of interest.
In outputting the relative scores 560 for each action of interest, the output of the relative scores 560 may be stored in the memory 512 or output via the output interface 561. During processing, the processor 514 may communicate with the memory 512 to store or retrieve stored instructions or other data related to the processing.
FIG. 6 is a block diagram illustrating the method of FIG. 1A, which may be implemented using alternative computer or processor configurations, according to an embodiment of the present disclosure. The computer/controller 611 includes a processor 640, a computer readable memory 612, a storage 658, and a user interface 649 with a display 652 and a keyboard 651, connected by a bus 656. For example, the user interface 649, which is in communication with the processor 640 and the computer-readable memory 612, retrieves data and stores it in the computer-readable memory 612 when user input is received from a surface of the user interface 657, a keyboard surface.
It is contemplated that memory 612 may store instructions executable by a processor, historical data, and any data usable by the methods and systems of the present disclosure. Processor 640 may be a single-core processor, a multi-core processor, a computing cluster, or any number of other configurations. The processor 640 may be connected to one or more input devices and output devices by a bus 656. The memory 612 may include Random Access Memory (RAM), Read Only Memory (ROM), flash memory, or any other suitable memory system.
Still referring to fig. 6, the storage 658 may be configured to store supplemental data and/or software modules used by the processor. For example, the storage 658 may store historical data as well as other relevant data as mentioned above with respect to the present disclosure. Additionally or alternatively, the storage 658 may store historical data similar to that as mentioned above with respect to the present disclosure. The storage 658 may include a hard disk drive, an optical disk drive, a thumb drive, an array of drives, or any combination thereof.
The system may optionally be linked by a bus 656 to a display interface (not shown) arranged to connect the system to a display device (not shown), which may include a computer monitor, camera, television, projector, mobile device, or the like.
The controller 611 may include a power supply 654, and the power supply 654 may optionally be located external to the controller 611, depending on the application. A user input interface 657, which is configured to be connected to a display device 648, can be linked via the bus 656, where the display device 648 can include a computer monitor, camera, television, projector, or mobile device, among others. Printer interface 659 may also be connected by bus 656 and configured to connect to printing device 632, where printing device 632 may include a liquid inkjet printer, a solid ink printer, a large commercial printer, a thermal printer, a UV printer, or a dye sublimation printer, among others. A Network Interface Controller (NIC)634 is provided to connect to the network 636 via the bus 656, where data or other data, etc. may be rendered on a third party display device, a third party imaging device, and/or a third party printing device external to the controller 611.
Still referring to fig. 6, data or other data or the like may be transmitted via a communication channel of the network 636 and/or stored within the storage system 658 for storage and/or further processing. Further, data or other data may be received wirelessly or hardwired from the receiver 646 (or an external receiver 638) or transmitted wirelessly or hardwired via the transmitter 647 (or an external transmitter 639), both the receiver 646 and the transmitter 647 being connected by a bus 656. Further, the GPS 601 may be connected to the controller 611 via a bus 656. The controller 611 may be connected to an external sensing device 644 and an external input/output device 641 via the input interface 608. The controller 611 can be connected to other external computers 642. Output interface 609 may be used to output processed data from processor 640.
Aspects of the present disclosure may also include a bidirectional long-short term memory (LSTM) network that manages data stored over time based on conditions, wherein the conditions include an input gate, a forgetting gate, and an output gate, to manage the stored data based on changes over time, wherein the data stored over time is similar to data related to an action of interest, such that the stored data includes historical properties of an expected duration of the action of interest, historical types of the action of interest after or before the action of interest, and historical long term motion patterns that extend beyond bounding box boundaries of the action of interest.
The above-described embodiments of the present disclosure may be implemented in any of numerous ways. For example, embodiments may be implemented using hardware, software, or a combination thereof. Use of ordinal terms such as "first," "second," in the claims to modify a claim element does not by itself connote any priority, precedence, or order of one claim element over another or the temporal order in which acts of a method are performed, but are used merely as labels to distinguish one claim element having a certain name from another element having a same name (but for use of the ordinal term) to distinguish the claim elements.
Additionally, embodiments of the present disclosure may be embodied as a method, examples of which are provided. The actions performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed which perform acts in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments.

Claims (20)

1. A method of detecting motion of an object in a scene from a video of the scene, such that the video is a video sequence of the scene divided into chunks, and each chunk comprises successive video frames, the method comprising the steps of:
obtaining, by a processor, the video of the scene, wherein the video comprises a sequence of images;
tracking, by the processor, the objects in the video, and for each object and each chunk of the video, further comprising:
determining a sequence of contour images from video frames of the video sequence to represent motion data within a bounding box located around the object;
generating a cropped contour image and a cropped image for one or more images in the respective chunks using the bounding box; and
the cropped contour image and the cropped image are passed to a recurrent neural network RNN, which outputs relative scores for each action of interest.
2. The method of claim 1, wherein the RNN comprises a convolutional neural network layer and one or more cyclic neural network layers.
3. The method of claim 2, wherein the convolutional neural network layer operates on a plurality of streams including a sequence of cropped contour images and the cropped images.
4. The method of claim 2, wherein the convolutional neural network layer operates on multiple streams including a sequence of cropped contour images and the cropped images, and contour images and images having a complete spatial extent of the video frame.
5. The method of claim 2, wherein the recurrent neural network layer comprises long-short term memory (LSTM) units.
6. The method of claim 5, wherein the recurrent neural network layer comprises bidirectional long-short term memory (LSTM) units.
7. The method of claim 1, wherein the object is one of a human, a robot, or an industrial robot.
8. The method of claim 7, further comprising a person detector and a person tracker.
9. The method of claim 8, wherein the person tracker identifies at least one bounding box around each person in the video.
10. The method of claim 9, wherein the video frames of the video sequence representing motion data of the object are within a plurality of bounding boxes positioned around the object over time.
11. The method of claim 1, wherein the bounding box is a region having a shape that encompasses at least a portion or all of the tracked object.
12. The method of claim 1, wherein the video is initially acquired in a form other than a sequence of images and converted to a sequence of images.
13. The method of claim 1, wherein the method is used for fine-grained motion detection in the video.
14. A method as claimed in claim 1, wherein the method comprises training the RNN prior to the detecting step, or the RNN has been pre-trained prior to acquiring the video of the scene.
15. The method of claim 1, wherein the detecting step comprises one of a temporal action detection or a space-time action detection.
16. A system for detecting a motion of interest of an object in a scene from a video of the scene, such that the video is a video sequence of the scene divided into chunks, and each chunk comprises successive video frames, the system comprising:
a processor acquires the video of the scene such that the video comprises a sequence of images, wherein the processor is configured to:
tracking the objects in the video, and for each object and each chunk of the video:
determining a sequence of contour images from video frames of the video sequence to represent motion information within a bounding box located around the object;
generating a cropped contour image and a cropped image for one or more images in the respective chunks using the bounding box; and is
The cropped contour image and the cropped image are passed to a recurrent neural network RNN, which outputs relative scores for each action of interest.
17. The system of claim 16, wherein the RNN comprises a convolutional neural network layer and one or more recurrent neural network layers, such that the convolutional neural network layer operates on multiple streams comprising a sequence of cropped contour images and the cropped images.
18. The system of claim 16, wherein the recurrent neural network layer includes long-short term memory (LSTM) units.
19. A non-transitory computer readable storage medium having embodied thereon a program, the program being executable by a computer for performing a method of detecting, from a video of a scene, an action of interest of an object in the scene such that the video is a video sequence of the scene divided into chunks, and each chunk comprises successive video frames, the method comprising the steps of:
obtaining, by a processor, the video of the scene, wherein the video comprises a sequence of images;
tracking, by the processor, the objects in the video, and for each object and each chunk of the video, the processor is configured to:
determining a sequence of contour images from video frames of the video sequence within a bounding box located around the object;
generating a cropped contour image and a cropped image for one or more images in the respective chunks using the bounding box; and is
The cropped contour image and the cropped image are passed to a Recurrent Neural Network (RNN) that outputs relative scores for each action of interest via an output interface in communication with the processor.
20. The storage medium of claim 19, wherein the RNN comprises a convolutional neural network layer and one or more recurrent neural network layers, such that the convolutional neural network layer operates on a plurality of streams comprising a sequence of cropped contour images and the cropped images.
CN201880048903.3A 2017-08-07 2018-06-18 Method and system for detecting motion Active CN110998594B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US15/670,021 2017-08-07
US15/670,021 US10210391B1 (en) 2017-08-07 2017-08-07 Method and system for detecting actions in videos using contour sequences
PCT/JP2018/023910 WO2019031083A1 (en) 2017-08-07 2018-06-18 Method and system for detecting action

Publications (2)

Publication Number Publication Date
CN110998594A true CN110998594A (en) 2020-04-10
CN110998594B CN110998594B (en) 2024-04-09

Family

ID=62948285

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201880048903.3A Active CN110998594B (en) 2017-08-07 2018-06-18 Method and system for detecting motion

Country Status (5)

Country Link
US (1) US10210391B1 (en)
EP (1) EP3665613A1 (en)
JP (1) JP6877630B2 (en)
CN (1) CN110998594B (en)
WO (1) WO2019031083A1 (en)

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10762637B2 (en) * 2017-10-27 2020-09-01 Siemens Healthcare Gmbh Vascular segmentation using fully convolutional and recurrent neural networks
WO2019097784A1 (en) * 2017-11-16 2019-05-23 ソニー株式会社 Information processing device, information processing method, and program
EP3495988A1 (en) 2017-12-05 2019-06-12 Aptiv Technologies Limited Method of processing image data in a connectionist network
US11501522B2 (en) * 2017-12-06 2022-11-15 Nec Corporation Image recognition model generating device, image recognition model generating method, and image recognition model generating program storing medium
US10762662B2 (en) * 2018-03-14 2020-09-01 Tata Consultancy Services Limited Context based position estimation of target of interest in videos
EP3561726A1 (en) 2018-04-23 2019-10-30 Aptiv Technologies Limited A device and a method for processing data sequences using a convolutional neural network
EP3561727A1 (en) * 2018-04-23 2019-10-30 Aptiv Technologies Limited A device and a method for extracting dynamic information on a scene using a convolutional neural network
US10795933B1 (en) * 2018-05-01 2020-10-06 Flock Group Inc. System and method for object based query of video content captured by a dynamic surveillance network
US11055854B2 (en) * 2018-08-23 2021-07-06 Seoul National University R&Db Foundation Method and system for real-time target tracking based on deep learning
CN110111358B (en) * 2019-05-14 2022-05-24 西南交通大学 Target tracking method based on multilayer time sequence filtering
US11663448B2 (en) 2019-06-28 2023-05-30 Conduent Business Services, Llc Neural network systems and methods for event parameter determination
WO2021055536A1 (en) * 2019-09-17 2021-03-25 Battelle Memorial Institute Activity assistance system
US11798272B2 (en) 2019-09-17 2023-10-24 Battelle Memorial Institute Activity assistance system
US11373407B2 (en) * 2019-10-25 2022-06-28 International Business Machines Corporation Attention generation
CN110826702A (en) * 2019-11-18 2020-02-21 方玉明 Abnormal event detection method for multitask deep network
CN111027510A (en) * 2019-12-23 2020-04-17 上海商汤智能科技有限公司 Behavior detection method and device and storage medium
CN111400545A (en) * 2020-03-01 2020-07-10 西北工业大学 Video annotation method based on deep learning
US11195039B2 (en) * 2020-03-10 2021-12-07 International Business Machines Corporation Non-resource-intensive object detection
CN111243410B (en) * 2020-03-20 2022-01-28 上海中科教育装备集团有限公司 Chemical funnel device construction experiment operation device and intelligent scoring method
CN113744373A (en) * 2020-05-15 2021-12-03 完美世界(北京)软件科技发展有限公司 Animation generation method, device and equipment
CN111881720B (en) * 2020-06-09 2024-01-16 山东大学 Automatic enhancement and expansion method, recognition method and system for data for deep learning
JP7472073B2 (en) 2021-04-26 2024-04-22 株式会社東芝 Training data generation device, training data generation method, and training data generation program
CN113362369A (en) * 2021-06-07 2021-09-07 中国科学技术大学 State detection method and detection device for moving object
CN115359059B (en) * 2022-10-20 2023-01-31 一道新能源科技(衢州)有限公司 Solar cell performance test method and system
CN117994850A (en) * 2024-02-26 2024-05-07 中国人民解放军军事科学院军事医学研究院 Behavior detection method, equipment and system for experimental animal

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999007153A1 (en) * 1997-07-31 1999-02-11 Reality Fusion, Inc. Systems and methods for software control through analysis and interpretation of video information
US20020101932A1 (en) * 2000-11-29 2002-08-01 Montgomery Dennis L. Method and apparatus for encoding information using multiple passes and decoding in a single pass
WO2003036557A1 (en) * 2001-10-22 2003-05-01 Intel Zao Method and apparatus for background segmentation based on motion localization
CN101464952A (en) * 2007-12-19 2009-06-24 中国科学院自动化研究所 Abnormal behavior identification method based on contour
US20090278937A1 (en) * 2008-04-22 2009-11-12 Universitat Stuttgart Video data processing
CN101872418A (en) * 2010-05-28 2010-10-27 电子科技大学 Detection method based on group environment abnormal behavior
CN103377479A (en) * 2012-04-27 2013-10-30 索尼公司 Event detecting method, device and system and video camera
CN103824070A (en) * 2014-03-24 2014-05-28 重庆邮电大学 Rapid pedestrian detection method based on computer vision
CN104408444A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Human body action recognition method and device
CN105184818A (en) * 2015-09-06 2015-12-23 山东华宇航天空间技术有限公司 Video monitoring abnormal behavior detection method and detections system thereof
US20160042621A1 (en) * 2014-06-13 2016-02-11 William Daylesford Hogg Video Motion Detection Method and Alert Management
CN105426820A (en) * 2015-11-03 2016-03-23 中原智慧城市设计研究院有限公司 Multi-person abnormal behavior detection method based on security monitoring video data
US20170083764A1 (en) * 2015-09-23 2017-03-23 Behavioral Recognition Systems, Inc. Detected object tracker for a video analytics system
US20170199010A1 (en) * 2016-01-11 2017-07-13 Jonathan Patrick Baker System and Method for Tracking and Locating Targets for Shooting Applications
CN106952269A (en) * 2017-02-24 2017-07-14 北京航空航天大学 The reversible video foreground object sequence detection dividing method of neighbour and system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4481663B2 (en) 2004-01-15 2010-06-16 キヤノン株式会社 Motion recognition device, motion recognition method, device control device, and computer program
US8345984B2 (en) 2010-01-28 2013-01-01 Nec Laboratories America, Inc. 3D convolutional neural networks for automatic human action recognition
US9147260B2 (en) * 2010-12-20 2015-09-29 International Business Machines Corporation Detection and tracking of moving objects
CN103593661B (en) 2013-11-27 2016-09-28 天津大学 A kind of human motion recognition method based on sort method
JP6517681B2 (en) * 2015-12-17 2019-05-22 日本電信電話株式会社 Image pattern learning apparatus, method and program

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999007153A1 (en) * 1997-07-31 1999-02-11 Reality Fusion, Inc. Systems and methods for software control through analysis and interpretation of video information
US20020101932A1 (en) * 2000-11-29 2002-08-01 Montgomery Dennis L. Method and apparatus for encoding information using multiple passes and decoding in a single pass
WO2003036557A1 (en) * 2001-10-22 2003-05-01 Intel Zao Method and apparatus for background segmentation based on motion localization
CN101464952A (en) * 2007-12-19 2009-06-24 中国科学院自动化研究所 Abnormal behavior identification method based on contour
US20090278937A1 (en) * 2008-04-22 2009-11-12 Universitat Stuttgart Video data processing
CN101872418A (en) * 2010-05-28 2010-10-27 电子科技大学 Detection method based on group environment abnormal behavior
CN103377479A (en) * 2012-04-27 2013-10-30 索尼公司 Event detecting method, device and system and video camera
CN103824070A (en) * 2014-03-24 2014-05-28 重庆邮电大学 Rapid pedestrian detection method based on computer vision
US20160042621A1 (en) * 2014-06-13 2016-02-11 William Daylesford Hogg Video Motion Detection Method and Alert Management
CN104408444A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Human body action recognition method and device
CN105184818A (en) * 2015-09-06 2015-12-23 山东华宇航天空间技术有限公司 Video monitoring abnormal behavior detection method and detections system thereof
US20170083764A1 (en) * 2015-09-23 2017-03-23 Behavioral Recognition Systems, Inc. Detected object tracker for a video analytics system
CN105426820A (en) * 2015-11-03 2016-03-23 中原智慧城市设计研究院有限公司 Multi-person abnormal behavior detection method based on security monitoring video data
US20170199010A1 (en) * 2016-01-11 2017-07-13 Jonathan Patrick Baker System and Method for Tracking and Locating Targets for Shooting Applications
CN106952269A (en) * 2017-02-24 2017-07-14 北京航空航天大学 The reversible video foreground object sequence detection dividing method of neighbour and system

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
BHARAT SINGH; TIM K. MARKS; MICHAEL JONES; ONCEL TUZEL; MING SHAO: "A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection" *
DA-WEI KUO, GUAN-YU CHENG, SHYI-CHYI CHENG: "Detecting Salient Fragments for Video Human Action Detection and Recognition Using an Associative Memory" *
MING YANG; FENGJUN LV; WEI XU; KAI YU; YIHONG GONG: "Human action detection by boosting efficient motion features" *
刘会珍;尚振宏;: "多运动目标检测的研究" *
张杰、吴剑章、汤嘉立、范洪辉: "基于时空图像分割和交互区域检测的 人体动作识别方法" *

Also Published As

Publication number Publication date
JP2020530162A (en) 2020-10-15
EP3665613A1 (en) 2020-06-17
CN110998594B (en) 2024-04-09
US10210391B1 (en) 2019-02-19
JP6877630B2 (en) 2021-05-26
WO2019031083A1 (en) 2019-02-14
US20190042850A1 (en) 2019-02-07

Similar Documents

Publication Publication Date Title
CN110998594B (en) Method and system for detecting motion
JP6625220B2 (en) Method and system for detecting the action of an object in a scene
CN108961312B (en) High-performance visual object tracking method and system for embedded visual system
CN107273782B (en) Online motion detection using recurrent neural networks
Wang et al. Hidden‐Markov‐models‐based dynamic hand gesture recognition
CN110287844B (en) Traffic police gesture recognition method based on convolution gesture machine and long-and-short-term memory network
JP4208898B2 (en) Object tracking device and object tracking method
Li et al. Tracking in low frame rate video: A cascade particle filter with discriminative observers of different life spans
CN108446585A (en) Method for tracking target, device, computer equipment and storage medium
KR100421740B1 (en) Object activity modeling method
US9798923B2 (en) System and method for tracking and recognizing people
US20090296989A1 (en) Method for Automatic Detection and Tracking of Multiple Objects
KR102465960B1 (en) Multi-Class Multi-Object Tracking Method using Changing Point Detection
Rout A survey on object detection and tracking algorithms
CN117425916A (en) Occlusion aware multi-object tracking
CN112184767A (en) Method, device, equipment and storage medium for tracking moving object track
CN113869274B (en) Unmanned aerial vehicle intelligent tracking monitoring method and system based on city management
CN115035158A (en) Target tracking method and device, electronic equipment and storage medium
JP7450754B2 (en) Tracking vulnerable road users across image frames using fingerprints obtained from image analysis
Chen et al. Mode-based multi-hypothesis head tracking using parametric contours
Mohamed et al. Real-time moving objects tracking for mobile-robots using motion information
Chuang et al. Human Body Part Segmentation of Interacting People by Learning Blob Models
Ji et al. Visual-based view-invariant human motion analysis: A review
US20230206641A1 (en) Storage medium, information processing method, and information processing apparatus
Challa et al. Facial Landmarks Detection System with OpenCV Mediapipe and Python using Optical Flow (Active) Approach

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant