CN109670523B - Method for acquiring bounding box corresponding to object in image by convolution neural network including tracking network and computing device using same - Google Patents

Method for acquiring bounding box corresponding to object in image by convolution neural network including tracking network and computing device using same Download PDF

Info

Publication number
CN109670523B
CN109670523B CN201811191036.7A CN201811191036A CN109670523B CN 109670523 B CN109670523 B CN 109670523B CN 201811191036 A CN201811191036 A CN 201811191036A CN 109670523 B CN109670523 B CN 109670523B
Authority
CN
China
Prior art keywords
bounding box
feature map
boxes
box
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811191036.7A
Other languages
Chinese (zh)
Other versions
CN109670523A (en
Inventor
金镕重
南云铉
夫硕焄
成明哲
吕东勋
柳宇宙
张泰雄
郑景中
诸泓模
赵浩辰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Stradvision Inc
Original Assignee
Stradvision Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Stradvision Inc filed Critical Stradvision Inc
Publication of CN109670523A publication Critical patent/CN109670523A/en
Application granted granted Critical
Publication of CN109670523B publication Critical patent/CN109670523B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/248Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving reference images or patches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/50Extraction of image or video features by performing operations within image blocks; by using histograms, e.g. histogram of oriented gradients [HoG]; by summing image-intensity values; Projection analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/759Region-based matching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20021Dividing image into blocks, subimages or windows
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30232Surveillance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30236Traffic on road, railway or crossing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2210/00Indexing scheme for image generation or computer graphics
    • G06T2210/12Bounding box
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Image Analysis (AREA)

Abstract

A method of acquiring a bounding box corresponding to an object is provided. The method comprises the following steps: (a) obtaining a suggestion box; (b) Selecting a specific suggestion box among the suggestion boxes by referring to (i) a result of comparing a distance between the reference boundary box and the suggestion box and/or (ii) a result of comparing scores representing whether the suggestion boxes include objects, and then setting the specific suggestion box as a start area of the tracking box; (c) Determining a specific area of the current frame as a target area of the tracking frame by using a mean shift tracking algorithm; and (d) allowing the pooling layer to generate a pooled feature map by applying a pooling operation to a region corresponding to the specific region, and then allowing the FC layer to acquire a bounding box by applying a regression operation to the pooled feature map.

Description

Method for acquiring bounding box corresponding to object in image by convolution neural network including tracking network and computing device using same
Technical Field
The present invention relates to a method of acquiring a bounding box corresponding to an object in a test image using a Convolutional Neural Network (CNN) including a tracking network and a test apparatus using the same; and more particularly to a method of acquiring at least one bounding box corresponding to at least one object in a test image by using a CNN including a tracking network, and a test apparatus performing the method, the method comprising the steps of: (a) If a feature map is generated by applying a convolution operation to a test image as a current frame and then information on a plurality of advice frames obtained by applying a specific operation to the feature map by a regional advice network (RPN) is output, the test apparatus acquires or supports another apparatus to acquire the plurality of advice frames; (b) The test device selects or supports another device in the plurality of suggestion boxes by selecting at least one particular suggestion box in the plurality of suggestion boxes with reference to at least one of: (i) A result of comparing each distance between a reference bounding box of the object in the previous frame and each of the plurality of suggestion boxes and (ii) a result of comparing each score that is a probability value indicating whether each of the suggestion boxes includes the object, and then setting or supporting another device to set a particular suggestion box as a starting region of the tracking box, wherein the starting region is used for a mean shift tracking algorithm; (c) By using the mean shift tracking algorithm, the test device determines or supports another device to determine a specific region of the current frame as a target region of the tracking frame, the specific region having information on a probability similar to a probability corresponding to pixel data of an object in a previous frame; and (d) the testing means allows the pooling layer to generate a pooled feature map by applying a pooling operation to a region of the feature map corresponding to the specific region, and then allows the FC layer to acquire the bounding box by applying a regression operation to the pooled feature map.
Background
In machine learning, convolutional neural networks (CNN or ConvNet) are a class of deep feed-forward artificial neural networks that have been successfully applied to analyze visual images.
Fig. 1 is a diagram schematically illustrating a learning process of a conventional CNN according to the related art.
Specifically, FIG. 1 illustrates a process for obtaining a penalty by comparing a prediction bounding box with a Ground Truth (GT) bounding box. Here, the loss represents the difference between the prediction bounding box and the GT bounding box, and is denoted dx c 、dy c Dw, dh, as shown in FIG. 1.
First, as shown in fig. 1, the learning device may acquire an RGB image as an input to be fed to a plurality of convolution layers (i.e., convolution filters) included in a convolution block. As the RGB image passes through the plurality of convolution layers, the size (e.g., width and height) of the RGB image becomes smaller, while the number of channels increases.
As shown in fig. 1, the learning apparatus allows a regional recommendation network (RPN) to generate a recommendation frame from a final feature map output by a convolution block, and allows a pooling layer (e.g., ROI pooling layer) to adjust the size of a region corresponding to the recommendation frame on the feature map to a predetermined size (e.g., size 2×2) by applying a maximum pooling operation (or an average pooling operation) to pixel data of the region corresponding to the recommendation frame on the feature map. Thus, a pooled profile is obtained. For reference, the pooled feature map may also be referred to as a feature vector. Here, the max-pooling operation is an operation such that: by this operation, each maximum value in each of the sub-regions divided from the subject region on the feature map is selected as each of the representative values of the subject region, as shown in the lower right of fig. 1.
Next, the pooled feature map may be allowed to be fed to a Full Connectivity (FC) layer.
The learning device may then allow the FC layer to identify the class of objects in the RGB image. In addition, a prediction bounding box in an RGB image may be obtained by the FC layer, and a penalty may also be obtained by making a comparison between the prediction bounding box and a Ground Truth (GT) bounding box. Here, the GT bounding box represents a bounding box that precisely surrounds the object in the RGB image, which may be generally prepared by humans.
Finally, the learning device in fig. 1 may adjust at least one of the parameters included in the FC layer, the RPN, or the plurality of convolutional layers by using the loss during the back propagation process.
Thereafter, a test apparatus (not shown) having a CNN including the adjusted parameters may later acquire a bounding box surrounding the object in the test image. However, even if the test apparatus has a CNN including the adjusted parameters, it is difficult to obtain a bounding box that accurately surrounds the object in the test image.
The applicant of the present invention therefore proposes a method for acquiring at least one bounding box corresponding to at least one object in a test image with high accuracy.
Disclosure of Invention
An object of the present invention is to solve the above problems.
It is another object of the present invention to provide a method for acquiring a bounding box of high accuracy corresponding to an object in an image using a tracking network included in a CNN.
It is a further object of the invention to more accurately track objects by using a mean shift tracking algorithm.
It is another object of the present invention to increase reliability of a trace result and verify the result by having a trace network reuse (reuse) a classifier and a regressor in a detection network included in a CNN.
According to one aspect of the present invention, there is provided a method of acquiring at least one bounding box corresponding to at least one object in a test image by using a CNN including a tracking network, comprising the steps of: (a) If a feature map is generated by applying a convolution operation to a test image as a current frame and then information on a plurality of advice frames obtained by applying a specific operation to the feature map by a regional advice network (RPN) is output, the test apparatus acquires or supports another apparatus to acquire the plurality of advice frames; (b) The test apparatus selects or supports another apparatus to select at least one specific suggestion frame among the plurality of suggestion frames by referring to at least one of (i) a result of comparing each distance between a reference boundary frame of an object in a previous frame and each of the plurality of suggestion frames and (ii) a result of comparing each score as a probability value indicating whether each of the suggestion frames includes the object, and then sets or supports another apparatus to set the specific suggestion frame as a start area of a tracking frame, wherein the start area is used for a mean shift tracking algorithm; (c) By using the mean shift tracking algorithm, the test device determines or supports another device to determine a specific region of the current frame as a target region of the tracking frame, the specific region having information on a probability similar to a probability corresponding to pixel data of an object in a previous frame; and (d) the testing means allows the pooling layer to generate a pooled feature map by applying a pooling operation to a region of the feature map corresponding to the specific region, and then allows the FC layer to acquire the bounding box by applying a regression operation to the pooled feature map.
According to another aspect of the present invention, there is provided a method of acquiring a bounding box corresponding to an object in a test image by using a CNN including a tracking network and a detection network, including the steps of: (a) If a feature map is generated by applying a convolution operation to a test image as a current frame and then information on a plurality of advice frames obtained by applying a specific operation to the feature map by a regional advice network (RPN) is output, the test apparatus acquires or supports another apparatus to acquire the plurality of advice frames; (b) (b-1) selecting or supporting another device among the plurality of advice frames at least one specific advice frame by referring to at least one of (i) a result of comparing each distance between a reference bounding frame of the object in the previous frame and each of the plurality of advice frames and (ii) a result of comparing each score indicating whether each of the advice frames includes a probability value of the object, and then setting or supporting another device to set the specific advice frame as a start area of the tracking frame, wherein the start area is used for the mean shift tracking algorithm; (b-2) the test apparatus setting or supporting the other apparatus to set at least some of the plurality of suggested boxes that have not been set as tracking boxes as a plurality of untracked boxes; and (c) (c-1) after step (b-1), determining or supporting another device to determine a specific region of the current frame as a target region of the tracking frame by using a mean shift tracking algorithm, the specific region having information on a probability similar to a probability corresponding to pixel data of an object in a previous frame; and allowing the first pooling layer to generate a first pooled feature map by applying a pooling operation to a region of the feature map corresponding to the specific region, and then allowing the FC layer to obtain a first bounding box by applying a regression operation to the first pooled feature map; (c-2) after step (b-2), the testing device allowing the second pooling layer to generate a second pooled feature map by applying a pooling operation to a region on the feature map corresponding to at least one of the plurality of untracked boxes; and, if the FC layer detects a new object by applying a classification operation to the second pooled feature map, the testing apparatus allows the FC layer to acquire the second bounding box by applying a regression operation to the second pooled feature map.
According to another aspect of the present invention, there is provided a test apparatus for acquiring at least one bounding box corresponding to at least one object in a test image by using a CNN including a tracking network, comprising: a communication section for acquiring a test image or a feature map converted therefrom; and a processor for performing the following: (I) Acquiring or supporting another apparatus to acquire a plurality of advice frames if a feature map is acquired by applying a convolution operation to a test image as a current frame and then information on the plurality of advice frames obtained by applying a specific operation to the feature map by a regional advice network (RPN) is output; (II) selecting or supporting another device among the plurality of suggestion boxes to select at least one specific suggestion box among the plurality of suggestion boxes by referring to at least one of (i) a result of comparing each distance between a reference bounding box of the object in the previous frame and each of the plurality of suggestion boxes and (II) a result of comparing each score indicating whether each of the suggestion boxes includes a probability value of the object, and then setting or supporting another device to set the specific suggestion box as a start area of the tracking box, wherein the start area is used for a mean shift tracking algorithm; (III) determining or supporting another device to determine a specific region of the current frame as a target region of the tracking frame by using a mean shift tracking algorithm, the specific region having information on a probability similar to a probability corresponding to pixel data of an object in a previous frame; and (IV) allowing the pooling layer to generate a pooled feature map by applying a pooling operation to a region of the feature map corresponding to the specific region, and then allowing the FC layer to obtain a bounding box by applying a regression operation to the pooled feature map.
According to still another aspect of the present invention, there is provided a test apparatus for acquiring a bounding box corresponding to an object in a test image by using a CNN including a tracking network and a detection network, including: a communication section for acquiring a test image or a feature map converted therefrom; and a processor for performing the following: (I) Acquiring or supporting another apparatus to acquire a plurality of advice frames if a feature map is generated by applying a convolution operation to a test image as a current frame and then information on the plurality of advice frames obtained by applying a specific operation to the feature map by a regional advice network (RPN) is output; (II) (II-1) selecting or supporting another device among the plurality of suggestion boxes at least one specific suggestion box among the plurality of suggestion boxes by referring to at least one of (i) a result of comparing each distance between a reference bounding box of the object in the previous frame and each of the plurality of suggestion boxes and (II) a result of comparing each score as a probability value indicating whether each of the suggestion boxes includes the object, and then setting or supporting another device to set the specific suggestion box as a start area of the tracking box, wherein the start area is used for a mean shift tracking algorithm; (II-2) setting or supporting another device to set at least some of the plurality of suggested boxes that have not been set as tracking boxes as a plurality of untracked boxes; and (III) (III-1) after the processing of (II-1), determining or supporting another apparatus to determine a specific region of the current frame as a target region of the tracking frame, the specific region having information on a probability similar to a probability corresponding to pixel data of an object in a previous frame, by using a mean shift tracking algorithm; and allowing the first pooling layer to generate a first pooled feature map by applying a pooling operation to a region of the feature map corresponding to the specific region, and then allowing the FC layer to obtain a first bounding box by applying a regression operation to the first pooled feature map; (III-2) after the processing of (II-2), the testing device allowing the second pooling layer to generate a second pooled feature map by applying a pooling operation to a region on the feature map corresponding to at least one of the plurality of untracked boxes; and, if the FC layer detects a new object by applying a classification operation to the second pooled feature map, allowing the FC layer to obtain a second bounding box by applying a regression operation to the second pooled feature map.
Drawings
The following drawings are included to demonstrate exemplary embodiments of the invention and are merely a part of the preferred embodiments of the invention. Other figures may be obtained based on the figures herein without the need for inventive work by those skilled in the art. The above and other objects and features of the present invention will become apparent from the following description of the preferred embodiments given in conjunction with the accompanying drawings, in which:
fig. 1 is a diagram exemplarily illustrating a learning process of a conventional CNN according to the related art;
FIG. 2 is a block diagram schematically illustrating a test apparatus according to an example embodiment of the invention;
fig. 3A is a block diagram exemplarily showing a configuration of a CNN capable of acquiring a bounding box according to an exemplary embodiment of the present invention;
FIG. 3B is a flowchart illustrating a process of acquiring a bounding box by using a CNN including a tracking network, according to an example embodiment of the present invention;
fig. 4A is a block diagram exemplarily showing a configuration of a CNN capable of acquiring a bounding box according to another exemplary embodiment of the present invention;
fig. 4B is a flowchart illustrating a process of acquiring a bounding box by using a CNN including a tracking network and a detection network according to another exemplary embodiment of the present invention;
fig. 5 is a diagram illustrating a mean shift tracking algorithm used in the present invention.
Detailed Description
For a clear purpose, technical solutions and advantages of the present invention, reference is made to the accompanying drawings, which show, by way of illustration, more detailed example embodiments in which the invention may be practiced. These preferred embodiments are described in sufficient detail to enable those skilled in the art to practice the invention.
It is to be understood that the various embodiments of the invention, although different, are not necessarily mutually exclusive. For example, a particular feature, structure, or characteristic described herein in connection with one embodiment may be implemented within other embodiments without departing from the spirit and scope of the invention. In addition, it is to be understood that the location or arrangement of individual elements within each disclosed embodiment may be modified without departing from the spirit and scope of the invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined only by the appended claims, appropriately interpreted, along with the full range of equivalents to which the claims are entitled. In the drawings, like numerals refer to the same or similar functionality throughout the several views.
Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily implement the present invention.
Fig. 2 is a block diagram schematically illustrating a test apparatus according to an exemplary embodiment of the present invention.
As shown in fig. 2, the test apparatus 200 may include a communication section 210 and a processor 220. Also, the test apparatus 200 may further include a database 230. Optionally, the test apparatus 200 may not include a database, as shown in FIG. 2. Here, any digital computing device having at least one processor to perform operations may be employed as the test device 200 of the present invention.
The communication section 210 may be configured to acquire a test image or at least one feature map obtained therefrom.
The processor 220 may be configured to perform the following: (i) selecting at least one of the suggestion boxes, i.e. at least one specific suggestion box, among all the suggestion boxes generated by the RPN with a certain criterion, (ii) setting the specific suggestion box as a starting region of the tracking box, wherein the starting region is used for a tracking algorithm, e.g. a mean shift tracking algorithm, (iii) determining the specific region as a target region of the tracking box by using the mean shift tracking algorithm, and (iv) allowing the pooling layer to generate a pooled feature map by applying a pooling operation to a region of the feature map corresponding to the specific region, and then allowing the FC layer to acquire the boundary box by applying a regression operation to the pooled feature map. Further details regarding the above-described processes will be described below.
Meanwhile, the database 230 may be accessed by the communication part 210 of the test apparatus 200, and information on the score of the suggested frame, information on the reference bounding box of the object in the previous frame, information on the parameter of the CNN, and the like may be stored therein.
Fig. 3A is a block diagram exemplarily showing a configuration of a CNN capable of acquiring a bounding box according to an exemplary embodiment of the present invention, and fig. 3B illustrates a process of acquiring a bounding box by using a CNN including a tracking network according to an exemplary embodiment of the present invention.
For reference, the test device 200 may include several other digital computing devices to perform each of the functions or processes disclosed herein. Nevertheless, for ease of description and illustration, assumptions made in the disclosure of the present invention for the implementation of test apparatus 200 by a single digital computing apparatus.
Referring to fig. 3A, if the input image 301 is transmitted to the convolution block 310, at least one feature map is generated by applying at least one convolution operation to the input image 301. The feature map is then forwarded to RPN 320 to generate a suggestion box.
After sending the suggestion box to the tracking module 330, at least one tracking box is obtained from the tracking module 330 by: (i) Selecting a particular suggestion box from the suggestion boxes having a certain criteria, and (ii) setting it as a starting region (i.e. an initial window) of the tracking box, wherein the starting region is used for the mean shift tracking algorithm. The detailed description about the standard will be made later.
Next, the pooling layer 340 may receive (i) the feature map from the convolution block 310 and (ii) the information about the tracking frame from the tracking module 330, thereby generating a pooled feature map by applying a pooling operation to the region of the feature map corresponding to the tracking frame (i.e., ROI). Thereafter, the Fully Connected (FC) layer 350 may identify the object class 302 via the classifier 351 and may generate the bounding box 303 via the regressor 352 by using information about the pooled feature map.
Unlike test apparatus 200 including the CNN shown in fig. 3A including convolution block 310, RPN 320, tracking module 330, pooling layer 340, and FC layer 350, the CNN in test apparatus 200 may include only a portion of convolution block 310, RPN 320, tracking module 330, pooling layer 340, and FC layer 360, as the case may be.
Referring to fig. 3B, which has the configuration shown in fig. 3A, in step S311, an input image 301 (e.g., a test image having a size of 1280×720 and 3 channels) is transmitted to the convolution block 310, and as a result, in step S312, a feature map 304 having a size of 40×23 and 256 channels can be generated by applying a convolution operation to the input image 301. Here, the input image 301 may be regarded as a current frame.
For reference, convolution block 310 includes one or more convolution layers. The width and height of the input may be reduced by a specific ratio each time a convolution operation is applied, and the number of channels may be increased by a specific ratio, but is not limited thereto. Here, the specific ratio may be determined based on parameters (i.e., weights) of the convolution layers included in the convolution block 310.
Further, in step S321, the RPN 320 may generate information about the advice block 305 from the feature map 304. For reference, the advice block 305 is a block each of which has a probability of including an object in the input image 301.
Further, in step S331, the processor 220 of the test apparatus 200 selects a specific advice frame 306 among the plurality of advice frames 305 by referring to at least one of (i) a result of comparing each distance between the reference bounding frame 307 of the object in the previous frame and each of the advice frames 305 and (ii) a result of comparing each score that is a probability value indicating a probability that each of the advice frames 305 includes the object, and then the processor 220 of the test apparatus 200 sets the specific advice frame 306 as a start area of the tracking frame. The starting region may be used for a mean shift tracking algorithm. Here, although the reference bounding box 307 of the object is a bounding box located in a previous frame, fig. 3B shows the reference bounding box 307 on the current frame for convenience of explanation.
For example, in step S331, the processor 220 of the test apparatus 200 may determine the particular suggestion box 306 as having the smallest distance (e.g., L2 distance), and/or the suggestion box having the highest score among the respective scores of the suggestion box 305.
As another example, the processor 220 of the test apparatus 200 may determine the particular suggestion box 306 as the suggestion box having the smallest ratio of L2 distances/scores.
Here, the score may be represented by a ratio of (i) a region corresponding to an intersection of a region of the GT bounding box and an intersection of each of the suggestion boxes 305 to (ii) a region corresponding to a union of regions of the GT bounding box and a union of each of the suggestion boxes 305. Thus, the score may be a value between 0 and 1. Thus, the probability value of a particular suggestion box 306 selected in the suggestion box 305 may be close to 1.
Meanwhile, the suggestion box 305 may be dedicated to only the same object. As another example, suggestion box 305 may be specific to multiple objects. For example, if 100 suggestion boxes are generated, 70 suggestion boxes may be generated for object a and 30 suggestion boxes may be generated for object B.
Then, in step S341, by using the mean shift tracking algorithm, the processor 220 of the test apparatus 200 may determine a specific region in the current frame as a target region 308 of the tracking frame, the target region 308 having information on a probability similar to a probability corresponding to pixel data of an object in the previous frame.
Here, the tracking module 330 may use a mean shift tracking algorithm, but it is apparent to those skilled in the art that the tracking algorithm should not be limited thereto.
In step S351, after the target region 308 of the tracking frame is acquired, a pooled feature map (not shown), i.e., feature vector, is generated by applying a pooling operation to a region corresponding to the target region 308 in the feature map 304, and then a bounding box is generated by applying a regression operation to the feature vector by the regressor 352 in the FC layer 350, and an object class (e.g., vehicle, pedestrian, road, building, etc.) can be identified by applying an operation to the feature vector by the classifier 351 in the FC layer 350. Therefore, a bounding box with high accuracy can be acquired via the CNN including the tracking network.
For reference, the test apparatus 200 may allow the tracking module 330 to precisely find the position of the tracking frame in the current frame, and may instruct the FC layer 350 to refine the size of the tracking frame, thereby obtaining the bounding box closest to the GT bounding box.
Finally, the test apparatus 200 may determine the bounding box as a reference bounding box for the tracking box of the object located in the next frame.
For reference, the configuration of fig. 3A may be referred to as a tracking network.
According to another example embodiment of the present invention, the CNN of the present invention may further comprise a detection network. Details regarding another example embodiment may be described below by way of illustration in fig. 4A and 4B.
Fig. 4A is a block diagram exemplarily showing a configuration of a CNN capable of acquiring a bounding box according to another exemplary embodiment of the present invention.
Referring to fig. 4A, the configuration of the CNN included in the test apparatus 200 may include a trace network and a detection network. The tracking network and the detection network may share the convolution block 310, RPN 320, and FC layer 460 with each other.
As the case may be, unlike fig. 4A, the CNN may include a separate FC layer for each network. That is, CNNs may not share adjusted parameters between individual FC layers. Such a CNN may have a first FC layer for tracking the network and a second FC layer for detecting the network.
For reference, in the claims of the present invention, the word "FC layer" is used without separating the first FC layer and the second FC layer, but this does not mean that the word "FC layer" used in the claims of the present invention does not include the case of separating the first FC layer and the second FC layer.
Referring to fig. 4a, the cnn may receive an input image and send it to a convolution block 310 to obtain a feature map. The feature map may then be relayed to RPN 320 to generate a suggestion box. For example, the number of suggested boxes generated by RPN 320 may be three hundred, but is not limited thereto.
Next, the tracking module 430 of the tracking network may receive information about the suggestion boxes and may allow trackers within the tracking module 430 to select at least one particular suggestion box among the suggestion boxes by referencing the L2 distance and/or the score of each suggestion box as described above. For example, if the number of specific suggestion boxes selected based on the distance from the reference box of the previous frame is ten, information about the remaining unselected suggestion boxes (i.e., untracked boxes) may be forwarded to the detection network. That is, information about two hundred ninety untracked boxes is forwarded to the detection network.
Meanwhile, according to the tracking network, a first pooled feature map is generated by applying pooling operation to the region corresponding to the specific suggestion frame on the feature map by the first pooling layer 440; and generates a first bounding box by applying a regression operation to the first pooled feature graph by the FC layer 460.
On the other hand, according to the detection network, a second pooled feature map is generated by applying pooling operations by the second pooling layer 450 to the regions of the feature map corresponding to the untracked frames; and a second bounding box is obtained by applying a regression operation to the second pooled feature graph by the FC layer 460.
Fig. 4B is a flowchart illustrating a process of acquiring a bounding box by using a CNN including a tracking network and a detection network according to another exemplary embodiment of the present invention.
For reference, the process of transmitting an input image in step S410, generating a feature map in step S420, and generating a suggestion box in step S430 in fig. 4B is the same as the process of S311, S312, and S321 described in fig. 3B due to the fact that the embodiments of CNN in fig. 3A and 4A may have the same configuration for the convolution block 310 and RPN 320.
However, the process in fig. 4B is different from the process in fig. 3B in that processing is performed not only on a tracked frame but also on an untracked frame. Here, some of the plurality of suggestion boxes that are not set as tracking boxes among the suggestion boxes are set as non-tracking boxes, but are not limited thereto. As another example, an untracked box may be selected among suggested boxes according to one or more particular conditions.
For reference, since the process of determining the tracking frame by using the mean shift tracking algorithm has been described above, a detailed description thereof will be omitted.
In step S440, the test apparatus 200 may determine whether each of the suggested boxes is a tracked box or an untracked box.
If it is determined in step S450 that the suggested frame is a tracking frame, the test apparatus 200 may adjust the position of the tracking frame by using a mean shift tracking algorithm in step S460. In detail, in step S460, the test apparatus 200 determines a specific area of the current frame, which has information on a probability similar to that corresponding to the pixel data of the object in the previous frame, as a target area of the tracking frame by using the mean shift tracking algorithm. Thereafter, in step S470, the test apparatus 200 generates a first pooled feature map by pooling the specific region by the first pooling layer 440, and then acquires a first bounding box by applying a regression operation to the first pooled feature map by the FC layer 460.
Otherwise, if it is determined in step S450 that the suggested frame is an untracked frame, in step S490, the test apparatus 200 may allow the second pooling layer 450 to generate a second pooled feature map by applying a pooling operation to a region on the feature map corresponding to at least one of the plurality of untracked frames; and, if the FC layer 460 detects a new object by applying a classification operation to the second pooled feature map, the FC layer 460 is allowed to acquire the second bounding box by applying a regression operation to the second pooled feature map.
For another example, before step S490, in step S480, the test apparatus 200 may select at least one specific untracked frame among the plurality of untracked frames by referring to at least one of (i) each of L2 distances between a reference bounding box acquired from a previous frame and each of the plurality of untracked frames and (ii) each score as a probability value indicating a probability that each of the plurality of untracked frames includes an object. If step S480 is performed, then in step S490, the test apparatus 200 may allow the second pooling layer 450 to generate a second pooled feature map by applying a pooling operation to the region of the feature map corresponding to the particular untracked box; and, if the FC layer 460 detects a new object by applying a classification operation to the second pooled feature map, the FC layer 460 is allowed to acquire the second bounding box by applying a regression operation to the second pooled feature map.
Here, the test apparatus 200 may determine the second bounding box corresponding to the NEW object as the reference bounding box new_ref for the tracking box of the NEW object included in the next frame, and then set the tracking box in the next frame with reference to each distance between the reference bounding box new_ref and each of the plurality of suggested boxes in the next frame.
For reference, the result of the classification operation provides information about probabilities of objects being various identities. The classification operation may represent a probability that the object is a vehicle, a passenger, a background, a road, etc.
Fig. 5 is a diagram for explaining a mean shift tracking algorithm used in the present invention.
By referring to fig. 5, a histogram 520 of an object (e.g., a vehicle) to be tracked may be obtained from a particular region 510. The histogram is information on probability data acquired by counting each pixel number included in the region of the object for each color and dividing the total pixel number by each pixel number.
Under the condition that the histogram 520 is acquired, the input image 530 is back projected to acquire a back projected image 540. Here, back projection is the process of digitizing how many color values of pixels in the input image 530 are included in the object to be tracked. If the histogram of the model is referred to as Hm and if the color value of each pixel x of the input image I530 is referred to as I (x), the value of the back projection can be obtained as shown by the formula w (x) =hm (I (x)).
A mean shift tracking algorithm may be applied to the values of the back projection. More specifically, since the mean shift tracking algorithm tracks an object in an image by using mean shift capable of finding the center of a data distribution to be moved from a current position, in the present invention, it is used to find a specific area in a current frame moved from a start area of a tracking frame, the specific area having information on a probability similar to a probability corresponding to pixel data of an object in a previous frame.
For reference, the information on probability data corresponding to pixel data of an object in a previous frame may be a histogram corresponding to pixel data of a first bounding box in the previous frame and/or pixel data of a second bounding box therein.
Meanwhile, at least one parameter of the CNN included in the test apparatus 200 may be adjusted by a learning apparatus (not shown) before the process of the test apparatus 200 is performed.
In detail, the test device 200 may perform the above steps on condition that the learning device has completed the following processes: (i) allowing the convolutional layer to acquire a feature map for training from a training image including an object for training, (ii) allowing the RPN to acquire one or more suggested frames for training corresponding to the object for training in the training image, (iii) allowing the pooling layer to generate a pooled feature map for training corresponding to the suggested frames for training by applying a pooling operation, (iv) allowing the FC layer to acquire information on pixel data of a bounding box for training by applying a regression operation to the pooled feature map for training, and (v) allowing the loss layer to acquire comparison data by comparing the information on pixel data of the bounding box in the training image with the information on pixel data of the bounding box in the GT image, thereby adjusting at least one parameter of the CNN during back propagation by using the comparison data.
For reference, information on pixel data of a bounding box for training may be acquired through both the first FC layer and the second FC layer, as the case may be. As described above, if the trace network including the first FC layer and the detection network including the second FC layer are configured as one network, it is not necessary to perform the learning process for the first FC layer and the second FC layer, respectively. In this case, the parameter of the first FC layer may have the same value as the parameter of the second FC layer.
According to the present invention, there is an effect of acquiring a high-precision bounding box corresponding to an object in an image.
According to the present invention, by using the mean shift tracking algorithm, there is an effect of tracking an object more accurately.
According to the present invention, by having the tracking network reuse the classifier and the regressor in the detection network included in the CNN, there is an effect of increasing the reliability of the tracking result and the verification result.
The embodiments of the present invention described above may be implemented in the form of executable program commands by various computer devices recordable to a computer readable medium. The computer readable media may include program commands, data files, and data structures, alone or in combination. The program command recorded to the medium may be a component specially designed for the present invention or available to those skilled in the relevant arts. The computer-readable recording medium includes: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROM and DVD; magneto-optical media such as optical disks; hardware devices such as ROM, RAM; and flash memory dedicated to storing and executing programs. The program commands include not only machine language code produced by a compiler but also high-level code that may be used by an interpreter or the like that is executed by the computing device. The hardware devices described above may operate as more than software modules to perform the technical features of the invention, and they may do the same in the opposite case.
As described above, the present invention has been explained by specific matters such as detailed components, limited embodiments and drawings. While the invention has been shown and described with respect to the preferred embodiments, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the following claims.
The inventive idea, therefore, should not be limited to the embodiments explained and the following patent claims as well as all matters comprising equivalents or variations equivalent to the patent claims fall within the scope of the inventive idea.

Claims (18)

1. A method of acquiring at least one bounding box corresponding to at least one object in a test image by using a convolutional neural network CNN comprising a tracking network, comprising the steps of:
(a) A test device acquires a plurality of advice frames, wherein a feature map is generated by applying a convolution operation to the test image as a current frame, and the test device outputs information on the plurality of advice frames from a regional advice network RPN by using the feature map;
(b) The test apparatus selects at least one specific advice frame among the plurality of advice frames by referring to: (i) A result of comparing a reference bounding box of the object in a previous frame with each distance between each of the plurality of bounding boxes, wherein a bounding box having a smallest distance is selected as the at least one particular bounding box, or (ii) a result of comparing each score indicating whether each of the bounding boxes includes a probability value of the object, wherein a bounding box having a highest score is selected as the at least one particular bounding box, and then setting the particular bounding box as a starting region of a tracking box, wherein the starting region is used for a mean shift tracking algorithm, wherein the score is represented by a ratio of a region corresponding to an intersection of regions of a GT bounding box and an intersection of each of the bounding boxes to a region of a GT bounding box and a region corresponding to the union of each of the bounding boxes;
(c) By using the mean shift tracking algorithm, the test device determines a specific region of the current frame as a target region of the tracking frame, the specific region having information on a probability most similar to information on a probability corresponding to pixel data of the object in the previous frame;
(d) The testing device allows a pooling layer to generate a pooling feature map by applying pooling operation to a region corresponding to the specific region in the feature map, and then allows a fully connected FC layer to acquire a boundary frame by applying regression operation to the pooling feature map; and
(e) The test device determines the bounding box as a reference bounding box for a tracking box of the object located in a next frame.
2. The method of claim 1, wherein in the step (c), the information on a probability corresponding to pixel data of the object in the previous frame is a histogram corresponding to pixel data of the bounding box in the previous frame.
3. The method of claim 1, wherein in the step (b), if the number of objects is a plurality, the test apparatus selects the specific suggestion box among the plurality of suggestion boxes by referring to: (i) A result of comparing each of distances between the reference bounding box of the object in the previous frame and each of the plurality of suggestion boxes, and (ii) a result of comparing each score indicating whether each of the suggestion boxes includes a probability value of the object, and then setting each of the specific suggestion boxes as each start region of each of the tracking boxes.
4. The method of claim 1, wherein in the step (b), a distance between the reference bounding box of the object located in the previous frame and each of the plurality of suggested boxes is an L2 distance between a center coordinate of the reference bounding box and a center coordinate of each of the plurality of suggested boxes.
5. The method of claim 1, wherein the test device performs the steps (a) through (e) on condition that the learning device has completed the following process: (i) allowing a convolutional layer to acquire a feature map for training from a training image including an object for training, (ii) allowing the RPN to acquire one or more suggested frames for training corresponding to the object for training in the training image, (iii) allowing the pooling layer to generate a pooled feature map for training corresponding to a suggested frame for training by applying a pooling operation, (iv) allowing the FC layer to acquire information on pixel data of a bounding box for training by applying a regression operation to the pooled feature map for training, and (v) allowing a loss layer to acquire comparison data by comparing the information of the pixel data of the bounding box in the training image with the information of the pixel data of the bounding box in a GT image, thereby adjusting at least one parameter of the CNN during back propagation by using the comparison data.
6. The method according to claim 1, wherein in the step (d), the test apparatus acquires the bounding box whose size is adjusted to correspond to the object in the test image by a process of generating the pooled feature map and then applying the regression operation through the FC layer.
7. The method of claim 1, further comprising detecting a network, the method comprising the steps of:
(f) After said step (b), said testing device setting at least some of said plurality of suggested frames that have not been set as tracked frames as a plurality of untracked frames, wherein said pooling layer is a first pooling layer, said pooling feature map is a first pooling feature map, and said bounding frame is a first bounding frame; and
(g) After said step (d), said testing means allowing a second pooling layer to generate a second pooled feature map by applying a pooling operation to a region on said feature map corresponding to at least one of said plurality of untracked boxes; and, if the FC layer detects a new object by applying a classification operation to the second pooled feature map, the testing apparatus allows the FC layer to acquire a second bounding box by applying a regression operation to the second pooled feature map.
8. The method of claim 7, wherein in the step (g), the testing apparatus determines the second bounding box corresponding to the new object as a reference bounding box for a tracking box of the new object included in a next frame.
9. The method of claim 7, wherein in step (f), at least one particular untracked box is selected among the plurality of untracked boxes by referring to at least one of: (i) Each of the L2 distances between the reference bounding box and each of the plurality of untracked boxes acquired from the previous frame and (ii) as each score indicating whether each of the plurality of untracked boxes includes a probability value for the object, and wherein in the step (g), the testing device allows the second pooling layer to generate the second pooled feature map by applying a pooling operation to a region on the feature map corresponding to the particular untracked box; and, if the FC layer detects the new object by applying a classification operation to the second pooled feature map, allowing the FC layer to obtain the second bounding box by applying a regression operation to the second pooled feature map.
10. A test apparatus for acquiring at least one bounding box corresponding to at least one object in a test image by using a convolutional neural network CNN comprising a tracking network, comprising:
a communication section for acquiring the test image or a feature map converted from the test image; and
a processor for performing the following: (I) Acquiring a plurality of advice frames, wherein the feature map is acquired by applying a convolution operation to the test image as a current frame, and information on the plurality of advice frames is advice network RPN from a region by using the feature map; (II) selecting at least one particular suggestion box from the plurality of suggestion boxes by referring to: (i) A result of comparing a reference bounding box of the object in a previous frame with each distance between each of the plurality of bounding boxes, wherein a bounding box having a smallest distance is selected as the at least one particular bounding box, or (ii) a result of comparing each score indicating whether each of the bounding boxes includes a probability value of the object, wherein a bounding box having a highest score is selected as the at least one particular bounding box, and then setting the particular bounding box as a starting region of a tracking box, wherein the starting region is used for a mean shift tracking algorithm, wherein the score is represented by a ratio of a region corresponding to an intersection of regions of a GT bounding box and an intersection of each of the bounding boxes to a region of a GT bounding box and a region corresponding to the union of each of the bounding boxes; (III) determining, by using the mean shift tracking algorithm, a specific region of the current frame as a target region of the tracking frame, the specific region having information on a probability most similar to information on a probability corresponding to pixel data of the object in the previous frame; (IV) allowing a pooling layer to generate a pooled feature map by applying a pooling operation to a region of the feature map corresponding to the specific region, and then allowing a fully connected FC layer to obtain a bounding box by applying a regression operation to the pooled feature map; and (V) the testing device determining the bounding box as a reference bounding box for a tracking box of the object located in a next frame.
11. The test apparatus according to claim 10, wherein in the processing (III), the information on a probability corresponding to pixel data of the object in the preceding frame is a histogram corresponding to pixel data of the bounding box in the preceding frame.
12. The test apparatus of claim 10, wherein in the process (II), if the number of objects is a plurality, the processor selects the particular suggestion box from the plurality of suggestion boxes by referring to: (i) A result of comparing each of distances between the reference bounding box of the object in the previous frame and each of the plurality of suggestion boxes and (ii) a result of comparing each score indicating whether each of the suggestion boxes includes a probability value of the object, and then setting each of the specific suggestion boxes as each start region of each of the tracking boxes.
13. The test apparatus of claim 10, wherein in the process (II), a distance between the reference bounding box of the object located in the previous frame and each of the plurality of suggested boxes is an L2 distance between a center coordinate of the reference bounding box and a center coordinate of each of the plurality of suggested boxes.
14. The test device according to claim 10, wherein the processor of the test device performs the processes (I) to (V) on condition that the learning device has completed the following processes: (i) allowing a convolutional layer to acquire a feature map for training from a training image comprising an object for training, (ii) allowing the RPN to acquire one or more suggested frames for training corresponding to the object for training in the training image, (iii) allowing the pooling layer to generate a pooled feature map for training corresponding to a suggested frame for training by applying a pooling operation, (iv) allowing the FC layer to acquire information on pixel data of a bounding frame for training by applying the regression operation to the pooled feature map for training, and (v) allowing a loss layer to acquire comparison data by comparing the information of the pixel data of the bounding frame in the training image with the information of the pixel data of the bounding frame in a GT image, thereby adjusting at least one parameter of the CNN during back propagation by using the comparison data.
15. The test apparatus according to claim 10, wherein in the process (IV), the processor acquires the bounding box whose size is adjusted to correspond to the object in the test image by a process of generating the pooled feature map and then applying the regression operation through the FC layer.
16. The test apparatus of claim 10, further comprising a detection network wherein the processor is further configured to perform the following: (VI) after the processing (II), setting at least some of the plurality of suggested boxes that have not been set as tracking boxes as a plurality of untracked boxes, wherein the pooling layer is a first pooling layer, the pooling feature map is a first pooling feature map, and the bounding box is a first bounding box; and (VII) after the processing (VII), allowing a second pooling layer to generate a second pooled feature map by applying a pooling operation to a region on the feature map corresponding to at least one of the plurality of untracked boxes; and, if the FC layer detects a new object by applying a classification operation to the second pooled feature map, allowing the FC layer to obtain a second bounding box by applying a regression operation to the second pooled feature map.
17. The test apparatus of claim 16, wherein in the process (VII), the processor determines the second bounding box corresponding to the new object as a reference bounding box for a tracking box of the new object included in a next frame.
18. The test apparatus of claim 16, wherein in the processing (VI) at least one particular untracked box is selected among the plurality of untracked boxes by referring to at least one of: (i) Each of the L2 distances between the reference bounding box and each of the plurality of untracked boxes acquired from the previous frame and (ii) as each score indicating whether each of the plurality of untracked boxes includes a probability value for the object, and wherein in the processing (III-2), the processor allows the second pooling layer to generate the second pooled feature map by applying a pooling operation to a region on the feature map corresponding to the particular untracked box; and, if the FC layer detects the new object by applying a classification operation to the second pooled feature map, allowing the FC layer to obtain the second bounding box by applying a regression operation to the second pooled feature map.
CN201811191036.7A 2017-10-13 2018-10-12 Method for acquiring bounding box corresponding to object in image by convolution neural network including tracking network and computing device using same Active CN109670523B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US15/783,442 US9946960B1 (en) 2017-10-13 2017-10-13 Method for acquiring bounding box corresponding to an object in an image by using convolutional neural network including tracking network and computing device using the same
US15/783,442 2017-10-13

Publications (2)

Publication Number Publication Date
CN109670523A CN109670523A (en) 2019-04-23
CN109670523B true CN109670523B (en) 2024-01-09

Family

ID=61872587

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811191036.7A Active CN109670523B (en) 2017-10-13 2018-10-12 Method for acquiring bounding box corresponding to object in image by convolution neural network including tracking network and computing device using same

Country Status (5)

Country Link
US (1) US9946960B1 (en)
EP (1) EP3471026B1 (en)
JP (1) JP6646124B2 (en)
KR (1) KR102192830B1 (en)
CN (1) CN109670523B (en)

Families Citing this family (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10078794B2 (en) * 2015-11-30 2018-09-18 Pilot Ai Labs, Inc. System and method for improved general object detection using neural networks
DE102018206110A1 (en) * 2018-04-20 2019-10-24 Zf Friedrichshafen Ag training methods
US10269125B1 (en) * 2018-10-05 2019-04-23 StradVision, Inc. Method for tracking object by using convolutional neural network including tracking network and computing device using the same
CN109635842A (en) * 2018-11-14 2019-04-16 平安科技(深圳)有限公司 A kind of image classification method, device and computer readable storage medium
CN109492697B (en) * 2018-11-15 2021-02-02 厦门美图之家科技有限公司 Picture detection network training method and picture detection network training device
US11087170B2 (en) * 2018-12-03 2021-08-10 Advanced Micro Devices, Inc. Deliberate conditional poison training for generative models
US10509987B1 (en) * 2019-01-22 2019-12-17 StradVision, Inc. Learning method and learning device for object detector based on reconfigurable network for optimizing customers' requirements such as key performance index using target object estimating network and target object merging network, and testing method and testing device using the same
US10423860B1 (en) * 2019-01-22 2019-09-24 StradVision, Inc. Learning method and learning device for object detector based on CNN to be used for multi-camera or surround view monitoring using image concatenation and target object merging network, and testing method and testing device using the same
US10387753B1 (en) * 2019-01-23 2019-08-20 StradVision, Inc. Learning method and learning device for convolutional neural network using 1×1 convolution for image recognition to be used for hardware optimization, and testing method and testing device using the same
US10445611B1 (en) * 2019-01-25 2019-10-15 StradVision, Inc. Method for detecting pseudo-3D bounding box to be used for military purpose, smart phone or virtual driving based-on CNN capable of converting modes according to conditions of objects and device using the same
US10402978B1 (en) * 2019-01-25 2019-09-03 StradVision, Inc. Method for detecting pseudo-3D bounding box based on CNN capable of converting modes according to poses of objects using instance segmentation and device using the same
US10402686B1 (en) * 2019-01-25 2019-09-03 StradVision, Inc. Learning method and learning device for object detector to be used for surveillance based on convolutional neural network capable of converting modes according to scales of objects, and testing method and testing device using the same
US10372573B1 (en) * 2019-01-28 2019-08-06 StradVision, Inc. Method and device for generating test patterns and selecting optimized test patterns among the test patterns in order to verify integrity of convolution operations to enhance fault tolerance and fluctuation robustness in extreme situations
US10803333B2 (en) 2019-01-30 2020-10-13 StradVision, Inc. Method and device for ego-vehicle localization to update HD map by using V2X information fusion
US10817777B2 (en) * 2019-01-31 2020-10-27 StradVision, Inc. Learning method and learning device for integrating object detection information acquired through V2V communication from other autonomous vehicle with object detection information generated by present autonomous vehicle, and testing method and testing device using the same
US10796206B2 (en) * 2019-01-31 2020-10-06 StradVision, Inc. Method for integrating driving images acquired from vehicles performing cooperative driving and driving image integrating device using same
US11010668B2 (en) * 2019-01-31 2021-05-18 StradVision, Inc. Method and device for attention-driven resource allocation by using reinforcement learning and V2X communication to thereby achieve safety of autonomous driving
CN112307826A (en) * 2019-07-30 2021-02-02 华为技术有限公司 Pedestrian detection method, device, computer-readable storage medium and chip
US11288835B2 (en) 2019-09-20 2022-03-29 Beijing Jingdong Shangke Information Technology Co., Ltd. Lighttrack: system and method for online top-down human pose tracking
KR20210061839A (en) * 2019-11-20 2021-05-28 삼성전자주식회사 Electronic apparatus and method for controlling thereof
EP4055561A4 (en) * 2019-11-20 2023-01-04 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Object detection device, method, and systerm
CN111223125B (en) * 2020-01-06 2023-05-09 江苏大学 Target motion video tracking method based on Python environment
CN111428566B (en) * 2020-02-26 2023-09-01 沈阳大学 Deformation target tracking system and method
CN111428567B (en) * 2020-02-26 2024-02-02 沈阳大学 Pedestrian tracking system and method based on affine multitask regression
KR20210114728A (en) * 2020-03-11 2021-09-24 연세대학교 산학협력단 Pixel Level Video Object Tracking Apparatus Using Box Level Object Position Information
CN111460926B (en) * 2020-03-16 2022-10-14 华中科技大学 Video pedestrian detection method fusing multi-target tracking clues
CN111539991B (en) * 2020-04-28 2023-10-20 北京市商汤科技开发有限公司 Target tracking method and device and storage medium
CN111696136B (en) * 2020-06-09 2023-06-16 电子科技大学 Target tracking method based on coding and decoding structure
KR102436197B1 (en) 2020-06-10 2022-08-25 한국기술교육대학교 산학협력단 Method for detecting objects from image
KR20220052620A (en) * 2020-10-21 2022-04-28 삼성전자주식회사 Object traking method and apparatus performing the same
CN112257810B (en) * 2020-11-03 2023-11-28 大连理工大学人工智能大连研究院 Submarine organism target detection method based on improved FasterR-CNN
CN113011331B (en) * 2021-03-19 2021-11-09 吉林大学 Method and device for detecting whether motor vehicle gives way to pedestrians, electronic equipment and medium
CN113420919B (en) * 2021-06-21 2023-05-05 郑州航空工业管理学院 Engineering anomaly control method based on unmanned aerial vehicle visual perception
CN113780477B (en) * 2021-10-11 2022-07-22 深圳硅基智能科技有限公司 Method and device for measuring fundus image based on deep learning of tight frame mark

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106815579A (en) * 2017-01-22 2017-06-09 深圳市唯特视科技有限公司 A kind of motion detection method based on multizone double fluid convolutional neural networks model
CN106845430A (en) * 2017-02-06 2017-06-13 东华大学 Pedestrian detection and tracking based on acceleration region convolutional neural networks

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5216902B2 (en) * 2011-09-05 2013-06-19 日本電信電話株式会社 Object tracking device and object tracking method
US9965719B2 (en) * 2015-11-04 2018-05-08 Nec Corporation Subcategory-aware convolutional neural networks for object detection
US9858496B2 (en) * 2016-01-20 2018-01-02 Microsoft Technology Licensing, Llc Object detection and classification in images

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106815579A (en) * 2017-01-22 2017-06-09 深圳市唯特视科技有限公司 A kind of motion detection method based on multizone double fluid convolutional neural networks model
CN106845430A (en) * 2017-02-06 2017-06-13 东华大学 Pedestrian detection and tracking based on acceleration region convolutional neural networks

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
An object detection and tracking system for unmanned;Jian Yang;《Target and Background Signatures III》;20171030;第1-14页 *
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks;Shaqing Ren;《arXiv:1506.01497v3》;20160130;第1-9页 *

Also Published As

Publication number Publication date
CN109670523A (en) 2019-04-23
EP3471026C0 (en) 2023-11-01
JP6646124B2 (en) 2020-02-14
KR20190041923A (en) 2019-04-23
US9946960B1 (en) 2018-04-17
KR102192830B1 (en) 2020-12-18
JP2019075116A (en) 2019-05-16
EP3471026B1 (en) 2023-11-01
EP3471026A1 (en) 2019-04-17

Similar Documents

Publication Publication Date Title
CN109670523B (en) Method for acquiring bounding box corresponding to object in image by convolution neural network including tracking network and computing device using same
CN109670573B (en) Learning method and learning device for adjusting parameter of CNN using loss increase, and test method and test device using the same
US10803364B2 (en) Control method, non-transitory computer-readable storage medium for storing control program, and control apparatus
US9953437B1 (en) Method and device for constructing a table including information on a pooling type and testing method and testing device using the same
US10269125B1 (en) Method for tracking object by using convolutional neural network including tracking network and computing device using the same
CN109598781B (en) Method for acquiring pseudo 3D frame from 2D bounding frame by regression analysis, learning apparatus and testing apparatus using the same
CN110751099B (en) Unmanned aerial vehicle aerial video track high-precision extraction method based on deep learning
JP2019036009A (en) Control program, control method, and information processing device
JP6700373B2 (en) Apparatus and method for learning object image packaging for artificial intelligence of video animation
US11450023B2 (en) Method and apparatus for detecting anchor-free object based on deep learning
CN110889421A (en) Target detection method and device
KR101991307B1 (en) Electronic device capable of feature vector assignment to a tracklet for multi-object tracking and operating method thereof
JP5777390B2 (en) Information processing method and apparatus, pattern identification method and apparatus
CN113253269B (en) SAR self-focusing method based on image classification
CN116596895A (en) Substation equipment image defect identification method and system
CN115116128A (en) Self-constrained optimization human body posture estimation method and system
US11074507B1 (en) Method for performing adjustable continual learning on deep neural network model by using selective deep generative replay module and device using the same
Fujita et al. Fine-tuned Surface Object Detection Applying Pre-trained Mask R-CNN Models
KR20230078134A (en) Device and Method for Zero Shot Semantic Segmentation
US11881016B2 (en) Method and system for processing an image and performing instance segmentation using affinity graphs
JP6554963B2 (en) Reference image generation method, image recognition program, image recognition method, and image recognition apparatus
JP2005071125A (en) Object detector, object detection method, object data selection program and object position detection program
KR102431425B1 (en) Scene classification method for mobile robot and object classification method
KR102546198B1 (en) Method for learning data classification based physical factor, and computer program recorded on record-medium for executing method thereof
KR102546193B1 (en) Method for learning data classification using color information, and computer program recorded on record-medium for executing method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant