CN109670523B

CN109670523B - Method for acquiring bounding box corresponding to object in image by convolution neural network including tracking network and computing device using same

Info

Publication number: CN109670523B
Application number: CN201811191036.7A
Authority: CN
Inventors: 金镕重; 南云铉; 夫硕焄; 成明哲; 吕东勋; 柳宇宙; 张泰雄; 郑景中; 诸泓模; 赵浩辰
Original assignee: Stradvision Inc
Current assignee: Stradvision Inc
Priority date: 2017-10-13
Filing date: 2018-10-12
Publication date: 2024-01-09
Anticipated expiration: 2038-10-12
Also published as: CN109670523A; EP3471026C0; JP6646124B2; KR20190041923A; US9946960B1; KR102192830B1; JP2019075116A; EP3471026B1; EP3471026A1

Abstract

A method of acquiring a bounding box corresponding to an object is provided. The method comprises the following steps: (a) obtaining a suggestion box; (b) Selecting a specific suggestion box among the suggestion boxes by referring to (i) a result of comparing a distance between the reference boundary box and the suggestion box and/or (ii) a result of comparing scores representing whether the suggestion boxes include objects, and then setting the specific suggestion box as a start area of the tracking box; (c) Determining a specific area of the current frame as a target area of the tracking frame by using a mean shift tracking algorithm; and (d) allowing the pooling layer to generate a pooled feature map by applying a pooling operation to a region corresponding to the specific region, and then allowing the FC layer to acquire a bounding box by applying a regression operation to the pooled feature map.

Description

Method for acquiring bounding box corresponding to object in image by convolution neural network including tracking network and computing device using same

Technical Field

The present invention relates to a method of acquiring a bounding box corresponding to an object in a test image using a Convolutional Neural Network (CNN) including a tracking network and a test apparatus using the same; and more particularly to a method of acquiring at least one bounding box corresponding to at least one object in a test image by using a CNN including a tracking network, and a test apparatus performing the method, the method comprising the steps of: (a) If a feature map is generated by applying a convolution operation to a test image as a current frame and then information on a plurality of advice frames obtained by applying a specific operation to the feature map by a regional advice network (RPN) is output, the test apparatus acquires or supports another apparatus to acquire the plurality of advice frames; (b) The test device selects or supports another device in the plurality of suggestion boxes by selecting at least one particular suggestion box in the plurality of suggestion boxes with reference to at least one of: (i) A result of comparing each distance between a reference bounding box of the object in the previous frame and each of the plurality of suggestion boxes and (ii) a result of comparing each score that is a probability value indicating whether each of the suggestion boxes includes the object, and then setting or supporting another device to set a particular suggestion box as a starting region of the tracking box, wherein the starting region is used for a mean shift tracking algorithm; (c) By using the mean shift tracking algorithm, the test device determines or supports another device to determine a specific region of the current frame as a target region of the tracking frame, the specific region having information on a probability similar to a probability corresponding to pixel data of an object in a previous frame; and (d) the testing means allows the pooling layer to generate a pooled feature map by applying a pooling operation to a region of the feature map corresponding to the specific region, and then allows the FC layer to acquire the bounding box by applying a regression operation to the pooled feature map.

Background

In machine learning, convolutional neural networks (CNN or ConvNet) are a class of deep feed-forward artificial neural networks that have been successfully applied to analyze visual images.

Fig. 1 is a diagram schematically illustrating a learning process of a conventional CNN according to the related art.

Specifically, FIG. 1 illustrates a process for obtaining a penalty by comparing a prediction bounding box with a Ground Truth (GT) bounding box. Here, the loss represents the difference between the prediction bounding box and the GT bounding box, and is denoted dx _c 、dy _c Dw, dh, as shown in FIG. 1.

First, as shown in fig. 1, the learning device may acquire an RGB image as an input to be fed to a plurality of convolution layers (i.e., convolution filters) included in a convolution block. As the RGB image passes through the plurality of convolution layers, the size (e.g., width and height) of the RGB image becomes smaller, while the number of channels increases.

As shown in fig. 1, the learning apparatus allows a regional recommendation network (RPN) to generate a recommendation frame from a final feature map output by a convolution block, and allows a pooling layer (e.g., ROI pooling layer) to adjust the size of a region corresponding to the recommendation frame on the feature map to a predetermined size (e.g., size 2×2) by applying a maximum pooling operation (or an average pooling operation) to pixel data of the region corresponding to the recommendation frame on the feature map. Thus, a pooled profile is obtained. For reference, the pooled feature map may also be referred to as a feature vector. Here, the max-pooling operation is an operation such that: by this operation, each maximum value in each of the sub-regions divided from the subject region on the feature map is selected as each of the representative values of the subject region, as shown in the lower right of fig. 1.

Next, the pooled feature map may be allowed to be fed to a Full Connectivity (FC) layer.

The learning device may then allow the FC layer to identify the class of objects in the RGB image. In addition, a prediction bounding box in an RGB image may be obtained by the FC layer, and a penalty may also be obtained by making a comparison between the prediction bounding box and a Ground Truth (GT) bounding box. Here, the GT bounding box represents a bounding box that precisely surrounds the object in the RGB image, which may be generally prepared by humans.

Finally, the learning device in fig. 1 may adjust at least one of the parameters included in the FC layer, the RPN, or the plurality of convolutional layers by using the loss during the back propagation process.

Thereafter, a test apparatus (not shown) having a CNN including the adjusted parameters may later acquire a bounding box surrounding the object in the test image. However, even if the test apparatus has a CNN including the adjusted parameters, it is difficult to obtain a bounding box that accurately surrounds the object in the test image.

The applicant of the present invention therefore proposes a method for acquiring at least one bounding box corresponding to at least one object in a test image with high accuracy.

Disclosure of Invention

An object of the present invention is to solve the above problems.

It is another object of the present invention to provide a method for acquiring a bounding box of high accuracy corresponding to an object in an image using a tracking network included in a CNN.

It is a further object of the invention to more accurately track objects by using a mean shift tracking algorithm.

It is another object of the present invention to increase reliability of a trace result and verify the result by having a trace network reuse (reuse) a classifier and a regressor in a detection network included in a CNN.

According to one aspect of the present invention, there is provided a method of acquiring at least one bounding box corresponding to at least one object in a test image by using a CNN including a tracking network, comprising the steps of: (a) If a feature map is generated by applying a convolution operation to a test image as a current frame and then information on a plurality of advice frames obtained by applying a specific operation to the feature map by a regional advice network (RPN) is output, the test apparatus acquires or supports another apparatus to acquire the plurality of advice frames; (b) The test apparatus selects or supports another apparatus to select at least one specific suggestion frame among the plurality of suggestion frames by referring to at least one of (i) a result of comparing each distance between a reference boundary frame of an object in a previous frame and each of the plurality of suggestion frames and (ii) a result of comparing each score as a probability value indicating whether each of the suggestion frames includes the object, and then sets or supports another apparatus to set the specific suggestion frame as a start area of a tracking frame, wherein the start area is used for a mean shift tracking algorithm; (c) By using the mean shift tracking algorithm, the test device determines or supports another device to determine a specific region of the current frame as a target region of the tracking frame, the specific region having information on a probability similar to a probability corresponding to pixel data of an object in a previous frame; and (d) the testing means allows the pooling layer to generate a pooled feature map by applying a pooling operation to a region of the feature map corresponding to the specific region, and then allows the FC layer to acquire the bounding box by applying a regression operation to the pooled feature map.

According to another aspect of the present invention, there is provided a method of acquiring a bounding box corresponding to an object in a test image by using a CNN including a tracking network and a detection network, including the steps of: (a) If a feature map is generated by applying a convolution operation to a test image as a current frame and then information on a plurality of advice frames obtained by applying a specific operation to the feature map by a regional advice network (RPN) is output, the test apparatus acquires or supports another apparatus to acquire the plurality of advice frames; (b) (b-1) selecting or supporting another device among the plurality of advice frames at least one specific advice frame by referring to at least one of (i) a result of comparing each distance between a reference bounding frame of the object in the previous frame and each of the plurality of advice frames and (ii) a result of comparing each score indicating whether each of the advice frames includes a probability value of the object, and then setting or supporting another device to set the specific advice frame as a start area of the tracking frame, wherein the start area is used for the mean shift tracking algorithm; (b-2) the test apparatus setting or supporting the other apparatus to set at least some of the plurality of suggested boxes that have not been set as tracking boxes as a plurality of untracked boxes; and (c) (c-1) after step (b-1), determining or supporting another device to determine a specific region of the current frame as a target region of the tracking frame by using a mean shift tracking algorithm, the specific region having information on a probability similar to a probability corresponding to pixel data of an object in a previous frame; and allowing the first pooling layer to generate a first pooled feature map by applying a pooling operation to a region of the feature map corresponding to the specific region, and then allowing the FC layer to obtain a first bounding box by applying a regression operation to the first pooled feature map; (c-2) after step (b-2), the testing device allowing the second pooling layer to generate a second pooled feature map by applying a pooling operation to a region on the feature map corresponding to at least one of the plurality of untracked boxes; and, if the FC layer detects a new object by applying a classification operation to the second pooled feature map, the testing apparatus allows the FC layer to acquire the second bounding box by applying a regression operation to the second pooled feature map.

According to another aspect of the present invention, there is provided a test apparatus for acquiring at least one bounding box corresponding to at least one object in a test image by using a CNN including a tracking network, comprising: a communication section for acquiring a test image or a feature map converted therefrom; and a processor for performing the following: (I) Acquiring or supporting another apparatus to acquire a plurality of advice frames if a feature map is acquired by applying a convolution operation to a test image as a current frame and then information on the plurality of advice frames obtained by applying a specific operation to the feature map by a regional advice network (RPN) is output; (II) selecting or supporting another device among the plurality of suggestion boxes to select at least one specific suggestion box among the plurality of suggestion boxes by referring to at least one of (i) a result of comparing each distance between a reference bounding box of the object in the previous frame and each of the plurality of suggestion boxes and (II) a result of comparing each score indicating whether each of the suggestion boxes includes a probability value of the object, and then setting or supporting another device to set the specific suggestion box as a start area of the tracking box, wherein the start area is used for a mean shift tracking algorithm; (III) determining or supporting another device to determine a specific region of the current frame as a target region of the tracking frame by using a mean shift tracking algorithm, the specific region having information on a probability similar to a probability corresponding to pixel data of an object in a previous frame; and (IV) allowing the pooling layer to generate a pooled feature map by applying a pooling operation to a region of the feature map corresponding to the specific region, and then allowing the FC layer to obtain a bounding box by applying a regression operation to the pooled feature map.

According to still another aspect of the present invention, there is provided a test apparatus for acquiring a bounding box corresponding to an object in a test image by using a CNN including a tracking network and a detection network, including: a communication section for acquiring a test image or a feature map converted therefrom; and a processor for performing the following: (I) Acquiring or supporting another apparatus to acquire a plurality of advice frames if a feature map is generated by applying a convolution operation to a test image as a current frame and then information on the plurality of advice frames obtained by applying a specific operation to the feature map by a regional advice network (RPN) is output; (II) (II-1) selecting or supporting another device among the plurality of suggestion boxes at least one specific suggestion box among the plurality of suggestion boxes by referring to at least one of (i) a result of comparing each distance between a reference bounding box of the object in the previous frame and each of the plurality of suggestion boxes and (II) a result of comparing each score as a probability value indicating whether each of the suggestion boxes includes the object, and then setting or supporting another device to set the specific suggestion box as a start area of the tracking box, wherein the start area is used for a mean shift tracking algorithm; (II-2) setting or supporting another device to set at least some of the plurality of suggested boxes that have not been set as tracking boxes as a plurality of untracked boxes; and (III) (III-1) after the processing of (II-1), determining or supporting another apparatus to determine a specific region of the current frame as a target region of the tracking frame, the specific region having information on a probability similar to a probability corresponding to pixel data of an object in a previous frame, by using a mean shift tracking algorithm; and allowing the first pooling layer to generate a first pooled feature map by applying a pooling operation to a region of the feature map corresponding to the specific region, and then allowing the FC layer to obtain a first bounding box by applying a regression operation to the first pooled feature map; (III-2) after the processing of (II-2), the testing device allowing the second pooling layer to generate a second pooled feature map by applying a pooling operation to a region on the feature map corresponding to at least one of the plurality of untracked boxes; and, if the FC layer detects a new object by applying a classification operation to the second pooled feature map, allowing the FC layer to obtain a second bounding box by applying a regression operation to the second pooled feature map.

Drawings

The following drawings are included to demonstrate exemplary embodiments of the invention and are merely a part of the preferred embodiments of the invention. Other figures may be obtained based on the figures herein without the need for inventive work by those skilled in the art. The above and other objects and features of the present invention will become apparent from the following description of the preferred embodiments given in conjunction with the accompanying drawings, in which:

fig. 1 is a diagram exemplarily illustrating a learning process of a conventional CNN according to the related art;

FIG. 2 is a block diagram schematically illustrating a test apparatus according to an example embodiment of the invention;

fig. 3A is a block diagram exemplarily showing a configuration of a CNN capable of acquiring a bounding box according to an exemplary embodiment of the present invention;

FIG. 3B is a flowchart illustrating a process of acquiring a bounding box by using a CNN including a tracking network, according to an example embodiment of the present invention;

fig. 4A is a block diagram exemplarily showing a configuration of a CNN capable of acquiring a bounding box according to another exemplary embodiment of the present invention;

fig. 4B is a flowchart illustrating a process of acquiring a bounding box by using a CNN including a tracking network and a detection network according to another exemplary embodiment of the present invention;

fig. 5 is a diagram illustrating a mean shift tracking algorithm used in the present invention.

Detailed Description

For a clear purpose, technical solutions and advantages of the present invention, reference is made to the accompanying drawings, which show, by way of illustration, more detailed example embodiments in which the invention may be practiced. These preferred embodiments are described in sufficient detail to enable those skilled in the art to practice the invention.

It is to be understood that the various embodiments of the invention, although different, are not necessarily mutually exclusive. For example, a particular feature, structure, or characteristic described herein in connection with one embodiment may be implemented within other embodiments without departing from the spirit and scope of the invention. In addition, it is to be understood that the location or arrangement of individual elements within each disclosed embodiment may be modified without departing from the spirit and scope of the invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined only by the appended claims, appropriately interpreted, along with the full range of equivalents to which the claims are entitled. In the drawings, like numerals refer to the same or similar functionality throughout the several views.

Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily implement the present invention.

Fig. 2 is a block diagram schematically illustrating a test apparatus according to an exemplary embodiment of the present invention.

As shown in fig. 2, the test apparatus 200 may include a communication section 210 and a processor 220. Also, the test apparatus 200 may further include a database 230. Optionally, the test apparatus 200 may not include a database, as shown in FIG. 2. Here, any digital computing device having at least one processor to perform operations may be employed as the test device 200 of the present invention.

The communication section 210 may be configured to acquire a test image or at least one feature map obtained therefrom.

The processor 220 may be configured to perform the following: (i) selecting at least one of the suggestion boxes, i.e. at least one specific suggestion box, among all the suggestion boxes generated by the RPN with a certain criterion, (ii) setting the specific suggestion box as a starting region of the tracking box, wherein the starting region is used for a tracking algorithm, e.g. a mean shift tracking algorithm, (iii) determining the specific region as a target region of the tracking box by using the mean shift tracking algorithm, and (iv) allowing the pooling layer to generate a pooled feature map by applying a pooling operation to a region of the feature map corresponding to the specific region, and then allowing the FC layer to acquire the boundary box by applying a regression operation to the pooled feature map. Further details regarding the above-described processes will be described below.

Meanwhile, the database 230 may be accessed by the communication part 210 of the test apparatus 200, and information on the score of the suggested frame, information on the reference bounding box of the object in the previous frame, information on the parameter of the CNN, and the like may be stored therein.

Fig. 3A is a block diagram exemplarily showing a configuration of a CNN capable of acquiring a bounding box according to an exemplary embodiment of the present invention, and fig. 3B illustrates a process of acquiring a bounding box by using a CNN including a tracking network according to an exemplary embodiment of the present invention.

For reference, the test device 200 may include several other digital computing devices to perform each of the functions or processes disclosed herein. Nevertheless, for ease of description and illustration, assumptions made in the disclosure of the present invention for the implementation of test apparatus 200 by a single digital computing apparatus.

Referring to fig. 3A, if the input image 301 is transmitted to the convolution block 310, at least one feature map is generated by applying at least one convolution operation to the input image 301. The feature map is then forwarded to RPN 320 to generate a suggestion box.

After sending the suggestion box to the tracking module 330, at least one tracking box is obtained from the tracking module 330 by: (i) Selecting a particular suggestion box from the suggestion boxes having a certain criteria, and (ii) setting it as a starting region (i.e. an initial window) of the tracking box, wherein the starting region is used for the mean shift tracking algorithm. The detailed description about the standard will be made later.

Next, the pooling layer 340 may receive (i) the feature map from the convolution block 310 and (ii) the information about the tracking frame from the tracking module 330, thereby generating a pooled feature map by applying a pooling operation to the region of the feature map corresponding to the tracking frame (i.e., ROI). Thereafter, the Fully Connected (FC) layer 350 may identify the object class 302 via the classifier 351 and may generate the bounding box 303 via the regressor 352 by using information about the pooled feature map.

Unlike test apparatus 200 including the CNN shown in fig. 3A including convolution block 310, RPN 320, tracking module 330, pooling layer 340, and FC layer 350, the CNN in test apparatus 200 may include only a portion of convolution block 310, RPN 320, tracking module 330, pooling layer 340, and FC layer 360, as the case may be.

Referring to fig. 3B, which has the configuration shown in fig. 3A, in step S311, an input image 301 (e.g., a test image having a size of 1280×720 and 3 channels) is transmitted to the convolution block 310, and as a result, in step S312, a feature map 304 having a size of 40×23 and 256 channels can be generated by applying a convolution operation to the input image 301. Here, the input image 301 may be regarded as a current frame.

For reference, convolution block 310 includes one or more convolution layers. The width and height of the input may be reduced by a specific ratio each time a convolution operation is applied, and the number of channels may be increased by a specific ratio, but is not limited thereto. Here, the specific ratio may be determined based on parameters (i.e., weights) of the convolution layers included in the convolution block 310.

Further, in step S321, the RPN 320 may generate information about the advice block 305 from the feature map 304. For reference, the advice block 305 is a block each of which has a probability of including an object in the input image 301.

Further, in step S331, the processor 220 of the test apparatus 200 selects a specific advice frame 306 among the plurality of advice frames 305 by referring to at least one of (i) a result of comparing each distance between the reference bounding frame 307 of the object in the previous frame and each of the advice frames 305 and (ii) a result of comparing each score that is a probability value indicating a probability that each of the advice frames 305 includes the object, and then the processor 220 of the test apparatus 200 sets the specific advice frame 306 as a start area of the tracking frame. The starting region may be used for a mean shift tracking algorithm. Here, although the reference bounding box 307 of the object is a bounding box located in a previous frame, fig. 3B shows the reference bounding box 307 on the current frame for convenience of explanation.

For example, in step S331, the processor 220 of the test apparatus 200 may determine the particular suggestion box 306 as having the smallest distance (e.g., L2 distance), and/or the suggestion box having the highest score among the respective scores of the suggestion box 305.

As another example, the processor 220 of the test apparatus 200 may determine the particular suggestion box 306 as the suggestion box having the smallest ratio of L2 distances/scores.

Here, the score may be represented by a ratio of (i) a region corresponding to an intersection of a region of the GT bounding box and an intersection of each of the suggestion boxes 305 to (ii) a region corresponding to a union of regions of the GT bounding box and a union of each of the suggestion boxes 305. Thus, the score may be a value between 0 and 1. Thus, the probability value of a particular suggestion box 306 selected in the suggestion box 305 may be close to 1.

Meanwhile, the suggestion box 305 may be dedicated to only the same object. As another example, suggestion box 305 may be specific to multiple objects. For example, if 100 suggestion boxes are generated, 70 suggestion boxes may be generated for object a and 30 suggestion boxes may be generated for object B.

Then, in step S341, by using the mean shift tracking algorithm, the processor 220 of the test apparatus 200 may determine a specific region in the current frame as a target region 308 of the tracking frame, the target region 308 having information on a probability similar to a probability corresponding to pixel data of an object in the previous frame.

Here, the tracking module 330 may use a mean shift tracking algorithm, but it is apparent to those skilled in the art that the tracking algorithm should not be limited thereto.

In step S351, after the target region 308 of the tracking frame is acquired, a pooled feature map (not shown), i.e., feature vector, is generated by applying a pooling operation to a region corresponding to the target region 308 in the feature map 304, and then a bounding box is generated by applying a regression operation to the feature vector by the regressor 352 in the FC layer 350, and an object class (e.g., vehicle, pedestrian, road, building, etc.) can be identified by applying an operation to the feature vector by the classifier 351 in the FC layer 350. Therefore, a bounding box with high accuracy can be acquired via the CNN including the tracking network.

For reference, the test apparatus 200 may allow the tracking module 330 to precisely find the position of the tracking frame in the current frame, and may instruct the FC layer 350 to refine the size of the tracking frame, thereby obtaining the bounding box closest to the GT bounding box.

Finally, the test apparatus 200 may determine the bounding box as a reference bounding box for the tracking box of the object located in the next frame.

For reference, the configuration of fig. 3A may be referred to as a tracking network.

According to another example embodiment of the present invention, the CNN of the present invention may further comprise a detection network. Details regarding another example embodiment may be described below by way of illustration in fig. 4A and 4B.

Fig. 4A is a block diagram exemplarily showing a configuration of a CNN capable of acquiring a bounding box according to another exemplary embodiment of the present invention.

Referring to fig. 4A, the configuration of the CNN included in the test apparatus 200 may include a trace network and a detection network. The tracking network and the detection network may share the convolution block 310, RPN 320, and FC layer 460 with each other.

As the case may be, unlike fig. 4A, the CNN may include a separate FC layer for each network. That is, CNNs may not share adjusted parameters between individual FC layers. Such a CNN may have a first FC layer for tracking the network and a second FC layer for detecting the network.

For reference, in the claims of the present invention, the word "FC layer" is used without separating the first FC layer and the second FC layer, but this does not mean that the word "FC layer" used in the claims of the present invention does not include the case of separating the first FC layer and the second FC layer.

Referring to fig. 4a, the cnn may receive an input image and send it to a convolution block 310 to obtain a feature map. The feature map may then be relayed to RPN 320 to generate a suggestion box. For example, the number of suggested boxes generated by RPN 320 may be three hundred, but is not limited thereto.

Next, the tracking module 430 of the tracking network may receive information about the suggestion boxes and may allow trackers within the tracking module 430 to select at least one particular suggestion box among the suggestion boxes by referencing the L2 distance and/or the score of each suggestion box as described above. For example, if the number of specific suggestion boxes selected based on the distance from the reference box of the previous frame is ten, information about the remaining unselected suggestion boxes (i.e., untracked boxes) may be forwarded to the detection network. That is, information about two hundred ninety untracked boxes is forwarded to the detection network.

Meanwhile, according to the tracking network, a first pooled feature map is generated by applying pooling operation to the region corresponding to the specific suggestion frame on the feature map by the first pooling layer 440; and generates a first bounding box by applying a regression operation to the first pooled feature graph by the FC layer 460.

On the other hand, according to the detection network, a second pooled feature map is generated by applying pooling operations by the second pooling layer 450 to the regions of the feature map corresponding to the untracked frames; and a second bounding box is obtained by applying a regression operation to the second pooled feature graph by the FC layer 460.

Fig. 4B is a flowchart illustrating a process of acquiring a bounding box by using a CNN including a tracking network and a detection network according to another exemplary embodiment of the present invention.

For reference, the process of transmitting an input image in step S410, generating a feature map in step S420, and generating a suggestion box in step S430 in fig. 4B is the same as the process of S311, S312, and S321 described in fig. 3B due to the fact that the embodiments of CNN in fig. 3A and 4A may have the same configuration for the convolution block 310 and RPN 320.

However, the process in fig. 4B is different from the process in fig. 3B in that processing is performed not only on a tracked frame but also on an untracked frame. Here, some of the plurality of suggestion boxes that are not set as tracking boxes among the suggestion boxes are set as non-tracking boxes, but are not limited thereto. As another example, an untracked box may be selected among suggested boxes according to one or more particular conditions.

For reference, since the process of determining the tracking frame by using the mean shift tracking algorithm has been described above, a detailed description thereof will be omitted.

In step S440, the test apparatus 200 may determine whether each of the suggested boxes is a tracked box or an untracked box.

If it is determined in step S450 that the suggested frame is a tracking frame, the test apparatus 200 may adjust the position of the tracking frame by using a mean shift tracking algorithm in step S460. In detail, in step S460, the test apparatus 200 determines a specific area of the current frame, which has information on a probability similar to that corresponding to the pixel data of the object in the previous frame, as a target area of the tracking frame by using the mean shift tracking algorithm. Thereafter, in step S470, the test apparatus 200 generates a first pooled feature map by pooling the specific region by the first pooling layer 440, and then acquires a first bounding box by applying a regression operation to the first pooled feature map by the FC layer 460.

Otherwise, if it is determined in step S450 that the suggested frame is an untracked frame, in step S490, the test apparatus 200 may allow the second pooling layer 450 to generate a second pooled feature map by applying a pooling operation to a region on the feature map corresponding to at least one of the plurality of untracked frames; and, if the FC layer 460 detects a new object by applying a classification operation to the second pooled feature map, the FC layer 460 is allowed to acquire the second bounding box by applying a regression operation to the second pooled feature map.

For another example, before step S490, in step S480, the test apparatus 200 may select at least one specific untracked frame among the plurality of untracked frames by referring to at least one of (i) each of L2 distances between a reference bounding box acquired from a previous frame and each of the plurality of untracked frames and (ii) each score as a probability value indicating a probability that each of the plurality of untracked frames includes an object. If step S480 is performed, then in step S490, the test apparatus 200 may allow the second pooling layer 450 to generate a second pooled feature map by applying a pooling operation to the region of the feature map corresponding to the particular untracked box; and, if the FC layer 460 detects a new object by applying a classification operation to the second pooled feature map, the FC layer 460 is allowed to acquire the second bounding box by applying a regression operation to the second pooled feature map.

Here, the test apparatus 200 may determine the second bounding box corresponding to the NEW object as the reference bounding box new_ref for the tracking box of the NEW object included in the next frame, and then set the tracking box in the next frame with reference to each distance between the reference bounding box new_ref and each of the plurality of suggested boxes in the next frame.

For reference, the result of the classification operation provides information about probabilities of objects being various identities. The classification operation may represent a probability that the object is a vehicle, a passenger, a background, a road, etc.

Fig. 5 is a diagram for explaining a mean shift tracking algorithm used in the present invention.

By referring to fig. 5, a histogram 520 of an object (e.g., a vehicle) to be tracked may be obtained from a particular region 510. The histogram is information on probability data acquired by counting each pixel number included in the region of the object for each color and dividing the total pixel number by each pixel number.

Under the condition that the histogram 520 is acquired, the input image 530 is back projected to acquire a back projected image 540. Here, back projection is the process of digitizing how many color values of pixels in the input image 530 are included in the object to be tracked. If the histogram of the model is referred to as Hm and if the color value of each pixel x of the input image I530 is referred to as I (x), the value of the back projection can be obtained as shown by the formula w (x) =hm (I (x)).

A mean shift tracking algorithm may be applied to the values of the back projection. More specifically, since the mean shift tracking algorithm tracks an object in an image by using mean shift capable of finding the center of a data distribution to be moved from a current position, in the present invention, it is used to find a specific area in a current frame moved from a start area of a tracking frame, the specific area having information on a probability similar to a probability corresponding to pixel data of an object in a previous frame.

For reference, the information on probability data corresponding to pixel data of an object in a previous frame may be a histogram corresponding to pixel data of a first bounding box in the previous frame and/or pixel data of a second bounding box therein.

Meanwhile, at least one parameter of the CNN included in the test apparatus 200 may be adjusted by a learning apparatus (not shown) before the process of the test apparatus 200 is performed.

In detail, the test device 200 may perform the above steps on condition that the learning device has completed the following processes: (i) allowing the convolutional layer to acquire a feature map for training from a training image including an object for training, (ii) allowing the RPN to acquire one or more suggested frames for training corresponding to the object for training in the training image, (iii) allowing the pooling layer to generate a pooled feature map for training corresponding to the suggested frames for training by applying a pooling operation, (iv) allowing the FC layer to acquire information on pixel data of a bounding box for training by applying a regression operation to the pooled feature map for training, and (v) allowing the loss layer to acquire comparison data by comparing the information on pixel data of the bounding box in the training image with the information on pixel data of the bounding box in the GT image, thereby adjusting at least one parameter of the CNN during back propagation by using the comparison data.

For reference, information on pixel data of a bounding box for training may be acquired through both the first FC layer and the second FC layer, as the case may be. As described above, if the trace network including the first FC layer and the detection network including the second FC layer are configured as one network, it is not necessary to perform the learning process for the first FC layer and the second FC layer, respectively. In this case, the parameter of the first FC layer may have the same value as the parameter of the second FC layer.

According to the present invention, there is an effect of acquiring a high-precision bounding box corresponding to an object in an image.

According to the present invention, by using the mean shift tracking algorithm, there is an effect of tracking an object more accurately.

According to the present invention, by having the tracking network reuse the classifier and the regressor in the detection network included in the CNN, there is an effect of increasing the reliability of the tracking result and the verification result.

The embodiments of the present invention described above may be implemented in the form of executable program commands by various computer devices recordable to a computer readable medium. The computer readable media may include program commands, data files, and data structures, alone or in combination. The program command recorded to the medium may be a component specially designed for the present invention or available to those skilled in the relevant arts. The computer-readable recording medium includes: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROM and DVD; magneto-optical media such as optical disks; hardware devices such as ROM, RAM; and flash memory dedicated to storing and executing programs. The program commands include not only machine language code produced by a compiler but also high-level code that may be used by an interpreter or the like that is executed by the computing device. The hardware devices described above may operate as more than software modules to perform the technical features of the invention, and they may do the same in the opposite case.

As described above, the present invention has been explained by specific matters such as detailed components, limited embodiments and drawings. While the invention has been shown and described with respect to the preferred embodiments, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the following claims.

The inventive idea, therefore, should not be limited to the embodiments explained and the following patent claims as well as all matters comprising equivalents or variations equivalent to the patent claims fall within the scope of the inventive idea.

Claims

1. A method of acquiring at least one bounding box corresponding to at least one object in a test image by using a convolutional neural network CNN comprising a tracking network, comprising the steps of:

(a) A test device acquires a plurality of advice frames, wherein a feature map is generated by applying a convolution operation to the test image as a current frame, and the test device outputs information on the plurality of advice frames from a regional advice network RPN by using the feature map;

(b) The test apparatus selects at least one specific advice frame among the plurality of advice frames by referring to: (i) A result of comparing a reference bounding box of the object in a previous frame with each distance between each of the plurality of bounding boxes, wherein a bounding box having a smallest distance is selected as the at least one particular bounding box, or (ii) a result of comparing each score indicating whether each of the bounding boxes includes a probability value of the object, wherein a bounding box having a highest score is selected as the at least one particular bounding box, and then setting the particular bounding box as a starting region of a tracking box, wherein the starting region is used for a mean shift tracking algorithm, wherein the score is represented by a ratio of a region corresponding to an intersection of regions of a GT bounding box and an intersection of each of the bounding boxes to a region of a GT bounding box and a region corresponding to the union of each of the bounding boxes;

(c) By using the mean shift tracking algorithm, the test device determines a specific region of the current frame as a target region of the tracking frame, the specific region having information on a probability most similar to information on a probability corresponding to pixel data of the object in the previous frame;

(d) The testing device allows a pooling layer to generate a pooling feature map by applying pooling operation to a region corresponding to the specific region in the feature map, and then allows a fully connected FC layer to acquire a boundary frame by applying regression operation to the pooling feature map; and

(e) The test device determines the bounding box as a reference bounding box for a tracking box of the object located in a next frame.

2. The method of claim 1, wherein in the step (c), the information on a probability corresponding to pixel data of the object in the previous frame is a histogram corresponding to pixel data of the bounding box in the previous frame.

3. The method of claim 1, wherein in the step (b), if the number of objects is a plurality, the test apparatus selects the specific suggestion box among the plurality of suggestion boxes by referring to: (i) A result of comparing each of distances between the reference bounding box of the object in the previous frame and each of the plurality of suggestion boxes, and (ii) a result of comparing each score indicating whether each of the suggestion boxes includes a probability value of the object, and then setting each of the specific suggestion boxes as each start region of each of the tracking boxes.

4. The method of claim 1, wherein in the step (b), a distance between the reference bounding box of the object located in the previous frame and each of the plurality of suggested boxes is an L2 distance between a center coordinate of the reference bounding box and a center coordinate of each of the plurality of suggested boxes.

5. The method of claim 1, wherein the test device performs the steps (a) through (e) on condition that the learning device has completed the following process: (i) allowing a convolutional layer to acquire a feature map for training from a training image including an object for training, (ii) allowing the RPN to acquire one or more suggested frames for training corresponding to the object for training in the training image, (iii) allowing the pooling layer to generate a pooled feature map for training corresponding to a suggested frame for training by applying a pooling operation, (iv) allowing the FC layer to acquire information on pixel data of a bounding box for training by applying a regression operation to the pooled feature map for training, and (v) allowing a loss layer to acquire comparison data by comparing the information of the pixel data of the bounding box in the training image with the information of the pixel data of the bounding box in a GT image, thereby adjusting at least one parameter of the CNN during back propagation by using the comparison data.

6. The method according to claim 1, wherein in the step (d), the test apparatus acquires the bounding box whose size is adjusted to correspond to the object in the test image by a process of generating the pooled feature map and then applying the regression operation through the FC layer.

7. The method of claim 1, further comprising detecting a network, the method comprising the steps of:

(f) After said step (b), said testing device setting at least some of said plurality of suggested frames that have not been set as tracked frames as a plurality of untracked frames, wherein said pooling layer is a first pooling layer, said pooling feature map is a first pooling feature map, and said bounding frame is a first bounding frame; and

(g) After said step (d), said testing means allowing a second pooling layer to generate a second pooled feature map by applying a pooling operation to a region on said feature map corresponding to at least one of said plurality of untracked boxes; and, if the FC layer detects a new object by applying a classification operation to the second pooled feature map, the testing apparatus allows the FC layer to acquire a second bounding box by applying a regression operation to the second pooled feature map.

8. The method of claim 7, wherein in the step (g), the testing apparatus determines the second bounding box corresponding to the new object as a reference bounding box for a tracking box of the new object included in a next frame.

9. The method of claim 7, wherein in step (f), at least one particular untracked box is selected among the plurality of untracked boxes by referring to at least one of: (i) Each of the L2 distances between the reference bounding box and each of the plurality of untracked boxes acquired from the previous frame and (ii) as each score indicating whether each of the plurality of untracked boxes includes a probability value for the object, and wherein in the step (g), the testing device allows the second pooling layer to generate the second pooled feature map by applying a pooling operation to a region on the feature map corresponding to the particular untracked box; and, if the FC layer detects the new object by applying a classification operation to the second pooled feature map, allowing the FC layer to obtain the second bounding box by applying a regression operation to the second pooled feature map.

10. A test apparatus for acquiring at least one bounding box corresponding to at least one object in a test image by using a convolutional neural network CNN comprising a tracking network, comprising:

a communication section for acquiring the test image or a feature map converted from the test image; and

a processor for performing the following: (I) Acquiring a plurality of advice frames, wherein the feature map is acquired by applying a convolution operation to the test image as a current frame, and information on the plurality of advice frames is advice network RPN from a region by using the feature map; (II) selecting at least one particular suggestion box from the plurality of suggestion boxes by referring to: (i) A result of comparing a reference bounding box of the object in a previous frame with each distance between each of the plurality of bounding boxes, wherein a bounding box having a smallest distance is selected as the at least one particular bounding box, or (ii) a result of comparing each score indicating whether each of the bounding boxes includes a probability value of the object, wherein a bounding box having a highest score is selected as the at least one particular bounding box, and then setting the particular bounding box as a starting region of a tracking box, wherein the starting region is used for a mean shift tracking algorithm, wherein the score is represented by a ratio of a region corresponding to an intersection of regions of a GT bounding box and an intersection of each of the bounding boxes to a region of a GT bounding box and a region corresponding to the union of each of the bounding boxes; (III) determining, by using the mean shift tracking algorithm, a specific region of the current frame as a target region of the tracking frame, the specific region having information on a probability most similar to information on a probability corresponding to pixel data of the object in the previous frame; (IV) allowing a pooling layer to generate a pooled feature map by applying a pooling operation to a region of the feature map corresponding to the specific region, and then allowing a fully connected FC layer to obtain a bounding box by applying a regression operation to the pooled feature map; and (V) the testing device determining the bounding box as a reference bounding box for a tracking box of the object located in a next frame.

11. The test apparatus according to claim 10, wherein in the processing (III), the information on a probability corresponding to pixel data of the object in the preceding frame is a histogram corresponding to pixel data of the bounding box in the preceding frame.

12. The test apparatus of claim 10, wherein in the process (II), if the number of objects is a plurality, the processor selects the particular suggestion box from the plurality of suggestion boxes by referring to: (i) A result of comparing each of distances between the reference bounding box of the object in the previous frame and each of the plurality of suggestion boxes and (ii) a result of comparing each score indicating whether each of the suggestion boxes includes a probability value of the object, and then setting each of the specific suggestion boxes as each start region of each of the tracking boxes.

13. The test apparatus of claim 10, wherein in the process (II), a distance between the reference bounding box of the object located in the previous frame and each of the plurality of suggested boxes is an L2 distance between a center coordinate of the reference bounding box and a center coordinate of each of the plurality of suggested boxes.

14. The test device according to claim 10, wherein the processor of the test device performs the processes (I) to (V) on condition that the learning device has completed the following processes: (i) allowing a convolutional layer to acquire a feature map for training from a training image comprising an object for training, (ii) allowing the RPN to acquire one or more suggested frames for training corresponding to the object for training in the training image, (iii) allowing the pooling layer to generate a pooled feature map for training corresponding to a suggested frame for training by applying a pooling operation, (iv) allowing the FC layer to acquire information on pixel data of a bounding frame for training by applying the regression operation to the pooled feature map for training, and (v) allowing a loss layer to acquire comparison data by comparing the information of the pixel data of the bounding frame in the training image with the information of the pixel data of the bounding frame in a GT image, thereby adjusting at least one parameter of the CNN during back propagation by using the comparison data.

15. The test apparatus according to claim 10, wherein in the process (IV), the processor acquires the bounding box whose size is adjusted to correspond to the object in the test image by a process of generating the pooled feature map and then applying the regression operation through the FC layer.

16. The test apparatus of claim 10, further comprising a detection network wherein the processor is further configured to perform the following: (VI) after the processing (II), setting at least some of the plurality of suggested boxes that have not been set as tracking boxes as a plurality of untracked boxes, wherein the pooling layer is a first pooling layer, the pooling feature map is a first pooling feature map, and the bounding box is a first bounding box; and (VII) after the processing (VII), allowing a second pooling layer to generate a second pooled feature map by applying a pooling operation to a region on the feature map corresponding to at least one of the plurality of untracked boxes; and, if the FC layer detects a new object by applying a classification operation to the second pooled feature map, allowing the FC layer to obtain a second bounding box by applying a regression operation to the second pooled feature map.

17. The test apparatus of claim 16, wherein in the process (VII), the processor determines the second bounding box corresponding to the new object as a reference bounding box for a tracking box of the new object included in a next frame.

18. The test apparatus of claim 16, wherein in the processing (VI) at least one particular untracked box is selected among the plurality of untracked boxes by referring to at least one of: (i) Each of the L2 distances between the reference bounding box and each of the plurality of untracked boxes acquired from the previous frame and (ii) as each score indicating whether each of the plurality of untracked boxes includes a probability value for the object, and wherein in the processing (III-2), the processor allows the second pooling layer to generate the second pooled feature map by applying a pooling operation to a region on the feature map corresponding to the particular untracked box; and, if the FC layer detects the new object by applying a classification operation to the second pooled feature map, allowing the FC layer to obtain the second bounding box by applying a regression operation to the second pooled feature map.