CN114332583A

CN114332583A - Indoor target detection method based on improved yolov3

Info

Publication number: CN114332583A
Application number: CN202111518263.8A
Authority: CN
Inventors: 王养柱; 田雨豪
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2021-12-10
Filing date: 2021-12-10
Publication date: 2022-04-12

Abstract

The invention provides an indoor target detection method based on improved yolov3, which is suitable for detecting target furniture with smaller area behind shielding in an indoor environment. The method comprises the following steps: crawling indoor scene pictures, manually marking detection targets, and constructing a training set; clustering the target size in the training data set by using a K-means method to obtain an optimized anchor size; improving Darknet53 network of yolov3, adding a combination module (DS module) of expansion convolution and depth separation convolution into a residual module; and carrying out indoor target detection by using the improved network after training is finished. The method increases the utilization rate of image information, improves the accuracy of detecting the indoor target, and has higher accuracy on detecting the target furniture with smaller area behind the shielding.

Description

Indoor target detection method based on improved yolov3

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to an indoor target detection method based on an indoor data set.

Background

At present, the target detection technology becomes a related technology in the field of computer vision with wide application. The traditional target detection is mostly based on the characteristics of manual marking, the precision is poor, and the requirements of people on the precision and the efficiency cannot be gradually met by the development of the traditional target detection at any time. With the development of artificial intelligence in the field of computer vision, for example, through a deep learning network such as a convolutional network, the image features can be accurately completed in a computer extraction mode, and the speed, the accuracy and the robustness of the target detection task are improved. At present, a target detection method taking a convolutional neural network as a main framework has become the mainstream.

The application of target detection technique is fairly extensive, can carry out tasks such as reconnaissance, strike on the unmanned aerial vehicle for military use, can be in the lightweight, and the manpower is practiced thrift in the multimachine is cooperative while, promotes the efficiency of fighting, also can effectively avoid the casualties simultaneously. In civil use, the method can also be used in the fields of face recognition, automatic driving and the like. And the perfect indoor target detection technology can enable the unmanned aerial vehicle carrying the vision module to independently execute a task with higher precision, and the target can be easily searched through a video monitoring means in the general production and living field. Indoor scene is mostly living room, bedroom, study etc. because of reasons such as fitment style and furniture kind, illumination, and its background environment is complicated, also has a large amount of problems such as sheltering from, overlapping between the detection target, when using yolov3 target detection network to detect indoor object, will have target learning degree and detect the not good problem of precision to the target. There is therefore a need for further research and improvement with respect to indoor target detection.

Disclosure of Invention

Aiming at the problem that when the yolov3 target detects an indoor object, the indoor target detection accuracy is poor due to poor target learning degree, mutual shielding of indoor targets and the like, the invention provides an improved yolov 3-based indoor target detection method, and the yolov3 network is improved to improve the indoor target detection accuracy.

According to the indoor target detection method based on the improved yolov3, the Darknet53 network of the yolov3 algorithm is improved to be used for target detection under indoor shielding and overlapping conditions. The method comprises the following steps:

step 1, modify the Darknet53 network of yolov3 algorithm for indoor target detection;

step 2, firstly, acquiring an indoor scene picture training data set, wherein the acquired indoor scene picture comprises target furniture with smaller blocked back area;

clustering the target size in the training data set by using a K-means method, and optimizing the size of the anchor;

and 3, training the Darknet53 network of the improved yolov3 algorithm by using the training data set, and performing target identification on the indoor scene picture by using the trained network.

In the step 1, improving the Darknet53 network means adding a combined module of expansion convolution and depth separation convolution, namely a DS module, between a first layer residual module and a second layer residual module; the DS module comprises an expansion convolutional layer, a depth separable convolutional layer, two convolutional layers and a summation module; the feature maps are respectively input into an expansion convolutional layer and a convolutional layer, the output feature map of the expansion convolutional layer is input into a depth separable convolutional layer, the feature map output after the depth separable convolutional operation is input into the convolutional layer, and the feature maps output by the two convolutional layers are output after being processed by a summation module.

In the step 2, the obtained indoor scene picture includes target furniture with a smaller area after being shielded.

In the step 2, the indoor scene picture is crawled through the network, the detection target is manually marked, and the detection target comprises four types: beds, chairs, tables and sofas.

Compared with the prior art, the invention has the advantages and positive effects that: (1) aiming at the detection difficulty of target furniture with small area behind shield in indoor environment, the method modifies the dark net53 network from the residual error module, increases the utilization rate of image information of the network, enables the network to capture more target characteristics and improves the accuracy of indoor target detection. (2) The method of the invention creatively adds the DS module between the first block and the second block of the darknet-53 residual module and places the DS module in the shallow layer network in front of the network, thus sending the characteristic diagram into the expansion convolution layer to obtain a larger receptive field. (3) The whole residual error module structure is modified, and the target identification precision of the network is improved while the calculated amount is not greatly increased. (4) The DS module introduces expansion convolution and depth separable convolution, the depth separable convolution is adjustable in module quantity, and a network can be designed according to requirements. The DS module creatively combines the two convolutions, except for the superposition of the independent capacities of the two convolutions, the performance of the part which is not influenced by the DS module can be improved, and the combination of the two convolutions can improve the perception effect of the whole residual error module and the perception capacity of the whole channel on the premise of improving the speed. (5) And the DS module sums the result of the input after one-layer convolution and batch regularization and after expansion convolution and depth separable convolution to prevent the gradient from disappearing. (6) The method and the device utilize the training data set to cluster in advance to obtain a group of anchors, and accuracy of indoor target detection is improved. (7) The method provided by the invention is very suitable for target detection under the shielding and overlapping conditions.

Drawings

FIG. 1 is a schematic diagram of the structure of a DS module of the method of the present invention;

fig. 2 is a schematic diagram of the result of object recognition of indoor picture 1 using the original yolov3 network;

FIG. 3 is a diagram illustrating the result of target recognition of indoor picture 1 using the improved network of the present invention;

fig. 4 is a diagram illustrating the result of object recognition of indoor picture 2 using the original yolov3 network;

fig. 5 is a diagram illustrating the result of target recognition of indoor picture 2 using the improved network of the present invention;

fig. 6 is a schematic diagram of the result of object recognition of indoor picture 3 using the original yolov3 network;

fig. 7 is a diagram illustrating the result of target recognition of the indoor picture 3 by using the improved network of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood by those skilled in the art, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings and examples. It is to be understood that the drawings and described embodiments are provided as illustrative of some, but not all embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the disclosed embodiments of the invention without making any creative effort, shall fall within the protection scope of the invention.

The experiment of the embodiment of the invention is completed on a desktop configured in a laboratory, and the installation environment of the relevant software is as follows:

CPU：Intel(R)Core(TM)i7-8700K [email protected]；

a display card: intel (R) UHD Graphics 630;

memory: 64G;

operating the system: win 10;

compiling a language: python;

CUDA (unified computing device architecture): CUDA 10.0;

CUDNN (GPU acceleration library for deep neural networks): 7.6.0;

a deep learning framework: tensorflow-keras.

The invention improves yolov3 network for indoor target detection, improves target detection precision, and adds integrated expanding convolution and deep separation convolution combination module in shallow network, called as partition _ separable module, called DS module for short. The shallow layer network characteristic graph is firstly sent into the expansion convolution layer to obtain a larger receptive field, and then is sent into the separation convolution layer to further extract the characteristic information of the complex furniture after being processed by batch regularization and activation function. One specific implementation of the indoor target detection method based on modified yolov3 of the invention is as follows.

The method comprises the following steps: and acquiring an indoor scene picture data set, marking an indoor target and generating a training data set.

The indoor scenes applied by the invention are mostly living rooms, bedrooms, study rooms and the like, and due to decoration styles, furniture types, illumination and the like, the background environment is complex, and a large amount of shielding, overlapping and other problems exist among detection targets.

The embodiment of the invention sets four types of detection targets, namely 'bed', 'chair', 'table' and 'sofa', aiming at common furniture matching in indoor decoration. The current indoor data set, such as the turbo _ cvpr09 of MIT, is not suitable for the present invention because the picture definition is too low and its european furniture items are too many, and its sample distribution is too uneven. Therefore, the indoor scene data set is mainly acquired from the network, the indoor scene pictures are downloaded from websites such as the Baidu, must and Google websites through keyword indexes by manual screening, and the labelImg tool is used for manually marking the rectangular frame. Finally, the data obtained by the embodiment of the invention are 202 pictures containing bed objects, 197 pictures containing table objects, 212 pictures containing chairs and 197 pictures containing sofa (with multiple objects), and the total number of the picture data sets is 689.

Step two: the anchor size is optimized from the data set using the K-means method.

The Anchor size set in the Yolov3 network is the original size, but the Anchor size directly determines the prediction accuracy of the prediction frame, so when the Anchor size is matched with the target, the Anchor with the most appropriate size is selected for detection. In the embodiment of the invention, clustering is carried out again according to the targets marked in the data set by using a K-means method to obtain the optimized anchor size, as shown in the following table 1.

TABLE 1 optimized Anchor size data

In table 1, 1 to 9 represent generated anchors of nine sizes, the sizes correspond to the length and width values of the anchors, and the unit is a pixel.

Step three: the yolo network is improved.

In a general convolutional neural network structure, a backbone network stacked by convolution and pooling has good feature extraction capability. After the image is input into the network, the convolution layer performs characteristic extraction, the pooling layer performs characteristic aggregation, the model has certain translation invariance, and the calculation burden of the subsequent convolution layer can be reduced. And finally, outputting a classification result to a full connection layer. However, this structure has a problem for target detection: the ability to integrate image features is contradictory to the ability to detect smaller objects. The receptive field in image segmentation is critical to target detection, generally, prediction needs to be made on the last layer of feature map, and then how many pixels a feature point on the feature map can map to the original image, and the size upper limit that can be detected by the network is determined, and it is guaranteed that feature integration is performed on the upper limit by using the pooling layer, but small targets are difficult to detect after pooling operation, and the disadvantage is shown in the case of large-area shielding. In response to this problem, multi-level feature map pull branching can improve this problem because small objects are more easily represented on the feature map at the front, but the semantic information of the feature map at the front is not rich enough. The pooling is not performed, and only the number of convolutional layers is increased, so that the calculation amount of the network is increased firstly, and the final feature extraction effect of aggregation without pooling is also influenced. The requirements of the present invention in an indoor environment become: the target detection is carried out under the condition of fully obtaining semantic information without sacrificing the resolution of the feature map, namely, the contradiction between the image feature integration capability and the target furniture with smaller area after detection shielding is relieved, so that the extraction of the feature information under the indoor complex environment by a network is increased. In order to meet the requirement, the yolov3 network is improved, and the method comprises the steps 3.1-3.2.

Step 3.1: and introducing expansion convolution, and processing the input picture by an expansion convolution layer firstly.

The dilation convolution, also called hole convolution or dilation convolution, is to inject holes into the standard convolution kernel to increase the receptive field of the model and thus obtain more dense data. The dilation convolution increases the number of intervals of points of the convolution kernel compared to the original normal convolution operation. Therefore, with the same convolution kernel, the expanded convolution can have a larger receptive field, more dense data information can be obtained, the resolution reduction caused by pooling feature aggregation is avoided, the extraction of semantic information is increased, and when the indoor furniture is shielded, the expanded convolution is added in the shallow layer of the network to effectively extract the indoor environment information, so that the network feature extraction capability and the learning capability are increased. The method for calculating the output resolution (feature map side length) of the expanded convolutional layer comprises the following steps:

wherein, O represents the side length of the characteristic diagram output by the expansion convolution layer; i represents the side length of the square image input after processing; p represents a full 0 filling corresponding parameter, namely whether a dimension is added to the four sides of the image matrix and all the dimensions are set to be 0, wherein p is equal to 1 when 0 is filled and is equal to 0 when the image matrix is not filled; k is the original convolution kernel size, d is the dilated convolution expansion coefficient, and s is the step length.

Step 3.2: and introducing depth separable convolution, and inputting the feature map output by the expanded convolutional layer into depth separable convolutional layer processing.

The depth separable convolution can significantly reduce the number of parameters compared to the normal convolution. In this case, assuming that the parameter quantities are the same, the number of layers of the depth separable convolution can be deeper, and the feature extraction is more. Therefore, the network can extract richer features during learning, and the extraction and learning capacity of the network on the complex furniture features is enhanced. The depth separable convolution is classified into depth convolution and point-by-point convolution. One convolution kernel of the deep convolution is responsible for one channel, and one channel is convolved by only one convolution kernel. The number of the feature maps after the deep convolution is the same as the number of channels of the input layer, and the feature maps cannot be expanded. Moreover, the operation performs convolution operation on each channel of the input layer independently, and the characteristic information of different channels on the same spatial position is not effectively utilized. Therefore, a point-by-point convolution is required to combine these feature maps to generate a new feature map. The point-by-point convolution refers to ordinary 1 x 1 convolution of the number of convolution kernel channels on the depth convolution image output from the upper side, and therefore the feature of each point is collected equivalently.

As shown in fig. 1, the residual module in the Darknet53 network of the yolov3 method is improved, specifically, an integrated DS module is added between the first layer residual module and the second layer residual module of the residual module, so that a new Darknet network dedicated to indoor detection is obtained. As shown in fig. 1, the integrated DS module includes an expanded convolutional layer (scaled-Conv 2D), a depth separable convolutional layer (separable-Conv 2D), two convolutional layers (Conv2D), and a sum-and-ADD module (ADD). The input feature map is subjected to expansion convolutional layer and depth separable convolutional layer and then output to a convolutional layer, and on the other hand, the feature map output by the two convolutional layers is directly input to a convolutional layer, and a final feature map is output by a summation module.

The specific functions and advantages of the invention added with the DS module are as follows: after the feature map with the size 416 x 3 is input, the output is changed into 208 x 64 through the first layer residual module, the output is input into the expansion convolution layer in advance, the image perception is enhanced by the expansion convolution under the condition of not changing the size of the total feature map, a feature map with stronger information continuity is obtained, and the perception capability of the subsequent overall yolo network is improved. Further, the feature map input depth can be separated and convoluted, the output of the feature map input depth can be changed from the original 208 x 64 to 208 x 1, the total calculation amount before the channel is sent to the next residual module is reduced, and therefore the calculation amount of the whole network is reduced. Under the general condition, the calculation amount can be increased by singly using the expansion convolution, but the two types of convolution are jointly used under the condition that the DS modules are inserted into the residual error module, and only through the cooperation of one DS module, the calculation amount is not increased, the unique functions of the respective convolutions can be played respectively, the channel perception capability of the integral residual error network is enhanced, and the parameters of the integral residual error module and the integral network are reduced.

Step four: and training the Yolov3 network based on the anchor size optimized in the step two by using the training data set obtained in the step one.

The convergence weight of the network using the DS module is obtained through training. In order to prove the calculation effect of the method of the invention, the target recognition effect of the modified yolov3 network of the invention and the original yolov3 network on the same indoor scene picture data set are compared, as shown in fig. 2-7. The pr curve data obtained are summarized in table 2 below. And the pr curve expresses the relationship between the target identification precision rate recision and the recall rate recill.

Table 2 test results of each type of target detection precision of model and comprehensive detection precision of model

Categories	Bed	Sofa (CN)	Table (Ref. Table)	Chair (Ref. TM. chair)	mAP
						Original yolo network AP	76.62％	23.81％	37.22％	54.19％	47.96％
Improved yolo network AP	93.33％	61.22％	52.03％	50.71％	64.33％

AP represents the average accuracy of identification of a single object class, and the AP is the average of AP values of all object classes.

Analyzing the data obtained by the test in table 2, it can be seen that the original yolo network has higher detection precision for class models with obvious target characteristics, such as 'bed' and 'chair'. However, in the case of a smaller target and outgoing line shading, the original yolo network shows obvious insufficient learning capacity. For the targets such as "table" and "sofa" which are easy to be occluded and whose table top can have large differences or angle changes as a large-area feature, the target detection accuracy is not ideal. The main problems of low detection precision of a table and a sofa are that the types of targets are more, the targets of false detection and missed detection are mostly furniture with novel shapes due to personalized design, and the characteristics of the samples are inconsistent with those of the samples in a training set, so that the characteristics of the samples of the small class are not fully learned, and correct detection cannot be realized.

It can also be found from the comparative analysis of the test indexes in table 2 that the optimized YOLOv3 network has significant precision improvement on most indexes. The AP value of each type of target is compared and calculated in detail, and the optimized network model is found to have obvious precision improvement on mAP indexes by verifying the improvement measures of the previous insufficient points; from the AP value analysis of each class of targets, the identification accuracy of the YOLOv3 is high for large targets, and the promotion amplitude is not large. Although the accuracy of the chair is reduced by three percentage points, the accuracy of the target fitting of the prediction frame is better than that of the target fitting of the table and the sofa aiming at the improvement of target detection indexes with lower accuracy, and the added DS module obviously improves the learning capability of the enhanced network on phenomena such as shielding and overlapping. I.e. it is illustrated that the generalization of the network is enhanced.

It can be seen from fig. 2 and fig. 3 that the original yolov3 and the improved yolov3 network both make correct and reasonable detections on the bed, but in case of occlusion, the original network does not find the position of the rearmost chair, and also misjudges the plane of one of the chairs as a table, but the yolov3 network added with the DS module of the invention correctly detects and identifies all furniture information, proving that the improved yolov3 network is more perfect for extracting occlusion information.

As can be seen from fig. 4 and 5, the original yolov3 network does have a certain defect in determining the characteristics of the table and the sofa, i.e., the learning ability of the characteristics of the table and the sofa is slightly poor. The yolov3 network added with the DS module correctly detects and identifies the sofa information and generates no false detection. The original network wrongly recognizes the sofa as the bed, and judges that the chair is wrongly marked for many times, and also wrongly recognizes the sofa information. The improved yolov3 network of the invention proves to have better capability of learning features for objects with more difficult feature extraction.

As can be seen from fig. 6 and 7, the chair position and feature information in the original image are blurred, and the black color thereof is almost integrated with the black background. Note that: the tables marked by the experimental data set are four-corner tables including dining tables, desks and the like, so that the tables in the drawing are not marked because of human will and are not missed for inspection. The original yolov3 network has certain defects for the characteristic judgment of the half-inserted chairs. The yolov3 network added with the DS module correctly detects and identifies the information of the blocked part of the chair. Therefore, the improved yolov3 network is proved to have better capability of learning characteristics for the occluded and overlapped objects than the original yolov3 network.

Claims

1. An indoor target detection method based on improved yolov3 is characterized by comprising the following steps:

the improved Darknet53 network is that a combined module of expansion convolution and depth separation convolution, namely a DS module, is added between a first layer residual module and a second layer residual module; the DS module comprises an expansion convolutional layer, a depth separable convolutional layer, two convolutional layers and a summation module; firstly, feature maps are respectively input into an expansion convolutional layer and a convolutional layer, an output feature map of the expansion convolutional layer is input into a depth separable convolutional layer, the feature map output after the depth separable convolutional operation is input into the convolutional layer, and feature maps output by the two convolutional layers are output after being processed by a summation module;

2. The method as claimed in claim 1, wherein in the step 2, the indoor scene picture is crawled through the network, and the detection target is manually marked, and the detection target includes four types: beds, chairs, tables and sofas.

3. The method according to claim 1, wherein in the step 2, the size of the anchor is optimized, and the obtained optimized size includes nine, which are respectively as follows:

(1) length and width 51 and 62 respectively; (2) length and width 104 and 114, respectively; (3) length and width 148 and 207, respectively; (4) length and width 210 and 295, respectively; (5) 298 and 174 for length and width, respectively; (6) length and width 316 and 447, respectively; (7) 481 and 264 in length and width, respectively; (8) length and width are 633 and 416 respectively; (9) length and width 803 and 602, respectively; the unit of length and width is a pixel.