CN111428739A

CN111428739A - High-precision image semantic segmentation method with continuous learning capability

Info

Publication number: CN111428739A
Application number: CN202010289285.0A
Authority: CN
Inventors: 闵飞; 王嫣然
Original assignee: Tujue Guangzhou Intelligent Technology Co ltd
Current assignee: Tujue Guangzhou Intelligent Technology Co ltd
Priority date: 2020-04-14
Filing date: 2020-04-14
Publication date: 2020-07-17
Anticipated expiration: 2040-04-14
Also published as: CN111428739B

Abstract

The invention discloses a high-precision image semantic segmentation method with continuous learning capacity, which comprises the following steps of: step S100, extracting picture features by using a full convolution neural network unit; step S200, defining a search space, and obtaining a multi-scale prediction unit with the best segmentation effect by using a neural network architecture search method; step S300, fusing the features extracted by all the multi-scale prediction units in the step S200 to obtain a refined high-resolution feature map, so as to realize high-precision image semantic segmentation; and step S400, a continuous learning mechanism is added in the feature extraction modules of the full convolution neural network unit and the multi-scale prediction unit, so that the catastrophic forgetting problem of the deep convolution neural network is solved, the whole image semantic segmentation method has the capability of continuous learning across scenes, and the method has higher adaptability to complex changing environments.

Description

High-precision image semantic segmentation method with continuous learning capability

Technical Field

The invention relates to the field of machine vision technology and deep learning, in particular to a high-precision image semantic segmentation method with cross-scene continuous learning capability.

Background

Along with the development of artificial intelligence, the degree of intelligence of a robot is higher and higher, and an unmanned intelligent system is gradually moving to various social and civil fields to replace manpower to engage in complicated, dangerous and boring physical labor. If the robot is to perform a labor task well and to be compatible with human beings, the robot must have the ability to understand the environment like a human being so as to realize the flexible movement and smart operation functions of the autonomous robot. Such as: the robot comprises a man-machine cooperation type robot of an intelligent factory, a material distribution robot in a hospital environment, a housekeeping service robot, an automatic driving automobile, a new generation intelligent forklift robot, a rescue and disaster relief robot and the like. When intelligent robots with various configurations on the ground perform mobile operation tasks in various complex scenes, the core problem is the environmental understanding problem, and the key technology for solving the problem is as follows: and (5) image semantic segmentation. However, the existing semantic segmentation method has insufficient semantic segmentation precision when facing complex environments (such as diversification of small targets, target overlapping or occlusion), and especially the existing deep neural network semantic segmentation model has a serious memory capacity decline problem, namely: learning new knowledge forgets original experience knowledge, which causes the robot system not to have continuous learning ability across scenes, resulting in that the trained robot system can only adapt to the trained scenes, and cannot continuously learn new experience incrementally in new working scenes. Therefore, the new image semantic segmentation and scene understanding method breaks through the limitation of the catastrophic forgetting problem of the existing model, so that the robot system has stronger human-like continuous learning capacity, and is a work with academic and engineering application values. .

Disclosure of Invention

The invention aims to provide a high-precision image semantic segmentation method with continuous learning capability to solve the problem.

In order to achieve the above object, the present invention provides a high-precision image semantic segmentation method with continuous learning capability, which includes the following steps:

s100, extracting to obtain image preliminary features by using a full convolution neural network unit;

step S200, defining a search space SP, and obtaining a Multi-Scale Prediction unit (MSPC) which enables the best segmentation effect by using a neural network architecture search method;

s300, fusing the features extracted by the multi-scale prediction unit in the step S200 to obtain a refined high-resolution feature map, so that high-precision image semantic segmentation is realized;

step S400, a continuous learning mechanism is introduced into the feature extraction modules of the full convolution neural network unit and the multi-scale prediction unit, so that the defect that the memory capacity of a semantic segmentation network is degraded when the scene with gradually increased data scale and large image content span is faced is overcome, the image semantic segmentation and scene understanding system has more intelligent human-like continuous learning capacity, and the image semantic segmentation and scene understanding system has stronger adaptability to the scene with complex change.

Further, in step S200, the search space SP includes a series of convolutions of 1 × 1 and convolutions of 3 × 3, the search space SP is used for determining a multi-scale prediction unit structure and introducing the convolution of holes with different sampling rates into the search space SP, and the search space SP is specifically defined as follows:

SP0 is the convolution with a convolution kernel of 1 × 1;

SP1 is the convolution with a convolution kernel of 3 × 3;

SP2 ═ hole convolution with a convolution kernel of 3 × 3 and a sample rate of ∈ {2,4,6,8,12,18,24 };

SP 3-Re L U activation function

SP＝{SP0,SP1,SP2,SP3}。

Further, in step S200, the neural network architecture searching method uses a mean-square Intersection-over-unity (mlou) evaluation method to optimize the objective function f ═ f (x), where x ═ x₀,x₁,x₂,…,x_nAnd the structure combinations of the multi-scale prediction units searched according to the search space SP.

Further, in step S200, the input of the multi-scale prediction unit is the output of each layer of the full convolution neural network, and the output is the characteristic diagram response with equal resolution.

Further, in step S300, the fusion unit performs feature fusion by using a pixel addition method.

Further, in step S400, the continuous learning mechanism is to modify the neural network weights in the orthogonal direction of the old task input space, that is, the final weight increment is defined as Δ W ═ P Δ W₀Determining a function; where P is the input space orthogonal vector, Δ W₀The gradient is calculated for the neural network over the original input space, Δ W is the weight increment in the orthogonal direction of the input space.

Further, the input space orthogonal vector P is obtained by the following formula:

k_l(i)＝p_l(i-1)r_l-1(i)^T；

wherein ,p_l(i) Represents the ith training batch, the orthogonal vector of the l-th input space,

is a constant number not greater than 1, r_l-1(i) Represents the average value of the output data of the ith training and l-1 layer.

Compared with the prior art, the high-precision image semantic segmentation method with the continuous learning capability has the beneficial effects that: by introducing an orthogonal projection operator, the image semantic segmentation algorithm only modifies the weight of the neural network in the orthogonal direction of the input space of the old task when learning a new task. By the operation, the weight increment hardly acts with the input of the previous task, so that the solution searched by the network in the new task training process is still in the solution space of the previous task.

Furthermore, a multi-scale prediction unit structure is determined by a neural network architecture search method, high-level semantic features and low-level fine-grained features are extracted, pixel addition fusion is carried out, a refined high-resolution feature map can be obtained, and the precision of image semantic segmentation is further improved.

Drawings

The invention will be further understood from the following description in conjunction with the accompanying drawings. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the embodiments. Like reference numerals designate corresponding parts throughout the different views.

FIG. 1 is a flowchart of a high-precision image semantic segmentation method with continuous learning capability according to an embodiment of the present invention;

FIG. 2 is a high-precision image semantic segmentation network architecture diagram with continuous learning capability according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a continuous learning mechanism according to an embodiment of the present invention.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to embodiments thereof; it should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. Other systems, methods, and/or features of the present embodiments will become apparent to those skilled in the art upon review of the following detailed description. It is intended that all such additional systems, methods, features and advantages be included within this description, be within the scope of the invention, and be protected by the accompanying claims. Additional features of the disclosed embodiments are described in, and will be apparent from, the detailed description that follows.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by the terms "upper", "lower", "left", "right", etc. based on the orientation or positional relationship shown in the drawings, it is only for convenience of describing the present invention and simplifying the description, but it is not intended to indicate or imply that the device or component referred to must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes and are not to be construed as limiting the present patent, and the specific meaning of the terms described above will be understood by those of ordinary skill in the art according to the specific circumstances.

As shown in fig. 1, a high-precision image semantic segmentation method with continuous learning capability according to an embodiment of the present invention includes the following steps:

step S100, image features are preliminarily extracted by using a classical VGG16 or ResNet full convolution neural network unit;

s200, defining a search space, determining a multi-scale prediction unit by using a neural network architecture search method, and optimizing the primary image characteristics obtained in the step S100 to obtain multi-scale convolution characteristics;

s300, fusing the features extracted by the multi-scale prediction unit in the step S200 to obtain a refined high-resolution feature map;

step S400, a continuous learning mechanism is introduced into the feature extraction modules of the full convolution neural network unit and the multi-scale prediction unit, so that the catastrophic forgetting problem of the deep convolution neural network is solved, and the image semantic segmentation method has strong adaptability to various environments with complex changes.

The full convolution neural network usually comprises a plurality of convolution layers and a plurality of pooling layers, so that the image resolution is attenuated layer by layer to obtain as much context information as possible, and the pixel-level dense prediction problem of image semantic segmentation can be better solved. However, in the process of increasing the number of convolution layers, more and more frame information and less detail information are extracted by the network.

Therefore, on the basis of the full convolution neural network unit, the multi-scale prediction unit is added to optimize the feature map extracted by full convolution, and the refined high-resolution feature map is obtained, so that high-precision image semantic segmentation is realized.

Specifically, in step S200, the search space includes a series of convolutions of 1 × 1 and convolutions of 3 × 3, and since the search space is used for determining the multi-scale prediction unit structure, holes with different sampling rates are convoluted into the search space, and the search Space (SP) is specifically defined as follows:

SP0 is the convolution with a convolution kernel of 1 × 1;

SP1 is the convolution with a convolution kernel of 3 × 3;

SP 3-Re L U activation function

SP＝{SP0,SP1,SP2,SP3}.

As can be seen from the definition of the search space, there are 10 search nodes in total, and in this embodiment, it is specified that the hole convolutions are in a parallel structure, and at most 4 hole convolutions with different sampling rates are in parallel.

The neural network architecture searching method adopts an average cross-over ratio (mIoU) evaluation method to optimize an objective function f (f) (x), wherein x (x)₀,x₁,x₂,…,x_nAnd the structure combination of all the multi-scale prediction units searched according to the search space.

The average cross-over ratio mIoU is formulated as:

wherein k represents the number of classes, p_ijIndicating the number of pixels that would have been of class i but predicted to be of class j. I.e. p_iiRepresents a true quantity, and p_ij，p_jiIt is interpreted as false positive and false negative, respectively.

The multi-scale prediction unit is connected with the full convolution neural network according to the mode shown in fig. 2, the input of the multi-scale prediction unit is the output of each layer of the full convolution neural network, and the output of the multi-scale prediction unit is the response of the feature map with equal resolution.

Because the semantic segmentation method based on the convolutional neural network mostly depends on training of a large number of data sets, and only the recently trained objects and scenes can be effectively identified and segmented. Most of the existing image semantic segmentation methods are difficult to overcome the problem of catastrophic forgetting.

To this end, the present invention introduces a continuous learning mechanism, as shown in fig. 3, to make the neural network modify the network weights in the orthogonal direction of the old task input space, and finally the weight increment is increased by Δ W ═ P Δ W₀And (4) function determination. Where P is the input space orthogonal vector, Δ W₀The gradient is calculated for the neural network over the original input space, Δ W is the weight increment in the orthogonal direction of the input space.

The orthogonal vector P of the input space in the invention can be obtained by the following formula:

k_l(i)＝p_l(i-1)r_l-1(i)^T

In the fusion unit, feature fusion is performed by using a pixel addition method, i.e.

Where X is the feature matrix, (i, j, k) is the matrix dimension,

the pixel values of the ith row, the j column and the k dimension in different feature matrixes are respectively.

The high-precision image semantic segmentation method with the cross-scene continuous learning capability provided by the invention can be simultaneously used for semantic segmentation of the scenes after training of different scene data sets. The invention can effectively inhibit the memory capacity decline phenomenon of the visual semantic segmentation system, ensure that the robot can not forget the originally accumulated experience knowledge while learning new knowledge in a new environment, and continuously enhance the understanding capacity of the robot system to the environment.

Although the invention has been described above with reference to various embodiments, it should be understood that many changes and modifications may be made without departing from the scope of the invention. That is, the methods, systems, and devices discussed above are examples. Various configurations may omit, substitute, or add various procedures or components as appropriate. For example, in alternative configurations, the methods may be performed in an order different than that described, and/or various components may be added, omitted, and/or combined. Moreover, features described with respect to certain configurations may be combined in various other configurations, as different aspects and elements of the configurations may be combined in a similar manner. Further, elements therein may be updated as technology evolves, i.e., many elements are examples and do not limit the scope of the disclosure or claims.

Specific details are given in the description to provide a thorough understanding of the exemplary configurations including implementations. However, configurations may be practiced without these specific details, for example, well-known circuits, processes, algorithms, structures, and techniques have been shown without unnecessary detail in order to avoid obscuring the configurations. This description provides example configurations only, and does not limit the scope, applicability, or configuration of the claims. Rather, the foregoing description of the configurations will provide those skilled in the art with an enabling description for implementing the described techniques. Various changes may be made in the function and arrangement of elements without departing from the spirit or scope of the disclosure.

It is intended that the foregoing detailed description be regarded as illustrative rather than limiting, and that it be understood that it is the following claims, including all equivalents, that are intended to define the spirit and scope of this invention. The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims

1. A high-precision image semantic segmentation method with continuous learning capability is characterized by comprising the following steps:

s200, defining a search space SP, and obtaining a multi-scale prediction unit with the best segmentation effect by using a neural network architecture search method;

2. The method for semantic segmentation of high-precision images with continuous learning capability according to claim 1, wherein in step S200, the search space SP comprises a series of convolutions of 1 × 1 and convolutions of 3 × 3, the search space SP is used for determining a multi-scale prediction unit structure and introducing hole convolutions with different sampling rates into the search space SP, and the search space SP is specifically defined as follows:

SP0 is the convolution with a convolution kernel of 1 × 1;

SP1 is the convolution with a convolution kernel of 3 × 3;

SP3 ═ Re L U activation function;

SP＝{SP0,SP1,SP2,SP3}。

3. the method according to claim 1, wherein in step S200, the neural network architecture search method employs an average cross-over ratio evaluation method to optimize an objective function f ═ f (x), where x ═ x { (x } f { (x) } in the method₀,x₁,x₂,…,x_nAnd the structure combinations of the multi-scale prediction units searched according to the search space SP.

4. The method according to claim 1, wherein in step S200, the input of the multi-scale prediction unit is the output of each layer of the full convolutional neural network, and the output is the equal-resolution feature map response.

5. The method for semantic segmentation of high-precision images with continuous learning ability according to claim 1, wherein in step S300, the fusion means performs feature fusion by using a pixel addition method.

6. The method according to claim 1, wherein in step S400, the continuous learning mechanism is to modify the neural network weights in the orthogonal direction of the old task input space, that is, the final weight increment is determined by Δ W ═ P Δ W₀Determining a function; where P is the input space orthogonal vector, Δ W₀The gradient is calculated for the neural network over the original input space, Δ W is the weight increment in the orthogonal direction of the input space.

7. The method of claim 6, wherein the input spatial orthogonal vector P is obtained by the following formula:

k_l(i)＝p_l(i-1)r_l-1(i)^T；