CN112053439A

CN112053439A - Method, device and equipment for determining instance attribute information in image and storage medium

Info

Publication number: CN112053439A
Application number: CN202011042869.4A
Authority: CN
Inventors: 单鼎一; 梅树起
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-09-28
Filing date: 2020-09-28
Publication date: 2020-12-08
Anticipated expiration: 2040-09-28
Also published as: CN112053439B

Abstract

The application discloses a method, a device, equipment and a storage medium for determining instance attribute information in an image, wherein the method comprises the following steps: acquiring an image to be detected; down-sampling the image to be detected to obtain a shared characteristic; carrying out position offset prediction processing on the shared features to obtain the position offset of each pixel in the image to be detected; carrying out height prediction processing on the shared features to obtain height information of each pixel in the image to be detected; determining the fusion characteristic of each pixel in the image to be detected according to the sharing characteristic; determining a pixel set corresponding to each instance type according to the fusion characteristics of each pixel in the image to be detected; and determining attribute information of the instance corresponding to each instance type according to the fusion characteristics, the position offset and the height information of the pixels in the pixel set corresponding to each instance type. The method and the device realize accurate segmentation of the examples in the image and accurately predict the height and position offset of each example.

Description

Method, device and equipment for determining instance attribute information in image and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for determining instance attribute information in an image.

Background

The satellite image building detection is usually completed based on an example detection algorithm, the top of a building example is detected firstly, and then the top is directly used as a building base, or a single building offset and height prediction model is designed respectively. Due to the difference of satellite angles, the tops and bottoms of a plurality of buildings have serious deviation in an actual scene, the direct use of the top as the bottom has serious defects, the buildings can cover a road network or fall into water, and the like, the use of a plurality of models for respective prediction wastes time and labor, and the technology lags behind.

In the prior art, a satellite image-based detection task is mostly adopted in the industry as a method, a two-stage target detection algorithm mask-rcnn series is adopted, the first stage is the rough detection of a target instance top position frame, and a series of rectangular target frames are output. In the second stage, the extracted features at the position frame of the first stage are used as the input of the second stage, wherein the regression classification network is responsible for carrying out classification on the positive sample and carrying out accurate regression of an external rectangular frame, and the semantic segmentation network is responsible for segmenting the background before the single instance at the pixel level. If the prediction of the building height and the building offset is to be considered, the attribute prediction branches of the height and the offset can be designed in the second stage.

Take the classic target detection algorithm mask-rcnn as an example, and add the height and offset prediction branches in the second stage. mask-rcnn can detect the position of the top of the house well, but the problem is that the input of the second stage is the top-of-building target frame of the first stage, and for high-rise and inclined severe building targets, the frame of the first stage cannot well contain the full elements (side elevation textures) of the target, so that the height and offset predicted by the second stage are seriously wrong, and thus, the height offset prediction of the second stage by only using the rectangular internal elements is seriously lack of texture information. In addition, a candidate box-based detection algorithm such as mask-rcnn is not friendly to recall rate of large monomer special-shaped buildings, and has great dependence on preset candidate box size.

Therefore, it is necessary to provide a method, an apparatus, a device and a storage medium for determining example attribute information in an image, so as to implement accurate segmentation of examples in the image, accurately predict height and position offset of the examples and facilitate accurate drawing of example maps.

Disclosure of Invention

The application provides a method, a device, equipment and a storage medium for determining example attribute information in an image, which can realize accurate segmentation of an example in the image, accurately predict the height and position offset of the example and facilitate accurate drawing of an example map.

In one aspect, the present application provides a method for determining instance attribute information in an image, where the method includes:

acquiring an image to be detected, wherein the image to be detected comprises examples of the number of targets;

performing downsampling processing on the image to be detected to obtain shared characteristics;

performing position offset prediction processing on the shared features to obtain the position offset of each pixel in the image to be detected;

performing height prediction processing on the shared features to obtain height information of each pixel in the image to be detected;

determining the fusion characteristic of each pixel in the image to be detected according to the sharing characteristic;

determining a pixel set corresponding to each instance type according to the fusion characteristics of each pixel in the image to be detected;

and determining attribute information of the instance corresponding to each instance type according to the fusion characteristics, the position offset and the height information of the pixels in the pixel set corresponding to each instance type.

Another aspect provides an apparatus for determining attribute information of an instance in an image, the apparatus comprising:

the device comprises an image to be detected acquisition module, a target detection module and a target detection module, wherein the image to be detected acquisition module is used for acquiring an image to be detected, and the image to be detected comprises target quantity examples;

the shared characteristic determining module is used for carrying out downsampling processing on the image to be detected to obtain shared characteristics;

the position offset determining module is used for carrying out position offset prediction processing on the shared features to obtain the position offset of each pixel in the image to be detected;

the height information determining module is used for carrying out height prediction processing on the shared features to obtain the height information of each pixel in the image to be detected;

the fusion characteristic determining module is used for determining the fusion characteristic of each pixel in the image to be detected according to the sharing characteristic;

the pixel set determining module is used for determining a pixel set corresponding to each instance type according to the fusion characteristics of each pixel in the image to be detected;

and the attribute information determining module is used for determining the attribute information of the example corresponding to each example type according to the fusion characteristic, the position offset and the height information of the pixels in the pixel set corresponding to each example type.

Another aspect provides an apparatus for determining attribute information of an instance in an image, the apparatus including a processor and a memory, the memory having at least one instruction or at least one program stored therein, the at least one instruction or the at least one program being loaded and executed by the processor to implement the method for determining attribute information of an instance in an image as described above.

Another aspect provides a computer storage medium storing at least one instruction or at least one program, which is loaded and executed by a processor to implement the method for determining instance attribute information in an image as described above.

The method, the device, the equipment and the storage medium for determining the example attribute information in the image have the following technical effects:

the method comprises the steps of carrying out downsampling processing on an image to be detected comprising a target number example to obtain shared characteristics, then respectively determining the position offset, height information and fusion characteristic information of each pixel in the image to be detected according to the shared characteristics, and finally determining attribute information of each example; the method and the device realize accurate segmentation of the examples in the image, accurately predict the height and the position offset of each example and facilitate accurate drawing of the example map.

Drawings

In order to more clearly illustrate the technical solutions and advantages of the embodiments of the present application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a schematic diagram of an example attribute information determination system in an image according to an embodiment of the present disclosure;

FIG. 2 is a schematic flowchart of a method for determining example attribute information in an image according to an embodiment of the present disclosure;

FIG. 3 is a flowchart illustrating a method for determining a fusion feature of each pixel in the image to be detected according to the shared feature according to an embodiment of the present application;

FIG. 4 is a schematic flow chart diagram illustrating a method for determining a semantic branch network, an example branch network, a first regression branch network, and a second regression branch network according to an embodiment of the present disclosure;

fig. 5 is a flowchart illustrating a method for determining attribute information of an instance corresponding to each instance category according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a network framework provided in an embodiment of the present application;

FIG. 7 is a schematic diagram of a predicted result of building roof provided by an embodiment of the present application;

FIG. 8 is a schematic diagram of a predicted result of a building base according to an embodiment of the present disclosure;

FIG. 9 is a schematic structural diagram of an apparatus for determining example attribute information in an image according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like.

Specifically, the scheme provided by the embodiment of the application relates to the field of machine learning of artificial intelligence. Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. According to the method and the device, a large number of examples in the image are automatically segmented through the machine learning model, and the height information and the position offset of each example in the image are accurately obtained.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Referring to fig. 1, fig. 1 is a schematic diagram of an example attribute information determining system in an image according to an embodiment of the present application, and as shown in fig. 1, the example attribute information determining system in an image may at least include a server 01 and a client 02.

Specifically, in this embodiment of the present disclosure, the server 01 may include an independently operating server, or a distributed server, or a server cluster composed of a plurality of servers, and may also be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), and a big data and artificial intelligence platform. The server 01 may comprise a network communication unit, a processor, a memory, etc. Specifically, the server 01 may be configured to determine attribute information of an instance in an image.

Specifically, in the embodiment of the present disclosure, the client 02 may include a type of physical device such as a smart phone, a desktop computer, a tablet computer, a notebook computer, a digital assistant, a smart wearable device, and a vehicle-mounted terminal, and may also include software running in the physical device, such as a web page provided by some service providers to a user, and an application provided by the service providers to the user. Specifically, the client 02 may be configured to display an image corresponding to each instance in the image to be detected.

The following describes a method for determining example attribute information in an image according to the present application, and fig. 2 is a schematic flowchart of a method for determining example attribute information in an image according to an embodiment of the present application, where the present specification provides the method operation steps as described in the embodiment or the flowchart, but more or fewer operation steps may be included based on conventional or non-inventive labor. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. In practice, the system or server product may be implemented in a sequential or parallel manner (e.g., parallel processor or multi-threaded environment) according to the embodiments or methods shown in the figures. Specifically, as shown in fig. 2, the method may include:

s201: and acquiring an image to be detected, wherein the image to be detected comprises examples of the target number.

In the embodiment of the description, the image to be detected can include a large number of examples, the target number can be more than 2, and the examples in the image are similar in structure and same in attribute; examples in the same image belong to the same category, examples may be dense and fine objects in the image to be detected, examples may be buildings, vehicles, or the like, and the number of targets may be set to be greater than a preset number. The structure of each instance in the image to be detected is similar, and when a map is drawn according to the image to be detected, the instances in the image to be detected need to be segmented, and the characteristics of the size, the position and the like of each instance are obtained. When the example is an object having a certain height, for example, when the example is a building, it is also necessary to acquire height information of each example and base position information.

S203: and performing downsampling processing on the image to be detected to obtain the sharing characteristics.

In this embodiment of the present specification, the down-sampling processing on the image to be detected to obtain the shared feature may include:

s2031: extracting an edge texture feature set of an image to be detected;

in this specification, the edge texture feature set may include a plurality of edge texture features, where the edge texture features are bottom-layer features of the deep learning network, that is, feature maps of a front-layer network, and the visual representations of the edge texture features are usually points, lines, faces, and corners. The edge texture feature set of the image to be detected can be extracted through a bottom convolution layer, namely the edge texture feature set of the image to be detected is extracted through multiple convolution and pooling operations.

S2033: determining an edge texture combination characteristic according to an edge texture characteristic set of an image to be detected;

in the embodiment of the present specification, deep learning is a gradual abstraction process, and is combined and abstracted into a middle-layer feature, that is, an example local feature, from a bottom-layer edge texture feature, and is abstracted into an example global feature, that is, an edge texture combination feature, so as to perform example category learning.

In the embodiment of the description, the edge texture features of the image to be detected are combined through the middle and high layer convolution layers to obtain edge texture combination features; the edge texture features can be fused through multiple convolution and pooling operations to obtain edge texture combination features; for example, after the points, lines, planes and angles of the examples in the image to be detected are obtained, the local features of the examples in the image to be detected are determined through the middle layer convolution layer, then the local features of the examples are further convolved and pooled through the high layer convolution layer, the overall features of the examples in the image to be detected, namely the edge texture combination features, are determined, and therefore example category learning is conducted.

S2035: carrying out normalized normal distribution processing on the edge texture combination characteristics to obtain normalized characteristics;

s2037: and carrying out nonlinear mapping processing on the normalized features to obtain shared features.

In the embodiment of the present specification, an edge texture feature set of an image to be detected may be extracted through a bottom convolution layer; combining the edge texture features of the image to be detected through the high-level convolution layer to obtain edge texture combination features; carrying out normalized normal distribution processing on the edge texture combination characteristics through a normalization layer to obtain normalized characteristics; and then carrying out nonlinear mapping processing on the normalized features through the activation layer to obtain shared features.

In the embodiment of the present specification, the shared feature of the image to be detected can be obtained through downsampling processing, the shared feature can be used for performing semantic segmentation and instance analysis processing, and the position offset and the height information of each pixel in the image to be detected can also be determined through the shared feature.

In the embodiment of the present specification, the down-sampling is expressed as a gradual depth process of deep learning, and the information of the bottom layer, the middle layer and the high layer is realized by performing operations such as convolution, pooling and the like for multiple times respectively. Where both convolution and pooling have the effect of scaling down. In the downsampling process illustrated by 04 in fig. 6, during the multiple convolution and pooling operations from the bottom layer to the middle layer to the top convolution layer, the depth features are gradually scaled down and the number of channels is gradually increased, for example, the image is changed from 256 (length) × 256 (width) × 3 (number of channels) to 128 (length) × 128 (width) × 20 (number of channels), the length and the width are reduced, and the number of channels is increased.

In a specific embodiment, the image to be detected may be a building image, where an example is a building, the image includes thousands of building examples, and the extracted features of the bottom convolution layer, that is, the edge texture feature set of the image, are edges, points, lines, and corners of the building; the extraction features of the middle layer convolution layer are local features of a building, such as a top feature, a base feature, a height feature and the like of the building; the high-rise convolutional layer extracts features, namely edge texture combination features, and the features are the overall features of the building.

S205: and carrying out position offset prediction processing on the shared features to obtain the position offset of each pixel in the image to be detected.

Specifically, in the embodiment of the present specification, an example in the image to be detected may be an object having a certain height; the bottom profile of the object is the same as the top profile; for example, the example may be a building having a bottom profile that is the same as a top profile; in the actual detection process, only the top contour of the example can be displayed in the image to be detected, and the bottom contour of the example cannot be completely displayed; at this time, the bottom profile of the instance needs to be determined according to its top profile; the top contour and the bottom contour of the example in the image to be detected are usually shifted to a certain extent, so if the bottom contour of the example in the image to be detected needs to be obtained, not only the top contour size and the position information of each example, but also the position offset of each example needs to be obtained. In the embodiment of the present specification, obtaining the position offset of each pixel in the image to be detected is substantially obtaining the position offset of each pixel in the target region of the image to be detected, the target region is the top contour region of each example in the image to be detected, and the bottom (base) position of each example can be determined by the position offset of the top contour of each example.

In this embodiment of the present specification, performing a position offset prediction process on the shared feature, and obtaining a position offset of each pixel in the image to be detected may include:

and performing position offset prediction processing on the shared features through a first regression branch network to obtain the position offset of each pixel in the image to be detected.

Specifically, in the embodiments of the present specification, the position offset amount of each pixel refers to an offset vector of each pixel in the x and y directions, and the non-foreground or non-offset vector prediction guide value is 0. The first regression branch network performs feature fusion using the feature pyramid FPN, which uses L2 least squares loss as a guiding function for the regression task.

S207: and performing height prediction processing on the shared features to obtain the height information of each pixel in the image to be detected.

In this embodiment of the present description, performing height prediction processing on the shared feature to obtain height information of each pixel in the image to be detected includes:

and performing height prediction processing on the shared features through a second regression branch network to obtain height information of each pixel in the image to be detected.

Specifically, in the embodiment of the present specification, the height information corresponding to each instance in the image to be detected may be determined by obtaining the height information of each pixel in the image to be detected.

Specifically, in the embodiment of the present specification, when an example is a building, a platform such as a mobile phone map application program needs not only a two-dimensional spatial position relationship of background data, but also a three-dimensional stereoscopic visualization map, and therefore, the spatial height of the building is also an essential element. The second regression branch network adopts a feature pyramid FPN for feature fusion, the branch network is designed as a regression learning network, and the L1 absolute value loss is used as a guide function of a regression task.

S209: and determining the fusion characteristic of each pixel in the image to be detected according to the sharing characteristic.

In an embodiment of the present specification, as shown in fig. 3, determining a fusion feature of each pixel in the image to be detected according to the shared feature may include:

s2091: performing semantic segmentation processing on the shared features to obtain semantic features of each pixel in the image to be detected;

specifically, in this embodiment of the present specification, the semantic features may include a background feature and a foreground feature of the image to be detected.

Specifically, in this embodiment of the present specification, performing semantic segmentation processing on the shared feature to obtain a semantic feature of each pixel in the image to be detected may include:

and performing semantic segmentation processing on the shared features by adopting a connected region and regional clustering method to obtain the semantic features of each pixel in the image to be detected, wherein the semantic features are background features or foreground features.

In an embodiment of the present specification, after the step of performing semantic segmentation processing on the shared features by using a connected region and sub-region clustering method to obtain a semantic feature of each pixel in an image to be detected, the method further includes:

a first mask of background features and a second mask of foreground features of an image to be detected are determined.

In the embodiment of the present specification, the background feature and the foreground feature of the image to be detected may be determined by a connected region regional clustering method, and a mask corresponding to the foreground feature and the background feature is generated, for example, the mask of the foreground feature may be set to 1, and the mask of the background feature may be set to 0, so as to distinguish the foreground from the background.

In an embodiment of the present specification, performing semantic segmentation processing on the shared feature to obtain a semantic feature of each pixel in the image to be detected includes:

and performing semantic segmentation processing on the shared features through a semantic branch network to obtain the semantic features of each pixel in the image to be detected.

In this embodiment of the present specification, the semantic branch network may include a plurality of upsampling modules, and a feature Pyramid fpn (feature Pyramid networks) policy is adopted, so that a deconvolution layer operation may be performed on the shared feature by the plurality of upsampling modules, and each upsampling module performs a deconvolution operation, thereby implementing size amplification; and feature information necessary for sampling and fusing at a higher layer is provided. The input of each up-sampling module is not only from the output characteristics of the previous up-sampling module, but also from the corresponding shared characteristic layer with the same size in down-sampling processing, and for better fusing characteristic information, the two characteristics are added in the module, and convolution operation is carried out to realize information fusion, so that the prediction of the image foreground and the background is carried out, and the semantic characteristics are obtained.

In the embodiments of the present specification, the luminance value of the binary image has only two states: black (0) and white (255). In practical applications, the analysis of many images is ultimately converted into the analysis of binary images, such as: detecting the foreground of the image; the most important method for binary image analysis is connected region marking, which is the basis of all binary image analysis, and each individual connected region forms an identified block by marking white pixels (targets) in a binary image, so that geometric parameters such as outlines, circumscribed rectangles, centroids, invariant moments and the like of the blocks can be further obtained. In an image, the smallest unit is a pixel, each pixel has 8 adjacent pixels around it, and there are 2 common adjacent relations: 4 contiguous with 8 contiguous. 4 are adjacent to a total of 4 points, i.e. up, down, left and right. 8 contiguous points-8 in total, including diagonally located points, visually appear that points that are connected to each other form one region, while points that are not connected form a different region. The set of all the points connected with each other is called a connected region. The connected region regional clustering method can segment the foreground and the background of the image; when the example is a building, the prospect is the building.

S2093: and carrying out example analysis processing on the shared characteristics to obtain example characteristics of each pixel in the image to be detected.

Specifically, in the embodiment of the present specification, the example feature may include a texture feature of each pixel in the image to be detected.

In the embodiment of the present specification, the texture feature may be characterized by a feature vector of a fixed dimension (e.g., 8 dimensions).

In this embodiment of the present specification, performing example analysis processing on the shared features to obtain an example feature of each pixel in the image to be detected may include:

and carrying out example analysis processing on the shared characteristics through an example branch network to obtain example characteristics.

Specifically, in this embodiment of the present specification, the example features may further include a texture feature and a spatial position feature of each pixel in the image to be detected.

In the embodiment of the present specification, a plurality of upsampling modules may perform a deconvolution layer operation on the shared feature, and each upsampling module performs a deconvolution operation, thereby implementing size amplification; and feature information necessary for sampling and fusing at a higher layer is provided. The input of each up-sampling module is not only from the output characteristic of the previous up-sampling module, but also from the corresponding shared characteristic layer with the same size (area) in the down-sampling process, and for better fusing characteristic information, the two characteristics are added in the module, and the convolution operation is carried out to realize information fusion, so that the pixel characteristics in the image are learned to obtain the example characteristics. Clustering loss is used in the training process of the example branch network, and each pixel of the example loss link is calculated to have a corresponding example label. The example features comprise spatial position features of each pixel in the image to be detected, and the spatial position features of the pixels can be characterized by two-dimensional spatial coordinates of the pixels. The spatial position features can reflect the spatial region difference between the pixels, so that the feature similarity of adjacent pixels is improved, the spatial difference between the pixels far away from each other is increased, and the pixels of different instances can be prevented from being clustered.

In the embodiment of the present specification, the texture feature of a pixel can be characterized by an eight-dimensional feature vector; at the moment, the example features are represented by ten-dimensional feature vectors, and the two-dimensional feature vectors are space coordinates corresponding to pixels; thus, texture differences and spatial territory differences between different instances can be distinguished.

In this embodiment of the present specification, before the step of performing semantic segmentation processing on the shared feature through a semantic branch network to obtain a semantic feature of each pixel in the image to be detected, as shown in fig. 4, the method may further include:

s401: constructing a cross entropy loss function of the first network;

s403: constructing an intra-class aggregation degree loss function and an inter-class distinction degree loss function of the second network;

s405: constructing a square error loss function of a third network;

s407: constructing an absolute value loss function of a fourth network;

s409: determining the sum of a cross entropy loss function, an intra-class polymerization degree loss function, an inter-class discrimination degree loss function, a squared error loss function and an absolute value loss function as a comprehensive loss function;

s4011: respectively adjusting parameters of a first network, a second network, a third network and a fourth network to obtain a current first network, a current second network, a current third network and a current fourth network;

s4013: calculating the comprehensive loss values corresponding to the current first network, the current second network, the current third network and the current fourth network;

s4015: and when the comprehensive loss value is smaller than a preset threshold value, determining the current first network as a semantic branch network, determining the current second network as an example branch network, determining the current third network as a first regression branch network, and determining the current fourth network as a second regression branch network.

In an embodiment of the present specification, the method may further include:

s4017: when the comprehensive loss value is greater than or equal to a preset threshold value, repeating the following steps: and respectively adjusting parameters of the first network, the second network, the third network and the fourth network to obtain the current first network, the current second network, the current third network and the current fourth network.

In the embodiment of the present specification, the preset threshold may be set according to actual situations. In the training process of the semantic branch network, a semantic label is required to be labeled for each pixel in a training image, wherein the semantic label comprises a foreground label and a background label; in the training process of the example branch network, each pixel in the training image needs to be labeled with a feature label, which may include texture features and spatial location features.

In this specification embodiment, the cross-entropy loss function of the first network may be:

where pi is the prediction probability, yi is the class label (0, 1), and N is the number of features.

The intra-class cohesion loss function for the second network may be:

wherein CC is the number of instances in the training image, v is an intra-class penalty factor, μ c is the average value of the features in a certain class, and xi is a certain pixel feature;

the inter-class discriminative power loss function of the second network may be:

where CC is the number of instances in the training image,_dis an inter-class penalty factor, mu_ca，μ_cbIs the average of the features within a certain class.

In this illustrative embodiment, the squared error loss function of the third network may be:

wherein the content of the first and second substances,

and the predicted value of the position offset of the ith example is obtained, yi is the true value of the position offset of the ith example, and n is the number of examples in the training image.

In this specification embodiment, the absolute value loss function of the fourth network may be:

wherein the content of the first and second substances,

the height predicted value of the ith example is, yi is the height true value of the ith example, and n is the number of examples in the training image.

In this embodiment of this specification, the first network, the second network, the third network, and the fourth network are all in the same deep learning network, and the method of this embodiment may further include:

and constructing a regularization loss function of the deep learning network.

Specifically, in the embodiment of the present specification, determining the sum of the cross entropy loss function, the intra-class cohesion loss function, the inter-class discriminative power loss function, the square error loss function, and the absolute value loss function as the synthetic loss function may include:

and determining a cross entropy loss function, an intra-class cohesion loss function, an inter-class discrimination loss function, a square error loss function, an absolute value loss function and a regularization loss function as a comprehensive loss function.

In this embodiment of the present specification, the regularization loss function may be an L1 regularization function or an L2 regularization function, and when the comprehensive loss function is calculated, the regularization loss function is introduced, so that overfitting of a model corresponding to a network can be prevented, and the generalization capability of the model is improved.

In the embodiment of the present specification, the deep learning network can be a U-Net network, which is a classic full convolution network (i.e., no fully connected operation in the network). The input of the network is a picture with the edge subjected to mirror image operation; on the left side of the network is a series of down-sampling operations consisting of convolution and Max Pooling, this part being called the compression path. The compression path consists of 4 blocks, each block uses 3 effective volumes and 1 Max Pooling (Max Pooling) down-sampling, and the number of Feature maps (Feature maps) after each down-sampling is multiplied by 2; the right part of the network is an extended path. Each block is multiplied by 2 in size by deconvolution, the number of the blocks is reduced by half (the last layer is slightly different), and the blocks are merged with the Feature Map of the left symmetric compression path, and the U-Net is normalized by clipping the Feature Map of the compression path to the Feature Map of the same size as the extension path because the Feature maps of the left compression path and the right extension path are different in size. The extended path convolution operation still uses an efficient convolution operation.

S2095: and fusing the semantic features and the example features of each pixel in the image to be detected to determine the fusion features of each pixel in the image to be detected.

In an embodiment of the present specification, fusing the semantic features and the example features, and determining a fused feature of each pixel in the image to be detected may include:

s20951: fusing a first mask of the background feature of the image to be detected with the texture feature and the spatial position feature of the pixel corresponding to the background in the image to be detected to obtain a first fusion result;

s20953: fusing a second mask of the foreground characteristic of the image to be detected with the texture characteristic and the spatial position characteristic of the corresponding foreground pixel in the image to be detected to obtain a second fusion result;

s20955: and determining the fusion characteristic of each pixel in the image to be detected according to the first fusion result and the second fusion result.

In the embodiment of the specification, a strategy of connected region regional clustering and spatial position feature fusion is adopted, so that on one hand, pixels in different regions can not be clustered into a category, and on the other hand, the clustering speed is increased.

S2011: and determining a pixel set corresponding to each instance type according to the fusion characteristics of each pixel in the image to be detected.

In an embodiment of the present specification, determining, according to the fusion feature of each pixel in the image to be detected, a pixel set corresponding to each instance category includes:

s20111: determining the instance category of each pixel in the image to be detected according to the fusion characteristic of each pixel in the image to be detected;

s20113: and determining a pixel set corresponding to each instance category through a density clustering algorithm.

In the embodiment of the present specification, the density-based clustering method performs clustering based on the density of the data set in spatial distribution, and does not need to set the number of clusters in advance, so that the method is particularly suitable for clustering the data set with unknown content. And representative algorithms are: DBSCAN, OPTICS. Taking the example of the DBSCAN algorithm, the DBSCAN objective is to find the largest set of density connected objects. The classic Density-Based Spatial Clustering of Application with Noise (DBSCAN) algorithm Based on Density Clustering is a Density Clustering algorithm Based on high-Density connected regions.

The basic algorithm flow of the DBSCAN is as follows: and starting from any object P, extracting all objects which can be reached from the density P through breadth-first search according to the threshold value and the parameters to obtain a cluster. If P is the core object, the corresponding object can be marked as the current class at one time, and the expansion is carried out based on the current class. After a complete cluster is obtained, a new object is selected to repeat the above process. If P is a boundary object, it is marked as noise and discarded.

Specifically, in the embodiment of the present specification, an example polygon formed by a small number of points can be generated by the density-clustered example through a suitable vectorization algorithm; the image corresponding to the instance may be a polygonal structure; the triggering operation on the display interface may be sliding, clicking or dragging or other operations of the user on the display interface, for example, "image preview" in the display interface may be clicked, an image corresponding to each instance in the image to be detected is constructed and displayed, and the instances in the image are separated from each other.

In this embodiment, after the step of determining, by a density clustering algorithm, a set of pixels corresponding to each instance category, the method may further include:

sending a pixel set corresponding to each instance type to a terminal; and enabling the terminal to respond to the operation on the display interface, and constructing and displaying the image corresponding to each instance in the image to be detected.

Specifically, in the embodiment of the present specification, the examples in the image may be identified by using different colors, so that a user can distinguish different examples in the image conveniently; the terminal can comprise a map application program, and the map application program can respond to the operation on the display interface and construct and display the image corresponding to each instance in the image to be detected; therefore, the map information corresponding to the image to be detected is visually displayed to the user.

S2013: and determining attribute information of the instance corresponding to each instance type according to the fusion characteristics, the position offset and the height information of the pixels in the pixel set corresponding to each instance type.

In this embodiment of the present specification, as shown in fig. 5, determining attribute information of an instance corresponding to each instance category according to the fusion feature, the position offset, and the height information of the pixels in the pixel set corresponding to each instance category includes:

s20131: and determining the fusion characteristic of the example corresponding to each example type according to the fusion characteristic of the pixels in the pixel set corresponding to each example type.

S20133: and determining the position offset of the example corresponding to each example type according to the position offset of the pixels in the pixel set corresponding to each example type.

In the embodiments of the present specification, the attribute information of the instance may include a fusion feature, a position offset, height information, and the like of the instance.

Specifically, in this embodiment of the present specification, determining, according to a position offset of a pixel in a pixel set corresponding to each instance type, a position offset of an instance corresponding to each instance type includes:

sorting the position offset of each pixel in the pixel set corresponding to each instance type from small to large;

and determining the median of the position offset of each pixel as the position offset of the corresponding example of each example category.

S20135: and determining the height information of the example corresponding to each example type according to the height information of the pixels in the pixel set corresponding to each example type.

Specifically, in this embodiment of the present specification, determining the height information of the instance corresponding to each instance category according to the height information of the pixels in the pixel set corresponding to each instance category includes:

sorting the height information of each pixel in the pixel set corresponding to each instance type from small to large;

and determining the median in the height information of each pixel as the height information of the corresponding example of each example category.

In the embodiment of the present specification, the median of the position offset in the pixel set corresponding to each instance may be used as the position offset of the instance, and the median of the height in the pixel set corresponding to each instance may be used as the height of the instance, so as to realize accurate prediction of the position offset and the height information of the instance.

In this embodiment of the present specification, after the step of determining attribute information of the corresponding instance of each instance category, the method of this embodiment may further include:

and constructing an image corresponding to each instance in the image to be detected according to the attribute information of the instance corresponding to each instance type.

sending attribute information of the corresponding instance of each instance type to a terminal; and enabling the terminal to respond to the operation on the display interface, and constructing and displaying the image corresponding to each instance in the image to be detected.

In a specific embodiment, a network framework corresponding to the method of the present application is divided into six parts, i.e., a feature extraction downsampling part, a semantic feature extraction branch, an example feature extraction branch, a position offset prediction branch, a highly predicted branch, and an example clustering (clustering), as shown in fig. 6.

The network frame corresponding model is an example segmentation image determination model; in the application process, the image to be detected is directly input into the example segmentation image determination model, and the output example segmentation image can be obtained. Specifically, firstly, the downsampling processing network 04 processes the image 03 to be detected to obtain a shared feature; then, the shared features are respectively input into an example branch network 05, a semantic branch network 06, a first regression branch network 07 and a second regression branch network 08 to obtain an example feature graph 09 and a semantic feature graph 10; and finally, obtaining an example pixel cluster map 11 according to the example feature map 09 and the semantic feature map 10, and obtaining an example segmentation image 12 according to the example pixel cluster map 11 in the example, the position offset output by the first regression branch network 07 and the height information output by the second regression branch network 08.

In a specific embodiment, the schematic diagram of the prediction result of the building roof is shown in fig. 7, the areas of the rectangular frames 13 in fig. 7 are all the contours of the roof, the schematic diagram of the prediction result of the building base is shown in fig. 8, and the rectangular frames 14 in fig. 8 are all the contours of the base; by adopting the method to test the buildings in the national top50 city, the accuracy of the roof prediction of the conventional building region reaches 97%, and the accuracy of the floor prediction of the conventional building region reaches 95%. Compared with the existing manual labeling method, the efficiency of obtaining the map data by the method of the embodiment is improved by 10 times.

According to the technical scheme provided by the embodiment of the specification, the embodiment of the specification performs downsampling processing on the image to be detected including the target number of examples to obtain shared characteristics, then respectively determines the position offset, the height information and the fusion characteristic information of each pixel in the image to be detected according to the shared characteristics, and finally determines the attribute information of each example; the method and the device realize accurate segmentation of the examples in the image, accurately predict the height and the position offset of each example and facilitate accurate drawing of the example map.

An embodiment of the present application further provides an apparatus for determining instance attribute information in an image, as shown in fig. 9, the apparatus includes:

an image to be detected acquisition module 910, configured to acquire an image to be detected, where the image to be detected includes instances of a target number;

a shared feature determining module 920, configured to perform downsampling on the image to be detected to obtain a shared feature;

a position offset determining module 930, configured to perform position offset prediction processing on the shared feature to obtain a position offset of each pixel in the image to be detected;

a height information determining module 940, configured to perform height prediction processing on the shared feature to obtain height information of each pixel in the image to be detected;

a fusion feature determining module 950, configured to determine a fusion feature of each pixel in the image to be detected according to the shared feature;

a pixel set determining module 960, configured to determine a pixel set corresponding to each instance type according to a fusion feature of each pixel in the image to be detected;

the attribute information determining module 970 is configured to determine attribute information of an instance corresponding to each instance type according to the fusion feature, the position offset, and the height information of the pixels in the pixel set corresponding to each instance type.

In some embodiments, the fused feature determination module may include:

the semantic feature determining unit of the pixel is used for performing semantic segmentation processing on the shared feature to obtain the semantic feature of each pixel in the image to be detected;

the pixel example feature determining unit is used for performing example analysis processing on the shared features to obtain example features of each pixel in the image to be detected;

and the fusion characteristic determining unit of the pixels is used for fusing the semantic characteristic and the example characteristic of each pixel in the image to be detected and determining the fusion characteristic of each pixel in the image to be detected.

In some embodiments, the apparatus may further comprise:

and the image construction module is used for constructing an image corresponding to each instance in the image to be detected according to the attribute information of the instance corresponding to each instance type.

In some embodiments, the attribute information determination module may include:

the instance fusion feature determining unit is used for determining the fusion feature of the instance corresponding to each instance type according to the fusion feature of the pixels in the pixel set corresponding to each instance type;

the example position offset determining unit is used for determining the position offset of the example corresponding to each example type according to the position offset of the pixels in the pixel set corresponding to each example type;

and the height information determining unit of the example is used for determining the height information of the example corresponding to each example type according to the height information of the pixels in the pixel set corresponding to each example type.

In some embodiments, the semantic feature determining unit of the pixel may include:

and the semantic feature determining subunit is used for performing semantic segmentation processing on the shared features through a semantic branch network to obtain the semantic features of each pixel in the image to be detected.

In some embodiments, the example feature determination unit of the pixel may include:

and the example feature determining subunit of the pixel is used for performing example analysis processing on the shared feature through an example branch network to obtain an example feature.

In some embodiments, the position offset determination module may include:

the position offset determining unit is used for carrying out position offset prediction processing on the shared features through a first regression branch network to obtain the position offset of each pixel in the image to be detected;

in some embodiments, the height information determination module may include:

and the height information determining unit is used for performing height prediction processing on the shared characteristic through a second regression branch network to obtain the height information of each pixel in the image to be detected.

In some embodiments, the apparatus may further comprise:

the first function construction module is used for constructing a cross entropy loss function of the first network;

the second function building module is used for building an intra-class aggregation degree loss function and an inter-class differentiation degree loss function of the second network;

the third function construction module is used for constructing a square error loss function of a third network;

the fourth function construction module is used for constructing an absolute value loss function of a fourth network;

a comprehensive loss function determining module, configured to determine a sum of the cross entropy loss function, the intra-class conformity loss function, the inter-class discriminative power loss function, the squared error loss function, and the absolute value loss function as a comprehensive loss function;

a parameter adjusting module, configured to adjust parameters of the first network, the second network, the third network, and the fourth network, respectively, to obtain a current first network, a current second network, a current third network, and a current fourth network;

a comprehensive loss value calculating module, configured to calculate a comprehensive loss value corresponding to the current first network, the current second network, the current third network, and the current fourth network;

and the network determining module is used for determining the current first network as the semantic branch network, determining the current second network as the example branch network, determining the current third network as the first regression branch network and determining the current fourth network as the second regression branch network when the comprehensive loss value is smaller than a preset threshold value.

In some embodiments, the pixel set determination module may include:

the example type determining module is used for determining the example type of each pixel in the image to be detected according to the fusion characteristic of each pixel in the image to be detected;

and the pixel set determining module is used for determining the pixel set corresponding to each example category through a density clustering algorithm.

The device and method embodiments in the device embodiment are based on the same inventive concept.

The embodiment of the application provides an apparatus for determining attribute information of an instance in an image, which includes a processor and a memory, where the memory stores at least one instruction or at least one program, and the at least one instruction or the at least one program is loaded and executed by the processor to implement the method for determining attribute information of an instance in an image provided by the above method embodiment.

Embodiments of the present application further provide a computer storage medium, where the storage medium may be disposed in a terminal to store at least one instruction or at least one program for implementing a method for determining example attribute information in an image in the method embodiments, and the at least one instruction or the at least one program is loaded and executed by the processor to implement the method for determining example attribute information in an image provided in the method embodiments.

Alternatively, in the present specification embodiment, the storage medium may be located at least one network server among a plurality of network servers of a computer network. Optionally, in this embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The memory of the embodiments of the present disclosure may be used to store software programs and modules, and the processor may execute various functional applications and data processing by operating the software programs and modules stored in the memory. The memory can mainly comprise a program storage area and a data storage area, wherein the program storage area can store an operating system, application programs needed by functions and the like; the storage data area may store data created according to use of the device, and the like. Further, the memory may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory may also include a memory controller to provide the processor access to the memory.

The embodiment of the method for determining the example attribute information in the image provided by the embodiment of the application can be executed in a mobile terminal, a computer terminal, a server or a similar operation device. Taking the example of the application running on a server, fig. 10 is a hardware structure block diagram of the server of the method for determining the example attribute information in the image according to the embodiment of the present application. As shown in fig. 10, the server 1000 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 1010 (the processor 1010 may include but is not limited to a Processing device such as a microprocessor MCU or a programmable logic device FPGA), a memory 1030 for storing data, and one or more storage media 1020 (e.g., one or more mass storage devices) for storing applications 1023 or data 1022. Memory 1030 and storage media 1020 may be, among other things, transient or persistent storage. The program stored in the storage medium 1020 may include one or more modules, each of which may include a series of instruction operations for a server. Still further, the central processor 1010 may be configured to communicate with the storage medium 1020 and execute a series of instruction operations in the storage medium 1020 on the server 1000. The server 1000 may also include one or more power supplies 1060, one or more wired or wireless network interfaces 1050, one or more input-output interfaces 1040, and/or one or more operating systems 1021, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and so forth.

Input-output interface 1040 may be used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the server 1000. In one example, i/o Interface 1040 includes a Network adapter (NIC) that may be coupled to other Network devices via a base station to communicate with the internet. In one example, the input/output interface 1040 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

It will be understood by those skilled in the art that the structure shown in fig. 10 is merely illustrative and is not intended to limit the structure of the electronic device. For example, server 1000 may also include more or fewer components than shown in FIG. 10, or have a different configuration than shown in FIG. 10.

The present application also provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations described above.

According to the embodiment of the method, the device, the server or the storage medium for determining the example attribute information in the image, the shared feature is obtained by performing down-sampling processing on the image to be detected including the target number examples, then the position offset, the height information and the fusion feature information of each pixel in the image to be detected are respectively determined according to the shared feature, and finally the attribute information of each example is determined; the method and the device realize accurate segmentation of the examples in the image, accurately predict the height and the position offset of each example and facilitate accurate drawing of the example map.

It should be noted that: the sequence of the embodiments of the present application is only for description, and does not represent the advantages and disadvantages of the embodiments. And specific embodiments thereof have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus, device, and storage medium embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for relevant points.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer storage medium, and the above storage medium may be a read-only memory, a magnetic disk, an optical disk, or the like.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for determining instance attribute information in an image is characterized by comprising the following steps:

2. The method according to claim 1, wherein the determining a fusion feature of each pixel in the image to be detected according to the shared feature comprises:

performing semantic segmentation processing on the shared features to obtain semantic features of each pixel in the image to be detected;

carrying out example analysis processing on the shared characteristics to obtain example characteristics of each pixel in the image to be detected;

and fusing the semantic features and the example features of each pixel in the image to be detected, and determining the fusion features of each pixel in the image to be detected.

3. The method of claim 1, wherein after the step of determining attribute information for the corresponding instance of each instance class, the method further comprises:

4. The method according to claim 1, wherein determining attribute information of the instance corresponding to each instance category according to the fusion feature, the position offset and the height information of the pixels in the pixel set corresponding to each instance category comprises:

determining the fusion characteristics of the instances corresponding to each instance type according to the fusion characteristics of the pixels in the pixel set corresponding to each instance type;

determining the position offset of the example corresponding to each example type according to the position offset of the pixels in the pixel set corresponding to each example type;

and determining the height information of the example corresponding to each example type according to the height information of the pixels in the pixel set corresponding to each example type.

5. The method according to claim 2, wherein the semantic segmentation processing on the shared features to obtain the semantic features of each pixel in the image to be detected comprises:

performing semantic segmentation processing on the shared features through a semantic branch network to obtain semantic features of each pixel in the image to be detected;

the example analysis processing of the shared features to obtain the example features of each pixel in the image to be detected comprises:

carrying out example analysis processing on the shared characteristics through an example branch network to obtain example characteristics;

the step of performing position offset prediction processing on the shared features to obtain the position offset of each pixel in the image to be detected comprises the following steps:

performing position offset prediction processing on the shared features through a first regression branch network to obtain the position offset of each pixel in the image to be detected;

the height prediction processing of the shared features to obtain the height information of each pixel in the image to be detected comprises:

and performing height prediction processing on the shared characteristic through a second regression branch network to obtain height information of each pixel in the image to be detected.

6. The method according to claim 5, wherein before the step of performing semantic segmentation processing on the shared feature through a semantic branch network to obtain the semantic feature of each pixel in the image to be detected, the method further comprises:

constructing a cross entropy loss function of the first network;

constructing an intra-class aggregation degree loss function and an inter-class distinction degree loss function of the second network;

constructing a square error loss function of a third network;

constructing an absolute value loss function of a fourth network;

determining the sum of the cross entropy loss function, the intra-class conformity loss function, the inter-class discrimination loss function, the squared error loss function and the absolute value loss function as a comprehensive loss function;

respectively adjusting parameters of the first network, the second network, the third network and the fourth network to obtain a current first network, a current second network, a current third network and a current fourth network;

calculating the comprehensive loss values corresponding to the current first network, the current second network, the current third network and the current fourth network;

and when the comprehensive loss value is smaller than a preset threshold value, determining the current first network as the semantic branch network, determining the current second network as the example branch network, determining the current third network as the first regression branch network, and determining the current fourth network as the second regression branch network.

7. The method according to claim 1, wherein the determining the pixel set corresponding to each instance category according to the fusion feature of each pixel in the image to be detected comprises:

determining the instance category of each pixel in the image to be detected according to the fusion characteristic of each pixel in the image to be detected;

and determining a pixel set corresponding to each instance category through a density clustering algorithm.

8. An apparatus for determining attribute information of an instance in an image, the apparatus comprising:

9. An apparatus for determining attribute information of an instance in an image, the apparatus comprising a processor and a memory, the memory having at least one instruction or at least one program stored therein, the at least one instruction or the at least one program being loaded and executed by the processor to implement the method for determining attribute information of an instance in an image according to any one of claims 1 to 7.

10. A computer storage medium having at least one instruction or at least one program stored therein, the at least one instruction or the at least one program being loaded and executed by a processor to implement the method for determining instance attribute information in an image according to any one of claims 1 to 7.