WO2022078216A1 - Target recognition method and device - Google Patents

Target recognition method and device Download PDF

Info

Publication number
WO2022078216A1
WO2022078216A1 PCT/CN2021/121680 CN2021121680W WO2022078216A1 WO 2022078216 A1 WO2022078216 A1 WO 2022078216A1 CN 2021121680 W CN2021121680 W CN 2021121680W WO 2022078216 A1 WO2022078216 A1 WO 2022078216A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
feature
feature image
input image
target
Prior art date
Application number
PCT/CN2021/121680
Other languages
French (fr)
Chinese (zh)
Inventor
李亚婷
邱杰
李青之
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202011479454.3A external-priority patent/CN114429561A/en
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Publication of WO2022078216A1 publication Critical patent/WO2022078216A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition

Definitions

  • the present application relates to the field of communication technologies, and in particular, to a target identification method and device.
  • image-based target recognition has a wide range of application prospects.
  • image-based target recognition is involved in scenes such as illegal vehicle management, commodity identification, endangered species protection, traffic monitoring and detection.
  • image-based target recognition can identify illegal vehicles through images on the road or in some places, and obtain vehicle information of the illegal vehicles, such as vehicle license plates, vehicle logos, etc.
  • the present application provides a target recognition method and device to improve the accuracy of target recognition.
  • an embodiment of the present application provides a method for target recognition, which can be performed by a target recognition device.
  • the target recognition device acquires an input image, and the input image includes a target to be recognized, in order to If the target can be identified, the target recognition device can first determine the distinguishing area of the input image, and the distinguishing area of the input image is a subset of the area in the input image that can indicate the category to which the target belongs; after the target recognition device has determined the distinguishing area
  • the first feature image can be obtained by occluding the distinguishing region in the input image, that is, the first feature image is an image in which the distinguishing region in the input image is blocked.
  • the target identification device may identify the target according to the first characteristic image.
  • the target recognition device when the target recognition device performs target recognition, considering the situation that the distinguishing area in the input image is occluded, the first feature image in this case is acquired, and the distinguishing area can be removed from the first feature image.
  • the process of target recognition by strengthening the analysis of the area outside the distinguishing area, the effect of accurately identifying the target can be achieved and the accuracy of target recognition can be ensured.
  • the target recognition apparatus may not block the distinguishing area in the input image, such as displaying the distinguishing area normally, or highlighting the distinguishing area. region to generate a second feature image. That is to say, the second feature image is a feature image in which the distinguishing area of the input image is not blocked; when performing target recognition, the target recognition device can recognize the target according to the first feature image and the second feature image.
  • the target recognition device considers both the situation that the distinguishing area is blocked and the situation that the distinguishing area is not blocked, and obtains the first feature image and the second feature image, which correspond to the possible occlusions in the input image respectively ( Or the image cannot show the real situation due to the influence of the shooting environment) and the input image is not occluded (or the input image can show the real situation), target recognition based on the first feature image and the second feature image can reduce occlusion or environmental impact, thereby improving the accuracy of target recognition.
  • the distinguishing area may be determined according to the spatial feature of the input image. For example, the regions in the input image whose spatial features are greater than the threshold or are within a certain interval are selected as the distinguishing regions.
  • the embodiments of the present application do not limit the manner of determining the distinguishing region according to the spatial characteristics of the input image.
  • the spatial features of the input image can be used to determine a more discriminative region that is more capable of characterizing the category to which the target is different from other targets.
  • the target recognition device may configure scores for the spatial features of the input image when determining the distinguishing region according to the spatial features of the input image; for example, an attention model may be used to configure the spatial features of the input image Score, the region with the spatial feature score greater than the threshold is regarded as a discriminative region.
  • the first coefficient value may be configured for the pixel points in the input image, for example, the pixels in the input image that belong to the The first coefficient values of the pixels in the sexual area are configured to be smaller first values, and the first coefficient values of the remaining pixels are configured to be larger second values.
  • coefficient map after that, the first coefficient map is applied to the input image (in specific application, it can be applied to the feature image of the input image or the processed feature image of the input image, such as Fout or B), generate the first feature image.
  • applying the coefficient map to the input image can reduce the pixel values of the discriminative regions in the input image, realize the occlusion of the discriminative regions, and then obtain the first feature image more conveniently.
  • the target recognition device generates the second feature image according to the distinguishing area of the input image, and the second coefficient value can be configured for the pixel points in the input image.
  • the distinguishing area in the input image can be assigned The second coefficient value of the pixel is configured as a larger first value, and the coefficients of the remaining pixels are configured as a smaller second value.
  • the second coefficient values of the pixels belonging to the discriminative area in the input image are configured as Configured as the score of the spatial feature of the pixel.
  • the map formed by the second coefficient values of each pixel is the second coefficient map; after that, the second coefficient map is applied to the input image (in specific applications, it can be applied to the feature image of the input image or the processed image).
  • a second feature image is generated on the feature image of the input image, such as Fout or B in the embodiment.
  • the target recognition device can change the pixel values of the sub-regions in the input image in various ways, so as to realize the highlighting of the distinguishing regions, and further, the second characteristic image can be obtained.
  • the first feature image and the second feature image may be aggregated and dimension-reduced in the channel dimension to generate a third feature image.
  • feature image based on the third feature image, determine multiple candidate feature images with different receptive fields, wherein each candidate feature image has the same size; then, fuse the multiple candidate feature images into a fourth feature image; according to The fourth characteristic image identifies the target.
  • the fourth feature image is formed by merging multiple candidate feature images with different receptive fields, so that the receptive field of the fourth feature image obtained in this way can cover more effective information that is beneficial to target recognition, and reduces unfavorable targets.
  • the identified invalid information enables the target identification device to more accurately identify the target through the fourth characteristic image.
  • the target recognition device aggregates the first feature image and the second feature image in the channel dimension, and when generating the third feature image, the first feature image and the second feature image can be firstly in the channel dimension.
  • Image aggregation and dimensionality reduction to generate an aggregated image; the aggregated image can be the same size as the first feature image or the second feature image, and then configure weights for the aggregated image in the channel dimension to generate a third feature image, which is configured for candidate feature images can achieve the following effect: when the discriminative region is occluded in the input image, the part of the aggregated image that belongs to the first feature image has a greater weight on the channel than the part of the aggregated image that belongs to the second feature image. Or when the discriminative region is not occluded in the input image, the weight on the channel of the part of the aggregated image belonging to the first feature image is smaller than the weight of the part of the aggregated image belonging to the second feature image on the channel.
  • the part belonging to the second feature image in the aggregated image can be highlighted when the discriminative area is occluded, and when the discriminative area is not occluded, the part of the aggregated image can be highlighted
  • the part belonging to the first feature image enables the weight of the third feature image in the channel dimension to be more in line with the state in which the discriminative region in the input image is occluded or not occluded.
  • the target recognition device determines multiple candidate feature images based on the third feature image, and can apply multiple different convolution kernels to the third feature image respectively, and separate convolution ( That is, a plurality of candidate feature images are obtained by filling 0) in the third feature image.
  • the target recognition device obtains multiple candidate feature images of the same size by using dilated separation convolution, so as to facilitate subsequent fusion of the multiple candidate feature images.
  • a weight may be configured for each candidate feature image, and the weight may be obtained through pre-learning and training , and then, based on each candidate feature image and the corresponding weight of each candidate feature image, a fourth feature image is obtained.
  • the corresponding weight is configured for each candidate feature image, and when the fourth feature image is generated by subsequent fusion, the information in each candidate feature image can be retained with emphasis, so that the receptive field of the fourth feature image can cover More effective information for target recognition.
  • an embodiment of the present application further provides a target identification device, the target identification device has the function of implementing the behavior in the method example of the first aspect.
  • the functions can be implemented by hardware, or can be implemented by hardware executing corresponding software.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • the structure of the apparatus includes an acquisition unit, an image generation unit, an identification unit, and a determination unit, and these units can perform the corresponding functions in the method examples of the first aspect. For details, please refer to the detailed description in the method examples. , will not be repeated here.
  • an embodiment of the present application further provides an apparatus, which has a function of implementing the behavior in the method example of the first aspect.
  • the structure of the device includes a processor and a memory, and the processor is configured to support the target identification device to perform the corresponding functions in the method of the first aspect.
  • the memory is coupled to the processor and holds program instructions and data necessary for the communication device.
  • the structure of the communication device further includes a communication interface for communicating with other devices.
  • the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium runs on a computer, the computer enables the computer to execute the first aspect and each possibility of the first aspect. method described in the implementation of .
  • the present application further provides a computer program product comprising instructions, which, when run on a computer, cause the computer to execute the method described in the first aspect and various possible implementations of the first aspect.
  • the present application further provides a computer chip, the chip is connected to a memory, the chip is used to read and execute a software program stored in the memory, and execute the above-mentioned first aspect and various possibilities of the first aspect. method described in the implementation of .
  • FIG. 1 is a schematic diagram of a feature image provided by the application.
  • FIG. 2 is a schematic diagram of the architecture of a system provided by the application.
  • 3A is a schematic diagram of a target recognition method provided by the application.
  • 3B is a schematic diagram of another target recognition method provided by the application.
  • FIG. 4 is a schematic diagram of a method for determining a distinguishing region provided by the present application.
  • 5A is a schematic diagram of a method for configuring scores for spatial features using an attention model provided by the application
  • 5B is a schematic diagram of a method for configuring scores for spatial features and temporal features using an attention model provided by the application;
  • 6A is a schematic diagram of a method for generating a first feature image provided by the present application.
  • 6B is a schematic diagram of the effect of a first feature image provided by the application.
  • FIG. 7A is a schematic diagram of a method for generating a second feature image provided by the present application.
  • 7B is a schematic diagram of the effect of a second feature image provided by the application.
  • FIG. 8 is a schematic diagram of converting a third feature image into a fourth feature image provided by the application.
  • FIG. 9 is a schematic diagram of a method for generating a sixth feature image provided by the present application.
  • 10A is a schematic structural diagram of a ResNet50 provided by the application.
  • 10B is a schematic structural diagram of a CNN provided by the application.
  • FIG. 11 is a schematic structural diagram of a target identification device provided by the application.
  • FIG. 12 is a schematic diagram of a device provided by this application.
  • Image features are used to characterize the attributes of images. There are many types of image features. Image features can be divided into spatial features and visual features. Different image features can characterize images from different angles. Image features can be quantified as numerical values, which are called feature values.
  • the image features of different regions in the image are different, that is, different regions in the image correspond to different feature values, and the image formed by the feature values corresponding to each region of the image is the feature image.
  • a feature image can be abstracted as a cube in space with a length, width and height of C, H, and W, where the direction of C is the channel dimension, and the planes where W and H are located are the space dimension.
  • the length of the feature image in the channel dimension is C, which can be understood as the feature image has C channels.
  • Visual features are features that describe the channel dimension, and one channel can correspond to one visual feature. The number of channels may be different for different feature images. There are many types of visual features, such as color features, texture features.
  • the feature image shows the distance or relationship between each person or thing in the image in the spatial dimension
  • the spatial feature description is the feature in the spatial dimension.
  • a feature value on the spatial feature corresponds to a region (composed of multiple pixels) in the image, and is used to describe the feature of the region in the spatial dimension.
  • the discriminative region is used to distinguish objects of different categories, and is a subset of regions that can represent the difference from the category of other objects, that is, there are many regions in an image that can represent the difference from the category of other objects. A part of the regions is selected as the distinguishing region, that is, a subset of these regions is selected as the distinguishing region. For example, in an image of a vehicle, the areas where the head, logo, rear-view lights and tires are located are all areas that can represent differences from the categories to which other objects belong. The area in which it is located serves as the distinguishing area.
  • the attention model provided in the embodiment of the present application can be used to determine the distinguishing area, or each pixel in the feature image can be clustered, and the value of each pixel is greater than
  • the set value area is used as the distinguishing area, and the area where the value of each pixel is in the preset range can also be used as the distinguishing area.
  • the distinguishing area in the embodiment of the present application is suitable for coarse-grained target recognition (that is, identifying the large category to which the target belongs, such as the identification of plants, animals, and people), and is also suitable for fine-grained target recognition (that is, identifying the target to which the target belongs). of small types, such as identifying the categories to which different birds belong).
  • the discriminative region is the region that can distinguish the category to which the object in the image belongs from the same large category. To put it simply, for example, to distinguish different birds (parrots, sparrows, orioles, etc.) under the big category of birds, many birds in the same category are very similar in appearance and size, with only extremely small differences. Most of these differences are very similar.
  • the areas that exist in the bird's beak, claws, feather color, eyes, tail, etc. are called discriminative areas, and these areas are the areas that can distinguish the bird.
  • multiple feature images may be aggregated in the channel dimension, and the aggregation in the channel dimension refers to superimposing two feature images in the channel dimension.
  • the feature image may also be reduced in dimension in the channel dimension.
  • the dimension reduction in the channel dimension refers to reducing the length of the feature image in the channel dimension, so that the length of the feature image after dimension reduction in the channel dimension is reduced. meet specific needs.
  • the length of the aggregated image in the channel dimension is equal to the sum of the lengths of the multiple feature images in the channel dimension.
  • dimensionality reduction can be performed to obtain the same size as the multiple feature images.
  • the receptive field refers to the size of the area where the pixels on the feature image are mapped on the original image.
  • Dilated separation convolution refers to the combination of dilated convolution and depthwise separable convolution.
  • dilated convolution also known as hole convolution or dilated convolution, injects holes (filling 0) into the standard convolution kernel to increase the perception field.
  • the dilated convolution has one more parameter: the dilation rate (dilation rate, which can also be abbreviated as rate).
  • the expansion rate refers to the number of 0s between the points of the convolution kernel.
  • the expanded convolution not only expands the size of the convolution kernel by a certain expansion rate, but also fills the feature map by padding the 0 value, so that The image after convolution has the same size as the image before convolution but has a larger receptive field.
  • Depthwise separable convolution is a lightweight convolution operation that can obtain channel and spatial information separately. Compared with standard convolution, the parameter amount and computational cost of this depthwise separable convolution are much lower.
  • the depthwise separable convolution is divided into two parts: depthwise convolution and pointwise convolution. .
  • depthwise convolution different from standard convolution, one convolution kernel in this convolution convolutes one channel, and each channel has a corresponding convolution kernel to operate.
  • Each convolution kernel in pointwise convolution can effectively fuse the information of multiple channels and generate a feature image, so that multiple convolution kernels extract different features to obtain multi-dimensional feature output.
  • depthwise convolution can operate on each channel independently and obtain the information of each channel independently, so there is no interaction of information between the same spatial positions of different channels. For this reason, pointwise convolution is required to complete Information exchange between different channels.
  • Dilated separation convolution is to apply dilated convolution to the process of depthwise separable convolution. First, use dilated convolution to perform depthwise convolution on each channel, and then use point-by-point convolution to fuse the information of each channel. This dilated separation volume The product operation can reduce the amount of computation and parameters without reducing the classification accuracy.
  • FIG. 2 is a schematic diagram of a system to which this embodiment of the present application is applied, and the system includes an image collection device 200 and a target identification device 100 .
  • the image collection device 200 is used to collect images. After the image collection device 200 collects the images, the collected images are fed back to the target recognition device 100 .
  • the location where the image collection apparatus 200 is deployed and the type of the image collection apparatus 200 will be different.
  • the image collection device 200 may be a camera device deployed on both sides of a road, or a monitoring device deployed at a traffic intersection. The image collection device 200 may capture an image of the road, and send the captured image to the object recognition device 100 .
  • the image collection device 200 may be a camera deployed in a forest or ocean, and the image collection device 200 may capture images of various animals and plants in the forest, or images of various animals and plants in the ocean , and send the captured image to the target recognition device 100 .
  • the target recognition apparatus 100 can receive the image from the image collection apparatus 200, and execute the target recognition method provided by the embodiment of the present application.
  • This embodiment of the present application does not limit the location where the target identification device 100 is deployed.
  • the target identification device 100 may be deployed in an edge data center, such as an edge computing node (multi-access edge computing, MEC) deployed in an edge data center, or Deployed in cloud data centers, it can also be deployed on terminal computing devices.
  • MEC multi-access edge computing
  • the target identification apparatus 100 may also be distributed in some or all of the environments of edge data centers, cloud data centers, and terminal computing devices.
  • the target identification device 100 may be a hardware device, such as a server, a service cluster, or a terminal computing device, or a software device, specifically a software module running on the hardware computing device.
  • the target recognition apparatus 100 when it performs target recognition, it can perform coarse-grained target recognition, and can also perform more fine-grained recognition on the target.
  • the coarse-grained target recognition can be understood as the target recognition device 100 can simply classify the target and identify the large category to which the target belongs.
  • the target recognition device 100 can recognize humans, vehicles, animals, plant.
  • Fine-grained target recognition can be understood as the target recognition device 100 being able to finely classify the target and identify the small category to which the target belongs.
  • the target recognition device 100 can recognize the model, brand, etc. of the vehicle in the image.
  • the object recognition device 100 can recognize the species to which different birds in the image belong.
  • the target recognition device 100 can receive data from other devices in addition to the images sent by the image collection device 200. Taking the scene of illegal vehicle recognition as an example, the target device device 100 can also receive data from roadside units and radar measurements. .
  • the roadside unit RSU
  • the roadside unit can identify the vehicle passing through the roadside unit and obtain the information of the vehicle, and the roadside unit can send the obtained vehicle information to the target identification device 100, and the target identification device 100 After the target identification is performed on the image sent by the image collection device, the target in the image can also be identified according to the vehicle information sent by the roadside unit.
  • Radar can perform ranging, measuring the distance between vehicles and the distance from a vehicle to an object.
  • the radar can send the measured information to the edge sensing unit, and then the edge sensing unit sends the information to the target recognition device. After the target recognition device 100 performs target recognition on the image sent by the image collection device 200, the radar measurement The obtained information is annotated on the target in the image.
  • the target recognition device 100 will identify the information (such as the target information, or the target and the information from the roadside unit, radar, etc.) For example, in the scene of violation measurement and identification, the target recognition device can identify the illegal vehicle in the input image, obtain the information of the illegal vehicle, and send the information of the illegal vehicle to the traffic command center system.
  • the target recognition device can identify the illegal vehicle in the input image, obtain the information of the illegal vehicle, and send the information of the illegal vehicle to the traffic command center system.
  • the method includes:
  • Step 101 the target recognition apparatus 100 acquires an input image from the image collection apparatus 200 , and the input image includes the target to be recognized.
  • Step 102 The target recognition apparatus 100 obtains a first feature image by blocking the distinguishing region in the input image, that is, the first feature image is a feature image in which the distinguishing region in the input image is blocked.
  • the distinguishing region is hidden in the first feature image, and the feature value of the pixels in the distinguishing region in the first feature image may be significantly smaller than the feature values of the pixels in other regions, such as the pixels in the distinguishing region.
  • the eigenvalues of are zero.
  • step 202 for the method of determining the distinguishing region, please refer to the relevant description in step 202 in the embodiment shown in FIG. 3B , and the method for acquiring the first feature image by the target recognition apparatus 100 may refer to steps 203 to 203 in the embodiment shown in FIG. 3B . 204 related instructions.
  • Step 103 The target recognition device 100 uses the first characteristic image to recognize the target of the input image.
  • a first feature image is generated by using a distinguishing region in the input image, wherein the first feature image is a feature image in which the distinguishing region is occluded.
  • the target is identified according to the first characteristic image.
  • the case where the distinguishing area is occluded is considered, which corresponds to the case where there is occlusion in the recognized image (or the image cannot show the real situation due to the influence of the shooting environment).
  • the target in the input image can be more accurately identified, the influence of occlusion or environmental reasons on the target identification can be reduced, and the identification accuracy can be improved.
  • the target recognition process provided in the embodiments of the present application mainly involves the field of deep learning, and the modules or neural networks used can be trained first and then used, that is, the modules or neural networks are trained first by using the training set, and the modules or neural networks are continuously adjusted.
  • the parameters of the module or neural network so that the module or neural network can output more accurate results.
  • the module or neural network can be put into use, and the module or neural network can process the input data and output results, such as input feature images.
  • the process of processing the input data during training and use is the same.
  • the difference is that the parameters of the module or neural network need to be adjusted according to the results of each output during the training process.
  • the process of use tends to pass through this module.
  • the neural network obtains the output, and the methods involved in the following are used as an example to introduce the target recognition method provided by the embodiment of the present application.
  • the embodiment of the present application uses the first feature image and the second feature image to identify the target as an example, and the method specifically includes:
  • Step 201 The target recognition apparatus 100 acquires an input image, and the type of the input image is not limited here.
  • the input image may be an image directly sent after being collected by the image collection apparatus 200 , or may be an image processed based on the image collected by the image collection apparatus 200 .
  • Step 202 The object recognition apparatus 100 determines the distinguishing area in the input image.
  • the target recognition apparatus 100 may use the ResNet50 neural network or the VGG16 network to obtain the characteristic image of the input image, It can be the network layer of the bottleneck part of the ResNet50 neural network, the feature image output by the middle and high network layers in the VGG16 network, such as conv3_x, conv4_x, conv5_x and conv6 of the VGG16 network.
  • the present application does not limit the method by which the target identification device 100 determines the distinguishing area, and any method capable of determining the distinguishing area is applicable to the embodiment of the present application.
  • a method for determining a distinguishing region provided by an embodiment of the present application is described below. As shown in FIG. 4 , the method includes:
  • Step 301 The object recognition apparatus 100 determines the spatial feature of the input image.
  • the target recognition apparatus 100 After the target recognition apparatus 100 obtains the feature image of the input image, the value of each pixel in the feature image of the input image in the spatial dimension represents the spatial feature of the input image, and the value of each pixel is the feature value.
  • Step 302 The target recognition apparatus 100 may configure a score for the spatial feature based on the attention model.
  • Attention models can measure multiple pieces of information from a specific perspective and determine the value of each information.
  • the target recognition apparatus 100 may use the attention model to measure the spatial features of the feature image corresponding to the input image, and determine the value of each spatial feature, such as determining that the spatial feature contains The amount of information, configure the score for each spatial feature. For example, a higher score is assigned to a spatial feature that contains rich information, and a lower score is assigned to a spatial feature that contains less information.
  • the target recognition apparatus 100 may also configure scores for visual features. That is, in the channel dimension, score each visual feature and configure the score.
  • the manner in which the object recognition apparatus 100 assigns scores for visual features is similar to the manner in which the object identification apparatus 100 assigns scores for spatial features, and an attention model may also be used to assign scores for visual features.
  • the attention model used to configure the scores for spatial features and the attention model used to configure scores for visual features are two independent attention models.
  • FIG. 5A it is a flowchart for the target recognition apparatus 100 to use the attention model to score spatial features and configure the score.
  • an attention model can be used to configure a score for the spatial feature of the input image, and the score can be directly configured on the feature image of the input image, that is, using the score and the The corresponding feature values in the feature image are multiplied to obtain feature image B.
  • the feature image B and Fin have the same size, and the feature image B is the feature image obtained by applying the spatial feature score to Fin.
  • the feature image B may be the feature image that is used by the first coefficient map and the second coefficient map in steps 204 and 206 .
  • FIG. 5B it is a flowchart for the target recognition apparatus 100 to use two independent attention models to score spatial features and visual features, and to configure the score values.
  • Fig. 5B for the feature image Fin of the input image, two attention models can be used synchronously to configure the score for the spatial feature and temporal feature of the input image, and the score configured for the spatial feature can be directly configured in the input image.
  • the feature image of the input image that is, the feature image A is obtained by multiplying the score with the corresponding feature value in the feature image; the score for the visual feature of the input image can be directly configured on the input image to obtain the feature image B.
  • the feature image A and the feature image B are aggregated and dimension-reduced to obtain the feature image Fout.
  • the size of Fin and Fout is the same, and Fout is the feature image after applying the score of spatial feature and the score of visual feature on Fin.
  • the feature image Fout may be the feature image that is used by the first coefficient map and the second coefficient map in steps 204 and 206 .
  • each visual feature or spatial feature lies in its contribution to the classification of target recognition (such as fine-grained target recognition or coarse-grained target recognition), and some visual features or spatial features can directly indicate the attributes of the target. (such as category), the contribution to target recognition is relatively large, some visual features or spatial features cannot highlight the attributes of the target, and the contribution to target recognition is relatively small.
  • the attention model is used in the spatial dimension to enhance the expression of spatial features that are beneficial to target recognition by configuring the scores, and reduce the impact on target recognition. The expression of large spatial features enables the obtained effective feature expression to improve the accuracy of target recognition.
  • the target recognition apparatus 100 determines the distinguishing area according to the score of the spatial feature in the input image.
  • the embodiment of the present application does not limit the manner in which the target recognition apparatus 100 performs step 303.
  • the target recognition apparatus 100 may The region with the feature score greater than the threshold is regarded as the discriminative region.
  • the threshold may be an empirical value, or may be a value determined by means of simulation, simulation, or the like. It is also possible to use a region where the score of the spatial feature in the input image is greater than a specific range as a distinguishing region, and the specific range can be a fixed value or an artificially set value.
  • the first feature image (see steps 203-204) and the second feature image (see steps 205-206) can be determined respectively.
  • Step 203 The target recognition device 100 blocks the distinguishing area in the input image, configures first coefficient values for each pixel in the input image, and the first coefficient value of each pixel constitutes a first coefficient map.
  • a lower first coefficient value can be configured for the pixel values in the discriminative area in the input image, and a higher first coefficient value can be configured for the remaining pixels except the pixels in the discriminative area. first coefficient value.
  • the first coefficient value corresponding to the pixel point can be configured to be zero. If the score of the spatial feature of the pixel point is less than the threshold, the pixel point The first coefficient value of is set to 1, that is:
  • Att(F i ) is the score of the spatial feature of the pixel determined based on the attention model, and t is the threshold.
  • Step 204 The target recognition apparatus 100 applies the first coefficient map to the feature image of the input image to obtain the first feature image.
  • the object recognition apparatus 100 may also apply the first coefficient map to the feature image B or the feature image Fout shown in FIG. 5A , and the description here is only taking the first coefficient map applied to the feature image of the input image as an example.
  • the object recognition apparatus 100 multiplies the value of each pixel on the feature image of the input image by the first coefficient value of the pixel on the first coefficient map to obtain the first feature image.
  • the size of the first feature image is C*H*W, where C is the length of the channel, H is the height of the space, and W is the width of the space.
  • FIG. 6A is a flow chart of the target recognition device 100 generating the first feature image (wherein, the part of scoring the spatial feature of the input image by using the attention model and determining the distinguishing region corresponds to steps 301 to 303, wherein the generation of The part of the first coefficient map corresponds to step 203), the target recognition device 100 uses the attention model to score the spatial features of the input image, configure the score (for determining the distinguishing area), and then based on the spatial features of each pixel point The score of the first coefficient map is generated, and then the first coefficient map is applied to the feature image of the input image to generate the first feature image.
  • the target recognition device 100 uses the attention model to score the spatial features of the input image, configure the score (for determining the distinguishing area), and then based on the spatial features of each pixel point
  • the score of the first coefficient map is generated, and then the first coefficient map is applied to the feature image of the input image to generate the first feature image.
  • the distinguishing regions in the input image may be rearview mirrors, headlights, and license plates of the vehicle. After these discriminative regions are occluded, they become black in the input image, and other regions are displayed normally.
  • the target recognition apparatus 100 may also obtain a second feature image without blocking the distinguishing region in the input image, and the second feature image is a feature image in which the distinguishing region in the input image is not blocked.
  • the distinguishing area in the characteristic area can be displayed normally
  • the second characteristic image can be the characteristic image of the input image, that is, the distinguishing area in the characteristic image of the input image is not blocked, so that it can be displayed normally.
  • the second feature image in order to be able to highlight the difference between the distinguishing area and other areas, the second feature image may be that the eigenvalues of the pixels in the distinguishing area may be significantly higher than the eigenvalues of the pixels in other areas. high feature image.
  • the target recognition apparatus 100 obtains the second characteristic image reference may be made to the relevant descriptions of steps 205 to 206 .
  • Step 205 The target recognition device 100 does not block the distinguishing area in the input image, configures second coefficient values for each pixel in the input image, and the second coefficient value of each pixel constitutes a second coefficient map.
  • the target recognition apparatus 100 may configure the higher second coefficient value for the pixel point in the distinguishing area in the input image as the second coefficient value except the pixel point in the distinguishing area. The remaining pixels of the point are configured with lower second coefficient values.
  • the second coefficient value of the pixel can be configured to be 1.
  • the second coefficient value of the pixel is configured to be 0.
  • the score of the spatial feature of each pixel can be normalized, so that the score of the spatial feature of each pixel can be distributed in [0,1] , after each pixel point is normalized, the score of the spatial feature can be used as the second coefficient value of each pixel point to form a second coefficient map.
  • the scores of the spatial features of each pixel are normalized in the following way:
  • Step 206 The target recognition apparatus 100 applies the second coefficient map to the feature image of the input image to obtain the second feature image. Similar to step 204, the target recognition apparatus 100 can also apply the second coefficient map to the feature image B or the feature image Fout shown in FIG. 5A, and here is only an example of applying the second coefficient map to the feature image of the input image. Be explained.
  • the target identification device 100 multiplies the value of each pixel on the input image by the second coefficient value of the pixel on the second coefficient map to obtain a second feature image.
  • the size of the second feature image is C*H*W, where C is the length of the channel, H is the height of the space, and W is the width of the space.
  • FIG. 7A it is a flow chart of the target recognition apparatus 100 generating the second feature image (wherein, the part of using the attention model to score the spatial features of the input image and determine the distinguishing area corresponds to steps 301 to 303 , wherein the generation of The part of the first coefficient map corresponds to step 205), and the target recognition device 100 uses the attention model to score the visual features of the input image, configure the scores, and then generate a second coefficient map based on the scores of the spatial features of each pixel. , and then applying the second coefficient map to the feature image of the input image to generate a second feature image.
  • the distinguishing areas in the input image can be the rearview mirror, headlight, and license plate of the vehicle. These distinguishing areas are not blocked. Further, these distinguishing areas can be strengthened. These discriminative regions are brighter in the input image, and other regions are darker.
  • the target recognition apparatus 100 acquires the first characteristic image and the second characteristic image.
  • Step 207 The target recognition apparatus 100 aggregates and reduces the dimension of the first feature image and the second feature image in the channel dimension to generate a third feature image.
  • the aggregated size of the first feature image and the second feature image in the channel dimension is 2C*H*W.
  • the aggregated image can also be dimensionally reduced, that is, in the channel Dimensionally compressed to generate a third feature image, so that the length of the third feature map in the channel dimension is C.
  • the third feature image is equivalent to splicing the first feature image and the second feature image in the channel dimension, and then compressing to generate a feature image.
  • the third feature image includes the first feature image and the second feature image, that is, In the channel dimension, the part belonging to the first characteristic image and the part belonging to the second characteristic image can be distinguished from the third characteristic image.
  • Step 208 The target recognition apparatus 100 configures weights for the third feature image in the channel dimension to generate a fourth feature image.
  • the target recognition apparatus 100 may firstly configure the weight for the third feature image in the channel dimension, that is, configure the weight on the channel of the third feature image to generate the fourth feature image.
  • the embodiment of the present application does not limit the way of configuring weights for the third feature image in the channel dimension.
  • an efficient channel attention (ECA) model based on channel relationship modeling can be used to create the third feature image in the channel dimension. Configure weights.
  • the ECA model is pre-trained and can be established based on the channel dimension. It can model the channel relationship and learn the relationship between visual features, so as to obtain a more efficient image by configuring the weight for the third feature image in the channel dimension. Visual feature expression.
  • the parameters in the ECA model will be randomly initialized.
  • the classifier can feed back the classification results to the ECA model, so that the ECA model can adaptively adjust the weight scores according to the classification results. , give a large weight to those visual features that have a prominent contribution to the classification, and give a small weight to the visual features that contribute less to the classification result, so as to continuously learn and adjust the weight distribution during the training process until After training to a stable state, a weight distribution that is most helpful for target recognition is obtained.
  • the training of the ECA model is realized through the continuous learning of the feature images in the training set, so that the ECA model can redistribute all the channel feature weights of the feature images, so that the When the discriminative region is occluded, the weight on the channel of the part of the feature image that occludes the discriminative region can be increased, so that the subsequent classifier can learn the distinguishing features of other regions, otherwise, increase the feature image
  • the weight on the channel of the part that does not block the discriminative area in the middle find out the characteristics of the discriminative area, so that the classifier can identify the characteristics of the discriminative area and make a correct judgment.
  • the weight configured for the third feature image satisfies the following condition: in the case that the discriminative region is occluded in the input image, the part of the third feature image that belongs to the first feature image has a greater weight on the channel than that of the third feature image.
  • the weight on the channel of the part of the second feature image, or the part of the third feature image that belongs to the first feature image when the discriminative region is not occluded in the input image, on the channel is smaller than that of the third feature image.
  • FIG. 8 it is a schematic diagram of converting the third feature image into the fourth feature image by using the ECA model.
  • the part between the third feature image and the fourth feature image is the ECA model, in which only some operations included in the ECA model, such as average pooling operation (GAP), sigmoid, are drawn exemplarily. activation function, etc.
  • GAP average pooling operation
  • sigmoid sigmoid
  • the part belonging to the first characteristic image and the part belonging to the second characteristic image are configured with different weights, which can correspond to the distinguishable regions in the input image. Occlusion or non-occlusion, based on this, when the discriminative area in the input image is occluded, a higher weight can be configured for the first feature image, and a lower weight can be configured for the second feature image, so that the subsequent target recognition is performed. , more information can be obtained from other areas except the distinguishing area to assist in identifying the target in the distinguishing area and determining the category of the target.
  • a lower weight can be configured for the part of the fourth feature image that belongs to the first feature image, and a higher weight can be configured for the part of the fourth feature image that belongs to the second feature image, so that In subsequent target recognition, more information can be obtained from the distinguishing area, and a more comprehensive analysis of the distinguishing area can be performed to accurately identify the target in the distinguishing area.
  • steps 202 to 208 may be performed by a discriminative fine-grained feature representation method based on double attention (DMF) device, and the DMF device may be embedded in the neural network In the network, for example, after the layers of the neural network used to extract image features.
  • DMF devices can be embedded in the neural network after each network layer capable of extracting image features. Taking Resnet50 as an example, DMF devices can be embedded in CNN, located in the CNN. after each stage.
  • the other network layers can perform some processing on the fourth feature image output by the DMF device.
  • a series of processing After a series of processing (The specific type of a series of processing here is not limited, for example, it can be a convolution operation, a pooling operation, or a combination of a convolution operation and a pooling operation), and then the fifth feature image is obtained, and the fifth feature image is obtained. Feature images can then be passed to the classifier for classification.
  • the target recognition apparatus 100 may further process the fifth characteristic image, and the further processing method of the fifth characteristic image will be described below.
  • Step 209 the target recognition device 100 determines a plurality of candidate feature images based on the fifth feature image, each candidate feature image corresponds to a different receptive field, that is, each candidate feature image corresponds to one receptive field.
  • the receptive field refers to the size of the area mapped by the pixels on the feature image on the input image.
  • the number of candidate feature images is not limited here, and the number of candidate feature images can be determined according to the actual application scenario.
  • the target recognition apparatus 100 may use multiple convolution kernels of different sizes to act on the fifth feature image respectively, and obtain multiple candidate feature images through dilation and separation convolution. Due to the different sizes of the convolution kernels, the receptive fields of multiple candidate feature images obtained by dilated separation convolution are also different.
  • Step 210 The target recognition apparatus 100 fuses the plurality of candidate feature images into a sixth feature image.
  • the receptive field of the sixth feature image is an area that includes less redundant areas, and the redundant area is an area that is not conducive to target recognition, that is, the redundant area includes less or no information representing the target category.
  • the receptive field of the sixth feature image includes a lot of valid information, and the valid information indicates information that can be used for target recognition. For example, the valid information can be extracted by a classifier, and based on the valid information Ability to determine target type.
  • the receptive field of the sixth feature image may include less non-bird areas.
  • the current image shows the bird's head, but the color of the bird's head feathers will vary. Because the shooting scenes will show different effects, that is, the color of the head feathers is not conducive to target recognition and belongs to the redundant area.
  • the bird's beak, eyes and other areas will not easily change due to different shooting scenes, the bird's beak, eyes and other areas contain more effective information, the receptive field of the sixth feature image can include the bird's beak, eyes and other areas etc. area.
  • the target recognition apparatus 100 may aggregate and reduce the dimension of the plurality of candidate feature images, and configure weights for the parts of the aggregated and dimension-reduced feature images that belong to each candidate feature image, A sixth feature image is obtained, wherein the weights configured for each candidate feature image are acquired through training learning in advance.
  • each candidate feature image among the plurality of candidate feature images acquired by the target recognition device 100 corresponds to a receptive field, and the size of the receptive field corresponding to each candidate feature image is different, and some of these receptive fields have different sizes of receptive fields.
  • the wild may only cover a part of the target, and some receptive fields may cover the target, but also contain more non-target areas.
  • the flow chart of generating the sixth feature image for the target recognition apparatus 100 is shown.
  • the target recognition apparatus 100 uses six different convolution kernels to perform convolution operations on the sixth feature image respectively.
  • the six convolution kernels are a 1*1 convolution (conv) kernel, a 3*3 convolution kernel with an expansion rate of 1, a 3*3 convolution kernel with an expansion rate of 2, and a dilation rate of 3. 3*3 convolution kernel, 3*3 convolution kernel with dilation rate 4, 3*3 convolution kernel with dilation rate 5.
  • the size of the convolution kernel refers to the size of the length X width of the convolution kernel.
  • commonly used sizes are 3X3, 5X5.
  • Feature image After the fifth feature image passes through a convolution kernel, a candidate feature image of size C*H*W will be output. After the fifth feature image passes through six convolution kernels, six candidates of size C*H*W will be obtained. Feature image.
  • the target recognition device 100 can aggregate the six candidate feature images of size C*H*W in the channel dimension and reduce the dimension to obtain a feature image of 6C/N*H*W (wherein, N refers to the channel dimension.
  • N refers to the channel dimension.
  • weights can be assigned to each candidate feature image in the channel dimension.
  • Several operations for weight redistribution such as global average pooling operation (GAP), 1*1 convolution kernel (ie con1*1), BN+ReLU, sigmoid function, etc.
  • GAP global average pooling operation
  • 1*1 convolution kernel ie con1*1
  • BN+ReLU sigmoid function
  • the convolution kernel with a size of 1*1 can realize the convolution operation, and the operation of raising and lowering the dimension can be realized by setting the number of 1*1 convolution kernels.
  • GAP refers to the summation of each feature value on the feature image and then averaging to obtain a value, which can represent the feature information of the entire feature image.
  • BN+ReLU is the normalization and activation function in the convolutional neural network, which mainly realizes the normalization operation and the enhanced nonlinear operation.
  • FC is a fully connected layer, which is a more common layer in neural networks, and can play the role of "classifier" in the entire neural network.
  • the size of a convolution kernel in Figure 9 is a 1*1 convolution (conv) kernel.
  • conv convolution
  • the original feature information can be effectively retained, and the GAP is used to obtain global information, which can effectively compensate for the discontinuous information that may be caused by dilated convolution, so as to obtain a more complete and efficient feature expression.
  • the target recognition apparatus 100 may directly configure weights on the dimension-reduced feature image in the channel dimension to obtain the sixth feature image, or may configure the weights on the dimension-reduced feature image in the channel dimension, and then aggregate with another feature image.
  • dimensionality reduction to generate a sixth feature image
  • the other feature image may be a feature image generated by performing an average pooling operation on the fifth feature image, and the size of the other feature image is C*H*W.
  • the purpose of aggregation and dimension reduction with another feature image is to ensure that the input and output feature dimensions are consistent, and appropriate dimension reduction can effectively improve computational efficiency and recognition accuracy.
  • Step 211 The target recognition apparatus 100 performs target recognition based on the sixth characteristic image.
  • the target recognition apparatus 100 executes step 211, it can be implemented by means of a classifier, which can be pre-trained and can determine the category of the target in the characteristic image according to the characteristic image, so as to realize the target recognition.
  • a classifier which can be pre-trained and can determine the category of the target in the characteristic image according to the characteristic image, so as to realize the target recognition.
  • the target recognition device 100 is used to process the fifth characteristic image.
  • the target recognition device 100 can also directly process the image after acquiring the fourth characteristic image.
  • the fourth feature image is processed.
  • steps 209 to 210 may be performed by a multi-scale feature fusion method (multi-scale feature fusion method based on receptive field adaptive adjustment, RFAM) device based on receptive field adaptive adjustment, and the RFAM device may be placed in Before the classifier, the feature images that need to be input to the classifier are processed so that the classifier can finally output accurate results.
  • a multi-scale feature fusion method multi-scale feature fusion method based on receptive field adaptive adjustment, RFAM
  • RFAM receptive field adaptive adjustment
  • FIG. 10A it is a flow chart of the method for image recognition when the image recognition method is applied in ResNet50.
  • the image recognition device can be split into three devices, which are called DFM device, RFAM device and classifier respectively for the convenience of distinction.
  • the DFM device is used to execute steps 201 to 208 in the above-mentioned embodiment as shown in FIG. 3 .
  • the RFAM device is used to execute steps 209 to 210 in the above-mentioned embodiment as shown in FIG. 3 .
  • the classifier is used to perform step 211 in the above-mentioned embodiment as shown in FIG. 3 .
  • ResNet50 includes a main line, and the main line includes a main CNN and a main RFAM device, wherein the main CNN can perform feature extraction on the input image and output the feature image.
  • a DFM device can be added to the main CNN, as shown in Figure 10B for the structural diagram of the main CNN.
  • the main CNN includes four stages (each stage is essentially a convolutional layer for feature extraction), and one can be added after each stage.
  • DFM device each DFM device can process the feature image output by the stage set before the DFM device, for example, perform steps 201 to 208 in the embodiment of the present application.
  • the main RFAM device can process the feature image output by the main CNN.
  • the main CNN can output multiple feature images, one of which is for the entire input image, which can be transmitted to the main RFAM device for processing.
  • the plurality of feature images also include some feature images for different regions of the input image. Feature images with more information contained in these feature images can be transmitted to multiple branches after the main CNN for processing, and each branch processes one Feature image, here is an example of connecting four branches after the main CNN.
  • Each branch includes a branch CNN and a branch RFAM device.
  • the branch can process one feature image output by the main CNN.
  • the branch CNN can continue to perform feature extraction on the feature image and output a new feature image.
  • a DFM device can be added to the branch CNN.
  • the branch RFAM device can process the feature image output by the branch CNN.
  • the RFAM device in the main line and the feature images output by the branch RFAM devices in each branch can be input into the classifier, and the classifier can perform target recognition based on the feature image, and then summarize the structure of each classifier to output the final result, which can be Indicates the target in the input image.
  • the embodiment of the present application further provides a target identification device for executing the method performed by the target identification device in the method embodiments shown in FIGS. 3A to 3B and 4 above.
  • a target recognition device for executing the method performed by the target identification device in the method embodiments shown in FIGS. 3A to 3B and 4 above.
  • the target recognition apparatus 1100 includes an acquisition unit 1101 , an image generation unit 1102 , and a recognition unit 1103 , and optionally, a determination unit 1104 .
  • the acquiring unit 1101 is configured to acquire an input image, where the input image includes a target to be recognized.
  • the obtaining unit 1101 may perform step 101 in the method embodiment shown in FIG. 3A .
  • the obtaining unit 1101 may perform step 201 in the method embodiment shown in FIG. 3B .
  • the image generating unit 1102 is configured to generate a first feature image according to the distinguishing region of the input image, where the first feature image is a feature image in which the distinguishing region of the input image is occluded, and the distinguishing region of the input image is the target in the input image that can be indicated A subset of the region that belongs to the category.
  • the image generation unit 1102 may perform step 102 in the method embodiment shown in FIG. 3A .
  • the image generating unit 1102 may perform steps 203-204 in the method embodiment shown in FIG. 3B . .
  • the identifying unit 1103 is configured to identify the target according to the first characteristic image.
  • the identification unit 1103 may perform step 103 in the method embodiment shown in FIG. 3A .
  • the image generating unit 1102 may also obtain the second feature image without blocking the distinguishing region in the input image, that is, the second feature image is the unblocked distinguishing region of the input image Image; the image generation unit 1102 may perform steps 205-206 in the method embodiment shown in FIG. 3B .
  • the recognition unit 1103 may simultaneously consider the first feature image and the second feature image, and recognize the target according to the first feature image and the second feature image.
  • the identification unit 1103 may perform steps 207-211 in the method embodiment shown in FIG. 3B .
  • the determining unit 1104 may further determine the distinguishing region according to the spatial feature of the input image.
  • the determining unit 1104 may configure a score for the spatial feature of the input image when determining the distinguishing region according to the spatial feature of the input image; the determining unit 1104 may use the region where the score of the spatial feature is greater than the threshold as the Distinctive area. An area with a score within a preset range can also be used as a distinguishing area.
  • the determining unit 1104 may perform step 202 in the method embodiment shown in FIG. 3B .
  • the determining unit 1104 may execute the method embodiment shown in FIG. 4 .
  • the first coefficient value may be configured for the pixel point in the input image.
  • the map composed of the first coefficient values of each pixel point is the first coefficient map. There are many ways to configure the first coefficient value for the pixel point.
  • the first coefficient value of the pixel point belonging to the distinguishing area in the input image can be configured as a smaller first value; the first coefficient value of the remaining pixel points can be configured as The larger second value, where the first value is smaller than the second value, the map formed by the first coefficient values of each pixel is the first coefficient map; after obtaining the first coefficient map, the image generation unit 1102 will The coefficient map is applied to the input image to generate a first feature image.
  • the second coefficient value may be configured for the pixel point in the input image.
  • the map formed by the second coefficient values of each pixel is the second coefficient map. There are many ways to configure the second coefficient value for the pixel point.
  • the image generation unit 1102 may configure the second coefficient value of the pixel point belonging to the distinguishing region in the input image as the score of the spatial feature of the pixel point; for another example, The image generation unit 1102 may configure the second coefficient values of the pixels belonging to the distinguishing area in the input image to be a third larger value, and configure the second coefficient values of the remaining pixels to be a fourth smaller value, wherein the first The third value is greater than the fourth value. After the second coefficient map is obtained, the second coefficient map is applied to the input image to generate a second feature image.
  • the identifying unit 1103 may first aggregate the first feature image and the second feature image in the channel dimension to generate a third feature image. After that, based on the third feature image, multiple candidate feature images with the same size are determined, wherein each candidate feature image has a different receptive field; the multiple candidate feature images are fused into a fourth feature image; the fourth feature image is used to target identify.
  • the identifying unit 1103 aggregates the first feature image and the second feature image in the channel dimension, and when generating the third feature image, the first feature image and the second feature image can be aggregated in the channel dimension , reduce the dimension, and generate an aggregated image; configure the weights for the aggregated images in the channel dimension to generate a third feature image, and the weights configured for the candidate feature images can satisfy the following conditions: when the discriminative region is occluded in the input image, the aggregated image The part belonging to the first feature image has a greater weight on the channel than the part belonging to the second feature image in the aggregated image, or the first feature image in the aggregated image when the discriminative region is not occluded in the input image The weight on the channel of the part of is smaller than the weight on the channel of the part belonging to the second feature image in the aggregated image.
  • the identifying unit 1103 may apply multiple different convolution kernels to the third feature image respectively, and separate them by dilation. Multiple candidate feature images are obtained by convolution.
  • a corresponding weight may be configured for each candidate feature image, and then, based on each candidate feature image and each The weight corresponding to the candidate feature image is obtained to obtain the fourth feature image.
  • each functional unit in the embodiments of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
  • the above embodiments may be implemented in whole or in part by software, hardware, firmware or any other combination.
  • the above-described embodiments may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded or executed on a computer, all or part of the processes or functions described in the embodiments of the present invention are generated.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server or data center Transmission to another website site, computer, server, or data center is by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • the computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, or the like that contains one or more sets of available media.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media.
  • the semiconductor medium may be a solid state drive (SSD).
  • the target identification device in the embodiment shown in FIGS. 3A-3B can take the form shown in FIG. 12 .
  • the apparatus 1200 shown in FIG. 12 includes at least one processor 1201 , a memory 1202 , and optionally, a communication interface 1203 .
  • the memory 1202 can be volatile memory, such as random access memory; the memory can also be non-volatile memory, such as read-only memory, flash memory, hard disk drive (HDD) or solid-state hard disk, or the memory 1202 is capable of Any other medium for carrying or storing the desired program code in the form of instructions or data structures and capable of being accessed by a computer, without limitation.
  • the memory 1202 may be a combination of the foregoing memories.
  • connection medium between the above-mentioned processor 1201 and the memory 1202 is not limited in this embodiment of the present application.
  • the processor 1201 can be a central processing unit (central processing unit, CPU), and the processor 1201 can also be other general-purpose processors, digital signal processors (digital signal process, DSP), application specific integrated circuit (application specific integrated circuit, ASIC) ), field programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, artificial intelligence chips, chips on a chip, etc.
  • a general purpose processor may be a microprocessor or any conventional processor or the like. It has the function of data sending and receiving, and can communicate with other devices.
  • an independent data sending and receiving module can also be set, such as the communication interface 1203, which is used to send and receive data; when the processor 1201 communicates with other devices, it can Data transmission is performed through the communication interface 1203, such as acquiring an input image.
  • the processor 1201 in FIG. 12 can execute the instructions by invoking the computer stored in the memory 1202, so that the target identification device can execute any of the above method embodiments. The method performed by the target identification device.
  • the functions/implementation processes of the acquiring unit, the image generating unit, the identifying unit and the determining unit in FIG. 11 can be implemented by the processor 1201 in FIG. 12 calling the computer execution instructions stored in the memory 1202 .
  • the functions/implementation process of the image generating unit, the identifying unit and the determining unit in FIG. 11 may be implemented by the processor 1201 in FIG. 12 calling the computer execution instructions stored in the memory 1202,
  • the function/implementation process can be realized through the communication interface 1203 in FIG. 12 .
  • the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions
  • the apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Image Analysis (AREA)

Abstract

A target recognition method and device. The target recognition device obtains an input image, the input image comprising a target to be recognized; the target recognition device can first determine a discriminative area of the input image, the discriminative area of the input image being a subset of areas in the input image that can indicate the category of the target; after determining the discriminative area, a first feature image is obtained by occluding the discriminative area in the input image, then the target recognition device can perform target recognition according to the first feature image, and determine the category of the target in the input image. The target recognition device takes into account a case in which the discriminative area in the input image is occluded, obtains the first feature image in this case, and can analyze areas in the input image other than the discriminative area to perform target recognition. In the process of target recognition, the effect of accurately recognizing the target can be achieved by strengthening the analysis of the areas other than the discriminative area.

Description

一种目标识别方法及装置A target recognition method and device 技术领域technical field
本申请涉及通信技术领域,尤其涉及一种目标识别方法及装置。The present application relates to the field of communication technologies, and in particular, to a target identification method and device.
背景技术Background technique
目前,基于图像的目标识别具备较为广泛的应用前景,例如,在违章车辆管理、商品识别、濒危物种保护、交通监控与侦缉等场景中,均会涉及到基于图像的目标识别。At present, image-based target recognition has a wide range of application prospects. For example, image-based target recognition is involved in scenes such as illegal vehicle management, commodity identification, endangered species protection, traffic monitoring and detection.
以违章车辆管理为例,基于图像的目标识别可以通过道路上或一些场所的图像,识别违章车辆,获取该违章车辆的车辆信息,如车辆的车牌、车辆的车标等。Taking illegal vehicle management as an example, image-based target recognition can identify illegal vehicles through images on the road or in some places, and obtain vehicle information of the illegal vehicles, such as vehicle license plates, vehicle logos, etc.
但由于现实情况的复杂性,如拍摄图像当时的光照、道路的路况以及违章车辆周围的环境的影响,可能会导致图像中违章车辆存在被遮挡的情况,使得不能基于图像进行较为准确的目标识别,也即无法准确识别到违章车辆,无法获取该违章车辆较为准确的车辆信息。However, due to the complexity of the actual situation, such as the influence of the lighting at the time when the image was taken, the road conditions of the road, and the environment around the illegal vehicle, it may cause the illegal vehicle in the image to be occluded, making it impossible to perform more accurate target recognition based on the image. , that is, the illegal vehicle cannot be accurately identified, and the more accurate vehicle information of the illegal vehicle cannot be obtained.
发明内容SUMMARY OF THE INVENTION
本申请提供一种目标识别方法及装置,用以提升目标识别的准确率。The present application provides a target recognition method and device to improve the accuracy of target recognition.
第一方面,本申请实施例提供了一种目标识别的方法,该方法可以由目标识别装置来执行,在该方法中,目标识别装置获取输入图像,该输入图像中包括待识别的目标,为了能够识别出该目标,目标识别装置可以先确定该输入图像的区分性区域,输入图像的区分性区域为输入图像中能够指示目标所属类别的区域的子集;目标识别装置在确定了该区分性区域后,可以通过遮挡输入图像中的该区分性区域,获得第一特征图像,也就是说,该第一特征图像是输入图像中区分性区域被遮挡的图像。之后,目标识别装置可以根据第一特征图像识别目标。In a first aspect, an embodiment of the present application provides a method for target recognition, which can be performed by a target recognition device. In this method, the target recognition device acquires an input image, and the input image includes a target to be recognized, in order to If the target can be identified, the target recognition device can first determine the distinguishing area of the input image, and the distinguishing area of the input image is a subset of the area in the input image that can indicate the category to which the target belongs; after the target recognition device has determined the distinguishing area After the region is identified, the first feature image can be obtained by occluding the distinguishing region in the input image, that is, the first feature image is an image in which the distinguishing region in the input image is blocked. Afterwards, the target identification device may identify the target according to the first characteristic image.
通过上述方法,目标识别装置在进行目标识别时,考虑到了输入图像中区分性区域被遮挡的情况,获取在该种情况下的第一特征图像,可以对该第一特征图像中除区分性区域之外的区域进行分析,确定目标所属的类别,这种目标识别的过程中通过加强分析区分性区域之外的区域,达到准确识别目标的效果,保证目标识别的准确率。Through the above method, when the target recognition device performs target recognition, considering the situation that the distinguishing area in the input image is occluded, the first feature image in this case is acquired, and the distinguishing area can be removed from the first feature image. In the process of target recognition, by strengthening the analysis of the area outside the distinguishing area, the effect of accurately identifying the target can be achieved and the accuracy of target recognition can be ensured.
在一种可能的实现方式中,目标识别装置除了通过遮挡输入图像的区分性区域,还可以不对该输入图像中的区分性区域进行遮挡,如正常显示该区分性区域,或突出显示该区分性区域,生成第二特征图像。也就是说,该第二特征图像为输入图像的区分性区域未被遮挡的特征图像;当在进行目标识别时,目标识别装置可以根据第一特征图像和第二特征图像识别目标。In a possible implementation manner, in addition to blocking the distinguishing area of the input image, the target recognition apparatus may not block the distinguishing area in the input image, such as displaying the distinguishing area normally, or highlighting the distinguishing area. region to generate a second feature image. That is to say, the second feature image is a feature image in which the distinguishing area of the input image is not blocked; when performing target recognition, the target recognition device can recognize the target according to the first feature image and the second feature image.
通过上述方法,目标识别装置既考虑到区分性区域被遮挡的情况又考虑区分性区域不被遮挡的情况,获得第一特征图像和第二特征图像,分别对应了输入图像中可能存在的遮挡(或因为拍摄环境影响引起的图像不能呈现真实状况)的情况以及输入图像不存在遮挡(或输入图像能够呈现真实状况),根据该第一特征图像和第二特征图像进行目标识别,可以减少遮挡或环境影响,进而提升目标识别的准确率。Through the above method, the target recognition device considers both the situation that the distinguishing area is blocked and the situation that the distinguishing area is not blocked, and obtains the first feature image and the second feature image, which correspond to the possible occlusions in the input image respectively ( Or the image cannot show the real situation due to the influence of the shooting environment) and the input image is not occluded (or the input image can show the real situation), target recognition based on the first feature image and the second feature image can reduce occlusion or environmental impact, thereby improving the accuracy of target recognition.
在一种可能的实现方式中,目标识别装置在确定区分性区域时,可以根据输入图像的空间特征确定区分性区域。例如,选取输入图像中空间特征大于阈值,或处于某一区间范围内的区域作为区分性区域。本申请实施例并不限定根据输入图像的空间特征确定区分性区域的方式。In a possible implementation manner, when the target recognition apparatus determines the distinguishing area, the distinguishing area may be determined according to the spatial feature of the input image. For example, the regions in the input image whose spatial features are greater than the threshold or are within a certain interval are selected as the distinguishing regions. The embodiments of the present application do not limit the manner of determining the distinguishing region according to the spatial characteristics of the input image.
通过上述方法,通过输入图像的空间特征可以确定出更加具有区分性,更能够表征目标区别于其他目标所属类别的区分性区域。Through the above method, the spatial features of the input image can be used to determine a more discriminative region that is more capable of characterizing the category to which the target is different from other targets.
在一种可能的实现方式中,目标识别装置在根据输入图像的空间特征确定区分性区域时,可以为输入图像的空间特征配置分值;例如,可以利用注意力模型为输入图像的空间特征配置分值,将空间特征的分值大于阈值的区域作为区分性区域。In a possible implementation manner, the target recognition device may configure scores for the spatial features of the input image when determining the distinguishing region according to the spatial features of the input image; for example, an attention model may be used to configure the spatial features of the input image Score, the region with the spatial feature score greater than the threshold is regarded as a discriminative region.
通过上述方法,空间特征的分值越大,说明包含的信息量更多,确定的区分性区域更具备区分性。Through the above method, the larger the score of the spatial feature, the more information it contains, and the more discriminative the determined discriminative area.
在一种可能的实现方式中,目标识别装置在根据输入图像的区分性区域生成第一特征图像时,可以为输入图像中的像素点配置第一系数值,例如,可以将输入图像中属于区分性区域的像素点的第一系数值配置为较小的第一值,其余像素点的第一系数值配置为较大的第二值,各个像素点的第一系数值构成的图为第一系数图;之后,再将第一系数图作用到输入图像上(在具体应用时,可以作用到该输入图像的特征图像或经过处理后的该输入图像的特征图像,如实施例中的Fout或B上),生成第一特征图像。In a possible implementation manner, when the target recognition device generates the first feature image according to the distinguishing region of the input image, the first coefficient value may be configured for the pixel points in the input image, for example, the pixels in the input image that belong to the The first coefficient values of the pixels in the sexual area are configured to be smaller first values, and the first coefficient values of the remaining pixels are configured to be larger second values. coefficient map; after that, the first coefficient map is applied to the input image (in specific application, it can be applied to the feature image of the input image or the processed feature image of the input image, such as Fout or B), generate the first feature image.
通过上述方法,将系数图作用到输入图像上可以降低输入图像中区分性区域的各个像素值,实现对区分性区域的遮挡,进而可以较为方便的获得第一特征图像。Through the above method, applying the coefficient map to the input image can reduce the pixel values of the discriminative regions in the input image, realize the occlusion of the discriminative regions, and then obtain the first feature image more conveniently.
在一种可能的实现方式中,目标识别装置根据输入图像的区分性区域生成第二特征图像,可以为输入图像中的像素点配置第二系数值,例如,可以将输入图像中属于区分性区域的像素点的第二系数值配置为较大的第一值,其余像素点的系数配置为较小的第二值,又例如,将输入图像中属于区分性区域的像素点的第二系数值配置为像素点的空间特征的分值。各个像素点的第二系数值构成的图为第二系数图;之后,再将第二系数图作用到输入图像上(在具体应用时,可以作用到该输入图像的特征图像或经过处理后的输入图像的特征图像上,如实施例中的Fout或B上),生成第二特征图像。In a possible implementation manner, the target recognition device generates the second feature image according to the distinguishing area of the input image, and the second coefficient value can be configured for the pixel points in the input image. For example, the distinguishing area in the input image can be assigned The second coefficient value of the pixel is configured as a larger first value, and the coefficients of the remaining pixels are configured as a smaller second value. For another example, the second coefficient values of the pixels belonging to the discriminative area in the input image are configured as Configured as the score of the spatial feature of the pixel. The map formed by the second coefficient values of each pixel is the second coefficient map; after that, the second coefficient map is applied to the input image (in specific applications, it can be applied to the feature image of the input image or the processed image). On the feature image of the input image, such as Fout or B in the embodiment), a second feature image is generated.
通过上述方法,目标识别装置可以通过多种不同的方式改变输入图像中区域分区域的各个像素值,实现对区分性区域的突显,进而,可以获得第二特征图像。Through the above method, the target recognition device can change the pixel values of the sub-regions in the input image in various ways, so as to realize the highlighting of the distinguishing regions, and further, the second characteristic image can be obtained.
在一种可能的实现方式中,目标识别装置根据第一特征图像和第二特征图像识别目标时,可以在通道维度上对第一特征图像和第二特征图像进行聚合、降维,生成第三特征图像;之后,再基于第三特征图像,确定多个感受野不同的候选特征图像,其中,每个候选特征图像的大小相同;之后,将多个候选特征图像融合为第四特征图像;根据第四特征图像识别目标。In a possible implementation manner, when the target recognition device recognizes the target according to the first feature image and the second feature image, the first feature image and the second feature image may be aggregated and dimension-reduced in the channel dimension to generate a third feature image. feature image; then, based on the third feature image, determine multiple candidate feature images with different receptive fields, wherein each candidate feature image has the same size; then, fuse the multiple candidate feature images into a fourth feature image; according to The fourth characteristic image identifies the target.
通过上述方法,第四特征图像是根据感受野不同的多个候选特征图像融合而成的,这样获得的第四特征图像的感受野中能够涵盖更多的利于目标识别的有效信息,减少不利于目标识别的无效信息,使得目标识别装置通过第四特征图像可以更加准确的识别目标。Through the above method, the fourth feature image is formed by merging multiple candidate feature images with different receptive fields, so that the receptive field of the fourth feature image obtained in this way can cover more effective information that is beneficial to target recognition, and reduces unfavorable targets. The identified invalid information enables the target identification device to more accurately identify the target through the fourth characteristic image.
在一种可能的实现方式中,目标识别装置在第一特征图像和第二特征图像在通道维度上聚合,生成第三特征图像时,可以先在通道维度上将第一特征图像和第二特征图像聚合、降维,生成聚合图像;该聚合图像可以与第一特征图像或第二特征图像大小相同,之后,在通道维度上为聚合图像配置权重,生成第三特征图像,为候选特征图像配置的权重可以达到如下效果:在区分性区域在输入图像中被遮挡时聚合图像中属于第一特征图像的部分在通道上的权重大于聚合图像中属于第二特征图像的部分在通道上的权重、或在区分性区域在输入图像中未被遮挡时聚合图像中属于第一特征图像的部分在通道上的权重小于聚合图像中属于第二特征图像的部分在通道上的权重。In a possible implementation manner, the target recognition device aggregates the first feature image and the second feature image in the channel dimension, and when generating the third feature image, the first feature image and the second feature image can be firstly in the channel dimension. Image aggregation and dimensionality reduction to generate an aggregated image; the aggregated image can be the same size as the first feature image or the second feature image, and then configure weights for the aggregated image in the channel dimension to generate a third feature image, which is configured for candidate feature images can achieve the following effect: when the discriminative region is occluded in the input image, the part of the aggregated image that belongs to the first feature image has a greater weight on the channel than the part of the aggregated image that belongs to the second feature image. Or when the discriminative region is not occluded in the input image, the weight on the channel of the part of the aggregated image belonging to the first feature image is smaller than the weight of the part of the aggregated image belonging to the second feature image on the channel.
通过上述方法,利用在通道维度的权重配置,能够在区分性区域遮挡的情况下,突出聚合图像中属于第二特征图像的部分,在区分性区域未被遮挡的情况下,可以突出聚合图像中 属于第一特征图像的部分,使得第三特征图像在通道维度上的权重能够更加符合输入图像中区分性区域被遮挡或未被遮挡的状态。Through the above method, using the weight configuration in the channel dimension, the part belonging to the second feature image in the aggregated image can be highlighted when the discriminative area is occluded, and when the discriminative area is not occluded, the part of the aggregated image can be highlighted The part belonging to the first feature image enables the weight of the third feature image in the channel dimension to be more in line with the state in which the discriminative region in the input image is occluded or not occluded.
在一种可能的实现方式中,目标识别装置在基于第三特征图像,确定多个候选特征图像,可以将多个不同的卷积核分别作用在第三特征图像中,通过扩张分离卷积(也即在第三特征图像中通过填充0)获得多个候选特征图像。In a possible implementation, the target recognition device determines multiple candidate feature images based on the third feature image, and can apply multiple different convolution kernels to the third feature image respectively, and separate convolution ( That is, a plurality of candidate feature images are obtained by filling 0) in the third feature image.
通过上述方法,目标识别装置利用扩张分离卷积获得多个大小相同的候选特征图像,便于后续将该多个候选特征图像进行融合。Through the above method, the target recognition device obtains multiple candidate feature images of the same size by using dilated separation convolution, so as to facilitate subsequent fusion of the multiple candidate feature images.
在一种可能的实现方式中,目标识别装置将多个的候选特征图像融合为第四特征图像时,可以为每个候选特征图像配置权重,该权重可以是通过预先的学习、以及训练获得的,之后,基于每个候选特征图像和每个候选特征图像对应的权重,获得第四特征图像。In a possible implementation manner, when the target recognition device fuses multiple candidate feature images into a fourth feature image, a weight may be configured for each candidate feature image, and the weight may be obtained through pre-learning and training , and then, based on each candidate feature image and the corresponding weight of each candidate feature image, a fourth feature image is obtained.
通过上述方法,为每个候选特征图像配置对应的权重,可以后续融合生成第四特征图像时,各有侧重的保留各个候选特征图像中的信息,以使得该第四特征图像的感受野能够涵盖更多利于目标识别的有效信息。Through the above method, the corresponding weight is configured for each candidate feature image, and when the fourth feature image is generated by subsequent fusion, the information in each candidate feature image can be retained with emphasis, so that the receptive field of the fourth feature image can cover More effective information for target recognition.
第二方面,本申请实施例还提供了一种目标识别装置,该目标识别装置具有实现上述第一方面的方法实例中行为的功能,有益效果可以参见第一方面的描述此处不再赘述。所述功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上述功能相对应的模块。在一个可能的设计中,所述装置的结构中包括获取单元、图像生成单元、识别单元以及确定单元,这些单元可以执行上述第一方面方法示例中的相应功能,具体参见方法示例中的详细描述,此处不做赘述。In the second aspect, an embodiment of the present application further provides a target identification device, the target identification device has the function of implementing the behavior in the method example of the first aspect. The functions can be implemented by hardware, or can be implemented by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above functions. In a possible design, the structure of the apparatus includes an acquisition unit, an image generation unit, an identification unit, and a determination unit, and these units can perform the corresponding functions in the method examples of the first aspect. For details, please refer to the detailed description in the method examples. , will not be repeated here.
第三方面,本申请实施例还提供了一种装置,该装置具有实现上述第一方面的方法实例中行为的功能,有益效果可以参见第一方面的描述此处不再赘述。所述装置的结构中包括处理器和存储器,所述处理器被配置为支持所述目标识别装置执行上述第一方面方法中相应的功能。所述存储器与所述处理器耦合,其保存所述通信装置必要的程序指令和数据。所述通信装置的结构中还包括通信接口,用于与其他设备进行通信。In a third aspect, an embodiment of the present application further provides an apparatus, which has a function of implementing the behavior in the method example of the first aspect. For beneficial effects, reference may be made to the description of the first aspect and will not be repeated here. The structure of the device includes a processor and a memory, and the processor is configured to support the target identification device to perform the corresponding functions in the method of the first aspect. The memory is coupled to the processor and holds program instructions and data necessary for the communication device. The structure of the communication device further includes a communication interface for communicating with other devices.
第四方面,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述第一方面以及第一方面的各个可能的实现方式中所述的方法。In a fourth aspect, the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium runs on a computer, the computer enables the computer to execute the first aspect and each possibility of the first aspect. method described in the implementation of .
第五方面,本申请还提供一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面以及第一方面的各个可能的实现方式中所述的方法。In a fifth aspect, the present application further provides a computer program product comprising instructions, which, when run on a computer, cause the computer to execute the method described in the first aspect and various possible implementations of the first aspect.
第六方面,本申请还提供一种计算机芯片,所述芯片与存储器相连,所述芯片用于读取并执行所述存储器中存储的软件程序,执行上述第一方面以及第一方面的各个可能的实现方式中所述的方法。In a sixth aspect, the present application further provides a computer chip, the chip is connected to a memory, the chip is used to read and execute a software program stored in the memory, and execute the above-mentioned first aspect and various possibilities of the first aspect. method described in the implementation of .
附图说明Description of drawings
图1为本申请提供的一种特征图像的示意图;1 is a schematic diagram of a feature image provided by the application;
图2为本申请提供的一种***的架构示意图;2 is a schematic diagram of the architecture of a system provided by the application;
图3A为本申请提供的一种目标识别方法示意图;3A is a schematic diagram of a target recognition method provided by the application;
图3B为本申请提供的另一种目标识别方法示意图;3B is a schematic diagram of another target recognition method provided by the application;
图4为本申请提供的区分性区域确定方法的示意图;4 is a schematic diagram of a method for determining a distinguishing region provided by the present application;
图5A为本申请提供的一种利用注意力模型为空间特征配置分值的方法示意图;5A is a schematic diagram of a method for configuring scores for spatial features using an attention model provided by the application;
图5B为本申请提供的一种利用注意力模型为空间特征和时间特征配置分值的方法示意 图;5B is a schematic diagram of a method for configuring scores for spatial features and temporal features using an attention model provided by the application;
图6A为本申请提供的一种生成第一特征图像的方法示意图;6A is a schematic diagram of a method for generating a first feature image provided by the present application;
图6B为本申请提供的一种第一特征图像的效果示意图;6B is a schematic diagram of the effect of a first feature image provided by the application;
图7A为本申请提供的一种生成第二特征图像的方法示意图;7A is a schematic diagram of a method for generating a second feature image provided by the present application;
图7B为本申请提供的一种第二特征图像的效果示意图;7B is a schematic diagram of the effect of a second feature image provided by the application;
图8为本申请提供的一种将第三特征图像转换为第四特征图像的示意图;8 is a schematic diagram of converting a third feature image into a fourth feature image provided by the application;
图9为本申请提供的一种生成第六特征图像的方法示意图;9 is a schematic diagram of a method for generating a sixth feature image provided by the present application;
图10A为本申请提供的一种ResNet50的结构示意图;10A is a schematic structural diagram of a ResNet50 provided by the application;
图10B为本申请提供的一种CNN的结构示意图;10B is a schematic structural diagram of a CNN provided by the application;
图11为本申请提供的一种目标识别装置的结构示意图;11 is a schematic structural diagram of a target identification device provided by the application;
图12为本申请提供的一种装置示意图。FIG. 12 is a schematic diagram of a device provided by this application.
具体实施方式Detailed ways
在对本申请实施例提供的一种目标识别方法以及设备进行说明之前,先对本申请实施例设计的一些概念进行说明:Before describing a target recognition method and device provided by the embodiments of the present application, some concepts designed by the embodiments of the present application are first described:
1、图像特征,特征图像1. Image features, feature images
图像特征用于表征图像的属性,图像特征的类型有很多种,图像特征可以分为空间特征以及视觉特征,不同的图像特征可以从不同的角度来对图像进行表征。图像特征可以量化为数值,该数值即为特征值。Image features are used to characterize the attributes of images. There are many types of image features. Image features can be divided into spatial features and visual features. Different image features can characterize images from different angles. Image features can be quantified as numerical values, which are called feature values.
图像中不同区域的图像特征是不同的,也即图像中不同区域对应不同的特征值,由该图像的各个区域对应的特征值构成的图像即为特征图像。The image features of different regions in the image are different, that is, different regions in the image correspond to different feature values, and the image formed by the feature values corresponding to each region of the image is the feature image.
2、通道维度、空间维度、视觉特征以及空间特征2. Channel dimension, spatial dimension, visual feature and spatial feature
如图1所示,一个特征图像可以抽象为空间中的一个长宽高分别为C、H以及W的立方体,其中,C所在方向为通道维度,W和H所在的平面为空间维度。As shown in Figure 1, a feature image can be abstracted as a cube in space with a length, width and height of C, H, and W, where the direction of C is the channel dimension, and the planes where W and H are located are the space dimension.
从图1中可知,该特征图像在通道维度的长度为C,可以理解为该特征图像有C个通道。视觉特征是描述通道维度上的特征,一个通道可以对应一个视觉特征。对于不同的特征图像,通道的数量可能不同。视觉特征的类型有许多,如颜色特征、纹理特征。It can be seen from Figure 1 that the length of the feature image in the channel dimension is C, which can be understood as the feature image has C channels. Visual features are features that describe the channel dimension, and one channel can correspond to one visual feature. The number of channels may be different for different feature images. There are many types of visual features, such as color features, texture features.
该特征图像在空间维度上展示了图像中各个人或物之间的距离或关系,空间特征描述是空间维度上的特征。空间特征上的一个特征值与图像中一个区域(由多个像素点构成)对应,用于描述该区域在空间维度上的特征。The feature image shows the distance or relationship between each person or thing in the image in the spatial dimension, and the spatial feature description is the feature in the spatial dimension. A feature value on the spatial feature corresponds to a region (composed of multiple pixels) in the image, and is used to describe the feature of the region in the spatial dimension.
3、区分性区域3. Distinguishing area
区分性区域用于区分不同类别的目标,是可以表征区别于其他目标所属类别的差异的区域的子集,也即在一个图像中能够表征区别于其他目标所属类别的差异的区域有许多,可以从这些区域中选择一部分区域作为区分性区域,也即将这些区域的一个子集作为区分性区域。例如,对于车辆的图像中,车头、车标、后视灯以及轮胎所在的区域均是能够表征区别于其他目标所属类别的差异的区域,可以将这些区域中车头、车标、以及后视灯所在的区域作为区分性区域。The discriminative region is used to distinguish objects of different categories, and is a subset of regions that can represent the difference from the category of other objects, that is, there are many regions in an image that can represent the difference from the category of other objects. A part of the regions is selected as the distinguishing region, that is, a subset of these regions is selected as the distinguishing region. For example, in an image of a vehicle, the areas where the head, logo, rear-view lights and tires are located are all areas that can represent differences from the categories to which other objects belong. The area in which it is located serves as the distinguishing area.
区分性区域的确定方法有许多种,例如,可以采用本申请实施例中提供的注意力模型确定该区分性区域,也可以对特征图像中各个像素点进行聚类,将各个像素点的值大于设定值区域作为区分性区域,也可以将各个像素点的值处于预设范围的区域作为区分性区域。There are many methods for determining the distinguishing area. For example, the attention model provided in the embodiment of the present application can be used to determine the distinguishing area, or each pixel in the feature image can be clustered, and the value of each pixel is greater than The set value area is used as the distinguishing area, and the area where the value of each pixel is in the preset range can also be used as the distinguishing area.
本申请实施例中的区分性区域适应于粗粒度的目标识别(也即识别目标所属的大类别, 如植物、动物、人的识别),也适用于细粒度的目标识别(也即识别目标所属的小类型,例如识别不同鸟所属的类别)。以细粒度的目标识别场景为例,区分性区域是能从同一大类别中将该图像中的目标所属的类别区分开来的区域。简单来说,譬如区分鸟这个大类别下的不同鸟类(鹦鹉、麻雀、黄鹂等),同一大类别下的很多鸟类在外观和体积上极其相似仅存在着极其微小的差异,这些差异大多存在于鸟的喙、爪子、羽毛颜色、眼睛、尾巴等区域,而这些区域被称为区分性区域,这些区域是能够将该鸟区分出来的区域。The distinguishing area in the embodiment of the present application is suitable for coarse-grained target recognition (that is, identifying the large category to which the target belongs, such as the identification of plants, animals, and people), and is also suitable for fine-grained target recognition (that is, identifying the target to which the target belongs). of small types, such as identifying the categories to which different birds belong). Taking the fine-grained object recognition scene as an example, the discriminative region is the region that can distinguish the category to which the object in the image belongs from the same large category. To put it simply, for example, to distinguish different birds (parrots, sparrows, orioles, etc.) under the big category of birds, many birds in the same category are very similar in appearance and size, with only extremely small differences. Most of these differences are very similar. The areas that exist in the bird's beak, claws, feather color, eyes, tail, etc., are called discriminative areas, and these areas are the areas that can distinguish the bird.
4、聚合、降维4. Aggregation and dimensionality reduction
在本申请实施例中可以在通道维度上对多个特征图像进行聚合,在通道维度上的聚合是指在将两个特征图像在通道维度上叠加。In this embodiment of the present application, multiple feature images may be aggregated in the channel dimension, and the aggregation in the channel dimension refers to superimposing two feature images in the channel dimension.
在本申请实施例中还可以在通道维度上对特征图像进行降维,在通道维度上的降维是指缩小特征图像在通道维度上的长度,使降维后的特征图像在通道维度的长度满足特定需求。在本申请实施例中,当在通道维度上对多个特征图像进行聚合时,聚合后的图像在通道维度上的长度等于该多个特征图像在通道维度上长度之和。为了保证输出的图像与聚合前的特征图像在通道维度上的长度是一致的,在通道维度上对多个特征图像聚合后,可以再进行降维,以获得与该多个特征图像大小一致的特征图像。In the embodiment of the present application, the feature image may also be reduced in dimension in the channel dimension. The dimension reduction in the channel dimension refers to reducing the length of the feature image in the channel dimension, so that the length of the feature image after dimension reduction in the channel dimension is reduced. meet specific needs. In the embodiment of the present application, when multiple feature images are aggregated in the channel dimension, the length of the aggregated image in the channel dimension is equal to the sum of the lengths of the multiple feature images in the channel dimension. In order to ensure that the length of the output image and the feature image before aggregation in the channel dimension is consistent, after aggregating multiple feature images in the channel dimension, dimensionality reduction can be performed to obtain the same size as the multiple feature images. Feature image.
5、感受野(receptive field)5. Receptive field
感受野是指特征图像上的像素点在原始图像上映射的区域大小。The receptive field refers to the size of the area where the pixels on the feature image are mapped on the original image.
6、扩张分离卷积、扩张率6. Dilated separation convolution, dilation rate
扩张分离卷积是指扩张卷积和深度可分离卷积的结合体。其中,扩张卷积(dilated convolution)也被称为空洞卷积或者膨胀卷积,是在标准的卷积核中注入空洞(填0),以此来增加感受野(reception field)。相比原来的正常卷积操作,扩张卷积多了一个参数:扩张率(dilation rate,也可以简写为rate)。扩张率指的是卷积核的点之间的0的数量,扩张卷积不仅会一定的扩张率扩大卷积核的尺寸,还会通过填充(padding)0值的操作,填充特征图,使得卷积后的图像与卷积前的图像具有相同的尺寸却具有更大的感受野。Dilated separation convolution refers to the combination of dilated convolution and depthwise separable convolution. Among them, dilated convolution, also known as hole convolution or dilated convolution, injects holes (filling 0) into the standard convolution kernel to increase the perception field. Compared with the original normal convolution operation, the dilated convolution has one more parameter: the dilation rate (dilation rate, which can also be abbreviated as rate). The expansion rate refers to the number of 0s between the points of the convolution kernel. The expanded convolution not only expands the size of the convolution kernel by a certain expansion rate, but also fills the feature map by padding the 0 value, so that The image after convolution has the same size as the image before convolution but has a larger receptive field.
深度可分离卷积是一种轻量级并且可以分别获取通道和空间上的信息的卷积操作。相比于标准卷积,这种深度可分离卷积的参数量和计算成本要低的多,深度可分离卷积分为深度卷积(depthwise convolution)与逐点卷积(pointwise convolution)两部分操作。对于depthwise convolution,和标准卷积不同,此卷积中一个卷积核对一个通道卷积,各个通道都有对应的卷积核进行操作。pointwise convolution中每个卷积核能够将多个通道的信息有效的融合并生成一张特征图像,这样多个卷积核对不同的特征进行提取得到多维的特征输出。总的说来,depthwise convolution可以单独对每个通道进行操作,独立的获取各个通道的信息,于是便出现了不同通道的相同空间位置之间的信息没有交互的情况,为此需要pointwise convolution来完成不同通道之间的信息交互。Depthwise separable convolution is a lightweight convolution operation that can obtain channel and spatial information separately. Compared with standard convolution, the parameter amount and computational cost of this depthwise separable convolution are much lower. The depthwise separable convolution is divided into two parts: depthwise convolution and pointwise convolution. . For depthwise convolution, different from standard convolution, one convolution kernel in this convolution convolutes one channel, and each channel has a corresponding convolution kernel to operate. Each convolution kernel in pointwise convolution can effectively fuse the information of multiple channels and generate a feature image, so that multiple convolution kernels extract different features to obtain multi-dimensional feature output. In general, depthwise convolution can operate on each channel independently and obtain the information of each channel independently, so there is no interaction of information between the same spatial positions of different channels. For this reason, pointwise convolution is required to complete Information exchange between different channels.
扩张分离卷积是将扩张卷积应用于深度可分离卷积过程中,使用扩张卷积先对各个通道进行深度卷积,然后再用逐点卷积融合各个通道的信息,这种扩张分离卷积操作能够在不降低分类精度的前提下实现减少计算量和参数量。Dilated separation convolution is to apply dilated convolution to the process of depthwise separable convolution. First, use dilated convolution to perform depthwise convolution on each channel, and then use point-by-point convolution to fuse the information of each channel. This dilated separation volume The product operation can reduce the amount of computation and parameters without reducing the classification accuracy.
如图2所示为本申请实施例所适用的一种***架构图,该***中包括图像收集装置200、目标识别装置100。FIG. 2 is a schematic diagram of a system to which this embodiment of the present application is applied, and the system includes an image collection device 200 and a target identification device 100 .
图像收集装置200用于收集图像,图像收集装置200在收集到图像之后,将收集到的图像反馈给目标识别装置100。在不同的应用场景中,图像收集装置200部署的位置以及图像收集装置200的类型会不同。例如,在违章车辆识别的场景中,图像收集装置200可以是部 署在道路两侧的摄像装置,也可以是部署在交通路口的监控装置。图像收集装置200可以拍摄道路的图像,将拍摄到的图像发送给目标识别装置100。又例如,在物种类别识别场景中,图像收集装置200可以是部署在森林、或海洋的摄像装置,图像收集装置200可以拍摄森林中各种动植物的图像,或海洋中各种动植物的图像,将拍摄到的图像发送给目标识别装置100。The image collection device 200 is used to collect images. After the image collection device 200 collects the images, the collected images are fed back to the target recognition device 100 . In different application scenarios, the location where the image collection apparatus 200 is deployed and the type of the image collection apparatus 200 will be different. For example, in the scene of illegal vehicle identification, the image collection device 200 may be a camera device deployed on both sides of a road, or a monitoring device deployed at a traffic intersection. The image collection device 200 may capture an image of the road, and send the captured image to the object recognition device 100 . For another example, in a species category recognition scenario, the image collection device 200 may be a camera deployed in a forest or ocean, and the image collection device 200 may capture images of various animals and plants in the forest, or images of various animals and plants in the ocean , and send the captured image to the target recognition device 100 .
目标识别装置100能够接收来自图像收集装置200的图像,执行本申请实施例提供的目标识别方法。本申请实施例并不限定目标识别装置100部署的位置,例如该目标识别装置100可以部署在边缘数据中心,如部署在边缘数据中心的边缘计算节点(multi-access edge computing,MEC),也可以部署在云数据中心,还可以部署在终端计算设备上。目标识别装置100也可以分布式的部署在边缘数据中心、云数据中心以及终端计算设备中的部分或全部环境中。The target recognition apparatus 100 can receive the image from the image collection apparatus 200, and execute the target recognition method provided by the embodiment of the present application. This embodiment of the present application does not limit the location where the target identification device 100 is deployed. For example, the target identification device 100 may be deployed in an edge data center, such as an edge computing node (multi-access edge computing, MEC) deployed in an edge data center, or Deployed in cloud data centers, it can also be deployed on terminal computing devices. The target identification apparatus 100 may also be distributed in some or all of the environments of edge data centers, cloud data centers, and terminal computing devices.
目标识别装置100可以为一个硬件装置,如服务器、服务集群、终端计算设备,也可以为一个软件装置,具体可以为运行在硬件计算设备上的软件模块。The target identification device 100 may be a hardware device, such as a server, a service cluster, or a terminal computing device, or a software device, specifically a software module running on the hardware computing device.
在本申请实施例中,目标识别装置100在进行目标识别时,可以进行粗粒度的目标识别,也可以对目标进行更细粒度的识别。举例来说,粗粒度的目标识别可以理解为目标识别装置100能够对目标进行简单的分类,对目标所属的大类别进行识别,例如,目标识别装置100可以识别图像中的人类、车辆、动物、植物。细粒度的目标识别可以理解为目标识别装置100能够对目标进行精细的分类,对目标所属的小类别进行识别,例如,目标识别装置100可以识别图像中的车辆的车型、品牌等,又例如,目标识别装置100可以识别图像中不同鸟所属的种类。In this embodiment of the present application, when the target recognition apparatus 100 performs target recognition, it can perform coarse-grained target recognition, and can also perform more fine-grained recognition on the target. For example, the coarse-grained target recognition can be understood as the target recognition device 100 can simply classify the target and identify the large category to which the target belongs. For example, the target recognition device 100 can recognize humans, vehicles, animals, plant. Fine-grained target recognition can be understood as the target recognition device 100 being able to finely classify the target and identify the small category to which the target belongs. For example, the target recognition device 100 can recognize the model, brand, etc. of the vehicle in the image. The object recognition device 100 can recognize the species to which different birds in the image belong.
另外,目标识别装置100除了接收图像收集装置200发送的图像,还可以接收来自其他装置的数据,以违章车辆识别的场景为例,目标设备装置100还可以接收来自路侧单元、雷达测量的数据。其中,路侧单元(road side unit,RSU)可以对经过该路侧单元的车辆进行识别,获取该车辆的信息,路侧单元可以将获取的车辆的信息发送给目标识别装置100,目标识别装置100在对图像收集装置发送的图像进行目标识别后,还可以根据路侧单元发送的车辆的信息对图像中的目标进行标识。雷达可以进行测距,测量车辆之间的距离、车辆到某一物体的距离。雷达可以将测量到的信息发送边缘感知单元,之后再由边缘感知单元将该信息发送至目标识别装置,目标识别装置100在对图像收集装置200发送的图像进行目标识别后,还可以将雷达测量到的信息标注在图像中的目标上。In addition, the target recognition device 100 can receive data from other devices in addition to the images sent by the image collection device 200. Taking the scene of illegal vehicle recognition as an example, the target device device 100 can also receive data from roadside units and radar measurements. . The roadside unit (RSU) can identify the vehicle passing through the roadside unit and obtain the information of the vehicle, and the roadside unit can send the obtained vehicle information to the target identification device 100, and the target identification device 100 After the target identification is performed on the image sent by the image collection device, the target in the image can also be identified according to the vehicle information sent by the roadside unit. Radar can perform ranging, measuring the distance between vehicles and the distance from a vehicle to an object. The radar can send the measured information to the edge sensing unit, and then the edge sensing unit sends the information to the target recognition device. After the target recognition device 100 performs target recognition on the image sent by the image collection device 200, the radar measurement The obtained information is annotated on the target in the image.
目标识别装置100除了接收数据(如来自图像收集装置、路侧单元、或雷达的数据),将识别后的信息(如该目标的信息,或标注了目标以及来自路侧单元、雷达等的信息的图像)发送给其他设备,例如在违章测量识别的场景中,目标识别装置可以识别输入图像中的违章车辆,获取该违章车辆的信息,将该违章车辆的信息发送至交通指挥中心***。In addition to receiving data (such as data from the image collection device, roadside unit, or radar), the target recognition device 100 will identify the information (such as the target information, or the target and the information from the roadside unit, radar, etc.) For example, in the scene of violation measurement and identification, the target recognition device can identify the illegal vehicle in the input image, obtain the information of the illegal vehicle, and send the information of the illegal vehicle to the traffic command center system.
下面结合附图,对本申请实施例提供的一种目标识别方法进行说明,参见图3A,该方法包括:The following describes a target recognition method provided by the embodiments of the present application with reference to the accompanying drawings. Referring to FIG. 3A , the method includes:
步骤101:目标识别装置100从图像收集装置200获取输入图像,该输入图像包括待识别的目标。Step 101 : the target recognition apparatus 100 acquires an input image from the image collection apparatus 200 , and the input image includes the target to be recognized.
步骤102:目标识别装置100通过遮挡该输入图像中的区分性区域,获取第一特征图像,也就是说,第一特征图像是输入图像中的区分性区域被遮挡的特征图像。在该第一特征图像中该区分性区域被隐藏了,在第一特征图像中区分性区域的像素点的特征值可以明显比其他区域的像素点的特征值小,如区分性区域的像素点的特征值为零。Step 102: The target recognition apparatus 100 obtains a first feature image by blocking the distinguishing region in the input image, that is, the first feature image is a feature image in which the distinguishing region in the input image is blocked. The distinguishing region is hidden in the first feature image, and the feature value of the pixels in the distinguishing region in the first feature image may be significantly smaller than the feature values of the pixels in other regions, such as the pixels in the distinguishing region. The eigenvalues of are zero.
区分性区域的确定方式可以参见如图3B所示的实施例中步骤202中的相关说明,目标 识别装置100获取第一特征图像的方式可以参见如图3B所示的实施例中步骤203~步骤204的相关说明。For the method of determining the distinguishing region, please refer to the relevant description in step 202 in the embodiment shown in FIG. 3B , and the method for acquiring the first feature image by the target recognition apparatus 100 may refer to steps 203 to 203 in the embodiment shown in FIG. 3B . 204 related instructions.
步骤103:目标识别装置100利用该第一特征图像识别该输入图像的目标。Step 103: The target recognition device 100 uses the first characteristic image to recognize the target of the input image.
目标识别装置100进行目标识别的过程可以参见如图3B所示的实施例中步骤207~步骤211的相关说明。For the process of target recognition performed by the target recognition apparatus 100, reference may be made to the relevant descriptions of steps 207 to 211 in the embodiment shown in FIG. 3B .
为了能够在基于图像进行目标识别的过程中,在被识别的图像中目标存在遮挡或因为拍摄环境影响引起的图像不能呈现真实状况的情况下,仍能够基于该图像进行准确的目标识别,获取该图像中目标的相关信息。在本申请实施例中,在获取输入图像之后,利用该输入图像中的区分性区域产生第一特征图像,其中,第一特征图像是该区分性区域被遮挡的特征图像。在获取这第一特征图像之后,再根据该第一特征图像识别该目标。从上述过程可知,在本申请实施例中考虑了区分性区域被遮挡的情况,对应于被识别的图像中存在遮挡(或因为拍摄环境影响引起的图像不能呈现真实状况)的情况。基于该第一特征图像,能够较为准确的识别出该输入图像中的目标,减少遮挡或者环境原因对目标识别的影响,提高识别的准确率。In order to be able to perform accurate target recognition based on the image in the process of target recognition based on the image, even if the target in the recognized image is occluded or the image cannot show the real situation due to the influence of the shooting environment, obtain the Information about the object in the image. In this embodiment of the present application, after the input image is acquired, a first feature image is generated by using a distinguishing region in the input image, wherein the first feature image is a feature image in which the distinguishing region is occluded. After the first characteristic image is acquired, the target is identified according to the first characteristic image. It can be seen from the above process that in the embodiments of the present application, the case where the distinguishing area is occluded is considered, which corresponds to the case where there is occlusion in the recognized image (or the image cannot show the real situation due to the influence of the shooting environment). Based on the first feature image, the target in the input image can be more accurately identified, the influence of occlusion or environmental reasons on the target identification can be reduced, and the identification accuracy can be improved.
需要说明的是,本申请实施例提供的目标识别过程主要涉及深度学习领域,用到的模块或者神经网络可以先训练再使用,也即先利用训练集对模块或者神经网络进行训练,不断调整该模块或者神经网络的参数,使得该模块或者神经网络可以输出较为准确的结果。在训练完成后,该模块或者神经网络可以投入使用,该模块或神经网络可以对输入的数据进行处理,输出结果,如输入特征图像。但是训练以及使用的过程中对输入的数据进行处理的过程是一致的,区别在于训练过程中需要根据每次输出的结果对该模块或者神经网络的参数进行调整,使用的过程倾向于通过该模块或神经网络获取输出,下面涉及的方法以使用为例,介绍本申请实施例提供的目标识别方法。It should be noted that the target recognition process provided in the embodiments of the present application mainly involves the field of deep learning, and the modules or neural networks used can be trained first and then used, that is, the modules or neural networks are trained first by using the training set, and the modules or neural networks are continuously adjusted. The parameters of the module or neural network, so that the module or neural network can output more accurate results. After the training is completed, the module or neural network can be put into use, and the module or neural network can process the input data and output results, such as input feature images. However, the process of processing the input data during training and use is the same. The difference is that the parameters of the module or neural network need to be adjusted according to the results of each output during the training process. The process of use tends to pass through this module. Or the neural network obtains the output, and the methods involved in the following are used as an example to introduce the target recognition method provided by the embodiment of the present application.
参见图3B,为了确保目标识别的效率,本申请实施例以利用第一特征图像和第二特征图像识别目标为例,该方法具体包括:Referring to FIG. 3B , in order to ensure the efficiency of target recognition, the embodiment of the present application uses the first feature image and the second feature image to identify the target as an example, and the method specifically includes:
步骤201:目标识别装置100获取输入图像,这里并不限定输入图像的类型。该输入图像可以为图像收集装置200采集后直接发送的图像,也可以为基于图像收集装置200采集的图像处理之后的图像。Step 201: The target recognition apparatus 100 acquires an input image, and the type of the input image is not limited here. The input image may be an image directly sent after being collected by the image collection apparatus 200 , or may be an image processed based on the image collected by the image collection apparatus 200 .
步骤202:目标识别装置100确定输入图像中的区分性区域。Step 202: The object recognition apparatus 100 determines the distinguishing area in the input image.
在计算图像中的区分性区域之前首先要对图像进行特征提取以获取该输入图像的特征图像。本申请实施例并不限定该目标识别装置100获取该输入图像的特征图像的方式,例如,目标识别装置100可以利用ResNet50神经网络或VGG16网络获取该输入图像的特征图像,该输入图像的特征图像可以是ResNet50神经网络的瓶颈部分的网络层、VGG16网络中的中高网络层输出的特征图像,如VGG16网络的conv3_x、conv4_x、conv5_x以及conv6。Before calculating the discriminative regions in the image, the feature extraction should be performed on the image to obtain the feature image of the input image. This embodiment of the present application does not limit the manner in which the target recognition apparatus 100 obtains the characteristic image of the input image. For example, the target recognition apparatus 100 may use the ResNet50 neural network or the VGG16 network to obtain the characteristic image of the input image, It can be the network layer of the bottleneck part of the ResNet50 neural network, the feature image output by the middle and high network layers in the VGG16 network, such as conv3_x, conv4_x, conv5_x and conv6 of the VGG16 network.
本申请并不限定目标识别装置100确定区分性区域的方法,凡是能够确定区分性区域的方式均适用于本申请实施例。下面对本申请实施例提供的一种确定区分性区域的方法进行说明,如图4所示,该方法包括:The present application does not limit the method by which the target identification device 100 determines the distinguishing area, and any method capable of determining the distinguishing area is applicable to the embodiment of the present application. A method for determining a distinguishing region provided by an embodiment of the present application is described below. As shown in FIG. 4 , the method includes:
步骤301、目标识别装置100确定该输入图像的空间特征。Step 301: The object recognition apparatus 100 determines the spatial feature of the input image.
目标识别装置100获取该输入图像的特征图像后,该输入图像的特征图像在空间维度上各个像素点的值表征了该输入图像的空间特征,各个像素点的值即为特征值。After the target recognition apparatus 100 obtains the feature image of the input image, the value of each pixel in the feature image of the input image in the spatial dimension represents the spatial feature of the input image, and the value of each pixel is the feature value.
步骤302、目标识别装置100可以基于注意力模型,为空间特征配置分值。Step 302: The target recognition apparatus 100 may configure a score for the spatial feature based on the attention model.
注意力模型能够从一个特定的角度,衡量多个信息,确定每个信息的价值。在本申请实施例中,目标识别装置100在执行步骤302时,可以利用注意力模型对该输入图像对应的特 征图像的空间特征进行衡量,确定各个空间特征的价值,如确定该空间特征所包含的信息量,对各个空间特征配置分值。例如,对于包含信息量丰富的空间特征配置较高的分值,包含信息量较少的空间特征配置较低的分值。Attention models can measure multiple pieces of information from a specific perspective and determine the value of each information. In this embodiment of the present application, when performing step 302, the target recognition apparatus 100 may use the attention model to measure the spatial features of the feature image corresponding to the input image, and determine the value of each spatial feature, such as determining that the spatial feature contains The amount of information, configure the score for each spatial feature. For example, a higher score is assigned to a spatial feature that contains rich information, and a lower score is assigned to a spatial feature that contains less information.
可选的,目标识别装置100除了为空间特征配置分值之外,还可以为视觉特征配置分值。也即在通道维度上,为各个视觉特征进行打分,配置分值。Optionally, in addition to configuring scores for spatial features, the target recognition apparatus 100 may also configure scores for visual features. That is, in the channel dimension, score each visual feature and configure the score.
目标识别装置100为视觉特征配置分值的方式与目标识别装置100为空间特征配置分值的方式类似,也可以采用注意力模型为视觉特征配置分值。为空间特征配置分值所采用的注意力模型以及为视觉特征配置分值所采用的注意力模型是两个相互独立的注意力模型。The manner in which the object recognition apparatus 100 assigns scores for visual features is similar to the manner in which the object identification apparatus 100 assigns scores for spatial features, and an attention model may also be used to assign scores for visual features. The attention model used to configure the scores for spatial features and the attention model used to configure scores for visual features are two independent attention models.
如图5A所示,为目标识别装置100利用注意力模型为空间特征进行打分,配置分值的流程图。As shown in FIG. 5A , it is a flowchart for the target recognition apparatus 100 to use the attention model to score spatial features and configure the score.
图5A中,对于输入图像的特征图像Fin,可以采用注意力模型,为该输入图像的空间特征配置分值,该分值可以直接配置在输入图像的特征图像上,也即利用该分值与特征图像中对应的特征值相乘,获取特征图像B。特征图像B和Fin的大小相同,特征图像B是在Fin上作用了空间特征的分值之后的特征图像。该特征图像B可以是步骤204和步骤206中第一系数图和第二系数图作用的特征图像。In FIG. 5A , for the feature image Fin of the input image, an attention model can be used to configure a score for the spatial feature of the input image, and the score can be directly configured on the feature image of the input image, that is, using the score and the The corresponding feature values in the feature image are multiplied to obtain feature image B. The feature image B and Fin have the same size, and the feature image B is the feature image obtained by applying the spatial feature score to Fin. The feature image B may be the feature image that is used by the first coefficient map and the second coefficient map in steps 204 and 206 .
如图5B所示,为目标识别装置100利用两个独立的注意力模型为空间特征和视觉特征进行打分,配置分值的流程图。As shown in FIG. 5B , it is a flowchart for the target recognition apparatus 100 to use two independent attention models to score spatial features and visual features, and to configure the score values.
图5B中,对于输入图像的特征图像Fin,可以同步的采用两个注意力模型,为该输入图像的空间特征和时间特征配置分值,该为空间特征配置的分值可以直接配置在输入图像的特征图像上,也即利用该分值与特征图像中对应的特征值相乘,获取特征图像A;为该输入图像的视觉特征的分值可以直接配置在输入图像上,获取特征图像B。之后,将特征图像A和特征图像B进行聚合、降维,获得特征图像Fout。Fin和Fout的大小相同,Fout是在Fin上作用了空间特征的分值和视觉特征的分值之后的特征图像。该特征图像Fout可以是步骤204和步骤206中第一系数图和第二系数图作用的特征图像。In Fig. 5B, for the feature image Fin of the input image, two attention models can be used synchronously to configure the score for the spatial feature and temporal feature of the input image, and the score configured for the spatial feature can be directly configured in the input image. On the feature image of the input image, that is, the feature image A is obtained by multiplying the score with the corresponding feature value in the feature image; the score for the visual feature of the input image can be directly configured on the input image to obtain the feature image B. After that, the feature image A and the feature image B are aggregated and dimension-reduced to obtain the feature image Fout. The size of Fin and Fout is the same, and Fout is the feature image after applying the score of spatial feature and the score of visual feature on Fin. The feature image Fout may be the feature image that is used by the first coefficient map and the second coefficient map in steps 204 and 206 .
需要说明的是,每个视觉特征或空间特征的价值在于对于目标识别(如细粒度的目标识别或粗粒度的目标识别)分类的贡献,有些视觉特征或空间特征可以较为直接的指示目标的属性(如类别),对于目标识别的贡献比较大,有些视觉特征或空间特征并不能突显出目标的属性,对于目标识别的贡献比较小。在本申请实施例中,以基于空间特征确定区分性区域为例,在空间维度上使用注意力模型通过对分值的配置实现增强对目标识别有益的空间特征的表达,减弱对目标识别影响不大的空间特征的表达,使得获取的有效的特征表达从而提高目标识别的准确度。It should be noted that the value of each visual feature or spatial feature lies in its contribution to the classification of target recognition (such as fine-grained target recognition or coarse-grained target recognition), and some visual features or spatial features can directly indicate the attributes of the target. (such as category), the contribution to target recognition is relatively large, some visual features or spatial features cannot highlight the attributes of the target, and the contribution to target recognition is relatively small. In the embodiment of the present application, taking the determination of the distinguishing region based on spatial features as an example, the attention model is used in the spatial dimension to enhance the expression of spatial features that are beneficial to target recognition by configuring the scores, and reduce the impact on target recognition. The expression of large spatial features enables the obtained effective feature expression to improve the accuracy of target recognition.
步骤303、目标识别装置100根据输入图像中空间特征的分值确定区分性区域,本申请实施例并不限定目标识别装置100执行步骤303的方式,例如,目标识别装置100可以将输入图像中空间特征的分值大于阈值的区域作为区分性区域。该阈值可以为经验值,也可以是通过仿真、模拟等方式确定的值。也可以将输入图像中空间特征的分值大于处于特定范围的区域作为区分性区域,该特定范围可以是固定值,也可以是人为设置的值。In step 303, the target recognition apparatus 100 determines the distinguishing area according to the score of the spatial feature in the input image. The embodiment of the present application does not limit the manner in which the target recognition apparatus 100 performs step 303. For example, the target recognition apparatus 100 may The region with the feature score greater than the threshold is regarded as the discriminative region. The threshold may be an empirical value, or may be a value determined by means of simulation, simulation, or the like. It is also possible to use a region where the score of the spatial feature in the input image is greater than a specific range as a distinguishing region, and the specific range can be a fixed value or an artificially set value.
在确定了该输入图像中的区分性区域后,可以分别确定第一特征图像(参见步骤203~204)以及第二特征图像(参见步骤205~206)。After the distinguishing region in the input image is determined, the first feature image (see steps 203-204) and the second feature image (see steps 205-206) can be determined respectively.
步骤203:目标识别装置100遮挡该输入图像中的区分性区域,为该输入图像中的各个像素点配置第一系数值,各个像素点的第一系数值构成第一系数图。Step 203: The target recognition device 100 blocks the distinguishing area in the input image, configures first coefficient values for each pixel in the input image, and the first coefficient value of each pixel constitutes a first coefficient map.
为了遮挡该输入图像中的区分性区域,可以将该输入图像中区分性区域中的像素值的配 置较低的第一系数值,为除了区分性区域中像素点的其余像素点配置较高的第一系数值。In order to block the discriminative area in the input image, a lower first coefficient value can be configured for the pixel values in the discriminative area in the input image, and a higher first coefficient value can be configured for the remaining pixels except the pixels in the discriminative area. first coefficient value.
示例性的,对于区分性区域内的像素点Fi,需要遮挡该像素点Fi,可以配置该像素点对应的第一系数值为零,若该像素点空间特征的分值小于阈值,该像素点的第一系数值被置1,也即:Exemplarily, for the pixel point Fi in the distinguishing area, it is necessary to block the pixel point Fi, and the first coefficient value corresponding to the pixel point can be configured to be zero. If the score of the spatial feature of the pixel point is less than the threshold, the pixel point The first coefficient value of is set to 1, that is:
Figure PCTCN2021121680-appb-000001
Figure PCTCN2021121680-appb-000001
其中,Att(F i)为基于注意力模型确定的该像素点空间特征的分值,t为阈值。 Among them, Att(F i ) is the score of the spatial feature of the pixel determined based on the attention model, and t is the threshold.
步骤204:目标识别装置100将第一系数图作用到输入图像的特征图像上,获得第一特征图像。目标识别装置100也可以将第一系数图作用到图5A所示的特征图像B或特征图像Fout上,这里仅是以将第一系数图作用到输入图像的特征图像为例进行说明。Step 204: The target recognition apparatus 100 applies the first coefficient map to the feature image of the input image to obtain the first feature image. The object recognition apparatus 100 may also apply the first coefficient map to the feature image B or the feature image Fout shown in FIG. 5A , and the description here is only taking the first coefficient map applied to the feature image of the input image as an example.
目标识别装置100将输入图像的特征图像上每个像素点的值与第一系数图上该像素点的第一系数值相乘,获得第一特征图像。该第一特征图像的尺寸为C*H*W,其中,C为通道的长度,H为空间高度,W为空间宽度。The object recognition apparatus 100 multiplies the value of each pixel on the feature image of the input image by the first coefficient value of the pixel on the first coefficient map to obtain the first feature image. The size of the first feature image is C*H*W, where C is the length of the channel, H is the height of the space, and W is the width of the space.
如图6A所示为目标识别装置100生成第一特征图像的流程图(其中,利用注意力模型对输入图像的空间特征打分、确定区分性区域的部分对应于步骤301-步骤303,其中,生成第一系数图的部分对应于步骤203),目标识别装置100利用注意力模型对输入图像的空间特征进行打分,配置分值(用于确定区分性区域),之后再基于各个像素点的空间特征的分值生成第一系数图,之后,将第一系数图作用到输入图像的特征图像上,生成第一特征图像。As shown in FIG. 6A is a flow chart of the target recognition device 100 generating the first feature image (wherein, the part of scoring the spatial feature of the input image by using the attention model and determining the distinguishing region corresponds to steps 301 to 303, wherein the generation of The part of the first coefficient map corresponds to step 203), the target recognition device 100 uses the attention model to score the spatial features of the input image, configure the score (for determining the distinguishing area), and then based on the spatial features of each pixel point The score of the first coefficient map is generated, and then the first coefficient map is applied to the feature image of the input image to generate the first feature image.
如图6B所示,为第一特征图像的效果图,输入图像中的区分性区域可以为车辆的后视镜、车头灯、以及车牌。遮挡了这些区分性区域后,这些区域性区域在输入图像中变为黑色,其他区域正常显示。As shown in FIG. 6B , which is an effect diagram of the first feature image, the distinguishing regions in the input image may be rearview mirrors, headlights, and license plates of the vehicle. After these discriminative regions are occluded, they become black in the input image, and other regions are displayed normally.
可选的,目标识别装置100还可以不遮挡该输入图像中的区分性区域,获取第二特征图像,第二特征图像是输入图像中的区分性区域不被遮挡的特征图像,在该第二特征区域中该区分性区域可以正常显示,该第二特征图像可以为输入图像的特征图像,也即不遮挡该输入图像的特征图像中的区分性区域,使其正常显示。在一种可能的实现方式中,为了能够更加突显出该区分性区域与其他区域的区别,第二特征图像可以是区分性区域的像素点的特征值可以明显比其他区域的像素点的特征值高的特征图像。目标识别装置100获取该第二特征图像的方式可以参见步骤205~步骤206的相关说明。Optionally, the target recognition apparatus 100 may also obtain a second feature image without blocking the distinguishing region in the input image, and the second feature image is a feature image in which the distinguishing region in the input image is not blocked. The distinguishing area in the characteristic area can be displayed normally, and the second characteristic image can be the characteristic image of the input image, that is, the distinguishing area in the characteristic image of the input image is not blocked, so that it can be displayed normally. In a possible implementation manner, in order to be able to highlight the difference between the distinguishing area and other areas, the second feature image may be that the eigenvalues of the pixels in the distinguishing area may be significantly higher than the eigenvalues of the pixels in other areas. high feature image. For the manner in which the target recognition apparatus 100 obtains the second characteristic image, reference may be made to the relevant descriptions of steps 205 to 206 .
步骤205:目标识别装置100不遮挡该输入图像中的区分性区域,为该输入图像中的各个像素点配置第二系数值,各个像素点的第二系数值构成第二系数图。Step 205 : The target recognition device 100 does not block the distinguishing area in the input image, configures second coefficient values for each pixel in the input image, and the second coefficient value of each pixel constitutes a second coefficient map.
目标识别装置100不遮挡该输入图像中的区分性区域是指保持该区分性区域清晰。为了保证该区分性区域可以正常显示,甚至能够突出显示。目标识别装置100在为该输入图像的各个像素点配置第二系数值时,可以将该输入图像中区分性区域中的像素点的配置较高的第二系数值,为除了区分性区域中像素点的其余像素点配置较低的第二系数值。The fact that the object recognition apparatus 100 does not block the distinguishing area in the input image refers to keeping the distinguishing area clear. In order to ensure that the distinguishing area can be displayed normally, it can even be highlighted. When configuring the second coefficient value for each pixel point of the input image, the target recognition apparatus 100 may configure the higher second coefficient value for the pixel point in the distinguishing area in the input image as the second coefficient value except the pixel point in the distinguishing area. The remaining pixels of the point are configured with lower second coefficient values.
例如,对于该输入图像的特征图像的区分性区域内的像素点,需要突出该像素点,可以配置该像素点的第二系数值为1,对于该输入图像的区分性区域之外的像素点,该像素点的第二系数值配置为0。For example, for a pixel in the distinguishing region of the feature image of the input image, it is necessary to highlight the pixel, and the second coefficient value of the pixel can be configured to be 1. For the pixel outside the distinguishing region of the input image , the second coefficient value of the pixel is configured to be 0.
又例如,对于该输入图像的特征图像的各个像素点,可以将各个像素点的空间特征的分值进行归一化处理,使得各个像素点的空间特征的分值可以分布在[0,1],各个像素点进行了归一化之后空间特征的分值可以作为各个像素点的第二系数值,构成第二系数图。For another example, for each pixel of the feature image of the input image, the score of the spatial feature of each pixel can be normalized, so that the score of the spatial feature of each pixel can be distributed in [0,1] , after each pixel point is normalized, the score of the spatial feature can be used as the second coefficient value of each pixel point to form a second coefficient map.
采用如下方式将各个像素点的空间特征的分值进行归一化处理:The scores of the spatial features of each pixel are normalized in the following way:
MAP 定显(i)=σ(Att(F i)),其中,
Figure PCTCN2021121680-appb-000002
MAP definite(i) = σ(Att(Fi)), where,
Figure PCTCN2021121680-appb-000002
步骤206:目标识别装置100将第二系数图作用到输入图像的特征图像上,获得第二特征图像。与步骤204类似,目标识别装置100也可以将第二系数图作用到图5A所示的特征图像B或特征图像Fout上,这里仅是以将第二系数图作用到输入图像的特征图像为例进行说明。Step 206: The target recognition apparatus 100 applies the second coefficient map to the feature image of the input image to obtain the second feature image. Similar to step 204, the target recognition apparatus 100 can also apply the second coefficient map to the feature image B or the feature image Fout shown in FIG. 5A, and here is only an example of applying the second coefficient map to the feature image of the input image. Be explained.
目标识别装置100将输入图像上每个像素点的值与第二系数图上该像素点的第二系数值相乘,获得第二特征图像。该第二特征图像的尺寸为C*H*W,其中,C为通道的长度,H为空间高度,W为空间宽度。The target identification device 100 multiplies the value of each pixel on the input image by the second coefficient value of the pixel on the second coefficient map to obtain a second feature image. The size of the second feature image is C*H*W, where C is the length of the channel, H is the height of the space, and W is the width of the space.
如图7A所示为目标识别装置100生成第二特征图像的流程图(其中,利用注意力模型对输入图像的空间特征打分、确定区分性区域的部分对应于步骤301-步骤303,其中,生成第一系数图的部分对应于步骤205),目标识别装置100利用注意力模型对输入图像的视觉特征进行打分,配置分值,之后,基于各个像素点的空间特征的分值生成第二系数图,之后,将第二系数图作用到输入图像的特征图像上,生成第二特征图像。As shown in FIG. 7A , it is a flow chart of the target recognition apparatus 100 generating the second feature image (wherein, the part of using the attention model to score the spatial features of the input image and determine the distinguishing area corresponds to steps 301 to 303 , wherein the generation of The part of the first coefficient map corresponds to step 205), and the target recognition device 100 uses the attention model to score the visual features of the input image, configure the scores, and then generate a second coefficient map based on the scores of the spatial features of each pixel. , and then applying the second coefficient map to the feature image of the input image to generate a second feature image.
如图7B所示,为第二特征图像的效果图,输入图像中的区分性区域可以为车辆的后视镜、车头灯、以及车牌,不遮挡这些区分性区域,进一步地,可以强化这些区分性区域,这些区分性区域在输入图像中亮度提高,其他区域亮度较暗。As shown in FIG. 7B , which is the rendering of the second feature image, the distinguishing areas in the input image can be the rearview mirror, headlight, and license plate of the vehicle. These distinguishing areas are not blocked. Further, these distinguishing areas can be strengthened. These discriminative regions are brighter in the input image, and other regions are darker.
经过如上步骤,目标识别装置100获取了第一特征图像和第二特征图像。After the above steps, the target recognition apparatus 100 acquires the first characteristic image and the second characteristic image.
步骤207:目标识别装置100将第一特征图像和第二特征图像在通道维度上聚合、降维,生成第三特征图像。Step 207: The target recognition apparatus 100 aggregates and reduces the dimension of the first feature image and the second feature image in the channel dimension to generate a third feature image.
通常,第一特征图像和第二特征图像在通道维度上聚合后的尺寸为2C*H*W。为了保证第三特征图像的尺寸与第一特征图像或第二特征图像的尺寸一致,在第一特征图像和第二特征图聚合后,还可以对聚合后的图像进行降维,也即在通道维度上压缩,生成第三特征图像,使得第三特征图在通道维度的长度为C。Generally, the aggregated size of the first feature image and the second feature image in the channel dimension is 2C*H*W. In order to ensure that the size of the third feature image is consistent with the size of the first feature image or the second feature image, after the first feature image and the second feature map are aggregated, the aggregated image can also be dimensionally reduced, that is, in the channel Dimensionally compressed to generate a third feature image, so that the length of the third feature map in the channel dimension is C.
也即第三特征图像相当于将第一特征图像和第二特征图像在通道维度上拼接之后,经过压缩生成特征图像,第三特征图像中包括了第一特征图像和第二特征图像,也即在通道维度上,可以从第三特征图像中区分出属于第一特征图像的部分和属于第二特征图像的部分。That is, the third feature image is equivalent to splicing the first feature image and the second feature image in the channel dimension, and then compressing to generate a feature image. The third feature image includes the first feature image and the second feature image, that is, In the channel dimension, the part belonging to the first characteristic image and the part belonging to the second characteristic image can be distinguished from the third characteristic image.
步骤208:目标识别装置100在通道维度上为第三特征图像配置权重,生成第四特征图像。Step 208 : The target recognition apparatus 100 configures weights for the third feature image in the channel dimension to generate a fourth feature image.
目标识别装置100在执行步骤208时,可以先在通道维度上为第三特征图像配置权重,也即将该权重配置到第三特征图像的通道上,生成第四特征图像。When performing step 208, the target recognition apparatus 100 may firstly configure the weight for the third feature image in the channel dimension, that is, configure the weight on the channel of the third feature image to generate the fourth feature image.
本申请实施例并不限定在通道维度上为第三特征图像配置权重的方式,例如可以利用基于通道关系建模的注意力(efficient channel attention,ECA)模型,在通道维度上为第三特征图像配置权重。The embodiment of the present application does not limit the way of configuring weights for the third feature image in the channel dimension. For example, an efficient channel attention (ECA) model based on channel relationship modeling can be used to create the third feature image in the channel dimension. Configure weights.
该ECA模型的是预先训练好的,能够基于通道维度建立的,能够实现对通道关系建模,学习视觉特征之间的联系,从而通过在通道维度上为第三特征图像配置权重获取更高效的视觉特征表达。The ECA model is pre-trained and can be established based on the channel dimension. It can model the channel relationship and learn the relationship between visual features, so as to obtain a more efficient image by configuring the weight for the third feature image in the channel dimension. Visual feature expression.
在训练过程开始时,ECA模型中的参数会被随机初始化,随着训练的进行,分类器可以将分类的结果反馈给ECA模型,使得ECA模型能够根据分类的结果适应性地调整权重的分值,给与那些对于分类有突出贡献的视觉特征一个大的权重,而对分类结果贡献较小的视觉 特征给与一个小的权重,从而在训练过程中不断地学习不断地调整权重的分配,直至训练到一个稳定状态,也就获取了一个最有助于目标识别的权重分配。简单来说,在ECA模型的训练过程中,基于梯度下降算法,通过对训练集中特征图像的不断学习,实现对ECA模型的训练,使得ECA模型能够对特征图像所有的通道特征权重重新分配,使得在遇到区分性区域被遮挡的情况时,能够增大特征图像中遮挡区分性区域的部分在通道上的权重,以便后续分类器能够学习其他区域具有区分性的特征,否则,增大特征图像中未遮挡区分性区域的部分在通道上的权重,找出区分性区域的特征从而分类器可以识别区分性区域的特征,作出正确判断。At the beginning of the training process, the parameters in the ECA model will be randomly initialized. As the training progresses, the classifier can feed back the classification results to the ECA model, so that the ECA model can adaptively adjust the weight scores according to the classification results. , give a large weight to those visual features that have a prominent contribution to the classification, and give a small weight to the visual features that contribute less to the classification result, so as to continuously learn and adjust the weight distribution during the training process until After training to a stable state, a weight distribution that is most helpful for target recognition is obtained. To put it simply, in the training process of the ECA model, based on the gradient descent algorithm, the training of the ECA model is realized through the continuous learning of the feature images in the training set, so that the ECA model can redistribute all the channel feature weights of the feature images, so that the When the discriminative region is occluded, the weight on the channel of the part of the feature image that occludes the discriminative region can be increased, so that the subsequent classifier can learn the distinguishing features of other regions, otherwise, increase the feature image The weight on the channel of the part that does not block the discriminative area in the middle, find out the characteristics of the discriminative area, so that the classifier can identify the characteristics of the discriminative area and make a correct judgment.
为第三特征图像配置的权重满足如下条件:在区分性区域在输入图像中被遮挡的情况下,第三特征图像中属于第一特征图像的部分在通道上的权重大于第三特征图像中属于第二特征图像的部分在通道上的权重、或在区分性区域在输入图像中未被遮挡时第三特征图像中属于第一特征图像的部分在通道上的权重小于第三特征图像中属于第二特征图像的部分在通道上的权重。The weight configured for the third feature image satisfies the following condition: in the case that the discriminative region is occluded in the input image, the part of the third feature image that belongs to the first feature image has a greater weight on the channel than that of the third feature image. The weight on the channel of the part of the second feature image, or the part of the third feature image that belongs to the first feature image when the discriminative region is not occluded in the input image, on the channel is smaller than that of the third feature image. The weights on the channels of the parts of the two-feature image.
如图8所示,为利用ECA模型将第三特征图像转换为第四特征图像的示意图。图8中,位于第三特征图像和第四特征图像之间的部分即为ECA模型,其中仅示例性的绘制出了该ECA模型所包括的一些操作,如平均池化操作(GAP)、sigmoid激活函数等。As shown in FIG. 8 , it is a schematic diagram of converting the third feature image into the fourth feature image by using the ECA model. In FIG. 8 , the part between the third feature image and the fourth feature image is the ECA model, in which only some operations included in the ECA model, such as average pooling operation (GAP), sigmoid, are drawn exemplarily. activation function, etc.
在步骤201到步骤208中,目标识别装置100最终获得的第四特征图像中属于第一特征图像的部分和属于第二特图像的部分配置了不同的权重,能够对应输入图像中区分性区域被遮挡或未被遮挡的情况,基于此,当输入图像中区分性区域被遮挡,可以为第一特征图像配置较高的权重,第二特征图像配置较低的权重,这样后续在进行目标识别时,可以从除区分性区域外的其他区域获取较多的信息,以辅助对区分性区域内的目标识别,确定该目标的类别。当输入图像中区分性区域未被遮挡,可以为第四特征图像中属于第一特征图像的部分配置较低的权重,第四特征图像中属于第二特征图像的部分配置较高的权重,这样后续在进行目标识别时,可以从区分性区域获取较多的信息,能够区分性区域进行较为全面的分析,以准确识别该区分性区域内的目标。In steps 201 to 208 , in the fourth characteristic image finally obtained by the target recognition device 100, the part belonging to the first characteristic image and the part belonging to the second characteristic image are configured with different weights, which can correspond to the distinguishable regions in the input image. Occlusion or non-occlusion, based on this, when the discriminative area in the input image is occluded, a higher weight can be configured for the first feature image, and a lower weight can be configured for the second feature image, so that the subsequent target recognition is performed. , more information can be obtained from other areas except the distinguishing area to assist in identifying the target in the distinguishing area and determining the category of the target. When the discriminative area in the input image is not blocked, a lower weight can be configured for the part of the fourth feature image that belongs to the first feature image, and a higher weight can be configured for the part of the fourth feature image that belongs to the second feature image, so that In subsequent target recognition, more information can be obtained from the distinguishing area, and a more comprehensive analysis of the distinguishing area can be performed to accurately identify the target in the distinguishing area.
在本申请实施例中,步骤202~步骤208可以由基于双重注意力的辨别性细粒度特征表示方法(discriminative fine-grained feature representation method based on double attention,DMF)装置执行,DMF装置可以嵌入到神经网络中,例如位于神经网络中用于提取图像特征的网络层之后。本申请实施例并限定的DMF装置的位置以及数量,例如,DMF装置可以嵌入到神经网络中每一个能够提取图像特征的网络层之后,以Resnet50为例,DMF装置可以嵌入到CNN中,位于CNN的每一个阶段(stage)之后。通常在神经网络中在提取图像特征的网络层以及分类器之间,还存在其他网络层,该其他网络层能够对DMF装置输出的第四特征图像进行一些的处理,在进行了一系列处理(这里并不限定此处一系列处理的具体类型,例如可以为卷积操作,也可以是池化操作,也可以是卷积操作与池化操作的结合)之后,获取第五特征图像,第五特征图像才能够被传输至分类器进行分类。In this embodiment of the present application, steps 202 to 208 may be performed by a discriminative fine-grained feature representation method based on double attention (DMF) device, and the DMF device may be embedded in the neural network In the network, for example, after the layers of the neural network used to extract image features. The location and number of DMF devices are not limited by the embodiments of this application. For example, DMF devices can be embedded in the neural network after each network layer capable of extracting image features. Taking Resnet50 as an example, DMF devices can be embedded in CNN, located in the CNN. after each stage. Usually in the neural network, between the network layer for extracting image features and the classifier, there are other network layers. The other network layers can perform some processing on the fourth feature image output by the DMF device. After a series of processing ( The specific type of a series of processing here is not limited, for example, it can be a convolution operation, a pooling operation, or a combination of a convolution operation and a pooling operation), and then the fifth feature image is obtained, and the fifth feature image is obtained. Feature images can then be passed to the classifier for classification.
为了能够进一步提高分类器的分类的精确程度,也即提高目标识别的准确率,目标识别装置100还可以对第五特征图像进行进一步处理,下面对第五特征图像的进一步处理方式进行说明。In order to further improve the classification accuracy of the classifier, that is, the accuracy of target recognition, the target recognition apparatus 100 may further process the fifth characteristic image, and the further processing method of the fifth characteristic image will be described below.
步骤209:目标识别装置100基于第五特征图像,确定多个候选特征图像,每个候选特征图像对应不同的感受野,也即每个候选特征图像与一个感受野对应。感受野指的是特征图像上像素点在输入图像上映射出的区域大小。这里并不限定候选特征图像的数量,可以根据 实际应用场景确定候选特征图像的数量。Step 209 : the target recognition device 100 determines a plurality of candidate feature images based on the fifth feature image, each candidate feature image corresponds to a different receptive field, that is, each candidate feature image corresponds to one receptive field. The receptive field refers to the size of the area mapped by the pixels on the feature image on the input image. The number of candidate feature images is not limited here, and the number of candidate feature images can be determined according to the actual application scenario.
为了获得多个候选特征图像,目标识别装置100可以利用多个大小不同尺寸的卷积核分别作用到第五特征图像上,通过扩张分离卷积获得多个候选特征图像。由于卷积核的大小不同,则通过扩张分离卷积获得的多个候选特征图像的感受野也不同。In order to obtain multiple candidate feature images, the target recognition apparatus 100 may use multiple convolution kernels of different sizes to act on the fifth feature image respectively, and obtain multiple candidate feature images through dilation and separation convolution. Due to the different sizes of the convolution kernels, the receptive fields of multiple candidate feature images obtained by dilated separation convolution are also different.
步骤210:目标识别装置100将多个候选特征图像融合为第六特征图像。第六特征图像的感受野是包括冗余区域较少的区域,该冗余区域是不利于目标识别的区域,也即该冗余区域中包括较少或者不包括表征目标类别的信息。从另一个角度来说,该第六特征图像的感受野包括的有效信息较多,该有效信息指示能够用于目标识别的信息,如该有效信息是能够被分类器提取,且基于该有效信息能够确定目标类型。Step 210: The target recognition apparatus 100 fuses the plurality of candidate feature images into a sixth feature image. The receptive field of the sixth feature image is an area that includes less redundant areas, and the redundant area is an area that is not conducive to target recognition, that is, the redundant area includes less or no information representing the target category. From another perspective, the receptive field of the sixth feature image includes a lot of valid information, and the valid information indicates information that can be used for target recognition. For example, the valid information can be extracted by a classifier, and based on the valid information Ability to determine target type.
以对鸟的类别识别为例,该第六特征图像的感受野可以包括较少的非鸟的区域,对于鸟的类别识别,当前图像中显示了鸟的头部,但鸟头部羽毛颜色会因为拍摄的场景会呈现不同的效果,也即头部羽毛颜色并不利于目标识别,属于冗余区域。而鸟的喙、眼睛等区域不会轻易因为拍摄场景的不同而发生变化,鸟的喙、眼睛等区域包含的有效信息较多,该第六特征图像的感受野可以包括该鸟的喙、眼睛等区域。Taking the category recognition of birds as an example, the receptive field of the sixth feature image may include less non-bird areas. For category recognition of birds, the current image shows the bird's head, but the color of the bird's head feathers will vary. Because the shooting scenes will show different effects, that is, the color of the head feathers is not conducive to target recognition and belongs to the redundant area. The bird's beak, eyes and other areas will not easily change due to different shooting scenes, the bird's beak, eyes and other areas contain more effective information, the receptive field of the sixth feature image can include the bird's beak, eyes and other areas etc. area.
目标识别装置100在将该多个候选特征图像融合时,可以对该多个候选特征图像进行聚合、降维,并为聚合、降维后的特征图像中属于各个候选特征图像的部分配置权重,获得第六特征图像,其中,为各个候选特征图像配置的权重是预先通过训练学习获得的。When the target recognition apparatus 100 fuses the plurality of candidate feature images, it may aggregate and reduce the dimension of the plurality of candidate feature images, and configure weights for the parts of the aggregated and dimension-reduced feature images that belong to each candidate feature image, A sixth feature image is obtained, wherein the weights configured for each candidate feature image are acquired through training learning in advance.
通过步骤209~步骤210,目标识别装置100获取的多个候选特征图像中每个候选特征图像与一个感受野对应,且每个候选特征图像对应的感受野的大小不同,这些感受野中有的感受野可能只覆盖了该目标的一部分,有的感受野可能覆盖了该目标,但还包含的较多的非目标区域。通过对该多个候选特征图像的融合(聚合、降维、以及配置权重),能够实现感受野与目标的自适应过程,获得更利于目标识别的感受野,该感受野即为第六特征图像的感受野。这样当基于该第六特征图像进行目标识别时,能够更加准确提取目标的特征,确定该目标所属的类型。Through steps 209 to 210, each candidate feature image among the plurality of candidate feature images acquired by the target recognition device 100 corresponds to a receptive field, and the size of the receptive field corresponding to each candidate feature image is different, and some of these receptive fields have different sizes of receptive fields. The wild may only cover a part of the target, and some receptive fields may cover the target, but also contain more non-target areas. Through the fusion of the multiple candidate feature images (aggregation, dimension reduction, and weight configuration), the adaptive process of the receptive field and the target can be realized, and the receptive field that is more conducive to target recognition can be obtained, and the receptive field is the sixth feature image. receptive field. In this way, when the target recognition is performed based on the sixth feature image, the features of the target can be more accurately extracted, and the type to which the target belongs can be determined.
如图9所示,为目标识别装置100生成第六特征图像的流程图,图8中,目标识别装置100利用了六个不同的卷积核,分别对第六特征图像进行卷积操作。As shown in FIG. 9 , the flow chart of generating the sixth feature image for the target recognition apparatus 100 is shown. In FIG. 8 , the target recognition apparatus 100 uses six different convolution kernels to perform convolution operations on the sixth feature image respectively.
该六个卷积核分别为1*1卷积(conv)核、扩张率(rate)为1的3*3卷积核、扩张率为2的3*3卷积核、扩张率为3的3*3卷积核、扩张率为4的3*3卷积核、扩张率为5的3*3卷积核。The six convolution kernels are a 1*1 convolution (conv) kernel, a 3*3 convolution kernel with an expansion rate of 1, a 3*3 convolution kernel with an expansion rate of 2, and a dilation rate of 3. 3*3 convolution kernel, 3*3 convolution kernel with dilation rate 4, 3*3 convolution kernel with dilation rate 5.
其中,卷积核的大小是指长X宽卷积核的尺寸。举例来说,常用的尺寸有3X3、5X5。Among them, the size of the convolution kernel refers to the size of the length X width of the convolution kernel. For example, commonly used sizes are 3X3, 5X5.
第五特征图像经过一个卷积核,会输出一张大小为C*H*W的候选特征图像,第五特征图像经过六个卷积核,会得到六张大小为C*H*W的候选特征图像。After the fifth feature image passes through a convolution kernel, a candidate feature image of size C*H*W will be output. After the fifth feature image passes through six convolution kernels, six candidates of size C*H*W will be obtained. Feature image.
目标识别装置100可以将该六张大小为C*H*W的候选特征图像在通道维度上聚合、降维获得一张6C/N*H*W的特征图像(其中,N是指在通道上降维的参数,图9中以N=16为例进行说明),在降维之后,可以在通道维度上为各个候选特征图像进行权重重分配,在图9中仅示例性的绘制出了在进行权重重分配所经过的几个操作,例如全局平均池化操作(GAP)、1*1的卷积核(也即con1*1)、BN+ReLU、sigmoid函数等。其中,配置的权重可以是通过预先训练、学习获得的。The target recognition device 100 can aggregate the six candidate feature images of size C*H*W in the channel dimension and reduce the dimension to obtain a feature image of 6C/N*H*W (wherein, N refers to the channel dimension. The parameters of dimensionality reduction, N=16 is used as an example in Figure 9), after dimensionality reduction, weights can be assigned to each candidate feature image in the channel dimension. Several operations for weight redistribution, such as global average pooling operation (GAP), 1*1 convolution kernel (ie con1*1), BN+ReLU, sigmoid function, etc. The configured weights may be obtained through pre-training and learning.
其中,尺寸为1*1的卷积核可以实现卷积操作,通过设定1*1卷积核的数量能够来实现升降维度的操作。GAP是指对特征图像上的每个特征值求和后取平均得到一个数值,该数值可以表征整张特征图像的特征信息。BN+ReLU为卷积神经网络中的归一化和激活函数,主要 实现归一化操作和增强非线性的操作。FC为全连接层,是神经网络中较为常见的一层,在整个神经网络中可以起到“分类器”的作用。Among them, the convolution kernel with a size of 1*1 can realize the convolution operation, and the operation of raising and lowering the dimension can be realized by setting the number of 1*1 convolution kernels. GAP refers to the summation of each feature value on the feature image and then averaging to obtain a value, which can represent the feature information of the entire feature image. BN+ReLU is the normalization and activation function in the convolutional neural network, which mainly realizes the normalization operation and the enhanced nonlinear operation. FC is a fully connected layer, which is a more common layer in neural networks, and can play the role of "classifier" in the entire neural network.
需要说明的是,在图9中一个卷积核的尺寸为1*1卷积(conv)核,通过设置1*1的卷积核及逆行卷积操作可以有效保留原始的特征信息,并且通过GAP来获取全局信息,使得有效弥补扩张卷积有可能造成的获取的信息不连续的问题,从而得到更加完整高效的特征表达。It should be noted that the size of a convolution kernel in Figure 9 is a 1*1 convolution (conv) kernel. By setting a 1*1 convolution kernel and a reverse convolution operation, the original feature information can be effectively retained, and the GAP is used to obtain global information, which can effectively compensate for the discontinuous information that may be caused by dilated convolution, so as to obtain a more complete and efficient feature expression.
目标识别装置100可以直接在通道维度上对降维后的特征图像配置权重,获得第六特征图像,也可以在通道维度上对降维后的特征图像配置权重后,与另一个特征图像进行聚合、降维,生成第六特征图像,该另一个特征图像可以是对第五特征图像进行平均池化操作后生成的特征图像,该另一个特征图像的大小为C*H*W。其中,与另一个特征图像聚合、降维的目的是为了保证输入输出特征维度一致,并且适当的降维可以有效提高计算效率和识别精度。The target recognition apparatus 100 may directly configure weights on the dimension-reduced feature image in the channel dimension to obtain the sixth feature image, or may configure the weights on the dimension-reduced feature image in the channel dimension, and then aggregate with another feature image. , dimensionality reduction, to generate a sixth feature image, the other feature image may be a feature image generated by performing an average pooling operation on the fifth feature image, and the size of the other feature image is C*H*W. Among them, the purpose of aggregation and dimension reduction with another feature image is to ensure that the input and output feature dimensions are consistent, and appropriate dimension reduction can effectively improve computational efficiency and recognition accuracy.
步骤211:目标识别装置100基于第六特征图像进行目标识别。Step 211: The target recognition apparatus 100 performs target recognition based on the sixth characteristic image.
目标识别装置100在执行步骤211时,可以借助分类器来实现,分类器可以是预先训练的、能够根据特征图像来确定该特征图像中目标的类别,实现目标识别。When the target recognition apparatus 100 executes step 211, it can be implemented by means of a classifier, which can be pre-trained and can determine the category of the target in the characteristic image according to the characteristic image, so as to realize the target recognition.
应需理解的是,在步骤209和步骤210中,以目标识别装置100对第五特征图像进行处理,当然在实际应用场景中,目标识别装置100也可以在获取第四特征图像后,直接对第四特征图像进行处理。It should be understood that, in steps 209 and 210, the target recognition device 100 is used to process the fifth characteristic image. Of course, in an actual application scenario, the target recognition device 100 can also directly process the image after acquiring the fourth characteristic image. The fourth feature image is processed.
在本申请实施例中,步骤209~步骤210可以由基于感受野自适应调整的多尺度特征融合方法(multi-scale feature fusion method based on receptive field adaptive adjustment,RFAM)装置执行,RFAM装置可以放置在分类器之前,对需要输入到分类器的特征图像进行处理,以使得分类器最终能够输出准确的结果。In this embodiment of the present application, steps 209 to 210 may be performed by a multi-scale feature fusion method (multi-scale feature fusion method based on receptive field adaptive adjustment, RFAM) device based on receptive field adaptive adjustment, and the RFAM device may be placed in Before the classifier, the feature images that need to be input to the classifier are processed so that the classifier can finally output accurate results.
下面从整体应用的角度对本申请实施例提供的目标识别方法应用在ResNet50中时的实现方式进行说明,参见图10A,为该图像识别方法应用在ResNet50中时进行图像识别的方法流程图。在该方法中,图像识别装置可以分拆为三个装置,为方便区分,分别称为DFM装置、RFAM装置以及分类器。其中DFM装置用于执行如上述如图3所示的实施例中的步骤201~208。RFAM装置用于执行如上述如图3所示的实施例中的步骤209~210。分类器用于执行如上述如图3所示的实施例中的步骤211。The following describes the implementation of the target recognition method provided by the embodiment of the present application when it is applied in ResNet50 from the perspective of overall application. Referring to FIG. 10A , it is a flow chart of the method for image recognition when the image recognition method is applied in ResNet50. In this method, the image recognition device can be split into three devices, which are called DFM device, RFAM device and classifier respectively for the convenience of distinction. The DFM device is used to execute steps 201 to 208 in the above-mentioned embodiment as shown in FIG. 3 . The RFAM device is used to execute steps 209 to 210 in the above-mentioned embodiment as shown in FIG. 3 . The classifier is used to perform step 211 in the above-mentioned embodiment as shown in FIG. 3 .
ResNet50中包括一个主线,在该主线上包括主CNN和一个主RFAM装置,其中,主CNN可以对输入图像进行特征提取,输出特征图像。主CNN中可以添加DFM装置,如图10B为主CNN的结构示意图,主CNN包括四个stage(每个stage其实质为卷积层,用于实现特征提取),可以在每个stage后添加一个DFM装置,每个DFM装置可以对该DFM装置之前设置的stage输出的特征图像进行处理,如执行本申请实施例中的步骤201~208。主RFAM装置可以对该主CNN输出的特征图像进行处理。该主CNN可以输出多个特征图像,其中一个针对整个输入图像的特征图像,该特征图像可以传输至主RFAM装置进行处理。该多个特征图像中还包括一些针对输入图像不同区域的特征图像,这些特征图像中包含的信息量较多的特征图像可以传输至该主CNN之后的多个分支进行处理,每个分支处理一个特征图像,这里以主CNN之后连接四个分支为例。每个分支中包括一个分支CNN和一个分支RFAM装置。对于任一分支,该分支可以对主CNN输出的一个特征图像进行处理。具体的,分支CNN可以对该特征图像继续进行特征提取,输出新的特征图像,分支CNN中可以添加DFM装置,分支CNN中添加DFM装置的方式可以参见主CNN中添加DFM装置的方式,具体可参见前述说明,此处不再赘述。分支RFAM装置可以对该分支CNN输出的特征图像进行处理。ResNet50 includes a main line, and the main line includes a main CNN and a main RFAM device, wherein the main CNN can perform feature extraction on the input image and output the feature image. A DFM device can be added to the main CNN, as shown in Figure 10B for the structural diagram of the main CNN. The main CNN includes four stages (each stage is essentially a convolutional layer for feature extraction), and one can be added after each stage. DFM device, each DFM device can process the feature image output by the stage set before the DFM device, for example, perform steps 201 to 208 in the embodiment of the present application. The main RFAM device can process the feature image output by the main CNN. The main CNN can output multiple feature images, one of which is for the entire input image, which can be transmitted to the main RFAM device for processing. The plurality of feature images also include some feature images for different regions of the input image. Feature images with more information contained in these feature images can be transmitted to multiple branches after the main CNN for processing, and each branch processes one Feature image, here is an example of connecting four branches after the main CNN. Each branch includes a branch CNN and a branch RFAM device. For either branch, the branch can process one feature image output by the main CNN. Specifically, the branch CNN can continue to perform feature extraction on the feature image and output a new feature image. A DFM device can be added to the branch CNN. For the method of adding a DFM device to the branch CNN, please refer to the method of adding a DFM device to the main CNN. Refer to the foregoing description, which is not repeated here. The branch RFAM device can process the feature image output by the branch CNN.
主线中的RFAM装置以及各个分支中分支RFAM装置输出的特征图像可以输入至分类器 中,分类器可以基于特征图像进行目标识别,之后将各个分类器的结构进行汇总输出最终的结果,该结果可以指示输入图像中的目标。The RFAM device in the main line and the feature images output by the branch RFAM devices in each branch can be input into the classifier, and the classifier can perform target recognition based on the feature image, and then summarize the structure of each classifier to output the final result, which can be Indicates the target in the input image.
基于与方法实施例同一发明构思,本申请实施例还提供了一种目标识别装置,用于执行上述如图3A~3B、4所示的方法实施例中所述目标识别装置执行的方法,相关特征可参见上述方法实施例,此处不再赘述。如图11所示,为本申请实施例提供的一种目标识别装置,该目标识别装置1100包括获取单元1101、图像生成单元1102以及识别单元1103,可选的,还可以包括确定单元1104。Based on the same inventive concept as the method embodiment, the embodiment of the present application further provides a target identification device for executing the method performed by the target identification device in the method embodiments shown in FIGS. 3A to 3B and 4 above. Related For the features, reference may be made to the foregoing method embodiments, which will not be repeated here. As shown in FIG. 11 , in a target recognition apparatus provided by an embodiment of the present application, the target recognition apparatus 1100 includes an acquisition unit 1101 , an image generation unit 1102 , and a recognition unit 1103 , and optionally, a determination unit 1104 .
获取单元1101,用于获取输入图像,输入图像包括待识别的目标。获取单元1101可以执行图3A所示的方法实施例中的步骤101。获取单元1101可以执行图3B所示的方法实施例中的步骤201。The acquiring unit 1101 is configured to acquire an input image, where the input image includes a target to be recognized. The obtaining unit 1101 may perform step 101 in the method embodiment shown in FIG. 3A . The obtaining unit 1101 may perform step 201 in the method embodiment shown in FIG. 3B .
图像生成单元1102,用于根据输入图像的区分性区域生成第一特征图像,第一特征图像为输入图像的区分性区域被遮挡的特征图像,输入图像的区分性区域为输入图像中能够指示目标所属类别的区域的子集。图像生成单元1102可以执行图3A所示的方法实施例中的步骤102。图像生成单元1102可以执行图3B所示的方法实施例中的步骤203~204。。The image generating unit 1102 is configured to generate a first feature image according to the distinguishing region of the input image, where the first feature image is a feature image in which the distinguishing region of the input image is occluded, and the distinguishing region of the input image is the target in the input image that can be indicated A subset of the region that belongs to the category. The image generation unit 1102 may perform step 102 in the method embodiment shown in FIG. 3A . The image generating unit 1102 may perform steps 203-204 in the method embodiment shown in FIG. 3B . .
识别单元1103,用于根据第一特征图像识别目标。识别单元1103可以执行图3A所示的方法实施例中的步骤103。The identifying unit 1103 is configured to identify the target according to the first characteristic image. The identification unit 1103 may perform step 103 in the method embodiment shown in FIG. 3A .
作为一种可能的实施方式,图像生成单元1102还可以不遮挡该输入图像中的区分性区域,获得第二特征图像,也就是说,第二特征图像为输入图像的区分性区域未被遮挡的图像;图像生成单元1102可以执行图3B所示的方法实施例中的步骤205~206。As a possible implementation manner, the image generating unit 1102 may also obtain the second feature image without blocking the distinguishing region in the input image, that is, the second feature image is the unblocked distinguishing region of the input image Image; the image generation unit 1102 may perform steps 205-206 in the method embodiment shown in FIG. 3B .
当识别单元1103在根据第一特征图像识别目标时,识别单元1103可以同时考虑第一特征图像和第二特征图像,根据第一特征图像和第二特征图像识别目标。识别单元1103可以执行图3B所示的方法实施例中的步骤207~211。When the recognition unit 1103 recognizes the target according to the first feature image, the recognition unit 1103 may simultaneously consider the first feature image and the second feature image, and recognize the target according to the first feature image and the second feature image. The identification unit 1103 may perform steps 207-211 in the method embodiment shown in FIG. 3B .
作为一种可能的实施方式,图像生成单元1102在生成第一特征图像以及第二特征图像之后,确定单元1104还可以根据输入图像的空间特征确定区分性区域。As a possible implementation manner, after the image generating unit 1102 generates the first feature image and the second feature image, the determining unit 1104 may further determine the distinguishing region according to the spatial feature of the input image.
作为一种可能的实施方式,确定单元1104在根据输入图像的空间特征确定区分性区域时,可以为输入图像的空间特征配置分值;确定单元1104可以将空间特征的分值大于阈值的区域作为区分性区域。也可以将分值处于预设范围的区域作为区分性区域。确定单元1104可以执行图3B所示的方法实施例中的步骤202。确定单元1104可以执行图4所示的方法实施例。As a possible implementation, the determining unit 1104 may configure a score for the spatial feature of the input image when determining the distinguishing region according to the spatial feature of the input image; the determining unit 1104 may use the region where the score of the spatial feature is greater than the threshold as the Distinctive area. An area with a score within a preset range can also be used as a distinguishing area. The determining unit 1104 may perform step 202 in the method embodiment shown in FIG. 3B . The determining unit 1104 may execute the method embodiment shown in FIG. 4 .
作为一种可能的实施方式,图像生成单元1102在根据输入图像的区分性区域生成第一特征图像时,可以为输入图像中的像素点配置第一系数值。各个像素点的第一系数值构成的图为第一系数图。为像素点配置第一系数值的方式有很多,例如,可以将输入图像中属于区分性区域的像素点的第一系数值配置为较小的第一值;其余像素点的第一系数配置为较大的第二值,其中,第一值小于第二值,各个像素点的第一系数值构成的图为第一系数图;在获得了第一系数图之后,图像生成单元1102将第一系数图作用到输入图像上,生成第一特征图像。As a possible implementation manner, when the image generating unit 1102 generates the first feature image according to the distinguishing region of the input image, the first coefficient value may be configured for the pixel point in the input image. The map composed of the first coefficient values of each pixel point is the first coefficient map. There are many ways to configure the first coefficient value for the pixel point. For example, the first coefficient value of the pixel point belonging to the distinguishing area in the input image can be configured as a smaller first value; the first coefficient value of the remaining pixel points can be configured as The larger second value, where the first value is smaller than the second value, the map formed by the first coefficient values of each pixel is the first coefficient map; after obtaining the first coefficient map, the image generation unit 1102 will The coefficient map is applied to the input image to generate a first feature image.
作为一种可能的实施方式,图像生成单元1102在根据输入图像的区分性区域生成第二特征图像时,可以为输入图像中的像素点配置第二系数值。各个像素点的第二系数值构成的图为第二系数图。为像素点配置第二系数值的方式有很多,例如,图像生成单元1102可以将输入图像中属于区分性区域的像素点的第二系数值配置为像素点的空间特征的分值;又例如,图像生成单元1102可以将输入图像中属于区分性区域的像素点的第二系数值配置为较大的第三值,其余像素点的第二系数值配置为较小的第四值,其中,第三值大于第四值。在获得 了第二系数图之后,将第二系数图作用到输入图像上,生成第二特征图像。As a possible implementation manner, when the image generating unit 1102 generates the second feature image according to the distinguishing region of the input image, the second coefficient value may be configured for the pixel point in the input image. The map formed by the second coefficient values of each pixel is the second coefficient map. There are many ways to configure the second coefficient value for the pixel point. For example, the image generation unit 1102 may configure the second coefficient value of the pixel point belonging to the distinguishing region in the input image as the score of the spatial feature of the pixel point; for another example, The image generation unit 1102 may configure the second coefficient values of the pixels belonging to the distinguishing area in the input image to be a third larger value, and configure the second coefficient values of the remaining pixels to be a fourth smaller value, wherein the first The third value is greater than the fourth value. After the second coefficient map is obtained, the second coefficient map is applied to the input image to generate a second feature image.
作为一种可能的实施方式,识别单元1103在根据第一特征图像和第二特征图像识别目标时,可以先在通道维度上将第一特征图像和第二特征图像聚合,生成第三特征图像。之后,基于第三特征图像,确定多个大小相同的候选特征图像,其中,每个候选特征图像的感受野不同;将多个候选特征图像融合为第四特征图像;利用第四特征图像进行目标识别。As a possible implementation manner, when identifying the target according to the first feature image and the second feature image, the identifying unit 1103 may first aggregate the first feature image and the second feature image in the channel dimension to generate a third feature image. After that, based on the third feature image, multiple candidate feature images with the same size are determined, wherein each candidate feature image has a different receptive field; the multiple candidate feature images are fused into a fourth feature image; the fourth feature image is used to target identify.
作为一种可能的实施方式,识别单元1103在第一特征图像和第二特征图像在通道维度上聚合,生成第三特征图像时,可以在通道维度上将第一特征图像和第二特征图像聚合,降维,生成聚合图像;在通道维度上为聚合图像配置权重,生成第三特征图像,为候选特征图像配置的权重满足可以如下条件:在区分性区域在输入图像中被遮挡时聚合图像中属于第一特征图像的部分在通道上的权重大于聚合图像中属于第二特征图像的部分在通道上的权重、或在区分性区域在输入图像中未被遮挡时聚合图像中属于第一特征图像的部分在通道上的权重小于聚合图像中属于第二特征图像的部分在通道上的权重。As a possible implementation, the identifying unit 1103 aggregates the first feature image and the second feature image in the channel dimension, and when generating the third feature image, the first feature image and the second feature image can be aggregated in the channel dimension , reduce the dimension, and generate an aggregated image; configure the weights for the aggregated images in the channel dimension to generate a third feature image, and the weights configured for the candidate feature images can satisfy the following conditions: when the discriminative region is occluded in the input image, the aggregated image The part belonging to the first feature image has a greater weight on the channel than the part belonging to the second feature image in the aggregated image, or the first feature image in the aggregated image when the discriminative region is not occluded in the input image The weight on the channel of the part of is smaller than the weight on the channel of the part belonging to the second feature image in the aggregated image.
作为一种可能的实施方式,识别单元1103在基于第三特征图像,确定多个大小相同的候选特征图像时,可以将多个不同的卷积核分别作用在第三特征图像中,通过扩张分离卷积的方式获得多个候选特征图像。As a possible implementation manner, when determining multiple candidate feature images with the same size based on the third feature image, the identifying unit 1103 may apply multiple different convolution kernels to the third feature image respectively, and separate them by dilation. Multiple candidate feature images are obtained by convolution.
作为一种可能的实施方式,识别单元1103在将多个的候选特征图像融合为第四特征图像时,可以为每个候选特征图像配置对应的权重,之后,基于每个候选特征图像和每个候选特征图像对应的权重,获得第四特征图像。As a possible implementation manner, when the identification unit 1103 fuses multiple candidate feature images into a fourth feature image, a corresponding weight may be configured for each candidate feature image, and then, based on each candidate feature image and each The weight corresponding to the candidate feature image is obtained to obtain the fourth feature image.
需要说明的是,本申请实施例中对单元的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。在本申请的实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。It should be noted that the division of units in the embodiments of the present application is illustrative, and is only a logical function division, and other division methods may be used in actual implementation. Each functional unit in the embodiments of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载或执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质。半导体介质可以是固态硬盘(solid state drive,SSD)。The above embodiments may be implemented in whole or in part by software, hardware, firmware or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded or executed on a computer, all or part of the processes or functions described in the embodiments of the present invention are generated. The computer may be a general purpose computer, special purpose computer, computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server or data center Transmission to another website site, computer, server, or data center is by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, or the like that contains one or more sets of available media. The usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media. The semiconductor medium may be a solid state drive (SSD).
在一个简单的实施例中,本领域的技术人员可以想到如图3A~3B所示的实施例中目标识别装置可采用图12所示的形式。In a simple embodiment, those skilled in the art can think that the target identification device in the embodiment shown in FIGS. 3A-3B can take the form shown in FIG. 12 .
如图12所示的装置1200,包括至少一个处理器1201、存储器1202,可选的,还可以包括通信接口1203。The apparatus 1200 shown in FIG. 12 includes at least one processor 1201 , a memory 1202 , and optionally, a communication interface 1203 .
存储器1202可以是易失性存储器,例如随机存取存储器;存储器也可以是非易失性存储器,例如只读存储器,快闪存储器,硬盘(hard disk drive,HDD)或固态硬盘、或者存储器1202是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机 存取的任何其他介质,但不限于此。存储器1202可以是上述存储器的组合。The memory 1202 can be volatile memory, such as random access memory; the memory can also be non-volatile memory, such as read-only memory, flash memory, hard disk drive (HDD) or solid-state hard disk, or the memory 1202 is capable of Any other medium for carrying or storing the desired program code in the form of instructions or data structures and capable of being accessed by a computer, without limitation. The memory 1202 may be a combination of the foregoing memories.
本申请实施例中不限定上述处理器1201以及存储器1202之间的具体连接介质。The specific connection medium between the above-mentioned processor 1201 and the memory 1202 is not limited in this embodiment of the present application.
处理器1201可以为中央处理器(central processing unit,CPU),该处理器1201还可以是其他通用处理器、数字信号处理器(digital signal process,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件、人工智能芯片、片上芯片等。通用处理器可以是微处理器或者是任何常规的处理器等。具有数据收发功能,能够与其他设备进行通信,在如图12装置中,也可以设置独立的数据收发模块,例如通信接口1203,用于收发数据;处理器1201在与其他设备进行通信时,可以通过通信接口1203进行数据传输,如获取输入图像等。The processor 1201 can be a central processing unit (central processing unit, CPU), and the processor 1201 can also be other general-purpose processors, digital signal processors (digital signal process, DSP), application specific integrated circuit (application specific integrated circuit, ASIC) ), field programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, artificial intelligence chips, chips on a chip, etc. A general purpose processor may be a microprocessor or any conventional processor or the like. It has the function of data sending and receiving, and can communicate with other devices. In the device as shown in Figure 12, an independent data sending and receiving module can also be set, such as the communication interface 1203, which is used to send and receive data; when the processor 1201 communicates with other devices, it can Data transmission is performed through the communication interface 1203, such as acquiring an input image.
当所述目标识别装置采用图12所示的形式时,图12中的处理器1201可以通过调用存储器1202中存储的计算机执行指令,使得所述目标识别装置可以执行上述任一方法实施例中的所述目标识别装置执行的方法。When the target identification device adopts the form shown in FIG. 12 , the processor 1201 in FIG. 12 can execute the instructions by invoking the computer stored in the memory 1202, so that the target identification device can execute any of the above method embodiments. The method performed by the target identification device.
具体的,图11的获取单元、图像生成单元、识别单元以及确定单元的功能/实现过程均可以通过图12中的处理器1201调用存储器1202中存储的计算机执行指令来实现。或者,图11中的图像生成单元、识别单元以及确定单元的功能/实现过程可以通过图12中的处理器1201调用存储器1202中存储的计算机执行指令来实现,图11的获取单元以及发送单元的功能/实现过程可以通过图12中的通信接口1203来实现。Specifically, the functions/implementation processes of the acquiring unit, the image generating unit, the identifying unit and the determining unit in FIG. 11 can be implemented by the processor 1201 in FIG. 12 calling the computer execution instructions stored in the memory 1202 . Alternatively, the functions/implementation process of the image generating unit, the identifying unit and the determining unit in FIG. 11 may be implemented by the processor 1201 in FIG. 12 calling the computer execution instructions stored in the memory 1202, The function/implementation process can be realized through the communication interface 1203 in FIG. 12 .
本领域内的技术人员应明白,本申请的实施例可提供为方法、***、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by those skilled in the art, the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
本申请是参照根据本申请的方法、设备(***)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the present application. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the present application without departing from the scope of the present application. Thus, if these modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to include these modifications and variations.

Claims (24)

  1. 一种目标识别的方法,其特征在于,所述方法包括:A method for target recognition, characterized in that the method comprises:
    获取输入图像,所述输入图像包括待识别的目标;obtaining an input image, the input image including the target to be identified;
    根据所述输入图像的区分性区域生成第一特征图像,所述第一特征图像为所述输入图像的区分性区域被遮挡的特征图像,所述输入图像的区分性区域为所述输入图像中能够指示所述目标所属类别的区域的子集;A first feature image is generated according to the distinguishing region of the input image, the first feature image is a feature image in which the distinguishing region of the input image is occluded, and the distinguishing region of the input image is a feature image in the input image A subset of regions capable of indicating the category to which the target belongs;
    根据第一特征图像识别所述目标。The target is identified from the first characteristic image.
  2. 如权利要求1所述的方法,其特征在于,所述方法还包括:The method of claim 1, wherein the method further comprises:
    根据所述输入图像的区分性区域生成第二特征图像,所述第二特征图像为所述输入图像的区分性区域未被遮挡的特征图像;generating a second feature image according to the distinguishing region of the input image, where the second feature image is a feature image in which the distinguishing region of the input image is not occluded;
    所述根据第一特征图像识别所述目标,包括:The identifying the target according to the first characteristic image includes:
    根据所述第一特征图像和所述第二特征图像识别所述目标。The target is identified from the first feature image and the second feature image.
  3. 如权利要求1或2所述的方法,其特征在于,所述方法还包括:The method of claim 1 or 2, wherein the method further comprises:
    根据所述输入图像的空间特征确定所述区分性区域。The discriminative regions are determined based on spatial characteristics of the input image.
  4. 如权利要求3所述的方法,其特征在于,所述根据所述输入图像的空间特征确定所述区分性区域,包括:The method of claim 3, wherein the determining the distinguishing region according to the spatial feature of the input image comprises:
    为所述输入图像的空间特征配置分值;configuring scores for spatial features of the input image;
    将所述空间特征的分值大于所述阈值的区域作为所述区分性区域。The region where the score of the spatial feature is greater than the threshold is used as the distinguishing region.
  5. 如权利要求1~4任一所述的方法,其特征在于,所述根据所述输入图像的区分性区域生成第一特征图像,包括:The method according to any one of claims 1 to 4, wherein the generating the first feature image according to the distinguishing region of the input image comprises:
    将所述输入图像中属于区分性区域的像素点的第一系数值配置为第一值,其余像素点的第一系数值配置为第二值,其中,所述第一值小于所述第二值,各个所述像素点的第一系数值构成的图为第一系数图;The first coefficient values of the pixels belonging to the distinguishing area in the input image are configured as first values, and the first coefficient values of the remaining pixels are configured as second values, wherein the first value is smaller than the second value value, the map formed by the first coefficient values of each of the pixel points is the first coefficient map;
    将所述第一系数图作用到所述输入图像上,生成所述第一特征图像。The first feature image is generated by applying the first coefficient map to the input image.
  6. 如权利要求2~5任一所述的方法,其特征在于,所述根据所述输入图像的区分性区域生成第二特征图像,包括:The method according to any one of claims 2 to 5, wherein the generating the second feature image according to the distinguishing region of the input image comprises:
    将所述输入图像中属于区分性区域的像素点的第二系数值配置为所述像素点的空间特征的分值,生成第二系数图;configuring the second coefficient value of the pixel point belonging to the distinguishing area in the input image as the score of the spatial feature of the pixel point, and generating a second coefficient map;
    将所述第二系数图作用到所述输入图像上,生成所述第二特征图像。The second feature image is generated by applying the second coefficient map to the input image.
  7. 如权利要求2~5任一所述的方法,其特征在于,所述根据所述输入图像的区分性区域获取第二特征图像,包括:The method according to any one of claims 2 to 5, wherein the acquiring the second feature image according to the distinguishing region of the input image comprises:
    将所述输入图像中属于区分性区域的像素点的第二系数值配置为第三值,其余像素点的第二系数值配置为第四值,其中,所述第三值大于所述第四值,各个所述像素点的第二系数值构成的图为第二系数图;configuring the second coefficient values of the pixels belonging to the distinguishing area in the input image as a third value, and configuring the second coefficient values of the remaining pixels as a fourth value, wherein the third value is greater than the fourth value value, the graph formed by the second coefficient values of each of the pixel points is the second coefficient graph;
    将所述第二系数图作用到所述输入图像上,生成所述第二特征图像。The second feature image is generated by applying the second coefficient map to the input image.
  8. 如权利要求2~5任一所述的方法,其特征在于,所述根据所述第一特征图像和所述第二特征图像识别所述目标,包括:The method according to any one of claims 2 to 5, wherein the identifying the target according to the first characteristic image and the second characteristic image comprises:
    在通道维度上将所述第一特征图像和所述第二特征图像聚合,生成第三特征图像;Aggregating the first feature image and the second feature image in the channel dimension to generate a third feature image;
    基于所述第三特征图像,确定多个候选特征图像,其中,每个所述候选特征图像的大小相同,每个所述候选特征图像的感受野不同;determining a plurality of candidate feature images based on the third feature image, wherein the size of each candidate feature image is the same, and the receptive field of each candidate feature image is different;
    将所述多个候选特征图像融合为所述第四特征图像;fusing the plurality of candidate feature images into the fourth feature image;
    根据所述第四特征图像识别所述目标。The target is identified based on the fourth characteristic image.
  9. 如权利要求8所述的方法,其特征在于,所述在所述第一特征图像和所述第二特征图像在通道维度上聚合,生成第三特征图像,包括:The method of claim 8, wherein the generating a third feature image by aggregating the first feature image and the second feature image in a channel dimension comprises:
    在通道维度上将所述第一特征图像和所述第二特征图像聚合,降维,生成聚合图像;Aggregate the first feature image and the second feature image in the channel dimension, reduce the dimension, and generate an aggregated image;
    在通道维度上为所述聚合图像配置权重,生成所述第三特征图像,为所述候选特征图像配置的权重满足如下条件:在所述区分性区域在所述输入图像中被遮挡时所述聚合图像中属于所述第一特征图像的部分在通道上的权重大于所述聚合图像中属于所述第二特征图像的部分在通道上的权重、或在所述区分性区域在所述输入图像中未被遮挡时所述聚合图像中属于所述第一特征图像的部分在通道上的权重小于所述聚合图像中属于所述第二特征图像的部分在通道上的权重。A weight is configured for the aggregated image in the channel dimension, the third feature image is generated, and the weight configured for the candidate feature image satisfies the following condition: when the discriminative region is occluded in the input image, the The weight on the channel of the part belonging to the first feature image in the aggregated image is greater than the weight on the channel of the part belonging to the second feature image in the aggregated image, or in the discriminative region in the input image The weight on the channel of the part of the aggregated image that belongs to the first feature image is smaller than the weight of the part of the aggregated image that belongs to the second feature image on the channel when it is not occluded.
  10. 如权利要求8或9所述的方法,其特征在于,所述基于所述第三特征图像,确定多个候选特征图像,包括:The method according to claim 8 or 9, wherein the determining a plurality of candidate feature images based on the third feature image comprises:
    将多个不同的卷积核作用在所述第三特征图像中,通过扩张分离卷积获得所述多个候选特征图像。A plurality of different convolution kernels are applied to the third feature image, and the plurality of candidate feature images are obtained by dilated separation convolution.
  11. 如权利要求8~10任一所述的方法,其特征在于,所述将所述多个候选特征图像融合为所述第四特征图像,包括:The method according to any one of claims 8 to 10, wherein the fusing the multiple candidate feature images into the fourth feature image comprises:
    基于每个所述候选特征图像和每个所述候选特征图像对应的权重,获得所述第四特征图像。The fourth feature image is obtained based on each of the candidate feature images and the weight corresponding to each of the candidate feature images.
  12. 一种目标识别装置,其特征在于,所述装置包括:A target identification device, characterized in that the device comprises:
    获取单元,用于获取输入图像,所述输入图像包括待识别的目标;an acquisition unit for acquiring an input image, the input image including the target to be identified;
    图像生成单元,用于根据所述输入图像的区分性区域生成第一特征图像,所述第一特征图像为所述输入图像的区分性区域被遮挡的特征图像,所述输入图像的区分性区域为所述输入图像中能够指示所述目标所属类别的区域的子集;An image generating unit, configured to generate a first feature image according to the distinguishing area of the input image, where the first feature image is a feature image in which the distinguishing area of the input image is occluded, and the distinguishing area of the input image is a subset of the regions in the input image that can indicate the category to which the target belongs;
    识别单元,用于根据第一特征图像识别所述目标。an identification unit, configured to identify the target according to the first characteristic image.
  13. 如权利要求12所述的装置,其特征在于,所述图像生成单元,还用于:The apparatus of claim 12, wherein the image generating unit is further configured to:
    根据所述输入图像的区分性区域生成第二特征图像,所述第二特征图像为所述输入图像的区分性区域未被遮挡的特征图像;generating a second feature image according to the distinguishing region of the input image, where the second feature image is a feature image in which the distinguishing region of the input image is not occluded;
    所述识别单元在根据第一特征图像识别所述目标时,具体用于:When the identifying unit identifies the target according to the first characteristic image, it is specifically used for:
    根据所述第一特征图像和所述第二特征图像识别所述目标。The target is identified from the first feature image and the second feature image.
  14. 如权利要求12或13所述的装置,其特征在于,所述装置还包括确定单元,所述确定单元,还用于:The apparatus according to claim 12 or 13, characterized in that, the apparatus further comprises a determination unit, and the determination unit is further configured to:
    根据所述输入图像的空间特征确定所述区分性区域。The discriminative regions are determined based on spatial characteristics of the input image.
  15. 如权利要求14所述的装置,其特征在于,所述确定单元在根据所述输入图像的空间特征确定所述区分性区域时,具体用于:The device according to claim 14, wherein, when the determining unit determines the distinguishing region according to the spatial feature of the input image, it is specifically configured to:
    为所述输入图像的空间特征配置分值;configuring scores for the spatial features of the input image;
    将所述空间特征的分值大于所述阈值的区域作为所述区分性区域。The region where the score of the spatial feature is greater than the threshold is used as the distinguishing region.
  16. 如权利要求12~15任一所述的装置,其特征在于,所述图像生成单元在根据所述输入图像的区分性区域生成第一特征图像时,具体用于:The apparatus according to any one of claims 12 to 15, wherein when the image generating unit generates the first feature image according to the distinguishing region of the input image, it is specifically configured to:
    将所述输入图像中属于区分性区域的像素点的第一系数值配置为第一值,其余像素点的第一系数配置为第二值,其中,所述第一值小于所述第二值,各个所述像素点的第一系数值构成的图为第一系数图;The first coefficient values of the pixels belonging to the distinguishing area in the input image are configured as first values, and the first coefficients of the remaining pixels are configured as second values, wherein the first value is smaller than the second value , the graph formed by the first coefficient values of each of the pixel points is the first coefficient graph;
    将所述第一系数图作用到所述输入图像上,生成所述第一特征图像。The first feature image is generated by applying the first coefficient map to the input image.
  17. 如权利要求13~16任一所述的装置,其特征在于,所述图像生成单元在根据所述输入图像的区分性区域生成第二特征图像时,具体用于:The device according to any one of claims 13 to 16, wherein when the image generating unit generates the second feature image according to the distinguishing region of the input image, it is specifically configured to:
    将所述输入图像中属于区分性区域的像素点的第二系数值配置为所述像素点的空间特征的分值,生成第二系数图;configuring the second coefficient value of the pixel point belonging to the distinguishing region in the input image as the score of the spatial feature of the pixel point, and generating a second coefficient map;
    将所述第二系数图作用到所述输入图像上,生成所述第二特征图像。The second feature image is generated by applying the second coefficient map to the input image.
  18. 如权利要求13~16任一所述的装置,其特征在于,所述图像生成单元在根据所述输入图像的区分性区域生成第二特征图像时,具体用于:The device according to any one of claims 13 to 16, wherein when the image generating unit generates the second feature image according to the distinguishing region of the input image, it is specifically configured to:
    将所述输入图像中属于区分性区域的像素点的第二系数值配置为第三值,其余像素点的第二系数值配置为第四值,其中,所述第三值大于所述第四值,各个所述像素点的第二系数值构成的图为第二系数图;configuring the second coefficient values of the pixels belonging to the distinguishing area in the input image as a third value, and configuring the second coefficient values of the remaining pixels as a fourth value, wherein the third value is greater than the fourth value value, the graph formed by the second coefficient values of each of the pixel points is the second coefficient graph;
    将所述第二系数图作用到所述输入图像上,生成所述第二特征图像。The second feature image is generated by applying the second coefficient map to the input image.
  19. 如权利要求13~16任一所述的装置,其特征在于,所述识别单元在根据所述第一特征图像和所述第二特征图像识别所述目标时,具体用于:The device according to any one of claims 13 to 16, wherein when the identifying unit identifies the target according to the first feature image and the second feature image, the identifying unit is specifically configured to:
    在通道维度上将所述第一特征图像和所述第二特征图像聚合,生成第三特征图像;Aggregating the first feature image and the second feature image in the channel dimension to generate a third feature image;
    基于所述第三特征图像,确定多个候选特征图像,其中,每个所述候选特征图像的大小相同,每个所述候选特征图像的感受野不同;determining a plurality of candidate feature images based on the third feature image, wherein the size of each candidate feature image is the same, and the receptive field of each candidate feature image is different;
    将所述多个候选特征图像融合为所述第四特征图像;fusing the plurality of candidate feature images into the fourth feature image;
    根据所述第四特征图像识别所述目标。The target is identified based on the fourth characteristic image.
  20. 如权利要求19所述的装置,其特征在于,所述识别单元在所述第一特征图像和所述第二特征图像在通道维度上聚合,生成第三特征图像时,具体用于:The device according to claim 19, wherein, when the identification unit aggregates the first feature image and the second feature image in a channel dimension to generate a third feature image, it is specifically used for:
    在通道维度上将所述第一特征图像和所述第二特征图像聚合,降维,生成聚合图像;Aggregate the first feature image and the second feature image in the channel dimension, reduce the dimension, and generate an aggregated image;
    在通道维度上为所述聚合图像配置权重,生成所述第三特征图像,为所述候选特征图像配置的权重满足如下条件:在所述区分性区域在所述输入图像中被遮挡时所述聚合图像中属于所述第一特征图像的部分在通道上的权重大于所述聚合图像中属于所述第二特征图像的部分在通道上的权重、或在所述区分性区域在所述输入图像中未被遮挡时所述聚合图像中属于所述第一特征图像的部分在通道上的权重小于所述聚合图像中属于所述第二特征图像的部分在通道上的权重。A weight is configured for the aggregated image in the channel dimension, the third feature image is generated, and the weight configured for the candidate feature image satisfies the following condition: when the discriminative region is occluded in the input image, the The weight on the channel of the part belonging to the first feature image in the aggregated image is greater than the weight on the channel of the part belonging to the second feature image in the aggregated image, or in the discriminative region in the input image The weight on the channel of the part of the aggregated image that belongs to the first feature image is smaller than the weight of the part of the aggregated image that belongs to the second feature image on the channel when it is not occluded.
  21. 如权利要求19或20所述的装置,其特征在于,所述识别单元在基于所述第三特征图像,确定多个候选特征图像时,具体用于:The device according to claim 19 or 20, wherein, when the identifying unit determines a plurality of candidate feature images based on the third feature image, it is specifically configured to:
    将多个不同的卷积核作用在所述第三特征图像中,通过扩张分离卷积获得所述多个候选特征图像。A plurality of different convolution kernels are applied to the third feature image, and the plurality of candidate feature images are obtained by dilated separation convolution.
  22. 如权利要求19~21任一所述的装置,其特征在于,所述识别单元在将所述多个候选特征图像融合为所述第四特征图像时,具体用于:The apparatus according to any one of claims 19 to 21, wherein, when the identification unit fuses the plurality of candidate feature images into the fourth feature image, it is specifically configured to:
    基于每个所述候选特征图像和每个所述候选特征图像对应的权重,获得所述第四特征图像。The fourth feature image is obtained based on each of the candidate feature images and the weight corresponding to each of the candidate feature images.
  23. 一种装置,其特征在于,包括存储器和处理器;所述存储器存储有程序指令,所述处理器运行所述程序指令以执行权利要求1~11任一所述的方法。An apparatus is characterized in that it comprises a memory and a processor; the memory stores program instructions, and the processor executes the program instructions to execute the method of any one of claims 1-11.
  24. 一种计算机可读存储介质,其特征在于,计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行如权利要求1~11任一所述的方法。A computer-readable storage medium, characterized in that the computer-readable storage medium stores an instruction, which, when executed on a computer, causes the computer to execute the method according to any one of claims 1 to 11 .
PCT/CN2021/121680 2020-10-14 2021-09-29 Target recognition method and device WO2022078216A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202011097641 2020-10-14
CN202011097641.5 2020-10-14
CN202011479454.3 2020-12-15
CN202011479454.3A CN114429561A (en) 2020-10-14 2020-12-15 Target identification method and device

Publications (1)

Publication Number Publication Date
WO2022078216A1 true WO2022078216A1 (en) 2022-04-21

Family

ID=81207403

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/121680 WO2022078216A1 (en) 2020-10-14 2021-09-29 Target recognition method and device

Country Status (1)

Country Link
WO (1) WO2022078216A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116071709A (en) * 2023-03-31 2023-05-05 南京信息工程大学 Crowd counting method, system and storage medium based on improved VGG16 network
CN116229518A (en) * 2023-03-17 2023-06-06 百鸟数据科技(北京)有限责任公司 Bird species observation method and system based on machine learning
CN116524368A (en) * 2023-04-14 2023-08-01 北京卫星信息工程研究所 Remote sensing image target detection method
CN117611877A (en) * 2023-10-30 2024-02-27 西安电子科技大学 LS-YOLO network-based remote sensing image landslide detection method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107122701A (en) * 2017-03-03 2017-09-01 华南理工大学 A kind of traffic route sign based on saliency and deep learning
US20180181593A1 (en) * 2016-12-28 2018-06-28 Shutterstock, Inc. Identification of a salient portion of an image
CN109784255A (en) * 2019-01-07 2019-05-21 深圳市商汤科技有限公司 Neural network training method and device and recognition methods and device
CN110348355A (en) * 2019-07-02 2019-10-18 南京信息工程大学 Model recognizing method based on intensified learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180181593A1 (en) * 2016-12-28 2018-06-28 Shutterstock, Inc. Identification of a salient portion of an image
CN107122701A (en) * 2017-03-03 2017-09-01 华南理工大学 A kind of traffic route sign based on saliency and deep learning
CN109784255A (en) * 2019-01-07 2019-05-21 深圳市商汤科技有限公司 Neural network training method and device and recognition methods and device
CN110348355A (en) * 2019-07-02 2019-10-18 南京信息工程大学 Model recognizing method based on intensified learning

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116229518A (en) * 2023-03-17 2023-06-06 百鸟数据科技(北京)有限责任公司 Bird species observation method and system based on machine learning
CN116229518B (en) * 2023-03-17 2024-01-16 百鸟数据科技(北京)有限责任公司 Bird species observation method and system based on machine learning
CN116071709A (en) * 2023-03-31 2023-05-05 南京信息工程大学 Crowd counting method, system and storage medium based on improved VGG16 network
CN116071709B (en) * 2023-03-31 2023-06-16 南京信息工程大学 Crowd counting method, system and storage medium based on improved VGG16 network
CN116524368A (en) * 2023-04-14 2023-08-01 北京卫星信息工程研究所 Remote sensing image target detection method
CN116524368B (en) * 2023-04-14 2023-12-19 北京卫星信息工程研究所 Remote sensing image target detection method
CN117611877A (en) * 2023-10-30 2024-02-27 西安电子科技大学 LS-YOLO network-based remote sensing image landslide detection method
CN117611877B (en) * 2023-10-30 2024-05-14 西安电子科技大学 LS-YOLO network-based remote sensing image landslide detection method

Similar Documents

Publication Publication Date Title
WO2022078216A1 (en) Target recognition method and device
US10740654B2 (en) Failure detection for a neural network object tracker
US10318848B2 (en) Methods for object localization and image classification
US10410096B2 (en) Context-based priors for object detection in images
US11663502B2 (en) Information processing apparatus and rule generation method
US20160321540A1 (en) Filter specificity as training criterion for neural networks
KR20180037192A (en) Detection of unknown classes and initialization of classifiers for unknown classes
US20160321542A1 (en) Incorporating top-down information in deep neural networks via the bias term
CN111709285A (en) Epidemic situation protection monitoring method and device based on unmanned aerial vehicle and storage medium
US11443514B2 (en) Recognizing minutes-long activities in videos
JP2016219004A (en) Multi-object tracking using generic object proposals
JP2016062610A (en) Feature model creation method and feature model creation device
US20170031934A1 (en) Media label propagation in an ad hoc network
Sharma et al. Vehicle identification using modified region based convolution network for intelligent transportation system
CN115280373A (en) Managing occlusions in twin network tracking using structured dropping
JP2020010319A (en) Irreversible data compressor for vehicle control system
CN111951260A (en) Partial feature fusion based convolutional neural network real-time target counting system and method
CN116486392A (en) License plate face intelligent recognition method and system based on FPGA
TWI728655B (en) Convolutional neural network detection method and system for animals
CN114429561A (en) Target identification method and device
EP4058940A1 (en) Permutation invariant convolution (pic) for recognizing long-range activities
Shetty et al. Animal Detection and Classification in Image & Video Frames Using YOLOv5 and YOLOv8
KR102127855B1 (en) Feature selection method with maximum repeatability
KM et al. A Review on Deep Learning Based Helmet Detection
CN112818943A (en) Lane line detection method, device, equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21879262

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21879262

Country of ref document: EP

Kind code of ref document: A1