CN114821554A

CN114821554A - Image recognition method, electronic device, and storage medium

Info

Publication number: CN114821554A
Application number: CN202210346579.1A
Authority: CN
Inventors: 蔡占川; 姜志宏; 叶奔; 吕沛伦; 张雨晗; 刘家正; 兰霆; 白丽萍
Original assignee: Macau University of Science and Technology
Current assignee: Macau University of Science and Technology
Priority date: 2022-04-02
Filing date: 2022-04-02
Publication date: 2022-07-29

Abstract

The application is applicable to the technical field of computers, and provides an image recognition method, electronic equipment and a storage medium. The image recognition method comprises the following steps: and acquiring a multi-target image, inputting the multi-target image into the image recognition model, and obtaining a recognition result of the target to be recognized output by the image recognition model. The image recognition model is obtained according to single-target image training and comprises a feature extraction network, a region generation network and a detection network; the feature extraction network is used for obtaining a first image pyramid comprising the three-level feature map and performing feature fusion on the three-level feature map; the region generation network is used for determining the interest region of the fused feature map; the detection network is used for outputting the identification result of the target to be identified according to the interest area. The image pyramid comprising the three-level feature maps is subjected to feature fusion, the interest areas of the fused feature maps are extracted, and image recognition is carried out according to the interest areas, so that the recognition accuracy of the target to be recognized can be improved.

Description

Image recognition method, electronic device, and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to an image recognition method, an electronic device, and a storage medium.

Background

In the field of image recognition, an image recognition model for recognizing one or more targets in an image can be obtained by constructing a training sample and training a classification model by using the training sample. For example, if the images in the training sample are single-target images including a single target, the trained image recognition pattern is a single-target image recognition model. And if the images in the training sample are multi-target images comprising a plurality of targets, the image recognition model obtained by training is a multi-target image recognition model. For a multi-target image to be identified, a multi-target image identification model may be employed to identify multiple targets in the image. However, in training samples adopted for training the multi-target image recognition model, each image needs to include a plurality of targets, and for the classes of large varieties of traditional Chinese medicines and the like, the workload for constructing the training samples is large, the cost is high, and the difficulty for correspondingly constructing the multi-target image recognition model is very large. If the single-target image recognition model is adopted to recognize a plurality of targets in the multi-target image, the accuracy is low.

Disclosure of Invention

In view of this, the embodiment of the present application provides an image recognition method, which can solve the problem of low accuracy when a single-target image recognition model recognizes a multi-target image.

A first aspect of an embodiment of the present application provides an image recognition method, including:

acquiring a multi-target image, wherein the multi-target image comprises at least two targets to be identified;

inputting the multi-target image into an image recognition model to obtain a recognition result of the target to be recognized, which is output by the image recognition model; the image recognition model is obtained by training a classification model by taking a single target image as a training sample; the image recognition model comprises a feature extraction network, a region generation network and a detection network; the feature extraction network is used for performing convolution on the multi-target image to obtain a first image pyramid comprising three levels of feature maps, and performing feature fusion on the three levels of feature maps to obtain a fused feature map; the region generation network is used for determining the interest region of the fused feature map; and the detection network is used for outputting the identification result of the target to be identified according to the interest area.

In a possible implementation manner, the convolving the multi-target image to obtain a first image pyramid including three-level feature map features includes:

convolving the multi-target image by adopting a ResNet50 network to obtain a second image pyramid;

and respectively taking the second-level feature map, the third-level feature map and the fourth-level feature map of the second image pyramid as the first-level feature map, the second-level feature map and the third-level feature map of the first image pyramid.

In a possible implementation manner, the performing feature fusion on the three-level feature map to obtain a fused feature map includes:

performing up-sampling on the Nth-level feature map by adopting a deconvolution method to obtain an up-sampled image, wherein N is 2 and 3;

and performing feature fusion on the up-sampling image and the N-1 level feature map to obtain a fused feature map.

In a possible implementation manner, the image recognition model further includes an interest region pooling network, where the interest region pooling network is configured to adjust the size of the interest region to a preset size, so as to obtain a size-adjusted interest region; correspondingly, the detection network is used for outputting the recognition result of the target to be recognized according to the interest area after the size is adjusted.

In a possible implementation manner, the image recognition model further includes a full convolution network, where the full convolution network is configured to perform convolution processing on the interest region to obtain a convolution-processed interest region; correspondingly, the interest area pooling network is configured to adjust the size of the interest area after the convolution processing to the preset size, so as to obtain the interest area after the size adjustment.

In a possible implementation manner, the outputting the recognition result of the target to be recognized according to the region of interest includes:

mapping each interest region into a corresponding feature vector;

and outputting the confidence score corresponding to each feature vector and the position information of the target to be recognized in the multi-target image, and taking the confidence score and the position information as the recognition result of the target to be recognized.

In one possible implementation, the acquiring the multi-target image includes:

acquiring an original image;

performing image noise reduction on the original image by adopting a median filter to obtain a noise-reduced image;

adopting a Canny algorithm to carry out edge extraction on the denoised image to obtain boundary coordinates;

and obtaining the multi-target image according to the boundary coordinates.

In a possible implementation manner, the training sample includes single-target images and annotation information corresponding to each single-target image, where the annotation information is obtained by inputting the single-target images into an image annotation model, and the image annotation model is obtained by training an initial model by using an image with the annotation information.

A second aspect of an embodiment of the present application provides an image recognition apparatus, including:

the system comprises an acquisition module, a recognition module and a recognition module, wherein the acquisition module is used for acquiring a multi-target image which comprises at least two targets to be recognized;

the recognition module is used for inputting the multi-target image into an image recognition model to obtain a recognition result of the target to be recognized, which is output by the image recognition model; the image recognition model is obtained by training a classification model by taking a single target image as a training sample; the image recognition model comprises a feature extraction network, a region generation network and a detection network; the feature extraction network is used for performing convolution on the multi-target image to obtain a first image pyramid comprising three levels of feature maps, and performing feature fusion on the three levels of feature maps to obtain a fused feature map; the region generation network is used for determining the interest region of the fused feature map; and the detection network is used for outputting the identification result of the target to be identified according to the interest area.

A third aspect of embodiments of the present application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the image recognition method according to the first aspect when executing the computer program.

A fourth aspect of embodiments of the present application provides a computer-readable storage medium, which stores a computer program that, when executed by a processor, implements the image recognition method according to the first aspect.

A fifth aspect of embodiments of the present application provides a computer program product, which, when run on an electronic device, causes the electronic device to execute the image recognition method of any one of the first aspects.

Compared with the prior art, the embodiment of the application has the beneficial effects that: and taking the single-target image as a training sample, training the classification model to obtain an image recognition model, namely a single-target image recognition model, inputting the multi-target image into the image recognition model, and obtaining a recognition result of the target to be recognized output by the image recognition model. The image recognition model comprises a feature extraction network, an area generation network and a detection network; the feature extraction network is used for performing convolution on the multi-target image to obtain a first image pyramid comprising three levels of feature maps, and performing feature fusion on the three levels of feature maps to obtain a fused feature map; the region generation network is used for determining the interest region of the fused feature map; the detection network is used for outputting the identification result of the target to be identified according to the interest area. Because the feature fusion is carried out on the first image pyramid comprising the three-level feature maps, the fused feature maps can keep the texture detail information of the target, the probability of identifying a plurality of targets as one target is reduced, and the identification accuracy of the target to be identified can be improved. Meanwhile, because the interest areas in different feature maps are different in the fused feature maps, the interest areas of the fused feature maps are extracted and then input into a detection network for image recognition, so that the influence of interference information can be reduced, and the recognition accuracy of the multi-target image is further improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the embodiments or the description of the prior art will be briefly described below.

Fig. 1 is a schematic flow chart illustrating an implementation of an image recognition method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a multi-target image provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of a data set provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of edge extraction provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of an image recognition model provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of determining a region of interest provided by an embodiment of the present application;

fig. 7 is a schematic diagram of a target recognition result corresponding to each feature extraction network provided in the embodiment of the present application;

FIG. 8 is a schematic diagram of an image recognition apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

In addition, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not intended to indicate or imply relative importance.

When the single-target image recognition model is adopted to recognize the multi-target image, the problem of low accuracy exists. Therefore, the embodiment of the application provides an image identification method, wherein the image identification model is a single-target image identification model, a feature extraction network in the image identification model is used for constructing a first image pyramid comprising three-level feature maps according to a multi-target image, and then performing feature fusion on the first image pyramid, so that the fused feature maps retain texture detail information of targets, the probability of identifying a plurality of targets as one target is reduced, and the identification accuracy of the target to be identified is improved. And then extracting the interest region of the fused characteristic graph by using the region generation network, inputting the interest region into the detection network, and obtaining the identification result of the target to be identified output by the detection network, thereby further improving the identification accuracy of the multi-target image.

The following is an exemplary description of the image recognition method provided in the present application.

Referring to fig. 1, an image recognition method according to an embodiment of the present application includes:

s101: and acquiring a multi-target image, wherein the multi-target image comprises at least two targets to be identified.

As shown in fig. 2, the multi-target image may be an image of a chinese medicine, and the image includes a plurality of chinese medicine samples.

S102: inputting the multi-target image into an image recognition model to obtain a recognition result of the target to be recognized, which is output by the image recognition model; the image recognition model is obtained by training a classification model by taking a single target image as a training sample; the image recognition model comprises a feature extraction network, a region generation network and a detection network; the feature extraction network is used for performing convolution on the multi-target image to obtain a first image pyramid comprising three levels of feature maps, and performing feature fusion on the three levels of feature maps to obtain a fused feature map; the region generation network is used for determining the interest region of the fused feature map; and the detection network is used for outputting the identification result of the target to be identified according to the interest area.

Specifically, a data set comprising single-target images is constructed in advance, a label is set for each single-target image in the data set, a training sample is generated according to the single-target images and the corresponding labels, and a machine learning algorithm is adopted to train the classification model by using the training sample to obtain the image recognition model.

Taking the single target image as an example of the image of the traditional Chinese medicine sample, firstly, the image of the traditional Chinese medicine sample is shot by adopting the preset shooting parameters under the preset illumination environment, so as to improve the image quality. The images of the obtained traditional Chinese medicine samples are shot to form a data set. Illustratively, as shown in fig. 3, the data set includes 14 categories of images of chinese traditional medicine, the 14 categories are paris polyphylla, american ginseng, uncaria rhynchophylla, thunberg corydalis, cordyceps sinensis (cultivated and wild), scutellaria baicalensis (scutellaria baicalensis and scutellarin), forsythia suspensa (fructus forsythiae and fructus forsythiae), and fritillaria thunbergii (fritillary bulb and pinecone), each category includes a plurality of images, and the data set includes 19038 images. In the images of each category, a part of images are used as a training set for training a classification model, and a part of images are used as a verification set for verifying the accuracy of the image recognition model obtained by training.

In one embodiment, after the data set is obtained, a graying process is first performed on each image in the data set to obtain a grayscale image. In order to reduce image noise, after a gray image is obtained, noise suppression is performed on the gray image. In one possible implementation, a median filter may be used to perform noise suppression on the grayscale image, so that salt and pepper noise in the image may be reduced. And then, performing edge extraction on the denoised image by adopting a Canny algorithm. In the Canny algorithm, if the gradient magnitude of a pixel is greater than a preset high threshold, the pixel is set to 255. If the gradient magnitude of a pixel is less than a preset low threshold, the pixel is set to 0. If one or more pixels exist in the eight neighborhoods of one pixel, the pixel is set to be 255, so that the Canny algorithm can be adopted to carry out edge extraction on images of different types by using the same set of parameters, and the accuracy of edge detection is improved. After the edge extraction is carried out on each image, the area of the traditional Chinese medicine sample in the image is determined according to the extracted edge in each image. For example, as shown in fig. 4, a minimum rectangle that can cover the edge is determined, four vertices of the rectangle are boundary coordinates of the traditional Chinese medicine sample, and an image of an area where the traditional Chinese medicine sample is located, that is, a single-target image, can be extracted from the image according to the boundary coordinates.

In one embodiment, after a single-target image is obtained, the single-target image is input into an image annotation model to obtain annotation information of the single-target image, a training sample is generated according to the single-target image and corresponding annotation information, and the image annotation model is obtained by training an initial model by using an image with the annotation information, so that the accuracy of the annotation information is improved and the annotation efficiency is improved.

After the training sample is obtained, the classification model is trained to optimize parameters of the classification model until the optimal parameters of the classification model are obtained, the image recognition model can be obtained according to the optimal parameters of the classification model, and the training process of the image recognition model is the same as the image recognition process of the image recognition model. The structure of the image recognition model will be described below by taking the image recognition process of the image recognition model as an example.

As shown in fig. 5, the image recognition model includes a feature extraction network, an area generation network, and a detection network.

The feature extraction network is used for performing convolution on a multi-target image (input image) to obtain a first image pyramid comprising three levels of feature maps, and performing feature fusion on the three levels of feature maps to obtain a fused feature map.

Specifically, each level of feature map in the image pyramid is obtained by convolving the target image, the resolution of each level of feature map in the image pyramid is different, and the resolution is sequentially reduced according to the sequence from bottom to top.

In an embodiment, the feature extraction network includes a ResNet50 network, and the ResNet50 network is used to perform convolution on the multi-target image to obtain a second image pyramid, where the number of layers of the second image pyramid may be set according to an actual requirement, for example, the number of layers of the second image pyramid is 5. And respectively taking the second-level feature map, the third-level feature map and the fourth-level feature map of the second image pyramid as the first-level feature map, the second-level feature map and the third-level feature map of the first image pyramid, wherein the first-level feature map is positioned at the lowest layer of the pyramid, and the third-level feature map is positioned at the uppermost layer of the pyramid, so that images with more texture information can be selected for feature fusion, important feature information of the fused feature maps is kept, and the probability of identifying a plurality of targets as one target is reduced.

In other embodiments, the convolution operation with different layers may be performed on the target image three times to obtain the first image pyramid.

After the first image pyramid is obtained, the feature map of the previous stage is up-sampled, and the obtained image and the adjacent feature map of the next stage are subjected to feature fusion to obtain a fused feature map.

In an embodiment, after the first image pyramid is obtained, the nth level feature map is up-sampled by a factor of 2 by using a deconvolution method, so as to obtain an up-sampled image, where N is 2 and 3, and the up-sampled factor is related to the resolution of the nth level feature map and the resolution of the N-1 level feature map. And then performing feature fusion on the up-sampled image and the N-1 level feature map. The method comprises the steps of sampling a third-level feature map by a deconvolution method, performing feature fusion on the obtained upsampled image and a second-level feature map, sampling a second-level feature map by a deconvolution method, performing feature fusion on the obtained upsampled image and the first-level feature map, and obtaining a fused feature map. Illustratively, the upsampling operation on the image may be implemented using a convolution kernel of 3 × 3, performing a convolution operation with a step size of 2. By adopting a deconvolution method for up-sampling, effective features in the image can be further extracted, and the influence of interference information is reduced.

In other embodiments, a linear interpolation method, a bilinear interpolation method, or a bicubic interpolation method may also be adopted to perform upsampling on the nth-level feature map, and perform feature fusion on the obtained upsampled image and the nth-1-level feature map to obtain a fused feature map.

And the area generation network is used for determining the interest area of the fused feature map. Specifically, for the fused feature map, a plurality of anchor frames are generated on each pixel, and exemplarily, as shown in fig. 6, three sets of anchor frames having the same shape but different sizes are generated on each pixel. After the anchor frames are generated, the classification layer is used for grading each anchor frame, and the regression layer is used for adjusting the shape and the position of each anchor frame. And then, screening out a preset number of anchor frames according to the adjusted anchor frames and the corresponding scores, wherein the region where the screened anchor frames are located is the interest region.

The detection network is used for outputting the identification result of the target to be identified according to the interest area. Specifically, the position of the target to be recognized and the category of the target to be recognized are determined according to the image characteristics of the interest region.

In one embodiment, the detection network maps each region of interest to a corresponding feature vector; and outputting the confidence score corresponding to each feature vector and the position information of the target to be recognized in the multi-target image, and taking the confidence score and the position information of the target to be recognized in the multi-target image as the recognition result of the target to be recognized. The confidence score corresponds to the category, the category of the target to be recognized can be determined according to the confidence score, and the position information of the target to be recognized in the multi-target image can be a boundary box of the position of the target to be recognized, coordinate information of a vertex of the boundary box and the like, so that category information corresponding to a plurality of targets can be obtained simultaneously. Illustratively, the detection network is a Fast R-CNN detection head, the Fast R-CNN detection head is input into an interest region, each interest region is mapped into a feature vector through a plurality of full connection layers, and then the feature vector is input into a classification layer and a boundary box regression layer, so as to obtain a confidence score and a boundary box of the position of the target to be recognized respectively.

In an embodiment, the image recognition model further includes an interest region pooling network, and the interest region pooling network is configured to adjust the size of the interest region to a preset size, so as to obtain the size-adjusted interest region; correspondingly, the detection network is used for outputting the recognition result of the target to be recognized according to the interest area after the size is adjusted. Specifically, the interest area pooling network divides the interest area into a plurality of areas, the size of each area is a preset size, and then, the maximum pooling operation is performed once for each area to obtain the interest area with the adjusted size, so that the sizes of the images input into the detection network are consistent, as many features as possible are reserved, and the accuracy of subsequent target identification is improved.

In an embodiment, the image recognition model further includes a full convolution network, and the full convolution network is configured to perform convolution processing on the region of interest to obtain the convolution-processed region of interest, so as to further extract image features of the region of interest and reduce the influence of interference factors. Correspondingly, the interest area pooling network is configured to adjust the size of the interest area after the convolution processing to the preset size, so as to obtain the interest area after the size adjustment.

In one embodiment, the multi-target image is obtained by processing the original image. Specifically, after an original image is obtained, a median filter is adopted to perform image noise reduction on the original image to obtain a noise-reduced image, a Canny algorithm is adopted to perform edge extraction on the noise-reduced image to obtain boundary coordinates, and a multi-target image is obtained according to the boundary coordinates, so that effective characteristic information in the multi-target image can be determined, the influence of interference factors is reduced, and the calculation efficiency is improved.

In the embodiment, as the feature fusion is performed by using the first image pyramid including the features of the three-level feature map, the fused feature map can retain the texture detail information of the target, the probability of identifying a plurality of targets as one target is reduced, and the identification accuracy of the target to be identified is improved. Because the interest areas in different feature maps are different in the fused feature maps, the interest areas of the fused feature maps are extracted and then input into a detection network for image recognition, and the recognition accuracy of the multi-target image can be further improved.

Under the same condition, different feature extraction networks are adopted to obtain a fused feature map, a target identification result of a target to be identified in the multi-target image is determined according to the fused feature map, the target identification result corresponding to each feature extraction network is shown in fig. 7, fig. 7 (a) shows an original image, fig. 6 (b) shows an identification result corresponding to a fast R-CNN network, fig. 7 (c) shows an identification result corresponding to a network of an image pyramid nearest neighbor interpolation combination of a three-level feature map, fig. 7 (d) shows an identification result corresponding to an image pyramid of the three-level feature map and a network of a bicubic difference up-sampling combination, and fig. 7 (e) shows an identification result corresponding to a network of a pyramid of the three-level feature map and a deconvolution up-sampling combination. It can be seen that when the network combining the three-level feature map pyramid and the deconvolution up-sampling is applied to the image recognition model for multi-target recognition, the accuracy of target recognition is higher.

TABLE 1

The image recognition models of different feature extraction networks are tested for a plurality of times to obtain statistical results as shown in table 1. The accuracy rate corresponding to the fast R-CNN network is 29.21%, the accuracy rate corresponding to the image pyramid of the four-level feature map and the network of the nearest neighbor interpolation combination is 42.01%, the accuracy rate corresponding to the image pyramid of the three-level feature map and the network of the nearest neighbor interpolation combination is 43.84%, the accuracy rate corresponding to the image pyramid of the two-level feature map and the network of the nearest neighbor interpolation combination is 40.04%, the accuracy rate corresponding to the image pyramid of the three-level feature map and the network of the bicubic difference up-sampling combination is 45.57%, and the accuracy rate corresponding to the image pyramid of the three-level feature map and the network of the deconvolution up-sampling combination is 50.05%. It can be seen that when the network combining the three-level feature map pyramid and the deconvolution up-sampling is applied to the image recognition model for multi-target recognition, the accuracy of target recognition is higher.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Fig. 8 shows a block diagram of an image recognition apparatus provided in an embodiment of the present application, corresponding to the image recognition method described in the above embodiment, and only the relevant parts of the embodiment of the present application are shown for convenience of description.

As shown in fig. 8, the image recognition apparatus includes,

an obtaining module 81, configured to obtain a multi-target image, where the multi-target image includes at least two targets to be identified;

the recognition module 82 is configured to input the multi-target image into an image recognition model, so as to obtain a recognition result of the target to be recognized, which is output by the image recognition model; the image recognition model is obtained by training a classification model by taking a single target image as a training sample; the image recognition model comprises a feature extraction network, a region generation network and a detection network; the feature extraction network is used for performing convolution on the multi-target image to obtain a first image pyramid comprising three levels of feature maps, and performing feature fusion on the three levels of feature maps to obtain a fused feature map; the region generation network is used for determining the interest region of the fused feature map; the detection network is used for outputting the recognition result of the target to be recognized according to the interest area.

In one possible implementation, the identification module 82 is specifically configured to:

mapping each interest region into a corresponding feature vector;

In a possible implementation manner, the obtaining module 81 is specifically configured to:

acquiring an original image;

and obtaining the multi-target image according to the boundary coordinates.

It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.

Fig. 9 is a schematic structural diagram of an electronic device provided in an embodiment of the present application. The electronic device may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing device.

As shown in fig. 9, the electronic apparatus of this embodiment includes: a processor 91, a memory 92 and a computer program 93 stored in said memory 92 and executable on said processor 91. The processor 91, when executing the computer program 93, implements the steps in the above-described embodiment of the image recognition method, such as the steps S101 to S102 shown in fig. 1. Alternatively, the processor 91 executes the computer program 93 to realize the functions of the modules/units in the device embodiments, such as the functions of the acquiring module 81 to the identifying module 82 shown in fig. 8.

Illustratively, the computer program 93 may be divided into one or more modules/units, which are stored in the memory 92 and executed by the processor 91 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program 93 in the electronic device.

Those skilled in the art will appreciate that fig. 9 is merely an example of an electronic device and is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or different components, e.g., the electronic device may also include input-output devices, network access devices, buses, etc.

The Processor 91 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 92 may be an internal storage unit of the electronic device, such as a hard disk or a memory of the electronic device. The memory 92 may also be an external storage device of the electronic device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the electronic device. Further, the memory 92 may also include both an internal storage unit and an external storage device of the electronic device. The memory 92 is used for storing the computer program and other programs and data required by the electronic device. The memory 92 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/electronic device and method may be implemented in other ways. For example, the above-described apparatus/electronic device embodiments are merely illustrative, and for example, the division of the modules or units is only one type of logical function division, and other division manners may exist in actual implementation, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. An image recognition method, comprising:

2. The image recognition method of claim 1, wherein the convolving the multi-target image to obtain a first image pyramid comprising three levels of feature maps comprises:

3. The image recognition method according to claim 1, wherein the performing feature fusion on the three-level feature image to obtain a fused feature map comprises:

4. The image recognition method according to claim 1, wherein the image recognition model further comprises a region of interest pooling network, and the region of interest pooling network is configured to adjust the size of the region of interest to a preset size, so as to obtain a resized region of interest; correspondingly, the detection network is used for outputting the recognition result of the target to be recognized according to the interest area after the size is adjusted.

5. The image recognition method according to claim 4, wherein the image recognition model further comprises a full convolution network, and the full convolution network is configured to perform convolution processing on the region of interest to obtain a convolution-processed region of interest; correspondingly, the interest area pooling network is configured to adjust the size of the interest area subjected to the convolution processing to the preset size, so as to obtain the interest area with the adjusted size.

6. The image recognition method according to claim 1, wherein the outputting the recognition result of the target to be recognized according to the region of interest comprises:

mapping each interest region into a corresponding feature vector;

7. The image recognition method of claim 1, wherein the acquiring the multiple target images comprises:

acquiring an original image;

and obtaining the multi-target image according to the boundary coordinates.

8. The image recognition method of claim 1, wherein the training sample comprises single-target images and annotation information corresponding to each single-target image, the annotation information is obtained by inputting the single-target images into an image annotation model, and the image annotation model is obtained by training an initial model by using images with the annotation information.

9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the image recognition method according to any one of claims 1 to 8 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out an image recognition method according to any one of claims 1 to 8.