CN113052159B

CN113052159B - Image recognition method, device, equipment and computer storage medium

Info

Publication number: CN113052159B
Application number: CN202110400954.1A
Authority: CN
Inventors: 林东青; 马军; 陈涛
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Shanxi Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Shanxi Co Ltd
Priority date: 2021-04-14
Filing date: 2021-04-14
Publication date: 2024-06-07
Anticipated expiration: 2041-04-14
Also published as: CN113052159A

Abstract

The embodiment of the application provides an image recognition method, an image recognition device, image recognition equipment and a computer storage medium, relates to the field of image detection, and aims to improve the accuracy of image recognition. The method comprises the following steps: acquiring an image to be identified, wherein at least one object to be identified exists in the image to be identified; inputting an image to be identified into a first network in a pre-trained image identification model, and determining text characteristics of the image to be identified; inputting an image to be identified into a second network in the image identification model, and determining a pooling feature map and spatial relationship features of at least one object to be identified; carrying out feature fusion on text features of the image to be identified, a pooling feature map of at least one object to be identified and spatial relationship features, and determining a shared feature image corresponding to the image to be identified; and inputting the shared characteristic image into a third network in the image recognition model, and determining recognition information of the image to be recognized, wherein the recognition information comprises category information and position information of each object to be recognized.

Description

Image recognition method, device, equipment and computer storage medium

Technical Field

The present application relates to the field of image detection, and in particular, to an image recognition method, apparatus, device, and computer storage medium.

Background

The identification of the target object in the image is one of important research directions in the field of computer vision, and has important roles in the fields of public safety, road traffic, video monitoring and the like. In the prior art, the target object can be identified by utilizing the spatial relationship characteristics of the target object in the image; the recognition accuracy of the neural network to the target object can be improved by reasonably matching the image characteristic weights in the neural network.

However, in the prior art, due to the complexity and diversity of the scenes included in the image and the uncertainty of the target position to be detected in the image, the method cannot adapt to more scenes, and therefore the accuracy of image recognition cannot be improved.

Disclosure of Invention

The embodiment of the application provides an image recognition method, an image recognition device, image recognition equipment and a computer storage medium, which are used for improving the accuracy of image recognition.

In a first aspect, an embodiment of the present application provides an image recognition method, including:

acquiring an image to be identified, wherein at least one object to be identified exists in the image to be identified;

Inputting an image to be identified into a first network in a pre-trained image identification model, and determining text characteristics of the image to be identified;

inputting an image to be identified into a second network in the image identification model, and determining a pooling feature map and spatial relationship features of at least one object to be identified;

carrying out feature fusion on text features of the image to be identified, a pooling feature map of at least one object to be identified and spatial relationship features, and determining a shared feature image corresponding to the image to be identified;

And inputting the shared characteristic image into a third network in the image recognition model, and determining recognition information of the image to be recognized, wherein the recognition information comprises category information and position information of each object to be recognized.

In a second aspect, an embodiment of the present application provides an image recognition apparatus, including:

the first acquisition module is used for acquiring an image to be identified, wherein at least one object to be identified exists in the image to be identified;

The first determining module is used for inputting the image to be identified into a first network in the pre-trained image identification model and determining the text characteristics of the image to be identified;

the second determining module is used for inputting the image to be identified into a second network in the image identification model and determining a pooling feature map and spatial relationship features of at least one object to be identified;

The fusion module is used for carrying out feature fusion on the text features of the image to be identified, the pooled feature map of at least one object to be identified and the spatial relationship features, and determining a shared feature image corresponding to the image to be identified;

And the identification module is used for inputting the shared characteristic image into the third network in the image identification model, and determining identification information of the image to be identified, wherein the identification information comprises category information and position information of each object to be identified.

In a third aspect, an embodiment of the present application provides an image recognition apparatus, including:

a processor and a memory storing computer program instructions; the processor reads and executes the computer program instructions to implement the image recognition method as provided in the first aspect of the embodiment of the present application.

In a fourth aspect, an embodiment of the present application provides a computer storage medium having stored thereon computer program instructions which, when executed by a processor, implement an image recognition method as provided in the first aspect of the embodiment of the present application.

According to the image recognition method provided by the embodiment of the application, the text characteristics of the image to be detected and the pooling characteristic diagram and the spatial relation characteristic of at least one first target object in the image to be detected are extracted, the three characteristics are subjected to characteristic fusion, the fused shared characteristic diagram is input into a third network in an image recognition model, the recognition information of the image to be recognized is determined, and the recognition information comprises the category information and the position information of each object to be recognized. Compared with the prior art, the method has the advantages that the complementation of the image information is realized through the feature fusion, the defects of the image feature information on details and scenes are overcome while redundant noise is avoided, meanwhile, the text features are extracted, the difference and commonality of the images in different scenes can be reflected, the method can be further applied to more complex scenes, and the accuracy of image recognition is improved.

Drawings

In order to more clearly illustrate the technical solution of the embodiments of the present application, the drawings that are needed to be used in the embodiments of the present application will be briefly described, and it is possible for a person skilled in the art to obtain other drawings according to these drawings without inventive effort.

FIG. 1 is a schematic flow chart of a training method of an image recognition model according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a multi-modal feature fusion module according to an embodiment of the present application;

fig. 3 is a schematic flow chart of an image recognition method according to an embodiment of the present application;

Fig. 4 is a schematic flow chart of an image recognition device according to an embodiment of the present application;

Fig. 5 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present application.

Detailed Description

Features and exemplary embodiments of various aspects of the present application will be described in detail below, and in order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail below with reference to the accompanying drawings and the detailed embodiments. It should be understood that the particular embodiments described herein are meant to be illustrative of the application only and not limiting. It will be apparent to one skilled in the art that the present application may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the application by showing examples of the application.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The image recognition algorithm is one of important research directions in the field of computer vision, and has important effects in the fields of public safety, road traffic, video monitoring and the like. In recent years, image recognition is continuously improved in terms of accuracy due to the development of image recognition algorithms based on deep learning.

In the prior art, image recognition is performed in two ways:

1. multi-view image target detection method based on visual saliency

Aiming at a scene with a foreground target not blocked, calculating saliency maps of a plurality of view angle images, projecting the saliency maps of the view angles at two sides to a middle target view angle by utilizing a spatial relation between the view angles, and fusing the projected saliency maps and the saliency maps of the middle view angle to obtain a fused saliency map. The region blocked by the foreground object cannot be truly mapped to the target viewing angle during projection, projection holes are generated around the foreground object in the projection saliency map, and the projection hole region is regarded as a background region in the fusion saliency map. The image area is divided by utilizing the multi-view projection holes, and the area between the projection holes and the image edge and the area between the projection holes of different foreground objects are all regarded as background areas. In the fusion saliency map, the saliency value of the background area obtained by the method is set to be zero, and the object with clear edges and no background interference can be obtained after binarization.

2. Small target detection algorithm under complex background

By means of the thought of the feature pyramid algorithm, the features of the Conv4-3 layer are fused with the features of the Conv7 layer and the Conv3-3 layer, and the number of default frames corresponding to each position of the fused feature map is increased. A clipping-weight distribution network (SENet) is added in the network structure to distribute the weight of the characteristic channels of each layer, promote the useful characteristic weight and inhibit the invalid characteristic weight. And simultaneously, in order to enhance the generalization capability of the network, a series of enhancement processing is performed on the training data set.

Both algorithms are common techniques for detecting and identifying a target object in an image, however, due to the complexity and diversity of scenes contained in the image and the uncertainty of the position of the target to be detected in the image, the conventional target detection method has poor robustness in different application scenes. According to the multi-view image target detection method based on visual saliency, only the spatial relation characteristics of the target to be detected in the image are considered, but various characteristic information in the image is not fully utilized for information supplementation so as to improve the accuracy of final image recognition. The small target detection algorithm under the complex background does not consider the spatial relationship between the context information in the complex background and the target to be detected, has a narrow application range, mainly improves the detection and identification accuracy of the small target in the image, and omits the application of the algorithm in more complex scenes.

Based on the above, the embodiment of the application provides an image recognition method, which realizes complementation of image information through feature fusion, overcomes the defects of image feature information on details and scenes while avoiding redundant noise, and simultaneously extracts text features, which can reflect differences and commonalities of images in different scenes, is suitable for more complex scenes, and improves the accuracy of image recognition.

In the image recognition method provided by the embodiment of the application, the image is required to be recognized by using the pre-trained image recognition model, so that the image recognition model needs to be trained before the image recognition is performed by using the image recognition model. Accordingly, a specific implementation of the training method for an image recognition model according to the embodiment of the present application will be described below with reference to the accompanying drawings.

As shown in fig. 1, an embodiment of the present application provides a training method for an image recognition model, which includes first obtaining a sample image, fusing information such as a pooled feature map, text features, and spatial relationship features extracted from the sample image to form a shared feature map with more abundant information, and performing iterative training on a preset image recognition model through classification and regression detection algorithms until a training stop condition is satisfied. The method can be realized by the following steps:

1. and obtaining a plurality of images to be marked.

In some embodiments, a plurality of images to be marked can be obtained through a vehicle-mounted camera or the obtained video is subjected to frame extraction processing to obtain the plurality of images to be marked.

2. And manually labeling the plurality of images to be labeled, wherein the content to be labeled is label identification information of the target object, and the label identification information comprises classification information and position information of the target identification object, wherein the position information is coordinate values of a bounding box surrounding the target object.

In some embodiments, the image shot by the vehicle-mounted camera mainly uses road traffic as a main scene, so that the labeling object of the image to be labeled can comprise target objects such as pedestrians, riders, bicycles, motorcycles, automobiles, trucks, buses, trains, traffic signs, traffic lights and the like, and the labeling result is the category of the target object and the coordinate value of a boundary frame surrounding the target object; simultaneously, each image to be annotated is subjected to text annotation from three angles of time, place and weather.

Specifically, for each image to be annotated, from a temporal perspective, the selectable values include daytime, dusk/dawn, night; from a location perspective, optional values include highways, city streets, residences, parking lots, gas stations, tunnels; from a weather perspective, alternative values include snow, cloudiness, sunny, cloudy, rainy, foggy.

3. And integrating the artificially marked images and marking information corresponding to each image into a training sample set, wherein the training sample set comprises a plurality of sample image groups.

It should be noted that, since the image recognition model needs to be subjected to iterative training for multiple times, so as to adjust the loss function value until the loss function value meets the training stop condition, a trained image recognition model is obtained, and in each iterative training, if only one sample image is input, the sample size is too small to be beneficial to the training adjustment of the image recognition model, so that the training sample set is divided into multiple sample image sets, each sample image set contains multiple sample images, and the iterative training is performed on the image recognition model by utilizing the multiple sample image sets in the training sample set.

4. And training an image recognition model by using a sample image group in the training sample set until the training stopping condition is met, so as to obtain a trained image recognition model. The method specifically comprises the following steps:

And 4.1, extracting a sample pooling feature map and sample space relation features of the identifiable objects in the sample image by using a second network in the preset image identification model.

In some embodiments, the second network in the preset image recognition model may be a fast area convolutional neural network FASTERRCNN network, which is not limited in this disclosure.

Specifically, the acquisition of the sample pooling feature map and the sample spatial relationship feature of the identifiable object in the sample image can be realized by the following steps:

And 4.1.1, uniformly adjusting the sample images in the training set to a fixed size of 1000 multiplied by 600 pixels to obtain the sample images with the adjusted sizes.

And 4.1.2, inputting the sample image group subjected to the size adjustment into a depth residual error network ResNet, an area generation network RPN and a fast area convolution neural network to extract image features, and obtaining a pooling feature map.

1) Firstly, inputting a sample image with the adjusted size into a convolution layer conv1 with the size of 7 multiplied by 64, and then sequentially extracting an original feature map of the sample image through the convolution layers conv2_x, conv3_x, conv4_x, conv5_x and a full connection layer fc;

2) Inputting an original feature map output by conv4_x in ResNet network structures into an area extraction network RPN, and selecting the first 300 anchor frames (anchors) with highest scores in the prediction results and candidate frames corresponding to the anchor frames;

3) And (3) inputting the position mapping diagram of the 300 candidate frames into a region-of-interest pooling layer ROIPooling in the fast region convolutional neural network according to the original feature diagram output by conv4_x to obtain a pooled feature diagram with a fixed size of the identifiable object.

4.1.3 Calculating the intersection ratio (Intersection over union, IOU) between the candidate frames using the coordinates of 300 anchors and the candidate frames corresponding thereto, and calculating the spatial relationship characteristics between the identifiable objects by the following equation 1,

F _r＝f(w,h,area,d_x,d_y, IOU) equation 1

Where _w and h represent the width and height of the candidate box, _area represents the candidate box area, d _x and d _y are the lateral, longitudinal distances of the geometric centers of the two candidate boxes, IOU refers to the intersection ratio between the candidate boxes, F (·) represents the activation function, and F _r represents the predicted spatial relationship features between the identifiable objects.

And 4.2, inputting the sample image into a first network in a preset image recognition model, determining at least one text vector according to the context information of the sample image, splicing the at least one text vector, and determining a sample text feature F _t corresponding to the sample image.

It should be noted that the first network in the image recognition model may be a pre-training model such as Word2vec, glove or BERT; the text vector determined according to the context information of the sample image may be a word vector that converts text annotation information describing the time, place and weather of the sample image, which is not limited in this regard.

And 4.3, constructing a multi-mode feature fusion module, and complementarily fusing the sample text features extracted according to the context information of the sample image, the sample space relationship features determined by the second network based on the image recognition model and the sample pooling feature images to obtain a sample sharing feature image. The fusion calculation method can be realized by a formula 2 and a formula 3:

f _v＝ReLu(F_roi,F_r) equation 2

F _out＝F_v*F_t equation 3

Wherein F _roi represents a fixed-size feature map output after passing through the pooling layer ROIPooling, F _v represents an original feature map, and F _out represents a sample sharing feature image obtained after fusion of a sample text feature, a sample spatial relationship feature, and a sample pooling feature map.

And 4.4, inputting the sample sharing characteristic image into a third network in a preset image recognition model, and determining reference recognition information of each recognizable object, wherein the reference recognition information comprises classification information and reference position information of the recognizable object.

And 4.5, performing non-maximum suppression processing on the reference position information of each identifiable object, filtering the reference position information which does not meet the preset requirement, and determining the prediction identification information of each sample image, wherein the prediction identification information comprises the classification information and the prediction position information of all the identifiable objects.

In some embodiments, non-maximum suppression processing (Non Maximum Suppression, NMS) is performed on the reference location information of each type of identifiable object, the NMS obtains a prediction list arranged according to the score and iterates the ordered prediction list, discards prediction results with IOU values greater than a predefined threshold, sets the threshold to 0.7, filters out candidate boxes with greater overlapping degree, and determines the suppressed location information as predicted location information.

And 4.6, calculating a loss value between the predicted identification information and the marked identification information, optimizing the image identification model according to the target loss function shown in the formula 4, reversely updating network parameters by using a gradient descent algorithm to obtain an updated image identification model, stopping optimizing training until the loss function value is smaller than a preset value, and determining the trained image identification model.

Where i denotes an index of an anchor, p _i denotes a probability that the i-th anchor is predicted as a target,Probability of true sample label representing whether the i-th anchor is a sample, λ is a parameter representing weight,/>Log loss representing two categories (target and non-target)/>Representing the classification loss, t= { t _x,t_y,t_w,t_h } represents the predicted offset of the anchor in the RPN training phase (rois in FastRCNN phase)/>Representing the actual offset of the anchor with respect to the real tag during the RPN training phase (rois during Fast RCNN), v >Representing regression loss.

In order to improve the accuracy of the image recognition model, the image recognition model can be continuously trained by using new training samples in practical application, so that the image recognition model is continuously updated, the accuracy of the image recognition model is improved, and the accuracy of image recognition is further improved.

The above is a specific implementation manner of the image recognition model training method provided in the embodiment of the present application, and the image recognition model obtained through the training may be applied to the image recognition method provided in the following embodiment.

The following describes in detail a specific implementation manner of the image recognition method provided by the present application with reference to fig. 3.

As shown in fig. 3, an embodiment of the present application provides an image recognition method, which includes:

s301, acquiring an image to be identified, wherein at least one object to be identified exists in the image to be identified.

In some embodiments, the object to be identified may be acquired through an on-board camera, or a pre-acquired video may be subjected to frame extraction processing, so as to determine an image to be identified.

Taking the road traffic scene as an example, the object to be identified in the image to be identified can be a pedestrian, a rider, a bicycle, a motorcycle, an automobile, a truck, a bus, a train, a traffic sign, a traffic light and the like.

S302, inputting the image to be identified into a first network in a pre-trained image identification model, and determining text characteristics of the image to be identified.

In some embodiments, the image to be identified is input to a first network in a pre-trained image identification model, and at least one text vector is determined according to the context information of the image to be identified; and splicing the at least one text vector to determine the text characteristics of the image to be recognized.

It should be noted that, the text vector is based on the first network, and text annotation information describing time, place and weather of the sample image is converted into a determined word vector according to the context information of the image to be identified, so that the environment information of the image to be identified can be represented by splicing text features determined by a plurality of text vectors, and further the difference and commonality of the image to be identified under different scenes can be reflected, so as to enhance the identification degree of the object to be identified.

S303, inputting the image to be identified into a second network in the image identification model, and determining a pooling feature map and spatial relationship features of at least one object to be identified.

When the object to be identified is identified, since a large amount of redundant information exists in the image to be identified, convolution processing is needed to be performed on the image, after the image features are determined through the convolution processing, the image identification model can be trained by using the extracted image features, but the calculation cost is relatively high, so that pooling processing is needed to be performed on the image to reduce the dimension of the image features, reduce the calculation amount and the number of parameters, prevent overfitting and improve the fault tolerance of the model.

On the other hand, the spatial relationship refers to a relative spatial position and a relative direction relationship between a plurality of target objects segmented from an image, and these relationships may be also classified into a connection relationship, an overlapping/overlapping relationship, and an inclusion/containment relationship. Thus, the extraction of spatial relationship features may enhance the ability to distinguish image content.

In some embodiments, determining the pooled feature map and the spatial relationship feature of at least one of the objects to be identified may be performed by:

1. And adjusting the resolution of each sample image in the sample image group to be a preset resolution, and determining the adjusted sample image group.

In this step, the sample images in the training set can be uniformly adjusted to a fixed size of 1000×600 pixels.

2. And inputting the adjusted sample image group into a depth residual error network, and determining an original image set, wherein images in the original image set correspond to images in the adjusted sample image group one by one.

Specifically, the resized sample image may be input to the convolution layer conv1 of 7×7×64, and then the original feature map of the sample image may be extracted sequentially through the convolution layers conv2_x, conv3_x, conv4_x, conv5_x, and one full connection layer fc.

3. Inputting an original image set into a region extraction network, and determining N anchor frames and position coordinates corresponding to each anchor frame, wherein the anchor frames are boundary frames which are predicted by the region extraction network and surround identifiable objects, and N is an integer greater than 1; and extracting M anchor frames with the confidence coefficient larger than a preset confidence coefficient threshold value from the N anchor frames based on the confidence coefficient of the N anchor frames, wherein M is a positive integer smaller than N.

As an example, the original feature map output by conv4_x in the ResNet network structure may be input into the regional extraction network RPN, a plurality of anchor frames and candidate frames corresponding to the anchor frames are determined, and 300 anchor frames with higher confidence and candidate frames corresponding to the anchor frames are selected from the 300 anchor frames based on the confidence of each anchor frame.

4. And inputting the mapping region images of the M anchor frames to a region-of-interest pooling layer of the region convolution neural network, adjusting the resolution of the mapping region images of the M anchor frames, and determining M sample pooling feature maps with the same resolution, wherein each identifiable object corresponds to at least one anchor frame.

In this step, the position map of the 300 candidate frame may be input to the region of interest pooling layer in the fast region convolution neural network according to the original feature map output by conv4_x, to obtain a pooled feature map of a fixed size of the identifiable object.

S304, carrying out feature fusion on the text features of the image to be identified, the pooled feature map of at least one object to be identified and the spatial relationship features, and determining a shared feature image corresponding to the image to be identified.

In the above steps S202 and S203, the text feature of the image to be identified, the pooled feature map of at least one object to be identified, and the spatial relationship feature are extracted, and although the spatial relationship feature is more sensitive to the rotation, inversion, and size change of the image or the identification of the target object in the image, and the pooled feature map can reduce the calculation amount in the image identification, in practical application, only the spatial relationship feature and/or pooled feature is insufficient, and the scene information cannot be expressed effectively and accurately, so that feature fusion needs to be performed on the text feature of the image to be identified, the pooled feature map of at least one object to be identified, and the spatial relationship feature, and various feature information in the image is fully utilized to supplement information to reflect the difference and commonality of the image in different scenes, so that the defects of the image feature information in detail and scene are overcome while redundant noise is avoided.

S305, inputting the shared characteristic image into a third network in the image recognition model, and determining recognition information of the image to be recognized, wherein the recognition information comprises category information and position information of each object to be recognized.

According to the image recognition method provided by the embodiment of the application, the text characteristics of the image to be recognized, the pooling characteristic diagram of at least one object to be recognized and the spatial relationship characteristics are determined through the image recognition model. The method and the device have the advantages that multiple characteristic information is complementarily fused, the identification degree of the object to be identified in the image is enhanced, the final image identification performance is optimized, the method and the device are suitable for more complex scenes, and the accuracy of image identification is improved.

In order to verify that the image recognition method provided in the above embodiment can improve the accuracy of image recognition compared with the image recognition method in the prior art, the embodiment of the application also provides a test method of image recognition, which tests the image recognition model applied in the image recognition method of the application. Specifically, the method may include the steps of:

1. And inputting the sample image into a trained image recognition model for testing.

Specifically, the average detection precision of all kinds of target objects is calculated according to the formula 5 and the formula 6, and the classification and prediction precision of each prediction frame are output:

n represents the number of the target categories to be detected, AP represents the average precision, and mAP represents the average precision mean value of all the categories.

2. According to the AP and mAP calculation formulas, a detection result is obtained, and the advantages and disadvantages of an image recognition algorithm using FASTER RCNN network algorithm and an image recognition model provided by the embodiment of the application in the prior art are compared to obtain a conclusion:

The image recognition method provided by the embodiment of the application is used in a classical image recognition network, has a remarkable improvement on the image recognition effect, maintains the recognition accuracy of the target object in the image at a stable level even under the condition of large background difference of the image, and has a better recognition effect compared with the original algorithm.

Specifically, an embodiment is used to further describe the method for testing the image recognition model provided by the embodiment of the application through the following simulation experiment.

The prior art adopted in the simulation experiment provided by the application is a faster regional convolution neural network FASTER RCNN; the image recognition model selects ResNet structures to extract image features, the initial learning rate is set to be 0.005, the learning rate attenuation coefficient is set to be 0.1, epoch is set to be 15, and the default optimizer selects SGD.

1. Simulation conditions: the simulated hardware environment of the application is: intel Core i7-7700@

3.60GHz,8G memory; software environment: ubuntu.04, python3.7, pycharm2019.

2. Simulation content and result analysis:

firstly, a sample image set is used as input, text feature extraction, spatial relation feature extraction and pooling feature map acquisition are conducted on the basis of a traditional FASTER RCNN algorithm, then the basic thought of the three feature fusion detection methods is used, an image recognition model is trained by means of the method, a test sample set is input into a trained improved model, and the average precision of each category and the average precision of all categories are evaluated by an AP index.

The application discloses a driving dataset based on BDD100k for experiments, simulation experiment results are shown in table 1, and comparison results of a classical FASTERRCNN algorithm and a multi-mode feature fusion detection method based on context information, which are tested on the same dataset, are shown in the table.

Table 1 comparison of the performance of image recognition methods

As can be seen from the experimental results in Table 1, compared with the detection precision of the classical FASTER RCNN algorithm on the test dataset, the image recognition method provided by the embodiment of the application improves the average detection precision of five kinds of targets by approximately 4.3% in tasks of different scenes. Multiple experiments prove that: the multi-mode feature fusion technology utilizes complementarity among information to enhance the representation of input features, can effectively improve the performance of a target detection algorithm, and obviously improves the average precision in most categories in different image recognition scenes. Because in a real life scene, the image/video data acquisition difficulty is high and the defects often occur, the traditional image and video-based target detection method is not suitable at the moment, and the image recognition method provided by the embodiment of the application can enhance the complementarity between information and has important significance for detection tasks in different scenes.

Based on the same inventive concept of the image recognition method, the embodiment of the application also provides an image recognition device.

As shown in fig. 4, an embodiment of the present application provides an image recognition apparatus, which may include:

A first obtaining module 401, configured to obtain an image to be identified, where at least one object to be identified is in the image to be identified;

A first determining module 402, configured to input an image to be identified into a first network in a pre-trained image identification model, and determine text features of the image to be identified;

a second determining module 403, configured to input the image to be identified into a second network in the image identification model, and determine a pooled feature map and a spatial relationship feature of at least one object to be identified;

The fusion module 404 is configured to perform feature fusion on the text feature of the image to be identified, the pooled feature map of at least one object to be identified, and the spatial relationship feature, and determine a shared feature image corresponding to the image to be identified;

the identifying module 405 is configured to input the shared feature image to the third network in the image identifying model, and determine identifying information of the image to be identified, where the identifying information includes category information and location information of each object to be identified.

In some embodiments, the apparatus may further comprise:

The second acquisition module is used for acquiring a training sample set, wherein the training sample set comprises a plurality of sample image groups, each sample image group comprises a sample image and a corresponding label image, label identification information of a target identification object and scene information of the sample image are marked in the label image, and the label identification information comprises category information and position information of the target identification object;

The training module is used for training a preset image recognition model by using the sample image group in the training sample set until the training stopping condition is met, so as to obtain a trained image recognition model.

In some embodiments, the training module may be specifically configured to:

for each sample image group, the following steps are respectively executed:

Inputting a sample image group into a first network in a preset image recognition model, and determining sample text characteristics corresponding to each sample image;

Inputting a sample image group into a second network in a preset image recognition model, and determining a sample pooling feature map and sample space relation features of each recognizable object;

According to the sample text features corresponding to each sample image, the sample pooling feature images of each identifiable object and the sample space relation features, carrying out feature fusion on each sample image, and determining a sample sharing feature image corresponding to each sample image;

Inputting the sample sharing characteristic image into a third network in a preset image recognition model, and determining reference recognition information of each recognizable object, wherein the reference recognition information comprises classification information and reference position information of the recognizable object;

Performing non-maximum suppression processing on the reference position information of each identifiable object, filtering the reference position information which does not meet the preset requirement, and determining the prediction identification information of each sample image, wherein the prediction identification information comprises the classification information and the prediction position information of all the identifiable objects;

determining a loss function value of a preset image recognition model according to the predicted recognition information of the target sample image and the tag recognition information of all target recognition objects on the target sample image, wherein the target sample image is any one of a sample image group;

And under the condition that the loss function value does not meet the training stop condition, adjusting the model parameters of the image recognition model, and training the image recognition model after parameter adjustment by using the sample image group until the loss function value meets the training stop condition, so as to obtain the trained image recognition model.

In some embodiments, the training module may be specifically configured to:

For each sample image, the following steps are performed:

Inputting a sample image into a first network in a preset image recognition model, and determining at least one text vector according to the context information of the sample image;

at least one text vector is stitched to determine sample text features corresponding to the sample images.

In some embodiments, the second network in the pre-set image recognition model comprises at least a depth residual network, a region extraction network and a region convolution neural network,

The training module may be specifically configured to:

adjusting the resolution of each sample image in the sample image group to be a preset resolution, and determining the adjusted sample image group;

inputting the adjusted sample image group into a depth residual error network, and determining an original image set, wherein images in the original image set correspond to images in the adjusted sample image group one by one;

Inputting an original image set into a region extraction network, and determining N anchor frames and position coordinates corresponding to each anchor frame, wherein the anchor frames are boundary frames which are predicted by the region extraction network and surround identifiable objects, and N is an integer greater than 1;

extracting M anchor frames with confidence degrees larger than a preset confidence degree threshold value from the N anchor frames based on the confidence degrees of the N anchor frames, wherein M is a positive integer smaller than N;

Inputting the mapping region images of the M anchor frames into a region-of-interest pooling layer of the region convolution neural network, adjusting the resolution of the mapping region images of the M anchor frames, and determining M sample pooling feature maps with the same resolution, wherein each identifiable object corresponds to at least one anchor frame;

and determining the sample space relation characteristic of each identifiable object according to the intersection ratio and the relative position between at least one anchor frame corresponding to each identifiable object.

In some embodiments, the training module may be specifically configured to:

Dividing all the identifiable objects into a plurality of groups based on the classification information of each identifiable object, and determining the reference position information of the identifiable objects of different groups;

filtering the reference position information of each type of identifiable object;

The predicted identification information of each sample image is determined based on the reference position information of the identified object after filtering and the classification information of the identified object after filtering.

In some embodiments, the training module may be specifically configured to:

Calculating the intersection ratio between a target frame and other reference frames in sequence, wherein the target frame is any one of a plurality of reference frames, and the reference frames are boundary frames which are determined in the reference position information and surround the identifiable object;

filtering the reference frames with the cross-over ratio larger than the preset cross-over ratio threshold until the cross-over ratio between any two reference frames is smaller than the preset cross-over ratio threshold;

the reference frame after filtering is determined as predicted position information of the identifiable object.

Other details of the image recognition apparatus according to the embodiment of the present application are similar to those of the image recognition method according to the embodiment of the present application described above in connection with fig. 1, and are not described herein.

Fig. 5 shows a schematic hardware structure of image recognition according to an embodiment of the present application.

The image recognition method and apparatus provided according to the embodiments of the present application described in connection with fig. 1 and 4 may be implemented by an image recognition device. Fig. 5 is a schematic diagram showing a hardware configuration 500 of an image recognition apparatus according to an embodiment of the application.

A processor 501 and a memory 502 storing computer program instructions may be included in the image recognition device.

In particular, the processor 501 may include a central processing unit (Central Processing Unit, CPU), or Application SPECIFIC INTEGRATED Circuit (ASIC), or may be configured as one or more integrated circuits that implement embodiments of the present application.

Memory 502 may include mass storage for data or instructions. By way of example, and not limitation, memory 502 may comprise a hard disk drive (HARD DISK DRIVE, HDD), floppy disk drive, flash memory, optical disk, magneto-optical disk, magnetic tape, or universal serial bus (Universal Serial Bus, USB) drive, or a combination of two or more of the foregoing. In one example, memory 502 may include removable or non-removable (or fixed) media, or memory 502 may be a non-volatile solid state memory. Memory 502 may be internal or external to the integrated gateway disaster recovery device.

In one example, memory 502 may be Read Only Memory (ROM). In one example, the ROM may be mask-programmed ROM, programmable ROM (PROM), erasable PROM (EPROM), electrically Erasable PROM (EEPROM), electrically rewritable ROM (EAROM), or flash memory, or a combination of two or more of these.

The processor 501 reads and executes the computer program instructions stored in the memory 502 to implement the methods/steps S301 to S305 in the embodiment shown in fig. 3, and achieve the corresponding technical effects achieved by executing the methods/steps in the embodiment shown in fig. 3, which are not described herein for brevity.

In one example, the image recognition device may also include a communication interface 503 and a bus 510. As shown in fig. 5, the processor 501, the memory 502, and the communication interface 503 are connected to each other by a bus 510 and perform communication with each other.

The communication interface 503 is mainly used to implement communication between each module, apparatus, unit and/or device in the embodiments of the present application.

Bus 510 includes hardware, software, or both that couple the components of the online data flow billing device to each other. By way of example, and not limitation, the buses may include an accelerated graphics Port (ACCELERATED GRAPHICS Port, AGP) or other graphics Bus, an enhanced industry Standard architecture (Extended Industry Standard Architecture, EISA) Bus, a Front Side Bus (FSB), a HyperTransport (HT) interconnect, an industry Standard architecture (Industry Standard Architecture, ISA) Bus, an Infiniband interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a Micro Channel Architecture (MCA) Bus, a Peripheral Component Interconnect (PCI) Bus, a PCI-Express (PCI-X) Bus, a Serial Advanced Technology Attachment (SATA) Bus, a video electronics standards Association local (VLB) Bus, or other suitable Bus, or a combination of two or more of these. Bus 510 may include one or more buses, where appropriate. Although embodiments of the application have been described and illustrated with respect to a particular bus, the application contemplates any suitable bus or interconnect.

The image recognition equipment provided by the embodiment of the application realizes complementation of image information through feature fusion, overcomes the defects of image feature information on details and scenes while avoiding redundant noise, fully utilizes various feature information in images to carry out information complementation, and simultaneously extracts text features, can reflect the differences and commonalities of the images in different scenes, and further can be suitable for more complex scenes, and improves the accuracy of image recognition.

In addition, in combination with the image recognition method in the above embodiment, the embodiment of the present application may be implemented by providing a computer storage medium. The computer storage medium has stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the image recognition methods of the above embodiments.

It should be understood that the application is not limited to the particular arrangements and instrumentality described above and shown in the drawings. For the sake of brevity, a detailed description of known methods is omitted here. In the above embodiments, several specific steps are described and shown as examples. The method processes of the present application are not limited to the specific steps described and shown, but various changes, modifications and additions, or the order between steps may be made by those skilled in the art after appreciating the spirit of the present application.

The functional blocks shown in the above-described structural block diagrams may be implemented in hardware, software, firmware, or a combination thereof. When implemented in hardware, it may be, for example, an electronic Circuit, application SPECIFIC INTEGRATED Circuit (ASIC), appropriate firmware, plug-in, function card, or the like. When implemented in software, the elements of the application are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine readable medium or transmitted over transmission media or communication links by a data signal carried in a carrier wave. A "machine-readable medium" may include any medium that can store or transfer information. Examples of machine-readable media include electronic circuitry, semiconductor memory devices, ROM, flash memory, erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, radio Frequency (RF) links, and the like. The code segments may be downloaded via computer networks such as the internet, intranets, etc.

It should also be noted that the exemplary embodiments mentioned in this disclosure describe some methods or systems based on a series of steps or devices. The present application is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, or may be performed in a different order from the order in the embodiments, or several steps may be performed simultaneously.

Aspects of the present application are described above with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, enable the implementation of the functions/acts specified in the flowchart and/or block diagram block or blocks. Such a processor may be, but is not limited to being, a general purpose processor, a special purpose processor, an application specific processor, or a field programmable logic circuit. It will also be understood that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware which performs the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In the foregoing, only the specific embodiments of the present application are described, and it will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the systems, modules and units described above may refer to the corresponding processes in the foregoing method embodiments, which are not repeated herein. It should be understood that the scope of the present application is not limited thereto, and any equivalent modifications or substitutions can be easily made by those skilled in the art within the technical scope of the present application, and they should be included in the scope of the present application.

Claims

1. An image recognition method, the method comprising:

Inputting the image to be identified into a first network in a pre-trained image identification model, and determining text characteristics of the image to be identified;

Inputting the image to be identified into a second network in the image identification model, and determining a pooling feature map and spatial relationship features of the at least one object to be identified;

Feature fusion is carried out on the text features of the image to be identified, the pooled feature map of the at least one object to be identified and the spatial relationship features, and a shared feature image corresponding to the image to be identified is determined;

inputting the shared characteristic image into a third network in the image recognition model, and determining recognition information of the image to be recognized, wherein the recognition information comprises category information and position information of each object to be recognized;

Inputting the image to be identified into a second network in the image identification model, and determining a pooling feature map and spatial relationship features of the at least one object to be identified, including:

Inputting the adjusted sample image group into the depth residual error network, and determining an original image set, wherein images in the original image set correspond to images in the adjusted sample image group one by one;

Inputting the original image set to the area extraction network, and determining N anchor frames and position coordinates corresponding to each anchor frame, wherein the anchor frames are boundary frames which are predicted by the area extraction network and surround identifiable objects, and N is an integer greater than 1;

Extracting M anchor frames with the confidence coefficient larger than a preset confidence coefficient threshold value from the N anchor frames based on the confidence coefficient of the N anchor frames, wherein M is a positive integer smaller than N;

And inputting the mapping region images of the M anchor frames to a region-of-interest pooling layer of the region convolutional neural network, adjusting the resolution of the mapping region images of the M anchor frames, and determining M sample pooling feature maps with the same resolution, wherein each identifiable object corresponds to at least one anchor frame.

2. The method of claim 1, wherein prior to the acquiring the image to be identified, the method further comprises:

Acquiring a training sample set, wherein the training sample set comprises a plurality of sample image groups, each sample image group comprises a sample image and a corresponding label image thereof, label identification information of a target identification object and scene information of the sample image are marked in the label image, and the label identification information comprises category information and position information of the target identification object;

and training a preset image recognition model by using the sample image group in the training sample set until the training stopping condition is met, so as to obtain a trained image recognition model.

3. The method according to claim 2, wherein training the image recognition model using the sample image group in the training sample set until a training stop condition is satisfied, and obtaining a trained image recognition model specifically includes:

For each sample image group, the following steps are respectively executed:

inputting the sample image group into a first network in a preset image recognition model, and determining sample text characteristics corresponding to each sample image;

Inputting the sample image group into a second network in a preset image recognition model, and determining a sample pooling feature map and sample space relation features of each recognizable object;

According to the sample text characteristics corresponding to each sample image, the sample pooling characteristic images of each identifiable object and the sample space relation characteristics, carrying out characteristic fusion on each sample image, and determining a sample sharing characteristic image corresponding to each sample image;

Performing non-maximum suppression processing on the reference position information of each identifiable object, filtering the reference position information which does not meet the preset requirement, and determining the prediction identification information of each sample image, wherein the prediction identification information comprises the classification information and the prediction position information of all identifiable objects;

Determining a loss function value of the preset image recognition model according to the predicted recognition information of the target sample image and the tag recognition information of all target recognition objects on the target sample image, wherein the target sample image is any one of the sample image groups;

And under the condition that the loss function value does not meet the training stop condition, adjusting the model parameters of the image recognition model, and utilizing the image recognition model after the sample image group training parameter adjustment until the loss function value meets the training stop condition, so as to obtain the trained image recognition model.

4. A method according to claim 3, wherein said inputting said set of sample images into a first network in a pre-set image recognition model, determining sample text features corresponding to each of said sample images, comprises:

For each sample image, the following steps are respectively executed:

inputting the sample image into a first network in the preset image recognition model, and determining at least one text vector according to the context information of the sample image;

And splicing the at least one text vector, and determining sample text characteristics corresponding to the sample images.

5. The method of claim 3, wherein the second network in the pre-set image recognition model comprises at least a depth residual network, a region extraction network, and a region convolution neural network,

Inputting the sample image group into a second network in a preset image recognition model, and determining a sample pooling feature map and sample spatial relationship features of each recognizable object, wherein the method comprises the following steps of:

Adjusting the resolution of each sample image in the sample image group to be a preset resolution, and determining an adjusted sample image group;

Inputting the original image set to the area extraction network, and determining N anchor frames and position coordinates corresponding to each anchor frame, wherein the anchor frames are boundary frames which are predicted by the area extraction network and surround the identifiable object, and N is an integer greater than 1;

inputting the mapping region images of the M anchor frames into a region-of-interest pooling layer of the region convolutional neural network, adjusting the resolution of the mapping region images of the M anchor frames, and determining M sample pooling feature maps with the same resolution, wherein each identifiable object corresponds to at least one anchor frame;

6. A method according to claim 3, wherein said performing non-maximum suppression processing on the reference position information of each identifiable object, filtering the reference position information that does not meet a preset requirement, and determining the predicted identification information of each sample image includes:

and determining the prediction identification information of each sample image according to the reference position information of the identified objects after filtering and the classification information of the identified objects after filtering.

7. The method of claim 6, wherein filtering the reference location information for each type of identifiable object comprises:

sequentially calculating the cross-over ratio between a target frame and other reference frames, wherein the target frame is any one of a plurality of reference frames, and the reference frames are boundary frames which are determined in the reference position information and surround the identifiable object;

filtering the reference frames with the cross-over ratio larger than a preset cross-over ratio threshold until the cross-over ratio between any two reference frames is smaller than the preset cross-over ratio threshold;

8. An image recognition apparatus, the apparatus comprising:

The first determining module is used for inputting the image to be identified into a first network in a pre-trained image identification model and determining text characteristics of the image to be identified;

The second determining module is used for inputting the image to be identified into a second network in the image identification model and determining a pooling feature map and spatial relationship features of the at least one object to be identified;

the fusion module is used for carrying out feature fusion on the text features of the image to be identified, the pooled feature map of the at least one object to be identified and the spatial relationship features, and determining a shared feature image corresponding to the image to be identified;

The identification module is used for inputting the shared characteristic image into a third network in the image identification model, and determining identification information of the image to be identified, wherein the identification information comprises category information and position information of each object to be identified;

The second determination module includes:

the first adjusting unit is used for adjusting the resolution of each sample image in the sample image group to be a preset resolution and determining the adjusted sample image group;

The first determining unit is used for inputting the adjusted sample image group into the depth residual error network to determine an original image set, wherein the images in the original image set are in one-to-one correspondence with the images in the adjusted sample image group;

the second determining unit is used for inputting the original image set into the area extraction network and determining N anchor frames and position coordinates corresponding to each anchor frame, wherein the anchor frames are boundary frames which are predicted by the area extraction network and surround the identifiable object, and N is an integer larger than 1;

The extraction unit is used for extracting M anchor frames with the confidence coefficient larger than a preset confidence coefficient threshold value from the N anchor frames based on the confidence coefficient of the N anchor frames, wherein M is a positive integer smaller than N;

And the second adjusting unit is used for inputting the mapping region images of the M anchor frames to a region-of-interest pooling layer of the region convolution neural network, adjusting the resolution of the mapping region images of the M anchor frames, and determining M sample pooling feature maps with the same resolution, wherein each identifiable object corresponds to at least one anchor frame.

9. An image recognition apparatus, characterized in that the apparatus comprises: a processor and a memory storing computer program instructions; the processor reads and executes the computer program instructions to implement the image recognition method according to any one of claims 1-7.

10. A computer storage medium having stored thereon computer program instructions which, when executed by a processor, implement the image recognition method of any of claims 1-7.