CN112084849A

CN112084849A - Image recognition method and device

Info

Publication number: CN112084849A
Application number: CN202010761239.6A
Authority: CN
Inventors: 车慧敏; 李志刚; 杨雨
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-07-31
Filing date: 2020-07-31
Publication date: 2020-12-15
Also published as: WO2022022695A1

Abstract

The application discloses an image recognition method and device, relates to the technical field of neural networks, and is beneficial to improving the image recognition accuracy. The method comprises the following steps: acquiring an image to be identified; performing feature extraction on an image to be recognized by using a first neural network to obtain a first feature map; performing feature extraction on the first feature map by using a second neural network to obtain a second feature map, and performing point multiplication on the second feature map and the first feature map to obtain a third feature map; the third feature map represents a feature map obtained after the features of the image to be recognized are transformed to the main direction; obtaining a first score map of the image to be recognized based on the third feature map; and identifying the image to be identified based on the third feature map and the first score map.

Description

Image recognition method and device

Technical Field

The present application relates to the field of neural network technologies, and in particular, to an image recognition method and apparatus.

Background

With the development of technology, some preschool education product robots (drawing robots for short) with the function of reading and drawing books appear on the market. The plotter robot needs to accurately identify the plotter before reading the plotter. Specifically, the robot collects an image of a certain page of the picture book through the camera, then performs local feature detection on the image, matches the detection result with a picture book image template prestored in the database to obtain an image with the highest matching degree with the detection result, and takes the image with the highest matching degree with the detection result as an image to be read. And subsequently, the robot is drawn to read the image to be read.

The method for identifying the picture book has higher requirement on the placing position of the picture book. For example, the painting robot is required to spread and open on a horizontal plane, which is consistent with the horizontal plane where the painting robot is located; the distance and angle between the book drawing and the book drawing robot are required to meet certain requirements. In addition, the robot is required to stand still.

However, in practical applications, it is often difficult for young children to put the drawing book according to the above requirements, which may cause the accuracy of the drawing robot in recognizing the drawing book to be greatly reduced, or even impossible to recognize the drawing book.

Disclosure of Invention

The embodiment of the application provides an image identification method and device, which are beneficial to improving the image identification accuracy.

In order to achieve the above purpose, the present application provides the following technical solutions:

in a first aspect, an image recognition method is provided, including: first, an image to be recognized is acquired. And then, performing feature extraction on the image to be recognized by using a first neural network to obtain a first feature map. Then, using a second neural network to perform feature extraction on the first feature map to obtain a second feature map, and performing dot multiplication on the second feature map and the first feature map to obtain a third feature map; and the third feature map represents a feature map obtained after the features of the image to be recognized are transformed to the main direction. And obtaining a first score map of the image to be recognized based on the third feature map. And finally, identifying the image to be identified based on the third feature map and the first score map. In the technical scheme, the second neural network is used for carrying out feature extraction on the first feature map to obtain a second feature map, and the second feature map and the first feature map are subjected to point multiplication to obtain a third feature map, so that the method is beneficial to constructing a network with rotation invariance features, carrying out image recognition based on the network and improving the accuracy of the image recognition.

In one possible design, performing feature extraction on an image to be recognized by using a first neural network to obtain a first feature map, including: and performing at least one layer of convolution operation on the image to be identified by using the first neural network to obtain a first characteristic diagram.

In one possible design, the feature extraction of the first feature map using the second neural network to obtain the second feature map includes: and performing at least one layer of convolution operation on the first feature map by using a second neural network to obtain a second feature map. According to the possible design, the first characteristic diagram is subjected to characteristic extraction through at least one layer of convolution operation, so that a second characteristic diagram is obtained, and the operation is simple.

In one possible design, the feature extraction of the first feature map using the second neural network to obtain the second feature map includes: performing at least one layer of convolution operation on the first feature map using a second neural network; and performing at least one layer of pooling operation and/or full connection operation on the first feature map after the convolution operation is performed to obtain a second feature map. According to the possible design, the feature extraction is carried out on the first feature map through at least one layer of convolution operation and at least one layer of pooling operation and/or full-connection operation, so that the more complex feature extraction is facilitated, the result of the feature extraction is facilitated to be more accurate, and the accuracy of image recognition is facilitated to be improved.

In one possible design, the third feature size is M1 × N1 × P1, the first score size is M1 × N1, P1 is the feature direction dimension, M1 × N1 is the dimension perpendicular to the feature direction dimension, and M1, N1, and P1 are all positive integers. According to the possible design, the third feature map and the first score map are directly used for identifying the image to be identified. The scheme is simple to implement.

In one possible design, the third feature size is M2 × N2 × P2, the first score size is M1 × N1, P2 is the feature direction dimension, and M1, N1, P1, M2, N2, and P2 are all positive integers. Identifying the image to be identified based on the third feature map and the first score map, wherein the identifying comprises the following steps: extracting the features of the third feature map to obtain a fourth feature map; wherein the fourth feature size is M1 × N1 × P1; p1 is the dimension of the characteristic direction dimension, P1 is a positive integer; identifying the image to be identified based on the fourth feature map and the first score map; wherein the first score plot is the size M1 × N1. Based on the optional implementation manner, the first score map and the feature map obtained after feature extraction is performed on the third feature map are used for identifying the image to be identified, which is beneficial to changing the size of the feature map, because the larger the size of the feature map is, the lower the efficiency of the image identification process is, and the larger the size of the feature map is, the more accurately the feature map can represent the image to be identified; thus, varying the size of the feature map helps balance the efficiency and accuracy of the image recognition process, thereby improving the overall performance of the image recognition process.

In one possible design, M1 × N1 < M2 × N2. In this way, the size of the feature map used for the image recognition process is reduced, so that the processing complexity of the image recognition process is reduced to improve the processing efficiency of the image recognition process.

In one possible design, obtaining the first score map of the image to be recognized based on the third feature map includes: performing convolution operation on the third feature map by using a 1-channel convolution kernel to obtain X fifth feature maps; the size of the feature direction of the fifth feature map is smaller than that of the third feature map; x is an integer greater than 2; weighting and summing the elements of the X fifth feature maps to obtain a sixth feature map; and performing feature extraction on the sixth feature map to obtain a first score map. This possible design, in obtaining the score map, only compresses the dimension of the feature direction of the third feature map, and is therefore simple to implement.

In one possible design, obtaining the first score map of the image to be recognized based on the third feature map includes: extracting the features of the third feature map to obtain a seventh feature map; the dimension of the third characteristic diagram perpendicular to the characteristic direction is larger than that of the seventh characteristic diagram perpendicular to the characteristic direction; x is an integer greater than 2; performing convolution operation on the seventh feature map by using a 1-channel convolution kernel to obtain X fifth feature maps; weighting and summing the elements of the X fifth feature maps to obtain a sixth feature map; and performing feature extraction on the sixth feature map to obtain a first score map. In the possible design, in the process of obtaining the score map, the size of the third feature map in the feature direction and the size perpendicular to the feature direction are compressed, so that the complexity of the image processing process is reduced, and the processing efficiency of the image recognition process is improved.

In one possible design, the size of the image to be recognized is larger than the size of the first score map. Since the size (assumed to be a × b) of the first score map used in the image recognition process represents the number of features in the feature map used in the process, in this possible design, if the features of the image to be recognized are dense features, the feature map corresponding to the first score map is sparse features, and image recognition is performed using the sparse features, which is helpful for reducing the complexity of the image processing process, thereby improving the processing efficiency of the image recognition process.

In a second aspect, the present application provides an image recognition apparatus.

In one possible design, the image recognition apparatus is configured to perform any one of the methods provided by the first aspect. The present application may divide the functional blocks of the image recognition apparatus according to any one of the methods provided by the first aspect. For example, the functional blocks may be divided for the respective functions, or two or more functions may be integrated into one processing block. For example, the image recognition apparatus may be divided into an acquisition unit, a feature extraction unit, a recognition unit, and the like according to functions. The above description of possible technical solutions and beneficial effects executed by each divided functional module may refer to the technical solutions provided by the first aspect or the corresponding possible designs thereof, and will not be described herein again.

In another possible embodiment, the image recognition device includes: a memory and one or more processors, the memory and processors coupled. The memory is for storing computer instructions, and the processor is for invoking the computer instructions to perform any of the methods as provided by the first aspect and any of its possible designs.

In a third aspect, the present application provides a computer-readable storage medium, such as a computer non-transitory readable storage medium. Having stored thereon a computer program (or instructions) which, when run on an image recognition apparatus, causes the image recognition apparatus to perform any of the methods provided by any of the possible implementations of the first aspect described above.

In a fourth aspect, the present application provides a computer program product enabling any of the methods provided in any of the possible implementations of the first aspect to be performed when the computer program product runs on a computer.

In a fifth aspect, the present application provides a chip system, comprising: and the processor is used for calling and running the computer program stored in the memory from the memory and executing any one of the methods provided by the implementation mode in the first aspect.

It is understood that any one of the image recognition apparatuses, computer storage media, computer program products, or chip systems provided above may be applied to the corresponding methods provided above, and therefore, the beneficial effects achieved by the methods may refer to the beneficial effects in the corresponding methods, and are not described herein again.

In the present application, the names of the above-mentioned image recognition apparatuses do not limit the devices or functional modules themselves, and in actual implementation, the devices or functional modules may appear by other names. Insofar as the functions of the respective devices or functional modules are similar to those of the present application, they fall within the scope of the claims of the present application and their equivalents.

These and other aspects of the present application will be more readily apparent from the following description.

Drawings

FIG. 1 is a diagram of a hardware configuration of a computer device that can be applied to an embodiment of the present application;

fig. 2a is a schematic diagram of a deep learning network model according to an embodiment of the present disclosure;

FIG. 2b is a schematic diagram of another deep learning network model provided in the embodiment of the present application;

fig. 3 is a schematic diagram of a logic structure of a first neural network according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating the logic results of a second neural network according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of dimensions of a feature map provided by an embodiment of the present application;

fig. 6 is a schematic flowchart of a method for acquiring training data according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a reference image and a sample image obtained by homography transformation of the reference image, which can be applied to the embodiment of the present application;

FIG. 8 is a schematic diagram of a relationship between reference data and training data provided by an embodiment of the present application;

FIG. 9 is a schematic diagram illustrating a connection relationship among a front-end network, a countermeasure network and a twin network according to an embodiment of the present application;

fig. 10 is a flowchart illustrating a method for training a front-end network according to an embodiment of the present application;

fig. 11 is a schematic diagram of a logical structure of a countermeasure network according to an embodiment of the present application;

fig. 12 is a schematic diagram of a logical structure of an extraction network according to an embodiment of the present disclosure;

fig. 13 is a schematic diagram illustrating a logical structure of a network according to an embodiment of the present application;

fig. 14 is a schematic flowchart of an image recognition process according to an embodiment of the present application;

FIG. 15 is a schematic flowchart of another image recognition process provided in the embodiments of the present application;

fig. 16 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present application;

fig. 17 is a schematic structural diagram of a chip system according to an embodiment of the present disclosure;

fig. 18 is a conceptual partial view of a computer program product provided by an embodiment of the present application.

Detailed Description

First, some of the terms and techniques referred to in this application are explained:

is characterized in that: i.e., image features, may include color features, texture features, etc., shape features, local feature points, etc.

Global/local features: the global features refer to overall attributes of the image, and common global features include color features, texture features, shape features and the like. Global features are features that use all of the features of an image to represent the image, such features having a large amount of redundant information. Local features refer to local attributes of an image. Local features are the use of local feature points of an image to represent the image. Each local feature point only contains the information of the image block where the local feature point is located, and the local feature point is not aware of the global information of the image.

Feature points (i.e., local feature points): in image processing, multiple images of the same object or scene are acquired from different angles, and if the same part of the object or scene can be identified as identical, the parts are said to have scale invariance. The pixel points or pixel blocks with scale invariance (i.e. pixel blocks formed by a plurality of pixel points) are the feature points. In one example, if a pixel in an image is an extreme point (e.g., the point of maximum or minimum) in its neighborhood, then the pixel is determined to be a feature point.

Image block (image patch): a local square area in the image, such as 4 x 4 pixels, 8 x 8 pixels. Where, a × a pixels represent square regions each having a pixels in width and height, and a is an integer of 1 or more.

Homographic transformation (homograph): also known as projective transformation. It maps points (three-dimensional homogeneous vectors) on one projection plane onto another projection plane. And satisfying Y ═ H × X, where H is a matrix (also called homography matrix) of 3 × 3, X is the position coordinates of the pixel points in the source image, and Y is the position coordinates of the corresponding pixel points on the mapped target image. In the description, the description can be regarded as a plane, the subset of the geometric transformation corresponding to the plane is a homography transformation, and the homography matrix determining the transformation is a matrix (e.g. 3 x 3 matrix) composed of rotation, translation, scaling and other properties. If one image is homography transformed into the other image, the two images are considered to have homography transformation relation.

Histogram of Oriented Gradients (HOG): the histogram is also called a quality distribution graph, and is a statistical report graph, in which a series of vertical stripes or line segments with different heights represent the data distribution, the horizontal axis generally represents the data type, and the vertical axis represents the distribution. The gradient direction histogram is a statistical value used to calculate direction information of local image gradients.

Main direction: in an image/image block, a gradient direction histogram is established by calculating the gradient direction between adjacent pixels (i.e. the unit vector of the vector difference of the adjacent pixels), and the gradient at which the peak in the gradient direction histogram is located is the main direction of the image/image block.

Convolutional Neural Network (CNN): the method is a feedforward neural network, and the artificial neurons of the feedforward neural network can respond to peripheral units in a part of coverage range and have excellent performance on large-scale image processing.

Maximum pooling (max polling): the most immediate purpose of the pooling layer is to reduce the amount of data to be processed in the next layer. Maximum pooling extracts a number of feature values for a filter, only the largest pooled layer is obtained as a retained value, all other feature values are discarded, the maximum value represents that only the strongest of the features is retained, and other weak features are discarded.

Rotation invariance: in physics, a physical system has rotational invariance provided that its properties are independent of its orientation in space. In image processing, if the features extracted by the feature extractor hardly change at any rotation angle in the plane, the feature extractor is said to have rotation invariance. The feature extractor may be a mapping robot, or a functional module in a mapping robot, such as a neural network.

Loss function (loss function): the loss function is used to measure the degree of disagreement between the predicted value f (x) and the true value Y of the model, and is a non-negative real value function, usually expressed by L (Y, f (x)), and the smaller the loss function is, the better the robustness of the model is. One goal of the optimization problem is to minimize the loss function. An objective function is usually a loss function itself or its negative value. When an objective function is a negative value of the loss function, the value of the objective function seeks to be maximized.

Sparse features, dense features: in local feature detection, if a position index (index) of each pixel in a recorded image is associated with one feature for each index, the sparse feature means that most of the indexes in an index set are empty, or that most of the indexes have no associated features. And dense features means that most of the indices are not empty, i.e. most of the indices have their corresponding feature descriptions.

Local feature detection algorithm: the local feature detection algorithm includes two parts, extraction and representation. The purpose of "extraction" is to determine whether each pixel point (or image block) in the image is a feature point. "representation" means that all detected feature points are represented as feature values in the same dimension according to their neighborhood. Whether the two characteristic points are similar or not can be judged by calculating the distance between the characteristic values of the two characteristic points, and the similarity degree of the two images can be further judged according to the number or the ratio of the similar characteristic points in the two images. Therefore, the evaluation criteria of the local feature detection algorithm are as follows: matching accuracy with which feature points are successfully matched is obtained for two graphs having the same/similar region.

The high homography transformation scene refers to a scene in which the feature representations before and after transformation are very different (i.e., the feature points determined before and after transformation are very different), such as a picture book recognition scene.

In the embodiments of the present application, words such as "exemplary" or "for example" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.

In the embodiments of the present application, the terms "first", "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present application, "a plurality" means two or more unless otherwise specified.

The term "at least one" in this application means one or more, and the term "plurality" in this application means two or more, for example, the plurality of second messages means two or more second messages. The terms "system" and "network" are often used interchangeably herein.

It is to be understood that the terminology used in the description of the various described examples herein is for the purpose of describing particular examples only and is not intended to be limiting. As used in the description of the various described examples and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. The term "and/or" is an associative relationship that describes an associated object, meaning that three relationships may exist, e.g., A and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present application generally indicates that the former and latter related objects are in an "or" relationship.

It should also be understood that, in the embodiments of the present application, the size of the serial number of each process does not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

It should be understood that determining B from a does not mean determining B from a alone, but may also be determined from a and/or other information.

It will be further understood that the terms "comprises," "comprising," "includes," and/or "including," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also understood that the term "if" may be interpreted to mean "when" ("where" or "upon") or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined." or "if [ a stated condition or event ] is detected" may be interpreted to mean "upon determining.. or" in response to determining. "or" upon detecting [ a stated condition or event ] or "in response to detecting [ a stated condition or event ]" depending on the context.

It should be appreciated that reference throughout this specification to "one embodiment," "an embodiment," "one possible implementation" means that a particular feature, structure, or characteristic described in connection with the embodiment or implementation is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" or "one possible implementation" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

At present, the following local detection algorithm is usually adopted for the picture book recognition:

the first method comprises the following steps: the local detection algorithm based on the manual characteristic mode, namely the extraction and the representation of the local characteristic points are all based on rules. If the extreme point is determined, the pixel value of each pixel point needs to be compared with the pixel values of the surrounding neighborhood pixel points one by one. In the judgment of the main direction, it is necessary to construct a gradient direction histogram and the like one by one. In the feature representation, complicated steps such as normalization and direction correction are required. Each of these steps requires fixed parameters to be set experimentally.

When judging whether two images are locally similar or not, if the geometric shape of a similar area (namely the position of the same physical area at different shooting angles) is reduced, the corresponding characteristic change is small in manual local characteristic detection, and the matching is relatively easy and correct. In the picture book recognition scene, for the same page in the picture book, when the positions of the picture books are different, the characteristic change of the image of the page scanned by the picture book robot is large. When the similar area has large geometric deformation, the distribution change of the extreme points is large, when the local area is reduced or the geometric deformation is large, the pixel points which are originally the extreme points may not be represented as the extreme points under the manual rule any more, and then cannot be determined as the feature points, which may cause the representation of part of the feature points to have deviation, and the feature points cannot be correctly matched.

And the second method comprises the following steps: the local detection algorithm based on the deep learning method is that the input of the neural network is an image, the output is a score map (i.e. the probability value of 0-1 corresponding to the possibility that each pixel point can be marked as a local feature point) of each pixel point (or pixel block) in the image is regarded as a feature point, and the feature map of each pixel point (or pixel block) corresponding to a feature value. The method is a non-end-to-end method. On the one hand, the extraction of features in this method still relies on manual feature extraction, and therefore, the above-mentioned problems also exist. On the other hand, the neural network is usually a convolutional neural network, and the convolutional neural network has rotation invariance only to a certain extent and does not rotate and normalize the feature points like the above method one, so that in a high homography transformation scenario, the feature representation difference before and after transformation is very large, which results in very low matching accuracy.

Based on this, the embodiment of the application provides a neural network model training method and an image recognition method, which are applied to a high-homography transformation scene (such as a sketch recognition scene). Specifically, the method comprises the following steps: in the model training phase, a neural network with rotational invariance is trained based on a plurality of images, more precisely a neural network with a higher degree of rotational invariance compared to prior art convolutional neural networks. Wherein, the images comprise images with homography transformation relation. In the image identification stage, the image is identified based on the neural network with the rotation invariance. Thus, compared with the prior art, the method is beneficial to reducing the difference of the feature representation before and after transformation, thereby improving the matching accuracy.

The neural network model training method and the image recognition method provided by the embodiment of the application can be respectively applied to the same or different computer equipment. For example, the neural network model training method may be executed by a computer device such as a server or a terminal. The image recognition method may be performed by a terminal (e.g., a rendering robot, etc.). This is not limited in the embodiments of the present application.

Fig. 1 is a schematic diagram of a hardware structure of a computer device 10 that can be applied to the embodiments of the present application.

Referring to fig. 1, a computer device 10 includes a processor 101, a memory 102, an input-output device 103, and a bus 104. The processor 101, the memory 102, and the input/output device 103 may be connected to each other via a bus 104.

The processor 101 is a control center of the computer device 10, and may be a Central Processing Unit (CPU), other general-purpose processors, or the like. Wherein a general purpose processor may be a microprocessor or any conventional processor or the like.

By way of example, processor 101 may include one or more CPUs, such as CPU 0 and CPU 1 shown in FIG. 1.

The memory 102 may be, but is not limited to, a read-only memory (ROM) or other type of static storage device that may store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that may store information and instructions, an electrically erasable programmable read-only memory (EEPROM), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

In one possible implementation, the memory 101 may exist independently of the processor 101. Memory 102 may be coupled to processor 101 through bus 104 for storing data, instructions, or program code. The processor 101, when calling and executing the instructions or program codes stored in the memory 102, can implement the neural network model training method and/or the image recognition method provided by the embodiments of the present application.

In another possible implementation, the memory 102 may also be integrated with the processor 101.

And the input and output device 103 is used for inputting parameter information such as a sample image, an image to be recognized and the like, so that the processor 101 executes instructions in the memory 102 according to the input parameter information to execute the neural network model training method and/or the image recognition method provided by the embodiment of the application. Generally, the input/output device 103 may be an operation panel, a touch screen, or any other device capable of inputting parameter information, and the embodiment of the present application is not limited thereto.

The bus 104 may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an extended ISA (enhanced industry standard architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 1, but it is not intended that there be only one bus or one type of bus.

It should be noted that the configuration shown in fig. 1 does not constitute a limitation of the computer device 10, and that the computer device 10 may include more or less components than those shown in fig. 1, or combine some components, or a different arrangement of components, in addition to the components shown in fig. 1.

The technical solution provided by the embodiments of the present application is described below with reference to the accompanying drawings:

the model adopted in the embodiment of the present application is a deep learning network model (or neural network model, hereinafter referred to as network model for short). As shown in fig. 2a and fig. 2b, schematic diagrams of two deep learning network models provided in the embodiment of the present application are provided.

The network model shown in fig. 2a comprises: a front-end network 41 and a presentation network 42.

The network model shown in fig. 2b comprises: a front-end network 41, a presentation network 42, and an extraction network 43.

The input of the previous network 41 is an image, and the output is a third feature map of the image. The third feature map represents a feature map obtained by transforming the features (such as texture features) of the image input to the previous stage 41 into the main direction. In the training phase, the input to the front-end network 41 is a sample image. In the image recognition stage, the input of the front-end network 41 is an image to be recognized.

Alternatively, the front-end network 41 may include a first neural network 411 and a second neural network 412.

The first neural network 411 is used to perform feature extraction on an input image (i.e., an input image), for example, perform at least one layer of convolution operation on the input image to obtain a first feature map. The first feature map may be a three-dimensional tensor, in which an element corresponds to a region in the input image, which may also be referred to as a receptive field (recurrent field) of the convolutional neural network.

For example, as shown in fig. 3, a schematic diagram of a logic structure of a first neural network 411 is provided in the embodiment of the present application. The input image of the first neural network 411 has a size H × W × 3, and the output first feature map has a size H/4 × W/4 × 64. The first neural network 411 includes 4 convolutional layers (labeled conv1-1, conv1-2, conv2-1, conv2-2, respectively).

The second neural network 412 is used for correcting the first feature map to obtain a third feature map. Optionally, the second neural network 412 is configured to perform feature extraction on the first feature map to obtain a second feature map, and perform point multiplication on the second feature map and the first feature map to obtain a third feature map.

In one implementation, the second neural network 412 is specifically configured to perform at least one layer of convolution operation on the first feature map to obtain a second feature map.

In another implementation, the second neural network 412 is specifically configured to perform at least one layer of convolution operation on the first feature map, and then perform at least one layer of pooling operation and/or full join operation on the first feature map after the convolution operation is performed, so as to obtain the second feature map. For example, as shown in fig. 4, a schematic diagram of a logic result of a second neural network 412 provided in the embodiment of the present application is shown. Fig. 4 is drawn based on fig. 3. The first signature input by the second neural network 412 is H/4W/4 x 64, and the third signature output is H/4W/4 x 64. The second neural network 412 shown in FIG. 4 includes 2 convolutional layers, 1 fully-connected layer, and one layer of dot-multiplied layer. According to the implementation mode, more complex feature extraction can be realized, so that the feature extraction result is more accurate, and the accuracy of image recognition is improved.

Based on the network model shown in fig. 2 a:

a network 42 is represented for obtaining a first score map based on the third feature map. The first score map is a score map of an image input to the front-end network. Wherein the size of the third characteristic diagram is M₁*N₁*P₁The size of the first score map is M₁*N₁，P₁Is the dimension of the characteristic direction dimension, M₁*N₁Is the dimension perpendicular to the characteristic direction, M₁、N₁And P₁Are all positive integers.

Fig. 5 is a schematic diagram of dimensions of a feature diagram provided in an embodiment of the present application. Fig. 5 illustrates a characteristic diagram with a size of H/4 × 64. In the present embodiment, the feature pattern has a feature direction dimension of 64 and a feature direction dimension perpendicular to the feature direction dimension of H/4 × H/4. The description of each dimension of other feature diagrams is similar to that of the other feature diagrams, and is not repeated here.

Based on the network model shown in fig. 2a, the third feature map output by the second neural network 42 is used as the feature map used in the image recognition process. When the method is applied to the image recognition stage, the first score map and the third feature map are used for recognizing the image to be recognized.

Combination drawingThe network model shown in FIG. 2a and the third feature map output by the second neural network 412 shown in FIG. 4 are known in size, M₁*N₁*P₁Equivalent to H/4 × 64, in particular, M₁＝H/4，N₁＝H/4，P ₁64. In this case, the size of the first score plot is H/4 × H/4.

Based on the network model shown in fig. 2 b:

a network 42 is represented for obtaining a first score map based on the third feature map. The size of the third feature map is M₂*N₂*P₂The size of the first score map is M₁*N₁，P₂Is the dimension of the characteristic direction dimension, M₁、N₁、P₁、M₂、N₂And P₂Are all positive integers. Optionally, M₁*N₁＜M₂*N₂。

And the extraction network 43 is configured to perform feature extraction on the third feature map to obtain a fourth feature map. Wherein the size of the fourth feature map is M₁*N₁*P₁；P₁Is the dimension of the characteristic direction dimension, P₁Is a positive integer. That is, the feature extraction is to further reduce the dimension of the feature map perpendicular to the feature direction, which is helpful to reduce the computational complexity of the image recognition process when the fourth feature map is subsequently used for image recognition, thereby improving the recognition efficiency.

When the network model shown in fig. 2b is applied to the image recognition stage, the first score map and the fourth feature map obtained based on the image to be recognized are used for recognizing the image to be recognized. The following specific examples are all described by taking the network model shown in fig. 2b as an example, which are described herein in a unified manner and are not described in detail below.

The technical scheme provided by the embodiment of the application comprises a training stage and an image recognition stage, which are respectively explained as follows:

training phase

The training phase includes a training data acquisition phase and a model training phase, which are respectively explained as follows:

a) stage of obtaining training data

As shown in fig. 6, a flowchart of a method for acquiring training data provided in an embodiment of the present application is shown, where an execution subject of the method may be a computer device, and the method may include the following steps:

s101: acquiring a reference image set, wherein the reference image set comprises a plurality of reference images; then, a score map is obtained for each of the plurality of reference images.

The embodiment of the present application does not limit the reference image set. For example, the reference image set may be an existing data set, such as an HPatches data set, and may specifically be a three-dimensional reconstruction data set or the like.

Because the local feature detection method is mostly used in the fields of three-dimensional modeling, instant positioning and mapping (SLAM) and the like, the high homography transformation condition of the picture book recognition scene rarely occurs in the fields, and therefore, the training data set used in the fields usually does not contain the sample image under the high homography transformation condition. Because the construction difficulty of the data set is high and the cost is high, in some embodiments of the application, enhancement is performed based on the existing data set, so that a sample image suitable for the situation of high homography transformation is obtained. Wherein, the sample image under the condition of high homography transformation comprises: and (3) images with homography transformation relation. The enhancement process can refer to S102 to S103.

The score map of the image may be characterized by a matrix. For example, the value of the element in the ith row and jth column in the matrix represents the probability that the pixel point (or pixel block) in the ith row and jth column in the image is the feature point. Wherein i and j are each an integer of 0 or more. In one example, if the reference image set is an existing data set, such as an HPatches data set, the score map of the reference image in the reference image set may be the score map of the corresponding image in the existing data set, such as the HPatches data set, so that the score map of the image in the prior art may be directly used without being obtained through calculation, which helps to reduce the calculation complexity.

In other embodiments of the present application, sample images suitable for use in high homography transformations may be obtained in other ways than enhancement based on an existing training data set. Correspondingly, the score map of each reference image in the reference image set may also be obtained in other ways, which is not limited in this embodiment of the application.

S102: a plurality of reference images (e.g., each reference image) are respectively used as sample images, and the score maps of the plurality of reference images are respectively used as the score maps of the corresponding sample images. Then, the homography change is performed on each of the plurality of reference images (for example, each of the reference images) to obtain a plurality of sample images.

Performing homography change on the reference image, specifically comprising: and multiplying the reference image by the transformation matrix to obtain a sample image. Wherein the homography transformation matrix may be predefined or randomly generated. In S102, for any one reference image, the reference image is multiplied by one or more different transformation matrices to obtain one or more sample images. For a reference image, the transformation matrix corresponds to a sample image obtained based on the reference image.

Fig. 7 is a schematic diagram of a reference image and a sample image obtained by performing homography transformation on the reference image, which are applicable to the embodiment of the present application. Here, H in fig. 7 denotes a transformation matrix used for homography transformation.

S103: for each sample image obtained based on the homography transformation, a score map of the sample image is obtained based on the score of the reference image corresponding to the sample image (i.e., the reference image used when the sample image is obtained) and the transformation matrix corresponding to the sample image (i.e., the transformation matrix used when the sample image is obtained).

In the following, taking an example of obtaining one sample image by performing homography transformation on one reference image, a method of obtaining a score map of a sample image will be described:

firstly, marking the reference image as D, marking a transformation matrix used when the homography transformation is carried out on the reference image as H, and marking a pixel point D in the reference image as H_ij(i.e. in the reference imageThe ith row and the jth column, i and j are integers) is marked as s_ij. The pixel point d in the reference image_ijMultiplying by a homography transform coefficient H_iAnd marking the obtained pixel points as

Is marked as

From d_ijThe mapping is obtained, and therefore,

is scored (i.e. by

) Influenced by the homography transform matrix H. When the image is locally deformed and image loss is large (the loss has a correlation with the spatial rotation parameter and the scaling parameter of the homography matrix, and the correlation can be calculated in the expression of b), when a part of feature points in the image before transformation are mapped into the image after transformation, the possibility of being considered as feature points is reduced, and if the scores of pixel points with the mapping relationship are kept unchanged before and after the mapping, a sample is seriously distorted, so that subsequent network training is difficult to converge. To this end, an embodiment of the present application provides a method for estimating a score in an image obtained after transformation, which may specifically include the following steps:

step A), mixing s_ijExpanded as a matrix [ s ]_ij，1,1]. To normalize the data, the matrix s is normalized_ij，1,1]Carrying out normalization operation to obtain S ═ a, b and c]。

Step B) based on the reference image and the reference imageMapping, calculating a mapping matrix T ═ λ [ λ ] used when transforming the reference image into the sample image₁,λ₂,λ₃]。

Specifically, the method comprises the following steps: according to the matching corresponding relation between the image blocks in the reference image and the image blocks in the sample image, taking the scores of the image blocks in the reference image before deformation and the homography transformation matrix H as input, taking the scores of the image blocks obtained after deformation of the images as output, and obtaining the transformation matrix T of the score map through least square fitting. Wherein if the image block in the reference image and the image block in the sample image physically represent the same object, there is a matching correspondence between the two image blocks.

Step C), obtaining a transformation matrix T based on the score map

Specifically, the method comprises the following steps:

if the pixel point P 'on the sample image is obtained by converting the pixel point P on the reference image, the pixel point Q' on the sample image is obtained by converting the pixel point Q on the reference image and the P and Q are superposed in the conversion process, then the image quality of the image is improved

The following formula is satisfied:

where n is the number of coincident points.

If the pixel point P' on the sample image is obtained by converting the pixel point P on the reference image in the conversion process, and the reference image has a pixel point Q, wherein the pixel point Q is a pixel point in the neighborhood of the pixel point P, and Q is a fitting estimation point, then

The following formula is satisfied:

where n is the number of neighbors, i.e., the number of pixels in a neighborhood.

The neighborhood of pixel point P may be predefined. The size and the position of the neighborhood of the pixel point P are not limited in the embodiment of the application.

It should be noted that each score of the feature point score map is jointly constrained by points in the neighborhood, so that the receptive field is increased in the constraint, the problem of sample distortion in the data enhancement process is solved, and the contingency of feature point selection is reduced.

Thus, training data is obtained. The training data includes: a sample image in the sample image set and a score map for each sample image. The sample image set comprises a reference image and an image obtained by performing homography transformation on the reference image.

Fig. 8 is a schematic diagram of a relationship between reference data and training data provided in an embodiment of the present application. Wherein the reference data comprises a reference image set and a score map of each reference image in the reference image set, and fig. 8 illustrates that the reference image set comprises a reference image 1 and a reference image 2. The training data comprises a sample image set and a score map for each sample image in the sample image set, and fig. 8 illustrates that the sample image set comprises: a sample image 10 (i.e., a reference image 1), a sample image 11 (i.e., an image obtained by multiplying the reference image 1 by a transformation matrix 11), a sample image 12 (i.e., an image obtained by multiplying the reference image 1 by the transformation matrix 12), a sample image 20 (i.e., a reference image 2), and a sample image 21 (i.e., an image obtained by multiplying the reference image 2 by a transformation matrix 21), etc. The double-headed arrows in fig. 8 indicate the correspondence between images and their score maps.

For training data, two input sample images with the size of H x W are paired, each image block corresponds to each image, and a feature matrix (dimension of 1 x W) corresponding to the image block; the score range corresponding to each image block is ([0,1 ]]). Construction of the triplet tri ═ (D)_i,D_j,D_k) Wherein D is_i,D_j,D_kAre all image blocks, (D)_i,D_j) Is a similar drawingMatching pairs of patches, (D)_i,D_k) Are matching pairs of dissimilar image blocks. Wherein, the dissimilar image matching blocks are randomly selected from the same image or different images under the same scale.

In order to make the image robust at multiple scales, the image at multiple scales is used as an input of training, and the size of the image may be adjusted to three sizes of (H × 2) × (W × 2), H × W, (H/2) × (W/2) according to training data in the embodiment of the present application. In the corresponding score maps, the score map corresponding to the (H × 2) × (W × 2) size image is obtained by interpolation, and the score map corresponding to the (H/2) × (W/2) size image is obtained by down-sampling (max pooling).

It should be noted that the training data is based on the labeling information of the natural scene, and the score map estimation is performed, so that the estimation sample is more inclined to the real scene, and the accuracy of image recognition is improved.

b) Model training phase

Based on the network model shown in fig. 2a, the computer device may train the front-end network 41 first, and then train the representation network 42 respectively.

Based on the network model shown in fig. 2b, the computer device may train the front-end network 41 first, and then train the representation network 42 and the extraction network 43, respectively. The training for the representation network 42 and the extraction network 43 may be performed in parallel, and the training sequence between the two may not be sequential.

The process of training each network (including the preceding-stage network 41, the representing network 42, the extracting network 43, and the like) can be regarded as a process of obtaining actual values of parameters of the network (such as values of each element in the convolution kernel). The actual values here refer to the values of the parameters used by the network when applied in the image recognition stage.

Training the front end network 41

Before training the front-end network 41, the following information may be configured in advance:

the first neural network 411 and the second neural network 412 in the preceding stage network 41 respectively include operation layers, and the correlation between the size of the input of the operation layer, the size of the parameter of the operation layer, the size of the output of the operation layer, and the operation layers. Wherein, the operation layer can include: one or more of a convolutional layer, a pooling layer, a fully-connected layer, or a dot-multiplied layer, etc. The parameters of the operation layer include parameters used in performing the layer of operations, for example, the parameters of the convolutional layer include the number of layers of the convolutional layer and the size of the convolutional kernel used by each convolutional layer. The association relationship between the operation layers may also be referred to as a connection relationship between the operation layers, for example, which output of the operation layer is input to which operation layer.

It is understood that the input of the first operation layer in the previous network 41 is the input of the previous network 41, and the output of the last operation layer in the previous network 41 is the output of the previous network 41.

The input to the front-end network 41 is an image. In one example, the size of the input of the front-end network 41 is labeled H x W3. Where H denotes the height of the input image, W denotes the width of the input image, and 3 denotes the number of channels. The values of H and W may be predefined.

The output of the front-end network 41 is a third profile. The third feature map is a feature map obtained by rotating the features of the image input to the preceding network 41 in the main direction.

At this point, the pre-configuration process for the front-end network is completed.

After the pre-configuration of the front-end network 41, the input size, the parameter size and the output size of different operation layers are adapted. Here, "fitting" means a size satisfying an operational relationship between matrices/tensors in mathematics, for example, the principle that matrices a and B satisfy dot multiplication is that the number of columns of matrix a is equal to the number of rows of matrix B. Other examples are not listed.

After the pre-configuration process is completed, the computer device may configure initial values for the parameters in the previous-stage network 41 (e.g., the parameters of the operation layers in the previous-stage network 41), for example, the convolution kernels used in each convolution layer have initial values. In the embodiment of the present application, the initial value of each parameter is not limited, and may be randomly generated, for example.

The basic principle of performing the training of the front-end network 41 is: based on the images in the sample image set and the initial values of the parameters in the preceding stage network 41, training is performed under the constraints of the countermeasure network 44 and the twin network 45 of the preceding stage network 41, so that "the third feature map output by the preceding stage network 41 is a feature map obtained by transforming the features of the images input thereto in the principal direction". And the parameters of the previous stage network 41 used for this purpose are used as the training result. Among them, the connection relationship among the front stage network 41, the opposing network 44, and the twin network 45 may be as shown in fig. 9.

The result of the training process is used as a value (or an actual value) of a parameter of the preceding stage network in the image recognition process using the preceding stage network 41.

The following describes a method for training the front-end network 41 according to an embodiment of the present application. The execution subject of the method may be a computer device. As shown in fig. 10, the method may include the steps of:

s201: any one image in the sample image set is input into the first neural network 411 as an input image, and the first neural network 411 performs feature extraction on the input image to obtain a first feature map of the input image.

For example, the first neural network 411 performs feature extraction on the input image using the initial values of the parameters of the first neural network, to obtain a first feature map of the input image.

Optionally, the first neural network performs convolution operation on the input image by a preset number of layers to obtain a first feature map of the input image. For example, based on fig. 3, when S201 is executed, the first neural network 411 performs a 4-layer convolution operation on the input image, so as to obtain a first feature map of the input image.

It should be noted that this is merely an example, and in actual implementation, the first neural network 411 may also perform other operations on the input image to obtain the first feature map, which is not limited in this embodiment of the present application.

S202: the first feature map of the input image is input into the second neural network 412 to perform feature extraction on the first feature map, so as to obtain a third feature map. The third feature map may be understood as a feature map obtained by converting features (e.g., texture features) of the input image into the main direction through the processing of the second neural network 412.

For example, the second neural network 412 sequentially performs a convolution operation and a full join operation on the first feature map of the input image, and performs a dot product operation on the result of the full join operation and the first feature map of the input image to obtain a third feature map.

For example, based on fig. 4, when S202 is executed, the second neural network 412 performs a 2-layer convolution operation and a 1-layer full-join operation on the first feature map obtained in fig. 3 in sequence, and performs a dot product operation on the result of the full-join operation and the first feature map to obtain a third feature map. For example, based on fig. 4, after performing 2-layer convolution operation and 1-layer full-connection operation on the first feature map obtained in fig. 3 in sequence, the second neural network 412 may obtain h × w 2 × 2 matrices, and obtain features that circulate to the main direction by dot-by-dot using the 2 × 2 (h × w) kernels as the direction matrix of the main direction of the features of the corresponding channels.

Since the point multiplication is differentiable, the second neural network 412 can be back-propagated in the training process of the front-end network 41. Specifically, the preceding stage network 41 is trained by constraining the opposing network 44 and the twin network 45 of the preceding stage network 41, and the actual values of the parameters of the preceding stage network 41 are obtained. The operation of the countermeasure network 44 is explained below by step S203, and the operation of the twin network 45 is explained by step S204.

By way of example, the second neural network 412 may be referred to as a Local Spatial Transform Network (LSTN). The LSTN design, under the learning of generating a competing network, enables the local regions to be rectified into their dominant directions, enabling the network to converge when training the high homography transform samples.

S203: the third feature map is used as an input of the countermeasure network 44, and the countermeasure network 44 performs deconvolution operation on the third feature map to obtain a fifth feature map. The size of the fifth feature map is the same as the size of the input image of the previous stage network 41, and is H × W × 3, for example. The countermeasure network 44 then divides the fifth signature into a plurality of data blocks.

Optionally, the countermeasure network 44 performs two-layer deconvolution on the third feature map to obtain a fifth feature map.

Optionally, the countermeasure network 44 divides the fifth feature map into a plurality of data blocks, which may include: the countermeasure network 44 averages the fifth profile into a plurality of data blocks. The size of each data block is not limited in the embodiment of the present application.

Fig. 11 is a schematic diagram of a logical structure of a countermeasure network 44 according to an embodiment of the present application. Fig. 11 is drawn based on fig. 4. Specifically, based on the third feature map obtained in fig. 4 (with the size of H/4 × 64), the countermeasure network 44 performs two-layer deconvolution on the third feature map, so as to obtain a feature map with the size of H/2 × 32 and a feature map with the size of H × W3 (i.e., a fifth feature map), respectively. Then, the elements in each layer of the matrix with the size H × W in the feature map with the size H × W3 are equally divided into 16 × 16 data blocks.

S204: the plurality of data blocks generated by the countermeasure network 44 are input to the twin network 45. Whether or not the third feature map is the feature map rotated in the main direction is determined by the twin network 45 using the loss function for the constraint. The basic idea behind the twin network 45 is, among other things, to minimize the characteristic distance between matching pairs of similar data blocks while maximizing the characteristic distance of pairs of dissimilar data blocks.

If so, the determination result is that the third feature map is the feature map rotated in the main direction, and the training process for the previous stage network 41 is ended. Subsequently, the value of the parameter used when S201 and S202 are executed this time may be used as the value of the parameter of the previous stage network at the time of the identification stage.

If not, that is, the determination result is that the third feature map is not the feature map rotated to the main direction, the previous-stage network 41 may feed back the related information to the previous-stage network 41 to assist in adjusting the value of the parameter of the previous-stage network 41, and after the parameter adjustment of the previous-stage network 41, S201 may be executed again, so as to perform the loop until S204 is executed one or more times, the determination result is that the third feature map is the feature map rotated to the main direction.

The embodiment of the present application does not limit the specific implementation manner of the countermeasure network 44 and the twin network 45 to assist in adjusting the front-end network 41. For example, reference may be made to the feedback adjustment process in the training process of the values of the parameters of the front-end network 41 in other application scenarios in the prior art, which is not described in detail herein.

Optionally, whether the third feature map is correct is determined by constructing a loss function constraint of the triple tri. The loss function of the triplet is shown in the following formula 1, and the idea is as follows: and minimizing the characteristic distance of the matching pair of the similar image blocks, and simultaneously maximizing the characteristic distance of the matching pair of the dissimilar image blocks, wherein M is an offset value for ensuring model convergence.

Equation 1: l is_tri(D_i，D_j，D_k)＝∑_i,j,k∈Pmax(0,dist(D_i，D_j)-dist(D_i，D_k)+M)。

It should be noted that the loss function of the triplet is used on the countermeasure network of the feature representation and the main direction at the same time, so that the distribution of dissimilar feature points is enlarged, and the nearest neighbor features can be obtained more accurately by subsequent matching.

Training extraction network 43

Before training the extraction network 43, the following information may be preconfigured:

the extraction network 43 includes operation layers, the size of the input to the operation layers, the size of the parameter of the operation layers, the size of the output from the operation layers, and the correlation between the operation layers (i.e., which output from the operation layer is the input to which operation layer, etc.). Wherein, the operation layer can include: convolutional layers, or packet-weighted layers, etc. The parameters of the convolutional layers include the number of layers of the convolutional layers, and the size of the convolutional core used for each convolutional layer.

In one implementation, the extraction network 43 includes a packet weighting layer 432.

Packet weighting layer 432 is used to: performing convolution operation on the third feature map by using a 1-channel convolution kernel to obtain X fifth feature maps; the size of the feature direction of the fifth feature map is smaller than that of the third feature map; x is an integer greater than 2; weighting and summing the elements of the X fifth feature maps to obtain a sixth feature map; and performing feature extraction on the sixth feature map to obtain a first score map. The detailed description of the group weighting layer 432 in this implementation can be inferred based on the following implementation, and is not described here again.

In another implementation, the extraction network 43 includes a convolutional layer 431 and a packet-weighted layer 432.

The convolutional layer 431 is used for: and performing convolution operation on the third characteristic diagram to obtain a seventh characteristic diagram. The embodiment of the application does not limit the number of layers of convolution operation, the size of a convolution kernel and the like. Optionally, the purpose of "extracting features from the third feature map to obtain the seventh feature map" is to reduce the dimension perpendicular to the feature direction.

Packet weighting layer 432 is used to: and performing convolution operation on the seventh feature map by using a 1-channel convolution kernel to obtain X fifth feature maps. The dimension of the feature direction of the fifth feature map is smaller than the dimension of the feature direction of the third feature map. X is an integer greater than 2. And weighting and summing the elements of the X fifth feature maps to obtain a sixth feature map. And performing feature extraction on the sixth feature map to obtain a first score map.

A 1-channel convolution kernel is understood to be a convolution kernel whose dimension perpendicular to the feature direction is 1 and whose dimension in the feature direction is X. The dimensions of the sixth feature map are the same as those of the fifth feature map. The purpose of feature extraction of the sixth feature map is to compress the dimension of the sixth feature map in the feature direction to 1. Specifically, the grouping weighting layer 432 may perform one or more layers of convolution operations on the sixth feature map to obtain the first score map. The first score map is a two-dimensional matrix, that is, the dimension of the characteristic direction is 1.

It should be noted that the design of the packet weighting layer (or called packet weighting network) enables the computation of the score map to utilize both local and global information to find local features.

Fig. 12 is a schematic diagram of a logical structure of an extraction network 43 according to an embodiment of the present application. Fig. 12 is drawn based on fig. 4. Specifically, the method comprises the following steps: based on the third feature map of size H/4 × 64 obtained in fig. 4, the convolution layer 431 in the extraction network 43 is used to perform a convolution operation on the third feature map of size H/4 × 64, resulting in a seventh feature map of size H/8 × 256. The dimension of the seventh feature pattern in the feature direction is 256, and the dimension perpendicular to the feature direction is H/8 × H/8. The packet weighting layer 432 in the extraction network 43 is configured to perform a convolution operation on the seventh feature map with the size H/8 × 256 using a convolution kernel of 1 × 16, so as to obtain 16 feature maps with the size H/8 × 16, respectively. Then, the elements of the 16 feature maps with the size of H/8 × 16 are weighted and summed to obtain a sixth feature map with the size of H/8 × 16. And weighting and summing the elements with the same coordinate position in the characteristic diagrams of different H/8 16 to obtain the element with the coordinate position in the sixth characteristic diagram. And performing convolution operation on the sixth feature map to obtain a first score map with the size of H/8.

The formula for weighted summation of the elements of 16 feature maps with the size of H/8 × 16 is shown in formula 2:

equation 2: s_k＝∑_ijexp(a_ij*p_ij)/∑_k∑_ijexp(a_k,ij*p_k,ij)

Wherein s is_kRepresenting a channel-by-channel element-by-element fractional representation. In one channel, i represents the ith group, j represents the jth element in the ith group, and the maximum value of k is the number of elements in a single channel; a is_ijRepresenting the weight corresponding to the jth element of the ith group in the first channel (the weight is learned by back propagation); a is_k,ijRepresenting the weight corresponding to the jth element of the ith group of the kth channel; p is the value of the corresponding element.

It should be noted that, in practical implementation, during the training of the extraction network 43, a feedback constraint (not shown in fig. 12) is required to be performed on the loss function, for example, the loss function of the local feature extraction (i.e., the extraction network) is shown in equation 3:

equation 3: l is_score(sx,sy)＝log(∑_h,wexp(l(sx_hw,sy_hw)))

Wherein the content of the first and second substances,

sy is a label, the value of which is not directly obtained from the corresponding pixel position in the data set, but the score on the score map in the corresponding n x n region (n is self-defined, and the suggested value is 9 x 9) is calculated, the score of each pixel point is obtained through formula 2, and then the maximum value in the scores in the n x n region is obtained as the score of the current point.

The point corresponding to each pixel point in the local region is scored (the pixel point without the corresponding score is complemented with a score of 0.0). sx is the score derived by forward push, and sy is the benchmark score given for the data set. sx_hwFor the calculated w column score of the h row of the image, sy_hwThe score of the h row and w column of the image in the baseline is given for the data set. The expression of formula 3 is a general neural network loss function, data calculated by the neural network is constrained by using reference data in the data set, and each parameter in the neural network is updated through back propagation.

Training representation network 42

Prior to training the representation network 42, the following information may be preconfigured:

indicating the computation layers included in the network 42, the size of the inputs to the computation layers, the size of the parameters of the computation layers, the size of the outputs from the computation layers, and the correlation between the computation layers (i.e., which computation layer outputs are inputs to which computation layer, etc.). The operation layer may include a convolution layer, etc. The parameters of the convolutional layers include the number of layers of the convolutional layers, and the size of the convolutional core used for each convolutional layer.

Fig. 13 is a schematic diagram illustrating a logical structure of a network 42 according to an embodiment of the present invention. Fig. 13 is drawn based on fig. 4. Specifically, the method comprises the following steps: based on the third feature map with the size of H/4 × 64 obtained in fig. 4, it is shown that one convolutional layer in the network 42 performs convolution operation on the third feature map, and then outputs the output result to another convolutional layer for convolution operation, so as to obtain a fourth feature map. Wherein, the size of the fourth characteristic diagram can be H/8 128. That is, the dimension perpendicular to the feature dimension is reduced after processing by the presentation network 42. Therefore, the method is beneficial to reducing the calculation complexity in the subsequent image recognition process, and the image recognition efficiency is improved.

It should be noted that in practical implementation, during the process of training the representation network 42, feedback constraint (not shown in fig. 13) is required for the loss function, for example, the loss function in the local feature representation stage (i.e., the representation network) uses a triple loss function, that is, a similar matching pair and an dissimilar matching pair are constructed, and then using formula 1, the distance between the similar matching pair is minimized, and the distance between the dissimilar matching pair is maximized. In this stage, all channels of a single element of the feature map are extracted as features, i.e. a matrix of 1 x 128 dimensions.

It should be noted that, in practical implementation, the loss function of the whole network can be established as shown in equation 4:

equation 4:

where P represents the set of all matching image points, and P, q are the points in P, respectively, which may be similar points or dissimilar points. The overall loss function is the sum of the local feature score (i.e., extracting the network) and the loss function of the feature representation (i.e., representing the network).

And

the scores of the points A and B are shown, respectively, and A, B are extracted from the two images having the homography conversion relationship, respectively. The loss function of equation 4 is a global loss calculation, which is different from a simple weighted addition of loss functions, and aims to cross-multiply similar scores with respect to similar pairs through the joint action of similar and dissimilar matching pairs and calculate the proportion thereof in the global range, thereby strengthening the constraint so that the overall loss can be better influenced.

Image recognition phase

In the image recognition stage, the forward reasoning of the network comprises the network structure shown in fig. 2a or fig. 2b, and does not comprise a countermeasure network, a twin network and the like.

Fig. 14 is a schematic flow chart of an image recognition method according to an embodiment of the present application. The method shown in fig. 14 comprises the following steps:

s301: the image recognition device acquires an image to be recognized. For example, the picture-book robot shoots the picture book to obtain the image to be identified.

S302: the image recognition device uses the first neural network to extract the features of the image to be recognized, and a first feature map is obtained.

S303: the image recognition device uses the second neural network to perform feature extraction on the first feature map to obtain a second feature map, and performs point multiplication on the second feature map and the first feature map to obtain a third feature map. And the third feature map represents a feature map obtained after the features of the image to be recognized are transformed to the main direction.

The first neural network may be any one of the trained first neural networks 411 provided above, and the second neural network may be any one of the trained second neural networks 412 provided above.

In one example, the image recognition device performs at least one layer of convolution operation on the first feature map using a second neural network, resulting in a second feature map. The specific implementation process can refer to the relevant steps executed by the computer device.

In another example, the image recognition device performs at least one layer of convolution operation on the first feature map using a second neural network; and performing at least one layer of pooling operation and/or full connection operation on the first feature map after the convolution operation is performed to obtain a second feature map. The specific implementation process can refer to the relevant steps executed by the computer device.

S304: the image recognition device obtains a first score map of the image to be recognized based on the third feature map.

In one example, the image recognition device performs a convolution operation on the third feature map using a 1-channel convolution kernel to obtain X fifth feature maps; the size of the feature direction of the fifth feature map is smaller than that of the third feature map; x is an integer greater than 2; weighting and summing the elements of the X fifth feature maps to obtain a sixth feature map; and performing feature extraction on the sixth feature map to obtain a first score map. The specific implementation process can refer to the relevant steps executed by the computer device.

In one example, the image recognition device performs feature extraction on the third feature map to obtain a seventh feature map; the dimension of the third characteristic diagram perpendicular to the characteristic direction is larger than that of the seventh characteristic diagram perpendicular to the characteristic direction; x is an integer greater than 2; performing convolution operation on the seventh feature map by using a 1-channel convolution kernel to obtain X fifth feature maps; weighting and summing the elements of the X fifth feature maps to obtain a sixth feature map; and performing feature extraction on the sixth feature map to obtain a first score map. The specific implementation process can refer to the relevant steps executed by the computer device.

Optionally, the size of the image to be recognized is larger than that of the first score map.

S305: and the image recognition device recognizes the image to be recognized based on the third feature map and the first score map.

In one example, the third feature size is M1 × N1 × P1, the first score size is M1 × N1, P1 is a feature direction dimension, M1 × N1 is a dimension perpendicular to the feature direction dimension, and M1, N1, and P1 are all positive integers. In this case, the image recognition apparatus recognizes the image to be recognized directly based on the third feature map and the first score map.

In another example, the third feature size is M2 × N2 × P2, the first score size is M1 × N1, P2 is a feature direction dimension, and M1, N1, P1, M2, N2, and P2 are all positive integers. In this case, the image recognition device performs feature extraction on the third feature map to obtain a fourth feature map; wherein the fourth feature size is M1 × N1 × P1; p1 is the dimension of the characteristic direction dimension, P1 is a positive integer; then, identifying the image to be identified based on the fourth feature map and the first score map; wherein the first score plot is the size M1 × N1. Optionally, M1 × N1 < M2 × N2.

As an example of a specific implementation of S305, reference may be made to the following specific example in step S405.

According to the image identification method provided by the embodiment of the application, the network described above is utilized, and the network has rotation invariance, so that in the image processing of the image with rotation invariance, the characteristics of the image extracted by the image identification device are hardly changed at any rotation angle in a plane, and the requirements on placement and shooting of the image to be identified are not high. In addition, compared with the technical scheme of using a network without rotation invariance to perform image recognition in the prior art, the method is beneficial to improving the accuracy of image recognition.

The following describes an image recognition process provided by an embodiment of the present application, by way of a specific example.

Fig. 15 is a schematic flow chart of another image recognition method according to an embodiment of the present application. The method shown in fig. 15 comprises the following steps:

s401: the image recognition device acquires two images which need to be matched. One of the two images is an image to be recognized, and the other image is a sample image. For example, the image recognition device may be a painted robot.

For example, when the method is applied to a picture book recognition process, the image to be recognized is an image shot by a drawing robot, and the sample image is a certain page of a picture book stored in a predefined picture book database.

S402: the image recognition device scales an image to be recognized to three scales such as (0.5,1,2), respectively inputs the scaled images to a network, and simultaneously inputs a sample image scaled to the same size to the network. Wherein, the network may be the network trained in the training phase. 0.5,1 and 2 represent the zoom factor, respectively.

It should be noted that scaling the image to be recognized to different sizes and performing image recognition based on different sizes are optional steps. Thus, the accuracy of image recognition can be improved.

S403: the image recognition device uses the network to obtain score maps (S1, S2) and feature maps (F1, F2) under different scales through forward reasoning.

For example, in conjunction with fig. 2b, the score maps S1 and S2 can be considered as the first score map obtained after the image to be recognized in S401 is input to the network, and the first score map obtained after the sample image in S401 is input to the network, respectively. The feature maps F1 and F2 can be considered as a fourth feature map obtained after the image to be recognized in S401 is input to the network, and a fourth feature map obtained after the sample image in S401 is input to the network, respectively.

Regarding the working principle of the network in the image recognition stage, reference may be made to the process of training the network, which is not described herein again. It should be noted that, compared with the operation principle of the network in the training process, the network in the image recognition stage does not include a countermeasure network, a twin network, or a network that uses a loss function for feedback adjustment.

S404: the image recognition apparatus uses an image retrieval technique (refer to the prior art in particular), and performs the following steps based on score maps (S1, S2) at different scales to determine the number of matching pairs of feature points in F1 and F2: for feature point F1 in F1, it corresponds to score S1 > T in S1, where T is the score threshold below which feature points are not considered. The most similar feature F2 to F1 is searched for in F2, for example, two features closest in euclidean distance are taken as the most similar features, where F2 corresponds to a score S2 > T in S2. f1 and f2 are a matched pair of characteristic points.

S405: if the number of the matched feature point pairs in the F1 and the F2 is greater than or equal to the preset threshold, the image recognition device is used for obtaining the sample image used in the F2 as the recognition result of the image to be recognized. Otherwise, the sample image is updated and S401-S405 are re-executed.

The embodiment provides a specific application example of the image recognition method, and the practical implementation is not limited to this.

The scheme provided by the embodiment of the application is mainly introduced from the perspective of a method. To implement the above functions, it includes hardware structures and/or software modules for performing the respective functions. Those of skill in the art would readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiment of the present application, the image recognition apparatus may be divided into the functional modules according to the method example, for example, each functional module may be divided according to each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, in the embodiment of the present application, the division of the module is schematic, and is only one logic function division, and there may be another division manner in actual implementation.

As shown in fig. 16, fig. 16 is a schematic structural diagram of an image recognition device 160 according to an embodiment of the present application. The image recognition device 160 is configured to execute the image recognition method described above, for example, the image recognition method shown in fig. 14. For example, the image capturing device 160 may include a first capturing unit 1601, a feature extracting unit 1602, a second capturing unit 1603, and an identifying unit 1604.

A first acquiring unit 1601 is used for acquiring an image to be recognized. A feature extraction unit 1602, configured to perform feature extraction on an image to be identified by using a first neural network, so as to obtain a first feature map; performing feature extraction on the first feature map by using a second neural network to obtain a second feature map, and performing point multiplication on the second feature map and the first feature map to obtain a third feature map; and the third feature map represents a feature map obtained after the features of the image to be recognized are transformed to the main direction. A second obtaining unit 1603, configured to obtain a first score map of the image to be recognized based on the third feature map. An identifying unit 1604, configured to identify the image to be identified based on the third feature map and the first score map.

As an example, the first neural network may be the first neural network 411 above and the second neural network may be the second neural network 412 above. In conjunction with fig. 14, the first acquiring unit 1601 may perform S301, the feature extracting unit 1602 may perform S302 and S303, the second acquiring unit 1603 may perform S304, and the identifying unit 1604 may perform S305.

Optionally, the feature extraction unit 1602 is specifically configured to: and performing at least one layer of convolution operation on the first feature map by using a second neural network to obtain a second feature map.

Optionally, the feature extraction unit 1602 is specifically configured to: performing at least one layer of convolution operation on the first feature map using a second neural network; and performing at least one layer of pooling operation and/or full connection operation on the first feature map after the convolution operation is performed to obtain a second feature map.

Optionally, the third feature size is M1 × N1 × P1, the first score size is M1 × N1, P1 is a feature direction dimension, M1 × N1 is a dimension perpendicular to the feature direction dimension, and M1, N1, and P1 are all positive integers.

Optionally, the size of the third feature is M2 × N2 × P2, the size of the first score is M1 × N1, P2 is the size of the feature direction dimension, and M1, N1, P1, M2, N2, and P2 are all positive integers; the identifying unit 1604 is specifically configured to: extracting the features of the third feature map to obtain a fourth feature map; wherein the fourth feature size is M1 × N1 × P1; p1 is the dimension of the characteristic direction dimension, P1 is a positive integer; identifying the image to be identified based on the fourth feature map and the first score map; wherein the first score plot is the size M1 × N1.

Optionally, M1 × N1 < M2 × N2.

Optionally, the second obtaining unit 1603 is specifically configured to: performing convolution operation on the third feature map by using a 1-channel convolution kernel to obtain X fifth feature maps; the size of the feature direction of the fifth feature map is smaller than that of the third feature map; x is an integer greater than 2; weighting and summing the elements of the X fifth feature maps to obtain a sixth feature map; and performing feature extraction on the sixth feature map to obtain a first score map.

Optionally, the second obtaining unit 1603 is specifically configured to: extracting the features of the third feature map to obtain a seventh feature map; the dimension of the third characteristic diagram perpendicular to the characteristic direction is larger than that of the seventh characteristic diagram perpendicular to the characteristic direction; x is an integer greater than 2; performing convolution operation on the seventh feature map by using a 1-channel convolution kernel to obtain X fifth feature maps; weighting and summing the elements of the X fifth feature maps to obtain a sixth feature map; and performing feature extraction on the sixth feature map to obtain a first score map.

For the detailed description of the above alternative modes, reference may be made to the foregoing method embodiments, which are not described herein again. In addition, for any explanation and beneficial effect description of the image recognition apparatus 160 provided above, reference may be made to the corresponding method embodiment described above, and details are not repeated.

As an example, in conjunction with fig. 1, the functions implemented by the first acquiring unit 1601, the feature extracting unit 1602, the second acquiring unit 1603, and the identifying unit 1604 in the image recognition apparatus 160 may be implemented by the processor 101 in fig. 1 executing the program code in the memory 102 in fig. 1.

The embodiment of the present application further provides a chip system, as shown in fig. 17, where the chip system includes at least one processor 111 and at least one interface circuit 112. By way of example, when the system-on-chip 110 includes one processor and one interface circuit, the one processor may be the processor 111 shown in the solid line block in fig. 11 (or the processor 111 shown in the dashed line block), and the one interface circuit may be the interface circuit 112 shown in the solid line block in fig. 11 (or the interface circuit 112 shown in the dashed line block). When the system-on-chip 110 includes two processors and two interface circuits, the two processors include the processor 111 shown in a solid line block in fig. 11 and the processor 111 shown in a dashed line block, and the two interface circuits include the interface circuit 112 shown in a solid line block in fig. 11 and the interface circuit 112 shown in a dashed line block. This is not limitative.

The processor 111 and the interface circuit 112 may be interconnected by wires. For example, the interface circuit 112 may be used to receive signals (e.g., from a vehicle speed sensor or an edge service unit). As another example, the interface circuit 112 may be used to send signals to other devices, such as the processor 111. Illustratively, interface circuitry 112 may read instructions stored in a memory and send the instructions to processor 111. The instructions, when executed by the processor 111, may cause the image recognition apparatus to perform the various steps in the embodiments described above. Of course, the chip system may further include other discrete devices, which is not specifically limited in this embodiment of the present application.

Another embodiment of the present application further provides a computer-readable storage medium, which stores instructions that, when executed on an image recognition apparatus, cause the image recognition apparatus to perform the steps performed by the image recognition apparatus in the method flow shown in the above method embodiment.

In some embodiments, the disclosed methods may be implemented as computer program instructions encoded on a computer-readable storage medium in a machine-readable format or encoded on other non-transitory media or articles of manufacture.

Fig. 18 schematically illustrates a conceptual partial view of a computer program product comprising a computer program for executing a computer process on a computing device provided by an embodiment of the application.

In one embodiment, the computer program product is provided using a signal bearing medium 120. The signal bearing medium 120 may include one or more program instructions that, when executed by one or more processors, may provide the functions or portions of the functions described above with respect to fig. 14. Thus, for example, one or more features described with reference to S401-S405 in fig. 14 may be undertaken by one or more instructions associated with the signal bearing medium 120. Further, the program instructions in FIG. 18 also describe example instructions.

In some examples, signal bearing medium 120 may comprise a computer readable medium 121, such as, but not limited to, a hard disk drive, a Compact Disc (CD), a Digital Video Disc (DVD), a digital tape, a memory, a read-only memory (ROM), a Random Access Memory (RAM), or the like.

In some embodiments, signal bearing medium 120 may comprise a computer recordable medium 122 such as, but not limited to, a memory, a read/write (R/W) CD, a R/W DVD, and the like.

In some implementations, the signal bearing medium 120 may include a communication medium 123 such as, but not limited to, a digital and/or analog communication medium (e.g., a fiber optic cable, a waveguide, a wired communications link, a wireless communication link, etc.).

The signal bearing medium 120 may be conveyed by a wireless form of communication medium 123, such as a wireless communication medium that complies with the IEEE 802.11 standard or other transmission protocol. The one or more program instructions may be, for example, computer-executable instructions or logic-implementing instructions.

In some examples, an image recognition device such as that described with respect to fig. 14 may be configured to provide various operations, functions, or actions in response to being programmed by one or more of the program instructions in the computer-readable medium 121, the computer-recordable medium 122, and/or the communication medium 123.

It should be understood that the arrangements described herein are for illustrative purposes only. Thus, those skilled in the art will appreciate that other arrangements and other elements (e.g., machines, interfaces, functions, orders, and groupings of functions, etc.) can be used instead, and that some elements may be omitted altogether depending upon the desired results. In addition, many of the described elements are functional entities that may be implemented as discrete or distributed components or in conjunction with other components, in any suitable combination and location.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented using a software program, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The processes or functions according to the embodiments of the present application are generated in whole or in part when the computer-executable instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). Computer-readable storage media can be any available media that can be accessed by a computer or can comprise one or more data storage devices, such as servers, data centers, and the like, that can be integrated with the media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. An image recognition method, comprising:

acquiring an image to be identified;

performing feature extraction on the image to be identified by using a first neural network to obtain a first feature map;

performing feature extraction on the first feature map by using a second neural network to obtain a second feature map, and performing point multiplication on the second feature map and the first feature map to obtain a third feature map; the third feature map represents a feature map obtained after the features of the image to be recognized are transformed to the main direction;

obtaining a first score map of the image to be identified based on the third feature map;

and identifying the image to be identified based on the third feature map and the first score map.

2. The method of claim 1, wherein the feature extracting the first feature map using a second neural network to obtain a second feature map comprises:

and performing at least one layer of convolution operation on the first feature map by using the second neural network to obtain the second feature map.

3. The method of claim 1, wherein the feature extracting the first feature map using a second neural network to obtain a second feature map comprises:

performing at least one layer of convolution operations on the first feature map using the second neural network;

and performing at least one layer of pooling operation and/or full connection operation on the first feature map after the convolution operation is performed to obtain the second feature map.

4. The method according to any one of claims 1 to 3,

the dimension of the third feature map is M₁*N₁*P₁The size of the first score map is M₁*N₁，P₁Is the dimension of a characteristic direction dimension, said M₁*N₁Is the dimension perpendicular to the characteristic direction, M₁、N₁And P₁Are all positive integers.

5. A method according to any of claims 1 to 3, wherein the third feature map is of dimensionsIs M₂*N₂*P₂The size of the first score map is M₁*N₁，P₂Is the dimension of the characteristic direction dimension, M₁、N₁、P₁、M₂、N₂And P₂Are all positive integers;

the identifying the image to be identified based on the third feature map and the first score map comprises:

performing feature extraction on the third feature map to obtain a fourth feature map; wherein the size of the fourth feature map is M₁*N₁*P₁；P₁Is the dimension of the characteristic direction dimension, P₁Is a positive integer;

identifying the image to be identified based on the fourth feature map and the first score map; wherein the size of the first score map is M₁*N₁。

6. The method of claim 5, wherein M is₁*N₁＜M₂*N₂。

7. The method according to any one of claims 1 to 6, wherein the obtaining a first score map of the image to be recognized based on the third feature map comprises:

performing convolution operation on the third feature map by using a 1-channel convolution kernel to obtain X fifth feature maps; wherein the dimension of the fifth feature map in the feature direction is smaller than the dimension of the third feature map in the feature direction; x is an integer greater than 2;

weighting and summing the elements of the X fifth feature maps to obtain a sixth feature map;

and performing feature extraction on the sixth feature map to obtain the first score map.

8. The method according to any one of claims 1 to 6, wherein the obtaining a first score map of the image to be recognized based on the third feature map comprises:

performing feature extraction on the third feature map to obtain a seventh feature map; wherein the dimension of the third feature map perpendicular to the feature direction is larger than the dimension of the seventh feature map perpendicular to the feature direction; x is an integer greater than 2;

performing convolution operation on the seventh feature map by using a 1-channel convolution kernel to obtain X fifth feature maps;

9. The method according to any one of claims 1 to 8, characterized in that the size of the image to be recognized is larger than the size of the first score map.

10. An image recognition apparatus, comprising:

the device comprises a first acquisition unit, a second acquisition unit and a control unit, wherein the first acquisition unit is used for acquiring an image to be identified;

the characteristic extraction unit is used for extracting the characteristics of the image to be identified by using a first neural network to obtain a first characteristic diagram; performing feature extraction on the first feature map by using a second neural network to obtain a second feature map, and performing point multiplication on the second feature map and the first feature map to obtain a third feature map; the third feature map represents a feature map obtained after the features of the image to be recognized are transformed to the main direction;

the second obtaining unit is used for obtaining a first score map of the image to be identified based on the third feature map;

and the identification unit is used for identifying the image to be identified based on the third feature map and the first score map.

11. The apparatus of claim 10,

the feature extraction unit is specifically configured to: and performing at least one layer of convolution operation on the first feature map by using the second neural network to obtain the second feature map.

12. The apparatus according to claim 10, wherein the feature extraction unit is specifically configured to:

13. The apparatus according to any one of claims 10 to 12,

the third feature size is M1 × N1 × P1, the first score size is M1 × N1, P1 is the feature direction dimension, M1 × N1 is the dimension perpendicular to the feature direction dimension, and M1, N1, and P1 are all positive integers.

14. The apparatus of any one of claims 10 to 12, wherein the third feature size is M2N 2P 2, the first score size is M1N 1, P2 is a feature direction dimension size, and M1, N1, P1, M2, N2, and P2 are all positive integers; the identification unit is specifically configured to:

performing feature extraction on the third feature map to obtain a fourth feature map; wherein the fourth feature size is M1N 1P 1; p1 is the dimension of the characteristic direction dimension, P1 is a positive integer;

identifying the image to be identified based on the fourth feature map and the first score map; wherein the first score plot is of size M1 × N1.

15. The device of claim 14, wherein M1 × N1 < M2 × N2.

16. The apparatus according to any one of claims 10 to 15, wherein the second obtaining unit is specifically configured to:

17. The apparatus according to any one of claims 10 to 15, wherein the second obtaining unit is specifically configured to:

18. The apparatus according to any one of claims 10 to 17, wherein the size of the image to be recognized is larger than the size of the first score map.

19. An image recognition apparatus, comprising: a memory for storing a computer program and a processor for invoking the computer program to perform the method of any of claims 1-9.

20. A computer-readable storage medium, in which a computer program is stored which, when run on a computer, causes the computer to carry out the method of any one of claims 1 to 9.