CN115840417A

CN115840417A - Target identification method, device and storage medium based on artificial intelligence

Info

Publication number: CN115840417A
Application number: CN202011011330.2A
Authority: CN
Inventors: 谭文军; 郑思远; 邵长东; 高倩
Original assignee: Ecovacs Commercial Robotics Co Ltd
Current assignee: Ecovacs Commercial Robotics Co Ltd
Priority date: 2020-09-23
Filing date: 2020-09-23
Publication date: 2023-03-24

Abstract

The embodiment of the application provides a target identification method, equipment and a storage medium based on artificial intelligence. In an embodiment of the present application, shelf images acquired by the robot during movement along the shelf may be acquired. In the stage of target detection of the shelf image, inputting the shelf image into a target detection model to obtain spatial information of a detection frame for target labeling of the shelf image, and in the stage of target identification, extracting a local image corresponding to the detection frame from the shelf image according to the spatial information containing a rotation angle; and the local image is input into the target recognition model to recognize the target object in the local image, so that the automatic recognition of the target object is realized, the target recognition efficiency is improved, and the commodity counting efficiency based on the target recognition result is improved.

Description

Target identification method, device and storage medium based on artificial intelligence

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, and a storage medium for identifying a target based on artificial intelligence.

Background

In large-scale business excess, goods on shelves are usually counted by manpower, and due to the large area of business excess, thousands of goods types are covered, the counting of goods types and quantities is time-consuming and labor-consuming, and the efficiency is low.

Disclosure of Invention

Aspects of the present disclosure provide a target recognition method, apparatus, and storage medium based on artificial intelligence to improve accuracy of target recognition.

The embodiment of the application provides a target identification method based on artificial intelligence, which comprises the following steps: acquiring a shelf image acquired by the robot in the process of moving along the shelf; inputting the shelf image into a target detection model to obtain space information of a first detection frame for carrying out target labeling on the shelf image; extracting a local image corresponding to the first detection frame from the shelf image according to the spatial information of the first detection frame; and inputting the local image into a target recognition model to recognize a target object contained in the local image.

An embodiment of the present application further provides a robot, including: a machine body; the mechanical body is provided with a camera, a memory and a processor; the memory is used for storing a computer program;

the camera is used for collecting shelf images in the process that the robot moves along the shelf;

the processor is coupled to the memory for executing the computer program for: inputting the shelf image into a target detection model to obtain spatial information of a first detection frame for performing target labeling on the shelf image; extracting a local image corresponding to the first detection frame from the shelf image according to the spatial information of the first detection frame; and inputting the rotation angle into a target recognition model to recognize a target object contained in the local image.

An embodiment of the present application further provides a computer device, including: a memory and a processor; the memory is used for storing a computer program;

the processor is coupled to the memory for executing the computer program for performing the steps in the artificial intelligence based object recognition method described above.

Embodiments of the present application also provide a computer-readable storage medium storing computer instructions, which, when executed by one or more processors, cause the one or more processors to perform the steps of the artificial intelligence based object recognition method.

In an embodiment of the present application, shelf images acquired by the robot during movement along the shelf may be acquired. In the stage of target detection of the shelf image, inputting the shelf image into a target detection model to obtain spatial information of a detection frame for target labeling of the shelf image, and in the stage of target identification, extracting a local image corresponding to the detection frame from the shelf image according to the spatial information containing a rotation angle; and the local image is input into the target recognition model to recognize the target object in the local image, so that the automatic recognition of the target object is realized, the target recognition efficiency is improved, and the commodity counting efficiency based on the target recognition result is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1a is a block diagram of a hardware structure of a robot according to an embodiment of the present disclosure;

fig. 1b is a schematic view of a scene in which a robot collects shelf images according to an embodiment of the present application;

FIG. 1c is a schematic diagram of a shelf image captured by a robot according to an embodiment of the present disclosure;

fig. 1d is a schematic diagram of a target detection effect provided in the embodiment of the present application;

FIG. 1e is a schematic structural diagram of a multi-angle image capturing system according to an embodiment of the present application;

fig. 2a and fig. 2b are schematic flow charts of an artificial intelligence based target identification method provided in an embodiment of the present application;

fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only a few embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Aiming at the technical problem that the existing target recognition accuracy is low, in some embodiments of the present application, in the embodiments of the present application, a shelf image acquired by a robot in the process of moving along a shelf can be acquired. In the stage of target detection of the shelf image, inputting the shelf image into a target detection model to obtain spatial information of a detection frame for target labeling of the shelf image, and in the stage of target identification, extracting a local image corresponding to the detection frame from the shelf image according to the spatial information containing a rotation angle; and the local image is input into the target recognition model to recognize the target object in the local image, so that the automatic recognition of the target object is realized, the target recognition efficiency is improved, and the commodity counting efficiency based on the target recognition result is improved.

The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

It should be noted that: like reference numerals refer to like objects in the following figures and embodiments, and thus, once an object is defined in one figure or embodiment, further discussion thereof is not required in subsequent figures and embodiments.

Fig. 1a is a block diagram of a hardware structure of a robot according to an exemplary embodiment of the present disclosure. As shown in fig. 1a, the robot 100 includes: a machine body 101; the machine body 101 is provided with a processor 102 and a memory 103 for storing computer instructions. In addition, the machine body 101 is provided with a camera 104.

It is worth noting that the number of the processors 102 and the memories 103 may be 1 or more. Plural means 2 or more. In this embodiment, the processor 102 and the memory 103 may be disposed inside the machine body 101 or disposed on the surface of the machine body 101. The camera 104 is disposed on a surface of the machine body.

The machine body 101 is an actuator of the robot 100, and can perform an operation designated by the processor 102 in a certain environment. The machine body 101 shows the appearance of the robot 100 to some extent. In the present embodiment, the appearance of the robot 100 is not limited. For example, the robot 100 may be a humanoid robot as shown in fig. 1b, and the machine body 101 may include, but is not limited to: the robot has mechanical structures such as a head, a hand, a wrist, an arm, a waist and a base. Note that the robot 100 may be a non-human robot, and the machine body 101 is mainly a main body of the robot 100.

It should be noted that some basic components of the robot 100, such as a driving component, an odometer, a power supply component, an audio component, and the like, are also disposed on the machine body 101. Alternatively, the drive assembly may include drive wheels, drive motors, universal wheels, and the like. These basic components and the configurations of the basic components included in different robots 100 are different, and the embodiments of the present application are only some examples.

The memory 103 is mainly used for storing one or more computer instructions, which can be executed by the processor 102, so that the processor 102 controls the robot 100 to implement corresponding functions, and complete corresponding actions or tasks. In addition to storing computer instructions, the memory 103 may also be configured to store other various data to support operations on the robot 100. Examples of such data include instructions for any application or method operating on the robot 100, an environment map corresponding to the environment in which the robot 100 is located, and so forth. The environment map may be one or more maps corresponding to the whole environment stored in advance, or may be a partial map being constructed before.

The memory 103, which may be implemented by any type of volatile or non-volatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The processor 102, which may be considered a control system of the robot 100, may be configured to execute computer instructions stored in the memory 103 to control the robot 100 to implement corresponding functions and perform corresponding actions or tasks. In this embodiment, the robot 100 may move autonomously, and may complete a certain task based on the autonomous movement. For example, in a shopping scene such as a supermarket, a mall, etc., the robot 100 may count goods on a shelf, etc. As another example, in some warehouse sorting scenarios, a sorting robot may sort goods, and the like.

In this embodiment, no matter whether the robot 100 is counting the shelf goods or sorting the goods, it is not necessary to detect and identify the goods. In this embodiment, in order to enable the robot 100 to detect and identify the goods, as shown in fig. 1b, the processor 102 may control the robot 100 to move along the shelf, and control the camera 104 to capture the shelf image during the movement of the robot 100 along the shelf. For the camera 104, the shelf image of the position where the robot 100 moves is acquired in real time during the movement of the robot 100 along the shelf. In this embodiment, the shelf image includes an image of a product placed on a shelf.

Further, the quality of the goods shelf image is considered to influence the subsequent commodity detection and identification effect, but in the prior art, the goods shelf image is usually collected by adopting handheld equipment or a fixed camera, the shooting angle of the handheld equipment is not fixed, the collected image is easy to incline, the fixed camera is generally fixed at a high position, the visual angle is large but far away from the commodity, a high-definition picture cannot be obtained, and the accuracy of commodity identification can be reduced.

In order to solve the above problem, the processor 102 in the robot 100 may control the relative position relationship between the collection angle of view of the camera 104 and the goods on the shelf to be stable. Alternatively, the processor 102 may control the acquisition perspective of the camera 104 to be directed at the shelf product. Further, the processor 102 may control the robot 100 to move in a direction parallel to the shelves and control the camera 104 to capture shelf images during movement of the robot 100. The moving direction of the robot 100 is parallel to the shelf, which helps to ensure that the shooting angle of the camera 104 is not inclined, helps to improve the quality of the shelf image collected by the camera 104, and further helps to improve the accuracy of subsequent commodity detection and identification in the shelf image.

Alternatively, the processor 102 may determine the location distribution of the shelves in the environment map according to the known environment map; and planning a moving path parallel to the shelf for the robot 100 according to the position distribution of the shelf in the environment map, and controlling the robot 100 to move along the moving path parallel to the shelf.

Further, the robot 100 may automatically adjust its distance from the shelf, ensuring that the camera 104 may capture images of the entire shelf. Optionally, the robot 100 may automatically adjust the height of the camera 104 and the distance between the robot 100 and the shelf so that the camera 104 may capture an image of the entire shelf.

Correspondingly, the processor 102 may further obtain a shelf image acquired by the camera 104, and input the shelf image into the target detection model to obtain spatial information of the detection frame for performing target labeling on the shelf image. In practical applications, when performing target detection on an image, a rectangular detection frame is usually used to mark a target object included in the image. The spatial information of the rectangular detection frame is specifically the spatial information of the rectangular detection frame on the image to be processed, and can reflect the spatial distribution condition of the target object in the image. Wherein, the spatial information of the rectangular detection frame comprises: the center position, size, etc. of the frame are detected. Alternatively, the center position of the detection frame may be represented by the center coordinates of a rectangular detection frame, and the size of the detection frame may be represented by the width and height of the rectangular detection frame. Accordingly, the spatial information of the detection frame may be represented as (x, y, w, h). Wherein, (x, y) represents the center coordinates of the rectangular detection frame, i.e., the coordinates of the center of the rectangular detection frame in the image to be detected, and w and h represent the width and height of the rectangular detection frame, respectively. Or, the spatial information of the rectangular detection frame includes: vertex coordinates of the rectangular detection box, and the like. The center coordinates and the vertex coordinates of the rectangular detection frame are coordinates on the image to be processed.

Further, the processor 102 may extract a local image corresponding to the target detection frame from the shelf image according to the spatial information of the target detection frame; and the local image is input into the target recognition model to recognize the target object contained in the local image, so that the automatic recognition of the target object is realized, the target recognition efficiency is improved, and the commodity statistics efficiency based on the target recognition result is improved.

In the embodiment of the present application, considering that the placement position of the items on the shelf may be inclined, for example, in a shopping scene such as a shopping mall, a supermarket, etc., the position where a customer may put the items back when selecting the items may not be consistent with the initial placement position, and the placement of the items is inclined, such as the writing case a shown in fig. 1b and 1 c. Fig. 1c is a schematic diagram of a frame of shelf images acquired by the robot 100 during movement along the shelf. However, as shown in the left diagram of fig. 1d, for the target object with inclination, the local image extracted from the spatial information of the rectangular detection frame without angle information often contains too much background noise, which affects the accuracy of target identification based on the local image.

In order to solve the above problem, in the present embodiment, for the detection of the rotation angle introduced by the target detection model, the spatial information of the target detection frame output by the target detection model includes the rotation angle. The rotation angle included in the spatial information of the target detection frame is the rotation angle of the detection frame corresponding to the detection frame with the vertical transverse screen. The target detection frame is a rectangular frame, and the spatial information of the target detection frame may include: the center coordinates, the size and the rotation angle of the target detection frame can be expressed as (x, y, w, h, theta); where θ represents the rotation angle of the target detection frame. Alternatively, the spatial information of the target detection frame may also include: and the vertex coordinates and the rotation angle of the target detection frame.

Based on the target detection model, the processor 102 may input the acquired shelf image into the target detection model to obtain spatial information of a target detection frame for performing target labeling on the shelf image; the spatial information of the target detection frame includes: the central coordinates, the size and the rotation angle of the target detection frame, or the spatial information of the target detection frame includes: and the vertex coordinates and the rotation angle of the target detection frame. Further, the processor 102 may extract a local image corresponding to the target detection frame from the shelf image according to the spatial information including the rotation angle; the partial image also has a rotation angle. The rotation angle is consistent with the inclination angle of the target object in the local image, and the extracted local image is as shown in the right image of fig. 1d, so that the background noise of the target object is reduced, and the accuracy of target identification based on the local image is improved. The dotted frame shown in the right diagram of fig. 1d is a target detection frame.

Alternatively, the processor 102 may determine the position of the image marked by the target detection frame in the shelf image according to the spatial information including the rotation angle. The specific calculation formula is as follows:

in formula (1), (x, y) is the center coordinates of the target detection frame, i.e., the coordinates of the center of the target detection frame in the shelf image. i =1,2. (x) _i ,y _i ) And (x) ₂ ,y ₂ ) Each represents the position of 2 vertices of a diagonal line when the target detection frame is at a positive angle. (x' ₁ ,y′ ₁ ) And (x' ₂ ,y′ ₂ ) Each indicates the position of 2 vertices of the diagonal line after the target detection frame is rotated by the angle θ.

Further, since the partial image has a rotation angle, in the present embodiment, in order to implement target recognition on the partial image with the rotation angle, a target recognition model supporting multi-angle target recognition may also be provided. Accordingly, the processor 102 may input the partial image having the rotation angle into a target recognition model supporting multi-angle target recognition to recognize the target object included in the partial image.

The robot provided by the embodiment can move along the shelf, and the shelf images are collected during the movement of the robot along the shelf. In the stage of carrying out target detection on the shelf image, the shelf image is input into a target detection model to obtain spatial information containing a rotation angle of a detection frame for carrying out target labeling on the shelf image, so that a local image corresponding to the detection frame with the rotation angle is extracted, background noise of a target object contained in the extracted local image can be reduced, and the accuracy of subsequently identifying the target object contained in the local image can be improved; in the target identification stage, extracting a local image corresponding to the detection frame from the shelf image according to the space information containing the rotation angle, wherein the local image also has the rotation angle; and the local image with the rotation angle is input into a target recognition model supporting multi-angle target recognition to recognize the target object in the local image, and the target object in the local image can be recognized without correcting the local image, thereby being beneficial to improving the target recognition efficiency.

It is worth noting that in the embodiment of the present application, the processor 102 may also perform model training on the target detection model before inputting the shelf image into the target detection model. Alternatively, the processor 102 may acquire a multi-angle image and a multi-distance image of a sample shelf on which a plurality of commodities are placed as a sample image set. The multi-angle image of the sample shelf refers to an image obtained by acquiring an image of the sample shelf by adopting a plurality of shooting angles. Optionally, the multi-angle image comprises: a top view image, a head-up image, and a bottom view. The multi-distance image means: and acquiring images obtained by acquiring the images of the sample at different shooting distances.

Further, the processor 102 may further perform target labeling on each sample image in the sample image set by using the spatial information of the detection frame including the rotation angle, to obtain spatial information of the detection frame for performing target labeling on each sample image, where the spatial information includes the rotation angle. In the embodiment of the application, for convenience of description and distinction, a detection frame for performing target labeling on a sample image is defined as a reference detection frame; and the target detection frame for labeling the shelf image collected by the robot 100 is defined as a first detection frame. The spatial information of the first detection frame and the reference detection frame both comprise rotation angle information. Optionally, a manual labeling mode may be further adopted to perform target labeling on each sample image in the sample image set, so as to obtain spatial information of the reference detection frame, where the spatial information includes the rotation angle.

Further, the processor 102 may perform model training on the target detection model by using the sample image set to obtain the target detection model. The initial model of model training of the target detection model is referred to as the initial detection model. The initial detection model and the target detection model obtained finally by model training have the same model architecture, namely the parameters of the models are the same. The model training in this embodiment mainly refers to: parameters of the target detection model are trained using the sample image set to minimize a loss function. Namely, the loss function is minimized to be a training target, and model training is carried out by utilizing the sample image set to obtain an initial target detection model. The loss function can be determined according to the spatial information of the detection frame containing the rotation angle obtained by model training and the spatial information of the reference detection frame for performing target labeling on the sample image set before the model training. For convenience of description and distinction, the detection frame containing the rotation angle obtained by model training is defined as the second detection frame. Alternatively, the loss function may be expressed as:

L＝L _center +λ _scale L _scale +λ _offset L _offset +λ _θ L _θ (2)

in the loss function (2), L _center Representing the center loss, namely the loss between the center coordinates of the second detection frame and the center coordinates of the reference detection frame obtained by model training; l is _scale The scale loss is represented, namely the loss between the width and the height of the second detection frame obtained by model training and the width and the height of the reference detection frame; l is _offset Indicating the offset loss, i.e. the center coordinates of the second detection frame obtained by model training are compared with the center of the reference detection frameLoss of offset of the coordinates; l is _θ The loss of the rotation angle is represented, that is, the loss of the rotation angle of the second detection frame obtained by the model training compared with the rotation angle of the reference detection frame. Lambda [ alpha ] _scale 、λ _offset And λ _θ The weights of the scale loss, the offset loss and the rotation angle loss are respectively expressed, and can be flexibly set according to actual conditions.

Wherein the angle of rotation is lost L _θ Can be expressed as:

where K is the number of positive samples in the sample image set, θ _i And with

The angle of the reference detection frame and the rotation angle output by the model training are respectively.

For the rotation angle prediction branch in the target detection model, a rotation angle prediction branch may be introduced in a known target detection model architecture. The known target detection model architecture may be, but is not limited to, a Single Shot Detector (SSD) model, a YOLO (young only look once) series model, or a centrnet model. For example, in the cenet model, the detection head of the cenet model includes three branches of the center coordinates, the center coordinate offset, and the scale prediction of the second detection frame, the Hourglass-104 model parameters are shared, and the two convolutional layers are cascaded at the head to respectively realize the prediction of the rotation angle. The Hourglass-104 model adopts a large number of convolution and deconvolution layers to fuse multi-scale features, and the size of the finally output feature map is 1/4 of the input size, so that the calculation amount is reduced.

The target detection model training stage may be completed before the robot 100 leaves a factory, or a sample image set may be provided by a user of the robot 100 after leaving the factory, and a computer instruction related to the target detection model training is started to complete the target detection model training.

After the training of the target detection model is completed, the processor 102 may detect the target of the input shelf image by using the target detection model, and obtain the spatial information of the first detection frame for performing the target labeling on the input shelf image, where the spatial information includes the rotation angle. And the rotating angle of the first detection frame is consistent with the rotating angle of the target object marked by the first detection frame. Further, a partial image corresponding to the first detection frame may be extracted from the shelf image based on the spatial information including the rotation angle, the partial image also having the rotation angle. Further, the partial image having the rotation angle may be input to a target recognition model supporting multi-angle target recognition to recognize a target object included in the partial image.

In this embodiment, before the partial image with the rotation angle is input into the target recognition model supporting multi-angle target recognition, model training may be further performed on the target recognition model, so that the target recognition model can recognize the multi-angle target. Alternatively, the ReID method may be employed to train the target recognition model. The specific implementation mode is as follows:

in the present embodiment, the robot 100 may acquire multi-angle images of various sample goods. Wherein, a plurality means 2 or more than 2. Preferably, the sample commodity is a single commodity. The multi-angle image of the sample good may include: a top view image, a head-up image, and a bottom view. Wherein, the overlook angle and the upward angle can be flexibly set. Alternatively, the top and bottom viewing angles may be 45 °, or the like. The variety of sample goods includes: the same commodity under different brands, different commodities under the same brand, commodities with different specifications of the same commodity under the same brand and the like. Each sample commodity corresponds to a set of multi-angle images. Herein, the multi-angle means a plurality of angles. The plurality means 2 or more than 2, and the specific number of the angles can be flexibly set.

In the embodiment of the present application, a specific implementation of the robot 100 for acquiring multi-angle images of various sample commodities is not limited. In some embodiments, the robot 100 may flexibly adjust the collection angle of the camera 104, and shoot each sample commodity with a different collection angle, so as to obtain a multi-angle image of multiple sample commodities. In other embodiments, the multi-angle images of the sample commodities may be collected by other devices or systems, and the collected multi-angle images may be provided to the robot 100, etc. For example, as shown in FIG. 1e, the multi-angle image capturing system may include: a fixed camera and a rotatable tray S1. Wherein the mounting position of the fixed camera is shown in fig. 1 e. For the multi-angle image acquisition system shown in FIG. 1e, comprising: three cameras (namely cameras 1-3) are overlooked at an angle of 45 degrees, overlooked at an angle of 0 degrees and overlooked at an angle of 45 degrees. Sample commodity A can place on the rotating tray, and every rotatory angle setting of rotating tray shoots a set of image, includes: images collected by a camera with a 45-degree depression angle, images collected by a camera with a 0-degree flat view angle and images collected by a camera with a 45-degree elevation angle. For example, a set of images is taken for every 6 ° rotation of the rotating tray, so that a total of 180 pictures are taken for each sample article a. Optionally, each frame of image acquired by the multi-angle image acquisition system may be cut, only the sample commodity part is reserved, and the cut image is used as the multi-angle image.

In practical applications, the appearances of different brands of commodities are different and easy to distinguish, wherein the different brands of commodities can be the same type of commodities of different brands, such as brand a cola and brand B cola, or different types of commodities of different brands, such as brand a tobable and brand B cola. The appearance difference of different types of commodities of the same brand is large, and the commodities are easy to distinguish. For example, brand a of sprite and cola; brand B potato chips and bread, etc. However, in the case of different specifications of the same product under the same brand, the product appearance is extremely similar, and recognition is difficult. For example, brand A500 ml cola with 250ml cola, etc.

In the embodiment of the application, in order to improve the accuracy of target identification, the coarse and fine identification models of the target identification model can be trained by distinguishing sample commodities with different granularities. In the present embodiment, for convenience of description and distinction, the coarse recognition model is defined as a first target recognition submodel, and the fine recognition model is defined as a second target recognition submodel. The first target identification submodel and the second target identification submodel both support multi-angle target identification. The fineness of the image characteristic vector extracted by the second target recognition submodel is greater than that of the image characteristic vector extracted by the first target recognition submodel. Optionally, the first target identification submodel may distinguish and identify sample goods of different brands and goods of different types under the same brand; the second target identification submodel can distinguish and identify different specifications of the same type of commodities under the same brand.

Further, the processor 102 may train the first target recognition submodel and the second target recognition submodel, respectively. Optionally, when the first target identification submodel is trained, the processor 102 may obtain images of the same sample commodity belonging to the same brand from the multi-angle images of the multiple sample commodities as a positive sample image pair (denoted as a first positive sample image pair); and acquiring images of sample commodities belonging to different brands or images of sample commodities of different types of the same brand as a negative sample image pair (marked as a first negative sample image pair), and forming a triple (marked as a first triple) by using the positive sample image pair and the negative sample image pair containing the same anchor point image. For the anchor point image in the first triple, one frame of the other two frames of images is an image of the same sample commodity belonging to the same brand as the sample commodity contained in the anchor point image, and the frame of image and the anchor point image form a first positive sample image pair; the other frame is an image of a sample commodity belonging to a different brand from the sample commodity contained in the anchor image, or an image of a sample commodity belonging to a different type under the same brand from the sample commodity contained in the anchor image, and the frame image and the anchor image form a first negative sample image pair.

Optionally, a primary commodity identification label may be performed on the sample commodity contained in each frame image in the multi-angle image of the sample commodity in advance, where the primary commodity identification may be a brand name + a commodity name, such as a brand a cola, a brand a snow bill, and the like. In this way, the processor 102 may determine, according to the primary commodity identification labels of the sample commodities included in each frame of image in the multi-angle images of the multiple sample commodities, images of the same sample commodity belonging to the same brand in the multi-angle images of the sample commodities; and the images of the sample commodities belonging to different brands in the multi-angle images of the sample commodities or the images of the sample commodities of the same brand and different types.

Optionally, a secondary commodity identification label may be performed on the sample commodity contained in each frame of image in the multi-angle images of the multiple sample commodities in advance, where the secondary commodity identification may be brand name + trade name + specification, such as 500ml of brand a cola, 250ml of brand a cola, and the like. Thus, in the following embodiment, in the process of training the second target recognition submodel, the processor 102 may determine, according to the secondary commodity identification label of the sample commodity included in each frame image in the multi-angle images of multiple sample commodities, the images of the sample commodities of the same specification and the same type under the same brand in the multi-angle images of the sample commodities; and images of sample goods of the same type and different specifications under the same brand.

In this embodiment, the training of the first target recognition submodel mainly includes training of a feature extraction layer in the first target recognition submodel. Defining a feature extraction layer in a first target recognition submodel as a first feature extraction layer in order to distinguish the feature extraction layer from a feature extraction layer in a second target recognition submodel; and defining a feature extraction layer in the second target recognition submodel as a second feature extraction layer. And the fineness of the image feature vectors extracted by the trained second feature extraction layer is greater than that of the trained first feature extraction layer.

The initial network model used to train the first feature extraction layer is referred to as a first initial network model. The first initial network model and the model architecture of the first feature extraction layer finally obtained by model training are the same, namely the parameters of the model are the same. The model training in this embodiment mainly refers to: parameters of the first initial network model are trained using the triplets to minimize the loss function.

For the ReID method, the feature extraction layer may be coupled to a classifier, and the classifier determines, according to the image feature vector output by the feature extraction layer, the commodity identifier included in each sample image in the first triplet. In this embodiment, the product identifier is information that uniquely identifies a product. For example, the product identifier may be a product ID, or a brand, a product name, a specification, and the like to which the product belongs. Accordingly, the loss function for training the first feature extraction layer may be a combined loss function composed of a triplet loss function, a center loss function, and a commodity class loss function. For convenience of description and distinction, the loss function in the training of the target detection model is defined as a first loss function; defining a loss function for training the first feature extraction layer as a second loss function; and defining a loss function for training the second target recognition submodel as a third loss function.

The triple loss function and the central loss function in the second loss function can be determined according to the image feature vector output by training the first initial network model each time; the commodity category loss in the second loss function may be determined according to the difference between the commodity categories included in the triplets obtained by each training and the actual difference between the commodity categories included in the triplets.

Further, the second loss function can be minimized to be a training target, and the first training model is trained by utilizing the triples formed by the positive sample image pair and the negative sample image pair to obtain the first feature extraction layer. Wherein the first training model comprises: a first initial network model and classifier for training a first feature extraction layer. Specifically, the first initial network model when the second loss function is minimized is the first feature extraction layer. Of course, the parameters in the first initial network model when the second function is minimized have changed compared to the first initial network model when initially trained.

Further, the processor 102 may input the multi-angle images of the multiple sample commodities into the trained first feature extraction layer, obtain image feature vectors corresponding to the multi-angle images, respectively, as commodity feature vectors in a commodity feature vector set (denoted as a first commodity feature vector set), and construct a first commodity feature vector set according to the commodity feature vectors in the first commodity feature vector set; and establishing a corresponding relation between each commodity feature vector in the first commodity feature vector set and the commodity identification. Alternatively, the dimension of the product feature vector in the first product feature vector set may be 256 dimensions, but is not limited thereto. The commodity identification in the corresponding relationship between the first commodity feature vector set and the commodity identification can adopt the secondary commodity identification. Because the negative sample image pair in the triple for training the first feature extraction layer is images of sample commodities of different brands or images of sample commodities of different types of the same brand, and the difference of image feature vectors between the images of the negative sample image pair is large, the granularity of the image feature vectors in the first commodity feature vector set is coarse. And because the sample images for training the first feature extraction layer are multi-angle images of various sample commodities, the trained first feature extraction layer can support feature extraction of the multi-angle images, and the image feature vectors in the first commodity feature vector set are the image feature vectors of the multi-angle images. Therefore, the trained first target recognition submodel can support multi-angle target recognition.

In an embodiment of the present application, a second target recognition submodel may also be trained. Optionally, when the second target identification submodel is trained, the processor 102 may obtain, from the multi-angle images of multiple sample commodities, images of sample commodities of the same specification and of the same type under the same brand as a positive sample image pair; and acquiring images of sample commodities of the same type and different specifications under the same brand as a negative sample image pair, and forming a triple (marked as a second triple) by using the positive sample image pair and the negative sample image pair containing the same anchor point image. For the anchor point image in the second triple, one frame of the other two frames of images is the image of the sample commodity which belongs to the same brand and has the same type and the same specification as the sample commodity contained in the anchor point image, and the frame of image and the anchor point image form a positive sample image pair; the other frame is an image of a sample commodity of a different specification belonging to the same type and the same brand as the sample commodity contained in the anchor image, and the frame image and the anchor image form a negative sample image pair.

In this embodiment, the training of the second target recognition submodel mainly trains a feature extraction layer (denoted as a second feature extraction layer) in the second target recognition submodel. The initial network model used for training the second feature extraction layer is referred to as a second initial network model, where the second initial network model may be the first feature extraction layer after the training. And the second initial network model and the model architecture of the second feature extraction layer finally obtained by model training are the same, namely the parameters of the model are the same. The model training in this embodiment mainly refers to: parameters of the second initial network model are trained using the second triples to minimize the loss function.

For the ReID method, the feature extraction layer may be coupled to a classifier, and the classifier determines the commodity identifiers included in each sample image in the second triple according to the image feature vectors output by the feature extraction layer. Accordingly, the third loss function for training the second feature extraction layer may be a combined loss function composed of a triple loss function, a center loss function, and a commodity class loss function.

The triple loss function and the central loss function in the third loss function can be determined according to the image feature vector output by the first feature extraction layer during each training; the commodity category loss in the third loss function may be determined according to the difference between the commodity categories included in the triplets obtained by each training and the actual difference between the commodity categories included in the triplets.

Further, a third loss function can be minimized to be a training target, and a second training model is trained by using a second triple formed by the positive sample image pair and the negative sample image pair to obtain a second feature extraction layer. Wherein the second training model comprises: a second initial network model and a classifier. Alternatively, the second initial network model may be the trained first feature extraction layer. Specifically, the second initial network model when the third loss function is minimized is the second feature extraction layer. Of course, the second feature extraction layer when the third function is minimized has changed compared to the parameters in the second initial network model when initially trained.

Further, the processor 102 may input the multi-angle images of the multiple sample commodities into a trained second feature extraction layer, obtain image feature vectors corresponding to the multi-angle images, respectively, as commodity feature vectors in a commodity feature vector set (denoted as a second commodity feature vector set), and construct a second commodity feature vector set according to the commodity feature vectors in the second commodity feature vector set; and establishing a corresponding relation between each commodity feature vector in the second commodity feature vector set and the commodity identification. Alternatively, the dimension of the commodity feature vector in the second commodity feature vector set may be 256 dimensions, but is not limited thereto. And the commodity identification in the corresponding relation between the second commodity feature vector set and the commodity identification can adopt the secondary commodity identification. Because the negative sample image pair in the second triple for training the second feature extraction layer is the image of the sample commodity of the same type and different specifications under the same brand, and the difference of the image feature vectors between the images of the negative sample image pair is small, the granularity of the image feature vectors in the second commodity feature vector set is fine, that is, the fineness of the image feature vectors in the second commodity feature vector set is greater than that of the image feature vectors in the first commodity feature vector set. And because the sample images for training the second feature extraction layer are multi-angle images of various sample commodities, the trained second feature extraction layer can support feature extraction of the multi-angle images, and the image feature vectors in the second commodity feature vector set are the image feature vectors of the multi-angle images. Therefore, the trained second target recognition submodel can support multi-angle target recognition.

Based on the first target recognition submodel and the second target recognition submodel, when the target is recognized, the processor 102 may input the local image with the rotation angle to the first target recognition submodel, recognize the target object included in the local image, and when the first target recognition submodel cannot recognize the target object included in the local image, input the local image with the selected angle to the second target recognition submodel, and recognize the target object included in the local image by the second target recognition submodel.

Optionally, the first target recognition submodel may include: the multi-angle image recognition system comprises a first feature extraction layer supporting feature extraction of multi-angle images and a first target recognition layer supporting multi-angle target recognition. When identifying a target object included in the local image, the processor 102 may input the local image with the rotation angle into the first feature extraction layer to obtain an image feature vector (denoted as a first image feature vector) of the local image; and inputting the first image feature vector into a first target recognition layer, and recognizing a target object contained in the local image.

Further, in the first target recognition layer, the similarity between the first image feature vector of the local image and the commodity feature vectors in the first commodity feature vector set may be calculated. Optionally, cosine distances between the first image feature vector of the local image and the feature vectors of the commodities in the first commodity feature vector set can be calculated, wherein the shorter the cosine distance is, the greater the similarity is. Selecting M commodity feature vectors from the first commodity feature vector set according to the sequence of similarity from large to small of the first image feature vectors; and then, determining the commodity identifications corresponding to the M commodity feature vectors according to the corresponding relation between the commodity feature vectors and the commodity identifications in the first commodity feature vector set. Further, if the number Q of the commodity feature vectors with the same commodity identification corresponding to the commodity feature vector with the maximum similarity in the M commodity feature vectors is larger than or equal to N, the commodity identification corresponding to the commodity feature vector with the maximum similarity is used as the identification of the target object contained in the local image. In the embodiment, M and N are integers, and M is more than or equal to 2,1 and is more than or equal to N and less than or equal to (M-1). In this embodiment, specific values of M and N are not limited, and optionally, M =5, N =3, and the like, but are not limited thereto.

Correspondingly, if Q is less than N, the first target recognizer model is determined not to be capable of recognizing the target object contained in the local image. Further, the processor 102 re-inputs the local image with the selected angle to the second target recognition sub-model, and the second target recognition sub-model recognizes the target object included in the local image.

Optionally, the second target recognition submodel includes: a second feature extraction layer supporting feature extraction of multi-angle images and a second target identification layer supporting multi-angle target identification. And the fineness of the image feature vectors extracted by the second feature extraction layer is greater than that of the image feature vectors extracted by the first feature extraction layer.

Correspondingly, the processor 102 inputs the local image with the rotation angle into the second feature extraction layer under the condition that Q is less than N, so as to obtain a second image feature vector of the local image; and inputting the second image feature vector into a second target recognition layer to recognize the target object contained in the partial image.

Optionally, in the second target recognition layer, the corresponding relationship between the commodity feature vectors and the commodity identifications may be collected in a second commodity feature vector set, and the first target commodity feature vectors corresponding to the commodity identifications corresponding to the M commodity feature vectors are obtained from the second commodity feature vector set; and calculating the similarity between the second image feature vector and the first target commodity feature vector. Optionally, a cosine distance between the second image feature vector of the local image and the first target commodity feature vector may be calculated, wherein the shorter the cosine distance, the greater the similarity.

Further, if a second target commodity feature vector with the similarity between the first target commodity feature vector and the second image feature vector being greater than or equal to a set similarity threshold exists in the first target commodity feature vector, the identifier of the target object included in the local image is determined according to the corresponding relation between the commodity feature vector and the commodity identifier in the second commodity feature vector set and the second target commodity feature vector.

Optionally, a commodity feature vector with the largest similarity with the second image feature vector may be determined from the second target commodity feature vector; and determining the commodity identification corresponding to the commodity feature vector with the maximum similarity between the commodity feature vectors and the second image feature vectors according to the corresponding relation between the commodity feature vectors and the commodity identifications in the second commodity feature vector set, wherein the commodity identification is used as the identification of the target object contained in the local image.

Or in the second target recognition layer, the similarity between the second image feature vector of the local image and the commodity feature vectors in the second commodity feature vector set can be calculated; and the fineness of the commodity feature vectors in the second commodity feature vector set is greater than that of the commodity feature vectors in the first commodity feature vector set. Optionally, cosine distances between the second image feature vector of the local image and the feature vectors of the commodities in the first commodity feature vector set can be calculated, wherein the shorter the cosine distance is, the greater the similarity is.

Further, if a target commodity feature vector with the similarity between the second commodity feature vector set and the second image feature vector being greater than or equal to a set similarity threshold exists in the second commodity feature vector set, the identifier of the target object included in the local image is determined according to the corresponding relation between the commodity feature vector and the commodity identifier in the second commodity feature vector set and the target commodity feature vector.

Optionally, a commodity feature vector with the largest similarity with the second image feature vector can be determined from the target commodity feature vectors; and determining the commodity identification corresponding to the commodity feature vector with the maximum similarity between the commodity feature vectors and the second image feature vectors as the identification of the target object contained in the local image according to the corresponding relation between the commodity feature vectors and the commodity identification in the second commodity feature vector set.

In the above target identification process, the robot 100 collects any frame of shelf image in the moving process along the shelf, and in practical applications, the shelf images collected by the robot 100 in the moving process along the shelf are multiple frames, and the multiple frame shelf images may have a certain overlap. In this embodiment, the processor 102 may also extract feature points of multiple frames of shelf images. The characteristic points are local expressions of the characteristics of the shelf images and can reflect local specificity of the shelf images.

Optionally, the processor 102 also performs blob detection or corner detection on the multi-frame shelf image. If the processor 102 performs the spot detection on the shelf image, feature points in the shelf image can be obtained by adopting an LOG method, a DOH method, an SIFI algorithm or an SURF algorithm; if the processor 102 performs corner detection on the shelf image, a Harris algorithm or a FAST algorithm may be used to obtain feature points in the shelf image.

Further, the processor 102 may perform deduplication processing on the multiple frames of shelf images according to feature points of the multiple frames of shelf images. Optionally, calculating similarity between feature points of a first shelf image and a second shelf image, which are adjacent to each other in any acquisition time, in the multi-frame shelf image; calculating a perspective transformation matrix between the first shelf image and the second shelf image according to the similarity between the characteristic points of the first shelf image and the second shelf image; then, performing affine transformation on the first shelf image and the second shelf image according to the perspective transformation matrix to determine an overlapping area of the first shelf image and the second shelf image; and carrying out deduplication processing on the overlapping area of the first shelf image and the second shelf image so as to realize deduplication processing on the first shelf image and the second shelf image.

Further, the processor 102 may perform deduplication processing on a target object included in the multiple shelf images according to a result of deduplication processing on the multiple shelf images, so as to implement target detection and identification on the entire shelf.

It is noted that the object recognition method provided in the above embodiments may be executed by other computer devices in robot communication, in addition to the robot. In this case, the robot provides the acquired shelf image to the other computer device, and the other computer device performs object recognition on the shelf image. For a specific implementation of the target recognition of the shelf image by other computer devices, reference may be made to the content related to the target recognition performed by the robot, which is not described herein again.

In addition to the robot described above, some exemplary embodiments of the present application also provide a target recognition method. The object recognition method is described in detail below with reference to the accompanying drawings.

Fig. 2a is a schematic flowchart of a target identification method based on artificial intelligence according to an embodiment of the present disclosure. As shown in fig. 2a, the method comprises:

20a, acquiring shelf images acquired by the robot in the process of moving along the shelf.

And 20b, inputting the shelf image into the target detection model to obtain the space information of the first detection frame for carrying out target labeling on the shelf image.

And 20c, extracting a local image corresponding to the first detection frame from the shelf image according to the space information of the first detection frame.

And 20d, inputting the local image into the target recognition model to recognize the target object contained in the local image.

The object recognition method provided by the embodiment may be executed by the autonomous mobile robot, or may be executed by another computer device in communication with the autonomous mobile robot. Regardless of the implementation of the target recognition method, in step 20a, shelf images acquired by the robot during movement along the shelf may be acquired. The shelf images include commodity images of shelf placement. For the case where the executing agent is a robot, an alternative implementation of step 20a is: and controlling the robot to move along the goods shelf, and controlling a camera on the robot to acquire goods shelf images in the process that the robot moves along the goods shelf. For other devices that perform a body in communication with the robot, another alternative implementation of step 20a is: and receiving shelf images acquired by the robot in the process of moving along the shelf.

In order to solve the above problem, in step 20a, the relative position relationship between the acquisition view angle of the camera on the robot and the goods on the shelf can be controlled to be stable. Optionally, the collection view angle of the controllable camera is opposite to the goods on the shelf. Furthermore, the robot can be controlled to move along the direction parallel to the goods shelf, and the camera is controlled to collect goods shelf images in the moving process of the robot. The moving direction of the robot is parallel to the goods shelf, so that the shooting angle of the camera is not inclined, the quality of goods shelf images collected by the camera is improved, and the accuracy of subsequent commodity detection and identification in the goods shelf images is improved.

Optionally, the position distribution of the shelves in the environment map can be determined according to the known environment map; and planning a moving path parallel to the goods shelf for the robot according to the position distribution condition of the goods shelf in the environment map, and controlling the robot to move along the moving path parallel to the goods shelf.

Furthermore, the distance between the camera and the goods shelf can be automatically adjusted, and the camera can acquire images of the whole goods shelf. Optionally, the robot may automatically adjust the height of the camera and the distance between the robot and the shelf, so that the camera may capture images of the entire shelf.

Further, in step 20b, the shelf image may be input to the target detection model to obtain spatial information of the first detection frame for performing target labeling on the shelf image. For the description of the spatial information of the first detection frame, reference may be made to the related contents of the robot embodiment described above, and details are not repeated here. Next, in step 20c, a partial image corresponding to the first detection frame may be extracted from the shelf image based on the spatial information of the first detection frame; in step 20d, the local image is input into the target recognition model to recognize the target object included in the local image, so that the target object is automatically recognized, the target recognition efficiency is improved, and the commodity counting efficiency based on the target recognition result is improved.

Further, considering that the placement position of the goods on the shelf may be inclined, for example, in a shopping scene such as a shopping mall or a supermarket, the position where the customer may put back the goods when purchasing the goods may not be consistent with the initial placement position, and the placement of the goods is inclined. For a target object with inclination, a local image extracted according to spatial information of a rectangular detection frame without angle information often contains excessive background noise, which affects the accuracy of target identification based on the subsequent local image.

In order to solve the above problem, an embodiment of the present application further provides another target identification method based on artificial intelligence. As shown in fig. 2b, the method comprises:

201. and acquiring shelf images acquired by the robot in the process of moving along the shelf.

202. Inputting the shelf image into a target detection model to obtain spatial information of a first detection frame for performing target labeling on the shelf image; wherein, the spatial information of the first detection frame comprises: the center coordinates, the size and the rotation angle of the first detection frame, or the vertex coordinates and the rotation angle of the first detection frame.

203. Extracting a local image corresponding to the first detection frame from the shelf image according to the space information containing the rotation angle; the partial image has a rotation angle.

204. The partial image with the rotation angle is input into a target recognition model supporting multi-angle target recognition to recognize a target object contained in the partial image.

In this embodiment, in order to improve the accuracy of target identification, detection of a rotation angle is introduced to the target detection model, and spatial information of a target detection frame output by the target detection model includes the rotation angle. For the description of the rotation angle, reference may be made to the related contents of the above embodiments, which are not repeated herein.

Based on the target detection model, in step 202, the obtained shelf image may be input into the target detection model to obtain spatial information of a target detection frame for performing target labeling on the shelf image; the spatial information of the target detection frame includes: the angle of rotation. Further, in step 203, a partial image corresponding to the target detection frame may be extracted from the shelf image according to the spatial information including the rotation angle; the partial image also has a rotation angle. The rotation angle is consistent with the inclination angle of the target object in the local image, so that the background noise of the target object is reduced, and the accuracy of target identification based on the local image is improved.

Further, since the partial image has a rotation angle, in the present embodiment, in order to implement target recognition on the partial image with the rotation angle, a target recognition model supporting multi-angle target recognition may also be provided. Accordingly, in step 204, the partial image having the rotation angle may be input to a target recognition model supporting multi-angle target recognition to recognize the target object included in the partial image.

In this embodiment, shelf images acquired by the robot during movement along the shelf may be acquired. In the stage of carrying out target detection on the shelf image, the shelf image is input into a target detection model to obtain spatial information containing a rotation angle of a detection frame for carrying out target labeling on the shelf image, so that a local image corresponding to the detection frame with the rotation angle is extracted, background noise of a target object contained in the extracted local image can be reduced, and the accuracy of subsequently identifying the target object contained in the local image can be improved; in the target identification stage, extracting a local image corresponding to the detection frame from the shelf image according to the space information containing the rotation angle, wherein the local image also has the rotation angle; and the local image with the rotation angle is input into a target recognition model supporting multi-angle target recognition to recognize the target object in the local image, and the target object in the local image can be recognized without correcting the local image, thereby being beneficial to improving the target recognition efficiency.

It should be noted that, in the embodiment of the present application, before inputting the shelf image into the target detection model, the target detection model may also be subjected to model training. Alternatively, a multi-angle image of a sample shelf on which a plurality of kinds of commodities are placed may be acquired as a sample image set. And performing target labeling on each sample image in the sample image set by adopting the detection frame space information containing the rotation angle to obtain the space information of the detection frame for performing target labeling on each sample image, wherein the space information contains the rotation angle. In the embodiment of the application, for convenience of description and distinction, a detection frame for performing target labeling on a sample image is defined as a reference detection frame; and defining the target detection frame for labeling the shelf image collected by the robot as a first detection frame. The spatial information of the first detection frame and the reference detection frame both comprise rotation angle information. Optionally, a manual labeling mode may be further adopted to perform target labeling on each sample image in the sample image set, so as to obtain spatial information of the reference detection frame, where the spatial information includes the rotation angle.

Further, a multi-angle image and a multi-distance image of a sample shelf on which a plurality of kinds of commodities are placed can be acquired as the first sample image set. For the description of the multi-angle image and the multi-distance image, reference may be made to the related contents of the above embodiments, which are not described herein again. Further, the first loss function can be minimized to be a training target, and model training is carried out by utilizing the first sample image set to obtain a target detection model; the first loss function is determined according to the spatial information of the second detection frame containing the rotation angle obtained by model training and the spatial information of the reference detection frame for performing target labeling on the first sample image set before the model training; the spatial information of the reference detection frame includes a rotation angle. For the first loss function and the specific training process for the target detection model, reference may be made to the relevant contents of the robot embodiment described above, and details are not described here again.

After the training of the target detection model is completed, the target detection model can be used for detecting the target of the input shelf image, and the spatial information of the first detection frame for performing target labeling on the input shelf image is obtained, wherein the spatial information comprises the rotation angle. And the rotating angle of the first detection frame is consistent with the rotating angle of the target object marked by the first detection frame. Further, a partial image corresponding to the first detection frame may be extracted from the shelf image based on the spatial information including the rotation angle, the partial image also having the rotation angle. Further, a partial image having a rotation angle may be input to a target recognition model supporting multi-angle target recognition to recognize a target object included in the partial image.

In this embodiment, before the partial image with the selected angle is input into the target recognition model supporting multi-angle target recognition, model training may be further performed on the target recognition model, so that the target recognition model can recognize the multi-angle target. Optionally, a ReID method may be used to train the target recognition model, and for a specific implementation, reference may be made to relevant contents in the foregoing embodiments, which are not described herein again.

In this embodiment, the multi-angle images of the multiple sample commodities can be input into the trained first feature extraction layer to obtain image feature vectors corresponding to the multi-angle images, respectively, and the image feature vectors are used as commodity feature vectors in a commodity feature vector set (marked as a first commodity feature vector set), and a first commodity feature vector set is constructed according to the commodity feature vectors in the first commodity feature vector set; and establishing a corresponding relation between each commodity feature vector in the first commodity feature vector set and the commodity identification. Correspondingly, the multi-angle images of the various sample commodities can be input into a trained second feature extraction layer to obtain image feature vectors corresponding to the multi-angle images respectively, the image feature vectors are used as commodity feature vectors in a commodity feature vector set (marked as a second commodity feature vector set), and a second commodity feature vector set is constructed according to the commodity feature vectors in the second commodity feature vector set; and establishing a corresponding relation between each commodity feature vector in the second commodity feature vector set and the commodity identification. Based on the first target identification submodel and the second target identification submodel, because the fineness of the image feature vector extracted by the second target identification submodel is greater than that of the image feature vector extracted by the first target identification submodel, a local image with a rotation angle can be input into the first target identification submodel to identify a target object contained in the local image; and when the first target identification submodel cannot identify the target object contained in the local image, inputting the local image with the rotation angle into the second target identification submodel to identify the target object contained in the local image.

In this embodiment, the first target recognition submodel may include: the multi-angle image recognition system comprises a first feature extraction layer supporting feature extraction of multi-angle images and a first target recognition layer supporting multi-angle target recognition. Accordingly, when the target object included in the local image is identified based on the first target identification submodel, the local image having the rotation angle may be input to the first feature extraction layer to obtain an image feature vector (referred to as a first image feature vector) of the local image; and inputting the first image feature vector into a first target recognition layer, and recognizing a target object contained in the local image.

Optionally, in the first target recognition layer, the similarity between the image feature vector of the local image and the commodity feature vectors in the first commodity feature vector set may be calculated; selecting M commodity feature vectors from the first commodity feature vector set according to the sequence of similarity with the first image feature vector from large to small; and then, determining the commodity identifications corresponding to the M commodity feature vectors according to the corresponding relation between the commodity feature vectors and the commodity identifications in the first commodity feature vector set. Further, if the number Q of the commodity feature vectors with the same commodity identification corresponding to the commodity feature vector with the maximum similarity in the M commodity feature vectors is larger than or equal to N, the commodity identification corresponding to the commodity feature vector with the maximum similarity is used as the identification of the target object contained in the local image. In the embodiment, M and N are integers, and M is more than or equal to 2,1 and is more than or equal to N and less than or equal to (M-1). In this embodiment, specific values of M and N are not limited, and optionally, M =5, N =3, and the like, but are not limited thereto.

Accordingly, if Q < N, it is determined that the first target sub-model cannot identify the target object included in the partial image. Further, a partial image having a rotation angle may be input to the second target recognition submodel to recognize a target object included in the partial image.

Wherein the second target recognition submodel may include: a second feature extraction layer supporting feature extraction of multi-angle images and a second target identification layer supporting multi-angle target identification. Correspondingly, if Q is less than N, the local image with the rotation angle can be input into a second feature extraction layer to obtain a second image feature vector of the local image; and inputting the second image feature vector into a second target recognition layer to recognize the target object contained in the partial image.

Optionally, in the second target recognition layer, according to the correspondence between the commodity feature vectors and the commodity identifications in the second commodity feature vector set, first target commodity feature vectors corresponding to the commodity identifications corresponding to the M commodity feature vectors are obtained from the second commodity feature vector set; calculating the similarity between the second image feature vector and the first target commodity feature vector; and if a second target commodity feature vector with the similarity between the first target commodity feature vector and the second image feature vector being greater than or equal to a set similarity threshold exists in the first target commodity feature vector, determining the identifier of the target object contained in the local image according to the corresponding relation between the commodity feature vector and the commodity identifier in the second commodity feature vector set and the second target commodity feature vector.

Optionally, determining a commodity feature vector with the maximum similarity with the second image feature vector from the second target commodity feature vector; and determining the commodity identification corresponding to the commodity feature vector with the maximum similarity between the second image feature vectors as the identification of the target object contained in the local image according to the corresponding relation between the commodity feature vectors and the commodity identification in the second commodity feature vector set.

For other embodiments of the second target recognition layer for recognizing the target object, reference may be made to the relevant contents of the robot embodiment described above, and details are not described herein again.

The shelf images in the target identification process are any frame of shelf images acquired by the robot in the moving process along the shelf, and in practical application, the shelf images acquired by the robot in the moving process along the shelf are multiple frames, and the multiple frame shelf images may have certain overlap. In this embodiment, feature points of a plurality of shelf images can also be extracted. The characteristic points are local expressions of the characteristics of the shelf images and can reflect local specificity of the shelf images. For a specific implementation of extracting feature points of multiple shelf images, reference may be made to the related contents of the robot embodiment described above, and details are not described herein again.

Further, the multi-frame shelf images can be subjected to de-duplication processing according to the feature points of the multi-frame shelf images. Optionally, calculating similarity between feature points of a first shelf image and a second shelf image, which are adjacent to each other in any acquisition time, in the multi-frame shelf image; calculating a perspective transformation matrix between the first shelf image and the second shelf image according to the similarity between the characteristic points of the first shelf image and the second shelf image; then, performing affine transformation on the first shelf image and the second shelf image according to the perspective transformation matrix to determine an overlapping area of the first shelf image and the second shelf image; and carrying out deduplication processing on the overlapping area of the first shelf image and the second shelf image, so as to realize deduplication processing on the first shelf image and the second shelf image.

Furthermore, the target object contained in the multi-frame shelf image can be subjected to de-duplication processing according to the de-duplication processing result of the multi-frame shelf image, so that the target detection and identification of the whole shelf can be realized.

It should be noted that the execution subjects of the steps of the methods provided in the above embodiments may be the same device, or different devices may be used as the execution subjects of the methods. For example, the execution subjects of

steps

201 and 202 may be device a; for another example, the execution subject of step 201 may be device a, and the execution subject of step 202 may be device B; and so on.

In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations are included in a specific order, but it should be clearly understood that the operations may be executed out of the order presented herein or in parallel, and the sequence numbers of the operations, such as 201, 202, etc., are merely used for distinguishing different operations, and the sequence numbers do not represent any execution order per se. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel.

Accordingly, embodiments of the present application also provide a computer-readable storage medium storing computer instructions, which, when executed by one or more processors, cause the one or more processors to perform the steps of the above-mentioned object recognition method.

Fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present application. The computer equipment can be terminal equipment such as a smart phone and a computer, and can also be server-side equipment. The server device may be a single server device, a cloud server array, or a Virtual Machine (VM) running in the cloud server array. In addition, the server device may also refer to other computing devices with corresponding service capabilities, such as a terminal device (running a service program) such as a computer.

As shown in fig. 3, the computer apparatus includes: a memory 30a and a processor 30b. The memory 30a is used to store a computer program. The processor 30b is coupled to the memory 30a for executing a computer program for: acquiring a shelf image acquired by the robot in the process of moving along the shelf; inputting the goods shelf image into a target detection model to obtain space information of a first detection frame for carrying out target labeling on the goods shelf image; extracting a local image corresponding to the first detection frame from the shelf image according to the spatial information of the first detection frame; the partial image is input to a target recognition model to recognize a target object included in the partial image.

In some embodiments, the spatial information of the first detection box includes: the angle of rotation. The spatial information of the first detection frame may further include: the center coordinates and the size of the first detection frame; alternatively, it comprises: the vertex coordinates of the first detection frame. Accordingly, the processor 30b, when identifying the target object included in the local image, is specifically configured to: the partial image with the rotation angle is input into a target recognition model supporting multi-angle target recognition to recognize a target object contained in the partial image.

In some embodiments, the computer device further comprises: a communication component 30c. Optionally, the processor 30b is specifically configured to, when taking the shelf image acquired by the robot during the movement along the shelf: shelf images captured by the robot during movement in a direction parallel to the shelf are received by the communication component 30c.

In some embodiments, the processor 30b is further configured to: before inputting a shelf image into a target detection model, acquiring a multi-angle image of a sample shelf in which various commodities are placed as a first sample image set; performing model training by using the first sample image set by taking the minimization of the first loss function as a training target to obtain a target detection model; the first loss function is determined according to the spatial information of the second detection frame containing the rotation angle obtained by model training and the spatial information of the reference detection frame for performing target labeling on the first sample image set before the model training; the spatial information of the reference detection frame includes a rotation angle.

Optionally, the processor 30b is further configured to: and performing target labeling on the first sample image set by using the spatial information of the detection frame containing the rotation angle to obtain the spatial information of the reference detection frame containing the rotation angle.

In other embodiments, the target recognition model includes: the system comprises a first target identification submodel supporting multi-target identification and a second target identification submodel supporting multi-target identification; and the fineness of the image feature vector extracted by the second target identification submodel is greater than that of the image feature vector extracted by the first target identification submodel. Accordingly, the processor 30b, when identifying the target object included in the local image, is specifically configured to: inputting a partial image with a rotation angle into a first target identification submodel to identify a target object contained in the partial image; and in the case that the first target identification submodel cannot identify the target object contained in the local image, inputting the local image with the rotation angle into the second target identification submodel to identify the target object contained in the local image.

Optionally, the first target sub-model comprises: the system comprises a first feature extraction layer supporting feature extraction of multi-angle images and a first target identification layer supporting multi-angle target identification. Correspondingly, when identifying the target object included in the local image, the processor 30b is specifically configured to: inputting a local image with a rotation angle into a first feature extraction layer to obtain a first image feature vector of the local image; the first image feature vector is input into a first target recognition submodel to recognize a target object contained in the partial image.

Further, when identifying the target object included in the local image, the processor 30b is specifically configured to: in the first target identification layer, calculating the similarity between the first image feature vector and the commodity feature vectors in the first commodity feature vector set; selecting M commodity feature vectors from the first commodity feature vector set according to the sequence of similarity from large to small of the first image feature vectors; determining commodity identifications corresponding to the M commodity feature vectors according to the corresponding relation between the commodity feature vectors and the commodity identifications in the first commodity feature vector set; if Q is larger than or equal to N, the commodity identification corresponding to the commodity feature vector with the maximum similarity is used as the identification of the target object contained in the local image; q is the number of commodity feature vectors with the same commodity identification corresponding to the commodity feature vector with the maximum similarity in the M commodity feature vectors; m is more than or equal to 2,1 and less than or equal to N (M-1), and M and N are integers.

Optionally, the second target recognition submodel includes: a second feature extraction layer supporting feature extraction of multi-angle images and a second target identification layer supporting multi-angle target identification; the fineness of the image feature vectors extracted by the second feature extraction layer is greater than that of the image feature vectors extracted by the first feature extraction layer. Accordingly, the processor 30b is further configured to: if Q is less than N, inputting the local image with the rotation angle into a second feature extraction layer to obtain a second image feature vector of the local image; the fineness of the second image feature vector is greater than that of the first image feature vector; and inputting the second image feature vector into a second target recognition sub-model to recognize the target object contained in the local image.

Further, when identifying the target object included in the local image, the processor 30b is specifically configured to:

in the second target recognition layer, according to the corresponding relation between the commodity feature vectors and the commodity identifications in the second commodity feature vector set, acquiring first target commodity feature vectors corresponding to the commodity identifications corresponding to the M commodity feature vectors from the second commodity feature vector set;

calculating the similarity between the second image feature vector and the first target commodity feature vector; the fineness of the commodity feature vectors in the second commodity feature vector set is greater than that of the commodity feature vectors in the first commodity feature vector set; and if a second target commodity feature vector with the similarity between the first target commodity feature vector and the second image feature vector being larger than or equal to a set similarity threshold exists in the first target commodity feature vector, determining the identification of the target object contained in the local image according to the corresponding relation between the commodity feature vector and the commodity identification in the second commodity feature vector set and the second target commodity feature vector.

Further, the processor 30b, when determining the identifier of the target object included in the local image, is specifically configured to: determining a commodity feature vector with the maximum similarity with the second image feature vector from the second target commodity feature vectors; and determining the commodity identification corresponding to the commodity feature vector with the maximum similarity between the commodity feature vectors and the second image feature vectors according to the corresponding relation between the commodity feature vectors and the commodity identifications in the second commodity feature vector set, wherein the commodity identification is used as the identification of the target object contained in the local image.

Optionally, the processor 30b is further configured to: acquiring multi-angle images of various sample commodities before inputting a local image with a rotation angle into a first feature extraction layer; acquiring images of the same sample commodity belonging to the same brand from the multi-angle images of the sample commodity as a first positive sample image pair; obtaining images of sample commodities belonging to different brands or images of different types of sample commodities under the same brand as a first negative sample image pair; training the first training model by using a first triple formed by the first positive sample image pair and the first negative sample image pair with the second loss function minimization as a training target to obtain a first feature extraction layer; wherein the first training model comprises: a first initial network model and a classifier for training a first feature extraction layer; the second loss function consists of a first triple loss function, a first central loss function and a first commodity category loss; the first triple loss function and the first center loss function are determined according to image feature vectors output by training the first initial network model each time; the first commodity type loss is determined according to the difference of the commodity types contained in the first triple obtained by each training and the actual difference of the commodity types contained in the first triple.

Optionally, the multi-angle image of each sample commodity comprises: top view, head up view and bottom view of the sample good.

Optionally, the processor 30b is further configured to: inputting the multi-angle images of the sample commodity into the trained first feature extraction layer to obtain image feature vectors corresponding to the multi-angle images of the sample commodity respectively, wherein the image feature vectors are used as commodity feature vectors in a first commodity feature vector set; constructing a first commodity feature vector set according to the commodity feature vectors in the first commodity feature vector set; and establishing a corresponding relation between each commodity feature vector in the first commodity feature vector set and the commodity identification.

Optionally, the processor 30b is further configured to: before inputting the second image feature vector into a second target identification sub-model, acquiring images of sample commodities with the same specification and the same type under the same brand from multi-angle images as a second positive sample image pair, and acquiring images of sample commodities with different specifications and the same type under the same brand as a second negative sample image pair; training a second training model by using a triple formed by the second positive sample image pair and the second negative sample image pair with the third loss function minimization as a training target to obtain a second feature extraction layer; wherein the second training model comprises: a first feature extraction layer and a classifier; the third loss function consists of a second triple loss function, a second center loss function and a second commodity type loss; the second triple loss function and the second central loss function are determined according to the image feature vector output by training the second initial network model each time; the second commodity category loss is determined according to the difference of the commodity categories contained in the second positive sample image pair and the second negative sample image pair output by the training classifier and the actual difference of the commodity categories contained in the second positive sample image pair and the second negative sample image pair.

Accordingly, the processor 30b is further configured to: inputting multi-angle images of various sample commodities into a trained second feature extraction layer to obtain image feature vectors corresponding to the multi-angle images respectively, wherein the image feature vectors are used as commodity feature vectors in a second commodity feature vector set; constructing a second commodity feature vector set according to the commodity feature vectors in the second commodity feature vector set; and establishing a corresponding relation between each commodity feature vector in the second commodity feature vector set and the commodity identification.

In still other embodiments, the number of shelf images is multiple frames; the processor 30b is further configured to: extracting feature points of the multi-frame goods shelf image; according to the characteristic points of the multi-frame shelf images, carrying out duplicate removal processing on the multi-frame shelf images; and according to the result of the duplicate removal processing on the multi-frame shelf images, carrying out the duplicate removal processing on the target objects contained in the multi-frame shelf images.

Further, when the processor 30b performs deduplication processing on multiple frames of shelf images, it is specifically configured to: calculating the similarity between the characteristic points of the first shelf image and the second shelf image aiming at the first shelf image and the second shelf image which are adjacent at any acquisition time in the multi-frame shelf images; calculating a perspective transformation matrix between the first shelf image and the second shelf image according to the similarity between the characteristic points of the first shelf image and the second shelf image; performing affine transformation on the first shelf image and the second shelf image according to the perspective transformation matrix to determine an overlapping area of the first shelf image and the second shelf image; and performing deduplication processing on the overlapping area of the first shelf image and the second shelf image.

In some optional embodiments, as shown in fig. 3, the computer device may further include: optional components such as power component 30d, display 30e, and audio component 30 f. Only some of the components shown in fig. 3 are schematically depicted, and it is not meant that the computer device must include all of the components shown in fig. 3, nor that the computer device only includes the components shown in fig. 3.

The computer equipment provided by the embodiment can acquire the shelf images acquired by the robot in the process of moving along the shelf. In the stage of target detection of the shelf image, inputting the shelf image into a target detection model to obtain spatial information of a detection frame for target labeling of the shelf image, and in the stage of target identification, extracting a local image corresponding to the detection frame from the shelf image according to the spatial information containing a rotation angle; and the local image is input into the target recognition model to recognize the target object in the local image, so that the automatic recognition of the target object is realized, the target recognition efficiency is improved, and the commodity counting efficiency based on the target recognition result is improved.

In embodiments of the present application, the memory is used to store computer programs and may be configured to store various other data to support operations on the device on which it resides. Wherein the processor may execute a computer program stored in the memory to implement the corresponding control logic. For the implementation of the memory, reference may be made to the related contents of the above embodiments, which are not described herein again.

In the embodiments of the present application, the processor may be any hardware processing device that can execute the above described method logic. Alternatively, the processor may be a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), or a Micro Controller Unit (MCU); programmable devices such as Field-Programmable Gate arrays (FPGAs), programmable Array Logic devices (PALs), general Array Logic devices (GAL), complex Programmable Logic Devices (CPLDs), etc. may also be used; or Advanced Reduced Instruction Set (RISC) processors (ARM), or System On Chips (SOC), etc., but is not limited thereto.

In embodiments of the present application, the communication component is configured to facilitate wired or wireless communication between the device in which it is located and other devices. The device where the communication component is located can access a wireless network based on a communication standard, such as WiFi,2G or 3G,4G,5G or a combination thereof. In an exemplary embodiment, the communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component may also be implemented based on Near Field Communication (NFC) technology, radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, or other technologies.

In the embodiment of the present application, the display assembly may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the display assembly includes a touch panel, the display assembly may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.

In embodiments of the present application, a power supply component is configured to provide power to various components of the device in which it is located. The power components may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device in which the power component is located.

In embodiments of the present application, the audio component may be configured to output and/or input audio signals. For example, the audio component includes a Microphone (MIC) configured to receive an external audio signal when the device in which the audio component is located is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in a memory or transmitted via a communication component. In some embodiments, the audio assembly further comprises a speaker for outputting audio signals. For example, for devices with language interaction functionality, voice interaction with a user may be enabled through an audio component, and so forth.

It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. An artificial intelligence-based target identification method is characterized by comprising the following steps:

acquiring a shelf image acquired by the robot in the process of moving along a shelf;

inputting the shelf image into a target detection model to obtain space information of a first detection frame for carrying out target labeling on the shelf image;

extracting a local image corresponding to the first detection frame from the shelf image according to the spatial information of the first detection frame;

inputting the local image into a target recognition model to recognize a target object contained in the local image.

2. The method of claim 1, wherein the spatial information of the first detection box comprises: the center coordinates, the size and the rotation angle of the first detection frame; or the vertex coordinates and the rotation angle of the first detection frame;

the inputting the local image into a target recognition model to recognize a target object contained in the local image includes:

inputting the partial image with the rotation angle into a target recognition model supporting multi-angle target recognition to recognize a target object included in the partial image.

3. The method of claim 2, wherein the target recognition model comprises: the system comprises a first target identification submodel supporting multi-angle target identification and a second target identification submodel supporting multi-angle target identification; the fineness of the image characteristic vectors extracted by the second target identification submodel is greater than that of the image characteristic vectors extracted by the first target identification submodel;

the inputting the partial image with the rotation angle into a target recognition model supporting multi-angle target recognition to recognize the target object contained in the partial image comprises:

inputting the partial image with the rotation angle into the first target identification submodel to identify a target object contained in the partial image;

and when the first target identification submodel cannot identify the target object contained in the local image, inputting the local image with the rotation angle into the second target identification submodel to identify the target object contained in the local image.

4. The method of claim 3, wherein the first target sub-model comprises: the system comprises a first feature extraction layer supporting feature extraction of multi-angle images and a first target identification layer supporting multi-angle target identification;

the inputting the partial image with the rotation angle into the first target identification submodel to identify the target object contained in the partial image comprises:

inputting the local image with the rotation angle into the first feature extraction layer to obtain a first image feature vector of the local image;

and inputting the first image feature vector into the first target identification submodel to identify a target object contained in the local image.

5. The method of claim 4, wherein inputting the first image feature vector into the first target recognition layer to identify a target object contained in the local image comprises:

in the first target recognition layer, calculating the similarity between the first image feature vector and the commodity feature vectors in a first commodity feature vector set;

selecting M commodity feature vectors from the first commodity feature vector set according to the sequence of similarity with the first image feature vector from large to small;

determining commodity identifications corresponding to the M commodity feature vectors according to the corresponding relation between the commodity feature vectors and the commodity identifications in the first commodity feature vector set;

if Q is larger than or equal to N, the commodity identification corresponding to the commodity feature vector with the maximum similarity is used as the identification of the target object contained in the local image;

q is the number of commodity feature vectors with the same commodity identification corresponding to the commodity feature vector with the maximum similarity in the M commodity feature vectors; m is more than or equal to 2,1 and less than or equal to N (M-1), and M and N are integers.

6. The method of claim 5, wherein the second target recognition submodel comprises: a second feature extraction layer supporting feature extraction of multi-angle images and a second target identification layer supporting multi-angle target identification; the fineness of the image feature vectors extracted by the second feature extraction layer is greater than that of the image feature vectors extracted by the first feature extraction layer;

the method further comprises the following steps:

if Q is less than N, inputting the local image with the rotation angle into the second feature extraction layer to obtain a second image feature vector of the local image; the fineness of the second image feature vector is greater than the first image feature vector;

and inputting the second image feature vector into the second target identification layer to identify a target object contained in the local image.

7. The method of claim 6, wherein inputting the second image feature vector into the second target recognition layer to identify a target object contained in the local image comprises:

calculating the similarity between the second image feature vector and the first target commodity feature vector; the fineness of the commodity feature vectors in the second commodity feature vector set is greater than that of the commodity feature vectors in the first commodity feature vector set;

and if a second target commodity feature vector with the similarity between the first target commodity feature vector and the second image feature vector being greater than or equal to a set similarity threshold exists in the first target commodity feature vector, determining the identifier of the target object contained in the local image according to the corresponding relation between the commodity feature vector and the commodity identifier in the second commodity feature vector set and the second target commodity feature vector.

8. The method according to claim 7, wherein the determining the identifier of the target object included in the local image according to the correspondence between the commodity feature vector and the commodity identifier in the second commodity feature vector set and the second target commodity feature vector comprises:

determining a commodity feature vector with the maximum similarity with the second image feature vector from the second target commodity feature vector;

and determining the commodity identification corresponding to the commodity feature vector with the maximum similarity between the commodity feature vectors and the second image feature vectors as the identification of the target object contained in the local image according to the corresponding relation between the commodity feature vectors and the commodity identifications in the second commodity feature vector set.

9. The method according to claim 7, further comprising, before inputting the partial image having the rotation angle into the first feature extraction layer:

acquiring multi-angle images of various sample commodities;

inputting the multi-angle images of the sample commodity into a trained first feature extraction layer to obtain image feature vectors corresponding to the multi-angle images of the sample commodity respectively, wherein the image feature vectors are used as commodity feature vectors in the first commodity feature vector set;

constructing a first commodity feature vector set according to the commodity feature vectors in the first commodity feature vector set;

and establishing a corresponding relation between each commodity feature vector in the first commodity feature vector set and a commodity identifier.

10. The method of claim 7, wherein prior to inputting the second image feature vector into the second target recognition submodel, the method further comprises:

inputting the multi-angle images of the sample commodity into a trained second feature extraction layer to obtain image feature vectors corresponding to the multi-angle images of the sample commodity respectively, wherein the image feature vectors are used as commodity feature vectors in a second commodity feature vector set;

constructing a second commodity feature vector set according to the commodity feature vectors in the second commodity feature vector set;

and establishing a corresponding relation between each commodity feature vector in the second commodity feature vector set and the commodity identification.

11. The method of claim 9, wherein the multi-angle image of each sample article comprises: top view, head up view and bottom view of the sample article.

12. The method of any one of claims 1-11, wherein said obtaining shelf images captured by the robot during movement along the shelf comprises:

controlling the robot to move along the direction parallel to the goods shelf; and controlling the robot to collect the shelf image in the moving process.

13. The method of any one of claims 1-11, wherein the number of shelf images is a plurality of frames; the method further comprises the following steps:

extracting feature points of the multi-frame shelf image;

according to the feature points of the multi-frame shelf images, carrying out duplicate removal processing on the multi-frame shelf images;

and according to the result of the duplicate removal processing of the multi-frame shelf images, carrying out the duplicate removal processing on the target objects contained in the multi-frame shelf images.

14. A robot, comprising: a machine body; the mechanical body is provided with a camera, a memory and a processor; the memory is used for storing a computer program;

the processor is coupled to the memory for executing the computer program for: inputting the shelf image into a target detection model to obtain space information of a first detection frame for carrying out target labeling on the shelf image; extracting a local image corresponding to the first detection frame from the shelf image according to the spatial information of the first detection frame; and inputting the local image into a target recognition model to recognize a target object contained in the local image.

15. A computer device, comprising: a memory and a processor; the memory is used for storing a computer program;

the processor is coupled to the memory for executing the computer program for performing the steps in the method of any of claims 1-13.

16. A computer-readable storage medium having stored thereon computer instructions, which, when executed by one or more processors, cause the one or more processors to perform the steps of the method of any one of claims 1-13.