CN111739060A

CN111739060A - Identification method, device and storage medium

Info

Publication number: CN111739060A
Application number: CN201910706446.9A
Authority: CN
Inventors: 刘武; 鲍慊; 梅涛; 阮威健
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2019-08-01
Filing date: 2019-08-01
Publication date: 2020-10-02
Anticipated expiration: 2039-08-01
Also published as: CN111739060B

Abstract

The embodiment of the application discloses an identification method, equipment and a storage medium, wherein the method comprises the following steps: obtaining at least two images; obtaining key point information of each object in each image; determining a characteristic sequence of each object in each image at least based on the key point information of each object in each image; at least obtaining a first parameter between the objects in the at least two images based on the characteristic sequence of the objects in the images, wherein the first parameter is used for representing the similarity degree between the objects of different images in the at least two images; at least a target object is determined based on a first parameter between objects in the at least two images, the target object being a similar object between the at least two images.

Description

Identification method, device and storage medium

Technical Field

The present application relates to identification technologies, and in particular, to an identification method, device, and storage medium.

Background

Object tracking is the obtaining of the position and tracking number of each object in a video, such as the position of a pedestrian and the tracking number of a pedestrian, by computer vision techniques. In the related art, for an object in two adjacent frames of images in a video, two steps, such as two steps of correlating pedestrian recognition data and pedestrian tracking data, are generally required to at least realize the tracking of a pedestrian appearing in the two frames of images. Taking two adjacent frames as an example of a current frame and a previous frame, the current frame at least needs to distinguish a pedestrian appearing in the current frame and identical to the pedestrian appearing in the previous frame, a pedestrian newly appearing in the current frame, and a pedestrian appearing in the previous frame but disappearing in the current frame through the two steps. To ensure the accuracy of these two steps, the respective network models are usually used for the association of pedestrian identification and tracking data. It can be understood that before the network model for pedestrian recognition and the network model for association of tracking data are used for pedestrian recognition and association of tracking data, respective training of the models is required, and the trained network model is used for association of pedestrian recognition and tracking data. It can be understood that two independent network models are adopted to realize object tracking, on one hand, two different network models are trained, and the calculated amount is increased invisibly; on the other hand, the output of the network model for pedestrian recognition is usually the input of the associated network model for tracking data, and the two independent network models are processed in different ways, so that the associated network model for tracking data cannot obtain the expected input, that is, the two network models cannot be effectively connected.

Disclosure of Invention

In order to solve the existing technical problem, embodiments of the present application provide an identification method, an identification device, and a storage medium, which can at least avoid the problems that the computation amount is large and the models cannot be effectively connected due to the adoption of two independent network models in the related art.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides an identification method, which comprises the following steps:

obtaining at least two images;

obtaining key point information of each object in each image;

determining a characteristic sequence of each object in each image at least based on the key point information of each object in each image;

at least obtaining a first parameter between the objects in the at least two images based on the characteristic sequence of the objects in the images, wherein the first parameter is used for representing the similarity degree between the objects of different images in the at least two images;

at least a target object is determined based on a first parameter between objects in the at least two images, the target object being a similar object between the at least two images.

In the above scheme, the method includes:

obtaining at least two global feature maps for respective objects in respective images, wherein different global feature maps of the same object are at least partially different;

correspondingly, the determining the feature sequence of each object in each image based on at least the key point information of each object in each image includes:

and determining the local feature sequence of each object based on the at least two global feature maps and the key point information of each object.

In the foregoing solution, the determining a local feature sequence of each object based on at least two global feature maps and the keypoint information of each object includes:

the key point information of each object is at least represented as position information of at least two key parts of each object;

obtaining each target image of each object from each global feature map of each object, wherein the target image of each object is an image corresponding to the position relation of at least two key parts of each object in the global feature map of each object;

based on the respective target images of the respective objects, a local feature sequence of the respective objects is determined.

In the above solution, the obtaining at least a first parameter between objects in the at least two images based on the feature sequence of each object in each image includes:

will t be_iRespective objects in the image and the t-th_i-1Combining feature sequences of each object in the image pairwise to obtain feature tensor information, wherein the tth object is a feature tensor_iImage, t-th_i-1The images are two adjacent images in the at least two images;

performing convolution processing on the characteristic tensor information for at least two times to obtain two target matrixes, wherein at least part of elements in the target matrixes are represented as the tth_iImage, t-th_i-1The degree of similarity of any two objects in the image;

and at least obtaining first parameters between each object in the first image and each object in the second image based on the two target matrixes.

In the foregoing solution, the obtaining at least a first parameter between each object in the first image and each object in the second image based on the two object matrices includes:

performing normalized exponential function softmax operation on a first target matrix of the two target matrices according to columns to obtain a first matching probability matrix, wherein at least part of elements of the first matching probability matrix are represented as the tth_iImages to t_i-1A match probability between image objects;

performing softmax operation on a second target matrix of the two target matrices according to rows to obtain a second matching probability matrix, wherein at least part of elements in the second matching probability matrix are represented as the tth_i-1Images to t_iA match probability between image objects;

and taking the element value with a large value at the same position in the first matching probability matrix and the second matching probability matrix as a first parameter between corresponding objects in the first image and the second image.

An embodiment of the present application provides an identification device, which includes:

a first obtaining unit configured to obtain at least two images;

a second obtaining unit configured to obtain key point information of each object in each image;

a first determining unit, configured to determine a feature sequence of each object in each image based on at least the key point information of each object in each image;

a third obtaining unit, configured to obtain at least a first parameter between objects in the at least two images based on a feature sequence of each object in each image, where the first parameter is used to characterize a degree of similarity between objects in different images of the at least two images;

a second determining unit, configured to determine at least a target object based on at least the first parameter between the objects in the at least two images, where the target object is a similar object between the at least two images.

In the above solution, the apparatus further includes:

a fourth obtaining unit, configured to obtain at least two global feature maps for each object in each image, where different global feature maps of the same object are at least partially different;

correspondingly, the first determining unit is configured to determine the local feature sequence of each object based on the at least two global feature maps and the keypoint information of each object.

In the foregoing solution, the first determining unit is configured to:

In the foregoing scheme, the third obtaining unit is configured to:

An embodiment of the application provides an identification device, which comprises a processor and a storage medium for storing a computer program; wherein the processor is configured to execute at least the aforementioned identification method when executing the computer program.

The storage medium stores a computer program, and the computer program performs at least the foregoing identification method when executed.

The identification method, the equipment and the storage medium provided by the embodiment of the application comprise the following steps: obtaining at least two images; obtaining key point information of each object in each image; determining a characteristic sequence of each object in each image at least based on the key point information of each object in each image; at least obtaining a first parameter between the objects in the at least two images based on the characteristic sequence of the objects in the images, wherein the first parameter is used for representing the similarity degree between the objects of different images in the at least two images; at least a target object is determined based on a first parameter between objects in the at least two images, the target object being a similar object between the at least two images.

In the embodiment of the application, the feature sequence of each object in each image is determined based on the key point information of each object in each image, at least the similarity between each object in different images is obtained based on the feature sequence of the object, and which objects are similar objects between at least two images are determined according to the obtained similarity. Compared with the scheme of adopting two independent network models to realize object tracking in the related technology, the method at least does not need the connection between the models and has small calculation amount. In addition, the target object is determined based on the similarity degree between the objects in different images, and the identification accuracy of the target object can be ensured.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flow chart illustrating an implementation of a first embodiment of an identification method provided in the present application;

fig. 2 is a schematic flow chart illustrating an implementation of a second embodiment of the identification method provided in the present application;

FIG. 3 is a schematic diagram illustrating a network model according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating the operation of the network model provided herein;

FIG. 5 is a schematic illustration of the eigen-representation tensors provided herein;

fig. 6 is a schematic structural diagram illustrating a first embodiment of an identification device according to the present application;

fig. 7 is a schematic structural diagram of a second embodiment of an identification device provided in the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the embodiments of the present application will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. In the present application, the embodiments and features of the embodiments may be arbitrarily combined with each other without conflict. The steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

In a first embodiment of the identification method provided in the present application, as shown in fig. 1, the method includes:

step 101: obtaining at least two images;

in this step, the at least two images may be adjacent images or non-adjacent images.

Step 102: obtaining key point information of each object in each image;

step 103: determining a characteristic sequence of each object in each image at least based on the key point information of each object in each image;

step 104: at least obtaining a first parameter between the objects in the at least two images based on the characteristic sequence of the objects in the images, wherein the first parameter is used for representing the similarity degree between the objects of different images in the at least two images;

step 105: at least a target object is determined based on a first parameter between objects in the at least two images, the target object being a similar object between the at least two images.

The entity performing steps 101-105 is any device that can be used to identify a target object.

In the foregoing solution, a feature sequence of each object in each image is determined based on the key point information of each object in each image, at least a similarity degree between each object in different images is obtained based on the feature sequence of the object, and which objects are similar objects between at least two images is determined according to the obtained similarity degree. Compared with the scheme of realizing object tracking by adopting two independent network models in the related technology, the method does not need the connection between the models and has small calculation amount. In addition, the target object is determined based on the similarity degree between the objects in different images, and the identification accuracy of the target object can be ensured.

In a second embodiment of the identification method provided in the present application, as shown in fig. 2, the method includes:

step 201: obtaining at least two images;

step 202: obtaining key point information of each object in each image;

step 203: obtaining at least two global feature maps for respective objects in respective images, wherein different global feature maps of the same object are at least partially different;

step 204: determining a local feature sequence of each object based on the at least two global feature maps and the key point information of each object;

step 205: at least obtaining a first parameter between each object in the at least two images based on the local feature sequence of each object in each image, wherein the first parameter is used for representing the similarity degree between the objects of different images in the at least two images;

step 206: at least a target object is determined based on a first parameter between objects in the at least two images, the target object being a similar object between the at least two images.

The entity performing steps 201-206 is any device that can be used to identify a target object.

In the foregoing solution, a local feature sequence of each object in each image is determined based on at least two global feature maps and keypoint information of each object in each image, at least a degree of similarity between each object in different images is obtained based on the local feature sequences of the objects, and which objects are similar objects between at least two images is determined according to the obtained degree of similarity. Compared with the scheme of realizing object tracking by adopting two independent network models in the related technology, the method does not need the connection between the models and has small calculation amount. In addition, the target object is determined based on the local features of the objects in each image, and the granularity is considered to be finer from the local features of the objects, so that the identification accuracy of the target object can be ensured.

Based on the foregoing second embodiment of the method, the determining a local feature sequence of each object based on at least two global feature maps and the keypoint information of each object includes:

In the scheme, the local features of the object are determined by combining the global feature map of the object and the position information of the key part, the determination accuracy of the local features can be ensured, the target object is determined from the local features of the object, the considered granularity is more detailed, and the identification accuracy of the target object can be ensured.

In a first and/or second embodiment of the foregoing method, the obtaining at least a first parameter between the objects in the at least two images based on the feature sequences of the objects in the images includes:

In the foregoing scheme, the first parameter is obtained based on the feature sequences of the objects in the adjacent images, so that the accuracy of obtaining the first parameter can be ensured.

performing normalized exponential function softmax operation on a first target matrix of the two target matrices according to columns to obtain a first matching probability matrix, wherein the first matching probability matrix is obtainedAt least some of the elements of the probability matrix are denoted as tth_iImages to t_i-1A match probability between image objects;

In the above scheme, some elements in the first matching probability matrix are denoted as tth_iImages to t_i-1The match probability between image objects can be viewed as a backward match probability matrix. Some elements in the second match probability matrix are denoted as tth_i-1Images to t_iThe match probability between image objects can be viewed as a forward match probability matrix. The first parameter is obtained according to the two probability matrixes, namely the forward matching probability matrix and the backward matching probability matrix, and the first parameter is calculated from the non-single probability matrix, so that the calculation accuracy of the first parameter can be improved, and the identification accuracy of the target object is further ensured.

The technical solution of the embodiment of the present application will be described with reference to fig. 3 to 5.

As shown in fig. 3, in the embodiment of the present application, a network model for identifying a target object is provided, and for an object, such as a pedestrian, appearing in two adjacent frames of images, a matching probability between pedestrians in the two adjacent frames of images is obtained by using the network model, and based on the matching probability, which pedestrian/pedestrians in a current frame is/are the same person as which pedestrian/pedestrians in a previous frame is/are a person in the current frame, which pedestrians are persons who newly appear in the current frame, and which pedestrians appear in the previous frame and disappear in the current frame. Namely, the network model is utilized to realize the matching between the pedestrians in the two adjacent frames of images.

In the embodiment of the application, the network model at least comprises a feature learning network and a measurement network. It should be understood by those skilled in the art that before the feature learning network and the measurement network are used to identify the target object, the network model, specifically, the feature learning network and the measurement network, needs to be trained (training phase), and the trained network model is used to identify the target object (testing/applying phase).

The training phase and the application (test) phase are described separately below:

a training stage: it should be noted that, in the training phase, two adjacent frames of images are used: t th_i-1Frame image and t_iAnd training the network model by using the frame image.

The specific scheme is as follows:

step 500: collecting two adjacent frames of images, and carrying out human body detection on each collected (frame) image so as to identify all pedestrians in each frame of image;

in this step, the t-th acquisition_i-1Frame image and t_iFrame image, as shown in FIG. 4, t_i-1Frame image and t_iThe frame images are all images including a person and a certain background, and the human body detection is carried out on each frame image in the two frame images by adopting target detection methods such as SSD (solid State disk), Faster RCNN (fast Multi-object Detector), YOLO (fast Multi object Detector) and the like, so that the person in the two frame images is identified, and each pedestrian in each frame image is distinguished by using a detection frame.

The method can be preferentially adopted for human body detection in consideration of high identification accuracy of fast RCNN.

Step 501: calculating key point information of each pedestrian in each image of two adjacent frames of images;

in this step, the calculation of the key point information is performed for each pedestrian in each frame of image. Considering that the object in the application scene is a pedestrian, the key point information of the object is specifically the position coordinates of the human body trunk of each pedestrian in the detection frame of the pedestrian. The key point information of the pedestrian in the embodiment of the application is the human body key point of each pedestrian, and the position coordinates of 14 parts such as a head, a neck, left and right shoulders, left and right buttocks, left and right elbows, left and right knees, left and right ankles and the like in the detection frame of the pedestrian are included.

Wherein, the calculation of the human body key points can adopt a human body key point detection model such as an Hourglass (Hourglass) network. Based on this, as can be known to those skilled in the art, the network model of the embodiment of the present application includes a human body key point detection model in addition to the feature learning network and the measurement network. The human body key point detection model can be directly used without training in the embodiment of the application.

Step 502: inputting the two adjacent frames of images and the detection frames of all pedestrians in each frame of image into a network model, specifically a feature learning network, and obtaining at least two global feature maps for each pedestrian in each frame of image;

step 503: obtaining each target image of each object from each global feature map of each object, wherein the target image of each object is an image corresponding to the position relation of at least two key parts of each object in the global feature map of each object;

step 504: determining a sequence of local features, such as (final) feature representation vectors, of each object from each target image of each object;

the steps 502-504 are implemented based on at least a feature learning network. The feature learning network may be specifically a vgg (visual Geometry group) neural network or a residual error (ResNet) network. With ResNet, considering that the deeper the ResNet network structure (the more residual blocks) can extract relatively more abundant features, in particular, with a ResNet101 network, the ResNet101 includes at least two residual blocks, each of which includes at least two convolutional layers, and those skilled in the art should understand that the convolutional layers are used for performing feature map calculation on images input to the convolutional layers.

As shown in fig. 4, in a specific implementation, each frame image input to the feature learning network, specifically the ResNet101 network, outputs a certain feature map through the operation of each residual block in the ResNet101 network, and the last convolutional layer of each residual block. In the embodiment of the present application, for each pedestrian in each frame image, a feature map output by the last convolutional layer of 4 residual blocks is extracted, for example, a feature map conv2 output by the last convolutional layer of the 2 nd residual block, a feature map conv3 output by the last convolutional layer of the 3 rd residual block, a feature map conv4 output by the last convolutional layer of the 4 th residual block, and a feature map conv5 output by the last convolutional layer of the 5 th residual block are extracted as four feature maps of the corresponding pedestrian. The four feature maps of the corresponding pedestrian can be regarded as global feature representations of the pedestrian on four levels. The convolutional layers of different residual blocks can be regarded as different levels, and the convolutional layers of four residual blocks can be regarded as four levels. In general, feature information represented by feature maps output at different levels is different, and the deeper (more) the level is, the more abundant the image feature information is represented. It can be understood that as the hierarchy is deepened, the size of the feature map on the hierarchy is smaller, but the expressed image feature information is richer. The feature map output by the last convolutional layer of each residual block is typically a global feature map. It can be understood that at least two global feature maps of each pedestrian for each frame image are obtained based on the feature learning network in the embodiment of the application. Wherein, the global feature map representation contains the feature description of the pedestrian and the surrounding, and is usually not fine enough. In the embodiment of the present application, in combination with the human body key point information obtained in step 501, for each pedestrian, in the global feature map for each level of the pedestrian, the image (target image) features at the positions corresponding to the plurality of human body parts are intercepted as the local feature vectors at the corresponding level.

Taking the aforementioned 14 key parts as an example, for each pedestrian, 14 local feature vectors can be obtained on each level, and each local feature vector is a feature vector of the position of the corresponding part in the global feature map. For each pedestrian, a total of 14 local feature vectors at four levels are obtained. On each level, averaging 14 local feature vectors of a certain pedestrian to obtain feature representation of the pedestrian on the level; and processing the feature vectors on each level by a convolution layer with the convolution kernel size of 1 × 1 to obtain dimension-reduced feature representations, and then connecting (corresponding) the obtained feature representations on each level subjected to dimension-reduction processing together to obtain a (final) feature representation vector of the pedestrian. The aforementioned (final) feature representation vector of a certain object, such as a pedestrian, can be regarded as a local feature sequence of the object in the embodiment of the present application.

In this step, the local features of the human body are taken into consideration, so that fine-grained consideration is realized, and parameters such as feature tensor information and the first parameter obtained based on the local features of the human body are more accurate.

In this step, the value of the number of the hierarchy levels is 4, and the value of the number of the key parts is 14, it can be understood that both the number of the hierarchy levels and the value of the key parts can be flexibly set according to actual conditions. Similarly, the size of the convolution kernel used for the dimensionality reduction processing of the feature vectors on each level can take any reasonable value, and is not particularly limited.

Step 505: will t be_iRespective objects in the image and the t-th_i-1Combining feature sequences of each object in the image in pairs to obtain feature tensor information;

the scheme of the previous step 504 can obtain the t_i-1Frame image and t_iThe (final) features of all pedestrians in the frame image represent vectors. Since the number of pedestrians contained in each frame of image is not necessarily the same, in order to obtain a uniform representation, the number of pedestrians in each frame is expanded to N (for example, the value is 60), and when the number of actual pedestrians in each frame of image is less than N, the feature representation vectors of pedestrians that are not present in these images are filled with zero vectors. Will t be_i-1Frame image and t_iThe pedestrian feature expression vectors of the frame images are combined in pairs and randomly to obtain a feature expression tensor S (feature tensor information) between two continuous frames of pedestrians.

In specific implementation, since two adjacent frames of images have time sequence, such as t-th time_iFrame image later than tth_i-1Frame image, arbitrary combination between pedestrian feature expression vectors of respective pedestrians in two images, tth_i-1The pedestrian feature representation vector of the pedestrian in the frame image is located in front of or behind the combination result, and the meaning is different. Any combination in the embodiments of the present application includes the tth_i-1The pedestrian feature of the pedestrian in the frame image indicates a case where the vector is located in front of the combination result, and also includes a case where the vector is located behind the combination result. In the following, for example, let us assume the t-th_i-1Frame image, t-th_iThree pedestrians are included in each frame image. See A1, A2, A3 as t_i-1Feature representation vectors of three persons included in the frame image; b1, B2 and B3 are the t-th_iThe feature representation vectors of the three persons included in the frame image. Then t is_i-1Frame image, t-th_iEach frame image includes any two-by-two combination of feature expression vectors of P ═ 3 pedestrians, and the obtained feature expression tensor S is shown in fig. 5. The dimension of the eigen-representation tensor S is P N_c(ii) a Wherein N is_cThe representation feature represents the length of the vector.

Step 506: inputting the feature representation tensor S into a measurement network to obtain a first target matrix (represented by M1) and a second target matrix (represented by M2);

the measurement network in this step is a similarity measurement network, and can be specifically realized by at least two convolution processes, for example, the measurement network is realized by using 5 convolution layers with convolution kernel 1 × 1, and the feature representation tensor S obtains a similarity measurement matrix M through the processing of the convolution layers, where the dimension of the matrix is N × N. In the process of pedestrian tracking, there are situations that pedestrians enter and leave the scene, and the similarity measurement matrix M does not take this situation into account, so that by supplementing one row or one column on the similarity measurement matrix M to respectively indicate that a pedestrian enters or leaves the current frame, elements in the supplemented row or column take on a hyper-parameter σ obtained according to experience. The similarity measure matrix after the apparent supplementary row is M1 (first target matrix) with dimension (N +1) × N. The similarity measure matrix after the supplementary columns is considered to be M2 (second target matrix), dimension N x (N + 1).

Step 507: carrying out normalized exponential function (softmax) operation on the first target matrix M1 according to columns, and carrying out softmax operation on the second target matrix M2 according to rows to obtain a first matching probability matrix and a second matching probability matrix;

in this step, the similarity measurement matrices M1 and M2 are respectively column-wisesoftmax operation and softmax by row operation, thereby obtaining a first match probability matrix (with A)_bTo be represented) and a second matching probability matrix (denoted by a)_fTo indicate). Wherein A is_bCode the t th_iFrame image to t_i-1The matching probability between the pedestrians in the frame image is a backward matching probability matrix; a. the_fCode the t th_i-1Frame image to t_iAnd the matching probability among the pedestrians in the frame image is a forward matching probability matrix. For concrete implementation of softmax operation, please refer to the related description.

Step 508: obtaining the t_iFrame image and t_i-1A true matching incidence matrix G of the frame image based on the true matching incidence matrix G and the first matching probability matrix A_bAnd a second matching probability matrix A_fAnd obtaining the trained network model.

In this step, in the training phase, the t-th label_iFrame image and t_i-1And the dimension of a real matching incidence matrix G of the frame image is (N +1) × (N +1), and the matrix element is 0 or 1. When a certain pedestrian is at the t-th position_iFrame image and t_i-1And when the frame images exist, the value of the element at the corresponding position is 1, otherwise, the value is 0. Wherein the first (N-1) row and the first (N-1) column of the matrix G represent the matching relationship between two consecutive frames of pedestrians. Line N and column N give the current frame, tth_iPedestrian indices of presence and departure in the frame image. If a certain pedestrian leaves at the current frame, the corresponding position of the Nth row is 1, otherwise, the corresponding position is 0; and if a certain pedestrian appears in the current frame, setting the corresponding position of the Nth row as 1, otherwise, setting the corresponding position as 0.

The loss function of the network model is realized by the average of the following four loss functions.

Loss function one:

a second loss function:

a loss function three:

a loss function of four:

wherein G is_f、G_bRemoving the matrix of the last row and the last column from the true matching incidence matrix G; g_wRepresenting a matrix in which the last row and the last column of the true matching incidence matrix G are removed simultaneously;

represents removal of A_fThe matrix of the last row is then,

represents removal of A_bThe matrix of the last column. Wherein,

equivalent to the product of corresponding elements of two matrixes, equivalent to the modulus of the two matrixes, log logarithm ∑_i∑_jSumming all elements of the matrix; max () takes the maximum of the two. The final loss function of the network model is the average of the four loss functions.

It should be understood by those skilled in the art that the training process of the above steps 500-508 is actually to find the optimal values of the convolution weight parameters of the feature learning network, specifically the ResNet101 convolution layer, and the convolution weight parameters of the convolution layers in the measurement network when the final loss function of the network model is the minimum value. And the characteristic learning network and the measurement network when the final loss function of the network model is the minimum value can be used as the trained network, and the network model at least comprising the characteristic learning network and the measurement network is the trained model.

The testing/application stage: and matching the pedestrians of two adjacent frames by using the trained network model.

In particular toLet the adjacent two frames at this stage be t_iFrame image and t_i+1Frame image, will t_iFrame image and t_i+1The frame image is executed with the steps 500-507 to obtain the t-th image_iFrame image and t_i+1First matching probability matrix A of frame image_bAnd a second matching probability matrix A_f. Obtaining a first matching probability matrix A_bAnd a second matching probability matrix A_fAt the position of the corresponding element, i.e. maximum

The elements in a represent the probability of similarity (match probability) between the pedestrians of two consecutive frames. Determining the tth based on the matching similarity_i+1Which pedestrian or pedestrians in the frame image and the t-th_iWhich pedestrian is the same pedestrian in the frame image and which pedestrian is at the t-th position_i+1The newly appearing pedestrian in the frame image, which pedestrian/pedestrians is/are at the t-th_i+1A pedestrian disappearing in the frame image.

For example, the first and second light sources may be,

the value of the element on the first row and the first column is 0.3, which represents the t-th_iA1 user and t in frame image_i+1The similarity probability of the B1 user in the frame image is 0.3;

the value of the element on the first row and the first column is 0.8, which represents the t-th_i+1B1 user in frame image and t_iThe similarity probability of the a1 user in the frame image is 0.8; then A is^*The value of the element in the first row and the first column is the maximum value of 0.8 of 0.3 and 0.8, which represents the tth_iA1 user and t in frame image_i+1If the probability that the B1 user in the frame image is the same person is 80% and is greater than or equal to the first preset probability, the A1 user and the B1 user appear at the t-th position_iFrame image and t_i+1The same person in the frame image. If based on

Knows t-th_i+1A user appears in the frame image, which corresponds to the t-th_iIf the similarity probability of all users appearing in the frame image is less than the second threshold, the user is regarded as being at the t-th_i+1New users appearing in the frame image. If based on

Knows t-th_iA user appears in the frame image, which corresponds to the t-th_i+1If the similarity probability of all users appearing in the frame is less than the third threshold, the user is considered as being at the t-th_iAppearing in the frame image but at the t-th_i+1Users whose frames disappear.

Compared with a scheme in the related art in which two independent network models are adopted to realize object tracking and both the two independent network models need independent training, in the scheme of the embodiment of the application, one network model is adopted to match the similarity between pedestrians among images, and only the network model is trained, so that the calculation amount can be greatly reduced. In addition, the technical scheme in the embodiment of the application realizes and does not need the connection between the models in one go, and can avoid the problem that the models cannot be effectively connected due to the adoption of two independent models in the related technology. In addition, the similarity degree is obtained based on the local feature sequence of the object, and the considered granularity is finer from the local feature of the object, so that the identification accuracy of the target object can be ensured. The similarity degree between the pedestrians is obtained according to the two probability matrixes, namely the forward matching probability matrix and the backward matching probability matrix, and the identification accuracy of the target object can be ensured.

An embodiment of the present application further provides an identification device, as shown in fig. 6, the identification device includes: a first obtaining unit 601, a second obtaining unit 602, a first determining unit 603, a third obtaining unit 604, and a second determining unit 605; wherein,

a first obtaining unit 601 for obtaining at least two images;

a second obtaining unit 602, configured to obtain key point information of each object in each image;

a first determining unit 603 configured to determine a feature sequence of each object in each image based on at least the key point information of each object in each image;

a third obtaining unit 604, configured to obtain at least a first parameter between the objects in the at least two images based on the feature sequence of each object in each image, where the first parameter is used to characterize a similarity degree between the objects;

a second determining unit 605, configured to determine at least a target object based on at least the first parameter between the objects in the at least two images, where the target object is a similar object between the at least two images.

In an optional embodiment, the apparatus further comprises:

In an optional embodiment, the first determining unit 601 is configured to:

In an alternative embodiment, the third obtaining unit 604 is configured to:

will t be_iRespective objects in the image and the t-th_i-1Sequence of features of individual objects in an imageCombining every two to obtain the characteristic tensor information, wherein the tth_iImage, t-th_i-1The images are two adjacent images in the at least two images;

In an alternative embodiment, the third obtaining unit 604 is configured to:

It should be noted that, in the identification device according to the embodiment of the present application, because the principle of solving the problem is similar to that of the identification method, the implementation process and the implementation principle of the identification device can be described by referring to the implementation process and the implementation principle of the identification method, and repeated details are not repeated.

The first obtaining Unit 601, the second obtaining Unit 602, the first determining Unit 603, the third obtaining Unit 604, and the second determining Unit 605 may be implemented by a Central Processing Unit (CPU), a Digital Signal Processor (DSP), a Micro Control Unit (MCU), or a Programmable Gate Array (FPGA) in the identification device in practical application.

The present application also provides a computer-readable storage medium, on which a computer program is stored, where the computer program is configured to, when executed by a processor, perform at least the steps of the identification method of the foregoing embodiment. The computer readable storage medium may be specifically a memory. The memory may be the memory 72 as shown in fig. 7.

The embodiment of the application also provides the identification equipment. Fig. 7 is a schematic diagram of a hardware structure of an identification device according to an embodiment of the present application, and as shown in fig. 7, the identification apparatus includes: a communication component 73 for data transmission, at least one processor 71 and a memory 72 for storing computer programs capable of running on the processor 71. The various components in the terminal are coupled together by a bus system 74. It will be appreciated that the bus system 74 is used to enable communications among the components of the connection. The bus system 74 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 74 in fig. 7.

Wherein the processor 71 performs the steps of the identification method of the previous embodiment.

It will be appreciated that the memory 72 may be either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRmb Access), and Random Access Memory (DRAM). The memory 72 described in the embodiments of the present application is intended to comprise, without being limited to, these and any other suitable types of memory.

The method disclosed in the above embodiments of the present application may be applied to the processor 71, or implemented by the processor 71. The processor 71 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 71. The processor 71 described above may be a general purpose processor, a DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. Processor 71 may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in a storage medium located in the memory 72, and the processor 71 reads the information in the memory 72 and performs the steps of the aforementioned methods in conjunction with its hardware.

In an exemplary embodiment, the recognition Device may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), FPGAs, general purpose processors, controllers, MCUs, microprocessors (microprocessors), or other electronic components for performing the aforementioned recognition method.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain new method embodiments.

Features disclosed in several of the product embodiments provided in the present application may be combined in any combination to yield new product embodiments without conflict.

The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An identification method, characterized in that the method comprises:

obtaining at least two images;

obtaining key point information of each object in each image;

2. The method according to claim 1, characterized in that it comprises:

3. The method of claim 2, wherein the determining the local feature sequence of each object based on the at least two global feature maps and the keypoint information of each object comprises:

4. The method according to any one of claims 1 to 3, wherein the obtaining at least a first parameter between each object in the at least two images based on the feature sequence of each object in each image comprises:

5. The method of claim 4, wherein obtaining at least a first parameter between each object in the first image and each object in the second image based on the two object matrices comprises:

the second target in the two target matrixesPerforming softmax operation on the matrix according to rows to obtain a second matching probability matrix, wherein at least part of elements in the second matching probability matrix are represented as the tth_i-1Images to t_iA match probability between image objects;

6. An identification device, characterized in that the device comprises:

a first obtaining unit configured to obtain at least two images;

7. The apparatus of claim 6, further comprising:

8. The apparatus of claim 7, wherein the first determining unit is configured to:

9. The apparatus according to any of claims 6 to 8, characterized by said third obtaining unit for:

10. The apparatus of claim 9, wherein the third obtaining unit is configured to:

11. An identification device comprising a processor and a storage medium for storing a computer program; wherein the processor is adapted to perform at least the identification method of any of claims 1 to 5 when executing the computer program.

12. A storage medium storing a computer program which, when executed, performs at least the identification method of any one of claims 1 to 5.