CN113807122A

CN113807122A - Model training method, object recognition method and device, and storage medium

Info

Publication number: CN113807122A
Application number: CN202010529203.5A
Authority: CN
Inventors: 王栋; 刘梓墨; 王双; 戚赟炜; 邓玉明
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-06-11
Filing date: 2020-06-11
Publication date: 2021-12-17

Abstract

A model training method, an object recognition method and device and a storage medium are provided, the method comprises the steps of obtaining images collected by an image collecting device, and establishing a plurality of groups of sample sets according to the obtained images; performing correlation processing on the acquired image to obtain a plurality of video clips, establishing a single-group sample pair set according to the video clips and the sample set, and training a single-group distance model by using the single-group sample pair set, wherein the single-group distance model is used for calculating the distance between two samples belonging to the same sample set; and establishing a first cross-group sample pair set and a second cross-group sample pair set according to the single-group distance model and the sample set, initializing a training cross-group distance model by using the single-group sample pair set and the first cross-group sample pair set, and continuously training the cross-group distance model by using the second cross-group sample pair set. The method and the device realize unsupervised cross-camera model training and object recognition, and the recognition precision is relatively high.

Description

Model training method, object recognition method and device, and storage medium

Technical Field

The present application relates to, but not limited to, the field of image recognition technologies, and in particular, to a model training method, an object recognition method and apparatus, and a storage medium.

Background

At present, a large number of monitoring cameras are applied to places with high crowd density and easy safety problems, such as: anti-theft of supermarkets, security of parks and the like. Pedestrian re-recognition, also known as pedestrian re-recognition or re-verification, is a technique of determining whether a pedestrian shot from a non-overlapping camera is the same pedestrian. The pedestrian re-identification technology enables the retrieval of the monitoring video aiming at a specific object, and has important practical significance. The pedestrian re-recognition algorithm may be classified into a supervised learning-based pedestrian re-recognition algorithm, a semi-supervised learning-based pedestrian re-recognition algorithm, and an unsupervised learning-based pedestrian re-recognition algorithm according to the dependency on the training data. The former two users need to collect and label training samples in a large amount in a specific application scene, which is not only time-consuming but also poor in mobility. The unsupervised pedestrian re-identification does not depend on data labeling, and becomes the development trend of the pedestrian re-identification technology.

Disclosure of Invention

The application provides a model training method, an object recognition method and device and a storage medium, which can realize unsupervised cross-camera pedestrian re-recognition and have relatively high recognition accuracy.

The embodiment of the application provides a model training method, which comprises the following steps: acquiring images acquired by an image acquisition device, and establishing a plurality of groups of sample sets according to the acquired images; performing correlation processing on an acquired image to obtain a plurality of video clips, establishing a single-group sample pair set according to the video clips and the sample set, and training a single-group distance model by using the single-group sample pair set, wherein the single-group distance model is used for calculating the distance between two samples belonging to the same sample set; according to the single-group distance model and the sample set, a first cross-group sample pair set and a second cross-group sample pair set are established, the first cross-group sample pair set comprises a plurality of first cross-group sample pairs, the first cross-group sample pairs comprise first samples and second samples belonging to different sample sets, and the first samples and the second samples are samples which are searched in the sample set where the opposite side is located and are closest to the first samples and the second samples; the second cross-group sample pair set comprises a plurality of second cross-group sample pairs, the second cross-group sample pairs comprise third samples and fourth samples belonging to different sample sets, and one of the third samples and the fourth samples is a sample which is searched in the sample set and is closest to the other one of the third samples and the fourth samples; initializing a training cross-group distance model by using a single-group sample pair set and a first cross-group sample pair set, and continuously training the cross-group distance model by using a second cross-group sample pair set until the training of the cross-group distance model is completed, wherein the cross-group distance model is used for calculating the distance between two samples belonging to different sample sets.

In some possible implementations, the establishing a plurality of sets of samples from the acquired image includes: zooming the acquired image to a preset pixel value; extracting a multidimensional characteristic vector of the zoomed image; reducing the dimension of the extracted multi-dimensional feature vector; and storing the feature vectors after dimensionality reduction as samples in the sample set.

In some possible implementations, the images acquired by one of the image acquisition devices correspond to one or more sets of the samples.

In some possible implementations, the single set of sample pairs includes a set of positive sample pairs and a set of negative sample pairs; establishing a single set of sample pair sets from the video segments and sample sets, comprising: taking any two samples belonging to the same video segment as a positive sample pair, wherein a plurality of positive sample pairs corresponding to a group of sample sets form a positive sample pair set corresponding to the group of sample sets; any two samples belonging to different video clips are used as a negative sample pair, and a plurality of negative sample pairs corresponding to a group of sample sets form a negative sample pair set corresponding to the group of sample sets.

In some possible implementations, the single set of distance models and the cross-set distance models are both cross-view quadratic discriminant analysis distance models.

In some possible implementation manners, when the cross-group distance model is continuously trained by using the second cross-group sample pair set, a cycle convergence condition is that all the second cross-group sample pairs are trained completely or the training times reach a preset iteration time.

An embodiment of the present application further provides a model training apparatus, which includes a processor and a memory, where the processor is configured to execute a computer program stored in the memory to implement the steps of the model training method as described in any one of the above.

An embodiment of the present application further provides a storage medium, where a computer program is stored, and when the computer program is executed by a processor, the steps of the model training method according to any one of the above items are implemented.

The embodiment of the application provides an object identification method, which comprises the following steps: acquiring an image to be identified; performing object recognition on an image to be recognized by using a pre-trained cross-group distance model, wherein a training sample set of the cross-group distance model comprises an initial training sample set and a secondary training sample set, the initial training sample set comprises a single-group sample pair set and a first cross-group sample pair set, the secondary training sample set comprises a second cross-group sample pair set, the first cross-group sample pair set comprises a plurality of first cross-group sample pairs, the first cross-group sample pairs comprise first samples and second samples belonging to different sample sets, and the first samples and the second samples are samples which are searched in a sample set where the other side is located and are closest to the first samples and the second samples; the second cross-group sample pair set comprises a plurality of second cross-group sample pairs, the second cross-group sample pairs comprise third samples and fourth samples belonging to different sample sets, and one of the third samples and the fourth samples is a sample which is searched in the sample set and is closest to the other one of the third samples and the fourth samples.

The embodiment of the application provides an object identification method, which comprises the following steps: the method comprises the steps that a cloud server receives an image to be recognized input by a user, wherein the image to be recognized comprises a target object; the cloud server performs object recognition on an image to be recognized by using a pre-trained cross-group distance model, and outputs one or more images, wherein the output image contains an object matched with the target object, a training sample set of the cross-group distance model comprises an initial training sample set and a secondary training sample set, the initial training sample set comprises a single-group sample pair set and a first cross-group sample pair set, the secondary training sample set comprises a second cross-group sample pair set, the first cross-group sample pair set comprises a plurality of first cross-group sample pairs, the first cross-group sample pairs comprise first samples and second samples belonging to different sample sets, and the first samples and the second samples are samples which are searched in the sample set where the opposite side is located and are closest to the cloud server; the second cross-group sample pair set comprises a plurality of second cross-group sample pairs, the second cross-group sample pairs comprise third samples and fourth samples belonging to different sample sets, and one of the third samples and the fourth samples is a sample which is searched in the sample set and is closest to the other one of the third samples and the fourth samples.

An embodiment of the present application further provides an object recognition apparatus, which includes a processor and a memory, where the processor is configured to execute a computer program stored in the memory to implement the steps of the object recognition method according to any one of the above.

An embodiment of the present application further provides a storage medium, where a computer program is stored, and the computer program, when executed by a processor, implements the steps of the object identification method according to any one of the above.

The embodiment of the present application further provides a model training device, including a sample establishing module, a single group training module and a cross-group training module, wherein: the sample establishing module is used for acquiring images acquired by the image acquisition device and establishing a plurality of groups of sample sets according to the acquired images; the single group training module is used for performing correlation processing on the acquired image to obtain a plurality of video clips, establishing a single group sample pair set according to the video clips and the sample set, and training a single group distance model by using the single group sample pair set, wherein the single group distance model is used for calculating the distance between two samples belonging to the same sample set; the cross-group training module is used for establishing a first cross-group sample pair set and a second cross-group sample pair set according to the single-group distance model and the sample set, wherein the first cross-group sample pair set comprises a plurality of first cross-group sample pairs, the first cross-group sample pairs comprise first samples and second samples belonging to different sample sets, and the first samples and the second samples are samples which are searched in the sample set where the opposite side is located and are closest to the first samples and the second samples; the second cross-group sample pair set comprises a plurality of second cross-group sample pairs, the second cross-group sample pairs comprise third samples and fourth samples belonging to different sample sets, and one of the third samples and the fourth samples is a sample which is searched in the sample set and is closest to the other one of the third samples and the fourth samples; initializing a training cross-group distance model by using a single-group sample pair set and a first cross-group sample pair set, and continuously training the cross-group distance model by using a second cross-group sample pair set until the training of the cross-group distance model is completed, wherein the cross-group distance model is used for calculating the distance between two samples belonging to different sample sets.

The embodiment of the present application further provides an object recognition apparatus, including an image acquisition module and an object recognition module, wherein: the image acquisition module is used for acquiring an image to be identified; the object recognition module is used for performing object recognition on an image to be recognized by using a pre-trained cross-group distance model, wherein a training sample set of the cross-group distance model comprises an initial training sample set and a secondary training sample set, the initial training sample set comprises a single-group sample pair set and a first cross-group sample pair set, the secondary training sample set comprises a second cross-group sample pair set, the first cross-group sample pair set comprises a plurality of first cross-group sample pairs, the first cross-group sample pairs comprise first samples and second samples belonging to different sample sets, and the first samples and the second samples are samples which are searched in a sample set where the other side is located and are closest to the first samples and the second samples; the second cross-group sample pair set comprises a plurality of second cross-group sample pairs, the second cross-group sample pairs comprise third samples and fourth samples belonging to different sample sets, and one of the third samples and the fourth samples is a sample which is searched in the sample set and is closest to the other one of the third samples and the fourth samples.

According to the model training method, the object recognition method and device and the storage medium, the single-group sample pair set and the first cross-group sample pair set are used for initializing the training cross-group distance model, and the second cross-group sample pair set is used for continuing training the cross-group distance model, so that the unsupervised cross-camera model training method and the pedestrian re-recognition method are achieved, and the object recognition result accuracy is relatively high.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. Other advantages of the present application may be realized and attained by the instrumentalities and combinations particularly pointed out in the specification and the drawings.

Drawings

The accompanying drawings are included to provide an understanding of the present disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the examples serve to explain the principles of the disclosure and not to limit the disclosure.

FIG. 1 is a schematic flow chart of a model training method according to an embodiment of the present application;

FIG. 2 is a schematic processing diagram of a model training method according to an embodiment of the present disclosure;

fig. 3 is a schematic flowchart of an object identification method according to an embodiment of the present application;

fig. 4 is a schematic flowchart of another object identification method according to an embodiment of the present application;

fig. 5 is a schematic diagram illustrating a result of pedestrian re-identification according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a model training apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an object recognition apparatus according to an embodiment of the present application.

Detailed Description

The present application describes embodiments, but the description is illustrative rather than limiting and it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible within the scope of the embodiments described herein. Although many possible combinations of features are shown in the drawings and discussed in the detailed description, many other combinations of the disclosed features are possible. Any feature or element of any embodiment may be used in combination with or instead of any other feature or element in any other embodiment, unless expressly limited otherwise.

The present application includes and contemplates combinations of features and elements known to those of ordinary skill in the art. The embodiments, features and elements disclosed in this application may also be combined with any conventional features or elements to form a unique inventive concept as defined by the claims. Any feature or element of any embodiment may also be combined with features or elements from other inventive aspects to form yet another unique inventive aspect, as defined by the claims. Thus, it should be understood that any of the features shown and/or discussed in this application may be implemented alone or in any suitable combination. Accordingly, the embodiments are not limited except as by the appended claims and their equivalents. Furthermore, various modifications and changes may be made within the scope of the appended claims.

Further, in describing representative embodiments, the specification may have presented the method and/or process as a particular sequence of steps. However, to the extent that the method or process does not rely on the particular order of steps set forth herein, the method or process should not be limited to the particular sequence of steps described. Other orders of steps are possible as will be understood by those of ordinary skill in the art. Therefore, the particular order of the steps set forth in the specification should not be construed as limitations on the claims. Further, the claims directed to the method and/or process should not be limited to the performance of their steps in the order written, and one skilled in the art can readily appreciate that the sequences may be varied and still remain within the spirit and scope of the embodiments of the present application.

As shown in fig. 1, an embodiment of the present application provides a model training method, which includes steps 101 to 104.

Step 101 comprises: and acquiring images acquired by the image acquisition device, and establishing a plurality of groups of sample sets according to the acquired images.

In an exemplary embodiment, the image capture device may be a camera, an image sensor, or any other type of image capture device.

In an exemplary embodiment, the method further comprises: and detecting the pedestrian in the acquired image by using a Deformable Parts Model (DPM) algorithm, and matting the pedestrian image.

The DPM algorithm is a very successful target detection algorithm and becomes an important part of a plurality of classifiers, segmentation, human body posture and behavior classification. The DPM algorithm can be viewed as an extension of Histogram of Oriented Gradients (HOG), with the general idea being consistent with HOG. The gradient direction histogram is calculated, and then a Support Vector Machine (SVM) is used for training to obtain a gradient Model (Model) of the object.

In an exemplary embodiment, the image acquired by one image acquisition device may correspond to a set of sample sets; or, the image acquired by one image acquisition device corresponds to a plurality of groups of sample sets.

In one exemplary embodiment, establishing a plurality of sets of samples from an acquired image includes:

step 1011, zooming the acquired image to a preset pixel value;

step 1012, extracting the multidimensional characteristic vector of the zoomed image;

1013, reducing the dimension of the extracted multi-dimensional feature vector;

and 1014, storing the feature vectors after dimensionality reduction as samples in the sample set.

In an exemplary embodiment, the predetermined pixel value may be 128 × 48 pixels, so that the scaled image occupies a smaller storage space and still can more clearly display the features of the image.

In one exemplary embodiment, the 26960-dimensional feature vector in each pedestrian image after scaling may be extracted using a Local maximum probability (LOMO) algorithm. Compared with the depth feature, the feature vector extracted by the artificially designed LOMO operator is more efficient and stable, is suitable for embedded deployment and is not easy to resist attack.

The LOMO algorithm focuses mainly on lighting and viewing angle issues. Before feature extraction, a Retinex image enhancement algorithm is adopted for image enhancement. The feature representation adopts HSV (Hue, Saturation) histogram and Scale Invariant Local Terry Pattern (SILTP) histogram, and HSV space is the most common color space of histogram. Its three components represent color (Hue), Saturation (Saturation), and Value (Value), respectively.

In an exemplary embodiment, the extracted multi-dimensional feature vector is subjected to dimensionality reduction using a Principal Component Analysis (PCA) algorithm, and the feature vector after dimensionality reduction may be 600 dimensions, for example.

The PCA algorithm is one of the most important dimension reduction methods, and has wide application in the fields of data compression, redundancy elimination, data noise elimination and the like. The PCA algorithm extracts the principal linear components of the data by transforming the raw data into a set of representations that are linearly independent of each dimension through a linear transformation.

Step 102 comprises: the method comprises the steps of performing correlation processing on an acquired image to obtain a plurality of video clips, establishing a single-group sample pair set according to the video clips and a sample set, and training a single-group distance model by using the single-group sample pair set, wherein the single-group distance model is used for calculating the distance between two samples belonging to the same sample set.

In an exemplary embodiment, the acquired images are correlated using a Generalized Maximum multibolus Problem (GMMCP) tracking algorithm to obtain a plurality of video segments.

The GMMCP tracking algorithm provides a data association method based on a global framework, firstly, a target function is formulated by means of binary integer programming, then, a virtual node is added to process the problem of target missing, and an acceleration method is provided on the basis: and aggregating the virtual nodes. And finally, solving the undirected graph by using maximum binary integer programming, thereby simultaneously obtaining a plurality of clusters, namely finishing the multi-target tracking. In a word, the GMMCP converts the multi-target tracking into the graph, firstly, the graph is constructed by using the weight, and the final track is obtained by solving the graph.

In one exemplary embodiment, the single set of sample pairs includes a set of positive sample pairs and a set of negative sample pairs. Establishing a single group of sample pair set according to the video clip and the sample set, wherein the method comprises the following steps:

taking any two samples belonging to the same video segment as a positive sample pair, wherein a plurality of positive sample pairs corresponding to a group of sample sets form a positive sample pair set corresponding to the group of sample sets;

any two samples belonging to different video clips are used as a negative sample pair, and a plurality of negative sample pairs corresponding to a group of sample sets form a negative sample pair set corresponding to the group of sample sets.

In an exemplary embodiment, the single-group distance model and the Cross-group distance model established in the embodiment of the present application are Cross-view Quadratic Discriminant Analysis (XQDA) distance models.

The XQDA distance model of the embodiment of the application comprises a single-group distance model and a cross-group distance model. In an exemplary embodiment, a single set of distance models corresponding to the ith set of samples is represented as dⁱ(..), the single set of distance models corresponding to the jth set of samples is denoted as d^j(..), a cross-group distance model of the ith group of sample sets to the jth group of sample sets is denoted as d^i，j(..), a cross-group distance model of the jth group of sample sets to the ith group of sample sets is denoted as d^j，i(.，.)。

The XQDA distance model is proposed based on the Keep-straight Simple and straight forward (KISSME) metric and Bayesian face method. The method uses a Gaussian model to respectively fit the difference distribution of the sample characteristics in the class and between the classes, and deduces the Mahalanobis distance according to the log likelihood ratio of the two Gaussian distributions.

In the embodiment of the present application, for each group of sample sets, any two samples in the same video segment may be used as a positive sample pair, and any two samples in different video segments may be used as a negative sample pair, to establish a single group of sample pair set S, for example, for the ith group of sample sets, the single group of sample pair set is Sⁱ. A single set of distance models d (x, y) are trained using XQDA on a single set of sample pairs S. For example, the single set of distance models corresponding to the ith set of samples is dⁱ(x, y), wherein the superscript i represents the ith set of samples and x and y represent different samples.

The expression of the XQDA distance model in the embodiment of the present application is (,) (^T(y))^TM(W^T(x-y))), wherein W is a model matrix parameter, M is a model covariance parameter, and W and M can be obtained by training samples and learning of an XQDA algorithm.

Step 103 comprises: according to the single-group distance model and the sample set, a first cross-group sample pair set and a second cross-group sample pair set are established, the first cross-group sample pair set comprises a plurality of first cross-group sample pairs, the first cross-group sample pairs comprise first samples and second samples belonging to different sample sets, and the first samples and the second samples are samples which are searched in the sample set where the first samples and the second samples are located and are closest to the first samples and the second samples; the second cross-group sample pair set comprises a plurality of second cross-group sample pairs, the second cross-group sample pairs comprise third samples and fourth samples belonging to different sample sets, and one of the third samples and the fourth samples is a sample which is searched in the sample set and is closest to the other one of the third samples and the fourth samples.

In this embodiment of the application, the first sample and the second sample are each a sample that is searched in a sample set where an opposite party is located and is closest to the first sample, which means: the second sample is a sample which is searched in the sample set where the second sample is located and is closest to the first sample, and the first sample is a sample which is searched in the sample set where the first sample is located and is closest to the second sample.

One of the third sample and the fourth sample is a sample which is searched in the sample set where the sample is located and is closest to the other one of the third sample and the fourth sample, and the method includes: the fourth sample is a sample which is searched in the sample set where the fourth sample is located and is closest to the third sample, or the third sample is a sample which is searched in the sample set where the third sample is located and is closest to the fourth sample.

According to the method and the device, the first cross-group sample pairs with high reliability are collected by using the first cross-group sample pair set, and the first cross-group sample pairs in the first cross-group sample pair set can effectively initialize a cross-group distance model. Then, by using a cross-view collaborative learning method, a second cross-group sample pair is labeled in an iterative and mutual mode between the two groups of sample sets, so that the cross-group distance model can be trained continuously.

In one exemplary embodiment, the first cross-group sample pair set includes a plurality of first cross-group sample pairs, for example, the first cross-group sample pairs from the ith group sample set to the jth group sample set may be represented as (x)_i，

) Wherein i and j are natural numbers and i ≠ j, x_iFor one sample in the ith set of samples,

for the sum x searched in the jth group of sample sets_iNearest sample

And searched in the ith set of samples

Nearest sample

And x_iFor the same sample, Ω_jSet of jth sample, Ω_iIs the ith group of sample set.

In one exemplary embodiment, the second cross-group sample pair set includes a plurality of second cross-group sample pairs, and the second cross-group sample pairs from the ith group sample set to the jth group sample set are (x)_i，

for the sum x searched in the jth group of sample sets_iNearest sample

Ω_jIs the jth group of sample sets.

Step 104 comprises: and initializing a training cross-group distance model by using the single-group sample pair set and the first cross-group sample pair set, and continuously training the cross-group distance model by using the second cross-group sample pair set until the training of the cross-group distance model is completed, wherein the cross-group distance model is used for calculating the distance between two samples belonging to different sample sets.

In this embodiment of the present application, first, an initial training sample pair set of multiple cross-group distance models is established, for example, all first cross-group sample pairs corresponding to an ith group of sample sets to a jth group of sample sets are added to a single group of sample pair set corresponding to the ith group of sample sets, so as to obtain an initial training sample pair set S of the cross-group distance model from the ith group of sample sets to the jth group of sample sets^i，j＝Sⁱ∪S^sWherein S isⁱA single set of sample pairs corresponding to the ith set of samples (the single set of sample pairs includes a set of positive sample pairs and a set of negative sample pairs), S^sA first cross-group sample pair set from the ith group of sample sets to the jth group of sample sets. Will be provided withAdding all first cross-group sample pairs from the jth group sample set to the ith group sample set into a single group sample pair set corresponding to the jth group sample set to obtain an initial training sample pair set S of a cross-group distance model from the jth group sample set to the ith group sample set^j，i＝S^j∪S^s′Wherein S is^jA single set of sample pairs corresponding to the jth set of samples (the single set of sample pairs including a set of positive sample pairs and a set of negative sample pairs), S^s' is a first set of cross-group sample pairs from the jth group of sample sets to the ith group of sample sets.

Then, a plurality of cross-group distance models are initially trained on the set by using the established initial training samples respectively. For example, set S of initial training sample pairs^i，jObtaining a cross-group distance model d from the ith group of sample sets to the jth group of sample sets through initialization training^i，j(..), with an initial training sample set S^j，iObtaining a cross-group distance model d from the jth group sample set to the ith group sample set by initialization training^j，i(.，.)。

And adding all the second cross-group sample pairs meeting the conditions into the initial training sample pair set to obtain an updated training sample pair set, and continuing to train the cross-group distance model by using the updated training sample pair set. For example, all the second cross-group sample pairs from the ith group of samples to the jth group of samples are added to the initial training sample pair set S^i，jObtaining an updated training sample pair set, and continuing to train the cross-group distance model d from the ith group of sample set to the jth group of sample set^i，j(..); the training sample set S is updated in the same way^j，iContinuing training model d^j，i(.，.)。

In an exemplary embodiment, when the cross-group distance model continues to be trained by using the second cross-group sample pair set, the loop convergence condition may be that all the second cross-group sample pairs are trained or the training number reaches a preset iteration number. For example, the preset number of iterations may be set to 20.

In an exemplary embodiment, as shown in fig. 2, a user acquires an image and sends the image to a server, the server establishes a plurality of sets of samples by using the model training method, establishes a single set of sample pairs according to the sample set, trains a single set of distance model by using the single set of sample pairs, establishes a first cross-set sample pair set and a second cross-set sample pair set according to the single set of distance model and the sample set, initializes the training cross-set distance model by using the first cross-set sample pair set, continues to train the cross-set distance model by using the second cross-set sample pair set, and outputs the trained cross-set distance model to the user.

In another exemplary embodiment, a user inputs an image set and an original image recognition model thereof, the server establishes a single-group distance model and a cross-group distance model for the image set input by the user by using the model training method, trains the established single-group distance model and the cross-group distance model, and optimizes the image recognition model input by the user according to the trained cross-group distance model.

An embodiment of the present application further provides an object identification method, as shown in fig. 3, the object identification method includes steps 301 to 302.

Wherein step 301 comprises: and acquiring an image to be identified.

Step 302 includes: performing object recognition on an image to be recognized by utilizing a pre-trained cross-group distance model, wherein a training sample set of the cross-group distance model comprises an initial training sample set and a secondary training sample set, the initial training sample set comprises a single-group sample pair set and a first cross-group sample pair set, the secondary training sample set comprises a second cross-group sample pair set, the first cross-group sample pair set comprises a plurality of first cross-group sample pairs, the first cross-group sample pairs comprise first samples and second samples belonging to different sample sets, and the first samples and the second samples are samples which are searched in the sample set where the other side is located and are closest to the first samples and the second samples; the second cross-group sample pair set comprises a plurality of second cross-group sample pairs, the second cross-group sample pairs comprise third samples and fourth samples belonging to different sample sets, and one of the third samples and the fourth samples is a sample which is searched in the sample set and is closest to the other one of the third samples and the fourth samples.

In one exemplary embodiment, the single set of sample pairs includes a set of positive sample pairs and a set of negative sample pairs. The set of positive sample pairs includes a plurality of positive sample pairs, each positive sample pair including two samples belonging to the same video segment. The set of negative example pairs comprises a plurality of negative example pairs, each negative example pair comprising two examples belonging to different video segments.

In one exemplary embodiment, the cross-set distance model is a cross-view quadratic discriminant analysis distance model.

An embodiment of the present application further provides an object identification method, as shown in fig. 4, the object identification method includes steps 401 to 402.

Wherein step 401 comprises: the cloud server receives an image to be recognized input by a user, wherein the image to be recognized comprises a target object;

step 402 comprises: the cloud server performs object recognition on an image to be recognized by using a pre-trained cross-group distance model, and outputs one or more images, wherein the output image contains an object matched with a target object, a training sample set of the cross-group distance model comprises an initial training sample set and a secondary training sample set, the initial training sample set comprises a single-group sample pair set and a first cross-group sample pair set, the secondary training sample set comprises a second cross-group sample pair set, the first cross-group sample pair set comprises a plurality of first cross-group sample pairs, the first cross-group sample pairs comprise first samples and second samples belonging to different sample sets, and the first samples and the second samples are mutually searched in a sample set where the other side is located and are closest to the cloud server; the second cross-group sample pair set comprises a plurality of second cross-group sample pairs, the second cross-group sample pairs comprise third samples and fourth samples belonging to different sample sets, and one of the third samples and the fourth samples is a sample which is searched in the sample set and is closest to the other one of the third samples and the fourth samples.

In an exemplary embodiment, a user sends a training image set to a cloud server, the cloud server establishes an initial training sample set and a secondary training sample set for the training image set sent by the user by using the model training method, and trains a pre-established single-group distance model and a pre-established cross-group distance model by using the initial training sample set and the secondary training sample set; and then, inputting a test image by the user, carrying out object recognition on the test image input by the user according to the trained group-spanning distance model by the cloud server, and outputting a recognition result which is closest to the test image input by the user to the user.

In an exemplary embodiment, as shown in fig. 5, a user inputs a prototype image segment of a pedestrian to be identified into a pre-trained cross-group distance model, and one or more image segments of the pedestrian close to the identified image segment are obtained.

Pedestrian Re-Identification (ReID), namely Re-identifying pedestrians, refers to a process of establishing a corresponding relationship between images of pedestrians captured by different (non-over) cameras without overlapping and covering fields of view. When the shooting ranges of the cameras are not overlapped, the searching difficulty is increased due to the fact that continuous information does not exist. Therefore, pedestrian re-identification emphasizes the retrieval of a particular pedestrian in a cross-camera video. Pedestrian re-identification is mainly applied to aspects such as image retrieval, and the like, wherein one or more pictures (query images) belonging to a given person are found from the multiple pictures (galery images), and the method is a person comparison technology realized through the overall characteristics of pedestrians.

According to the embodiment of the application, the effectiveness of the pedestrian re-identification method is tested by using the PRID 2011 standard pedestrian re-identification database. This data set was collected by the Austrian Institute of Technology (austria Institute of Technology) in order to advance the study of pedestrian re-identification Technology. The data set consists of images extracted from a plurality of pedestrian trajectories recorded by two different static surveillance cameras. These images differ due to changes in camera viewpoint as well as illumination, pedestrian background and camera characteristics. Since the image is extracted from the trajectory, each person has several different poses in the image. The data set recorded the trajectory of 475 people from one perspective and 856 people from another perspective, with 245 people appearing from both perspectives. The data set has filtered out some severely occluded pedestrians, fewer than five pedestrians in a reliable image under one camera, and damaged images due to labeling errors.

Fig. 5 is a test effect graph of the embodiment of the present application. As shown in fig. 5, the left side is a prototype image segment of a pedestrian to be identified, and the right side is the top ten results identified by the embodiment of the present application. For the first pedestrian prototype image fragment, six identification results are correct, and four identification results are wrong, and for the second pedestrian prototype image fragment, ten identification results are all correct. As can be seen from fig. 5, the embodiment of the present application can successfully identify the same pedestrian and similar pedestrians.

The pedestrian re-identification method and the existing pedestrian re-identification method in the embodiment of the application are qualitatively compared and analyzed by using a front k ordering accuracy (Rank-k accuracy) evaluation method to evaluate the re-identification accuracy of the pedestrian re-identification method and the existing pedestrian re-identification method, wherein k is 1, 5, 10 and 20 in the embodiment of the application. Rank-k indicates the accuracy of the same ID as the Query Image (Query Image) in the top k images sorted according to the similarity.

The experimental result shows that on the PRID 2011 video pedestrian re-identification database, the accuracy of the embodiment of the application has obvious advantages compared with other unsupervised methods, and is not much different from the accuracy of a supervised learning method. Table 1 reports the recognition accuracy comparison results on the PRID 2011 database. As shown in table 1, euclidean (LOMO) represents an unsupervised pedestrian re-identification algorithm based on the features of LOMO and euclidean distance, saice represents an unsupervised pedestrian re-identification algorithm based on saliency features, CNN + XQDA represents a supervised pedestrian re-identification algorithm based on convolutional neural network features and XQDA distance learning, and XQDA (LOMO) represents a supervised pedestrian re-identification algorithm based on the features of LOMO and XQDA distance learning. As can be seen from table 1, the embodiment of the present application achieves a good effect on pedestrian re-identification accuracy. Compared with the existing pedestrian re-identification method, the pedestrian re-identification method provided by the embodiment of the application has the advantages that the re-identification precision is obviously superior to that of an unsupervised pedestrian re-identification algorithm and is slightly lower than that of a supervised pedestrian re-identification algorithm, but the embodiment of the application does not depend on data labeling, and has greater advantages in practice.

TABLE 1

According to the object identification method, the cross-group distance model is initially trained by utilizing the single-group sample pair set and the first cross-group sample pair set, and the cross-group distance model is continuously trained by utilizing the second cross-group sample pair set, so that the unsupervised cross-camera object identification method is realized, and experimental results show that the identification precision is relatively high.

An embodiment of the present application further provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the steps of the model training method according to any one of the above.

An embodiment of the present application further provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the steps of the object identification method according to any one of the above.

As shown in fig. 6, an embodiment of the present application further provides a model training apparatus, which includes a sample building module 601, a single-group training module 602, and a cross-group training module 603.

The sample establishing module 601 is configured to acquire an image acquired by an image acquisition device, and establish a plurality of sets of sample sets according to the acquired image.

The single-group training module 602 is configured to perform correlation processing on the acquired image to obtain a plurality of video clips, establish a single-group sample pair set according to the video clips and the sample set, and train a single-group distance model using the single-group sample pair set, where the single-group distance model is used to calculate a distance between two samples belonging to the same sample set.

A cross-group training module 603, configured to establish a first cross-group sample pair set and a second cross-group sample pair set according to the single-group distance model and the sample set, where the first cross-group sample pair set includes a plurality of first cross-group sample pairs, the first cross-group sample pair includes a first sample and a second sample belonging to different sample sets, and the first sample and the second sample are samples that are closest to each other and are searched in a sample set where an opposite side is located; the second cross-group sample pair set comprises a plurality of second cross-group sample pairs, the second cross-group sample pairs comprise third samples and fourth samples belonging to different sample sets, and one of the third samples and the fourth samples is a sample which is searched in the sample set and is closest to the other one of the third samples and the fourth samples; and initializing a training cross-group distance model by using the single-group sample pair set and the first cross-group sample pair set, and continuously training the cross-group distance model by using the second cross-group sample pair set until the training of the cross-group distance model is completed, wherein the cross-group distance model is used for calculating the distance between two samples belonging to different sample sets.

In an exemplary embodiment, the sample creating module 601 creates a plurality of sets of samples from the acquired image, including: zooming the acquired image to a preset pixel value; extracting a multidimensional characteristic vector of the zoomed image; reducing the dimension of the extracted multi-dimensional feature vector; and storing the feature vectors after dimensionality reduction as samples in the sample set.

In one exemplary embodiment, the single set of sample pairs includes a set of positive sample pairs and a set of negative sample pairs; the single set of training modules 602 builds a single set of sample pairs from the video segments and the sample sets, including: taking any two samples belonging to the same video segment as a positive sample pair, wherein a plurality of positive sample pairs corresponding to a group of sample sets form a positive sample pair set corresponding to the sample set; any two samples belonging to different video clips are used as a negative sample pair, and a plurality of negative sample pairs corresponding to a group of sample sets form a negative sample pair set corresponding to the sample set.

In one exemplary embodiment, the single-set distance model and the cross-set distance model are both cross-view quadratic discriminant analysis distance models.

In one exemplary embodiment, the first cross-group sample pair set includes a plurality of first cross-group sample pairs, and the first cross-group sample pairs from the ith group sample set to the jth group sample set are (x)_i，

for the sum x searched in the jth group of sample sets_iNearest sample

And searched in the ith set of samples

Nearest sample

And x_iFor the same sample, Ω_jSet of jth sample, Ω_iSet of i-th group of samples, d^j(.,) is a single set of distance models corresponding to the jth set of samples, dⁱ(.,) is a single set of distance models corresponding to the ith set of samples.

for the sum x searched in the jth group of sample sets_iNearest sample

Ω_jSet of jth group of samples, dⁱ(.,) is a single set of distance models corresponding to the ith set of samples.

In an exemplary embodiment, when the cross-group distance model continues to be trained by using the second cross-group sample pair set, the loop convergence condition is that all the second cross-group sample pairs are trained completely or the training times reach the preset iteration times.

As shown in fig. 7, an embodiment of the present application further provides an object recognition apparatus, which includes an image acquisition module 701 and an object recognition module 702.

The image obtaining module 701 is configured to obtain an image to be identified.

The object recognition module 702 is configured to perform object recognition on an image to be recognized by using a pre-trained group-spanning distance model, where a training sample set of the group-spanning distance model includes an initial training sample set and a secondary training sample set, the initial training sample set includes a single group sample pair set and a first group-spanning sample pair set, the secondary training sample set includes a second group-spanning sample pair set, the first group-spanning sample pair set includes a plurality of first group-spanning sample pairs, the first group-spanning sample pair includes a first sample and a second sample belonging to different sample sets, and the first sample and the second sample are each a sample which is searched in a sample set where the other side is located and is closest to the first sample and the second sample; the second cross-group sample pair set comprises a plurality of second cross-group sample pairs, the second cross-group sample pairs comprise third samples and fourth samples belonging to different sample sets, and one of the third samples and the fourth samples is a sample which is searched in the sample set and is closest to the other one of the third samples and the fourth samples.

The object recognition device provided by the embodiment of the application initializes the training cross-group distance model by utilizing the single group sample pair set and the first cross-group sample pair set and continues to train the cross-group distance model by utilizing the second cross-group sample pair set, realizes the unsupervised cross-camera object recognition, and the experimental result shows that the recognition precision is relatively high.

It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

Claims

1. A method of model training, comprising:

acquiring images acquired by an image acquisition device, and establishing a plurality of groups of sample sets according to the acquired images;

performing correlation processing on an acquired image to obtain a plurality of video clips, establishing a single-group sample pair set according to the video clips and the sample set, and training a single-group distance model by using the single-group sample pair set, wherein the single-group distance model is used for calculating the distance between two samples belonging to the same sample set;

according to the single-group distance model and the sample set, a first cross-group sample pair set and a second cross-group sample pair set are established, the first cross-group sample pair set comprises a plurality of first cross-group sample pairs, the first cross-group sample pairs comprise first samples and second samples belonging to different sample sets, and the first samples and the second samples are samples which are searched in the sample set where the opposite side is located and are closest to the first samples and the second samples; the second cross-group sample pair set comprises a plurality of second cross-group sample pairs, the second cross-group sample pairs comprise third samples and fourth samples belonging to different sample sets, and one of the third samples and the fourth samples is a sample which is searched in the sample set and is closest to the other one of the third samples and the fourth samples;

initializing a training cross-group distance model by using a single-group sample pair set and a first cross-group sample pair set, and continuously training the cross-group distance model by using a second cross-group sample pair set until the training of the cross-group distance model is completed, wherein the cross-group distance model is used for calculating the distance between two samples belonging to different sample sets.

2. The model training method of claim 1, wherein the establishing sets of samples from the acquired images comprises:

zooming the acquired image to a preset pixel value;

extracting a multidimensional characteristic vector of the zoomed image;

reducing the dimension of the extracted multi-dimensional feature vector;

and storing the feature vectors after dimensionality reduction as samples in the sample set.

3. The model training method of claim 1, wherein the images acquired by one of the image acquisition devices correspond to one or more sets of the samples.

4. The model training method of claim 1, wherein the single set of sample pairs comprises a set of positive sample pairs and a set of negative sample pairs; establishing a single set of sample pair sets from the video segments and sample sets, comprising:

5. The model training method of claim 1, wherein the single-set distance model and the cross-set distance model are both cross-view quadratic discriminant analysis distance models.

6. A model training apparatus comprising a processor and a memory, the processor being configured to execute a computer program stored in the memory to implement the steps of the model training method of any one of claims 1 to 5.

7. A storage medium, characterized in that the storage medium stores a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the model training method according to any one of claims 1 to 5.

8. An object recognition method, comprising:

acquiring an image to be identified;

performing object recognition on an image to be recognized by using a pre-trained cross-group distance model, wherein a training sample set of the cross-group distance model comprises an initial training sample set and a secondary training sample set, the initial training sample set comprises a single-group sample pair set and a first cross-group sample pair set, the secondary training sample set comprises a second cross-group sample pair set, the first cross-group sample pair set comprises a plurality of first cross-group sample pairs, the first cross-group sample pairs comprise first samples and second samples belonging to different sample sets, and the first samples and the second samples are samples which are searched in a sample set where the other side is located and are closest to the first samples and the second samples; the second cross-group sample pair set comprises a plurality of second cross-group sample pairs, the second cross-group sample pairs comprise third samples and fourth samples belonging to different sample sets, and one of the third samples and the fourth samples is a sample which is searched in the sample set and is closest to the other one of the third samples and the fourth samples.

9. An object recognition method, comprising:

the method comprises the steps that a cloud server receives an image to be recognized input by a user, wherein the image to be recognized comprises a target object;

the cloud server performs object recognition on an image to be recognized by using a pre-trained cross-group distance model, and outputs one or more images, wherein the output image contains an object matched with the target object, a training sample set of the cross-group distance model comprises an initial training sample set and a secondary training sample set, the initial training sample set comprises a single-group sample pair set and a first cross-group sample pair set, the secondary training sample set comprises a second cross-group sample pair set, the first cross-group sample pair set comprises a plurality of first cross-group sample pairs, the first cross-group sample pairs comprise first samples and second samples belonging to different sample sets, and the first samples and the second samples are samples which are searched in the sample set where the opposite side is located and are closest to the cloud server; the second cross-group sample pair set comprises a plurality of second cross-group sample pairs, the second cross-group sample pairs comprise third samples and fourth samples belonging to different sample sets, and one of the third samples and the fourth samples is a sample which is searched in the sample set and is closest to the other one of the third samples and the fourth samples.

10. An object recognition apparatus comprising a processor and a memory, the processor being configured to execute a computer program stored in the memory to implement the steps of the object recognition method as claimed in any one of claims 8 or 9.

11. A storage medium, characterized in that the storage medium stores a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the object recognition method according to any one of claims 8 or 9.

12. A model training device is characterized by comprising a sample establishing module, a single-group training module and a cross-group training module, wherein:

the sample establishing module is used for acquiring images acquired by the image acquisition device and establishing a plurality of groups of sample sets according to the acquired images;

the single group training module is used for performing correlation processing on the acquired image to obtain a plurality of video clips, establishing a single group sample pair set according to the video clips and the sample set, and training a single group distance model by using the single group sample pair set, wherein the single group distance model is used for calculating the distance between two samples belonging to the same sample set;

the cross-group training module is used for establishing a first cross-group sample pair set and a second cross-group sample pair set according to the single-group distance model and the sample set, wherein the first cross-group sample pair set comprises a plurality of first cross-group sample pairs, the first cross-group sample pairs comprise first samples and second samples belonging to different sample sets, and the first samples and the second samples are samples which are searched in the sample set where the opposite side is located and are closest to the first samples and the second samples; the second cross-group sample pair set comprises a plurality of second cross-group sample pairs, the second cross-group sample pairs comprise third samples and fourth samples belonging to different sample sets, and one of the third samples and the fourth samples is a sample which is searched in the sample set and is closest to the other one of the third samples and the fourth samples; initializing a training cross-group distance model by using a single-group sample pair set and a first cross-group sample pair set, and continuously training the cross-group distance model by using a second cross-group sample pair set until the training of the cross-group distance model is completed, wherein the cross-group distance model is used for calculating the distance between two samples belonging to different sample sets.

13. An object recognition apparatus comprising an image acquisition module and an object recognition module, wherein:

the image acquisition module is used for acquiring an image to be identified;

the object recognition module is used for performing object recognition on an image to be recognized by using a pre-trained cross-group distance model, wherein a training sample set of the cross-group distance model comprises an initial training sample set and a secondary training sample set, the initial training sample set comprises a single-group sample pair set and a first cross-group sample pair set, the secondary training sample set comprises a second cross-group sample pair set, the first cross-group sample pair set comprises a plurality of first cross-group sample pairs, the first cross-group sample pairs comprise first samples and second samples belonging to different sample sets, and the first samples and the second samples are samples which are searched in a sample set where the other side is located and are closest to the first samples and the second samples; the second cross-group sample pair set comprises a plurality of second cross-group sample pairs, the second cross-group sample pairs comprise third samples and fourth samples belonging to different sample sets, and one of the third samples and the fourth samples is a sample which is searched in the sample set and is closest to the other one of the third samples and the fourth samples.