AU2018100321A4

AU2018100321A4 - Person ReID method based on metric learning with hard mining

Info

Publication number: AU2018100321A4
Application number: AU2018100321A
Authority: AU
Inventors: Jinghan Chen; Guangyao LI; Ming Zhang
Original assignee: Individual
Current assignee: Individual
Priority date: 2018-03-15
Filing date: 2018-03-15
Publication date: 2018-04-26
Anticipated expiration: 2026-03-15

Abstract

Abstract Based on the construction of Resnet-50 model, related DNN conception, Triplet Loss and hard mining algorithm, this invention puts forward a method for re-identifying people through images or even videos, named person re-identification(ReID). Compare to other algorithms in ReID filed, The utilizing of the algorithm enable the invention to achieve a higher accuracy and efficiency, which will be articulated in the description. The invention mainly involves several steps: For the preparation of training model, images are collected from database market1501 on the internet and processed and the resnet-50 model will be constructed; then image features will be extracted as the input in the training phase, where, in the same time, the weights of the model will be consistently optimized by adopting Triplet Loss and hard mining algorithm; finally, the trained model will be saved and it is proved in the final test that it can reach a high accuracy in ReID tasks. training datasets epoch=O Input (contend of start trainin NOoch<150 epoch+1 a updat save load module pictures F NO YESbackipropagation re in one batch which eautdaacosist of pc per I featur ti let ls Lons each_ k pcur Figure 1

Description

DESCRIPTION

TITLE

Person RelD method based on metric learning with hard mining

FIELD OF THE INVENTION

This invention can be well-applied in the filed of security, say, a security system. The identity of employees will be checked and their tracks will be monitored automatically in that the security of companies will be improved.

This invention can also be applied in the filed of detection. Police no longer need to identify suspects in the video or pictures with the naked eye or other comparatively low-efficient techniques. To conclude, the invention will help police re-identify the criminal in a short time.

BACKGROUND OF THE INVENTION

Person Re-Identification has been viewed as one of the key problems of the generic object recognition task. Prior to the flourish of computer and even deep learning technologies, without further supports, the only way for people to re-indentify others is by searching with the naked eye and matching certain traits of those people from memory. For instance, searching criminals through surveillance cameras used to be extreme time-costing and mind-demanding for police. They have to search for criminals in images or videos with the naked eye for an unbearably long time but may find nothing in the end.

Then, several approaches for person RelD were put forward. Methods for re-identification in still images setting have been extensively investigated, including feature representation learning, distance metric learning and CNN-based schemes. The previous algorithm in this area including hand-crafted algorithms and a small-scale evaluation. The Adjacency constrained search is mainly applied in the search process, yet the progress has been slow recently.

Because of all the convenience that technology in the filed of person RelD has brought to us, increasing popularity of person RelD has been gained in recent years. Thus there are more and more innovative ideas and methods that had been proposed to be applied in the person re-ID area.

In recent years, with the rapid development of computer technology, the application of person Re-ID based on deep learning have emerged and is widely utilized in all walks of life, especially in the filed of security. By using the image captured technology, with the utilization of person re-ID technology, police are able to re-recognize criminals more efficiently and more accurately. The main reason is that this technology is able to efficiently extract the relevant features of a specific object and then quickly re-recognize the same person by comparing the similarities in other images or videos.

Compare to previous approaches, people now have supports from larger data sets like marketl501, cuhk and so on, containing a variety of images that researchers needs. Besides, algorithms are constantly improved to promote efficiency. And most import thing is the application of deep-learning techniques, which has its own advantages over traditional approaches in RelD filed, and it is the reason that people believe deep-learning based technology is sure to hold its predominance in the re-ID community over the next few years.

But despite all the advantages of deep-learning based technologies, there are still some important unfinished issues in the field of person re-ID. For example, the re-ranking method, which is of great significance to improve the accuracy of retrieval process, and how to annotate large-scale data sets is still an open-ended question.

[1] Liang Zheng,Yi Yang, and Alexander G.Hauptmann:person reid-past, present and future. arXiv:1610.02984vl [cs.CV] 10 Oct 2016 [2] Srikrishna Karanam*, Student member, IEEE, Mengran Gou*, Student Member, IEEE, Ziyan Wu, Member, IEEE, Angels Rates-Borras, Octavia Campus, Menber, IEEE, and Richard J.Radke, Senior Member, IEEE: A Systematic Evaluation and Benchmark for Person

Re-Identification:Features, Metrics, and Datasets.arXiv: 1605.09653v4 [cs.CV] 18 Aug 2017 [3]Alexander Hermans*, Lucas Beyer* and Bastian Leibe Visual Computing Institute RWTH Aachen University: In Defense of the Triplet Loss for Person Re-Identification. arXiv:1703.07737V3 [cs.CV] 17 May 2017.

SUMMARY OF THE INVENTION

The invention is based on the support from a large data set, named market-1501, and resnet50 model, which is well trained in the training section. Metric learning method such as Triplet loss algorithm and hard mining conception were also adopted for achieving a higher accuracy.

This project uses a large data set, which has more than 10 thousand of images, to apply to RelD tasks. As one of large data sets, Market-1501 have been chosen. Market-1501 data set is collected by cameras in front of a supermarket in Tsinghua University. A total of six cameras are used, including 5 high-resolution cameras, and one low-resolution camera. Overall, this data set contains 32,668 annotated bounding boxes of 1,501 identities. In this open system, images of each identity are captured by at most six cameras. It is one of the largest data sets so far and it provides a lot of supports for research in the field of person Re-ID.

In the invention, the resnet-50 model, which is a deep residual network of 50 layers, will help improve the process of extracting features and return a satisfied outcome with higher accuracy. The core module of resnet is residual block, which is shown in Fig.2. Formally, denoting the desired underlying mapping as X(x), we let the stacked nonlinear layers fit another mapping of F(x) = !i(x) - x, then the desired mapping is recast into X(x) = F(x) + x, which is consistent with the description in figure.

The invention also adopts metric learning method which has been largely put into the image retrieval field. The traditional method often use standard distance, yet when a passenger with the same ID cross those surveillance cameras with no overlap, the influences factors such as the perspective and light differ from various appearance, thus some of those features can be hard to distinguish. Therefore, researchers try to build a new measurement space that can put the distance of images from the same person closer than that from different person, which leads to the emergency of this project.

The Metric learning can learn the similarity degree between two pictures through network learning. There are many metric methods, Triplet loss algorithm and hard mining conception are adopted in this invention. A triplet consists of images from three categories, which are anchor, positive and negative. Anchor is a sample chosen from training set and positive is of the same kind of Anchor, while negative is of the different kind of Anchor. Then the distance between each two images in the data sets should be calculated and stored. By using d to represent the distance between anchor and positive, d to represent the distance between anchor and negative, a to represent margin.

Triplet loss can be expressed as blow: i — (^o,p da,n “b ^) +

The plus Outside the parentheses represents a relu function which is showed in Fig.4. The outcomes will be itself if its larger than 0. But if it smaller or equal to 0, the outcome will still be 0. In other words, what Triplet loss algorithm tries to do here is to minimize the distance between the feature vectors of anchor and positive and enlarge that of anchor and negative until the differences between da and du,, is larger than a.

And the weights of restnet model will be updated every time for minimizing or enlarging the distance.

As for the person RelD topic, it can be specified as follow: The Triplet loss algorithm can pull the positive pair(images of the same identities) closer and push negative pair(images that are come from the different identities) away from each other, then eventually, through the loss function of the whole network, the images of the same identity will form a cluster in the feature space, so that the person re-ID can be achieved.

Another conception is also adopted, named hard mining. Its main principle is to find the hardest sample to train the network. It requires several steps: First, pairwise distance for the features of all images are calculated; then each category(each person as a category) should be traversed to find the max intra-class distance and min inter-class distance; Finally, by adding up all classes' max intra-class distance and min inter-class distance, loss function is the outcome of average distance.

[4] Tetsu Matsukawa, Einoshin Suzuki. Person re-identification using cnn features learned from combination of attributes[C]//Pattem Recognition (ICPR), 2016 23rd International Conference on. IEEE, 2016:2428 -2433.

[5] Rahul Rama Varior, Mrinal Haloi, Gang Wang. Gated Siamese convolutional neural network architecture for human re-identification[C]//European Conference on Computer Vision. Springer, 2016:791-808.

[6] Florian Schroff, Dmitry Kalenichenko, James Philbin. Facenet: A unified embedding for face recognition and clustering [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:815-823.

[7]Deep Residual Learning for Image Recognition.

DESCRIPTION OF DRAWING

The following drawings are only for the purpose of description and explanation but not for limitation, where in:

Fig.l is the general flow diagram of the invention. It has listed the main steps.

Fig.2 articulates the principle of triplet loss

Fig.3 illustrates the construction of resnet model.

Fig.4 shows a graph of the function. The outcomes will be itself if its larger than 0. But if it smaller or equal to 0, the outcome will still be 0. Fig.5 is part of the final result of the project. It shows MAP(Mean Average Precision) of the RelD project. And while using market-1501 as our image sets, top-1, top-5 and top-10 accuracy are 81.1%, 91.9% and 94.7% respectively.

DESCRIPTION OF PREFERRED EMBODIMENTS

Embodiment of the invention will be analyzed in details for the purpose of making this invention easier to be understood. And the general process is shown in Fig.l.

Different from the traditional Convolutional Neural Network(CNN) model such as alexnet, resnet-50 model is composed of residual blocks as shown in Fig.3. Compared with the traditional CNN model, it is a deeper residual network of 50 layers with shortcut connections, which make it more accurate in solving the degradation problem.

Triplet loss and hard mining are also applied to the project. Fig.2 shows its main steps.

Step A: Image Collection.

RelD project must work on a large data set, which should contain thousands of images of people, for training the model and improving its accuracy. Thus, data sets like market-1501 was found. A variety of images(over 10 thousand) of pedestrians are downloaded to the university server starting training, and 7,000 of them are used to train the model while the rest are used in model testing. Through establishing the connection between computer and server, the program is able to utilize images that had been downloaded.

Step B: Build a Resnet Model A resnet-50 model will be built for constantly extracting features layer by layer. Because of the total number of 50 layers, which makes the model much deeper than general model, resnet model has a better performance in the field of RelD. It consists of blocks which is shown on fig-3.

Step C: Start Training.

Images will be loaded to the program and re-sized. In the program, features of images will be extracted and stored as a tensor (data type), through using deep residual network named “resnet50”. These data will be used in the next step.

[] He K, Zhang X, Ren S, et al. Deep Residual Learning for Image Recognition^]. 2015:770-778.

Step D: Hard Mining and Triplet Loss

In this step, the features that have been extracted will be regarded as the input to calculate the output of triplet loss and hard mining, for training and the model updates the weights in the end.

Specifically, the program will make pairs of all the image features, and calculate Euclidean distance. The calculation can be expressed by

The distance represent the differences between images. Large distance means two images have a number of differences while small means a number of similarities. Then based on the principle of hard mining, each category(each person as a category) should be traversed to find the max intra-class distance and min inter-class distance and by adding up all classes' max intra-class distance and min inter-class distance, loss function is the outcome of average distance.

To identify the similar images more accurately, another conception will be introduced here, called triplet loss. It is involved in a function which can be expressed by

Where the D represents Euclidean distance,

represents all the anchors,

represent all positives, and

represents all negatives. The loss will be used to evaluate the difference between the target and the output from the network.

Step E: Update the model weights

The purpose of training the model is finding the most fitting weights in order to make the loss as lower as possible. The weights could be simply expressed by weight = weight - learning-rate x gradient. So back propagation will be applied to solve the partial derivative of all variables of this multi-layer complex function. Weights of the model will be updated constantly. And Step C and Step D will keep running in a loop until the epoch reach 150, in order to have the best weights in the model.

Step F: Save and Test

The model will be saved once it has been well trained after the several processes above. Then it will be loaded when it comes to evaluation and final test. In the test, MAP(Mean Average Precision) will be showed as a outcome(Fig.5).

Claims

CLAIM

1. A method based on a widely used model resnet-50, which compute the residual between layers, the model has 50 layers, consisting of residual blocks, different from common resnet-50 model, the model in the invention has been through adjustments to fit another mapping, for consistent with the our project.
2. A method based on a widely used model resnet-50 as claim 1, which is triplet loss, the core method of metric learning, is also applied in the invention, the main steps are as follow: First, samples, named anchor, are picked from the training data sets; then samples that are of the same kind and the different kind of Anchor are also chosen to form a triplet; the next step is to minimize the distance between the feature vectors of anchor and positive and enlarge that of anchor and negative.
3. A method based on a widely used model resnet-50 as claim 2, which is also adopted a significant step to train model, which is called hard mining, this approach focus on finding the hardest image that computer is able to identify and then makes it much more easier for computer through updating the weights of the model, which will further optimize the algorithm.
4. A method based on a widely used model resnet-50 as claim 3, which is the main steps are as follow: First, pairwise distance for the features of all images are calculated; then each category should be traversed to find the max intra-class distance and min inter-class distance; Finally, by adding up all classes' max intra-class distance and min inter-class distance, loss function is the outcome of average distance.