CN114387612A

CN114387612A - Human body weight recognition method and device based on bimodal feature fusion network

Info

Publication number: CN114387612A
Application number: CN202111407271.5A
Authority: CN
Inventors: 王文; 胡顺达; 朱世强; 宋伟; 林哲远; 金天磊
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2021-11-24
Filing date: 2021-11-24
Publication date: 2022-04-22

Abstract

The invention discloses a human body weight recognition method and a human body weight recognition device based on a bimodal feature fusion network, wherein the method comprises the following steps: acquiring a color image of a human body to be identified and other corresponding modal images; inputting the color images and other corresponding modal images into a trained bimodal feature fusion network, and extracting the features of the human body to be recognized; and comparing the characteristics of the human body to be recognized with the characteristics of the human body image library to obtain the recognition result of the human body to be recognized. Aiming at the problem of human body weight recognition, the color image of the human body to be recognized and other corresponding modal images are input into a trained bimodal feature fusion network for feature extraction, and the extracted feature information quantity is richer than the features extracted according to a single modal image, so that the accuracy of the human body weight recognition is higher than that of the human body weight recognition according to the single modal image.

Description

Human body weight recognition method and device based on bimodal feature fusion network

Technical Field

The application relates to the field of computer vision human body weight identification, in particular to a human body weight identification method and device based on a bimodal feature fusion network.

Background

The human body weight recognition technology is a key technology in the field of computer vision, and has wide application prospect and high application value. The technology plays a key role in practical application scenes such as automatic driving, intelligent monitoring, man-machine interaction, intelligent robots and the like. By means of a human body weight recognition technology, the track of the pedestrian can be predicted in automatic driving, so that actions such as avoidance can be performed in advance; in intelligent monitoring, a suspect, a lost child and the like can be quickly retrieved from a large number of videos; in the human-computer interaction, more intelligent interaction can be provided; in the intelligent robot, following of a target person and the like can be realized.

In recent years, with the popularization of deep learning, the human body weight recognition technology has been developed dramatically.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:

the existing human body weight recognition model depends on a color image and learns characteristics such as color, texture and the like from the color image, the information content of the characteristics is single, and the existing human body weight recognition model cannot meet the requirement on the accuracy in a complex scene, such as a campus and school uniforms with the same color worn by students. Furthermore, if a suspect escapes, the person often changes clothes to disguise the person, and the existing human body heavy recognition model cannot recognize the target of the changed clothes. Therefore, in the human body re-identification model, besides a single color image feature, how to blend in features of more other modality images to enrich the information content of finally extracted features is an urgent problem to be solved.

Disclosure of Invention

The embodiment of the application aims to provide a human body weight identification method and a human body weight identification device based on a bimodal feature fusion network, so as to solve the technical problem of single feature information quantity in the related technology.

According to a first aspect of the embodiments of the present application, a human body re-identification method based on a bimodal feature fusion network is provided, which includes:

acquiring a color image of a human body to be identified and other corresponding modal images;

inputting the color images and other corresponding modal images into a trained bimodal feature fusion network, and extracting the features of the human body to be recognized;

and comparing the characteristics of the human body to be recognized with the characteristics of the human body image library to obtain the recognition result of the human body to be recognized.

Further, the bimodal feature fusion network comprises:

the color image feature extraction backbone network is used for extracting first features from the color image of the human body to be identified;

the other modality image feature extraction backbone network is used for extracting second features from other modality images of the human body to be identified; and

and the bimodal feature fusion device is used for fusing the first feature and the second feature into the feature of the human body to be recognized.

Further, the training process of the bimodal feature fusion network comprises:

acquiring a training set, wherein the training set is divided into a plurality of subsets, and each subset comprises a color image of a plurality of persons and other corresponding modal images;

inputting one subset into the bimodal feature fusion network, and extracting features of the subset;

classifying people according to the characteristics of the subsets to obtain cross entropy loss;

dividing the characteristics of the subsets into triples to obtain ternary losses;

carrying out weighted summation on the cross entropy loss and the ternary loss to obtain the loss of the subset;

updating the parameters of the bimodal feature fusion network according to the loss of the subset to obtain an updated bimodal feature fusion network;

and sequentially inputting the subsets into the bimodal feature fusion network for the rest subsets, extracting the features of the subsets, and updating the parameters of the bimodal feature fusion network according to the loss of the subsets to obtain the updated bimodal feature fusion network until the loss of the subsets is converged.

Further, the features of the human body image library are obtained by inputting the color images of each pair of human bodies and the corresponding other modal images in the human body image library into the bimodal feature fusion network, wherein the human body image library comprises a plurality of color images of human bodies and corresponding other modal images.

Further, comparing the features of the human body to be recognized with the features of the human body image library to obtain the recognition result of the human body to be recognized, including:

calculating the characteristic distance between the characteristics of the human body to be recognized and the characteristics of the human body image library;

and setting the human body image corresponding to the minimum characteristic distance as the recognition result of the human body to be recognized.

According to a second aspect of the embodiments of the present application, there is provided a human body weight recognition apparatus based on a bimodal feature fusion network, including:

the acquisition module is used for acquiring a color image of a human body to be identified and other corresponding modal images;

the characteristic extraction module is used for inputting the human body color image and the corresponding other modal images into a trained bimodal characteristic fusion network and extracting the characteristics of the human body to be recognized;

and the comparison module is used for comparing the characteristics of the human body to be recognized with the characteristics of the human body image library to obtain the recognition result of the human body to be recognized.

Further, the training process of the bimodal feature fusion network comprises:

Further, the comparison module comprises:

the calculation submodule is used for calculating the characteristic distance between the characteristics of the human body to be recognized and the characteristics of the human body image library;

and the setting submodule is used for setting the human body image corresponding to the minimum characteristic distance as the recognition result of the human body to be recognized.

According to a third aspect of embodiments of the present application, there is provided an electronic apparatus, including:

one or more processors;

a memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement a method as described in the first aspect.

According to a fourth aspect of embodiments herein, there is provided a computer-readable storage medium having stored thereon computer instructions, characterized in that the instructions, when executed by a processor, implement the steps of the method according to the first aspect.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

according to the embodiment, aiming at the problem of human body weight recognition, the color image of the human body to be recognized and other corresponding modal images are input into the trained bimodal feature fusion network for feature extraction, and the extracted feature information quantity is richer than the features extracted according to a single modal image; and comparing the extracted features with the features of the human body image library to obtain the recognition result of the human body to be recognized, wherein the accuracy of the human body re-recognition is higher than that of the human body re-recognition performed according to the single-mode image.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

Fig. 1 is a flowchart illustrating a human body re-recognition method based on a bimodal feature fusion network according to an exemplary embodiment.

FIG. 2 is a schematic diagram illustrating a structure of a bimodal feature fusion network in accordance with an exemplary embodiment.

FIG. 3 is a flowchart illustrating a training process for a bimodal feature fusion network in accordance with an exemplary embodiment.

Fig. 4 is a flowchart illustrating step S13 according to an exemplary embodiment.

Fig. 5 is a block diagram illustrating a human body re-recognition apparatus based on a bimodal feature fusion network according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

Fig. 1 is a flowchart illustrating a human body re-recognition method based on a bimodal feature fusion network according to an exemplary embodiment, and as shown in fig. 1, the method may include the following steps:

step S11: acquiring a color image of a human body to be identified and other corresponding modal images;

step S12: inputting the color images and other corresponding modal images into a trained bimodal feature fusion network, and extracting the features of the human body to be recognized;

step S13: and comparing the characteristics of the human body to be recognized with the characteristics of the human body image library to obtain the recognition result of the human body to be recognized.

In step S11, acquiring a color image of the human body to be recognized and corresponding other modality images;

specifically, the color image of the human body to be recognized refers to a color RGB image shot by the human body to be recognized, and preferably, the color image may be a color image shot by a mobile phone, a common color camera, an industrial monitoring camera, and the like; the other modality image refers to any other modality image except the color image, and can be obtained through hardware equipment, or can be generated from the color image through software.

In a preferred example, a depth camera with a color image function may be used to capture a color image and a depth image simultaneously, where the depth image is used as the other modality image; or a color camera with an infrared function can be adopted to shoot and acquire the color image and the infrared image at the same time, and the infrared image is taken as the image of the other modes.

In another preferred example, a color image is acquired, and a human body contour map is generated from the color image as the other-mode image or a gray scale map is generated as the other-mode image in a software manner.

In step S12, inputting the color image and the corresponding other modality images into a trained bimodal feature fusion network, and extracting features of the human body to be recognized;

specifically, the bimodal feature fusion network includes: the system comprises a color image feature extraction backbone network, other modal image feature extraction backbone networks and a bimodal feature fusion device, wherein the color image feature extraction backbone network is used for extracting a first feature from a color image of a human body to be identified; the other modality image feature extraction backbone network is used for extracting second features from other modality images of the human body to be identified; the bimodal feature fusion device is used for fusing the first feature and the second feature into the feature of the human body to be recognized.

Preferably, in the embodiment of the present application, as shown in fig. 2, the color image feature extraction backbone network and the other modality image feature extraction backbone network both use the ResNet50 model, and the model is input as an image with a resolution of 256 × 128 pixels and output as a feature vector with a dimension of 2048. The rectangular parallelepiped block in fig. 2 represents the convolutional layer in the backbone network. In a preferred example, the output of each convolution layer of the color image feature extraction backbone network and the other modal image feature extraction backbone network can be fused by a dual-modal feature fusion device, and the fused features are input to the next convolution layer of the color image feature extraction backbone network, so that semantic information with different abstraction degrees can be fully utilized, and the final human body features have more expression capability and distinguishing capability.

Specifically, as shown in fig. 3, the training process of the bimodal feature fusion network includes:

step S21: acquiring a training set, wherein the training set is divided into a plurality of subsets, and each subset comprises a color image of a plurality of persons and other corresponding modal images;

specifically, the acquisition mode of the training set is the same as the acquisition mode of the color image of the human body to be recognized and the corresponding other modality images in step S11; the dividing of the training set into a plurality of subsets means that all color images and corresponding other modal images in the training set are evenly divided into a plurality of subsets of a certain size, wherein the subsets are composed of a plurality of anchor point images, positive sample images and negative sample images, the anchor point images are images randomly selected from the training set, the positive sample images are images belonging to the same person as the anchor point images, and the negative sample images are images not belonging to the same person as the anchor point images. The size of the subset is determined by the circumstances, generally speaking, the larger the subset is, the shorter the time consumption of network training is, but the higher the requirement on hardware, especially on data storage space is; the smaller the subset, the longer the time consuming network training, but the lower the requirements on hardware, especially data storage space. The advantage of dividing the training set into subsets is that the subset size can be flexibly set according to the limitations of existing hardware devices.

Step S22: inputting one subset into the bimodal feature fusion network, and extracting features of the subset;

specifically, before this step, data enhancement processing may be performed on the images in the subset, and the enhanced subset is input into the bimodal feature fusion network for feature extraction, so that the degree of overfitting of the network to a training set can be reduced, and the re-recognition accuracy is improved. Preferably, the image data is enhanced by using enhancement techniques such as random flipping and random cropping.

Step S23: classifying people according to the characteristics of the subsets to obtain cross entropy loss;

in particular, a cross entropy loss function

The formula of (1) is:

wherein, N represents the image logarithm of the color image and other corresponding modal images in a subset, g represents the one-hot coded label of the image sample, W and b represent the weight parameter and the bias parameter of the last full-connected layer of the bimodal feature fusion network respectively, and f represents the feature vector extracted by the bimodal feature fusion network.

Step S24: dividing the characteristics of the subsets into triples to obtain ternary losses;

in particular, a ternary loss function

The formula of (1) is:

wherein a, p, n respectively represent an anchor point image, a positive sample image and a negative sample image in the triplet, f_a,f_p,f_nRepresenting the features of the anchor image, the positive sample image and the negative sample image, respectively, the function d (-) calculates the distance between the two feature vectors, and m represents the distance threshold parameter of the ternary loss function.

Step S25: carrying out weighted summation on the cross entropy loss and the ternary loss to obtain the loss of the subset;

in particular, the loss of said subset

The formula of (1) is:

wherein, λ represents the weighting coefficient of weighted summation, and λ is more than or equal to 0 and less than or equal to 1. The larger the λ, the more the network is concerned with ternary losses; conversely, the network will be more concerned with cross-entropy loss. Preferably, λ is 0.5.

Step S26: updating the parameters of the bimodal feature fusion network according to the loss of the subset to obtain an updated bimodal feature fusion network;

step S27: and sequentially inputting the subsets into the bimodal feature fusion network for the rest subsets, extracting the features of the subsets, and updating the parameters of the bimodal feature fusion network according to the loss of the subsets to obtain the updated bimodal feature fusion network until the loss of the subsets is converged.

In the specific implementation of step S26 and step S27, parameters of the bimodal feature fusion network may be updated by using Adam, SGD, and other optimization methods, and the model trained in step S27, that is, the model with the input subset loss convergence, may be accelerated by model acceleration methods such as model pruning and quantization, and finally deployed into an actual production environment for application.

In step S13, the features of the human body to be recognized are compared with the features of the human body image library to obtain the recognition result of the human body to be recognized.

Specifically, the features of the human body image library are obtained by inputting the color images of each pair of human bodies and the corresponding other modal images in the human body image library into the bimodal feature fusion network, wherein the human body image library comprises a plurality of color images of human bodies and corresponding other modal images.

In a specific implementation, the human body image library is composed of a color image of a human body and corresponding other modality images, and the number of people included in the human body image library is not limited, and the human body image library may be a single person or a plurality of persons. In a preferred example, such as screening for suspects from surveillance video, the image library may consist of a single suspects or multiple suspects, i.e., multiple suspects may be screened simultaneously.

Specifically, as shown in fig. 4, this step includes the following sub-steps:

step S31: calculating the characteristic distance between the characteristics of the human body to be recognized and the characteristics of the human body image library;

step S32: taking the human body image corresponding to the minimum characteristic distance as the recognition result of the human body to be recognized;

in the specific implementation of step S31-step S32, the feature distance represents the similarity between two feature vectors, i.e., the similarity between two human bodies, and the smaller the distance value, the greater the similarity, and the greater the possibility that two human bodies belong to the same person. Preferably, the characteristic distance may be calculated using a euclidean distance or a cosine distance.

Corresponding to the embodiment of the human body weight recognition method based on the bimodal feature fusion network, the application also provides an embodiment of a human body weight recognition device based on the bimodal feature fusion network.

Fig. 5 is a block diagram illustrating a human body re-recognition apparatus based on a bimodal feature fusion network according to an exemplary embodiment. Referring to fig. 5, the apparatus may include:

the acquisition module 21 is configured to acquire a color image of a human body to be identified and other corresponding modality images;

the feature extraction module 22 is configured to input the human body color image and the corresponding other modal images into a trained bimodal feature fusion network, and extract features of the human body to be recognized;

the comparison module 23 is configured to compare the features of the human body to be recognized with the features of the human body image library to obtain a recognition result of the human body to be recognized;

in particular, the module may comprise the following sub-modules:

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the application. One of ordinary skill in the art can understand and implement it without inventive effort.

Correspondingly, the present application also provides an electronic device, comprising: one or more processors; a memory for storing one or more programs; when executed by the one or more processors, cause the one or more processors to implement a method for human body re-identification based on a bimodal feature fusion network as described above.

Accordingly, the present application also provides a computer readable storage medium, on which computer instructions are stored, wherein the instructions, when executed by a processor, implement the human body weight recognition method based on the bimodal feature fusion network as described above.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A human body weight recognition method based on a bimodal feature fusion network is characterized by comprising the following steps:

2. The method of claim 1, wherein the bimodal feature fusion network comprises:

3. The method of claim 1, wherein the training process of the bimodal feature fusion network comprises:

4. The method according to claim 1, wherein the features of the human body image library are obtained by inputting color images and corresponding other modality images of each pair of human bodies in the human body image library into the bimodal feature fusion network, wherein the human body image library comprises color images and corresponding other modality images of a plurality of human bodies.

5. The method according to claim 1, wherein comparing the features of the human body to be recognized with the features of the human body image library to obtain the recognition result of the human body to be recognized comprises:

6. A human body weight recognition device based on a bimodal feature fusion network is characterized by comprising:

7. The apparatus of claim 6, wherein the training process of the bimodal feature fusion network comprises:

8. The apparatus of claim 6, wherein the comparison module comprises:

9. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-5.

10. A computer-readable storage medium having stored thereon computer instructions, which when executed by a processor, perform the steps of the method according to any one of claims 1-5.