TWI727548B - Method for face recognition and device thereof - Google Patents

Method for face recognition and device thereof Download PDF

Info

Publication number
TWI727548B
TWI727548B TW108145586A TW108145586A TWI727548B TW I727548 B TWI727548 B TW I727548B TW 108145586 A TW108145586 A TW 108145586A TW 108145586 A TW108145586 A TW 108145586A TW I727548 B TWI727548 B TW I727548B
Authority
TW
Taiwan
Prior art keywords
image set
images
feature extraction
network
modal
Prior art date
Application number
TW108145586A
Other languages
Chinese (zh)
Other versions
TW202036367A (en
Inventor
于志鵬
Original Assignee
大陸商北京市商湯科技開發有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 大陸商北京市商湯科技開發有限公司 filed Critical 大陸商北京市商湯科技開發有限公司
Publication of TW202036367A publication Critical patent/TW202036367A/en
Application granted granted Critical
Publication of TWI727548B publication Critical patent/TWI727548B/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2148Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • G06F18/2193Validation; Performance evaluation; Active pattern learning techniques based on specific statistical tests
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/251Fusion techniques of input or preprocessed data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/803Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of input or preprocessed data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)
  • Collating Specific Patterns (AREA)

Abstract

The invention discloses a face recognition method and device. The method comprises the steps of obtaining a to-be-identified image; and the to-be-identified image is identified based on a cross-modal face identification network, an identification result of the to-be-identified image is obtained, and the cross-modal face identification network is obtained based on training of face image data of different modalities. The invention further discloses a corresponding device. According to the embodiment of the invention, the neural network is trained through the image set divided according to the categories to obtain the cross-modal face recognition network, and whether the objects of each category are the same person or not is recognized through the cross-modal face recognition network, so that the recognition accuracy can be improved.

Description

人臉識別方法、電子設備及電腦可讀儲 存介質 Face recognition method, electronic equipment and computer readable storage Storage medium

本公開實施例關於圖像處理技術領域,尤其關於一種人臉識別方法及裝置。 The embodiments of the present disclosure relate to the field of image processing technology, and in particular, to a face recognition method and device.

安防、社保、通信等領域需要識別不同圖像中包括的人物對象是否是同一個人,以實現面部跟蹤、實名認證、手機解鎖等操作。目前,通過人臉識別演算法對不同圖像中的人物對象分別進行人臉識別,可識別不同圖像中包括的人物對象是否是同一個人,但識別準確率較低。 Security, social security, communications, and other fields need to identify whether the person objects included in different images are the same person, in order to achieve facial tracking, real-name authentication, mobile phone unlocking and other operations. At present, face recognition is performed on human objects in different images through face recognition algorithms, which can identify whether the human objects included in different images are the same person, but the recognition accuracy rate is low.

本公開提供一種人臉識別方法,以識別不同圖像中的人物對象是否是同一個人。 The present disclosure provides a face recognition method to recognize whether the person objects in different images are the same person.

第一方面,提供了一種人臉識別方法,包括:獲得取待識別圖像;基於跨模態人臉識別網路對所述待識別圖像進行識別,得到所述待識別圖像的識別結果,其中,所 述跨模態人臉識別網路基於不同模態的人臉圖像資料訓練得到。 In a first aspect, a face recognition method is provided, including: obtaining an image to be recognized; recognizing the image to be recognized based on a cross-modal face recognition network to obtain a recognition result of the image to be recognized , Where, the The cross-modal face recognition network is trained based on face image data of different modalities.

在一種可能實現的方式中,所述基於不同模態的人臉圖像資料訓練得到所述跨模態人臉識別網路的過程,包括:基於第一模態網路和第二模態網路進行訓練得到所述跨模態人臉識別網路。 In a possible implementation manner, the process of obtaining the cross-modal face recognition network based on the face image data of different modalities includes: based on the first modal network and the second modal network Road is trained to obtain the cross-modal face recognition network.

在另一種可能實現的方式中,在所述基於第一模態網路和第二模態網路進行訓練得到所述跨模態人臉識別網路之前,還包括:基於第一圖像集和第二圖像集對所述第一模態網路訓練,其中,所述第一圖像集中的對象屬於第一類別,所述第二圖像集中的對象屬於第二類別。 In another possible implementation manner, before the cross-modal face recognition network is obtained by training based on the first modal network and the second modal network, the method further includes: based on the first image set Training the first modal network with a second image set, wherein the objects in the first image set belong to the first category, and the objects in the second image set belong to the second category.

在又一種可能實現的方式中,所述基於第一圖像集和第二圖像集對所述第一模態網路訓練,包括:基於所述第一圖像集和所述第二圖像集對所述第一模態網路進行訓練,得到所述第二模態網路;按預設條件從所述第一圖像集中選取第一數目的圖像,並從所述第二圖像集中選取第二數目的圖像,並根據所述第一數目的圖像和所述第二數目的圖像得到第三圖像集;基於所述第三圖像集對所述第二模態網路進行訓練,得到所述跨模態人臉識別網路。 In another possible implementation manner, the training of the first modal network based on the first image set and the second image set includes: based on the first image set and the second image set The image set trains the first modal network to obtain the second modal network; selects a first number of images from the first image set according to preset conditions, and selects the first number of images from the second Select a second number of images in the image set, and obtain a third image set based on the first number of images and the second number of images; compare the second number of images based on the third image set The modal network is trained to obtain the cross-modal face recognition network.

在又一種可能實現的方式中,所述預設條件包括:所述第一數目與所述第二數目相同,所述第一數目與所述第二數目的比值等於所述第一圖像集包含的圖像數目與所述第二圖像集包含的圖像數目的比值,所述第一數目與所 述第二數目的比值等於所述第一圖像集包含的人數與所述第二圖像集包含的人數的比值中的任意一種。 In another possible implementation manner, the preset condition includes: the first number is the same as the second number, and the ratio of the first number to the second number is equal to the first image set The ratio of the number of images included to the number of images included in the second image set, the first number and the number of images included in the The ratio of the second number is equal to any one of the ratios of the number of people included in the first image set to the number of people included in the second image set.

在又一種可能實現的方式中,所述第一模態網路包括第一特徵提取分支、第二特徵提取分支以及第三特徵提取分支;所述基於所述第一圖像集和所述第二圖像集對所述第一模態網路進行訓練,得到所述第二模態網路,包括:將所述第一圖像集輸入至所述第一特徵提取分支,並將所述第二圖像集輸入至所述第二特徵提取分支,並將第四圖像集輸入至所述第三特徵提取分支,對所述第一模態網路進行訓練,其中,所述第四圖像集包括的圖像為同一場景下採集的圖像或同一採集方式採集的圖像;將訓練後的第一特徵提取分支或訓練後的第二特徵提取分支或訓練後的第三特徵提取分支作為所述第二模態網路。 In another possible implementation manner, the first modal network includes a first feature extraction branch, a second feature extraction branch, and a third feature extraction branch; the first image set is based on the first image set and the first feature extraction branch. Training the first modal network with two image sets to obtain the second modal network includes: inputting the first image set to the first feature extraction branch, and adding the The second image set is input to the second feature extraction branch, and the fourth image set is input to the third feature extraction branch to train the first modal network, wherein the fourth The images included in the image set are images collected in the same scene or images collected in the same collection method; extract the first feature extraction branch after training or the second feature extraction branch after training or the third feature extraction branch after training Branch as the second modal network.

在又一種可能實現的方式中,所述將所述第一圖像集輸入至所述第一特徵提取分支,並將所述第二圖像集輸入至所述第二特徵提取分支,並將第四圖像集輸入至所述第三特徵提取分支,對所述第一模態網路進行訓練,包括:將所述第一圖像集、所述第二圖像集以及所述第四圖像集分別輸入至所述第一特徵提取分支、所述第二特徵提取分支以及所述第三特徵提取分支,分別得到第一識別結果、第二識別結果以及第三識別結果;獲取所述第一特徵提取分支的第一損失函數、所述第二特徵提取分支的第二損失函數以及所述第三特徵提取分支的第三損失函數;根據所述第一圖像集、所述第一識別結果以及所述第一損失函數,所述第二圖 像集、所述第二識別結果以及所述第二損失函數,所述第四圖像集、所述第三識別結果以及所述第三損失函數,調整所述第一模態網路的參數,得到調整後的第一模態網路,其中,所述第一模態網路的參數包括第一特徵提取分支參數、第二特徵提取分支參數以及第三特徵提取分支參數,所述調整後的第一模態網路的各分支參數相同。 In yet another possible implementation manner, the first image set is input to the first feature extraction branch, and the second image set is input to the second feature extraction branch, and The fourth image set is input to the third feature extraction branch to train the first modal network, including: combining the first image set, the second image set, and the fourth The image set is input to the first feature extraction branch, the second feature extraction branch, and the third feature extraction branch, respectively, to obtain the first recognition result, the second recognition result, and the third recognition result; and obtain the The first loss function of the first feature extraction branch, the second loss function of the second feature extraction branch, and the third loss function of the third feature extraction branch; according to the first image set, the first The recognition result and the first loss function, the second graph Image set, the second recognition result, and the second loss function, the fourth image set, the third recognition result, and the third loss function, adjust the parameters of the first modal network , The adjusted first modal network is obtained, wherein the parameters of the first modal network include a first feature extraction branch parameter, a second feature extraction branch parameter, and a third feature extraction branch parameter. The adjusted The parameters of each branch of the first modal network are the same.

在又一種可能實現的方式中,所述第一圖像集中的圖像包括第一標注資訊,所述第二圖像集中的圖像包括第二標注資訊,所述第四圖像集中的圖像包括第三標注資訊;所述根據所述第一圖像集、所述第一識別結果以及所述第一損失函數,所述第二圖像集、所述第二識別結果以及所述第二損失函數,所述第四圖像集、所述第三識別結果以及所述第三損失函數,調整所述第一模態網路的參數,得到調整後的第一模態網路,包括:根據所述第一標注資訊、所述第一識別結果、所述第一損失函數以及所述第一特徵提取分支的初始參數,得到第一梯度,以及根據所述第二標注資訊、所述第二識別結果、所述第二損失函數以及所述第二特徵提取分支的初始參數,得到第二梯度,以及根據所述第三標注資訊、所述第三識別結果、所述第三損失函數以及所述第三特徵提取分支的初始參數,得到第三梯度;將所述第一梯度、所述第二梯度以及所述第三梯度的平均值作為所述第一模態網路的反向傳播梯度,並通過所述反向傳播梯度調整所述第一模態網路的參數,使所述第一特徵提取分支的參 數、所述第二特徵提取分支的參數以及所述第三特徵提取分支的參數相同。 In another possible implementation manner, the images in the first image set include first annotation information, the images in the second image set include second annotation information, and the images in the fourth image set The image includes third annotation information; according to the first image set, the first recognition result, and the first loss function, the second image set, the second recognition result, and the first loss function Two loss functions, the fourth image set, the third recognition result, and the third loss function, adjusting the parameters of the first modal network to obtain the adjusted first modal network, including : Obtain a first gradient according to the first annotation information, the first recognition result, the first loss function, and the initial parameters of the first feature extraction branch, and according to the second annotation information, the The second recognition result, the second loss function, and the initial parameters of the second feature extraction branch to obtain a second gradient, and according to the third annotation information, the third recognition result, and the third loss function And the initial parameters of the third feature extraction branch to obtain a third gradient; use the average value of the first gradient, the second gradient, and the third gradient as the inverse of the first modal network Propagate the gradient, and adjust the parameters of the first modal network through the back-propagation gradient, so that the parameters of the first feature extraction branch are The number, the parameter of the second feature extraction branch, and the parameter of the third feature extraction branch are the same.

在又一種可能實現的方式中,所述按預設條件從所述第一圖像集中選取第一數量張圖像,並從所述第二圖像集中選取第二數量張圖像,得到第三圖像集,包括:從所述第一圖像集以及所述第二圖像集中分別選取f張圖像,使所述f張圖像中包含的人數為閾值,得到所述第三圖像集;或,從所述第一圖像集以及所述第二圖像集中分別選取m張圖像以及n張圖像,使所述m與所述n的比值等於所述第一圖像集包含的圖像數量與所述第二圖像集包含的圖像數量的比值,且所述m張圖像以及所述n張圖像中包含的人數均為所述閾值,得到所述第三圖像集;或,從所述第一圖像集以及所述第二圖像集中分別選取s張圖像以及t張圖像,使所述s與所述t的比值等於所述第一圖像集包含的人數與所述第二圖像集包含的人數的比值,且所述s張圖像以及所述t張圖像中包含的人數均為所述閾值,得到所述第三圖像集。 In yet another possible implementation manner, said selecting a first number of images from the first image set according to a preset condition, and selecting a second number of images from the second image set, to obtain the first image Three image sets, including: selecting f images from the first image set and the second image set, and setting the number of people included in the f images as a threshold to obtain the third image Image set; or, respectively select m images and n images from the first image set and the second image set, so that the ratio of the m to the n is equal to the first image The ratio of the number of images included in the set to the number of images included in the second image set, and the number of people included in the m images and the n images are both the threshold value, and the first Three image sets; or, respectively select s images and t images from the first image set and the second image set, so that the ratio of the s to the t is equal to the first The ratio of the number of people included in the image set to the number of people included in the second image set, and the number of people included in the s images and the t images are both the threshold value to obtain the third image Like set.

在又一種可能實現的方式中,所述基於所述第三圖像集對所述第二模態網路進行訓練,得到所述跨模態人臉識別網路,包括:對所述第三圖像集中的圖像依次進行特徵提取處理、線性變換、非線性變換,得到第四識別結果;根據所述第三圖像集中的圖像、所述第四識別結果以及所述第二模態網路的第四損失函數,調整所述第二模態網路的參數,得到所述跨模態人臉識別網路。 In another possible implementation manner, the training the second modal network based on the third image set to obtain the cross-modal face recognition network includes: The images in the image set are sequentially subjected to feature extraction processing, linear transformation, and nonlinear transformation to obtain a fourth recognition result; according to the images in the third image set, the fourth recognition result, and the second modality The fourth loss function of the network adjusts the parameters of the second modal network to obtain the cross-modal face recognition network.

在又一種可能實現的方式中,所述第一類別以及所述第二類別分別對應不同人種。 In yet another possible implementation manner, the first category and the second category respectively correspond to different races.

第二方面,提供了一種人臉識別裝置,包括:獲取單元,配置為獲得取待識別圖像;識別單元,配置為基於跨模態人臉識別網路對所述待識別圖像進行識別,得到所述待識別圖像的識別結果,其中,所述跨模態人臉識別網路基於不同模態的人臉圖像資料訓練得到。 In a second aspect, a face recognition device is provided, including: an acquisition unit configured to obtain an image to be recognized; and a recognition unit configured to recognize the image to be recognized based on a cross-modal face recognition network, Obtain the recognition result of the image to be recognized, wherein the cross-modal face recognition network is trained based on face image data of different modalities.

在一種可能實現的方式中,所述識別單元包括:訓練子單元,配置為基於第一模態網路和第二模態網路進行訓練得到所述跨模態人臉識別網路。 In a possible implementation manner, the recognition unit includes a training subunit configured to perform training based on a first modal network and a second modal network to obtain the cross-modal face recognition network.

在另一種可能實現的方式中,所述訓練子單元還配置為:基於第一圖像集和第二圖像集對所述第一模態網路訓練,其中,所述第一圖像集中的對象屬於第一類別,所述第二圖像集中的對象屬於第二類別。 In another possible implementation manner, the training subunit is further configured to train the first modal network based on the first image set and the second image set, wherein the first image set The objects of belong to the first category, and the objects in the second image set belong to the second category.

在又一種可能實現的方式中,所述訓練子單元還配置為:基於所述第一圖像集和所述第二圖像集對所述第一模態網路進行訓練,得到所述第二模態網路;以及按預設條件從所述第一圖像集中選取第一數目的圖像,並從所述第二圖像集中選取第二數目的圖像,並根據所述第一數目的圖像和所述第二數目的圖像得到第三圖像集;以及基於所述第三圖像集對所述第二模態網路進行訓練,得到所述跨模態人臉識別網路。 In another possible implementation manner, the training subunit is further configured to train the first modal network based on the first image set and the second image set to obtain the first modal network. A two-modal network; and selecting a first number of images from the first image set according to preset conditions, and selecting a second number of images from the second image set, and according to the first The number of images and the second number of images to obtain a third image set; and training the second modal network based on the third image set to obtain the cross-modal face recognition network.

在又一種可能實現的方式中,所述預設條件包括:所述第一數目與所述第二數目相同,所述第一數目與所 述第二數目的比值等於所述第一圖像集包含的圖像數目與所述第二圖像集包含的圖像數目的比值,所述第一數目與所述第二數目的比值等於所述第一圖像集包含的人數與所述第二圖像集包含的人數的比值中的任意一種。 In another possible implementation manner, the preset condition includes: the first number is the same as the second number, and the first number is the same as the The ratio of the second number is equal to the ratio of the number of images contained in the first image set to the number of images contained in the second image set, and the ratio of the first number to the second number is equal to the ratio of the number of images contained in the first image set to the number of images contained in the second image set. Any one of the ratios of the number of people included in the first image set to the number of people included in the second image set.

在又一種可能實現的方式中,所述第一模態網路包括第一特徵提取分支、第二特徵提取分支以及第三特徵提取分支;所述訓練子單元還配置為:將所述第一圖像集輸入至所述第一特徵提取分支,並將所述第二圖像集輸入至所述第二特徵提取分支,並將第四圖像集輸入至所述第三特徵提取分支,對所述第一模態網路進行訓練,其中,所述第四圖像集包括的圖像為同一場景下採集的圖像或同一採集方式採集的圖像;以及將訓練後的第一特徵提取分支或訓練後的第二特徵提取分支或訓練後的第三特徵提取分支作為所述第二模態網路。 In another possible implementation manner, the first modal network includes a first feature extraction branch, a second feature extraction branch, and a third feature extraction branch; the training subunit is further configured to: The image set is input to the first feature extraction branch, the second image set is input to the second feature extraction branch, and the fourth image set is input to the third feature extraction branch. The first modal network is trained, wherein the images included in the fourth image set are images collected in the same scene or images collected in the same collection method; and the first feature after training is extracted A branch or a second feature extraction branch after training or a third feature extraction branch after training is used as the second modal network.

在又一種可能實現的方式中,所述訓練子單元還配置為:將所述第一圖像集、所述第二圖像集以及所述第四圖像集分別輸入至所述第一特徵提取分支、所述第二特徵提取分支以及所述第三特徵提取分支,分別得到第一識別結果、第二識別結果以及第三識別結果;以及獲取所述第一特徵提取分支的第一損失函數、所述第二特徵提取分支的第二損失函數以及所述第三特徵提取分支的第三損失函數;以及根據所述第一圖像集、所述第一識別結果以及所述第一損失函數,所述第二圖像集、所述第二識別結果以及所述第二損失函數,所述第四圖像集、所述第三識別結果以及所述第三 損失函數,調整所述第一模態網路的參數,得到調整後的第一模態網路,其中,所述第一模態網路的參數包括第一特徵提取分支參數、第二特徵提取分支參數以及第三特徵提取分支參數,所述調整後的第一模態網路的各分支參數相同。 In another possible implementation manner, the training subunit is further configured to input the first image set, the second image set, and the fourth image set to the first feature, respectively. Extraction branch, the second feature extraction branch, and the third feature extraction branch to obtain a first recognition result, a second recognition result, and a third recognition result, respectively; and obtain the first loss function of the first feature extraction branch , The second loss function of the second feature extraction branch and the third loss function of the third feature extraction branch; and according to the first image set, the first recognition result, and the first loss function , The second image set, the second recognition result, and the second loss function, the fourth image set, the third recognition result, and the third Loss function, adjusting the parameters of the first modal network to obtain an adjusted first modal network, wherein the parameters of the first modal network include the first feature extraction branch parameter and the second feature extraction The branch parameters and the third feature extraction branch parameters, the branch parameters of the adjusted first modal network are the same.

在又一種可能實現的方式中,所述第一圖像集中的圖像包括第一標注資訊,所述第二圖像集中的圖像包括第二標注資訊,所述第四圖像集中的圖像包括第三標注資訊;所述訓練子單元還配置為:根據所述第一標注資訊、所述第一識別結果、所述第一損失函數以及所述第一特徵提取分支的初始參數,得到第一梯度,以及根據所述第二標注資訊、所述第二識別結果、所述第二損失函數以及所述第二特徵提取分支的初始參數,得到第二梯度,以及根據所述第三標注資訊、所述第三識別結果、所述第三損失函數以及所述第三特徵提取分支的初始參數,得到第三梯度;以及將所述第一梯度、所述第二梯度以及所述第三梯度的平均值作為所述第一模態網路的反向傳播梯度,並通過所述反向傳播梯度調整所述第一模態網路的參數,使所述第一特徵提取分支的參數、所述第二特徵提取分支的參數以及所述第三特徵提取分支的參數相同。 In another possible implementation manner, the images in the first image set include first annotation information, the images in the second image set include second annotation information, and the images in the fourth image set The image includes third annotation information; the training subunit is further configured to: obtain the initial parameters of the first feature extraction branch according to the first annotation information, the first recognition result, the first loss function, and the A first gradient, and obtain a second gradient according to the second annotation information, the second recognition result, the second loss function, and the initial parameters of the second feature extraction branch, and according to the third annotation Information, the third recognition result, the third loss function, and the initial parameters of the third feature extraction branch to obtain a third gradient; and combine the first gradient, the second gradient, and the third The average value of the gradient is used as the back propagation gradient of the first modal network, and the parameters of the first modal network are adjusted through the back propagation gradient, so that the parameters of the first feature extraction branch, The parameters of the second feature extraction branch and the parameters of the third feature extraction branch are the same.

在又一種可能實現的方式中,所述訓練子單元還配置為:從所述第一圖像集以及所述第二圖像集中分別選取f張圖像,使所述f張圖像中包含的人數為閾值,得到所述第三圖像集;或,以及從所述第一圖像集以及所述第二圖像集中分別選取m張圖像以及n張圖像,使所述m與所述n的比 值等於所述第一圖像集包含的圖像數量與所述第二圖像集包含的圖像數量的比值,且所述m張圖像以及所述n張圖像中包含的人數均為所述閾值,得到所述第三圖像集;或,以及從所述第一圖像集以及所述第二圖像集中分別選取s張圖像以及t張圖像,使所述s與所述t的比值等於所述第一圖像集包含的人數與所述第二圖像集包含的人數的比值,且所述s張圖像以及所述t張圖像中包含的人數均為所述閾值,得到所述第三圖像集。 In another possible implementation manner, the training subunit is further configured to: select f images from the first image set and the second image set respectively, so that the f images include The number of people is a threshold to obtain the third image set; or, and m images and n images are selected from the first image set and the second image set, respectively, so that m and The ratio of n The value is equal to the ratio of the number of images included in the first image set to the number of images included in the second image set, and the number of people included in the m images and the n images are both The threshold is used to obtain the third image set; or, and s images and t images are respectively selected from the first image set and the second image set, so that the s and the The ratio of t is equal to the ratio of the number of people included in the first image set to the number of people included in the second image set, and the s images and the number of people included in the t images are all The threshold value is used to obtain the third image set.

在又一種可能實現的方式中,所述訓練子單元還配置為:對所述第三圖像集中的圖像依次進行特徵提取處理、線性變換、非線性變換,得到第四識別結果;以及根據所述第三圖像集中的圖像、所述第四識別結果以及所述第二模態網路的第四損失函數,調整所述第二模態網路的參數,得到所述跨模態人臉識別網路。 In another possible implementation manner, the training subunit is further configured to: sequentially perform feature extraction processing, linear transformation, and nonlinear transformation on the images in the third image set to obtain a fourth recognition result; and The images in the third image set, the fourth recognition result, and the fourth loss function of the second modal network are adjusted to the parameters of the second modal network to obtain the cross-modality Face recognition network.

在又一種可能實現的方式中,所述第一類別以及所述第二類別分別對應不同人種。 In yet another possible implementation manner, the first category and the second category respectively correspond to different races.

第三方面,提供了一種電子設備,包括:包括處理器、記憶體;所述處理器被配置為支援所述裝置執行上述第一方面及其任一種可能的實現方式的方法中相應的功能。記憶體用於與處理器耦合,其保存所述裝置必要的程式(指令)和資料。可選的,所述裝置還可以包括輸入/輸出介面,用於支援所述裝置與其他裝置之間的通信。 In a third aspect, an electronic device is provided, including: a processor and a memory; the processor is configured to support the device to perform the corresponding function in the first aspect and any one of the possible implementation methods. The memory is used for coupling with the processor, and it stores the necessary programs (instructions) and data of the device. Optionally, the device may further include an input/output interface for supporting communication between the device and other devices.

第四方面,提供了一種電腦可讀儲存介質,所述電腦可讀儲存介質中儲存有指令,當其在電腦上運行時, 使得電腦執行上述第一方面及其任一種可能的實現方式的方法。 In a fourth aspect, a computer-readable storage medium is provided, the computer-readable storage medium stores instructions, and when it runs on a computer, Make the computer execute the method of the above-mentioned first aspect and any one of its possible implementation manners.

應當理解的是,以上的一般描述和後文的細節描述僅是示例性和解釋性的,而非限制本公開。 It should be understood that the above general description and the following detailed description are only exemplary and explanatory, rather than limiting the present disclosure.

1:人臉識別裝置 1: Face recognition device

11:獲取單元 11: Get unit

12:識別單元 12: Identification unit

121:訓練子單元 121: Training subunit

2:人臉識別裝置 2: Face recognition device

21:處理器 21: processor

22:輸入裝置 22: Input device

23:輸出裝置 23: output device

24:記憶體 24: Memory

為了更清楚地說明本公開實施例或背景技術中的技術方案,下面將對本公開實施例或背景技術中所需要使用的附圖進行說明。 In order to more clearly describe the technical solutions in the embodiments of the present disclosure or the background art, the following will describe the drawings that need to be used in the embodiments of the present disclosure or the background art.

此處的附圖被併入說明書中並構成本說明書的一部分,這些附圖示出了符合本公開的實施例,並與說明書一起用於說明本公開的技術方案。 The drawings here are incorporated into the specification and constitute a part of the specification. These drawings illustrate embodiments that conform to the present disclosure, and are used together with the specification to explain the technical solutions of the present disclosure.

圖1為本公開實施例提供的一種人臉識別方法的流程示意圖;圖2為本公開實施例提供的一種基於第一圖像集和第二圖像集對第一模態網路訓練的流程示意圖;圖3為本公開實施例提供的另一種人臉識別神經網路的訓練方法的流程示意圖;圖4為本公開實施例提供的另一種人臉識別神經網路的訓練方法的流程示意圖;圖5為本公開實施例提供的一種基於按人種分類得到的圖像集對神經網路進行訓練的流程示意圖; 圖6為本公開實施例提供的一種人臉識別裝置的結構示意圖;圖7為本公開實施例提供的一種人臉識別裝置的硬體結構示意圖。 FIG. 1 is a schematic flowchart of a face recognition method provided by an embodiment of the present disclosure; FIG. 2 is a process of training a first modal network based on a first image set and a second image set according to an embodiment of the present disclosure Schematic diagram; FIG. 3 is a schematic flowchart of another method for training a facial recognition neural network provided by an embodiment of the present disclosure; FIG. 4 is a schematic flowchart of another training method for a facial recognition neural network provided by an embodiment of the present disclosure; FIG. 5 is a schematic diagram of a process of training a neural network based on an image set classified by race according to an embodiment of the present disclosure; FIG. 6 is a schematic structural diagram of a face recognition device provided by an embodiment of the disclosure; FIG. 7 is a hardware structural schematic diagram of a face recognition device provided by an embodiment of the disclosure.

為了使本技術領域的人員更好地理解本公開方案,下面將結合本公開實施例中的附圖,對本公開實施例中的技術方案進行清楚、完整地描述,顯然,所描述的實施例僅僅是本公開一部分實施例,而不是全部的實施例。基於本公開中的實施例,本領域普通技術人員在沒有做出創造性勞動前提下所獲得的所有其他實施例,都屬於本公開保護的範圍。 In order to enable those skilled in the art to better understand the solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are only These are a part of the embodiments of the present disclosure, but not all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present disclosure.

本公開的說明書和申請專利範圍及上述附圖中的術語“第一”、“第二”等是用於區別不同對象,而不是用於描述特定順序。此外,術語“包括”和“具有”以及它們任何變形,意圖在於覆蓋不排他的包含。例如包含了一系列步驟或單元的過程、方法、系統、產品或設備沒有限定於已列出的步驟或單元,而是可選地還包括沒有列出的步驟或單元,或可選地還包括對於這些過程、方法、產品或設備固有的其他步驟或單元。 The terms "first", "second", etc. in the specification and patent application scope of the present disclosure and the above-mentioned drawings are used to distinguish different objects, rather than to describe a specific sequence. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions. For example, a process, method, system, product, or device that includes a series of steps or units is not limited to the listed steps or units, but optionally includes unlisted steps or units, or optionally also includes Other steps or units inherent to these processes, methods, products or equipment.

在本文中提及“實施例”意味著,結合實施例描述的特定特徵、結構或特性可以包含在本公開的至少一個實施例中。在說明書中的各個位置出現該短語並不一定均是 指相同的實施例,也不是與其它實施例互斥的獨立的或備選的實施例。本領域技術人員顯式地和隱式地理解的是,本文所描述的實施例可以與其它實施例相結合。 The reference to "embodiment" herein means that a specific feature, structure, or characteristic described in conjunction with the embodiment may be included in at least one embodiment of the present disclosure. The phrase that appears in various places in the description does not necessarily mean It refers to the same embodiment, and is not an independent or alternative embodiment mutually exclusive with other embodiments. Those skilled in the art clearly and implicitly understand that the embodiments described herein can be combined with other embodiments.

在本公開實施例中,人數並不等同於人物對象的數量,如:圖像A包含2個對象,分別為張三和李四;圖像B包含1個對象,為張三;圖像C包含2個對象,分別為張三和李四,則圖像A、圖像B以及圖像C包含的人數為2(張三和李四),圖像A、圖像B以及圖像C包含的對象的數量為2+1+2=5,即人數為5。 In the embodiment of the present disclosure, the number of people is not equal to the number of human objects. For example, image A contains two objects, namely Zhang San and Li Si; image B contains 1 object, which is Zhang San; image C Contains 2 objects, namely Zhang San and Li Si, the number of people included in Image A, Image B, and Image C is 2 (Zhang San and Li Si), Image A, Image B, and Image C include The number of objects in is 2+1+2=5, that is, the number of people is 5.

為了更清楚地說明本公開實施例或背景技術中的技術方案,下面將對本公開實施例或背景技術中所需要使用的附圖進行說明。 In order to more clearly describe the technical solutions in the embodiments of the present disclosure or the background art, the following will describe the drawings that need to be used in the embodiments of the present disclosure or the background art.

下面結合本公開實施例中的附圖對本公開實施例進行描述。 The embodiments of the present disclosure will be described below in conjunction with the drawings in the embodiments of the present disclosure.

請參閱圖1,圖1是本公開實施例提供的一種人臉識別方法的流程示意圖。 Please refer to FIG. 1. FIG. 1 is a schematic flowchart of a face recognition method provided by an embodiment of the present disclosure.

101、獲得取待識別圖像。在本公開實施例中,待識別圖像可以是儲存於本地終端(如:手機、平板電腦、筆記型電腦等等)的圖像集;也可以將視頻中的任意幀圖像作為待識別圖像,還可以從視頻中任意幀圖像中檢測出臉部區域圖像,並將該臉部區域圖像作為待識別圖像。 101. Obtain an image to be recognized. In the embodiment of the present disclosure, the image to be recognized can be an image collection stored in a local terminal (such as a mobile phone, a tablet computer, a notebook computer, etc.); it is also possible to use any frame image in the video as the image to be recognized. Like, you can also detect the face area image from any frame of the video, and use the face area image as the image to be recognized.

102、基於跨模態人臉識別網路對待識別圖像進行識別,得到待識別圖像的識別結果,其中,跨模態人臉識別網路基於不同模態的人臉圖像資料訓練得到。在本公開 實施例中,跨模態人臉識別網路可對包含不同類別的對象的圖像進行識別,例如,可識別兩張圖像中的對象是否是同一個人。其中,類別可以按人的年齡劃分,也可以按人種劃分,還可以按地區劃分,如:可以將0~3歲的人劃分為第一類別,將4~10歲的人劃分為第二類別,將11~20歲的人劃分為第三類別...;也可以將黃種人劃分為第一類別,將白種人劃分為第二類別,將黑種人劃分為第三類別,將棕種人劃分為第四類別;還可以將中國地區的人劃分為第一類別,將泰國地區的人劃分為第二類別,將印度地區的人劃分為第三類別,將開羅地區的人劃分為第四類別,將非洲地區的人劃分為第五類別,將歐洲地區的人劃分為第六類別。本公開實施例對類別的劃分不做限定。 102. Recognize the image to be recognized based on the cross-modal face recognition network to obtain the recognition result of the image to be recognized, wherein the cross-modal face recognition network is trained based on the face image data of different modalities. In this disclosure In an embodiment, the cross-modal face recognition network can recognize images containing objects of different categories, for example, it can recognize whether the objects in the two images are the same person. Among them, the categories can be divided by age, race, or region. For example, people between 0 and 3 years old can be divided into the first category, and people between 4 and 10 years old can be divided into the second category. Category, the 11-20 year olds are divided into the third category...; the yellow race can also be divided into the first category, the white race can be divided into the second category, the black race can be divided into the third category, and the brown race can be divided into the third category. People are divided into the fourth category; people in China can also be divided into the first category, people in Thailand are divided into the second category, people in India are divided into the third category, and people in Cairo can be divided into In the fourth category, people in Africa are divided into the fifth category, and people in Europe are divided into the sixth category. The embodiment of the present disclosure does not limit the division of categories.

在一些可能實現的方式中,將手機攝影頭採集的包括對象臉部區域圖像以及事先儲存的臉部區域圖像作為待識別圖像集輸入至人臉識別神經網路,識別出待識別圖像集包含的對象是否是同一個人。在另一些可能實現的方式中,攝影頭A在第一時刻採集到第一待識別圖像,攝影頭B在第二時刻採集到第二待識別圖像,將第一待識別圖像以及第二待識別圖像作為待識別圖像集輸入至人臉識別神經網路,識別這兩張待識別圖像中包含的對象是否是同一個人。在本公開實施例中,不同模態的人臉圖像資料指包含的不同類別的對象的圖像集。跨模態人臉識別網路是以不同模態的人臉圖像集為訓練集預先進行訓練得到的,其中,跨模態人臉識別網路可以是任意具備從圖像中提取特徵中功能的神 經網路,如:可以基於卷積層、非線性層、全連接層等網路單元按照一定方式堆疊或組成,也可以採用現有的神經網路結構,本公開對跨模態人臉識別網路的結構不做具體限定。 In some possible implementations, the face area image of the subject and the previously stored face area image collected by the mobile phone camera are input to the face recognition neural network as the image set to be recognized, and the image to be recognized is recognized Whether the objects included in the image set are the same person. In other possible implementation manners, the camera A collects the first image to be recognized at the first moment, and the camera B collects the second image to be recognized at the second moment, and the first image to be recognized and the first image are collected. Second, the image to be recognized is input to the face recognition neural network as a set of images to be recognized, and it is recognized whether the objects contained in the two images to be recognized are the same person. In the embodiments of the present disclosure, face image data of different modalities refer to image sets containing different types of objects. The cross-modal face recognition network is obtained by pre-training the face image sets of different modalities as the training set. Among them, the cross-modal face recognition network can be arbitrarily capable of extracting features from the image God of Via the network, for example, it can be stacked or composed in a certain way based on network units such as convolutional layer, nonlinear layer, fully connected layer, etc., or the existing neural network structure can be used. The present disclosure provides a cross-modal face recognition network The structure of is not specifically limited.

在一種可能實現的方式中,將兩張待識別圖像輸入至跨模態人臉識別網路,跨模態人臉識別網路分別對待識別圖像進行特徵提取處理,得到不同的特徵,再將提取出的特徵進行對比,得到特徵匹配度,在特徵匹配度達到特徵匹配度閾值的情況下,識別兩張待識別圖像中的對象是同一個人,反之,在特徵匹配度未達到特徵匹配度閾值的情況下,識別兩張待識別圖像中的對象不是同一個人。本實施例通過由按類別劃分的圖像集訓練神經網路得到跨模態人臉識別網路,通過跨模態人臉識別網路對各個類別的對象是否是同一個人進行識別,可提高識別準確率。 In one possible way, the two images to be recognized are input to the cross-modal face recognition network, and the cross-modal face recognition network performs feature extraction processing on the recognized images separately to obtain different features, and then The extracted features are compared to obtain the feature matching degree. When the feature matching degree reaches the feature matching degree threshold, it is recognized that the objects in the two images to be recognized are the same person. On the contrary, the feature matching degree does not reach the feature matching degree. In the case of a degree threshold, it is recognized that the objects in the two to-be-recognized images are not the same person. In this embodiment, a cross-modal face recognition network is obtained by training a neural network from image sets divided by categories, and the cross-modal face recognition network is used to identify whether objects of each category are the same person, which can improve recognition. Accuracy.

以下實施例為本公開提供的人臉識別方法中步驟102的一些可能的實現方式。 The following embodiments are some possible implementations of step 102 in the face recognition method provided by the present disclosure.

基於第一模態網路和第二模態網路進行訓練得到跨模態人臉識別網路,其中,第一模態網路和第二模態網路可以是任意具備從圖像中提取特徵中功能的神經網路,如:可以基於卷積層、非線性層、全連接層等網路單元按照一定方式堆疊或組成,也可以採用現有的神經網路結構,本公開對跨模態人臉識別網路的結構不做具體限定。在一些可能實現的方式中,以不同的圖像集為訓練集分別對第一模態網路和第二模態網路進行訓練,使第一模態網路分別學習到不同類別的對象的特徵,再總和第一模態網路和第二模態網 路學習到的特徵得到跨模態網路,使跨模態網路能對不同類別的對象進行識別。可選地,在基於第一模態網路和第二模態網路進行訓練得到跨模態人臉識別網路之前,基於第一圖像集和第二圖像集對第一模態網路訓練,其中,第一圖像集和第二圖像集中的對象可以只包括人臉,也可以包括人臉以及軀幹等其他部分,本公開對此不做具體限定。在一些可能實現的方式中,以第一圖像集為訓練集對第一模態網路進行訓練,得到第二模態神經網路,使第二模態網路可以識別多張包含第一類別的對象的圖像中的對象是否是同一個人,以第二圖像集為訓練集對第二模態網路進行訓練,得到跨模態人臉識別網路,使跨模態人臉識別網路可以識別多張包含第一類別的對象的圖像中的對象是否是同一個人,以及多張包含第二類別的對象的圖像中的對象是否是同一個人,這樣,跨模態人臉識別網路既在對第一類別的對象進行識別時的識別率高,且在對第二類別的對象進行識別時的識別率高。 Train based on the first modal network and the second modal network to obtain a cross-modal face recognition network, where the first modal network and the second modal network can be arbitrarily equipped to extract from the image Neural networks with functions in features, for example, can be stacked or composed in a certain manner based on network units such as convolutional layers, nonlinear layers, and fully connected layers, or can use existing neural network structures. The structure of the face recognition network is not specifically limited. In some possible implementations, different image sets are used as training sets to train the first modal network and the second modal network respectively, so that the first modal network learns different types of objects. Features, and then sum the first modal network and the second modal network The features learned by the road get the cross-modal network, so that the cross-modal network can recognize different types of objects. Optionally, before the cross-modal face recognition network is obtained by training based on the first modal network and the second modal network, the first modal network is compared based on the first image set and the second image set. Road training, where the objects in the first image set and the second image set may only include human faces, or may include other parts such as human faces and torso, which are not specifically limited in the present disclosure. In some possible implementation methods, the first modal network is trained with the first image set as the training set to obtain the second modal neural network, so that the second modal network can recognize multiple images containing the first Whether the objects in the image of the object of the category are the same person, use the second image set as the training set to train the second modal network to obtain the cross-modal face recognition network, so that the cross-modal face recognition The network can identify whether the objects in multiple images containing objects of the first category are the same person, and whether the objects in multiple images containing objects of the second category are the same person. In this way, cross-modal faces The recognition network has a high recognition rate when recognizing objects of the first category, and a high recognition rate when recognizing objects of the second category.

在另一些可能實現的方式中,將第一圖像集和第二圖像集中的所有圖像作為訓練集對第一模態網路進行訓練,得到跨模態人臉識別網路,使跨模態人臉識別網路可以識別多張包含第一類別或第二類別的對象的圖像中的對象是否是同一個人。在又一些可能實現的方式中,從第一圖像集中選取a張圖像、從第二圖像集中選取b張圖像,得到訓練集,其中,a:b滿足預設比例,再以訓練集對第一模態網路進行訓練,得到跨模態人臉識別網路,使跨模態人臉識 別網路識別多張包含第一類別或第二類別的對象的圖像中的人物對象是否是同一個人的識別準確率高。 In other possible implementations, all images in the first image set and the second image set are used as the training set to train the first modal network to obtain the cross-modal face recognition network, so that the cross-modal face recognition network is obtained. The modal face recognition network can identify whether the objects in multiple images containing objects of the first category or the second category are the same person. In some other possible implementations, a image is selected from the first image set, and b images are selected from the second image set to obtain the training set, where a:b meets the preset ratio, and then training Set to train the first modal network to obtain a cross-modal face recognition network, so that cross-modal face recognition It is highly accurate for the network to recognize whether the person objects in multiple images containing objects of the first category or the second category are the same person.

跨模態人臉識別網路通過特徵匹配度確定不同圖像中的對象是否是同一個人,而不同類別的人的臉部特徵會存在較大差異,因此,不同類別的人的特徵匹配度閾值(即達到這個閾值,將被識別為同一個人)均不相同,本實施例提供的訓練方法通過將包含不同類別的對象的圖像集放到一起進行訓練,可使減小跨模態人臉識別網路識別不同類別的人物對象的特徵匹配度之間的差異。 The cross-modal face recognition network uses feature matching to determine whether the objects in different images are the same person, and the facial features of different categories of people will have large differences. Therefore, the thresholds of feature matching for different categories of people (That is, when this threshold is reached, they will be recognized as the same person) are not the same. The training method provided in this embodiment can reduce cross-modal faces by putting together image sets containing objects of different categories for training. The recognition network recognizes the difference between the feature matching degrees of different categories of person objects.

本實施例通過由按類別劃分的圖像集訓練神經網路(第一模態網路和第二模態網路),使神經網路同時學習不同類別的對象的人臉特徵,這樣,通過訓練得到的跨模態人臉識別網路對各個類別的對象是否是同一個人進行識別,可提高識別準確率;通過不同類別的圖像集同時訓練神經網路,可減小神經網路識別不同類別的人物對象的識別標準之間的差異。 In this embodiment, the neural network (first modal network and second modal network) is trained from image sets divided by categories, so that the neural network learns the facial features of objects of different categories at the same time. In this way, The trained cross-modal face recognition network can recognize whether the objects of each category are the same person, which can improve the recognition accuracy; training the neural network through the image sets of different categories at the same time can reduce the difference between the neural network recognition The difference between the recognition criteria of the category of person objects.

請參閱圖2,圖2是本公開實施例提供的基於第一圖像集和第二圖像集對第一模態網路訓練的一些可能的實現方式的流程示意圖。 Please refer to FIG. 2, which is a schematic flowchart of some possible implementations of training the first modal network based on the first image set and the second image set provided by an embodiment of the present disclosure.

201、基於第一圖像集和第二圖像集對第一模態網路進行訓練,得到第二模態網路,其中,第一圖像集中的對象屬於第一類別,第二圖像集中的對象屬於第二類別。在本公開實施例中,可以通過多種方式獲取第一模態網路。在一些可能的實現方式中,可以從其他設備處獲取第一模態 網路,例如接收終端設備發送的第一模態網路。在另一些可能的實現方式中,第一模態網路儲存於本地終端,可從本地終端中調用第一模態網路。如上所述,第一圖像集包括的第一類別與第二圖像集包括的第二類別不同,分別以第一圖像集以及第二圖像集為訓練集對第一模態網路進行訓練,可使第一模態網路學習到第一類別以及第二類別的特徵,提高並識別第一類別以及第二類別的對象是否是同一個人的準確率。在一些可能實現的方式中,第一圖像集包括的對象為11~20歲的人,第二圖像集包括的對象為20~30歲的人。以第一圖像集、第二圖像集為訓練集對第一模態網路進行訓練,得到的第二模態網路對對象為11~20歲以及20~30歲的對象的識別準確率高。 201. The first modal network is trained based on the first image set and the second image set to obtain a second modal network, where the objects in the first image set belong to the first category, and the second image The concentrated objects belong to the second category. In the embodiment of the present disclosure, the first modal network can be obtained in a variety of ways. In some possible implementations, the first modality can be obtained from other devices The network, for example, receives the first modal network sent by the terminal device. In other possible implementations, the first modal network is stored in the local terminal, and the first modal network can be called from the local terminal. As mentioned above, the first category included in the first image set is different from the second category included in the second image set. The first image set and the second image set are used as the training set for the first modal network. Training can enable the first modal network to learn the characteristics of the first category and the second category, and improve the accuracy of identifying whether the objects of the first category and the second category are the same person. In some possible implementation manners, the objects included in the first image set are people between 11 and 20 years old, and the objects included in the second image set are people between 20 and 30 years old. Use the first image set and the second image set as the training set to train the first modal network, and the obtained second modal network can recognize objects between 11-20 and 20-30 years old. The accuracy is high.

202、按預設條件從第一圖像集中選取第一數目的圖像,並從第二圖像集中選取第二數目的圖像,並根據第一數目的圖像和第二數目的圖像得到第三圖像集。由於第一類別的特徵與第二類別的特徵的差異較大,神經網路在識別第一類別的對象是否是同一個人的識別標準與識別第二類別的對象是否是同一個人的識別標準也會不同,其中,識別標準可以為提取出的不同對象的特徵匹配度,如:由於0~3歲的人的五官以及臉部輪廓特徵沒有20~30歲的人的五官以及臉部輪廓特徵明顯,在訓練過程中,神經網路學習到的20~30歲的對象的特徵比0~30歲的對象的特徵多,這樣,訓練後的神經網路需要以更大的特徵匹配度來識別0~3歲的對象是否是同一個人。舉例來說,在識別0~3歲的對象 是否是同一個人時,確定特徵匹配度大於或等於0.8的兩個對象為同一個人,確定特徵匹配度小於0.8的兩個對象不是同一個人;神經網路在識別20~30歲的對象是否是同一個人時,確定特徵匹配度大於或等於0.65的兩個對象為同一個人,確定特徵匹配度小於0.65的兩個對象不是同一個人。此時,若用0~3歲的對象的識別標準去識別20~30歲的對象易導致本來是同一個人的兩個對象被識別為不是同一個人,反之,若用20~30歲的對象的識別標準去識別0~3歲的對象易導致本來不是同一個人的兩個對象被識別為同一個人。 202. Select a first number of images from the first image set according to preset conditions, and select a second number of images from the second image set, and select the first number of images and the second number of images Get the third image set. Since the features of the first category are quite different from the features of the second category, the neural network's identification criteria for identifying whether the objects of the first category are the same person and the identification criteria for identifying whether the objects of the second category are the same person will also Different, among them, the recognition criterion can be the feature matching degree of different objects extracted, such as: because the facial features and facial contour features of people aged 0 to 3 are not as obvious as those of people aged 20 to 30, During the training process, the neural network learns more features of objects aged 20-30 than those of objects aged 0-30. In this way, the trained neural network needs to recognize 0~ with a greater degree of feature matching. Is the 3-year-old subject the same person? For example, when recognizing objects between 0 and 3 years old When they are the same person, determine that the two objects with a feature matching degree greater than or equal to 0.8 are the same person, and determine that the two objects with a feature matching degree less than 0.8 are not the same person; the neural network is identifying whether the 20-30 year old objects are the same In the case of individuals, it is determined that two objects with a feature matching degree greater than or equal to 0.65 are the same person, and it is determined that two objects with a feature matching degree less than 0.65 are not the same person. At this time, if the identification standard of the object of 0~3 years old is used to identify the object of 20~30 years old, it is easy to cause the two objects that are originally the same person to be recognized as not the same person. On the contrary, if the object of 20~30 years old is used Recognition standards to identify objects between 0 and 3 years old can easily cause two objects that are not the same person to be identified as the same person.

本實施例按預設條件從第一圖像集中選取第一數目的圖像,並從第二圖像集中選取第二數目的圖像,並將第一數目的圖像和第二數目的圖像作為訓練集,可使第二模態網路在訓練過程中學習不同類別的特徵的比例更均衡,減小不同類別的對象的識別標準的差異。在一些可能實現的方式中,設第一圖像集中選取的第一數目的圖像包括的人數以及第二圖像集中選取的第二數目的圖像包括的人數均為X,則只需使分別從第一圖像集以及第二圖像集中選取的圖像包括的人數達到X即可,不限定從第一圖像集以及第二圖像集中選取的圖像的數量。 In this embodiment, according to a preset condition, a first number of images are selected from the first image set, and a second number of images are selected from the second image set, and the first number of images and the second number of images are combined. As a training set, the proportion of the second modal network learning different types of features during the training process can be more balanced, and the difference in the recognition standards of different types of objects can be reduced. In some possible implementations, assuming that the number of people included in the first number of images selected in the first image set and the number of people included in the second number of images selected in the second image set are both X, you only need to use The number of people in the images selected from the first image set and the second image set can reach X, and the number of images selected from the first image set and the second image set is not limited.

203、基於第三圖像集對第二模態網路進行訓練,得到跨模態人臉識別網路。第三圖像集包括第一類別以及第二類別,且第一類別的人數與第二類別的人數是按預設條件選取的,這也是第三圖像集不同於隨機選取的圖像集的 地方,以第三圖像集為訓練集對第二模態網路進行訓練,可使第二模態網路對第一類別的特徵的學習和對第二類別的特徵的學習更均衡。此外,若對第二模態網路的進行監督訓練,在訓練過程中,可通過softmax函數對每一張圖像中的對象所屬類別進行分類,並通過監督標籤、分類結果以及損失函數對第二模態網路的參數進行調整。在一些可能實現的方式中,第三圖像集中的每個對應一個標籤,如:圖像A與圖像B中的同一個對象的標籤均為1,圖像C中另一個對象的標籤為2。softmax函數的運算式如下:

Figure 108145586-A0305-02-0021-1
203. Train the second modal network based on the third image set to obtain a cross-modal face recognition network. The third image set includes the first category and the second category, and the number of people in the first category and the number of people in the second category are selected according to preset conditions. This is also the difference between the third image set and the randomly selected image set Locally, training the second modal network with the third image set as the training set can make the learning of the features of the first category and the learning of the features of the second category more balanced by the second modal network. In addition, if the supervised training of the second modal network is performed, during the training process, the softmax function can be used to classify the category of the object in each image, and the supervised label, classification result, and loss function can be used to classify the first Adjust the parameters of the two-modal network. In some possible implementations, each of the third image sets corresponds to a label, for example, the label of the same object in image A and image B is 1, and the label of another object in image C is 2. The calculation formula of the softmax function is as follows:
Figure 108145586-A0305-02-0021-1

其中,t為第三圖像集包括的人數,Sj為對象為j類的概率,Pj為輸入softmax層的特徵向量中的第j個數值,k為輸入softmax層的特徵向量中的第k個數值。在softmax層後連接包含有損失函數的損失函數層,通過softmax層輸出的概率值、第三圖像集的標籤,以及損失函數,可得到第二待訓練神經網路的反向傳播梯度,再根據反向傳播梯度對第二待訓練神經網路進行梯度反向傳播,可得到跨模態人臉識別網路。由於第三圖像集中包含第一類別的對象以及第二類別的對象,且第一類別的人數與第二類別的人數是滿足預設條件,因此,以第三圖像集為訓練集對第二模態網路進行訓練,可使第二模態網路平衡第一類別的人臉特徵以及第二類別的人臉特徵的學習比例,這樣,可使最終得到的跨模態人臉識別網路在識別第一類別的對象是否是同一個人的識 別率高,同時在識別第二類別的對象是否是同一個人的識別率也高。在一些可能實現的方式中,損失函數的運算式可參見下式:

Figure 108145586-A0305-02-0022-2
Among them, t is the number of people included in the third image set, S j is the probability that the object is of type j, P j is the jth value in the feature vector of the input softmax layer, and k is the number of the feature vector of the input softmax layer. k numbers. After the softmax layer is connected to the loss function layer containing the loss function, through the probability value output by the softmax layer, the label of the third image set, and the loss function, the back propagation gradient of the second neural network to be trained can be obtained, and then Perform gradient backpropagation on the second neural network to be trained according to the backpropagation gradient to obtain a cross-modal face recognition network. Since the third image set contains objects of the first category and objects of the second category, and the number of people in the first category and the number of people in the second category meet the preset conditions, the third image set is used as the training set to compare the The training of the two-modal network can make the second-modal network balance the learning ratio of the facial features of the first category and the facial features of the second category, so that the final cross-modal face recognition network can be obtained. The recognition rate of identifying whether the objects of the first category are the same person is high, and the recognition rate of identifying whether the objects of the second category are the same person is also high. In some possible implementation methods, the calculation formula of the loss function can be seen in the following formula:
Figure 108145586-A0305-02-0022-2

其中,t為第三圖像集包括的人數,Sj為人物對象為j類的概率,yj為第三圖像集中人物對象為j類的標籤,如:第三圖像集包括張三的圖像,標籤為1,則對象為1類的標籤1,且該對象為其他任意類別的標籤都為0。本公開實施例通過以按類別劃分的第一圖像集以及第二圖像集為訓練集對第一模態網路進行訓練,提高第一模態網路對第一類別以及第二類別的識別準確率;通過以第三圖像集對為訓練集對第二模態網路進行訓練,可使第二模態網路平衡第一類別的人臉特徵以及第二類別的人臉特徵的學習比例,這樣,訓練得到的跨模態人臉識別網路不僅對第一類別的對象是否是同一個人的識別準確率高,而且對第二類別的對象是否是同一個人的識別準確率高。 Among them, t is the number of people included in the third image set, S j is the probability that the person object is of type j, and y j is the label of the person object in the third image set. For example, the third image set includes Zhang San If the label is 1, the object is label 1 of category 1, and the label of any other category is 0. The embodiments of the present disclosure train the first modal network by using the first image set and the second image set divided by categories as the training set to improve the performance of the first modal network on the first category and the second category. Recognition accuracy; by training the second modal network with the third image set pair as the training set, the second modal network can balance the facial features of the first category and the facial features of the second category Learning ratio. In this way, the trained cross-modal face recognition network not only has a high recognition accuracy for whether the objects in the first category are the same person, but also has a high recognition accuracy for whether the objects in the second category are the same person.

請參閱圖3,圖3是本公開實施例提供的步驟201的一種可能實現方式的流程示意圖。 Please refer to FIG. 3, which is a schematic flowchart of a possible implementation of step 201 according to an embodiment of the present disclosure.

301、將第一圖像集輸入至第一特徵提取分支,並將第二圖像集輸入至第二特徵提取分支,並將第四圖像集輸入至第三特徵提取分支,對第一模態網路進行訓練,其中,第四圖像集包括的圖像為同一場景下採集的圖像或同一採集方式採集的圖像。在本公開實施例中,第四圖像集包 括的圖像為同一場景下採集的圖像或同一採集方式採集的圖像,例如:第四圖像集包括的圖像均是用手機拍攝的圖像;再例如:第四圖像集包括的圖像均是室內拍攝的圖像;又例如:第四圖像集包括的圖像均是在港口拍攝的圖像,本公開實施例對第四圖像集中的圖像的場景和採集方式不做限定。在本公開實施例中,第一模態網路包括第一特徵提取分支、第二特徵提取分支以及第三特徵提取分支,其中,第一特徵提取分支、第二特徵提取分支以及第三特徵提取分支均可以是任意具備從圖像中提取特徵中功能的神經網路結構,如:可以基於卷積層、非線性層、全連接層等網路單元按照一定方式堆疊或組成,也可以採用現有的神經網路的結構,本公開對第一特徵提取分支、第二特徵提取分支以及第三特徵提取分支的結構不做具體限定。在本實施例中,第一圖像集、第二圖像集以及第四圖像集中的圖像分別包括第一標注資訊、第二標注資訊以及第三標注資訊,其中,標注資訊包括圖像中包含的對象的編號,例如:第一圖像集、第二圖像集以及第四圖像集中包含的人數均為Y(Y為大於1的整數),對第一圖像集、第二圖像集以及第四圖像集中的任意一張圖像均包含對象對應的編號均為1~Y之間任意一個數字。需要理解的是,同一個人的對象在不同圖像中的編號相同,例如:圖像A中的對象為張三,圖像B中的對象也為張三,則圖像A中的對象與圖像B中的對象的編號相同,反之,圖像C中的對象為李四,則圖像C中的對象的編號與圖像A中的對象的編號不同。為使各圖像集包含的對象的人臉 特徵可起到對應該類別人臉特徵的代表性的作用,可選地,每個圖像集包含的人數均在5000人以上,需要理解的是,本公開實施例對圖像集中圖像的數量不做限定。在本公開實施例中,第一特徵提取分支的初始參數、第二特徵提取分支的初始參數以及第三特徵提取分支的初始參數分別指未調整參數前的第一特徵提取分支的參數、未調整參數前的第二特徵提取分支的參數以及未調整參數前的第三特徵提取分支的參數。第一模態網路的各分支包括第一特徵提取分支、第二特徵提取分支以及第三特徵提取分支。將第一圖像集輸入至第一特徵提取分支,並將第二圖像集輸入至第二特徵提取分支,並將第四圖像集輸入至第三特徵提取分支,即用第一特徵提取分支去學習第一圖像集包含的對象的人臉特徵,用第二特徵提取分支去學習第二圖像集包含的對象的人臉特徵,用第三特徵提取分支去學習第四圖像集包含的對象的人臉特徵,並根據各個特徵提取分支的softmax函數以及損失函數確定各個特徵提取分支的反向傳播梯度,最後根據各個特徵提取分支的反向傳播梯度確定第一模態網路的反向傳播梯度,並對第一模態網路的參數進行調整。需要理解的是,對第一模態網路的參數進行調整即對所有特徵提取分支的初始參數進行調整,由於每個特徵提取分支的反向傳播梯度均相同,最終調整後的參數也都相同,每個分支的反向傳播梯度代表每個特徵提取分支參數的調整方向,即通過特徵提取分支的反向傳播梯度調整分支的參數,可提高特徵提取分支識別對應類別(與輸入的圖像集包含的類別相同)的 對象的準確率。通過第一特徵提取分支和第二特徵提取分支的反向傳播梯度調整神經網路的參數,可綜合各個分支參數的調整方向,得到一個平衡的調整方向,由於第四圖像集包含特定場景下或特定拍攝方式採集得到的圖像,通過第三特徵提取分支的反向傳播梯度調整第一模態網路的參數可提高第一模態網路的魯棒性(即對圖像採集場景和圖像採集方式的魯棒性高)。通過三個特徵提取分支的反向傳播梯度得到的反向傳播梯度來調整第一模態網路的參數可使任意一個特徵提取分支識別對應類別(第一圖像集以及第二圖像集包含的類別中的任意一個)的對象都有較高的準確率,且可提高任意一個特徵提取分支在圖像採集場景和圖像採集方式方面的魯棒性。 301. Input the first image set to the first feature extraction branch, and input the second image set to the second feature extraction branch, and input the fourth image set to the third feature extraction branch. The state network is trained, where the images included in the fourth image set are images collected in the same scene or images collected in the same collection method. In the embodiment of the present disclosure, the fourth image collection package The enclosed images are the images collected in the same scene or the images collected in the same collection method, for example: the images included in the fourth image set are all images taken with a mobile phone; another example: the fourth image set includes The images in are all images taken indoors; for another example: the images included in the fourth image set are all images taken at the port, and the embodiments of the present disclosure compare the scenes and collection methods of the images in the fourth image set Not limited. In the embodiment of the present disclosure, the first modal network includes a first feature extraction branch, a second feature extraction branch, and a third feature extraction branch, where the first feature extraction branch, the second feature extraction branch, and the third feature extraction branch The branches can be any neural network structure that has the function of extracting features from the image. For example, it can be stacked or composed in a certain way based on network units such as convolutional layer, nonlinear layer, and fully connected layer, or it can use existing ones. For the structure of the neural network, the present disclosure does not specifically limit the structures of the first feature extraction branch, the second feature extraction branch, and the third feature extraction branch. In this embodiment, the images in the first image set, the second image set, and the fourth image set respectively include the first annotation information, the second annotation information, and the third annotation information, wherein the annotation information includes the image The number of the objects contained in, for example: the number of people included in the first image set, the second image set, and the fourth image set are all Y (Y is an integer greater than 1). For the first image set, the second image set, and the The image set and any one of the images in the fourth image set contains any number between 1 and Y for the corresponding number of the object. It should be understood that the objects of the same person have the same number in different images. For example, the object in image A is Zhang San, and the object in image B is also Zhang San, then the object in image A is the same as the image in image A. If the number of the object in image B is the same, on the contrary, if the object in image C is Li Si, then the number of the object in image C is different from the number of the object in image A. To make the face of the object included in each image set The feature can play a representative role corresponding to the facial features of the category. Optionally, each image set contains more than 5000 people. It should be understood that the embodiment of the present disclosure has an effect on the images in the image set. The number is not limited. In the embodiments of the present disclosure, the initial parameters of the first feature extraction branch, the initial parameters of the second feature extraction branch, and the initial parameters of the third feature extraction branch respectively refer to the parameters of the first feature extraction branch before the parameters are adjusted, and the parameters are not adjusted. The parameters of the second feature extraction branch before the parameters and the parameters of the third feature extraction branch before the parameters are not adjusted. Each branch of the first modal network includes a first feature extraction branch, a second feature extraction branch, and a third feature extraction branch. Input the first image set to the first feature extraction branch, and input the second image set to the second feature extraction branch, and input the fourth image set to the third feature extraction branch, that is, use the first feature extraction branch Branch to learn the face features of the objects contained in the first image set, use the second feature extraction branch to learn the face features of the objects contained in the second image set, and use the third feature extraction branch to learn the fourth image set The face features of the included objects, and determine the back propagation gradient of each feature extraction branch according to the softmax function and loss function of each feature extraction branch, and finally determine the first modal network according to the back propagation gradient of each feature extraction branch Back propagate the gradient and adjust the parameters of the first modal network. It should be understood that adjusting the parameters of the first modal network means adjusting the initial parameters of all feature extraction branches. Since the back propagation gradient of each feature extraction branch is the same, the final adjusted parameters are also the same. , The back propagation gradient of each branch represents the adjustment direction of the parameters of each feature extraction branch, that is, through the back propagation gradient of the feature extraction branch to adjust the parameters of the branch, it can improve the feature extraction branch to identify the corresponding category (with the input image set Contains the same categories) The accuracy of the object. Adjust the parameters of the neural network through the back propagation gradient of the first feature extraction branch and the second feature extraction branch. The adjustment direction of each branch parameter can be integrated to obtain a balanced adjustment direction. Because the fourth image set contains specific scenes Or the image acquired by a specific shooting method, adjusting the parameters of the first modal network through the back propagation gradient of the third feature extraction branch can improve the robustness of the first modal network (that is, the robustness of the first modal network is The robustness of the image acquisition method is high). Adjusting the parameters of the first modal network through the backpropagation gradients obtained by the backpropagation gradients of the three feature extraction branches can make any one feature extraction branch identify the corresponding category (the first image set and the second image set contain Any one of the categories of objects) has a higher accuracy rate, and can improve the robustness of any feature extraction branch in terms of image acquisition scenes and image acquisition methods.

在一些可能實現的方式中,將第一圖像集輸入至第一特徵提取分支,並將第二圖像集輸入至第二特徵提取分支,並將第四圖像集輸入至第三特徵提取分支,依次經過特徵提取處理、全連接層的處理、softmax層的處理,分別得到第一識別結果、第二識別結果以及第三識別結果,其中,softmax層包含softmax函數,可參見公式(1),此處將不再贅述,第一識別結果、第二識別結果以及第三識別結果包括每個對象的編號為不同編號的概率,例如:第一圖像集、第二圖像集以及第四圖像集中包含的人數為Y(Y為大於1的整數),對第一圖像集、第二圖像集以及第四圖像集中的任意一張圖像均包含人物對象對應的編號均為1~Y之間任意一個數字,則第一識別結果包括第一圖像集包含的 人物對象的編號分別是1~Y的概率,即每個對象的第一識別結果有Y個概率。同理,第二識別結果包括第二圖像集包含的對象的編號分別是1~Y的概率,第三識別結果包括第四圖像集包含的對象的編號分別是1~Y的概率。在每個分支中,softmax層後連接包含有損失函數的損失函數層,獲取第一分支的第一損失函數、第二分支的第二損失函數以及第三分支的第三損失函數,根據第一圖像集的第一標注資訊、第一識別結果以及第一損失函數,得到第一損失,根據第二圖像集的第二標注資訊、第二識別結果以及第二損失函數,得到第二損失,根據第四圖像集的第三標注資訊、第三識別結果以及第三損失函數,得到第三損失。第一損失函數、第二損失函數以及第三損失函數可參見公式(2),此處將不再贅述。獲得第一特徵提取分支的參數、第二特徵提取分支的參數以及第三特徵提取分支的參數,根據第一特徵提取分支的參數以及第一損失,得到第一梯度,以及根據第二特徵提取分支的參數以及第二損失,得到第二梯度,以及根據第三特徵提取分支的參數以及第三損失,得到第三梯度,其中,第一梯度、第二梯度以及第三梯度分別為第一特徵提取分支、第二特徵提取分支以及第三特徵提取分支的反向傳播梯度。根據第一梯度、第二梯度以及第三梯度,得到第一模態網路的反向傳播梯度,並通過梯度反向傳播的方式調整第一模態網路的參數,使第一特徵提取分支的參數、第二特徵提取分支以及第三特徵提取分支的參數相同。在一些可能實現的方式中,將第一梯度、第二梯度以及第三梯度的平均值作 為第一待訓練神經網路的反向傳播梯度,並根據反向傳播梯度對第一模態網路進行梯度方向傳播,調整第一特徵提取分支的參數、第二特徵提取分支以及第三特徵提取分支的參數,使調整參數後的第一特徵提取分支、第二特徵提取分支以及第三特徵提取分支的參數相同。 In some possible implementations, the first image set is input to the first feature extraction branch, the second image set is input to the second feature extraction branch, and the fourth image set is input to the third feature extraction branch. Branching, through feature extraction processing, fully connected layer processing, and softmax layer processing in turn, to obtain the first recognition result, the second recognition result, and the third recognition result respectively. Among them, the softmax layer contains the softmax function, see formula (1) , I will not repeat them here. The first recognition result, the second recognition result, and the third recognition result include the probability that the number of each object is a different number, for example: the first image set, the second image set, and the fourth The number of people included in the image set is Y (Y is an integer greater than 1), and the corresponding numbers for any image in the first image set, the second image set, and the fourth image set that contain the person object are Any number between 1~Y, the first recognition result includes the first image set The number of the person object is the probability of 1~Y respectively, that is, the first recognition result of each object has Y probabilities. Similarly, the second recognition result includes the probability that the numbers of the objects included in the second image set are 1 to Y, and the third recognition result includes the probability that the numbers of the objects included in the fourth image set are 1 to Y, respectively. In each branch, the softmax layer is connected to the loss function layer containing the loss function to obtain the first loss function of the first branch, the second loss function of the second branch, and the third loss function of the third branch. Obtain the first loss based on the first annotation information, the first recognition result, and the first loss function of the image set, and obtain the second loss according to the second annotation information, the second recognition result, and the second loss function of the second image set , According to the third annotation information of the fourth image set, the third recognition result and the third loss function, the third loss is obtained. The first loss function, the second loss function, and the third loss function can be referred to formula (2), which will not be repeated here. Obtain the parameters of the first feature extraction branch, the parameters of the second feature extraction branch, and the parameters of the third feature extraction branch, obtain the first gradient according to the parameters of the first feature extraction branch and the first loss, and extract the branch according to the second feature And the second loss to obtain the second gradient, and according to the parameters of the third feature extraction branch and the third loss, the third gradient is obtained, where the first gradient, the second gradient and the third gradient are respectively the first feature extraction The back propagation gradient of the branch, the second feature extraction branch, and the third feature extraction branch. According to the first gradient, the second gradient, and the third gradient, the back propagation gradient of the first modal network is obtained, and the parameters of the first modal network are adjusted by the gradient back propagation method to make the first feature extraction branch The parameters of the second feature extraction branch and the third feature extraction branch are the same. In some possible implementations, the average value of the first gradient, the second gradient, and the third gradient is taken as It is the back propagation gradient of the first neural network to be trained, and the first modal network is propagated in the gradient direction according to the back propagation gradient, and the parameters of the first feature extraction branch, the second feature extraction branch and the third feature are adjusted The parameters of the extraction branch are adjusted so that the parameters of the first feature extraction branch, the second feature extraction branch, and the third feature extraction branch after adjusting the parameters are the same.

302、將訓練後的第一特徵提取分支或訓練後的第二特徵提取分支或訓練後的第三特徵提取分支作為第二模態網路。通過301的處理,訓練後的第一特徵提取分支、訓練後的第二特徵提取分支以及訓練後的第三特徵提取分支的參數相同,即對第一類別(第一圖像集包含的類別)、第二類別(第二圖像集包含的類別)的對象識別準確率高,且識別不同場景採集的圖像和不同採集方式採集的圖像的魯棒性好。因此,將訓練後的第一特徵提取分支或訓練後的第二特徵提取分支或訓練後的第三特徵提取分支作為下一步訓練的網路,即第二模態網路。本公開實施例中,第一圖像集以及第二圖像集均是按類別選取得到的圖像集,第四圖像集為按照場景和拍攝方式選取的圖像集,以第一圖像集對第一特徵提取分支進行訓練,可使第一特徵提取分支著重學習第一類別的人臉特徵,以第二圖像集對第二特徵提取分支進行訓練,可使第二特徵提取分支著重學習第二類別的人臉特徵,而以第四圖像集對第三特徵提取分支進行訓練,可使第三特徵提取分支著重學習第四圖像集包括的對象的人臉特徵,提高第三特徵提取分支的魯棒性;根據第一特徵提取分支的反向傳播梯度、第二特徵提取分支的反向傳播梯度以 及第三特徵提取分支的反向傳播梯度得到第一模態網路的反向傳播梯度,並以該梯度對第一模態網路進行梯度反向傳播,可同時兼顧三個特徵提取分支的參數調整方向,並使調整參數後的第一模態網路的魯棒性好,且對第一類別以及第二類別的人物對象的識別準確率高。以下實施例為步驟202的一些可能的實現方式。為使第二模態網路在基於第三圖像集進行訓練時,更均衡的學習第一類別和第二類別的特徵,預設條件可以為第一數目與第二數目相同,在一種可能實現的方式中,從第一圖像集以及第二圖像集中分別選取f張圖像,使f張圖像中包含的人數為閾值,得到第三圖像集。在一些可能實現的方式中,閾值為1000,從第一圖像集以及第二圖像集中分別選取f張圖像,使f張圖像中包含的人數為1000即可,其中,f可為任意正整數,最後將從第一圖像集中選出的f張圖像以及從第二圖像集中選出的f張圖像作為第三圖像集。為使第二模態網路在基於第三圖像集進行訓練時,更有針對性的學習第一類別和第二類別的特徵,預設條件可以為第一數目與第二數目的比值等於第一圖像集包含的圖像數目與第二圖像集包含的圖像數目的比值,或第一數目與第二數目的比值等於第一圖像集包含的人數與第二圖像集包含的人數的比值,這樣,第二模態網路學習第一類別的特徵與第二類別的特徵的比值均為定值,可彌補第一類別的識別標準與第二類別的識別標準的差異。在一種可能實現的方式中,從第一圖像集以及第二圖像集中分別選取m張圖像以及n張圖像,使m與n的比值等於第一圖像集包含的圖像 數量與第二圖像集包含的圖像數量的比值,且m張圖像以及n張圖像中包含的人數均為閾值,得到第三圖像集。在一些可能實現的方式中,第一圖像集包含7000張圖像,第二圖像集包含8000張圖像,閾值為1000,從第一圖像集選取的m張圖像以及從第二圖像集中選取的n張圖像中包含的人數均為1000,且m:n=7:8,m、n可為任意正整數,最後將從第一圖像集中選出的m張圖像以及從第二圖像集中選出的n張圖像作為第三圖像集。在另一種可能實現的方式中,從第一圖像集以及第二圖像集中分別選取s張圖像以及t張圖像,使s與t的比值等於第一圖像集包含的人數與第二圖像集包含的人數的比值,且s張圖像以及t張圖像中包含的人數均為閾值,得到第三圖像集。在一些可能實現的方式中,第一圖像集包含的人數為6000,第二圖像集包含的人數為7000,閾值為1000,從第一圖像集選取的s張圖像以及從第二圖像集中選取的t張圖像中包含的人數均為1000,且s:t=6:7,s、t可為任意正整數,最後將從第一圖像集中選出的s張圖像以及從第二圖像集中選出的t張圖像作為第三圖像集。 302. Use the trained first feature extraction branch or the trained second feature extraction branch or the trained third feature extraction branch as the second modal network. Through the processing of 301, the parameters of the first feature extraction branch after training, the second feature extraction branch after training, and the third feature extraction branch after training are the same, that is, the first category (the category contained in the first image set) , The second category (category included in the second image set) has high object recognition accuracy and good robustness for recognizing images collected in different scenes and images collected in different collection methods. Therefore, the trained first feature extraction branch or the trained second feature extraction branch or the trained third feature extraction branch is used as the network for the next training, that is, the second modal network. In the embodiment of the present disclosure, both the first image set and the second image set are image sets selected according to categories, and the fourth image set is an image set selected according to scenes and shooting methods. Set to train the first feature extraction branch, so that the first feature extraction branch can focus on learning facial features of the first category, and use the second image set to train the second feature extraction branch to focus on the second feature extraction branch Learning the face features of the second category, and training the third feature extraction branch with the fourth image set, the third feature extraction branch can focus on learning the face features of the objects included in the fourth image set, and improve the third feature extraction branch. Robustness of the feature extraction branch; according to the back propagation gradient of the first feature extraction branch, the back propagation gradient of the second feature extraction branch is And the back propagation gradient of the third feature extraction branch to obtain the back propagation gradient of the first modal network, and use the gradient to perform the gradient back propagation of the first modal network, which can take into account the three feature extraction branches at the same time The direction of parameter adjustment makes the first modal network after the adjustment of the parameters have good robustness, and the recognition accuracy of the first category and the second category of human objects is high. The following embodiments are some possible implementations of step 202. In order for the second modal network to learn the features of the first category and the second category more evenly when training based on the third image set, the preset condition can be that the first number is the same as the second number. In one possibility In the implementation manner, f images are respectively selected from the first image set and the second image set, and the number of people included in the f images is the threshold value to obtain the third image set. In some possible implementations, the threshold is 1000, and f images are selected from the first image set and the second image set respectively, so that the number of people included in the f images is 1000, where f can be Any positive integer, and finally f images selected from the first image set and f images selected from the second image set as the third image set. In order for the second modal network to learn the features of the first category and the second category more specifically when it is trained based on the third image set, the preset condition can be that the ratio of the first number to the second number is equal to The ratio of the number of images contained in the first image set to the number of images contained in the second image set, or the ratio of the first number to the second number is equal to the number of people contained in the first image set and the number of images contained in the second image set In this way, the second modal network learns that the ratio of the features of the first category to the features of the second category is a constant value, which can make up for the difference between the recognition standard of the first category and the recognition standard of the second category. In a possible implementation manner, m images and n images are respectively selected from the first image set and the second image set, so that the ratio of m to n is equal to the images contained in the first image set The ratio of the number to the number of images included in the second image set, and the number of people included in the m images and the n images are both thresholds to obtain the third image set. In some possible implementations, the first image set contains 7000 images, the second image set contains 8000 images, the threshold is 1000, m images selected from the first image set and the second image set The number of people included in the n images selected in the image set is 1000, and m:n=7:8, m and n can be any positive integers, and finally the m images selected from the first image set and The n images selected from the second image set are used as the third image set. In another possible implementation, s images and t images are respectively selected from the first image set and the second image set, so that the ratio of s to t is equal to the number of people contained in the first image set and the first image set. The ratio of the number of people included in the two image sets, and the number of people included in s images and t images are both thresholds, and the third image set is obtained. In some possible implementations, the number of people in the first image set is 6000, the number of people in the second image set is 7000, the threshold is 1000, s images selected from the first image set and from the second image set The number of people included in the t images selected in the image set is 1000, and s:t=6:7, s and t can be any positive integers, and finally s images selected from the first image set and The t images selected from the second image set are used as the third image set.

本實施例提供了幾種從第一圖像集以及第二圖像集中選取圖像的方式,通過不同的選取方式可得到不同的第三圖像集,可根據具體訓練效果以及需求選擇不同的選取方式。 This embodiment provides several ways to select images from the first image set and the second image set. Different third image sets can be obtained through different selection methods. Different sets of images can be selected according to specific training effects and needs. Selection method.

請參閱圖4,圖4是本公開實施例提供的步驟203的一種可能的實現方式的流程示意圖。 Please refer to FIG. 4, which is a schematic flowchart of a possible implementation of step 203 provided by an embodiment of the present disclosure.

401、對第三圖像集中的圖像依次進行特徵提取處理、線性變換、非線性變換,得到第四識別結果。首先,第二模態網路對第三圖像集中的圖像進行特徵提取處理,特徵提取處理可以通過多種方式實現,例如卷積、池化等,本公開實施例對此不做具體限定。在一些可能的實現方式中,第二模態網路包括多層卷積層,通過多層卷積層對第三圖像集中的圖像逐層進行卷積處理完成對第三圖像集中的圖像的特徵提取處理,其中,每個卷積層提取出的特徵內容及語義資訊均不一樣,具體表現為,特徵提取處理一步步地將圖像的特徵抽象出來,同時也將逐步去除相對次要的特徵,因此,越到後面提取出的特徵尺寸越小,內容及語義資訊就越濃縮。通過多層卷積層逐級對第三圖像集中的圖像進行卷積處理,並提取相應的特徵,最終得到固定大小的特徵圖像,這樣,可在獲得待處理圖像主要內容資訊(即第三圖像集中的圖像的特徵圖像)的同時,將圖像尺寸縮小,減小系統的計算量,提高運算速度。在一種可能實現的方式中,卷積處理的實現過程如下:卷積層對待處理圖像做卷積處理,即利用卷積核在第三圖像集中的圖像上滑動,並將第三圖像集中的圖像上的圖元與對應的卷積核上的數值相乘,然後將所有相乘後的值相加作為卷積核中間圖元對應的圖像上圖元值,最終滑動處理完第三圖像集中的圖像中所有的圖元,並提取出相應的特徵圖像。在卷積層後連接的是全連接層,通過全連接層對卷積層提取出的特徵圖像進行線性變換,可將特徵圖像中的特徵映射到樣本(即對象的編號)標記空間。 在全連接層後連接有softmax層,通過softmax層對提取出的特徵圖像進行處理,得到第四識別結果,softmax層具體組成以及對特徵圖像的處理過程可參見301,此處將不再贅述,其中,第四識別結果包括第三圖像集包含的對象的編號分別是1~Z(第三圖像集包括的人數為Z)的概率,即每個對象的第四識別結果有Z個概率。 401. Perform feature extraction processing, linear transformation, and nonlinear transformation on images in the third image set in sequence to obtain a fourth recognition result. First, the second modal network performs feature extraction processing on the images in the third image set. The feature extraction processing can be implemented in a variety of ways, such as convolution, pooling, etc., which are not specifically limited in the embodiment of the present disclosure. In some possible implementations, the second modal network includes a multi-layer convolutional layer, and the image in the third image set is convolved layer by layer through the multi-layer convolution layer to complete the characteristics of the image in the third image set. Extraction processing, in which the feature content and semantic information extracted by each convolutional layer are different. The specific expression is that the feature extraction processing abstracts the features of the image step by step, while also gradually removing relatively minor features. Therefore, the smaller the feature size extracted later, the more condensed the content and semantic information. Through the multi-layer convolution layer, the image in the third image set is convolved step by step, and the corresponding features are extracted, and finally a fixed-size feature image is obtained. In this way, the main content information of the image to be processed (ie the first At the same time, the image size is reduced, the calculation amount of the system is reduced, and the calculation speed is increased. In a possible implementation, the convolution process is implemented as follows: the convolution layer performs convolution processing on the image to be processed, that is, the convolution kernel is used to slide on the image in the third image set, and the third image The primitives on the concentrated image are multiplied by the values on the corresponding convolution kernel, and then all the multiplied values are added as the primitive values on the image corresponding to the intermediate primitives in the convolution kernel, and finally the sliding processing is completed All the picture elements in the images in the third image set are extracted, and the corresponding characteristic images are extracted. The fully connected layer is connected after the convolutional layer, and the feature image extracted by the convolutional layer is linearly transformed through the fully connected layer, and the features in the feature image can be mapped to the sample (ie, the number of the object) label space. The softmax layer is connected after the fully connected layer, and the extracted feature image is processed through the softmax layer to obtain the fourth recognition result. The specific composition of the softmax layer and the process of processing the feature image can be found in 301, which will not be repeated here. To repeat, the fourth recognition result includes the probability that the numbers of the objects included in the third image set are 1~Z (the number of people included in the third image set is Z), that is, the fourth recognition result of each object has Z Probabilities.

402、根據第三圖像集中的圖像、第四識別結果以及第二模態網路的第四損失函數,調整第二模態網路的參數,得到跨模態人臉識別網路。在softmax層後連接有包含第四損失函數的損失函數層,第四損失函數的運算式可參見公式(2)。由於輸入至第二待訓練神經網路的第三圖像集包含不同類別的對象,因此,在通過softmax函數得到第四識別結果的過程中,將不同類別的對象的人臉特徵放在一起進行比較,對不同類別的識別標準歸一化,即以相同的識別標準識別不同類別的對象,最後通過第四識別結果和第四損失函數調整第二模態網路的參數,使調整參數後的第二模態網路以相同的識別標準識別不同類別的對象,提高了不同類別的對象的識別準確率,在一些可能實現的方式中,第一類別的識別標準是0.8,第二類別的識別標準是0.65,通過402的訓練,調整第二模態網路的參數以及識別標準,最終確定識別標準為0.72。由於第二模態網路的參數隨著識別標準的調整也會相應地調整,因此,使調整參數後得到的跨模態人臉識別網路通過減少第一類別的識別標準與第二類別的識別標準之間的差異。 402. According to the images in the third image set, the fourth recognition result, and the fourth loss function of the second modal network, adjust the parameters of the second modal network to obtain a cross-modal face recognition network. After the softmax layer is connected a loss function layer containing a fourth loss function, and the calculation formula of the fourth loss function can be found in formula (2). Since the third image set input to the second neural network to be trained contains objects of different categories, in the process of obtaining the fourth recognition result through the softmax function, the facial features of the objects of different categories are put together. Compare, normalize the identification standards of different categories, that is, identify objects of different categories with the same identification standard, and finally adjust the parameters of the second modal network through the fourth recognition result and the fourth loss function, so that the adjusted parameters The second modal network uses the same recognition standard to recognize different types of objects, which improves the recognition accuracy of different types of objects. In some possible implementations, the recognition standard of the first category is 0.8, and the recognition of the second category The standard is 0.65. Through the training of 402, the parameters of the second modal network and the recognition standard are adjusted, and the recognition standard is finally determined to be 0.72. Since the parameters of the second modal network will be adjusted accordingly with the adjustment of the recognition standard, the cross-modal face recognition network obtained after adjusting the parameters can reduce the recognition standard of the first category and the recognition standard of the second category. Identify the differences between standards.

本公開實施例中,以第三圖像集為訓練集對第二模態網路進行訓練,可將不同類別的對象的人臉特徵放在一起進行比較,對不同類別的識別標準歸一化;通過調整第二模態網路的參數,使調整參數後得到的跨模態人臉識別網路不僅對識別第一類別的對象的是否是同一個人的識別準確率高,而且對識別第二類別的對象的是否是同一個人的識別準確率高,減小了識別不同類別的對象是否是同一個人時的識別標準的差異。如上所述,訓練用的圖像集包含的人物對象的類別可以是按人的年齡劃分的,也可以是按人種劃分的,還可以是按地區劃分的,本公開提供一種基於按人種分類得到的圖像集對神經網路進行訓練的方法,即第一類別以及第二類別分別對應不同人種,可提高神經網路對不同人種的對象的識別準確率。 In the embodiment of the present disclosure, the second modal network is trained with the third image set as the training set, and the facial features of different categories of objects can be put together for comparison, and the recognition standards of different categories are normalized ; By adjusting the parameters of the second modal network, the cross-modal face recognition network obtained after adjusting the parameters not only has a high recognition accuracy for identifying whether the objects of the first category are the same person, but also the recognition of the second The recognition accuracy of whether the objects of different categories are the same person is high, which reduces the difference in the recognition standards when recognizing whether the objects of different categories are the same person. As mentioned above, the categories of the person objects contained in the training image set can be divided by the age of the person, by race, or by region. The present disclosure provides a method based on race The method of training the neural network on the image set obtained by classification, that is, the first category and the second category correspond to different races, which can improve the accuracy of the neural network for identifying objects of different races.

請參見圖5,圖5為本公開提供的一種基於按人種分類得到的圖像集對神經網路進行訓練的方法流程。 Please refer to FIG. 5. FIG. 5 is a flow of a method for training a neural network based on an image set obtained by classification according to race provided by the present disclosure.

501、獲得基礎圖像集、人種圖像集,以及第三模態網路。在本公開實施例中,基礎圖像集可以包括一個或多個圖像集,具體地,第十一圖像集中的圖像均是在室內採集的圖像,第十二圖像集中的圖像均是在港口採集的圖像,第十三圖像集中的圖像均是在野外採集的圖像,第十四圖像集中的圖像均是在人群中採集的圖像,第十五圖像集中的圖像均是證件圖像,第十六圖像集中的圖像均是通過手機拍攝的圖像,第十七圖像集中的圖像均是通過攝影機採集的圖像,第十八圖像集中的圖像均是從視頻中截取的圖像,第 十九圖像集中的圖像均是從互聯網下載的圖像,第二十圖像集中的圖像均是對名人圖像進行處理後得到的圖像。需要理解的是,基礎圖像集中的任意一個圖像集包括的圖像均為同一場景下採集的圖像或同一採集方式採集的圖像,即基礎圖像集中的圖像集對應與301中的第四圖像集。將中國地區的人劃分為第一人種,將泰國地區的人劃分為第二人種,將印度地區的人劃分為第三人種,將開羅地區的人劃分為第四人種,將非洲地區的人劃分為第五人種,將歐洲地區的人劃分為第六人種,對應地,就有6個人種圖像集,分別為包含以上6個人種,具體地,第五圖像集包含第一人種,第六圖像集包含第二人種...第十圖像集包含第六人種。需要理解的是,人種圖像集中的任意一個圖像集包括的對象均為同一人種(即同一類別),即人種圖像集中的圖像集對應與101中的第一圖像集或第二圖像集。 501. Obtain a basic image set, an ethnic image set, and a third modal network. In the embodiment of the present disclosure, the basic image set may include one or more image sets. Specifically, the images in the eleventh image set are all images collected indoors, and the images in the twelfth image set The images are all images collected at the port, the images in the thirteenth image set are all images collected in the field, the images in the fourteenth image set are all images collected in the crowd, the fifteenth image set The images in the image set are all certificate images, the images in the sixteenth image set are all images taken by mobile phones, and the images in the seventeenth image set are all images captured by cameras. The images in the eight-image set are all images taken from the video. The images in the 19th image collection are all images downloaded from the Internet, and the images in the 20th image collection are all images obtained by processing celebrity images. It should be understood that the images included in any image set in the basic image set are all images collected in the same scene or images collected in the same collection method, that is, the image sets in the basic image set correspond to those in 301 Set of fourth images. The people in China are divided into the first race, the people in Thailand are divided into the second race, the people in India are divided into the third race, the people in Cairo are divided into the fourth race, and the people in Africa are divided into the third race. The people in the region are divided into the fifth race, and the people in the European region are divided into the sixth race. Correspondingly, there are 6 types of image sets, including the above 6 types. Specifically, the fifth image set Contains the first race, the sixth image set contains the second race...The tenth image set contains the sixth race. It should be understood that the objects included in any image set in the race image set are all of the same race (that is, the same category), that is, the image set in the race image set corresponds to the first image set in 101 Or the second image set.

為使各圖像集包含的對象的人臉特徵可起到對應該類別人臉特徵的代表性的作用,可選地,每個圖像集包含的人數均在5000人以上,需要理解的是,本公開實施例對圖像集中圖像的數量不做限定。需要理解的是,人種劃分還可以是其他方式,例如:按膚色劃分人種,可分為黃色人種、白色人種、黑色人種和棕色人種四個人種,本實施例對人種劃分的方式不做限定。基礎圖像集以及人種圖像集中的對象可以只包括人臉,也可以包括人臉以及軀幹等其他部分,本公開對此不做具體限定。在本實施例中,第三模態網路可以是任意具備從圖像中提取特徵中功能的神經網路, 如:可以基於卷積層、非線性層、全連接層等網路單元按照一定方式堆疊或組成,也可以採用現有的神經網路結構,本公開對第三模態網路的結構不做具體限定。 In order that the facial features of the objects contained in each image set can play a representative role corresponding to the facial features of the category, optionally, the number of people included in each image set is more than 5,000 people, what needs to be understood is The embodiment of the present disclosure does not limit the number of images in the image collection. It needs to be understood that race classification can also be done in other ways. For example, the race can be divided into four races: yellow race, white race, black race, and brown race. The method of division is not limited. The objects in the basic image set and the ethnic image set may include only human faces, or may include other parts such as human faces and torso, which are not specifically limited in the present disclosure. In this embodiment, the third modal network can be any neural network that has the function of extracting features from the image. For example, network elements such as convolutional layers, nonlinear layers, and fully connected layers can be stacked or composed in a certain manner, or the existing neural network structure can be used. The present disclosure does not specifically limit the structure of the third modal network .

502、基於基礎圖像集和人種圖像集第三模態網路進行訓練,得到第四模態網路。此步驟具體可參見201以及301~302,此處將不再贅述。需要理解的是,由於基礎圖像集中包括10個圖像集,人種圖像集中包括6個圖像集,相應地,第三模態網路包括16個特徵提取分支,即每個圖像集對應一個特徵提取分支。通過502的處理,可提高第四模態網路對不同人種的對象是否是同一個人的識別準確率,即提高各個人種內的識別準確率,具體地,用第四模態網路分別識別第一人種、第二人種、第三人種、第四人種、第五人種、第六人種的對象是否是同一個人,均有較高的準確率,且第四待模態網路對識別不同場景下或以不同採集方式採集到的圖像的魯棒性好。 502. Perform training based on the basic image set and the third modal network of the ethnic image set to obtain the fourth modal network. For details of this step, please refer to 201 and 301~302, which will not be repeated here. It should be understood that since the basic image set includes 10 image sets and the ethnic image set includes 6 image sets, correspondingly, the third modal network includes 16 feature extraction branches, that is, each image The set corresponds to a feature extraction branch. Through the processing of 502, the recognition accuracy rate of the fourth modal network for whether objects of different races are the same person can be improved, that is, the recognition accuracy rate within each race can be improved. Specifically, the fourth modal network is used separately Identify whether the objects of the first, second, third, fourth, fifth, and sixth races are the same person, all have a high accuracy rate, and the fourth is to be modeled The state network has good robustness for recognizing images collected in different scenarios or in different collection methods.

503、基於人種圖像集對第四模態網路進行訓練,得到跨人種人臉識別網路。此步驟具體可參見202~203以及401~402,此處將不再贅述。通過503的處理,可減小得到的跨人種人臉識別網路識別不同人種的對象是否是同一個人時的識別標準的差異,跨人種人臉識別網路可提高不同人種的對象的識別準確率。具體地,跨人種人臉識別網路對不同圖像中屬於第一人種的對象是否是同一個人的識別準確率,以及對不同圖像中屬於第二人種的對象是否是同一個人的識別準確率,以及...,以及對不同圖像中屬於第六人 種的對象是否是同一個人的識別準確率都在預設值之上,需理解,預設值表示跨人種人臉識別網路對各個人種的識別準確率都很高,本公開對預設值的具體大小不做限定,可選地,預設值為98%。可選地,為同時提高人種內的識別準確率以及減小不同人種的識別標準的差異,可多次重複502以及503,在一些可能實現的方式中,以502的訓練方式對第三模態網路訓練10萬輪,然後在接下來的10~15萬輪訓練中,502的訓練方式的比重逐漸降低為0,而503的訓練方式的比重逐提升至1,15~25萬輪的訓練均通過503的訓練方式完成,在接下來的25~30萬輪訓練中,503的訓練方式的比重逐漸降低為0,而502的訓練方式的比重逐提升至1;最後,在第30~40萬輪訓練中,502的訓練方式以及503的訓練方式各占一半比重。需要理解的是,本公開實施例對各個階段的輪數具體數值、502的訓練方式以及503的訓練方式的比重均不做限定。應用本實施例得到的跨人種人臉識別網路可對識別多個人種的對象是否是同一個人,且識別準確率高,如:應用跨人種人臉識別網路即可對中國地區的人種進行識別,也可對開羅地區的人種進行識別,還可對歐洲地區的人種進行識別,且每個人種的識別準確率高,這樣,可解決人臉識別演算法在對某一類人種識別準確率高,但對其他人種識別準確率低的問題。此外,應用本實施例可提高跨人種人臉識別網路識別不同場景下或以不同採集方式採集到的圖像的魯棒性。本領域技術人員可以理解,在具體實施方式的上述方法中,各步驟的撰寫順序並不意味著嚴格的執行 順序而對實施過程構成任何限定,各步驟的具體執行順序應當以其功能和可能的內在邏輯確定。 503. Train the fourth modal network based on the ethnic image set to obtain a cross-ethnic face recognition network. For details of this step, please refer to 202~203 and 401~402, which will not be repeated here. Through the processing of 503, the difference in the recognition standards obtained when the cross-ethnic face recognition network recognizes whether the objects of different races are the same person can be reduced, and the cross-ethnic face recognition network can improve the objects of different races. The recognition accuracy rate. Specifically, the cross-ethnic face recognition network's recognition accuracy of whether the objects belonging to the first race in different images are the same person, and whether the objects belonging to the second race in different images are the same person Recognition accuracy, and..., and the sixth person in different images Whether the target of the same person is the same person, the recognition accuracy rate is above the default value. It needs to be understood that the default value means that the cross-racial face recognition network has a high recognition accuracy rate for each race. The specific size of the setting value is not limited, and optionally, the preset value is 98%. Optionally, in order to simultaneously improve the recognition accuracy within the race and reduce the difference in the recognition standards of different races, 502 and 503 can be repeated multiple times. In some possible implementation methods, the third training method is 502. The modal network is trained for 100,000 rounds, and then in the next 100,000 to 150,000 rounds of training, the proportion of the 502 training method is gradually reduced to 0, and the proportion of the 503 training method is gradually increased to 1, 150,000 to 250,000 rounds The training of 503 is completed through the 503 training method. In the next 250,000 to 300,000 rounds of training, the proportion of the 503 training method is gradually reduced to 0, and the proportion of the 502 training method is gradually increased to 1. Finally, in the 30th In ~400,000 rounds of training, the 502 training method and the 503 training method each account for half of the proportions. It should be understood that the embodiments of the present disclosure do not limit the specific values of the number of rounds in each stage, the training method of 502, and the proportion of the training method of 503. The cross-ethnic face recognition network obtained by applying this embodiment can determine whether the object of multiple races is the same person, and the recognition accuracy is high. For example, the cross-ethnic face recognition network can be used to identify the people in China. Race recognition can also be used to identify races in the Cairo area, as well as races in Europe, and the recognition accuracy of each race is high. In this way, it can solve the problem that the face recognition algorithm is correcting a certain type. The accuracy of ethnic recognition is high, but the accuracy of other ethnic recognition is low. In addition, the application of this embodiment can improve the robustness of the cross-ethnic face recognition network in recognizing images collected in different scenarios or in different collection methods. Those skilled in the art can understand that, in the above method of the specific implementation, the writing order of the steps does not mean strict execution. The order constitutes any limitation on the implementation process, and the specific execution order of each step should be determined by its function and possible internal logic.

上述詳細闡述了本公開實施例的方法,下面提供了本公開實施例的裝置。 The foregoing describes the method of the embodiment of the present disclosure in detail, and the device of the embodiment of the present disclosure is provided below.

請參閱圖6,圖6為本公開實施例提供的一種人臉識別裝置的結構示意圖,該識別裝置1包括:獲取單元11以及識別單元12。其中:獲取單元11,配置為獲得取待識別圖像;識別單元12,配置為基於跨模態人臉識別網路對所述待識別圖像進行識別,得到所述待識別圖像的識別結果,其中,所述跨模態人臉識別網路基於不同模態的人臉圖像資料訓練得到。 Please refer to FIG. 6. FIG. 6 is a schematic structural diagram of a face recognition device provided by an embodiment of the present disclosure. The recognition device 1 includes: an acquiring unit 11 and a recognition unit 12. Wherein: the obtaining unit 11 is configured to obtain the image to be recognized; the recognition unit 12 is configured to recognize the image to be recognized based on a cross-modal face recognition network to obtain the recognition result of the image to be recognized , Wherein the cross-modal face recognition network is trained based on face image data of different modalities.

進一步地,所述識別單元12包括:訓練子單元121,配置為基於第一模態網路和第二模態網路進行訓練得到所述跨模態人臉識別網路。 Further, the recognition unit 12 includes a training subunit 121 configured to perform training based on the first modal network and the second modal network to obtain the cross-modal face recognition network.

進一步地,所述訓練子單元121還配置為:基於第一圖像集和第二圖像集對所述第一模態網路訓練,其中,所述第一圖像集中的對象屬於第一類別,所述第二圖像集中的對象屬於第二類別。進一步地,所述訓練子單元121還配置為:基於所述第一圖像集和所述第二圖像集對所述第一模態網路進行訓練,得到所述第二模態網路;以及按預設條件從所述第一圖像集中選取第一數目的圖像,並從所述第二圖像集中選取第二數目的圖像,並根據所述第一數目的圖像和所述第二數目的圖像得到第三圖像集;以及基於所述第三圖像集對所述第二模態網路進行訓練,得到所述跨模態人 臉識別網路。進一步地,所述預設條件包括:所述第一數目與所述第二數目相同,所述第一數目與所述第二數目的比值等於所述第一圖像集包含的圖像數目與所述第二圖像集包含的圖像數目的比值,所述第一數目與所述第二數目的比值等於所述第一圖像集包含的人數與所述第二圖像集包含的人數的比值中的任意一種。進一步地,所述第一模態網路包括第一特徵提取分支、第二特徵提取分支以及第三特徵提取分支;所述訓練子單元121還配置為:將所述第一圖像集輸入至所述第一特徵提取分支,並將所述第二圖像集輸入至所述第二特徵提取分支,並將第四圖像集輸入至所述第三特徵提取分支,對所述第一模態網路進行訓練,其中,所述第四圖像集包括的圖像為同一場景下採集的圖像或同一採集方式採集的圖像;以及將訓練後的第一特徵提取分支或訓練後的第二特徵提取分支或訓練後的第三特徵提取分支作為所述第二模態網路。進一步地,所述訓練子單元121還配置為:將所述第一圖像集、所述第二圖像集以及所述第四圖像集分別輸入至所述第一特徵提取分支、所述第二特徵提取分支以及所述第三特徵提取分支,分別得到第一識別結果、第二識別結果以及第三識別結果;以及獲取所述第一特徵提取分支的第一損失函數、所述第二特徵提取分支的第二損失函數以及所述第三特徵提取分支的第三損失函數;以及根據所述第一圖像集、所述第一識別結果以及所述第一損失函數,所述第二圖像集、所述第二識別結果以及所述第二損失函數,所述第四圖像集、所述第三識別結果以及所述第三損失 函數,調整所述第一模態網路的參數,得到調整後的第一模態網路,其中,所述第一模態網路的參數包括第一特徵提取分支參數、第二特徵提取分支參數以及第三特徵提取分支參數,所述調整後的第一模態網路的各分支參數相同。進一步地,所述第一圖像集中的圖像包括第一標注資訊,所述第二圖像集中的圖像包括第二標注資訊,所述第四圖像集中的圖像包括第三標注資訊;所述訓練子單元121還配置為:根據所述第一標注資訊、所述第一識別結果、所述第一損失函數以及所述第一特徵提取分支的初始參數,得到第一梯度,以及根據所述第二標注資訊、所述第二識別結果、所述第二損失函數以及所述第二特徵提取分支的初始參數,得到第二梯度,以及根據所述第三標注資訊、所述第三識別結果、所述第三損失函數以及所述第三特徵提取分支的初始參數,得到第三梯度;以及將所述第一梯度、所述第二梯度以及所述第三梯度的平均值作為所述第一模態網路的反向傳播梯度,並通過所述反向傳播梯度調整所述第一模態網路的參數,使所述第一特徵提取分支的參數、所述第二特徵提取分支的參數以及所述第三特徵提取分支的參數相同。進一步地,所述訓練子單元121還配置為:從所述第一圖像集以及所述第二圖像集中分別選取f張圖像,使所述f張圖像中包含的人數為閾值,得到所述第三圖像集;或,以及從所述第一圖像集以及所述第二圖像集中分別選取m張圖像以及n張圖像,使所述m與所述n的比值等於所述第一圖像集包含的圖像數量與所述第二圖像集包含的圖像數量的比值,且所述m張圖像以及 所述n張圖像中包含的人數均為所述閾值,得到所述第三圖像集;或,以及從所述第一圖像集以及所述第二圖像集中分別選取s張圖像以及t張圖像,使所述s與所述t的比值等於所述第一圖像集包含的人數與所述第二圖像集包含的人數的比值,且所述s張圖像以及所述t張圖像中包含的人數均為所述閾值,得到所述第三圖像集。進一步地,所述訓練子單元121還配置為:對所述第三圖像集中的圖像依次進行特徵提取處理、線性變換、非線性變換,得到第四識別結果;以及根據所述第三圖像集中的圖像、所述第四識別結果以及所述第二模態網路的第四損失函數,調整所述第二模態網路的參數,得到所述跨模態人臉識別網路。進一步地,所述第一類別以及所述第二類別分別對應不同人種。在一些實施例中,本公開實施例提供的裝置具有的功能或包含的模組可以用於執行上文方法實施例描述的方法,其具體實現可以參照上文方法實施例的描述,為了簡潔,這裡不再贅述。 Further, the training subunit 121 is further configured to train the first modal network based on the first image set and the second image set, wherein the objects in the first image set belong to the first modal network. Category, the object in the second image set belongs to the second category. Further, the training subunit 121 is further configured to train the first modal network based on the first image set and the second image set to obtain the second modal network And selecting a first number of images from the first image set according to preset conditions, and selecting a second number of images from the second image set, and according to the first number of images and Obtaining a third image set from the second number of images; and training the second modal network based on the third image set to obtain the cross-modal person Face recognition network. Further, the preset condition includes: the first number is the same as the second number, and the ratio of the first number to the second number is equal to the number of images included in the first image set and The ratio of the number of images included in the second image set, and the ratio of the first number to the second number is equal to the number of people included in the first image set and the number of people included in the second image set Any one of the ratios. Further, the first modal network includes a first feature extraction branch, a second feature extraction branch, and a third feature extraction branch; the training subunit 121 is further configured to: input the first image set to The first feature extraction branch, and the second image set is input to the second feature extraction branch, and the fourth image set is input to the third feature extraction branch, and the first model State network for training, wherein the fourth image set includes images collected in the same scene or images collected in the same collection method; and the first feature extraction branch after training or the training The second feature extraction branch or the trained third feature extraction branch is used as the second modal network. Further, the training subunit 121 is further configured to: input the first image set, the second image set, and the fourth image set to the first feature extraction branch, the The second feature extraction branch and the third feature extraction branch respectively obtain the first recognition result, the second recognition result, and the third recognition result; and obtain the first loss function and the second loss function of the first feature extraction branch. The second loss function of the feature extraction branch and the third loss function of the third feature extraction branch; and according to the first image set, the first recognition result, and the first loss function, the second Image set, the second recognition result, and the second loss function, the fourth image set, the third recognition result, and the third loss Function to adjust the parameters of the first modal network to obtain an adjusted first modal network, wherein the parameters of the first modal network include a first feature extraction branch parameter and a second feature extraction branch The parameters and the third feature extraction branch parameters, the branch parameters of the adjusted first modal network are the same. Further, the images in the first image set include first annotation information, the images in the second image set include second annotation information, and the images in the fourth image set include third annotation information The training subunit 121 is also configured to: obtain a first gradient according to the first annotation information, the first recognition result, the first loss function, and the initial parameters of the first feature extraction branch, and According to the second annotation information, the second recognition result, the second loss function, and the initial parameters of the second feature extraction branch, a second gradient is obtained, and according to the third annotation information, the first 3. The recognition result, the third loss function, and the initial parameters of the third feature extraction branch to obtain a third gradient; and the average value of the first gradient, the second gradient, and the third gradient is taken as The back-propagation gradient of the first modal network, and the parameters of the first modal network are adjusted by the back-propagation gradient, so that the first feature extracts the parameters of the branch and the second feature The parameters of the extraction branch and the parameters of the third feature extraction branch are the same. Further, the training subunit 121 is further configured to: select f images from the first image set and the second image set respectively, and set the number of people included in the f images as a threshold. Obtain the third image set; or, and respectively select m images and n images from the first image set and the second image set, so that the ratio of the m to the n Is equal to the ratio of the number of images included in the first image set to the number of images included in the second image set, and the m images and The number of people included in the n images is the threshold to obtain the third image set; or, and s images are respectively selected from the first image set and the second image set And t images, so that the ratio of the s to the t is equal to the ratio of the number of people included in the first image set to the number of people included in the second image set, and the s images and all The number of people included in the t images is the threshold, and the third image set is obtained. Further, the training subunit 121 is further configured to: sequentially perform feature extraction processing, linear transformation, and nonlinear transformation on the images in the third image set to obtain a fourth recognition result; and according to the third image The image in the image set, the fourth recognition result, and the fourth loss function of the second modal network are adjusted, and the parameters of the second modal network are adjusted to obtain the cross-modal face recognition network . Further, the first category and the second category respectively correspond to different races. In some embodiments, the functions or modules contained in the device provided in the embodiments of the present disclosure can be used to execute the methods described in the above method embodiments. For specific implementation, refer to the description of the above method embodiments. For brevity, I won't repeat it here.

圖7為本公開實施例提供的一種人臉識別裝置的硬體結構示意圖。該識別裝置2包括處理器21,還可以包括輸入裝置22、輸出裝置23和記憶體24。該輸入裝置22、輸出裝置23、記憶體24和處理器21之間通過匯流排相互連接。記憶體包括但不限於是隨機存取記憶體(random access memory,RAM)、唯讀記憶體(read-only memory,ROM)、可擦除可程式設計唯讀記憶體(erasable programmable read only memory,EPROM)、或可擕式唯讀記憶體(compact disc read-only memory, CD-ROM),該記憶體用於相關指令及資料。輸入裝置用於輸入資料和/或信號,以及輸出裝置用於輸出資料和/或信號。輸出裝置和輸入裝置可以是獨立的器件,也可以是一個整體的器件。處理器可以包括是一個或多個處理器,例如包括一個或多個中央處理器(central processing unit,CPU),在處理器是一個CPU的情況下,該CPU可以是單核CPU,也可以是多核CPU。記憶體用於儲存網路設備的程式碼和資料。處理器用於調用該記憶體中的程式碼和資料,執行上述方法實施例中的步驟。具體可參見方法實施例中的描述,在此不再贅述。可以理解的是,圖7僅僅示出了一種人臉識別裝置的簡化設計。在實際應用中,人臉識別裝置還可以分別包含必要的其他元件,包含但不限於任意數量的輸入/輸出裝置、處理器、控制器、記憶體等,而所有可以實現本公開實施例的人臉識別裝置都在本公開的保護範圍之內。本領域普通技術人員可以意識到,結合本文中所公開的實施例描述的各示例的單元及演算法步驟,能夠以電子硬體、或者電腦軟體和電子硬體的結合來實現。這些功能究竟以硬體還是軟體方式來執行,取決於技術方案的特定應用和設計約束條件。專業技術人員可以對每個特定的應用來使用不同方法來實現所描述的功能,但是這種實現不應認為超出本公開的範圍。所屬領域的技術人員可以清楚地瞭解到,為描述的方便和簡潔,上述描述的系統、裝置和單元的具體工作過程,可以參考前述方法實施例中的對應過程,在此不再贅述。所屬領域的技術人員還可以清楚地瞭解到,本公開 各個實施例描述各有側重,為描述的方便和簡潔,相同或類似的部分在不同實施例中可能沒有贅述,因此,在某一實施例未描述或未詳細描述的部分可以參見其他實施例的記載。在本公開所提供的幾個實施例中,應該理解到,所揭露的系統、裝置和方法,可以通過其它的方式實現。例如,以上所描述的裝置實施例僅僅是示意性的,例如,所述單元的劃分,僅僅為一種邏輯功能劃分,實際實現時可以有另外的劃分方式,例如多個單元或元件可以結合或者可以集成到另一個系統,或一些特徵可以忽略,或不執行。另一點,所顯示或討論的相互之間的耦合或直接耦合或通信連接可以是通過一些介面,裝置或單元的間接耦合或通信連接,可以是電性,機械或其它的形式。所述作為分離部件說明的單元可以是或者也可以不是物理上分開的,作為單元顯示的部件可以是或者也可以不是物理單元,即可以位於一個地方,或者也可以分佈到多個網路單元上。可以根據實際的需要選擇其中的部分或者全部單元來實現本實施例方案的目的。 FIG. 7 is a schematic diagram of the hardware structure of a face recognition device provided by an embodiment of the disclosure. The identification device 2 includes a processor 21, and may also include an input device 22, an output device 23, and a memory 24. The input device 22, the output device 23, the memory 24 and the processor 21 are connected to each other through a bus. Memory includes but is not limited to random access memory (RAM), read-only memory (ROM), erasable programmable read only memory (erasable programmable read only memory, EPROM), or compact disc read-only memory (compact disc read-only memory, CD-ROM), the memory is used for related instructions and data. The input device is used to input data and/or signals, and the output device is used to output data and/or signals. The output device and the input device can be independent devices or an integrated device. The processor may include one or more processors, such as one or more central processing units (central processing unit, CPU). In the case that the processor is a CPU, the CPU may be a single-core CPU or Multi-core CPU. The memory is used to store the code and data of the network equipment. The processor is used to call the program code and data in the memory to execute the steps in the above method embodiment. For details, please refer to the description in the method embodiment, which will not be repeated here. It is understandable that FIG. 7 only shows a simplified design of a face recognition device. In practical applications, the face recognition device may also include other necessary components, including but not limited to any number of input/output devices, processors, controllers, memory, etc., and all those who can implement the embodiments of the present disclosure The face recognition devices are all within the protection scope of the present disclosure. A person of ordinary skill in the art may be aware that the units and algorithm steps described in the examples in combination with the embodiments disclosed herein can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of the present disclosure. Those skilled in the art can clearly understand that, for the convenience and conciseness of description, the specific working process of the system, device and unit described above can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here. Those skilled in the art can also clearly understand that the present disclosure The description of each embodiment has its own focus. For the convenience and brevity of the description, the same or similar parts may not be repeated in different embodiments. Therefore, for parts that are not described or not described in detail in a certain embodiment, please refer to the description of other embodiments. Record. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, device, and method may be implemented in other ways. For example, the device embodiments described above are merely illustrative. For example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or elements may be combined or may be Integrate into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms. The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. . Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

另外,在本公開各個實施例中的各功能單元可以集成在一個處理單元中,也可以是各個單元單獨物理存在,也可以兩個或兩個以上單元集成在一個單元中。在上述實施例中,可以全部或部分地通過軟體、硬體、固件或者其任意組合來實現。當使用軟體實現時,可以全部或部分地以電腦程式產品的形式實現。所述電腦程式產品包括一個或多個電腦指令。在電腦上載入和執行所述電腦程式指令時,全部或部分地產生按照本公開實施例所述的流程或功能。所述 電腦可以是通用電腦、專用電腦、電腦網路、或者其他可程式設計裝置。所述電腦指令可以儲存在電腦可讀儲存介質中,或者通過所述電腦可讀儲存介質進行傳輸。所述電腦指令可以從一個網站網站、電腦、伺服器或資料中心通過有線(例如同軸電纜、光纖、數位用戶線路(digital subscriber line,DSL))或無線(例如紅外、無線、微波等)方式向另一個網站網站、電腦、伺服器或資料中心進行傳輸。所述電腦可讀儲存介質可以是電腦能夠存取的任何可用介質或者是包含一個或多個可用介質集成的伺服器、資料中心等資料存放裝置。所述可用介質可以是磁性介質,(例如,軟碟、硬碟、磁帶)、光介質(例如,數位通用光碟(digital versatile disc,DVD))、或者半導體介質(例如固態硬碟(solid state disk,SSD))等。 In addition, the functional units in the various embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented by software, it can be implemented in the form of a computer program product in whole or in part. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions described in the embodiments of the present disclosure are generated in whole or in part. Said The computer can be a general-purpose computer, a dedicated computer, a computer network, or other programmable devices. The computer instructions can be stored in a computer-readable storage medium or transmitted through the computer-readable storage medium. The computer instructions can be sent from a website, computer, server, or data center via wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) Another website, computer, server or data center for transmission. The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, or the like integrated with one or more available media. The usable medium can be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a digital versatile disc (DVD)), or a semiconductor medium (for example, a solid state disk). , SSD)) etc.

本領域普通技術人員可以理解實現上述實施例方法中的全部或部分流程,該流程可以由電腦程式來指令相關的硬體完成,該程式可儲存於電腦可讀取儲存介質中,該程式在執行時,可包括如上述各方法實施例的流程。而前述的儲存介質包括:唯讀記憶體(read-only memory,ROM)或隨機存取記憶體(random access memory,RAM)、磁碟或者光碟等各種可儲存程式碼的介質為使本公開實施例的目的、技術方案和優點更加清楚,下面將結合本公開實施例中的附圖,對發明的具體技術方案做進一步詳細描述。以下實施例用於說明本公開,但不用來限制本公開的範圍。 A person of ordinary skill in the art can understand that all or part of the process in the above-mentioned embodiment method can be realized. The process can be completed by a computer program instructing related hardware. The program can be stored in a computer readable storage medium, and the program is executing At this time, it may include the processes of the above-mentioned method embodiments. The aforementioned storage media include: read-only memory (ROM) or random access memory (random access memory, RAM), magnetic disks or optical disks and other media that can store program codes. The purpose, technical solutions, and advantages of the examples are more clear. The specific technical solutions of the invention will be described in further detail below in conjunction with the accompanying drawings in the embodiments of the present disclosure. The following embodiments are used to illustrate the present disclosure, but are not used to limit the scope of the present disclosure.

圖1代表圖為流程圖,無元件符號簡單說明。 Figure 1 represents a flow chart with no component symbols for simple explanation.

Claims (11)

一種人臉識別方法,包括:獲得取待識別圖像;基於跨模態人臉識別網路對所述待識別圖像進行識別,得到所述待識別圖像的識別結果,其中,所述跨模態人臉識別網路基於不同模態的人臉圖像資料訓練得到;其中,所述基於不同模態的人臉圖像資料訓練得到所述跨模態人臉識別網路的過程,包括:基於第一模態網路和第二模態網路進行訓練得到所述跨模態人臉識別網路;在所述基於第一模態網路和第二模態網路進行訓練得到所述跨模態人臉識別網路之前,還包括:基於第一圖像集和第二圖像集對所述第一模態網路訓練,其中,所述第一圖像集中的對象屬於第一類別,所述第二圖像集中的對象屬於第二類別。 A face recognition method includes: obtaining an image to be recognized; recognizing the image to be recognized based on a cross-modal face recognition network to obtain a recognition result of the image to be recognized, wherein the cross The modal face recognition network is trained based on face image data of different modalities; wherein the process of training based on the face image data of different modalities to obtain the cross-modal face recognition network includes : Training based on the first modal network and the second modal network to obtain the cross-modal face recognition network; training based on the first modal network and the second modal network to obtain the result Before the cross-modal face recognition network, it further includes: training the first modal network based on the first image set and the second image set, wherein the objects in the first image set belong to the first image set. One category, the objects in the second image set belong to the second category. 根據請求項1所述的方法,其中,所述基於第一圖像集和第二圖像集對所述第一模態網路訓練,包括:基於所述第一圖像集和所述第二圖像集對所述第一模態網路進行訓練,得到所述第二模態網路;按預設條件從所述第一圖像集中選取第一數目的圖像,並從所述第二圖像集中選取第二數目的圖像,並根據所述第一數目的圖像和所述第二數目的圖像得到第三圖像集;基於所述第三圖像集對所述第二模態網路進行訓練,得到所述跨模態人臉識別網路。 The method according to claim 1, wherein the training of the first modal network based on the first image set and the second image set includes: based on the first image set and the second image set Two image sets train the first modal network to obtain the second modal network; select a first number of images from the first image set according to preset conditions, and select Select a second number of images in the second image set, and obtain a third image set based on the first number of images and the second number of images; The second modal network is trained to obtain the cross-modal face recognition network. 根據請求項2所述的方法,其中,所述預設條件包括:所述第一數目與所述第二數目相同,所述第一數目與所述第二數目的比值等於所述第一圖像集包含的圖像數目與所述第二圖像集包含的圖像數目的比值,所述第一數目與所述第二數目的比值等於所述第一圖像集包含的人數與所述第二圖像集包含的人數的比值中的任意一種。 The method according to claim 2, wherein the preset condition includes: the first number is the same as the second number, and the ratio of the first number to the second number is equal to the first graph The ratio of the number of images contained in the image set to the number of images contained in the second image set, and the ratio of the first number to the second number is equal to the number of people contained in the first image set and the Any one of the ratios of the number of people included in the second image set. 根據請求項2所述的方法,其中,所述第一模態網路包括第一特徵提取分支、第二特徵提取分支以及第三特徵提取分支;所述基於所述第一圖像集和所述第二圖像集對所述第一模態網路進行訓練,得到所述第二模態網路,包括:將所述第一圖像集輸入至所述第一特徵提取分支,並將所述第二圖像集輸入至所述第二特徵提取分支,並將第四圖像集輸入至所述第三特徵提取分支,對所述第一模態網路進行訓練,其中,所述第四圖像集包括的圖像為同一場景下採集的圖像或同一採集方式採集的圖像;將訓練後的第一特徵提取分支或訓練後的第二特徵提取分支或訓練後的第三特徵提取分支作為所述第二模態網路。 The method according to claim 2, wherein the first modal network includes a first feature extraction branch, a second feature extraction branch, and a third feature extraction branch; The second image set trains the first modal network to obtain the second modal network, including: inputting the first image set to the first feature extraction branch, and The second image set is input to the second feature extraction branch, and the fourth image set is input to the third feature extraction branch to train the first modal network, wherein the The images included in the fourth image set are images collected in the same scene or images collected in the same collection method; the first feature extraction branch after training or the second feature extraction branch after training or the third after training The feature extraction branch serves as the second modal network. 根據請求項4所述的方法,其中,所述將所述第一圖像集輸入至所述第一特徵提取分支,並將所述第二圖像集輸入至所述第二特徵提取分支,並將第四圖像集輸入至所述第三特徵提取分支,對所述第一模態網路進行訓練,包括: 將所述第一圖像集、所述第二圖像集以及所述第四圖像集分別輸入至所述第一特徵提取分支、所述第二特徵提取分支以及所述第三特徵提取分支,分別得到第一識別結果、第二識別結果以及第三識別結果;獲取所述第一特徵提取分支的第一損失函數、所述第二特徵提取分支的第二損失函數以及所述第三特徵提取分支的第三損失函數;根據所述第一圖像集、所述第一識別結果以及所述第一損失函數,所述第二圖像集、所述第二識別結果以及所述第二損失函數,所述第四圖像集、所述第三識別結果以及所述第三損失函數,調整所述第一模態網路的參數,得到調整後的第一模態網路,其中,所述第一模態網路的參數包括第一特徵提取分支參數、第二特徵提取分支參數以及第三特徵提取分支參數,所述調整後的第一模態網路的各分支參數相同。 The method according to claim 4, wherein the inputting the first image set to the first feature extraction branch, and inputting the second image set to the second feature extraction branch, And inputting the fourth image set to the third feature extraction branch to train the first modal network includes: Input the first image set, the second image set, and the fourth image set to the first feature extraction branch, the second feature extraction branch, and the third feature extraction branch, respectively , Obtain the first recognition result, the second recognition result, and the third recognition result respectively; obtain the first loss function of the first feature extraction branch, the second loss function of the second feature extraction branch, and the third feature Extract the third loss function of the branch; according to the first image set, the first recognition result, and the first loss function, the second image set, the second recognition result, and the second Loss function, the fourth image set, the third recognition result, and the third loss function adjust the parameters of the first modal network to obtain the adjusted first modal network, wherein, The parameters of the first modal network include a first feature extraction branch parameter, a second feature extraction branch parameter, and a third feature extraction branch parameter, and each branch parameter of the adjusted first modal network is the same. 根據請求項5所述的方法,其中,所述第一圖像集中的圖像包括第一標注資訊,所述第二圖像集中的圖像包括第二標注資訊,所述第四圖像集中的圖像包括第三標注資訊;所述根據所述第一圖像集、所述第一識別結果以及所述第一損失函數,所述第二圖像集、所述第二識別結果以及所述第二損失函數,所述第四圖像集、所述第三識別結果以及所述第三損失函數,調整所述第一模態網路的參數,得到調整後的第一模態網路,包括: 根據所述第一標注資訊、所述第一識別結果、所述第一損失函數以及所述第一特徵提取分支的初始參數,得到第一梯度,以及根據所述第二標注資訊、所述第二識別結果、所述第二損失函數以及所述第二特徵提取分支的初始參數,得到第二梯度,以及根據所述第三標注資訊、所述第三識別結果、所述第三損失函數以及所述第三特徵提取分支的初始參數,得到第三梯度;將所述第一梯度、所述第二梯度以及所述第三梯度的平均值作為所述第一模態網路的反向傳播梯度,並通過所述反向傳播梯度調整所述第一模態網路的參數,使所述第一特徵提取分支的參數、所述第二特徵提取分支的參數以及所述第三特徵提取分支的參數相同。 The method according to claim 5, wherein the images in the first image set include first annotation information, the images in the second image set include second annotation information, and the fourth image set The image includes the third annotation information; according to the first image set, the first recognition result, and the first loss function, the second image set, the second recognition result, and the The second loss function, the fourth image set, the third recognition result, and the third loss function, adjusting the parameters of the first modal network to obtain an adjusted first modal network ,include: According to the first annotation information, the first recognition result, the first loss function, and the initial parameters of the first feature extraction branch, a first gradient is obtained, and according to the second annotation information, the first 2. The recognition result, the second loss function, and the initial parameters of the second feature extraction branch to obtain a second gradient, and according to the third annotation information, the third recognition result, the third loss function, and The initial parameters of the third feature extraction branch to obtain a third gradient; the average value of the first gradient, the second gradient, and the third gradient is used as the back propagation of the first modal network Gradient, and adjust the parameters of the first modal network through the backpropagation gradient, so that the parameters of the first feature extraction branch, the parameters of the second feature extraction branch, and the third feature extraction branch The parameters are the same. 根據請求項2或3所述的方法,其中,所述按預設條件從所述第一圖像集中選取第一數量張圖像,並從所述第二圖像集中選取第二數量張圖像,得到第三圖像集,包括:從所述第一圖像集以及所述第二圖像集中分別選取f張圖像,使所述f張圖像中包含的人數為閾值,得到所述第三圖像集;或,從所述第一圖像集以及所述第二圖像集中分別選取m張圖像以及n張圖像,使所述m與所述n的比值等於所述第一圖像集包含的圖像數量與所述第二圖像集包含的圖像數量的比值,且所述m張圖像以及所述n張圖像中包含的人數均為所述閾值,得到所述第三圖像集;或, 從所述第一圖像集以及所述第二圖像集中分別選取s張圖像以及t張圖像,使所述s與所述t的比值等於所述第一圖像集包含的人數與所述第二圖像集包含的人數的比值,且所述s張圖像以及所述t張圖像中包含的人數均為所述閾值,得到所述第三圖像集。 The method according to claim 2 or 3, wherein the first number of images is selected from the first image set according to a preset condition, and the second number of images is selected from the second image set Image to obtain a third image set, including: selecting f images from the first image set and the second image set, and setting the number of people included in the f images as a threshold to obtain all The third image set; or, m images and n images are respectively selected from the first image set and the second image set, so that the ratio of the m to the n is equal to the The ratio of the number of images included in the first image set to the number of images included in the second image set, and the number of people included in the m images and the n images are both the threshold value, To obtain the third image set; or, Select s images and t images from the first image set and the second image set, so that the ratio of the s to the t is equal to the number of people included in the first image set and The ratio of the number of people included in the second image set, and the number of people included in the s images and the t images are both the threshold value to obtain the third image set. 根據請求項2所述的方法,其中,所述基於所述第三圖像集對所述第二模態網路進行訓練,得到所述跨模態人臉識別網路,包括:對所述第三圖像集中的圖像依次進行特徵提取處理、線性變換、非線性變換,得到第四識別結果;根據所述第三圖像集中的圖像、所述第四識別結果以及所述第二模態網路的第四損失函數,調整所述第二模態網路的參數,得到所述跨模態人臉識別網路。 The method according to claim 2, wherein the training the second modal network based on the third image set to obtain the cross-modal face recognition network includes: The images in the third image set are sequentially subjected to feature extraction processing, linear transformation, and nonlinear transformation to obtain a fourth recognition result; according to the images in the third image set, the fourth recognition result, and the second The fourth loss function of the modal network adjusts the parameters of the second modal network to obtain the cross-modal face recognition network. 根據請求項1至3、5、6、8中任意一項所述的方法,其中,所述第一類別以及所述第二類別分別對應不同人種。 The method according to any one of claim items 1 to 3, 5, 6, and 8, wherein the first category and the second category respectively correspond to different races. 一種電子設備,包括記憶體和處理器,所述記憶體上儲存有電腦可執行指令,所述處理器運行所述記憶體上的電腦可執行指令時實現請求項1至9任一項所述的方法。 An electronic device, comprising a memory and a processor, the memory is stored with computer executable instructions, and when the processor runs the computer executable instructions on the memory, any one of request items 1 to 9 is implemented Methods. 一種電腦可讀儲存介質,其上儲存有電腦程式,該電腦程式被處理器執行時,實現請求項1至9任一項所述的方法。 A computer-readable storage medium has a computer program stored thereon, and when the computer program is executed by a processor, the method described in any one of claim items 1 to 9 is realized.
TW108145586A 2019-03-22 2019-12-12 Method for face recognition and device thereof TWI727548B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910220321.5 2019-03-22
CN201910220321.5A CN109934198B (en) 2019-03-22 2019-03-22 Face recognition method and device

Publications (2)

Publication Number Publication Date
TW202036367A TW202036367A (en) 2020-10-01
TWI727548B true TWI727548B (en) 2021-05-11

Family

ID=66988039

Family Applications (1)

Application Number Title Priority Date Filing Date
TW108145586A TWI727548B (en) 2019-03-22 2019-12-12 Method for face recognition and device thereof

Country Status (6)

Country Link
US (1) US20210334604A1 (en)
JP (1) JP7038867B2 (en)
CN (1) CN109934198B (en)
SG (1) SG11202107826QA (en)
TW (1) TWI727548B (en)
WO (1) WO2020192112A1 (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109934198B (en) * 2019-03-22 2021-05-14 北京市商汤科技开发有限公司 Face recognition method and device
KR20190098108A (en) * 2019-08-02 2019-08-21 엘지전자 주식회사 Control system to control intelligent robot device
CN110633698A (en) * 2019-09-30 2019-12-31 上海依图网络科技有限公司 Infrared picture identification method, equipment and medium based on loop generation countermeasure network
CN110781856B (en) * 2019-11-04 2023-12-19 浙江大华技术股份有限公司 Heterogeneous face recognition model training method, face recognition method and related device
KR20210067442A (en) * 2019-11-29 2021-06-08 엘지전자 주식회사 Automatic labeling apparatus and method for object recognition
CN111539287B (en) * 2020-04-16 2023-04-07 北京百度网讯科技有限公司 Method and device for training face image generation model
CN112052792B (en) * 2020-09-04 2022-04-26 恒睿(重庆)人工智能技术研究院有限公司 Cross-model face recognition method, device, equipment and medium
CN112183480B (en) * 2020-10-29 2024-06-04 奥比中光科技集团股份有限公司 Face recognition method, device, terminal equipment and storage medium
CN112614199A (en) * 2020-11-23 2021-04-06 上海眼控科技股份有限公司 Semantic segmentation image conversion method and device, computer equipment and storage medium
CN115761833B (en) * 2022-10-10 2023-10-24 荣耀终端有限公司 Face recognition method, electronic equipment and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105138973A (en) * 2015-08-11 2015-12-09 北京天诚盛业科技有限公司 Face authentication method and device
US20160379043A1 (en) * 2013-11-25 2016-12-29 Ehsan FAZL ERSI System and method for face recognition
TW201832134A (en) * 2017-06-02 2018-09-01 大陸商騰訊科技(深圳)有限公司 Method and device for training human face recognition, electronic device, computer readable storage medium, and computer program product

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1967561A (en) * 2005-11-14 2007-05-23 株式会社日立制作所 Method for making gender recognition handler, method and device for gender recognition
US9660995B2 (en) * 2013-02-20 2017-05-23 The University Of North Carolina At Chapel Hill Methods, systems, and computer readable media for combating device theft with user notarization
CN104143079B (en) * 2013-05-10 2016-08-17 腾讯科技(深圳)有限公司 The method and system of face character identification
JP6476589B2 (en) 2014-05-15 2019-03-06 カシオ計算機株式会社 AGE ESTIMATION DEVICE, IMAGING DEVICE, AGE ESTIMATION METHOD, AND PROGRAM
JP2017102671A (en) 2015-12-01 2017-06-08 キヤノン株式会社 Identification device, adjusting device, information processing method, and program
CN105426860B (en) * 2015-12-01 2019-09-27 北京眼神智能科技有限公司 The method and apparatus of recognition of face
CN105608450B (en) * 2016-03-01 2018-11-27 天津中科智能识别产业技术研究院有限公司 Heterogeneous face identification method based on depth convolutional neural networks
WO2017174982A1 (en) 2016-04-06 2017-10-12 Queen Mary University Of London Method of matching a sketch image to a face image
CN106056083B (en) * 2016-05-31 2019-08-13 腾讯科技(深圳)有限公司 A kind of information processing method and terminal
US9971958B2 (en) 2016-06-01 2018-05-15 Mitsubishi Electric Research Laboratories, Inc. Method and system for generating multimodal digital images
CN106022317A (en) * 2016-06-27 2016-10-12 北京小米移动软件有限公司 Face identification method and apparatus
CN106650573B (en) * 2016-09-13 2019-07-16 华南理工大学 A kind of face verification method and system across the age
CN106909905B (en) * 2017-03-02 2020-02-14 中科视拓(北京)科技有限公司 Multi-mode face recognition method based on deep learning
US10565433B2 (en) 2017-03-30 2020-02-18 George Mason University Age invariant face recognition using convolutional neural networks and set distances
CN107679451A (en) * 2017-08-25 2018-02-09 百度在线网络技术(北京)有限公司 Establish the method, apparatus, equipment and computer-readable storage medium of human face recognition model
CN108596138A (en) * 2018-05-03 2018-09-28 南京大学 A kind of face identification method based on migration hierarchical network
CN109934198B (en) * 2019-03-22 2021-05-14 北京市商汤科技开发有限公司 Face recognition method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160379043A1 (en) * 2013-11-25 2016-12-29 Ehsan FAZL ERSI System and method for face recognition
CN105138973A (en) * 2015-08-11 2015-12-09 北京天诚盛业科技有限公司 Face authentication method and device
TW201832134A (en) * 2017-06-02 2018-09-01 大陸商騰訊科技(深圳)有限公司 Method and device for training human face recognition, electronic device, computer readable storage medium, and computer program product

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李自豪,基於卷積神經網路的人臉圖像識別研究,中國優秀碩士學位論文全文資料庫資訊科技輯,2018/05,http://gb.oversea.cnki.net/KCMS/detail/detail.aspx?filename=1018119250.nh&dbcode=CMFD&dbname=CMFDREF *
李自豪,基於卷積神經網路的人臉圖像識別研究,中國優秀碩士學位論文全文資料庫資訊科技輯,2018/05,http://gb.oversea.cnki.net/KCMS/detail/detail.aspx?filename=1018119250.nh&dbcode=CMFD&dbname=CMFDREF。

Also Published As

Publication number Publication date
US20210334604A1 (en) 2021-10-28
SG11202107826QA (en) 2021-08-30
JP2021530045A (en) 2021-11-04
JP7038867B2 (en) 2022-03-18
TW202036367A (en) 2020-10-01
CN109934198A (en) 2019-06-25
WO2020192112A1 (en) 2020-10-01
CN109934198B (en) 2021-05-14

Similar Documents

Publication Publication Date Title
TWI727548B (en) Method for face recognition and device thereof
TWI753327B (en) Image processing method, processor, electronic device and computer-readable storage medium
CN112052789B (en) Face recognition method and device, electronic equipment and storage medium
CN108491805B (en) Identity authentication method and device
WO2021012526A1 (en) Face recognition model training method, face recognition method and apparatus, device, and storage medium
CN110503076B (en) Video classification method, device, equipment and medium based on artificial intelligence
WO2020024484A1 (en) Method and device for outputting data
US11429809B2 (en) Image processing method, image processing device, and storage medium
CN106056083B (en) A kind of information processing method and terminal
WO2023173646A1 (en) Expression recognition method and apparatus
CN111108508B (en) Face emotion recognition method, intelligent device and computer readable storage medium
CN113642639B (en) Living body detection method, living body detection device, living body detection equipment and storage medium
WO2023124278A1 (en) Image processing model training method and apparatus, and image classification method and apparatus
CN112614110A (en) Method and device for evaluating image quality and terminal equipment
CN110839242B (en) Abnormal number identification method and device
WO2022028425A1 (en) Object recognition method and apparatus, electronic device and storage medium
WO2021027555A1 (en) Face retrieval method and apparatus
CN116758590B (en) Palm feature processing method, device, equipment and medium for identity authentication
WO2018137226A1 (en) Fingerprint extraction method and device
CN111814811A (en) Image information extraction method, training method and device, medium and electronic equipment
CN111507289A (en) Video matching method, computer device and storage medium
CN113743160A (en) Method, apparatus and storage medium for biopsy
TWM586599U (en) System for analyzing skin texture and skin lesion using artificial intelligence cloud based platform
WO2023024424A1 (en) Segmentation network training method, using method, apparatus, device, and storage medium
CN108229263B (en) Target object identification method and device and robot