TW202143100A

TW202143100A - Image processing method, electronic device and computer-readable storage media

Info

Publication number: TW202143100A
Application number: TW110115664A
Authority: TW
Inventors: 王燦; 李杰鋒; 劉文韜; 錢晨
Original assignee: 大陸商北京市商湯科技開發有限公司
Priority date: 2020-05-13
Filing date: 2021-04-29
Publication date: 2021-11-16
Also published as: WO2021227694A1; TWI777538B; CN111582207B; CN111582207A

Abstract

The present disclosure provides an image processing method, electronic device, and computer-readable storage media. The method includes: identifying a target area where a target object in a first image is located; based on the target region corresponding to the target object, determining the first & two-dimensional position information of multiple key points representing the pose of the target object in the first image, the relative depth of each key point to the reference node of the target object, and the absolute depth of the reference node of the target object in the camera coordinate system h; based on the first & two-dimensional position information, relative depth, and absolute depth of the target object, determining the three-dimensional position information of multiple key points of the target object in the camera coordinate system.

Description

圖像處理方法、電子設備及電腦可讀儲存介質Image processing method, electronic equipment and computer readable storage medium

本發明關於圖像處理技術領域，具體而言，關於一種圖像處理方法、電子設備及電腦可讀儲存介質。The present invention relates to the field of image processing technology, in particular, to an image processing method, electronic equipment, and computer-readable storage medium.

三維人體姿態檢測計被廣泛應用於安防、遊戲、娛樂等領域。當前的三維人體姿態檢測方法通常為識別人體關鍵點在圖像中的第一二維位置資訊，然後根據預先確定的人體關鍵點之間的位置關係，將第一二維位置資訊轉換為三維位置資訊。Three-dimensional human posture detectors are widely used in security, games, entertainment and other fields. The current three-dimensional human posture detection methods usually identify the first two-dimensional position information of the key points of the human body in the image, and then convert the first two-dimensional position information into a three-dimensional position according to the predetermined positional relationship between the key points of the human body Information.

當前的三維人體姿態檢測方法所得到的人體姿態存在較大的誤差。There are large errors in the human body posture obtained by the current three-dimensional human body posture detection methods.

本發明實施例至少提供一種圖像處理方法、電子設備及電腦可讀儲存介質。The embodiments of the present invention provide at least an image processing method, an electronic device, and a computer-readable storage medium.

第一方面，本發明實施例提供了一種圖像處理方法，包括：識別第一圖像中的目標對象所在的目標區域；基於所述目標對象所在的目標區域，確定所述目標對象的多個關鍵點分別在所述第一圖像中的第一二維位置資訊、每個所述關鍵點相對所述目標對象的參考節點的相對深度、以及所述目標對象的參考節點在相機座標系中的絕對深度；基於所述目標對象的多個關鍵點分別對應的所述第一二維位置資訊和所述相對深度、以及所述參考節點對應的所述絕對深度，確定所述目標對象的多個關鍵點分別在所述相機座標系中的三維位置資訊。In a first aspect, an embodiment of the present invention provides an image processing method, including: identifying a target area where a target object in a first image is located; and determining a plurality of target objects based on the target area where the target object is located. The first two-dimensional position information of the key points in the first image, the relative depth of each key point with respect to the reference node of the target object, and the reference node of the target object in the camera coordinate system The absolute depth of the target object; based on the first two-dimensional position information and the relative depth corresponding to the multiple key points of the target object, and the absolute depth corresponding to the reference node, determine the multiple of the target object The three-dimensional position information of each key point in the camera coordinate system.

這樣，本發明實施例能夠更精確的得到目標對象的多個關鍵點分別在相機座標系中的三維位置資訊，目標對象的多個關鍵點分別在相機座標系中的三維位置資訊能夠表徵目標對象的三維姿態，三維位置資訊的精度越高，則得到的目標對象的三維姿態的精度也就越高。In this way, the embodiment of the present invention can more accurately obtain the three-dimensional position information of the multiple key points of the target object in the camera coordinate system, and the three-dimensional position information of the multiple key points of the target object in the camera coordinate system can represent the target object. The higher the accuracy of the three-dimensional position information, the higher the accuracy of the obtained three-dimensional posture of the target object.

一種可能的實施方式中，還包括：基於所述目標對象的多個關鍵點分別在所述相機座標系中的三維位置資訊，得到所述目標對象的姿態。In a possible implementation manner, the method further includes: obtaining the posture of the target object based on the three-dimensional position information of the multiple key points of the target object in the camera coordinate system.

這樣，基於本發明實施例得到的目標對象的多個關鍵點分別在相機座標系中的三維位置資訊，由於三維位置資訊具有更高的精度，因而基於三維位置資訊確定的目標對象的姿態也就更為精確。In this way, based on the three-dimensional position information of the multiple key points of the target object in the camera coordinate system obtained by the embodiment of the present invention, since the three-dimensional position information has higher accuracy, the posture of the target object determined based on the three-dimensional position information is also More precise.

一種可能的實施方式中，所述識別所述第一圖像中的目標對象所在的目標區域，包括：對所述第一圖像進行特徵提取，得到所述第一圖像的特徵圖；基於所述特徵圖，從預先生成的多個候選邊界框中確定多個目標邊界框；基於多個所述目標邊界框，確定所述目標對象所在的目標區域。In a possible implementation manner, the recognizing the target area where the target object in the first image is located includes: performing feature extraction on the first image to obtain a feature map of the first image; The feature map determines multiple target bounding boxes from multiple candidate bounding boxes generated in advance; and determines the target area where the target object is located based on the multiple target bounding boxes.

這樣，分為兩步來確定目標對象所在的目標區域，能夠精確的將各個目標對象在第一圖像中的位置，從第一圖像中檢測出來，以提升後續關鍵點檢測過程中的人體資訊完整性、以及檢測精度。In this way, it is divided into two steps to determine the target area where the target object is located, and the position of each target object in the first image can be accurately detected from the first image to improve the human body in the subsequent key point detection process Information completeness, and detection accuracy.

一種可能的實施方式中，所述基於多個所述目標邊界框，確定所述目標對象所在的目標區域，包括：基於多個所述目標邊界框以及所述特徵圖，確定每個所述目標邊界框的特徵子圖；對多個所述目標邊界框分別對應的特徵子圖進行邊界框回歸處理，得到所述目標對象所在的目標區域。In a possible implementation manner, the determining the target area where the target object is located based on the plurality of target bounding boxes includes: determining each of the targets based on the plurality of target bounding boxes and the feature map The feature sub-map of the bounding box; the bounding box regression processing is performed on the feature sub-maps corresponding to the multiple target bounding boxes to obtain the target area where the target object is located.

這樣，對多個目標邊界框分別對應的特徵子圖進行邊界框回歸處理，能夠精確的將各個目標對象在第一圖像中的位置從第一圖像中檢測出來。In this way, the bounding box regression processing is performed on the feature sub-images respectively corresponding to the multiple target bounding boxes, and the position of each target object in the first image can be accurately detected from the first image.

一種可能的實施方式中，基於所述目標對象所在的目標區域，確定所述目標對象的參考節點在相機座標系中的絕對深度，包括：基於所述目標對象所在的目標區域以及所述第一圖像，確定所述目標對象的目標特徵圖；對所述目標對象對應的目標特徵圖進行深度識別處理，得到所述目標對象的參考節點的歸一化絕對深度；基於所述歸一化絕對深度以及相機的參數矩陣，得到所述目標對象的參考節點在所述相機座標系中的絕對深度。In a possible implementation manner, determining the absolute depth of the reference node of the target object in the camera coordinate system based on the target area where the target object is located includes: based on the target area where the target object is located and the first Image, determine the target feature map of the target object; perform depth recognition processing on the target feature map corresponding to the target object to obtain the normalized absolute depth of the reference node of the target object; based on the normalized absolute The depth and the parameter matrix of the camera are used to obtain the absolute depth of the reference node of the target object in the camera coordinate system.

這樣，能夠盡可能避免相機的內參不同所造成的直接基於目標特徵圖預測參考節點的絕對深度，所造成對不同相機在相同視角、相同位置獲取的不同第一圖像獲取的絕對深度不同的情況。In this way, it is possible to avoid as far as possible the situation where the absolute depth of the reference node is directly predicted based on the target feature map caused by the different internal parameters of the camera, resulting in different absolute depths of different first images acquired by different cameras at the same viewing angle and at the same position. .

一種可能的實施方式中，所述對所述目標對象對應的目標特徵圖進行深度識別處理，得到所述目標對象的參考節點的歸一化絕對深度，包括：基於所述第一圖像，確定初始深度圖像；其中，所述初始深度圖像中任一第一圖元點的圖元值為所述第一圖像中與所述第一圖元點的位置對應的第二圖元點在所述相機座標系中的初始深度值；基於所述目標對象對應的目標特徵圖，確定與所述目標對象對應的參考節點在所述第一圖像中的第二二維位置資訊；基於所述第二二維位置資訊、以及所述初始深度圖像，確定所述目標對象對應的參考節點的初始深度值；基於所述參考節點的初始深度值以及所述目標特徵圖，確定所述目標對象的參考節點的歸一化絕對深度。In a possible implementation manner, the performing depth recognition processing on the target feature map corresponding to the target object to obtain the normalized absolute depth of the reference node of the target object includes: determining based on the first image An initial depth image; wherein the primitive value of any first primitive point in the initial depth image is a second primitive point corresponding to the position of the first primitive point in the first image The initial depth value in the camera coordinate system; based on the target feature map corresponding to the target object, determine the second two-dimensional position information of the reference node corresponding to the target object in the first image; based on The second two-dimensional position information and the initial depth image determine the initial depth value of the reference node corresponding to the target object; determine the initial depth value of the reference node based on the initial depth value of the reference node and the target feature map The normalized absolute depth of the reference node of the target object.

這樣，能夠使得通過該過程得到參考節點的歸一化絕對深度更加精確。In this way, the normalized absolute depth of the reference node obtained through this process can be more accurate.

一種可能的實施方式中，所述基於所述參考節點的初始深度值以及所述目標特徵圖，確定所述目標對象的參考節點的歸一化絕對深度，包括：對所述目標對象對應的目標特徵圖進行至少一級第一卷積處理，得到所述目標對象的特徵向量；將所述特徵向量和所述初始深度值進行拼接，得到拼接向量，並對所述拼接向量進行至少一級第二卷積處理，得到所述初始深度值的修正值；基於所述初始深度值的修正值、以及所述初始深度值，得到所述歸一化絕對深度。In a possible implementation manner, the determining the normalized absolute depth of the reference node of the target object based on the initial depth value of the reference node and the target feature map includes: a target corresponding to the target object Perform at least one level of first convolution processing on the feature map to obtain the feature vector of the target object; concatenate the feature vector and the initial depth value to obtain a mosaic vector, and perform at least one level of second convolution on the mosaic vector Product processing to obtain the correction value of the initial depth value; and obtain the normalized absolute depth based on the correction value of the initial depth value and the initial depth value.

一種可能的實施方式中，所述參數矩陣包括：所述相機的焦距；所述基於所述歸一化絕對深度以及相機的參數矩陣，得到所述目標對象的參考節點在所述相機座標系中的絕對深度，包括：基於所述歸一化絕對深度、所述焦距、所述目標區域的面積、以及所述目標邊界框的面積，得到所述目標對象的參考節點在所述相機座標系中的絕對深度。In a possible implementation manner, the parameter matrix includes: the focal length of the camera; The obtaining the absolute depth of the reference node of the target object in the camera coordinate system based on the normalized absolute depth and the parameter matrix of the camera includes: Based on the normalized absolute depth, the focal length, the area of the target region, and the area of the target bounding box, the absolute depth of the reference node of the target object in the camera coordinate system is obtained.

一種可能的實施方式中，所述圖像處理方法應用於預先訓練好的神經網路中，所述神經網路包括目標檢測網路、關鍵點檢測網路以及深度預測網路三個分支網路；所述目標檢測網路用於獲得所述目標對象所在的目標區域；所述關鍵點檢測網路用於獲取所述目標對象的多個關鍵點分別在所述第一圖像中的第一二維位置資訊、和每個所述關鍵點相對所述目標對象的參考節點的相對深度；所述深度預測網路用於獲取所述參考節點在所述相機座標系中的絕對深度。In a possible implementation manner, the image processing method is applied to a pre-trained neural network, and the neural network includes three branch networks: a target detection network, a key point detection network, and a depth prediction network The target detection network is used to obtain the target area where the target object is located; the key point detection network is used to obtain the first of multiple key points of the target object in the first image, respectively Two-dimensional position information and the relative depth of each key point with respect to the reference node of the target object; the depth prediction network is used to obtain the absolute depth of the reference node in the camera coordinate system.

這樣，通過目標檢測網路、關鍵點檢測網路以及深度預測網路三個分支網路，構成端到端的目標對象姿態檢測框架，基於該框架對第一圖像進行處理，得到第一圖像中每個目標對象的多個關鍵點分別在相機座標系中的三維位置資訊，處理速度更快，識別精度更高。In this way, through the three branch networks of the target detection network, the key point detection network and the depth prediction network, an end-to-end target object pose detection framework is formed, and the first image is processed based on the framework to obtain the first image The three-dimensional position information of multiple key points of each target object in the camera coordinate system is faster in processing speed and higher in recognition accuracy.

第二方面，本發明實施例還提供一種圖像處理裝置，包括：識別模組，用於識別第一圖像中的目標對象所在的目標區域；第一檢測模組，用於基於所述目標對象所在的目標區域，確定所述目標對象的多個關鍵點分別在所述第一圖像中的第一二維位置資訊、每個所述關鍵點相對所述目標對象的參考節點的相對深度、以及所述目標對象的參考節點在相機座標系中的絕對深度；第二檢測模組，用於基於所述目標對象的多個關鍵點分別對應的所述第一二維位置資訊和所述相對深度、以及所述參考節點對應的所述絕對深度，確定所述目標對象的多個關鍵點分別在所述相機座標系中的三維位置資訊。In a second aspect, an embodiment of the present invention also provides an image processing device, including: a recognition module for recognizing a target area where a target object in a first image is located; a first detection module for recognizing a target area based on the target The target area where the object is located, determining the first two-dimensional position information of multiple key points of the target object in the first image, and the relative depth of each key point with respect to the reference node of the target object , And the absolute depth of the reference node of the target object in the camera coordinate system; a second detection module for the first two-dimensional position information corresponding to the multiple key points of the target object and the The relative depth and the absolute depth corresponding to the reference node determine the three-dimensional position information of the multiple key points of the target object in the camera coordinate system.

一種可能的實施方式中，所述第二檢測模組，還用於基於所述目標對象的多個關鍵點分別在所述相機座標系中的三維位置資訊，得到所述目標對象的姿態。In a possible implementation manner, the second detection module is further configured to obtain the posture of the target object based on the three-dimensional position information of the multiple key points of the target object in the camera coordinate system.

一種可能的實施方式中，所述識別模組，在識別所述第一圖像中的目標對象所在的目標區域時，用於：對所述第一圖像進行特徵提取，得到所述第一圖像的特徵圖；基於所述特徵圖，從預先生成的多個候選邊界框中確定多個目標邊界框；基於多個所述目標邊界框，確定所述目標對象所在的目標區域。In a possible implementation manner, the recognition module is used to: perform feature extraction on the first image when recognizing the target area where the target object in the first image is located to obtain the first image A feature map of the image; based on the feature map, multiple target bounding boxes are determined from multiple candidate bounding boxes generated in advance; based on the multiple target bounding boxes, the target area where the target object is located is determined.

一種可能的實施方式中，所述識別模組，在基於多個所述目標邊界框，確定所述目標對象所在的目標區域時，用於：基於多個所述目標邊界框以及所述特徵圖，確定每個所述目標邊界框的特徵子圖；對多個所述目標邊界框分別對應的特徵子圖進行邊界框回歸處理，得到所述目標對象所在的目標區域。In a possible implementation manner, when the recognition module determines the target area where the target object is located based on a plurality of the target bounding boxes, it is used to: based on the plurality of target bounding boxes and the feature map , Determining the feature sub-map of each of the target bounding boxes; performing bounding box regression processing on the feature sub-maps corresponding to the multiple target bounding boxes to obtain the target area where the target object is located.

一種可能的實施方式中，其中，所述第一檢測模組，在基於目標對象所在的目標區域，確定所述目標對象的參考節點在相機座標系中的絕對深度時，用於：基於所述目標對象所在的目標區域以及所述第一圖像，確定所述目標對象的目標特徵圖；對所述目標對象對應的目標特徵圖進行深度識別處理，得到所述目標對象的參考節點的歸一化絕對深度；基於所述歸一化絕對深度以及相機的參數矩陣，得到所述目標對象的參考節點在所述相機座標系中的絕對深度。In a possible implementation manner, the first detection module is used to determine the absolute depth of the reference node of the target object in the camera coordinate system based on the target area where the target object is located: The target area where the target object is located and the first image determine the target feature map of the target object; perform in-depth recognition processing on the target feature map corresponding to the target object to obtain the normalization of the reference node of the target object The absolute depth; based on the normalized absolute depth and the parameter matrix of the camera, the absolute depth of the reference node of the target object in the camera coordinate system is obtained.

一種可能的實施方式中，所述第一檢測模組，在對所述目標對象對應的目標特徵圖進行深度識別處理，得到所述目標對象的參考節點的歸一化絕對深度時，用於：基於所述第一圖像，確定初始深度圖像；其中，所述初始深度圖像中任一第一圖元點的圖元值為所述第一圖像中與所述第一圖元點的位置對應的第二圖元點在所述相機座標系中的初始深度值；基於所述目標對象對應的目標特徵圖，確定與所述目標對象對應的參考節點在所述第一圖像中的第二二維位置資訊；基於所述第二二維位置資訊、以及所述初始深度圖像，確定所述目標對象對應的參考節點的初始深度值；基於所述參考節點的初始深度值以及所述目標特徵圖，確定所述目標對象的參考節點的歸一化絕對深度。In a possible implementation manner, when the first detection module performs depth recognition processing on the target feature map corresponding to the target object to obtain the normalized absolute depth of the reference node of the target object, it is used to: Based on the first image, determine an initial depth image; wherein the primitive value of any first primitive point in the initial depth image is the same as that of the first primitive point in the first image The initial depth value of the second primitive point corresponding to the position of the target object in the camera coordinate system; based on the target feature map corresponding to the target object, it is determined that the reference node corresponding to the target object is in the first image Determine the initial depth value of the reference node corresponding to the target object based on the second two-dimensional location information and the initial depth image; based on the initial depth value of the reference node and The target feature map determines the normalized absolute depth of the reference node of the target object.

一種可能的實施方式中，所述第一檢測模組，在基於所述參考節點的初始深度值以及所述目標特徵圖，確定所述目標對象的參考節點的歸一化絕對深度時，用於：對所述目標對象對應的目標特徵圖進行至少一級第一卷積處理，得到所述目標對象的特徵向量；將所述特徵向量和所述初始深度值進行拼接，得到拼接向量，並對所述拼接向量進行至少一級第二卷積處理，得到所述初始深度值的修正值；基於所述初始深度值的修正值、以及所述初始深度值，得到所述歸一化絕對深度。In a possible implementation manner, the first detection module is configured to determine the normalized absolute depth of the reference node of the target object based on the initial depth value of the reference node and the target feature map. : Perform at least one level of first convolution processing on the target feature map corresponding to the target object to obtain the feature vector of the target object; splice the feature vector and the initial depth value to obtain a spliced vector, and compare all The stitching vector is subjected to at least one level of second convolution processing to obtain the correction value of the initial depth value; and the normalized absolute depth is obtained based on the correction value of the initial depth value and the initial depth value.

一種可能的實施方式中，所述參數矩陣包括：所述相機的焦距；所述第一檢測模組，在基於所述歸一化絕對深度以及相機的參數矩陣，得到所述目標對象的參考節點在所述相機座標系中的絕對深度時，用於：基於所述歸一化絕對深度、所述焦距、所述目標區域的面積、以及所述目標邊界框的面積，得到所述目標對象的參考節點在所述相機座標系中的絕對深度。In a possible implementation manner, the parameter matrix includes: the focal length of the camera; the first detection module obtains the reference node of the target object based on the normalized absolute depth and the parameter matrix of the camera In the absolute depth in the camera coordinate system, it is used for: Based on the normalized absolute depth, the focal length, the area of the target region, and the area of the target bounding box, the absolute depth of the reference node of the target object in the camera coordinate system is obtained.

一種可能的實施方式中，所述圖像處理裝置利用預先訓練好的神經網路實現圖像處理，所述神經網路包括目標檢測網路、關鍵點檢測網路以及深度預測網路三個分支網路；所述目標檢測網路用於獲得所述目標對象所在的目標區域；所述關鍵點檢測網路用於獲取所述目標對象的多個關鍵點分別在所述第一圖像中的第一二維位置資訊、和每個所述關鍵點相對所述目標對象的參考節點的相對深度；所述深度預測網路用於獲取所述參考節點在所述相機座標系中的絕對深度。In a possible implementation manner, the image processing device uses a pre-trained neural network to implement image processing. The neural network includes three branches: a target detection network, a key point detection network, and a depth prediction network. Network; the target detection network is used to obtain the target area where the target object is located; the key point detection network is used to obtain the multiple key points of the target object in the first image The first two-dimensional position information and the relative depth of each of the key points with respect to the reference node of the target object; the depth prediction network is used to obtain the absolute depth of the reference node in the camera coordinate system.

第三方面，本發明實施例還提供一種電腦設備，包括：相互連接的處理器和記憶體，所述記憶體儲存有所述處理器可執行的機器可讀指令，當電腦設備運行時，所述機器可讀指令被所述處理器執行以實現上述第一方面，或第一方面中任一種可能的實施方式中的圖像處理方法的步驟。In a third aspect, an embodiment of the present invention also provides a computer device, including: a processor and a memory that are connected to each other, and the memory stores machine-readable instructions executable by the processor. When the computer device is running, The machine-readable instructions are executed by the processor to implement the above-mentioned first aspect or the steps of the image processing method in any possible implementation of the first aspect.

第四方面，本發明實施例還提供一種電腦可讀儲存介質，該電腦可讀儲存介質上儲存有電腦程式，該電腦程式被處理器運行時執行上述第一方面，或第一方面中任一種可能的實施方式中的圖像處理方法的步驟。In a fourth aspect, an embodiment of the present invention also provides a computer-readable storage medium having a computer program stored on the computer-readable storage medium, and the computer program executes the first aspect or any one of the first aspect when the computer program is run by a processor The steps of the image processing method in a possible implementation.

為使本發明的上述目的、特徵和優點能更明顯易懂，下文特舉較佳實施例，並配合所附附圖，作詳細說明如下。In order to make the above-mentioned objects, features and advantages of the present invention more obvious and understandable, preferred embodiments are described in detail below in conjunction with accompanying drawings.

為使本發明實施例的目的、技術方案和優點更加清楚，下面將結合本發明實施例中附圖，對本發明實施例中的技術方案進行清楚、完整地描述，顯然，所描述的實施例僅僅是本發明一部分實施例，而不是全部的實施例。通常在此處附圖中描述和示出的本發明實施例的元件可以以各種不同的配置來佈置和設計。因此，以下對在附圖中提供的本發明的實施例的詳細描述並非旨在限制要求保護的本發明的範圍，而是僅僅表示本發明的選定實施例。基於本發明的實施例，本領域技術人員在沒有做出創造性勞動的前提下所獲得的所有其他實施例，都屬於本發明保護的範圍。In order to make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only It is a part of the embodiments of the present invention, but not all of the embodiments. The elements of the embodiments of the present invention generally described and illustrated in the drawings herein may be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of the present invention provided in the accompanying drawings is not intended to limit the scope of the claimed invention, but merely represents selected embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative work shall fall within the protection scope of the present invention.

三維人體姿態檢測方法通常為通過神經網路識別人體關鍵點在待識別圖像中的第一二維位置資訊，然後根據人體關鍵點之間的相互位置關係（如不同關鍵點之間的連接關係、相鄰關鍵點之間的距離範圍等）將各個人體關鍵點的第一二維位置資訊轉換為三維位置資訊；但人的體型複雜多變，不同的人體所對應的人體關鍵點之間的位置關係也各不相同，導致通過這種方法得到的三維人體姿態存在較大的誤差。The three-dimensional human pose detection method usually uses neural network to identify the first two-dimensional position information of the key points of the human body in the image to be recognized, and then according to the mutual positional relationship between the key points of the human body (such as the connection relationship between different key points) , The distance range between adjacent key points, etc.) Convert the first two-dimensional position information of each key point of the human body into three-dimensional position information; The positional relationship is also different, resulting in large errors in the three-dimensional body posture obtained by this method.

另外，當前的三維人體姿態檢測方法的精度是建立在人體關鍵點精確估計的基礎上，但由於衣服、肢體等遮擋，在很多情況下並不能精確的從圖像中將人體關鍵點識別出來，進而造成通過上述方法得到的三維人體姿態誤差會被進一步拉大。In addition, the accuracy of current three-dimensional human body posture detection methods is based on the accurate estimation of human body key points. However, due to the occlusion of clothing, limbs, etc., in many cases, the human body key points cannot be accurately identified from the image. As a result, the three-dimensional human body posture error obtained by the above method will be further enlarged.

針對以上方案所存在的缺陷，均是發明人在經過實踐並仔細研究後得出的結果，因此，上述問題的發現過程以及下文中本發明針對上述問題所提出的解決方案，都應該是發明人在本發明過程中對本發明做出的貢獻。The defects in the above solutions are all the results of the inventor after practice and careful study. Therefore, the discovery process of the above problems and the solutions proposed by the present invention to solve the above problems hereinafter should be the inventors. Contributions made to the present invention in the course of the present invention.

應注意到：相似的標號和字母在下面的附圖中表示類似項，因此，一旦某一項在一個附圖中被定義，則在隨後的附圖中不需要對其進行進一步定義和解釋。It should be noted that similar reference numerals and letters indicate similar items in the following drawings. Therefore, once an item is defined in one drawing, it does not need to be further defined and explained in the subsequent drawings.

基於上述研究，本發明提供了一種圖像處理方法及裝置，通過識別第一圖像中目標對象所在的目標區域，並基於目標區域，確定表徵目標對象姿態的多個關鍵點分別在第一圖像中的第一二維位置資訊、每個關鍵點相對於目標對象的參考節點的相對深度、以及目標對象的參考節點在相機座標系中的絕對深度，從而基於目標對象的第一二維位置資訊、相對深度、以及絕對深度，更精確的得到目標對象的多個關鍵點分別在相機座標系中的三維位置資訊。Based on the above research, the present invention provides an image processing method and device. By identifying the target area where the target object in the first image is located, and based on the target area, determine that multiple key points that characterize the target object’s posture are in the first image. The first two-dimensional position information in the image, the relative depth of each key point relative to the reference node of the target object, and the absolute depth of the reference node of the target object in the camera coordinate system, based on the first two-dimensional position of the target object Information, relative depth, and absolute depth, to obtain more accurate three-dimensional position information of multiple key points of the target object in the camera coordinate system.

為便於對本實施例進行理解，首先對本發明實施例所公開的一種圖像處理方法進行詳細介紹，本發明實施例所提供的圖像處理方法的執行主體一般為具有一定計算能力的電腦設備，該電腦設備例如包括：終端設備或伺服器或其它處理設備，終端設備可以為使用者設備（User Equipment，UE）、移動設備、使用者終端、終端、蜂窩電話、無繩電話、個人數位助理（Personal Digital Assistant，PDA）、手持設備、計算設備、車載設備、可穿戴設備等。在一些可能的實現方式中，該圖像處理方法可以通過處理器調用記憶體中儲存的電腦可讀指令的方式來實現。In order to facilitate the understanding of this embodiment, an image processing method disclosed in the embodiment of the present invention is first introduced in detail. The execution subject of the image processing method provided in the embodiment of the present invention is generally a computer device with a certain computing capability. Computer equipment includes, for example, terminal equipment or servers or other processing equipment. The terminal equipment can be User Equipment (UE), mobile equipment, user terminals, terminals, cellular phones, cordless phones, and personal digital assistants (Personal Digital Assistant, PDA), handheld devices, computing devices, in-vehicle devices, wearable devices, etc. In some possible implementations, the image processing method can be implemented by a processor calling computer-readable instructions stored in the memory.

下面對本發明實施例提供的圖像處理方法加以說明。The following describes the image processing method provided by the embodiment of the present invention.

參見圖1所示，為本發明實施例提供的圖像處理方法的流程圖，所述方法包括步驟S101~S103，其中： S101：識別第一圖像中的目標對象所在的目標區域； S102：基於所述目標對象所在的目標區域，確定所述目標對象的多個關鍵點分別在所述第一圖像中的第一二維位置資訊、每個所述關鍵點相對所述目標對象的參考節點的相對深度、以及所述目標對象的參考節點在相機座標系中的絕對深度； S103：基於所述目標對象的多個關鍵點分別對應的所述第一二維位置資訊和所述相對深度、以及所述參考節點對應的所述絕對深度，確定所述目標對象的多個關鍵點分別在所述相機座標系中的三維位置資訊。Refer to FIG. 1, which is a flowchart of an image processing method provided by an embodiment of the present invention. The method includes steps S101 to S103, wherein: S101: Identify the target area where the target object in the first image is located; S102: Based on the target area where the target object is located, determine the first two-dimensional position information of multiple key points of the target object in the first image, and each key point is relative to the target object The relative depth of the reference node of and the absolute depth of the reference node of the target object in the camera coordinate system; S103: Determine multiple keys of the target object based on the first two-dimensional position information and the relative depth respectively corresponding to the multiple key points of the target object, and the absolute depth corresponding to the reference node The three-dimensional position information of the points respectively in the camera coordinate system.

下面分別對上述S101~S103加以詳細說明。The above S101~S103 will be described in detail below.

I：在上述S101中，第一圖像中包括有至少一個目標對象。目標對象例如為人、動物、機器人、車輛等待確定姿態的對象。I: In the above S101, the first image includes at least one target object. The target object is, for example, a human, animal, robot, or vehicle waiting to determine a posture.

一種可能的實施方式中，當第一圖像中包括的目標對象多於一個的時候，不同目標對象的類別可以相同，也可以不同；例如，多個目標對象均為人；或者多個目標對象均為車輛。又例如，第一圖像中的目標對象包括：人和動物；或者第一圖像中的目標對象包括人和車輛，具體根據實際的應用場景需要來確定目標對象類別。In a possible implementation manner, when more than one target object is included in the first image, the categories of different target objects may be the same or different; for example, multiple target objects are all people; or multiple target objects All are vehicles. For another example, the target object in the first image includes: a person and an animal; or the target object in the first image includes a person and a vehicle, and the target object category is determined according to actual application scenarios.

目標對象所在的目標區域，是指第一圖像中包括有目標對象的區域。The target area where the target object is located refers to the area in the first image that includes the target object.

示例性的，參見圖2所示，本發明實施例提供一種識別第一圖像中目標對象所在的目標區域的具體方法，包括： S201：對所述第一圖像進行特徵提取，得到所述第一圖像的特徵圖。此處，例如可以利用神經網路對第一圖像進行特徵提取，以得到第一圖像的特徵圖。 S202：基於所述特徵圖，從預先生成的多個候選邊界框中，確定多個目標邊界框，並基於所述目標邊界框，確定所述目標對象對應的目標區域。Exemplarily, referring to FIG. 2, an embodiment of the present invention provides a specific method for identifying a target area in a first image where a target object is located, including: S201: Perform feature extraction on the first image to obtain a feature map of the first image. Here, for example, a neural network may be used to perform feature extraction on the first image to obtain a feature map of the first image. S202: Based on the feature map, determine multiple target bounding boxes from multiple pre-generated candidate bounding boxes, and determine a target area corresponding to the target object based on the target bounding box.

在具體實施中，例如可以利用邊界框預測演算法，得到多個目標邊界框。邊界框預測演算法例如包括RoIAlign、ROI-Pooling等，以RoIAlign為例，RoIAlign可以對預先生成的多個候選邊界框進行遍歷，確定各個候選邊界框對應的子圖像屬於第一圖像中任一目標對象的感興趣區域（region of interest，ROI）值，該ROI值越高，與之對應的候選邊界框對應的子圖像屬於某個目標對象的概率也就越大；在確定了每個候選邊界框對應的ROI值後，根據各個候選邊界框分別對應的ROI值從大到小的順序，從候選邊界框中確定多個目標邊界框。In specific implementation, for example, a bounding box prediction algorithm can be used to obtain multiple target bounding boxes. Bounding box prediction algorithms include, for example, RoIAlign, ROI-Pooling, etc. Taking RoIAlign as an example, RoIAlign can traverse multiple pre-generated candidate bounding boxes and determine that the sub-images corresponding to each candidate bounding box belong to any of the first images. The region of interest (ROI) value of a target object. The higher the ROI value, the greater the probability that the sub-image corresponding to the candidate bounding box belongs to a certain target object; After the ROI value corresponding to each candidate bounding box, according to the ROI value corresponding to each candidate bounding box, in descending order, multiple target bounding boxes are determined from the candidate bounding box.

目標邊界框例如為矩形；目標邊界框的資訊例如包括：目標邊界框中任一頂點在第一圖像中的座標，以及目標邊界框的高度值和寬度值。或者，目標邊界框的資訊例如包括：目標邊界框中任一頂點在第一圖像的特徵圖中的座標，以及目標邊界框的高度值和寬度值。The target bounding box is, for example, a rectangle; the information of the target bounding box includes, for example, the coordinates of any vertex in the target bounding box in the first image, and the height and width values of the target bounding box. Alternatively, the information of the target bounding box includes, for example, the coordinates of any vertex in the target bounding box in the feature map of the first image, and the height and width values of the target bounding box.

在得到多個目標邊界框後，基於多個目標邊界框，確定第一圖像中所有的目標對象分別對應的目標區域。After obtaining multiple target bounding boxes, based on the multiple target bounding boxes, target regions corresponding to all target objects in the first image are determined.

參見圖3所示，本發明實施例提供一種基於目標邊界框，確定目標對象對應的目標區域的具體示例，包括以下步驟。Referring to FIG. 3, the embodiment of the present invention provides a specific example of determining the target area corresponding to the target object based on the target bounding box, including the following steps.

S301：基於多個所述目標邊界框以及所述特徵圖，確定每個所述目標邊界框的特徵子圖。S301: Determine a feature sub-image of each target bounding box based on the multiple target bounding boxes and the feature map.

在具體實施中，在目標邊界框的資訊包括目標邊界框上的任一頂點在第一圖像中的座標，以及目標邊界框的高度值和寬度值的情況下，特徵圖中的特徵點和第一圖像中的圖元點具有一定的位置映射關係；根據該目標邊界框的相關資訊、以及特徵圖和第一圖像之間的映射關係，從第一圖像的特徵圖中確定各個目標邊界框分別對應的特徵子圖。In a specific implementation, when the information of the target bounding box includes the coordinates of any vertex on the target bounding box in the first image, and the height and width values of the target bounding box, the feature points in the feature map and The pixel points in the first image have a certain positional mapping relationship; according to the relevant information of the target bounding box and the mapping relationship between the feature map and the first image, each is determined from the feature map of the first image The feature sub-images corresponding to the target bounding boxes respectively.

在目標邊界框的資訊包括目標邊界框中任一頂點在第一圖像的特徵圖中的座標，以及目標邊界框的高度值和寬度值的情況下，可以直接基於該目標邊界框，從第一圖像的特徵圖中確定與各個目標邊界框分別對應的特徵子圖。In the case that the information of the target bounding box includes the coordinates of any vertex in the target bounding box in the feature map of the first image, and the height and width values of the target bounding box, it can be directly based on the target bounding box, from the first image A feature map corresponding to each target bounding box is determined in the feature map of an image.

S302：對多個所述目標邊界框分別對應的特徵子圖進行邊界框回歸處理，得到所述目標對象所在的目標區域。S302: Perform bounding box regression processing on the feature sub-images respectively corresponding to the multiple target bounding boxes to obtain the target area where the target object is located.

此處，例如可以利用邊界框回歸（Bounding-Box Regression）演算法，對各個目標邊界框分別對應的特徵子圖進行邊界框回歸處理，以得到包括完整目標對象的多個邊界框。Here, for example, a bounding box regression (Bounding-Box Regression) algorithm may be used to perform bounding box regression processing on the feature sub-images corresponding to each target bounding box, so as to obtain multiple bounding boxes including the complete target object.

在利用邊界框回歸演算法，能夠準確的將目標對象從對應的目標區域確定出來，以將目標對象和圖像背景區別開，進而減少圖像背景對後續圖像處理過程的影響。Using the bounding box regression algorithm, the target object can be accurately determined from the corresponding target area to distinguish the target object from the image background, thereby reducing the influence of the image background on the subsequent image processing process.

多個邊界框中的每個邊界框與一個目標對象對應，基於與該目標對象對應的邊界框確定的區域，即為對應目標對象所在的目標區域。Each bounding box in the multiple bounding boxes corresponds to a target object, and the area determined based on the bounding box corresponding to the target object is the target area where the corresponding target object is located.

此時，所得到的目標區域的數量，與第一圖像中目標對象的數量一致，且每個目標對象對應一個目標區域；若不同的目標對象之間存在相互遮擋的位置關係，則存在相互遮擋關係的目標對象分別對應的目標區域具有一定的重疊度。At this time, the number of target areas obtained is the same as the number of target objects in the first image, and each target object corresponds to a target area; if there is a mutual occlusion position relationship between different target objects, there is mutual occlusion. The target regions corresponding to the target objects in the occlusion relationship have a certain degree of overlap.

本發明另一種實施例中，也可以採用其他目標檢測演算法是被第一圖像中的目標對象所在的目標區域。例如，採用語義分割演算法，確定第一圖像中每個圖元點的語義分割結果，然後根據語義分割結果，確定屬於不同目標對象的圖元點在第一圖像中的位置；然後根據屬於同一目標對象的圖元點求最小包圍框，將最小包圍框對應的區域確定為目標對象所在的目標區域。In another embodiment of the present invention, other target detection algorithms can also be used to identify the target area where the target object in the first image is located. For example, the semantic segmentation algorithm is used to determine the semantic segmentation result of each pixel point in the first image, and then according to the semantic segmentation result, the location of the pixel points belonging to different target objects in the first image is determined; then according to The primitive points belonging to the same target object find the smallest bounding box, and the area corresponding to the smallest bounding box is determined as the target area where the target object is located.

II：在上述S102中，圖像座標系，是指以第一圖像的長和寬兩個方向所建立的二維座標系；相機座標系，是指以相機的光軸所在方向、以及平行於光軸且相機的光心所在平面中的兩個方向建立的三維座標系。II: In the above S102, the image coordinate system refers to the two-dimensional coordinate system established by the length and width of the first image; the camera coordinate system refers to the direction in which the optical axis of the camera is located and parallel A three-dimensional coordinate system established in two directions on the optical axis and the plane of the camera's optical center.

目標對象的關鍵點，例如是位於目標對象上，且之間具有相互關係的，並且按照相互關係連接後能夠表徵目標對象姿態的圖元點；例如，在目標對象為人體時，關鍵點例如包括人體各個關節的關鍵點。該關鍵點在圖像座標系中，表示為二維座標值；在相機座標系中，表示為三維座標值。The key points of the target object, for example, are located on the target object and have a mutual relationship between them, and are connected according to the mutual relationship to represent the target object's posture; for example, when the target object is a human body, the key points include, for example, The key points of each joint of the human body. In the image coordinate system, the key point is expressed as a two-dimensional coordinate value; in the camera coordinate system, it is expressed as a three-dimensional coordinate value.

在具體實施中，例如可以利用關鍵點檢測網路，基於目標對象的目標特徵圖進行關鍵點檢測處理，得到目標對象的多個關鍵點分別在第一圖像中的二維位置資訊，以及每個關鍵點相對於目標對象的參考節點的相對深度。此處，目標特徵圖的獲取方式可以參見下述對S401的說明，在此不再贅述。In specific implementations, for example, the key point detection network can be used to perform key point detection processing based on the target feature map of the target object to obtain the two-dimensional position information of multiple key points of the target object in the first image, and each The relative depth of each key point relative to the reference node of the target object. Here, the method of acquiring the target feature map can be referred to the description of S401 below, which will not be repeated here.

參考節點，例如為在目標對象上預先確定某個部位上的任一圖元點。示例性的，可以根據實際的需要來預先確定該參考節點；例如在目標對象為人體時，可以將人體骨盆上的圖元點確定為參考節點，或者將人體上任一圖元點確定為參考節點，或者將人體的胸腹中央上的圖元點確定為參考節點；具體的可以根據需要進行設定。The reference node is, for example, any picture element point on a certain part predetermined on the target object. Exemplarily, the reference node can be determined in advance according to actual needs; for example, when the target object is a human body, a primitive point on the human pelvis can be determined as a reference node, or any primitive point on the human body can be determined as a reference node , Or determine the image element point on the center of the chest and abdomen of the human body as the reference node; the specific can be set according to needs.

每個關鍵點相對於目標對象的參考節點的絕對深度，例如為關鍵點在相機座標系的深度方向的座標值、與參考節點在相機座標系的深度方向的座標值的差值。關鍵點的絕對深度，例如為關鍵點在相機座標系的深度方向的座標值。The absolute depth of each key point relative to the reference node of the target object is, for example, the difference between the coordinate value of the key point in the depth direction of the camera coordinate system and the coordinate value of the reference node in the depth direction of the camera coordinate system. The absolute depth of the key point, for example, is the coordinate value of the key point in the depth direction of the camera coordinate system.

參見圖4所示，本發明實施例提供一種基於目標對象對應的目標區域，確定目標對象的參考節點在相機座標系中的絕對深度的具體方法，包括以下步驟。Referring to FIG. 4, the embodiment of the present invention provides a specific method for determining the absolute depth of the reference node of the target object in the camera coordinate system based on the target area corresponding to the target object, including the following steps.

S401：基於所述目標對象所在的目標區域以及所述第一圖像，確定所述目標對象的目標特徵圖。S401: Determine a target feature map of the target object based on the target area where the target object is located and the first image.

此處，例如可以基於對第一圖像進行特徵提取所得到第一圖像的特徵圖、以及所述目標區域，從所述特徵圖中確定目標對象的目標特徵圖。Here, for example, based on the feature map of the first image obtained by performing feature extraction on the first image and the target region, the target feature map of the target object may be determined from the feature map.

這裡，為第一圖像提取的特徵圖中的特徵點和第一圖像中的圖元點具有一定的位置映射關係；在得到各個目標對象所在的目標區域後，能夠根據該位置映射關係，確定各個目標對象在第一圖像的特徵圖中的所在位置，然後將與各個目標對象的目標特徵圖從第一圖像的特徵圖中截取出來。Here, the feature points in the feature map extracted for the first image and the primitive points in the first image have a certain positional mapping relationship; after obtaining the target area where each target object is located, the positional mapping relationship can be used, Determine the location of each target object in the feature map of the first image, and then cut out the target feature map of each target object from the feature map of the first image.

S402：對所述目標對象對應的目標特徵圖進行深度識別處理，得到所述目標對象的參考節點的歸一化絕對深度。S402: Perform depth recognition processing on the target feature map corresponding to the target object to obtain the normalized absolute depth of the reference node of the target object.

此處，由於不同相機的內參不同，目標對象在不同相機的成像中會有所區別；若要直接確定目標對象的參考節點的絕對深度，會存在由於相機內參造成的誤差，因此本發明實施例中，為了減少相機內參不同導致的圖像差異對絕對深度造成的而影響，可以首先基於目標特徵圖，得到目標對象的參考節點的歸一化絕對深度，然後再利用歸一化絕對深度和相機內參，得到參考節點的絕對深度。該歸一化絕對深度，是利用相機的參數矩陣對參考節點進行歸一化後得到的絕對深度，在得到歸一化絕對深度後，可以利用相機的參數矩陣，恢復參考節點的絕對深度。Here, because the internal parameters of different cameras are different, the target object will be different in the imaging of different cameras; if you want to directly determine the absolute depth of the reference node of the target object, there will be errors caused by the camera internal parameters. Therefore, the embodiment of the present invention In order to reduce the influence of the image difference caused by the different camera internal parameters on the absolute depth, you can first obtain the normalized absolute depth of the reference node of the target object based on the target feature map, and then use the normalized absolute depth and camera Internal reference, get the absolute depth of the reference node. The normalized absolute depth is the absolute depth obtained after normalizing the reference node using the parameter matrix of the camera. After the normalized absolute depth is obtained, the parameter matrix of the camera can be used to restore the absolute depth of the reference node.

在一種可能的實施方式中，例如可以採用預先訓練的深度預測網路，對目標特徵圖執行深度檢測處理，得到目標對象的參考節點的歸一化絕對深度。In a possible implementation, for example, a pre-trained depth prediction network may be used to perform depth detection processing on the target feature map to obtain the normalized absolute depth of the reference node of the target object.

本發明另一種實施例中，參見圖5所示，還提供另一種得到參考節點的歸一化絕對深度的具體方法，包括以下步驟。In another embodiment of the present invention, referring to FIG. 5, another specific method for obtaining the normalized absolute depth of the reference node is also provided, which includes the following steps.

S501：基於所述第一圖像，確定初始深度圖像；其中，所述初始深度圖像中任一第一圖元點的圖元值為所述第一圖像中與所述第一圖元點的位置對應的第二圖元點在所述相機座標系中的初始深度值。S501: Determine an initial depth image based on the first image; wherein the primitive value of any first primitive point in the initial depth image is the same as that in the first image The position of the primitive point corresponds to the initial depth value of the second primitive point in the camera coordinate system.

在具體實施中，初始深度圖像中的第一圖元點與第一圖像中的第二圖元點具有一一對應關係，也即，第一圖元點在初始深度圖像中的座標值，與位置對應的第二圖元點在第一圖像中的座標值相同。In a specific implementation, the first image element point in the initial depth image has a one-to-one correspondence with the second image element point in the first image, that is, the coordinates of the first image element point in the initial depth image Value, the coordinate value of the second image element point corresponding to the position in the first image is the same.

示例性的，可以利用可以採用深度預測網路，確定第一圖像中每個圖元點（第二圖元點）初始深度值；各個第一圖元點的初始深度值，構成了第一圖像的初始深度圖像；在初始深度圖像中的任一圖元點（第一圖元點）的圖元值，即為在第一圖像中對應位置的圖元點（第二圖元點）的初始深度值。Exemplarily, a depth prediction network can be used to determine the initial depth value of each pixel point (second pixel point) in the first image; the initial depth value of each first pixel point constitutes the first The initial depth image of the image; the primitive value of any primitive point (first primitive point) in the initial depth image is the primitive point corresponding to the position in the first image (second image Meta point) initial depth value.

S502：基於所述目標對象對應的目標特徵圖，確定與所述目標對象對應的參考節點在所述第一圖像中的第二二維位置資訊，並基於所述第二二維位置資訊、以及所述初始深度圖像，確定所述目標對象對應的參考節點的初始深度值。S502: Based on the target feature map corresponding to the target object, determine second two-dimensional position information of the reference node corresponding to the target object in the first image, and based on the second two-dimensional position information, And the initial depth image, determining the initial depth value of the reference node corresponding to the target object.

此處，目標對象對應的目標特徵圖，例如可以基於各個目標對象對應的目標區域，從與第一圖像的特徵圖中，為各個目標對象確定的目標特徵圖。Here, the target feature map corresponding to the target object may, for example, be based on the target area corresponding to each target object, from the feature map of the first image, the target feature map determined for each target object.

在得到各個目標對象對應的目標特徵圖後，例如可以利用預先訓練的參考節點檢測網路，基於目標特徵圖中確定目標對象的參考節點在第一圖像中的第二二維位置資訊。然後利用該第二二維位置資訊，從初始深度圖像確定與參考節點對應的圖元點，並將該從初始深度圖像中確定的圖元點的圖元值，確定為參考節點的初始深度值。After obtaining the target feature map corresponding to each target object, for example, a pre-trained reference node detection network can be used to determine the second two-dimensional position information of the reference node of the target object in the first image based on the target feature map. Then use the second two-dimensional position information to determine the primitive point corresponding to the reference node from the initial depth image, and determine the primitive value of the primitive point determined from the initial depth image as the initial reference node The depth value.

S503：基於所述參考節點的初始深度值以及所述目標特徵圖，確定所述目標對象的參考節點的歸一化絕對深度。S503: Determine the normalized absolute depth of the reference node of the target object based on the initial depth value of the reference node and the target feature map.

示例性的，例如可以對所述目標對象對應的目標特徵圖進行至少一級第一卷積處理，得到所述目標對象的特徵向量；將所述特徵向量和所述初始深度值進行拼接，得到拼接向量，並對所述拼接向量進行至少一級第二卷積處理，得到所述初始深度值的修正值；基於所述初始深度值的修正值、以及所述初始深度值，得到所述歸一化絕對深度。Exemplarily, for example, at least one level of first convolution processing may be performed on the target feature map corresponding to the target object to obtain the feature vector of the target object; and the feature vector and the initial depth value may be spliced together to obtain the splicing Vector, and perform at least one level of second convolution processing on the stitching vector to obtain the correction value of the initial depth value; and obtain the normalized value based on the correction value of the initial depth value and the initial depth value Absolute depth.

此處，例如可以採用一個用於對初始深度值進行調整的神經網路，該神經網路包括多個卷積層；其中，多個卷積層中的部分卷積層用於對目標特徵圖進行至少一級第一卷積處理；其他卷積層用於對拼接向量進行至少一級第二卷積處理，進而得到該修正值；然後根據該修正值對初始深度值進行調整，得到目標對象的參考節點的歸一化深度。Here, for example, a neural network for adjusting the initial depth value may be used, and the neural network includes a plurality of convolutional layers; wherein, part of the convolutional layers of the plurality of convolutional layers is used to perform at least one level of the target feature map. The first convolution processing; other convolution layers are used to perform at least one level of second convolution processing on the stitching vector to obtain the correction value; then the initial depth value is adjusted according to the correction value to obtain the normalization of the reference node of the target object化depth.

承接上述S402，本發明實施例所提供的確定目標對象的參考節點在相機座標系中的絕對深度的具體方法還包括： S403：基於所述歸一化絕對深度以及相機的參數矩陣，得到所述目標對象的參考節點在所述相機座標系中的絕對深度。Following the above S402, the specific method for determining the absolute depth of the reference node of the target object in the camera coordinate system provided by the embodiment of the present invention further includes: S403: Obtain the absolute depth of the reference node of the target object in the camera coordinate system based on the normalized absolute depth and the parameter matrix of the camera.

在具體實施中，由於在對不同第一圖像進行圖像處理過程中，不同的第一圖像可能通過不同的相機拍攝而成；而對於不同的相機，所對應的相機內參可能會不同；此處，相機內參例如包括：相機在x軸上的焦距、相機在y軸上的焦距、相機的光心在相機座標系中的x軸和y軸的座標。In specific implementations, in the process of image processing on different first images, different first images may be taken by different cameras; and for different cameras, the corresponding camera internal parameters may be different; Here, the camera internal parameters include, for example, the focal length of the camera on the x-axis, the focal length of the camera on the y-axis, and the coordinates of the x-axis and the y-axis of the optical center of the camera in the camera coordinate system.

相機內參不同，即使在相同視角、相同位置獲取的第一圖像也會有所區別；若直接基於目標特徵圖預測參考節點的絕對深度，會造成對不同相機在相同視角、相同位置獲取的不同第一圖像獲取的絕對深度不同。The internal parameters of the camera are different, even if the first image acquired at the same angle of view and the same position will be different; if the absolute depth of the reference node is directly predicted based on the target feature map, it will cause different camera acquisitions at the same angle of view and the same position. The absolute depth acquired by the first image is different.

為了避免上述情況的產生，本發明實施例直接預測參考節點的歸一化深度，該歸一化絕對深度是在不考慮相機內參的情況下得到的；然後根據相機內參、以及歸一化絕對深度，恢復參考節點的絕對深度。In order to avoid the above situation, the embodiment of the present invention directly predicts the normalized depth of the reference node, and the normalized absolute depth is obtained without considering the camera internal parameters; then according to the camera internal parameters and the normalized absolute depth To restore the absolute depth of the reference node.

在基於歸一化絕對深度恢復參考節點的絕對深度時，例如可以基於所述歸一化絕對深度、所述焦距、所述目標區域的面積、以及所述目標邊界框的面積，得到所述目標對象的參考節點在所述相機座標系中的絕對深度。When restoring the absolute depth of the reference node based on the normalized absolute depth, for example, the target may be obtained based on the normalized absolute depth, the focal length, the area of the target region, and the area of the target bounding box The absolute depth of the reference node of the object in the camera coordinate system.

示例性的，任一目標對象的參考節點的歸一化絕對深度、和絕對深度滿足下述公式（1）：

（1）其中，

表示參考節點的歸一化絕對深度；

表示參考節點絕對深度；

表示目標區域的面積；

表示目標邊界框的面積。

表示相機焦距。示例性的，相機座標系為三維座標系；包括x、y和z三個座標軸；相機座標系的原點為相機的光心；相機的光軸為相機座標系的z軸；光心所在的、且垂直於z軸的平面為x軸和y軸所在的平面；

為相機在x軸上的焦距；

為相機在y軸上的焦距。Exemplarily, the normalized absolute depth and absolute depth of the reference node of any target object satisfy the following formula (1):

(1) Among them,

Indicates the normalized absolute depth of the reference node;

Indicates the absolute depth of the reference node;

Indicates the area of the target area;

Represents the area of the bounding box of the target.

Indicates the focal length of the camera. Exemplarily, the camera coordinate system is a three-dimensional coordinate system; it includes three coordinate axes x, y, and z; the origin of the camera coordinate system is the optical center of the camera; the optical axis of the camera is the z-axis of the camera coordinate system; , And the plane perpendicular to the z-axis is the plane where the x-axis and the y-axis are located;

Is the focal length of the camera on the x axis;

Is the focal length of the camera on the y axis.

這裡需要注意的是，在上述S202中可知，通過RoIAlign確定的目標邊界框有多個；且多個目標邊界框的面積均相等。It should be noted here that, in the above S202, it can be known that there are multiple target bounding boxes determined by RoIAlign; and the areas of the multiple target bounding boxes are equal.

由於相機焦距在相機獲取第一圖像的時候已經確定，且目標區域和目標邊界框在確定目標區域的時候也已經確定，因而在得到參考節點的歸一化絕對深度後，根據上述公式（1）得到目標對象的參考節點的絕對深度。Since the focal length of the camera has been determined when the camera acquires the first image, and the target area and the target bounding box have also been determined when the target area is determined, after obtaining the normalized absolute depth of the reference node, according to the above formula (1 ) Get the absolute depth of the reference node of the target object.

III：在上述S103中，假設每個目標對象包括J個關鍵點，且第一圖像中的目標對象有N個；其中，N個目標對象的三維姿態表示為：

。III: In the above S103, it is assumed that each target object includes J key points, and there are N target objects in the first image; among them, the three-dimensional poses of the N target objects are expressed as:

.

其中，第m個目標對象的三維姿態

可以表示為：

。其中，

表示第m個目標對象的第j個關鍵點在相機座標系中x軸方向的座標值；

表示第m個目標對象的第j個關鍵點在相機座標系中y軸方向的座標值；

表示第m個目標對象的第j個關鍵點在相機座標系中z軸方向的座標值。Among them, the three-dimensional pose of the m-th target object

It can be expressed as:

. in,

Represents the coordinate value of the j-th key point of the m-th target object in the x-axis direction in the camera coordinate system;

Represents the coordinate value of the j-th key point of the m-th target object in the y-axis direction in the camera coordinate system;

Represents the coordinate value of the j-th key point of the m-th target object in the z-axis direction in the camera coordinate system.

N個目標對象所在的目標區域表示為：

。其中，第m個目標對象所在的目標區域

表示為：

；此處，

和

表示目標區域的左上角所在的頂點的座標值；

和

分別表示目標區域的寬度值和高度值。The target area where N target objects are located is expressed as:

. Among them, the target area where the m-th target object is located

Expressed as:

; Here,

and

Indicates the coordinate value of the vertex where the upper left corner of the target area is located;

and

Respectively indicate the width and height of the target area.

N個目標對象的相對於參考節點的三維姿勢表示為：

；其中，第m個目標對象相對於參考節點的三維姿勢

表示為：

；其中，

表示第m個目標對象的第j個關鍵點在圖像座標系中x軸的座標值；

表示第m個目標對象的第j個關鍵點在圖像座標系中y軸的座標值；也即，

表示第m個目標對象的第j個關鍵點在圖像座標系中的二維座標值。

表示第m個目標對象的第j個節點相對於第m個目標對象的參考節點的相對深度。The three-dimensional poses of the N target objects relative to the reference node are expressed as:

; Among them, the three-dimensional pose of the m-th target object relative to the reference node

Expressed as:

;in,

Represents the coordinate value of the j-th key point of the m-th target object on the x-axis in the image coordinate system;

Represents the coordinate value of the j-th key point of the m-th target object on the y-axis in the image coordinate system; that is,

Represents the two-dimensional coordinate value of the j-th key point of the m-th target object in the image coordinate system.

Represents the relative depth of the j-th node of the m-th target object with respect to the reference node of the m-th target object.

使用相機的內參矩陣K，通過反投影得到第m個目標對象的三維姿勢，其中，第m個目標對象的第j個節點的三維座標資訊滿足下述公式（2）

（2）其中，

表示第m個目標對象的參考節點在相機座標系中的絕對深度值。此處，需要注意的是，該

基於上述公式（1）對應的實施例獲得。Using the camera's internal parameter matrix K, the three-dimensional posture of the m-th target object is obtained through back-projection, where the three-dimensional coordinate information of the j-th node of the m-th target object satisfies the following formula (2)

(2) Among them,

Represents the absolute depth value of the reference node of the m-th target object in the camera coordinate system. Here, it’s important to note that the

Obtained based on the corresponding embodiment of the above formula (1).

內參矩陣K例如為：

；其中：

為相機在相機座標系中x軸上的焦距；

為相機在相機座標系中在y軸上的焦距；

為相機的光心在相機座標系中在x軸上的座標值；

表示相機的光心在相機座標系中在y軸上的座標值。The internal parameter matrix K is for example:

; in:

Is the focal length of the camera on the x-axis in the camera coordinate system;

Is the focal length of the camera on the y axis in the camera coordinate system;

Is the coordinate value of the optical center of the camera on the x-axis in the camera coordinate system;

Indicates the coordinate value of the optical center of the camera on the y axis in the camera coordinate system.

通過上述過程，能夠得到目標對象的多個關鍵點分別在相機座標系中的三維位置資訊；針對第m個目標對象，該目標對象的J個關鍵點分別對應的三維位置資訊，表徵第m個目標對象的三維姿態。Through the above process, the three-dimensional position information of multiple key points of the target object in the camera coordinate system can be obtained; for the m-th target object, the three-dimensional position information corresponding to the J key points of the target object represents the m-th target object. The three-dimensional pose of the target object.

本發明實施例通過識別第一圖像中目標對象所在的目標區域，並基於目標區域，確定表徵目標對象姿態的多個關鍵點分別在第一圖像中的第一二維位置資訊、每個關鍵點相對於目標對象的參考節點的相對深度、以及目標對象的參考節點在相機座標系中的絕對深度，從而基於目標對象的第一二維位置資訊、相對深度、以及絕對深度，更精確的得到目標對象的多個關鍵點分別在相機座標系中的三維位置資訊。In the embodiment of the present invention, by identifying the target area in the first image where the target object is located, and based on the target area, determine the first two-dimensional position information and the first two-dimensional position information of multiple key points representing the posture of the target object in the first image. The relative depth of the key point with respect to the reference node of the target object, and the absolute depth of the reference node of the target object in the camera coordinate system, so as to be more accurate based on the first two-dimensional position information, relative depth, and absolute depth of the target object Obtain the three-dimensional position information of multiple key points of the target object in the camera coordinate system.

本發明另一實施例中，還提供另外一種圖像處理方法，其中，該圖像處理方法應用於預先訓練好的神經網路中。In another embodiment of the present invention, another image processing method is provided, wherein the image processing method is applied to a pre-trained neural network.

其中，所述神經網路包括目標檢測網路、關鍵點檢測網路以及深度預測網路三個分支網路；所述目標檢測網路用於獲得所述目標對象所在的目標區域；所述關鍵點檢測網路用於獲取所述目標對象的多個關鍵點分別在所述第一圖像中的第一二維位置資訊、和每個所述關鍵點相對所述目標對象的參考節點的相對深度；所述深度預測網路用於獲取所述參考節點在所述相機座標系中的絕對深度。Wherein, the neural network includes three branch networks: a target detection network, a key point detection network, and a deep prediction network; the target detection network is used to obtain the target area where the target object is located; the key The point detection network is used to obtain the first two-dimensional position information of the multiple key points of the target object in the first image, and the relative value of each key point with respect to the reference node of the target object. Depth; the depth prediction network is used to obtain the absolute depth of the reference node in the camera coordinate system.

上述三個分支網路的具體工作過程可以參見上述實施例所示，在此不再贅述。The specific working processes of the above three branch networks can be referred to the above embodiments, and will not be repeated here.

本發明實施例通過目標檢測網路、關鍵點檢測網路以及深度預測網路三個分支網路，構成端到端的目標對象姿態檢測框架，基於該框架對第一圖像進行處理，得到第一圖像中每個目標對象的多個關鍵點分別在相機座標系中的三維位置資訊，處理速度更快，識別精度更高。In the embodiment of the present invention, an end-to-end target object pose detection framework is formed through three branch networks of a target detection network, a key point detection network, and a depth prediction network, and the first image is processed based on the framework to obtain the first The three-dimensional position information of the multiple key points of each target object in the image in the camera coordinate system, the processing speed is faster, and the recognition accuracy is higher.

參見圖6所示，本發明實施例還提供一種目標對象姿態檢測框架的具體示例，包括：目標檢測網路、關鍵點檢測網路、以及深度預測網路三個網路分支；其中，目標檢測網路對第一圖像進行特徵提取，得到第一圖像的特徵圖；然後，根據第一特徵圖，採用RoIAlign從預先生成的多個候選邊界框中，確定多個目標邊界框；對多個目標邊界框執行邊界框回歸處理，得到與每個目標對象對應的目標區域。將目標區域對應的目標特徵圖，傳輸至關鍵點檢測網路、以及深度預測網路。Referring to FIG. 6, the embodiment of the present invention also provides a specific example of a target object pose detection framework, including: Three network branches: target detection network, key point detection network, and deep prediction network; Among them, the target detection network performs feature extraction on the first image to obtain a feature map of the first image; then, according to the first feature map, RoIAlign is used to determine multiple target boundaries from multiple candidate bounding boxes generated in advance Box: Perform bounding box regression processing on multiple target bounding boxes to obtain the target area corresponding to each target object. The target feature map corresponding to the target area is transmitted to the key point detection network and the depth prediction network.

關鍵點檢測網路，基於目標特徵圖，確定表徵目標對象姿態的多個關鍵點分別在所述第一圖像中的第一二維位置資訊、每個關鍵點相對所述目標對象的參考節點的相對深度。其中，針對每個目標特徵圖中各個關鍵點的第一二維位置資訊、及相對深度，構成該目標特徵圖中目標對象的三維姿態。此時的三維姿態，是以自身為參照的三維姿態。The key point detection network, based on the target feature map, determines the first two-dimensional position information of multiple key points representing the posture of the target object in the first image, and the reference node of each key point relative to the target object The relative depth. Among them, the first two-dimensional position information and relative depth of each key point in each target feature map constitute the three-dimensional posture of the target object in the target feature map. The three-dimensional posture at this time is the three-dimensional posture based on itself.

深度預測網路，基於目標特徵圖，確定目標對象的參考節點在相機座標系中的絕對深度。The depth prediction network determines the absolute depth of the reference node of the target object in the camera coordinate system based on the target feature map.

最終，根據目標對象的所述第一二維位置資訊、相對深度、以及參考節點的所述絕對深度，確定目標對象的多個關鍵點分別在所述相機座標系中的三維位置資訊。針對每個目標對象，該目標對象上的多個關鍵點分別在相機座標系中的三維位置資訊，構成了該目標對象在相機座標系中的三維姿態。此時的三維姿態，是以相機為參照的三維姿態。Finally, according to the first two-dimensional position information of the target object, the relative depth, and the absolute depth of the reference node, the three-dimensional position information of the multiple key points of the target object in the camera coordinate system are determined. For each target object, the three-dimensional position information of multiple key points on the target object in the camera coordinate system respectively constitute the three-dimensional posture of the target object in the camera coordinate system. The three-dimensional posture at this time is the three-dimensional posture referenced by the camera.

參見圖7所示，本發明實施例還提供另一種目標對象姿態檢測框架的具體示例，包括：目標檢測網路、關鍵點檢測網路、以及深度預測網路；其中，目標檢測網路對第一圖像進行特徵提取，得到第一圖像的特徵圖；然後，根據第一特徵圖，採用RoIAlign從預先生成的多個候選邊界框中，確定多個目標邊界框；對多個目標邊界框執行邊界框回歸處理，得到與每個目標對象對應的目標區域。將目標區域對應的目標特徵圖，傳輸至關鍵點檢測網路、以及深度預測網路。Referring to FIG. 7, the embodiment of the present invention also provides another specific example of a target object pose detection framework, including: Target detection network, key point detection network, and depth prediction network; Among them, the target detection network performs feature extraction on the first image to obtain a feature map of the first image; then, according to the first feature map, RoIAlign is used to determine multiple target boundaries from multiple candidate bounding boxes generated in advance Box: Perform bounding box regression processing on multiple target bounding boxes to obtain the target area corresponding to each target object. The target feature map corresponding to the target area is transmitted to the key point detection network and the depth prediction network.

深度預測網路，基於第一圖像，獲取初始深度圖像；並基於目標對象對應的目標特徵圖，確定與目標對象對應的參考節點在所述第一圖像中的第二二維位置資訊，並基於所述第二二維位置資訊、以及所述初始深度圖像，確定所述目標對象對應的參考節點的初始深度值；以及對目標對象對應的目標特徵圖進行至少一級第一卷積處理，得到所述目標對象的特徵向量；將所述特徵向量和參考節點的初始深度值進行拼接，形成拼接向量，並對所述拼接向量進行至少一級第二卷積處理，得到所述初始深度值的修正值；將修正值與參考節點的初始深度值相加，得到參考節點的歸一化絕對深度值。The depth prediction network obtains the initial depth image based on the first image; and based on the target feature map corresponding to the target object, determines the second two-dimensional position information of the reference node corresponding to the target object in the first image , And based on the second two-dimensional position information and the initial depth image, determine the initial depth value of the reference node corresponding to the target object; and perform at least one level of first convolution on the target feature map corresponding to the target object Processing to obtain the feature vector of the target object; splicing the feature vector and the initial depth value of the reference node to form a splicing vector, and performing at least one level of second convolution processing on the splicing vector to obtain the initial depth The correction value of the value; the correction value is added to the initial depth value of the reference node to obtain the normalized absolute depth value of the reference node.

然後，通過上述公式（1），恢復參考節點的絕對深度值，然後根據目標對象的所述第一二維位置資訊、相對深度、以及參考節點的所述絕對深度，確定目標對象的多個關鍵點分別在所述相機座標系中的三維位置資訊。針對每個目標對象，該目標對象上的多個關鍵點分別在相機座標系中的三維位置資訊，構成了該目標對象在相機座標系中的三維姿態。此時的三維姿態，是以相機為參照的三維姿態。Then, through the above formula (1), the absolute depth value of the reference node is restored, and then according to the first two-dimensional position information of the target object, the relative depth, and the absolute depth of the reference node, multiple key points of the target object are determined The three-dimensional position information of the points respectively in the camera coordinate system. For each target object, the three-dimensional position information of multiple key points on the target object in the camera coordinate system respectively constitute the three-dimensional posture of the target object in the camera coordinate system. The three-dimensional posture at this time is the three-dimensional posture referenced by the camera.

通過上述兩種目標對象姿態檢測框架中任一種，都能夠得到第一圖像中每個目標對象的多個關鍵點分別在相機座標系中的三維位置資訊，處理速度更快，識別精度更高。Through either of the above two target object posture detection frameworks, it is possible to obtain the three-dimensional position information of the multiple key points of each target object in the first image in the camera coordinate system. The processing speed is faster and the recognition accuracy is higher. .

本領域技術人員可以理解，在具體實施方式的上述方法中，各步驟的撰寫順序並不意味著嚴格的執行順序而對實施過程構成任何限定，各步驟的具體執行順序應當以其功能和可能的內在邏輯確定。Those skilled in the art can understand that in the above methods of the specific implementation, the writing order of the steps does not mean a strict execution order but constitutes any limitation on the implementation process. The specific execution order of each step should be based on its function and possibility. The inner logic is determined.

基於同一發明構思，本發明實施例中還提供了與圖像處理方法對應的圖像處理裝置，由於本發明實施例中的裝置解決問題的原理與本發明實施例上述圖像處理方法相似，因此裝置的實施可以參見方法的實施，重複之處不再贅述。Based on the same inventive concept, an image processing device corresponding to the image processing method is also provided in the embodiment of the present invention. Since the principle of the device in the embodiment of the present invention to solve the problem is similar to the above-mentioned image processing method in the embodiment of the present invention, The implementation of the device can refer to the implementation of the method, and the repetition will not be repeated.

參照圖8所示，為本發明實施例提供的一種圖像處理裝置的示意圖，所述裝置包括：識別模組81、第一檢測模組82、第二檢測模組83；其中，識別模組81，用於識別所述第一圖像中的目標對象所在的目標區域；第一檢測模組82，用於基於所述目標對象對應的目標區域，確定表徵所述目標對象姿態的多個關鍵點分別在所述第一圖像中的第一二維位置資訊、每個所述關鍵點相對所述目標對象的參考節點的相對深度、以及所述目標對象的參考節點在相機座標系中的絕對深度；第二檢測模組83，用於基於所述目標對象的所述第一二維位置資訊、所述相對深度、以及所述絕對深度，確定所述目標對象的多個關鍵點分別在所述相機座標系中的三維位置資訊。Referring to FIG. 8, it is a schematic diagram of an image processing device provided by an embodiment of the present invention. The device includes: an identification module 81, a first detection module 82, and a second detection module 83; wherein, The recognition module 81 is used to recognize the target area where the target object in the first image is located; The first detection module 82 is configured to determine, based on the target area corresponding to the target object, the first two-dimensional position information and the first two-dimensional position information of the multiple key points that characterize the target object's posture in the first image. The relative depth of the key point with respect to the reference node of the target object, and the absolute depth of the reference node of the target object in the camera coordinate system; The second detection module 83 is configured to determine that multiple key points of the target object are in the camera based on the first two-dimensional position information, the relative depth, and the absolute depth of the target object. Three-dimensional position information in the coordinate system.

一種可能的實施方式中，所述識別模組81，在識別所述第一圖像中的目標對象所在的目標區域時，用於：對所述第一圖像進行特徵提取，得到所述第一圖像的特徵圖；基於所述特徵圖，從預先生成的多個候選邊界框中，確定多個目標邊界框，並基於所述目標邊界框，確定所述目標對象對應的目標區域。In a possible implementation manner, the recognition module 81 is used to: when recognizing the target area where the target object in the first image is located: Performing feature extraction on the first image to obtain a feature map of the first image; Based on the feature map, determine multiple target bounding boxes from multiple pre-generated candidate bounding boxes, and determine the target area corresponding to the target object based on the target bounding box.

一種可能的實施方式中，所述識別模組81，在基於所述目標邊界框，確定所述目標對象對應的目標區域時，用於：基於多個所述目標邊界框，以及所述特徵圖，確定每個所述目標邊界框對應的特徵子圖；基於多個所述目標邊界框分別對應的特徵子圖進行邊界框回歸處理，得到所述目標對象對應的目標區域。In a possible implementation manner, the identification module 81 is configured to: when determining the target area corresponding to the target object based on the target bounding box: Determine a feature sub-map corresponding to each target bounding box based on a plurality of the target bounding boxes and the feature map; Bounding box regression processing is performed based on the feature sub-maps respectively corresponding to the multiple target bounding boxes to obtain the target area corresponding to the target object.

一種可能的實施方式中，所述第一檢測模組82，在基於所述目標對象對應的目標區域，確定所述目標對象的參考節點在相機座標系中的絕對深度時，用於：基於所述目標對象對應的目標區域以及所述第一圖像，確定所述靶心圖表像對應的目標特徵圖；基於所述目標對象對應的目標特徵圖執行深度識別處理，得到所述目標對象的參考節點的歸一化絕對深度；基於所述歸一化絕對深度以及所述相機的參數矩陣，得到所述目標對象的參考節點在所述相機座標系中的絕對深度。In a possible implementation manner, the first detection module 82 is used to determine the absolute depth of the reference node of the target object in the camera coordinate system based on the target area corresponding to the target object: Determining a target feature map corresponding to the bull's-eye chart image based on the target area corresponding to the target object and the first image; Performing depth recognition processing based on the target feature map corresponding to the target object to obtain the normalized absolute depth of the reference node of the target object; Based on the normalized absolute depth and the parameter matrix of the camera, the absolute depth of the reference node of the target object in the camera coordinate system is obtained.

一種可能的實施方式中，所述第一檢測模組82，在基於所述目標對象對應的目標特徵圖執行深度識別處理，得到所述目標對象的參考節點的歸一化絕對深度時，用於：基於所述第一圖像，獲取初始深度圖像；其中，所述初始深度圖像中任一第一圖元點的圖元值，表徵所述第一圖像中與所述第一圖元點位置對應的第二圖元點在所述相機座標系中的初始深度值；基於所述目標對象對應的目標特徵圖，確定與所述目標對象對應的參考節點在所述第一圖像中的第二二維位置資訊，並基於所述第二二維位置資訊、以及所述初始深度圖像，確定所述目標對象對應的參考節點的初始深度值；基於所述目標對象對應的參考節點的初始深度值，以及所述目標對象對應的所述目標特徵圖，確定所述目標對象的參考節點的歸一化絕對深度。In a possible implementation manner, the first detection module 82 is configured to perform depth recognition processing based on the target feature map corresponding to the target object to obtain the normalized absolute depth of the reference node of the target object. : Based on the first image, an initial depth image is obtained; wherein, the primitive value of any first primitive point in the initial depth image represents the difference between the first primitive and the first primitive in the first image The initial depth value of the second image element point corresponding to the point position in the camera coordinate system; Based on the target feature map corresponding to the target object, determine the second two-dimensional position information of the reference node corresponding to the target object in the first image, and based on the second two-dimensional position information, and The initial depth image, determining the initial depth value of the reference node corresponding to the target object; Determine the normalized absolute depth of the reference node of the target object based on the initial depth value of the reference node corresponding to the target object and the target feature map corresponding to the target object.

一種可能的實施方式中，所述第一檢測模組82，在基於所述目標對象對應的參考節點的初始深度值，以及所述目標對象對應的所述目標特徵圖，確定所述目標對象的參考節點的歸一化絕對深度時，用於：對所述目標對象對應的目標特徵圖進行至少一級第一卷積處理，得到所述目標對象的特徵向量；將所述特徵向量和所述初始深度值進行拼接，形成拼接向量，並對所述拼接向量進行至少一級第二卷積處理，得到所述初始深度值的修正值；基於所述初始深度值的修正值、以及所述初始深度值，得到所述歸一化絕對深度。In a possible implementation manner, the first detection module 82 determines the target object's initial depth value based on the initial depth value of the reference node corresponding to the target object and the target feature map corresponding to the target object. When referring to the normalized absolute depth of the node, it is used for: Performing at least one level of first convolution processing on the target feature map corresponding to the target object to obtain the feature vector of the target object; Splicing the feature vector and the initial depth value to form a splicing vector, and performing at least one level of second convolution processing on the splicing vector to obtain a correction value of the initial depth value; Based on the correction value of the initial depth value and the initial depth value, the normalized absolute depth is obtained.

一種可能的實施方式中，所述圖像處理裝置中部署有預先訓練好的神經網路，所述神經網路包括目標檢測網路、關鍵點檢測網路以及深度預測網路三個分支網路；所述目標檢測網路用於獲得所述目標對象所在的目標區域；所述關鍵點檢測網路用於獲取所述目標對象的多個關鍵點分別在所述第一圖像中的第一二維位置資訊、和每個所述關鍵點相對所述目標對象的參考節點的相對深度；所述深度預測網路用於獲取所述參考節點在所述相機座標系中的絕對深度。In a possible implementation manner, a pre-trained neural network is deployed in the image processing device, and the neural network includes three branch networks: a target detection network, a key point detection network, and a depth prediction network The target detection network is used to obtain the target area where the target object is located; the key point detection network is used to obtain the first of multiple key points of the target object in the first image, respectively Two-dimensional position information and the relative depth of each key point with respect to the reference node of the target object; the depth prediction network is used to obtain the absolute depth of the reference node in the camera coordinate system.

另外，本發明實施例通過目標檢測網路、關鍵點檢測網路以及深度預測網路三個分支網路，構成端到端的目標對象姿態檢測框架，基於該框架對第一圖像進行處理，得到第一圖像中每個目標對象的多個關鍵點分別在相機座標系中的三維位置資訊，處理速度更快，識別精度更高。In addition, the embodiment of the present invention constitutes an end-to-end target object pose detection framework through three branch networks of the target detection network, the key point detection network, and the depth prediction network, and the first image is processed based on the framework to obtain The three-dimensional position information of the multiple key points of each target object in the first image in the camera coordinate system is faster in processing speed and higher in recognition accuracy.

關於裝置中的各模組的處理流程、以及各模組之間的交互流程的描述可以參照上述方法實施例中的相關說明，這裡不再詳述。For the description of the processing flow of each module in the device and the interaction flow between each module, reference may be made to the related description in the above method embodiment, which will not be described in detail here.

本發明實施例還提供了一種電腦設備10，如圖9所示，為本發明實施例提供的電腦設備10結構示意圖，包括：處理器11和記憶體12；所述記憶體12儲存有所述處理器11可執行的機器可讀指令，當電腦設備運行時，所述機器可讀指令被所述處理器執行以實現下述步驟：識別所述第一圖像中的目標對象所在的目標區域；基於所述目標對象對應的目標區域，確定表徵所述目標對象姿態的多個關鍵點分別在所述第一圖像中的第一二維位置資訊、每個所述關鍵點相對所述目標對象的參考節點的相對深度、以及所述目標對象的參考節點在相機座標系中的絕對深度；基於所述目標對象的所述第一二維位置資訊、所述相對深度、以及所述絕對深度，確定所述目標對象的多個關鍵點分別在所述相機座標系中的三維位置資訊。The embodiment of the present invention also provides a computer device 10. As shown in FIG. 9, it is a schematic structural diagram of the computer device 10 provided by the embodiment of the present invention, including: The processor 11 and the memory 12; the memory 12 stores machine-readable instructions executable by the processor 11, and when the computer device is running, the machine-readable instructions are executed by the processor to achieve the following step: Identifying the target area where the target object in the first image is located; Based on the target area corresponding to the target object, determine the first two-dimensional position information of a plurality of key points representing the posture of the target object in the first image, and each key point is relative to the target object The relative depth of the reference node of and the absolute depth of the reference node of the target object in the camera coordinate system; Based on the first two-dimensional position information, the relative depth, and the absolute depth of the target object, the three-dimensional position information of the multiple key points of the target object in the camera coordinate system are determined.

上述指令的具體執行過程可以參考本發明實施例中所述的圖像處理方法的步驟，此處不再贅述。For the specific execution process of the foregoing instructions, reference may be made to the steps of the image processing method described in the embodiment of the present invention, which will not be repeated here.

本發明實施例還提供一種電腦可讀儲存介質，該電腦可讀儲存介質上儲存有電腦程式，該電腦程式被處理器運行時執行上述方法實施例中所述的圖像處理方法的步驟。其中，該儲存介質可以是易失性或非易失的電腦可讀取儲存介質。The embodiment of the present invention also provides a computer-readable storage medium having a computer program stored on the computer-readable storage medium, and the computer program executes the steps of the image processing method described in the above method embodiment when the computer program is run by a processor. Wherein, the storage medium may be a volatile or non-volatile computer readable storage medium.

本發明實施例還提供一種電腦程式產品，該電腦程式產品承載有程式碼，所述程式碼包括的指令可用於執行上述方法實施例中所述的圖像處理方法的步驟，具體可參見上述方法實施例，在此不再贅述。An embodiment of the present invention also provides a computer program product, the computer program product carries a program code, and the program code includes instructions that can be used to execute the steps of the image processing method described in the above method embodiment. For details, please refer to the above method The embodiments are not repeated here.

其中，上述電腦程式產品可以具體通過硬體、軟體或其結合的方式實現。在一個可選實施例中，所述電腦程式產品具體體現為電腦儲存介質，在另一個可選實施例中，電腦程式產品具體體現為軟體產品，例如軟體發展包（Software Development Kit，SDK）等等。Among them, the above-mentioned computer program product can be implemented by hardware, software, or a combination thereof. In an optional embodiment, the computer program product is specifically embodied as a computer storage medium. In another optional embodiment, the computer program product is specifically embodied as a software product, such as a software development kit (SDK), etc. Wait.

所屬領域的技術人員可以清楚地瞭解到，為描述的方便和簡潔，上述描述的系統和裝置的具體工作過程，可以參考前述方法實施例中的對應過程，在此不再贅述。在本發明所提供的幾個實施例中，應該理解到，所揭露的系統、裝置和方法，可以通過其它的方式實現。以上所描述的裝置實施例僅僅是示意性的，例如，所述單元的劃分，僅僅為一種邏輯功能劃分，實際實現時可以有另外的劃分方式，又例如，多個單元或元件可以結合或者可以集成到另一個系統，或一些特徵可以忽略，或不執行。另一點，所顯示或討論的相互之間的耦合或直接耦合或通信連接可以是通過一些通信介面，裝置或單元的間接耦合或通信連接，可以是電性，機械或其它的形式。Those skilled in the art can clearly understand that, for the convenience and conciseness of description, the specific working process of the system and device described above can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here. In the several embodiments provided by the present invention, it should be understood that the disclosed system, device, and method may be implemented in other ways. The device embodiments described above are merely illustrative. For example, the division of the units is only a logical function division, and there may be other divisions in actual implementation. For example, multiple units or elements may be combined or may be Integrate into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some communication interfaces, devices or units, and may be in electrical, mechanical or other forms.

所述作為分離部件說明的單元可以是或者也可以不是物理上分開的，作為單元顯示的部件可以是或者也可以不是物理單元，即可以位於一個地方，或者也可以分佈到多個網路單元上。可以根據實際的需要選擇其中的部分或者全部單元來實現本實施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. . Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

另外，在本發明各個實施例中的各功能單元可以集成在一個處理單元中，也可以是各個單元單獨物理存在，也可以兩個或兩個以上單元集成在一個單元中。In addition, the functional units in the various embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.

所述功能如果以軟體功能單元的形式實現並作為獨立的產品銷售或使用時，可以儲存在一個處理器可執行的非易失的電腦可讀取儲存介質中。基於這樣的理解，本發明的技術方案本質上或者說對現有技術做出貢獻的部分或者該技術方案的部分可以以軟體產品的形式體現出來，該電腦軟體產品儲存在一個儲存介質中，包括若干指令用以使得一台電腦設備（可以是個人電腦，伺服器，或者網路設備等）執行本發明各個實施例所述方法的全部或部分步驟。而前述的儲存介質包括：U盤、移動硬碟、唯讀記憶體（Read-Only Memory，ROM）、隨機存取記憶體（Random Access Memory，RAM）、磁碟或者光碟等各種可以儲存程式碼的介質。If the function is realized in the form of a software functional unit and sold or used as an independent product, it can be stored in a non-volatile computer-readable storage medium executable by a processor. Based on this understanding, the technical solution of the present invention essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including several The instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage media include: U disk, removable hard disk, Read-Only Memory (Read-Only Memory, ROM), Random Access Memory (Random Access Memory, RAM), magnetic disks or optical disks, etc., which can store program codes Medium.

最後應說明的是：以上所述實施例，僅為本發明的具體實施方式，用以說明本發明的技術方案，而非對其限制，本發明的保護範圍並不局限於此，儘管參照前述實施例對本發明進行了詳細的說明，本領域的普通技術人員應當理解：任何熟悉本技術領域的技術人員在本發明揭露的技術範圍內，其依然可以對前述實施例所記載的技術方案進行修改或可輕易想到變化，或者對其中部分技術特徵進行等同替換；而這些修改、變化或者替換，並不使相應技術方案的本質脫離本發明實施例技術方案的精神和範圍，都應涵蓋在本發明的保護範圍之內。因此，本發明的保護範圍應所述以申請專利範圍的保護範圍為準。Finally, it should be noted that the above-mentioned embodiments are only specific implementations of the present invention, which are used to illustrate the technical solutions of the present invention, but not to limit them. The protection scope of the present invention is not limited thereto, although referring to the foregoing The embodiments give a detailed description of the present invention, and those of ordinary skill in the art should understand that any person skilled in the art can still modify the technical solutions described in the foregoing embodiments within the technical scope disclosed in the present invention. Or it can be easily conceived of changes, or equivalent replacements of some of the technical features; and these modifications, changes or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should be covered by the present invention Within the scope of protection. Therefore, the protection scope of the present invention shall be subject to the protection scope of the patent application.

10:電腦設備 11:處理器 12:記憶體 81:識別模組 82:第一檢測模組 83:第二檢測模組 S101~S103,S201~S202,S301~S302,S401~S403,S501~S503:步驟10: Computer equipment 11: processor 12: Memory 81: Identification Module 82: The first detection module 83: The second detection module S101~S103, S201~S202, S301~S302, S401~S403, S501~S503: steps

為了更清楚地說明本發明實施例的技術方案，下面將對實施例中所需要使用的附圖作簡單地介紹，此處的附圖被併入說明書中並構成本說明書中的一部分，這些附圖示出了符合本發明的實施例，並與說明書一起用於說明本發明的技術方案。應當理解，以下附圖僅示出了本發明的某些實施例，因此不應被看作是對範圍的限定，對於本領域普通技術人員來講，在不付出創造性勞動的前提下，還可以根據這些附圖獲得其他相關的附圖。圖1示出了本發明實施例所提供的一種圖像處理方法的流程圖；圖2示出了本發明實施例所提供的識別第一圖像中目標對象所在的目標區域的具體方法的流程圖；圖3示出了本發明實施例所提供的基於目標邊界框，確定目標對象對應的目標區域的具體示例；圖4示出了本發明實施例所提供的確定目標對象的參考節點在相機座標系中的絕對深度的具體方法的流程圖；圖5示出了本發明實施例所提供的另一種得到參考節點的歸一化絕對深度的具體方法的流程圖；圖6示出了本發明實施例所提供的目標對象姿態檢測框架的具體示例；圖7示出了本發明實施例所提供的另一種目標對象姿態檢測框架的具體示例；圖8示出了本發明實施例所提供的一種圖像處理裝置的示意圖；圖9示出了本發明實施例所提供的一種電腦設備的示意圖。In order to more clearly describe the technical solutions of the embodiments of the present invention, the following will briefly introduce the drawings needed in the embodiments. The drawings here are incorporated into the specification and constitute a part of the specification. These attachments The figure shows an embodiment in accordance with the present invention, and is used together with the description to explain the technical solution of the present invention. It should be understood that the following drawings only show certain embodiments of the present invention, and therefore should not be regarded as limiting the scope. For those of ordinary skill in the art, they can also Obtain other related drawings based on these drawings. Fig. 1 shows a flowchart of an image processing method provided by an embodiment of the present invention; FIG. 2 shows a flowchart of a specific method for identifying a target area in a first image where a target object is located in an embodiment of the present invention; FIG. 3 shows a specific example of determining the target area corresponding to the target object based on the target bounding box provided by the embodiment of the present invention; 4 shows a flowchart of a specific method for determining the absolute depth of a reference node of a target object in the camera coordinate system provided by an embodiment of the present invention; FIG. 5 shows a flowchart of another specific method for obtaining the normalized absolute depth of a reference node provided by an embodiment of the present invention; FIG. 6 shows a specific example of a target object pose detection framework provided by an embodiment of the present invention; FIG. 7 shows a specific example of another target object pose detection framework provided by an embodiment of the present invention; FIG. 8 shows a schematic diagram of an image processing apparatus provided by an embodiment of the present invention; Fig. 9 shows a schematic diagram of a computer device provided by an embodiment of the present invention.

S101~S103:步驟S101~S103: steps

Claims

一種圖像處理方法，包括：識別第一圖像中的目標對象所在的目標區域；基於所述目標對象所在的目標區域，確定所述目標對象的多個關鍵點分別在所述第一圖像中的第一二維位置資訊、每個所述關鍵點相對所述目標對象的參考節點的相對深度、以及所述目標對象的參考節點在相機座標系中的絕對深度；基於所述目標對象的多個關鍵點分別對應的所述第一二維位置資訊和所述相對深度、以及所述參考節點對應的所述絕對深度，確定所述目標對象的多個關鍵點分別在所述相機座標系中的三維位置資訊。An image processing method, including: Identifying the target area where the target object in the first image is located; Based on the target area where the target object is located, determine the first two-dimensional position information of multiple key points of the target object in the first image, and the reference of each key point to the target object The relative depth of the node, and the absolute depth of the reference node of the target object in the camera coordinate system; Based on the first two-dimensional position information and the relative depth respectively corresponding to the multiple key points of the target object, and the absolute depth corresponding to the reference node, the multiple key points of the target object are determined respectively Three-dimensional position information in the camera coordinate system.

根據請求項1所述的圖像處理方法，還包括：基於所述目標對象的多個關鍵點分別在所述相機座標系中的三維位置資訊，得到所述目標對象的姿態。The image processing method according to claim 1, further comprising: obtaining the posture of the target object based on the three-dimensional position information of the plurality of key points of the target object in the camera coordinate system.

根據請求項1或2所述的圖像處理方法，其中，所述識別所述第一圖像中的目標對象所在的目標區域，包括：對所述第一圖像進行特徵提取，得到所述第一圖像的特徵圖；基於所述特徵圖，從預先生成的多個候選邊界框中確定多個目標邊界框；基於多個所述目標邊界框，確定所述目標對象所在的目標區域。The image processing method according to claim 1 or 2, wherein the recognizing the target area where the target object in the first image is located includes: Performing feature extraction on the first image to obtain a feature map of the first image; Based on the feature map, determine multiple target bounding boxes from multiple pre-generated candidate bounding boxes; Based on the multiple target bounding boxes, the target area where the target object is located is determined.

根據請求項3所述的圖像處理方法，其中，所述基於多個所述目標邊界框，確定所述目標對象所在的目標區域，包括：基於多個所述目標邊界框以及所述特徵圖，確定每個所述目標邊界框的特徵子圖；對多個所述目標邊界框分別對應的特徵子圖進行邊界框回歸處理，得到所述目標對象所在的目標區域。The image processing method according to claim 3, wherein the determining the target area where the target object is located based on a plurality of the target bounding boxes includes: Determining a feature sub-map of each target bounding box based on a plurality of the target bounding boxes and the feature map; Bounding box regression processing is performed on the feature sub-images respectively corresponding to the multiple target bounding boxes to obtain the target area where the target object is located.

根據請求項1或2所述的圖像處理方法，其中，基於所述目標對象所在的目標區域，確定所述目標對象的參考節點在相機座標系中的絕對深度，包括：基於所述目標對象所在的目標區域以及所述第一圖像，確定所述目標對象的目標特徵圖；對所述目標對象對應的目標特徵圖進行深度識別處理，得到所述目標對象的參考節點的歸一化絕對深度；基於所述歸一化絕對深度以及相機的參數矩陣，得到所述目標對象的參考節點在所述相機座標系中的絕對深度。The image processing method according to claim 1 or 2, wherein, based on the target area where the target object is located, determining the absolute depth of the reference node of the target object in the camera coordinate system includes: Determining a target feature map of the target object based on the target area where the target object is located and the first image; Performing depth recognition processing on the target feature map corresponding to the target object to obtain the normalized absolute depth of the reference node of the target object; Based on the normalized absolute depth and the parameter matrix of the camera, the absolute depth of the reference node of the target object in the camera coordinate system is obtained.

根據請求項5所述的圖像處理方法，其中，所述對所述目標對象對應的目標特徵圖進行深度識別處理，得到所述目標對象的參考節點的歸一化絕對深度，包括：基於所述第一圖像，確定初始深度圖像；其中，所述初始深度圖像中任一第一圖元點的圖元值為所述第一圖像中與所述第一圖元點的位置對應的第二圖元點在所述相機座標系中的初始深度值；基於所述目標對象對應的目標特徵圖，確定與所述目標對象對應的參考節點在所述第一圖像中的第二二維位置資訊；基於所述第二二維位置資訊、以及所述初始深度圖像，確定所述目標對象對應的參考節點的初始深度值；基於所述參考節點的初始深度值以及所述目標特徵圖，確定所述目標對象的參考節點的歸一化絕對深度。The image processing method according to claim 5, wherein the performing depth recognition processing on the target feature map corresponding to the target object to obtain the normalized absolute depth of the reference node of the target object includes: Based on the first image, determine an initial depth image; wherein the primitive value of any first primitive point in the initial depth image is the same as that of the first primitive point in the first image The initial depth value of the second pixel point corresponding to the position of in the camera coordinate system; Determine the second two-dimensional position information of the reference node corresponding to the target object in the first image based on the target feature map corresponding to the target object; Determine the initial depth value of the reference node corresponding to the target object based on the second two-dimensional position information and the initial depth image; Based on the initial depth value of the reference node and the target feature map, the normalized absolute depth of the reference node of the target object is determined.

根據請求項6所述的圖像處理方法，其中，所述基於所述參考節點的初始深度值以及所述目標特徵圖，確定所述目標對象的參考節點的歸一化絕對深度，包括：對所述目標對象對應的目標特徵圖進行至少一級第一卷積處理，得到所述目標對象的特徵向量；將所述特徵向量和所述初始深度值進行拼接，得到拼接向量，並對所述拼接向量進行至少一級第二卷積處理，得到所述初始深度值的修正值；基於所述初始深度值的修正值、以及所述初始深度值，得到所述歸一化絕對深度。The image processing method according to claim 6, wherein the determining the normalized absolute depth of the reference node of the target object based on the initial depth value of the reference node and the target feature map includes: Performing at least one level of first convolution processing on the target feature map corresponding to the target object to obtain the feature vector of the target object; Splicing the feature vector and the initial depth value to obtain a splicing vector, and performing at least one level of second convolution processing on the splicing vector to obtain a correction value of the initial depth value; Based on the correction value of the initial depth value and the initial depth value, the normalized absolute depth is obtained.

根據請求項5所述的圖像處理方法，其中，所述參數矩陣包括：所述相機的焦距；所述基於所述歸一化絕對深度以及相機的參數矩陣，得到所述目標對象的參考節點在所述相機座標系中的絕對深度，包括：基於所述歸一化絕對深度、所述焦距、所述目標區域的面積、以及所述目標邊界框的面積，得到所述目標對象的參考節點在所述相機座標系中的絕對深度。The image processing method according to claim 5, wherein the parameter matrix includes: the focal length of the camera; The obtaining the absolute depth of the reference node of the target object in the camera coordinate system based on the normalized absolute depth and the parameter matrix of the camera includes: Based on the normalized absolute depth, the focal length, the area of the target region, and the area of the target bounding box, the absolute depth of the reference node of the target object in the camera coordinate system is obtained.

根據請求項1或2所述的圖像處理方法，其中，所述圖像處理方法應用於預先訓練好的神經網路中，所述神經網路包括目標檢測網路、關鍵點檢測網路以及深度預測網路三個分支網路；所述目標檢測網路用於獲得所述目標對象所在的目標區域；所述關鍵點檢測網路用於獲取所述目標對象的多個關鍵點分別在所述第一圖像中的第一二維位置資訊、和每個所述關鍵點相對所述目標對象的參考節點的相對深度；所述深度預測網路用於獲取所述參考節點在所述相機座標系中的絕對深度。The image processing method according to claim 1 or 2, wherein the image processing method is applied to a pre-trained neural network, and the neural network includes a target detection network, a key point detection network, and Three branch networks of the deep prediction network; the target detection network is used to obtain the target area where the target object is located; the key point detection network is used to obtain multiple key points of the target object in each The first two-dimensional position information in the first image and the relative depth of each of the key points with respect to the reference node of the target object; the depth prediction network is used to obtain the reference node in the camera The absolute depth in the coordinate system.

一種電腦設備，包括：相互連接的處理器和記憶體，所述記憶體儲存有所述處理器可執行的機器可讀指令，當電腦設備運行時，所述機器可讀指令被所述處理器執行以實現如請求項1至9任一所述的圖像處理方法的步驟。A computer device includes: a processor and a memory that are connected to each other. The memory stores machine-readable instructions executable by the processor. When the computer device is running, the machine-readable instructions are used by the processor. Execute to realize the steps of the image processing method according to any one of Claims 1 to 9.

一種電腦可讀儲存介質，該電腦可讀儲存介質上儲存有電腦程式，該電腦程式被處理器運行時執行如請求項1至9任一項所述的圖像處理方法的步驟。A computer-readable storage medium has a computer program stored on the computer-readable storage medium, and the computer program executes the steps of the image processing method according to any one of claim items 1 to 9 when the computer program is run by a processor.