WO2022246720A1

WO2022246720A1 - Training method of surgical action identification model, medium and device

Info

Publication number: WO2022246720A1
Application number: PCT/CN2021/096244
Authority: WO
Inventors: 贾富仓; 徐文廷
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2021-05-24
Filing date: 2021-05-27
Publication date: 2022-12-01
Also published as: CN113705320A

Abstract

A training method of a surgical action identification model, a medium and a device. The surgical action identification model comprises a backbone network, a pyramid feature aggregation network and a prediction network. The pyramid feature aggregation network comprises a feature map convergence module and a feature map divergence module. The training method comprises: inputting an acquired original surgical action image into the backbone network to obtain a plurality of hierarchical feature maps of different scales (S10); inputting the hierarchical feature maps into the pyramid feature aggregation network, and sequentially performing fusion processing by the feature map convergence module and the feature map divergence module to obtain a plurality of fused feature maps of different scales (S20); inputting the plurality of fused feature maps of different scales into the prediction network to obtain a predicted target value (S30); and updating a loss function according to the predicted target value and an acquired real target value, and adjusting a model parameter of the surgical action identification model (S40). According to the method, spatial information is fully utilized, more scale features are fused, and a high-precision prediction model is obtained by training.

Description

手术动作识别模型的训练方法、介质和设备Training method, medium and equipment for surgical action recognition model

技术领域technical field

本发明属于图像处理技术领域，具体地讲，涉及一种手术动作识别模型的训练方法、计算机可读存储介质、计算机设备。The invention belongs to the technical field of image processing, and in particular relates to a training method for a surgical action recognition model, a computer-readable storage medium, and computer equipment.

背景技术Background technique

外科手术机器人***是一种可以辅助外科医生完成手术的智能计算机辅助***。在微创外科手术中，根据图像算法做出的处理结果使辅助手术机器人做出相应的手术操作动作，协助主治外科医生共同完成手术操作。手术机器人***不仅具有微创手术创伤小、恢复快、患者痛苦程度轻的特点，并且因为引入智能辅助机器人***将病人的影像数据与实际手术中患者的解剖部位相结合，手术中通过实时跟踪手术器械与手术动作的实时识别，使外科医生更加清晰的了解解剖部位的实时变化，使得微创手术更加安全、稳定、可靠。同时，通过辅助机械臂的介入对手术动作的实时检测，可以在很大程度上代替辅助外科医生的任务，减小手术过程中外科医生的需求量和多名医生相互配合不当产生的误操作。其中，针对手术动作的目标识别任务，是手术辅助机器人***中最基本也至关重要的技术。基于深度学习的手术动作实时检测，实现手术机器人***中核心的低层算法，为将来半自主或全自助化手术机器人的研发提供关键技术支撑。Surgical robot system is an intelligent computer-aided system that can assist surgeons to complete operations. In minimally invasive surgery, the processing result based on the image algorithm enables the auxiliary surgical robot to make corresponding surgical operation actions to assist the attending surgeon to complete the surgical operation. The surgical robot system not only has the characteristics of minimally invasive surgery, less trauma, quick recovery, and less pain for the patient, but also because the introduction of the intelligent auxiliary robot system combines the patient's image data with the patient's anatomical parts in the actual operation, the operation can be tracked in real time during the operation The real-time recognition of instruments and surgical actions enables surgeons to have a clearer understanding of real-time changes in anatomical parts, making minimally invasive surgery safer, more stable and more reliable. At the same time, the real-time detection of surgical actions through the intervention of the auxiliary robotic arm can largely replace the tasks of the auxiliary surgeon, reducing the demand for surgeons during the operation and the misoperation caused by improper cooperation between multiple doctors. Among them, the target recognition task for surgical actions is the most basic and crucial technology in the surgical assisted robot system. The real-time detection of surgical actions based on deep learning realizes the core low-level algorithm in the surgical robot system and provides key technical support for the development of semi-autonomous or fully self-service surgical robots in the future.

现有的基于深度学习的检测方法，分为两大类型，基于静态的行为检测和基于动态的行为检测。静态方法仅具有空间信息(图像数据)，而没有当前帧的任何时间上下文。动态活动检测方法使用视频数据，该视频数据为视频中的运动提供了时间上下文信息。但上述方法都是应用于自然场景与模拟的手术场景，和在真实场景下的手术动作检测有很大不同。首先，人体的组织器官存在非刚体的形变，两个不同器官之间的边界、形状和颜色差异很小，基于空间信息的方法难以提取到图像中有效的特征信息，造成分类器的精度较差。其次，使用内窥镜相机拍摄的场景非常接近，无法显示完整的器官及其周围环境，因此几乎没有上下文信息。这样基于动态的行文检测方法难以有效利用手术视频上下帧之间的时间与空间信息，这些方法就难以满足手术动作检测的任务需求。最后，内窥镜在近距离内的运动和方向使器官从不同角度呈现出很大的不同，这些变化性剧烈的情况也会造成传统的目标检测算法失效。Existing deep learning-based detection methods are divided into two types, static-based behavior detection and dynamic-based behavior detection. Static methods only have spatial information (image data) without any temporal context of the current frame. Dynamic activity detection methods use video data that provides temporal context information for motion in the video. However, the above methods are all applied to natural scenes and simulated surgical scenes, which are very different from surgical action detection in real scenes. First of all, there are non-rigid deformations in the tissues and organs of the human body, and the boundary, shape and color differences between two different organs are very small. It is difficult to extract effective feature information in the image based on the method of spatial information, resulting in poor accuracy of the classifier. . Second, the scene captured with the endoscopic camera is too close to show the complete organ and its surroundings, so there is little contextual information. Such a dynamic-based text detection method is difficult to effectively use the time and space information between the upper and lower frames of the surgical video, and these methods are difficult to meet the task requirements of surgical action detection. Finally, the movement and orientation of the endoscope at close range makes organs appear very different from different angles, and these highly variable conditions can also cause traditional object detection algorithms to fail.

发明内容Contents of the invention

(一)本发明所要解决的技术问题(1) technical problem to be solved by the present invention

在手术动作检测场景中时间上下文信息较少的情况下，如何充分利用空间信息，融合更多尺度特征，训练得到高精度的预测模型。In the case of less temporal context information in the surgical action detection scene, how to make full use of spatial information, integrate more scale features, and train a high-precision prediction model.

(二)本发明所采用的技术方案(2) The technical scheme adopted in the present invention

一种手术动作识别模型的训练方法，手术动作识别模型包括主干网络、金字塔特征聚合网络和预测网络，其中，所述金字塔特征聚合网络包括特征图汇集模块和特征图发散模块，所述特征图汇集模块的输入单元和所述特征图发散模块的输出单元之间具有跳连融合路径，所述训练方法包括：A training method for a surgical action recognition model, the surgical action recognition model includes a backbone network, a pyramid feature aggregation network, and a prediction network, wherein the pyramid feature aggregation network includes a feature map collection module and a feature map divergence module, and the feature map collection Between the input unit of the module and the output unit of the feature map divergence module, there is a skip connection fusion path, and the training method includes:

将获取到的原始手术动作图像输入到所述主干网络，得到若干不同尺度的层次化特征图；Inputting the obtained original surgical action images into the backbone network to obtain several hierarchical feature maps of different scales;

将所述层次化特征图输入到所述金字塔特征聚合网络，依次经过所述特征图汇集模块和所述特征图发散模块的融合处理，得到若干不同尺度的融合特征图；The hierarchical feature map is input to the pyramid feature aggregation network, and the fusion process of the feature map collection module and the feature map divergence module is sequentially processed to obtain a number of fusion feature maps of different scales;

将若干不同尺度的融合特征图输入到所述预测网络，得到预测目标值；Inputting fusion feature maps of several different scales into the prediction network to obtain a prediction target value;

根据预测目标值和获取到的真实目标值更新损失函数，并根据更新后的损失函数调整手术动作识别模型的模型参数。The loss function is updated according to the predicted target value and the obtained real target value, and the model parameters of the surgical action recognition model are adjusted according to the updated loss function.

可选择地，所述特征图汇集模块包括融合单元数量递减的第一列金字塔层、第二列金字塔层和第三列金字塔层，所述特征图发散模块包括融合单元数量递增的所述第三列金字塔层、第四列金字塔层和第五列金字塔层，其中，所述第一列金字塔层为所述特征图汇集模块的输入单元，所述第五列金字塔层为所述所述特征图发散模块的输出单元，且所述第一列金字塔层与所述第五列金字塔层的融合单元数量相同，所述第二列金字塔层与所述第四列金字塔层的融合单元数量相同，各个融合单元通过预定融合路径网进行信息传递。Optionally, the feature map collection module includes the first column of pyramid layers, the second column of pyramid layers, and the third column of pyramid layers with the number of fusion units decreasing, and the feature map divergence module includes the third column of pyramid layers with the number of fusion units increasing. A column pyramid layer, a fourth column pyramid layer and a fifth column pyramid layer, wherein the first column pyramid layer is the input unit of the feature map collection module, and the fifth column pyramid layer is the feature map The output unit of the divergence module, and the number of fusion units in the first column of pyramid layers is the same as that of the fifth column of pyramid layers, and the number of fusion units in the second column of pyramid layers is the same as that of the fourth column of pyramid layers. The fusion unit transmits information through a predetermined fusion path network.

可选择地，所述第一列金字塔层与所述第五列金字塔层均包括五个不同特征尺度的融合单元，所述第二列金字塔层与所述第四列金字塔层均包括三个不同Optionally, the first column of pyramid layers and the fifth column of pyramid layers each include five fusion units of different feature scales, and the second column of pyramid layers and the fourth column of pyramid layers each include three different

特征尺度的融合单元，所述第三列金字塔层具有一个融合单元。A fusion unit of a feature scale, the pyramid layer of the third column has a fusion unit.

可选择地，所述预定融合路径网包括：Optionally, the predetermined fusion path network includes:

第一融合路径，在金字塔层中自下而上由小尺度的融合单元指向大尺度的融合单元；The first fusion path is from bottom to top in the pyramid layer from small-scale fusion units to large-scale fusion units;

第二融合路径，用于对角连接两个相邻层之间的融合单元，通过下采样融合相邻层之间的不同尺度特征图信息；The second fusion path is used to diagonally connect the fusion units between two adjacent layers, and fuse the feature map information of different scales between adjacent layers by downsampling;

第三融合路径，用于对角连接两个相邻层之间的融合单元，通过上采样融合相邻层之间的不同尺度特征图信息；The third fusion path is used to diagonally connect the fusion units between two adjacent layers, and fuse the feature map information of different scales between adjacent layers by upsampling;

第四融合路径，用于水平连接同一层的融合单元，以融合相同尺度的特征图信息；The fourth fusion path is used to horizontally connect the fusion units of the same layer to fuse the feature map information of the same scale;

第五融合路径，在第一列金字塔层中自上而下由大尺度的融合单元指向小尺度的融合单元；The fifth fusion path is from top to bottom in the first pyramid layer from the large-scale fusion unit to the small-scale fusion unit;

跳连融合路径，用于连接所述第一列金字塔层与所述第五列金字塔层中同一尺度的融合单元。A skip-connect fusion path is used to connect the fusion units of the same scale in the pyramid layer in the first column and the pyramid layer in the fifth column.

可选择地，所述主干网络得到的层次化特征图具有三种尺度，所述第一列金字塔层的五个融合单元分别是由下至上尺度递增的第一融合单元、第二融合单元、第三融合单元、第四融合单元和第五融合单元，三种尺度的层次化特征图分别输入到所述第一融合单元、所述第二融合单元和所述第三融合单元；所述第一融合单元、所述第二融合单元和所述第三融合单元通过第五融合路径连接，所述第三融合单元、所述第四融合单元和所述第五融合单元通过第一融合路径连接。Optionally, the hierarchical feature map obtained by the backbone network has three scales, and the five fusion units in the first column of pyramid layers are respectively the first fusion unit, the second fusion unit, the second fusion unit, and the second fusion unit that increase in scale from bottom to top. Three fusion units, a fourth fusion unit and a fifth fusion unit, the hierarchical feature maps of three scales are respectively input to the first fusion unit, the second fusion unit and the third fusion unit; the first fusion unit The fusion unit, the second fusion unit, and the third fusion unit are connected by a fifth fusion path, and the third fusion unit, the fourth fusion unit, and the fifth fusion unit are connected by a first fusion path.

可选择地，所述损失函数的公式如下：Optionally, the formula of the loss function is as follows:

其中L _cls是Focal损失函数，L _reg是IOU损失函数，N _pos代表正样本的数量，λ是L _reg的平衡权重且值是1，

指对特征图上的所有点(x,y)所对应的损失进行求和，

指点(x,y)对应ground-truth的类别，P _x,y指点(x,y)对应预测值得类别，

指点(x,y)对应ground-truth的目标框，t _x,y指点(x,y)对应预测值的目标框，

是指数函数，当

时值为1，

取其它值时指数函数为0。 Where L _cls is the Focal loss function, L _reg is the IOU loss function, N _pos represents the number of positive samples, λ is the balance weight of L _reg and the value is 1,

Refers to summing the losses corresponding to all points (x, y) on the feature map,

Pointing (x, y) corresponds to the category of ground-truth, P _{x, y} pointing (x, y) corresponds to the predicted value category,

Pointing (x, y) corresponds to the target box of ground-truth, t _{x, y} pointing (x, y) corresponds to the target box of the predicted value,

is an exponential function, when

The time value is 1,

The exponential function is 0 when other values are taken.

本发明还公开了一种计算机可读存储介质，所述计算机可读存储介质存储有手术动作识别模型的训练程序，所述手术动作识别模型的训练程序被处理器执行时实现上述的手术动作识别模型的训练方法。The present invention also discloses a computer-readable storage medium, the computer-readable storage medium stores the training program of the surgical action recognition model, and when the training program of the surgical action recognition model is executed by the processor, the above-mentioned surgical action recognition is realized The training method of the model.

本发明还公开了一种计算机设备，所述计算机设备包括计算机可读存储介质、处理器和存储在所述计算机可读存储介质中的手术动作识别模型的训练程序，所述手术动作识别模型的训练程序被处理器执行时实现上述的手术动作识别模型的训练方法。The present invention also discloses a computer device, which includes a computer-readable storage medium, a processor, and a training program for a surgical action recognition model stored in the computer-readable storage medium, the surgical action recognition model When the training program is executed by the processor, the above-mentioned training method for the surgical action recognition model is realized.

(三)有益效果(3) Beneficial effects

本发明公开了一种手术动作识别模型的训练方法，相对于传统的训练方法，具有如下技术效果：The invention discloses a training method for a surgical action recognition model. Compared with the traditional training method, it has the following technical effects:

通过改进的金字塔特征聚合网络来充分融合高层语义信息和低层语义信息，得到的融合特征图可以更加精确地预测手术类别和边框的位置，解决了手术动作视频特征不明显的问题。Through the improved pyramid feature aggregation network to fully fuse high-level semantic information and low-level semantic information, the resulting fusion feature map can more accurately predict the surgical category and the position of the border, and solve the problem of indistinct surgical action video features.

附图说明Description of drawings

图1为本发明的实施例一的手术动作识别模型的训练方法的流程图；Fig. 1 is the flowchart of the training method of the surgical action recognition model of embodiment one of the present invention;

图2为本发明的实施例一的手术动作识别模型的训练测模型的框架图；Fig. 2 is the frame diagram of the training test model of the surgical action recognition model of embodiment one of the present invention;

图3为本发明的实施例一的金字塔特征聚合网络的结构示意图；Fig. 3 is the structural representation of the pyramidal feature aggregation network of embodiment one of the present invention;

图4为本发明的实施例二的手术动作识别模型的训练装置的结构示意图；4 is a schematic structural diagram of a training device for a surgical action recognition model according to Embodiment 2 of the present invention;

图5为本发明的实施例的计算机设备原理框图。FIG. 5 is a functional block diagram of a computer device according to an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。In order to make the object, technical solution and advantages of the present invention more clear, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

在详细描述本申请的各个实施例之前，首先简单描述本申请的技术构思：现有的基于深度学习的检测方法，需要依赖充分的上下文信息，而在真实手术场景下，由于相机拍摄场景非常接近，很难提取到有效的上下文信息，无法提高分类精度。本申请提供了一种手术动作识别模型的训练方法，首先通过主干网络提取到不同尺度的层次化特征图，接着利用金字塔特征聚合网络对层次化特征图进行融合处理，对不同尺度的特征图信息进行充分融合，得到不同尺度的融合特征图，最后利用预测网络进行预测以及利用更新后的损失函数调整手术动作识别模型的模型参数，该训练方法能充分利用视频中的空间信息，改善现有模型融合多尺度空间信息的能力，进而提高模型的识别精度和检测速度。Before describing the various embodiments of the present application in detail, first briefly describe the technical concept of the present application: the existing detection method based on deep learning needs to rely on sufficient context information, and in the real surgical scene, since the camera shooting scene is very close to , it is difficult to extract effective context information, and the classification accuracy cannot be improved. This application provides a training method for a surgical action recognition model. First, hierarchical feature maps of different scales are extracted through the backbone network, and then the hierarchical feature maps are fused using the pyramid feature aggregation network, and the feature map information of different scales is Perform full fusion to obtain fusion feature maps of different scales, and finally use the prediction network to predict and use the updated loss function to adjust the model parameters of the surgical action recognition model. This training method can make full use of the spatial information in the video and improve the existing model. The ability to integrate multi-scale spatial information can improve the recognition accuracy and detection speed of the model.

具体地，如图1和图2所示，本实施例一的手术动作识别模型的包括主干网络、金字塔特征聚合网络和预测网络，其中，金字塔特征聚合网络包括特征图汇集模块和特征图发散模块，特征图汇集模块的输入单元和特征图发散模块的输出单元之间具有跳连融合路径，手术动作识别模型的训练方法包括如下步骤：Specifically, as shown in Figures 1 and 2, the surgical action recognition model of the first embodiment includes a backbone network, a pyramid feature aggregation network, and a prediction network, wherein the pyramid feature aggregation network includes a feature map collection module and a feature map divergence module , there is a jump-connect fusion path between the input unit of the feature map collection module and the output unit of the feature map divergence module, and the training method of the surgical action recognition model includes the following steps:

步骤S10：将获取到的原始手术动作图像输入到所述主干网络，得到若干不同尺度的层次化特征图；Step S10: Input the obtained original surgical action images into the backbone network to obtain several hierarchical feature maps of different scales;

步骤S20：将所述层次化特征图输入到所述金字塔特征聚合网络，依次经过所述特征图汇集模块和所述特征图发散模块的融合处理，得到若干不同尺度的融合特征图；Step S20: input the hierarchical feature map into the pyramid feature aggregation network, and sequentially undergo the fusion processing of the feature map collection module and the feature map divergence module to obtain a number of fusion feature maps of different scales;

步骤S30：将若干不同尺度的融合特征图输入到所述预测网络，得到预测目标值；Step S30: Input several fused feature maps of different scales into the prediction network to obtain the predicted target value;

步骤S40：根据预测目标值和获取的真实目标值更新损失函数，并根据更新后的损失函数调整手术动作识别模型的模型参数。Step S40: update the loss function according to the predicted target value and the acquired real target value, and adjust the model parameters of the surgical action recognition model according to the updated loss function.

示例性地，在步骤S10中，主干网络对原始手术动作图像进行处理，得到C3、C4、C5三个尺度的层次化特征图，接着在步骤S20中，将相应尺度的层次化特征图输入到相应尺度的融合单元中，进行特征图信息的融合。Exemplarily, in step S10, the backbone network processes the original surgical action image to obtain hierarchical feature maps of three scales C3, C4, and C5, and then in step S20, input the hierarchical feature maps of corresponding scales into In the fusion unit of the corresponding scale, the feature map information is fused.

具体地，如图3所示，特征图汇集模块包括融合单元数量递减的第一列金字塔层P1、第二列金字塔层P2和第三列金字塔层P3，特征图发散模块包括融合单元数量递增的所述第三列金字塔层P3、第四列金字塔层P4和第五列金字塔层P5，其中，所述第一列金字塔层为所述特征图汇集模块的输入单元，所述第五列金字塔层为所述所述特征图发散模块的输出单元，且所述第一列金字塔层与所述第五列金字塔层的融合单元数量相同，所述第二列金字塔层与所述第四列金字塔层的融合单元数量相同，各个融合单元通过预定融合路径网进行信息传递。整个金字塔特征聚合网络呈蝴蝶状，通过各个融合单元对不同尺度的特征图信息进行充分融合。Specifically, as shown in Figure 3, the feature map collection module includes the first column of pyramid layers P1, the second column of pyramid layers P2, and the third column of pyramid layers P3 with the number of fusion units decreasing, and the feature map divergence module includes the number of fusion units increasing. The third column of pyramid layers P3, the fourth column of pyramid layers P4 and the fifth column of pyramid layers P5, wherein the first column of pyramid layers is the input unit of the feature map collection module, and the fifth column of pyramid layers Be the output unit of the feature map divergence module, and the number of fusion units in the first column of pyramid layers and the fifth column of pyramid layers is the same, and the second column of pyramid layers and the fourth column of pyramid layers The number of fusion units is the same, and each fusion unit transmits information through a predetermined fusion path network. The entire pyramid feature aggregation network is butterfly-shaped, and the feature map information of different scales is fully fused through each fusion unit.

示例性地，所述第一列金字塔层与所述第五列金字塔层均包括五个不同特征尺度的融合单元，所述第二列金字塔层与所述第四列金字塔层均包括三个不同特征尺度的融合单元，所述第三列金字塔层具有一个融合单元。需要说明的是，位于同一行的融合单元的尺度相同，又称为同一层的融合单元，同一列金字塔层的各个融合单元的尺度由上至下递减。Exemplarily, the first column of pyramid layers and the fifth column of pyramid layers each include five fusion units of different feature scales, and the second column of pyramid layers and the fourth column of pyramid layers each include three different A fusion unit of a feature scale, the pyramid layer of the third column has a fusion unit. It should be noted that the scales of the fusion units located in the same row are the same, which are also called fusion units of the same layer, and the scales of the fusion units in the same column of pyramid layers decrease from top to bottom.

进一步地，如图2所示，预定融合路径网包括第一融合路径11、第二融合路径12、第三融合路径13、第四融合路径14、第五融合路径15和跳连融合路径16。其中，第一融合路径11在金字塔层中自下而上由小尺度的融合单元指向大尺度的融合单元；第二融合路径12用于对角连接两个相邻层之间的融合单元，通过下采样融合相邻层之间的不同尺度特征图信息；第三融合路径13用于对角连接两个相邻层之间的融合单元，通过上采样融合相邻层之间的不同尺度特征图信息；第四融合路径14用于水平连接同一层的融合单元，以融合相同尺度的特征图信息；第五融合路径15在第一列金字塔层中自上而下由大尺度的融合单元指向小尺度的融合单元；跳连融合路径16用于连接所述第一列金字塔层与所述第五列金字塔层中同一尺度的融合单元，即用于融合同层输入单元与输出单元之间的特征图信息，以保留更多原始信息。Further, as shown in FIG. 2 , the predetermined merged path network includes a first merged path 11 , a second merged path 12 , a third merged path 13 , a fourth merged path 14 , a fifth merged path 15 and a jump-connected merged path 16 . Among them, the first fusion path 11 points from the small-scale fusion unit to the large-scale fusion unit from bottom to top in the pyramid layer; the second fusion path 12 is used to diagonally connect the fusion units between two adjacent layers, through Downsampling fuses feature map information of different scales between adjacent layers; the third fusion path 13 is used to diagonally connect fusion units between two adjacent layers, and fuses feature maps of different scales between adjacent layers by upsampling information; the fourth fusion path 14 is used to horizontally connect the fusion units of the same layer to fuse the feature map information of the same scale; the fifth fusion path 15 is from top to bottom in the first pyramid layer from the large-scale fusion unit to the small The fusion unit of the scale; the skip connection fusion path 16 is used to connect the fusion units of the same scale in the first column of pyramid layers and the fifth column of pyramid layers, that is, to fuse the features between the input unit and the output unit of the same layer Graph information to retain more original information.

示例性地，所述第一列金字塔层P1的五个融合单元分别是由下至上尺度递增的第一融合单元、第二融合单元、第三融合单元、第四融合单元和第五融合单元，三种尺度C5、C4、C3的层次化特征图分别输入到所述第一融合单元、所述第二融合单元和所述第三融合单元；所述第一融合单元、所述第二融合单元和所述第三融合单元通过第五融合路径连接，即通过上采样方式传递特征图信息，所述第三融合单元、所述第四融合单元和所述第五融合单元通过第一融合路径连接，即通过下采样方式传递特征图信息，这样可以进一步融合特征图信息。Exemplarily, the five fusion units of the pyramid layer P1 in the first column are respectively the first fusion unit, the second fusion unit, the third fusion unit, the fourth fusion unit and the fifth fusion unit whose scale increases from bottom to top, The hierarchical feature maps of three scales C5, C4, and C3 are respectively input to the first fusion unit, the second fusion unit, and the third fusion unit; the first fusion unit, the second fusion unit The third fusion unit is connected with the fifth fusion path, that is, the feature map information is transmitted by upsampling, and the third fusion unit, the fourth fusion unit, and the fifth fusion unit are connected through the first fusion path , that is, the feature map information is transmitted by downsampling, so that the feature map information can be further fused.

经过金字塔特征聚合模块充分利用特征图多尺度信息的融合，经过特征图信息的初始层汇集、输出层发散，利用输入层与输出层的跳连保留特征图的原始信息，得到信息更加丰富的融合特征图。预测网络包括两条分支网络，分别用于分类与回归任务，分支网络对融合特征图进行处理之后，得到预测目标值，最后根据预测目标值更新损失函数，并根据更新后的损失函数调整手术动作识别模型的模型参数，其中调整模型参数的过程为现有技术，在此不进行赘述。After the pyramid feature aggregation module makes full use of the fusion of multi-scale information of the feature map, after the initial layer of feature map information is collected and the output layer diverges, the original information of the feature map is retained by using the jump connection between the input layer and the output layer, and the fusion with richer information is obtained. feature map. The prediction network includes two branch networks, which are used for classification and regression tasks respectively. After the branch network processes the fusion feature map, the predicted target value is obtained, and finally the loss function is updated according to the predicted target value, and the surgical action is adjusted according to the updated loss function. The model parameters of the model are identified, and the process of adjusting the model parameters is a prior art, which will not be repeated here.

示例性地，在步骤S40中，损失函数的公式如下：Exemplarily, in step S40, the formula of the loss function is as follows:

指对特征图上的所有点(x,y)所对应的损失进行求和，

是指数函数，当

时值为1，

is an exponential function, when

The time value is 1,

The exponential function is 0 when other values are taken.

上式中Focal损失函数的一般形式是：The general form of the Focal loss function in the above formula is:

L _cls(p _t)＝-α _t(1-p _t) ^γlog(p _t) L _cls (p _t )＝-α _t (1-p _t ) ^γ log(p _t )

其中，参数α解决正负样本不平衡的问题，置信度p _t能够使模型主要关注难分类的样本，这样就解决了样本类别不均衡的问题。 Among them, the parameter α solves the problem of unbalanced positive and negative samples, and the confidence p _t can make the model focus on samples that are difficult to classify, thus solving the problem of unbalanced sample categories.

本实施例一公开的手术动作识别模型的训练方法，通过改进的金字塔特征聚合网络来充分融合高层语义信息和低层语义信息，得到的融合特征图可以更加精确地预测手术类别和边框的位置，解决了手术动作视频特征不明显的问题。The training method of the surgical action recognition model disclosed in the first embodiment fully fuses the high-level semantic information and the low-level semantic information through the improved pyramid feature aggregation network, and the obtained fusion feature map can more accurately predict the operation category and the position of the border, and solve the problem of It solves the problem that the video features of surgical actions are not obvious.

本实施例二还公开了一种手术动作识别模型的训练装置，训练装置包括第一输入单元100、第二输入单元200、第三输入单元300和模型训练单元400。其中，第一输入单元100用于将获取到的原始手术动作图像输入到所述主干网络，得到若干不同尺度的层次化特征图；第二输入单元200用于将所述层次化特征图输入到所述金字塔特征聚合网络，依次经过所述特征图汇集模块和所述特征图发散模块的融合处理，得到若干不同尺度的融合特征图；第三输入单元300用于将若干不同尺度的融合特征图输入到所述预测网络，得到预测目标值；模型训练单元400用于根据预测目标值和获取到的真实目标值更新损失函数，并根据更新后的损失函数调整手术动作识别模型的模型参数。The second embodiment also discloses a training device for a surgical action recognition model. The training device includes a first input unit 100 , a second input unit 200 , a third input unit 300 and a model training unit 400 . Wherein, the first input unit 100 is used to input the obtained original surgical action image to the backbone network to obtain several hierarchical feature maps of different scales; the second input unit 200 is used to input the hierarchical feature map to The pyramid feature aggregation network sequentially passes through the fusion processing of the feature map collection module and the feature map divergence module to obtain a number of fusion feature maps of different scales; the third input unit 300 is used to combine the fusion feature maps of several different scales Input to the prediction network to obtain the predicted target value; the model training unit 400 is used to update the loss function according to the predicted target value and the acquired real target value, and adjust the model parameters of the surgical action recognition model according to the updated loss function.

进一步地，本实施例三还公开了一种计算机可读存储介质，所述计算机可读存储介质存储有手术动作识别模型的训练程序，所述手术动作识别模型的训练程序被处理器执行时实现上述的手术动作识别模型的训练方法。Further, the third embodiment also discloses a computer-readable storage medium, the computer-readable storage medium stores a training program of the surgical action recognition model, and the training program of the surgical action recognition model is implemented when the processor executes The training method of the above-mentioned surgical action recognition model.

进一步地，本申请还公开了一种计算机设备，在硬件层面，如图6所示，该计算机设备包括处理器20、内部总线30、网络接口40、计算机可读存储介质50。处理器20从计算机可读存储介质中读取对应的计算机程序然后运行，在逻辑层面上形成请求处理装置。当然，除了软件实现方式之外，本说明书一个或多个实施例并不排除其他实现方式，比如逻辑器件抑或软硬件结合的方式等等，也就是说以下处理流程的执行主体并不限定于各个逻辑单元，也可以是硬件或逻辑器件。所述计算机可读存储介质50上存储有手术动作识别模型的训练程序，所述手术动作识别模型的训练程序被处理器执行时实现上述的手术动作识别模型的训练方法。Further, the present application also discloses a computer device. At the hardware level, as shown in FIG. 6 , the computer device includes a processor 20 , an internal bus 30 , a network interface 40 , and a computer-readable storage medium 50 . The processor 20 reads the corresponding computer program from the computer-readable storage medium and executes it, forming a request processing device on a logical level. Of course, in addition to software implementations, one or more embodiments of this specification do not exclude other implementations, such as logic devices or a combination of software and hardware, etc., that is to say, the execution subject of the following processing flow is not limited to each A logic unit, which can also be a hardware or logic device. The computer-readable storage medium 50 stores a training program of the surgical action recognition model, and when the training program of the surgical action recognition model is executed by the processor, the above-mentioned training method of the surgical action recognition model is realized.

计算机可读存储介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机可读存储介质的例子包括，但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带、磁盘存储、量子存储器、基于石墨烯的存储介质或其他磁性存储设备或任何其他非传输介质，可用于存储可以被计算设备访问的信息。Computer-readable storage media includes both volatile and non-permanent, removable and non-removable media by any method or technology for storage of information. Information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer readable storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage , magnetic cassettes, disk storage, quantum memory, graphene-based storage media or other magnetic storage devices or any other non-transmission media that can be used to store information that can be accessed by computing devices.

上面对本发明的具体实施方式进行了详细描述，虽然已表示和描述了一些实施例，但本领域技术人员应该理解，在不脱离由权利要求及其等同物限定其范围的本发明的原理和精神的情况下，可以对这些实施例进行修改和完善，这些修改和完善也应在本发明的保护范围内。The specific embodiments of the present invention have been described in detail above. Although some embodiments have been shown and described, those skilled in the art should understand that without departing from the principle and spirit of the present invention whose scope is defined by the claims and their equivalents Under the circumstances, these embodiments can be modified and improved, and these modifications and improvements should also be within the protection scope of the present invention.

Claims

一种手术动作识别模型的训练方法，其中，手术动作识别模型包括主干网络、金字塔特征聚合网络和预测网络，其中，所述金字塔特征聚合网络包括特征图汇集模块和特征图发散模块，所述特征图汇集模块的输入单元和所述特征图发散模块的输出单元之间具有跳连融合路径，所述训练方法包括：A training method for a surgical action recognition model, wherein the surgical action recognition model includes a backbone network, a pyramid feature aggregation network, and a prediction network, wherein the pyramid feature aggregation network includes a feature map collection module and a feature map divergence module, the feature There is a skip connection fusion path between the input unit of the graph collection module and the output unit of the feature map divergence module, and the training method includes:

将获取到的原始手术动作图像输入到所述主干网络，得到若干不同尺度的层次化特征图；Inputting the obtained original surgical action images into the backbone network to obtain several hierarchical feature maps of different scales;

将所述层次化特征图输入到所述金字塔特征聚合网络，依次经过所述特征图汇集模块和所述特征图发散模块的融合处理，得到若干不同尺度的融合特征图；The hierarchical feature map is input to the pyramid feature aggregation network, and the fusion process of the feature map collection module and the feature map divergence module is sequentially processed to obtain a number of fusion feature maps of different scales;

将若干不同尺度的融合特征图输入到所述预测网络，得到预测目标值；Inputting fusion feature maps of several different scales into the prediction network to obtain a prediction target value;

根据预测目标值和获取到的真实目标值更新损失函数，并根据更新后的损失函数调整手术动作识别模型的模型参数。The loss function is updated according to the predicted target value and the obtained real target value, and the model parameters of the surgical action recognition model are adjusted according to the updated loss function.
根据权利要求1所述的手术动作识别模型的训练方法，其中，所述特征图汇集模块包括融合单元数量递减的第一列金字塔层、第二列金字塔层和第三列金字塔层，所述特征图发散模块包括融合单元数量递增的所述第三列金字塔层、第四列金字塔层和第五列金字塔层，其中，所述第一列金字塔层为所述特征图汇集模块的输入单元，所述第五列金字塔层为所述所述特征图发散模块的输出单元，且所述第一列金字塔层与所述第五列金字塔层的融合单元数量相同，所述第二列金字塔层与所述第四列金字塔层的融合单元数量相同，各个融合单元通过预定融合路径网进行信息传递。The training method of the surgical action recognition model according to claim 1, wherein the feature map collection module includes the first column pyramid layer, the second column pyramid layer and the third column pyramid layer with the number of fusion units decreasing, the feature The graph divergence module includes the third column of pyramid layers, the fourth column of pyramid layers, and the fifth column of pyramid layers with increasing numbers of fusion units, wherein the first column of pyramid layers is the input unit of the feature map collection module, so The fifth row of pyramid layers is the output unit of the feature map divergence module, and the number of fusion units of the first row of pyramid layers is the same as that of the fifth row of pyramid layers, and the second row of pyramid layers is the same as that of the fifth row of pyramid layers. The number of fusion units in the fourth pyramid layer is the same, and each fusion unit transmits information through a predetermined fusion path network.
根据权利要求2所述的手术动作识别模型的训练方法，其中，所述第一列金字塔层与所述第五列金字塔层均包括五个不同特征尺度的融合单元，所述第二列金字塔层与所述第四列金字塔层均包括三个不同特征尺度的融合单元，所述第三列金字塔层具有一个融合单元。The training method of the surgical action recognition model according to claim 2, wherein, the pyramid layers in the first row and the pyramid layers in the fifth row each include fusion units of five different feature scales, and the pyramid layers in the second row The pyramid layer in the fourth column includes three fusion units with different feature scales, and the pyramid layer in the third column has one fusion unit.
根据权利要求3所述的手术动作识别模型的训练方法，其中，所述预定融合路径网包括：The training method of the surgical action recognition model according to claim 3, wherein the predetermined fusion path network comprises:

第一融合路径，在金字塔层中自下而上由小尺度的融合单元指向大尺度的融合单元；The first fusion path is from bottom to top in the pyramid layer from small-scale fusion units to large-scale fusion units;

第二融合路径，用于对角连接两个相邻层之间的融合单元，通过下采样融合相邻层之间的不同尺度特征图信息；The second fusion path is used to diagonally connect the fusion units between two adjacent layers, and fuse the feature map information of different scales between adjacent layers by downsampling;

第三融合路径，用于对角连接两个相邻层之间的融合单元，通过上采样融合相邻层之间的不同尺度特征图信息；The third fusion path is used to diagonally connect the fusion units between two adjacent layers, and fuse the feature map information of different scales between adjacent layers by upsampling;

第四融合路径，用于水平连接同一层的融合单元，以融合相同尺度的特征图信息；The fourth fusion path is used to horizontally connect the fusion units of the same layer to fuse the feature map information of the same scale;

第五融合路径，在第一列金字塔层中自上而下由大尺度的融合单元指向小尺度的融合单元；The fifth fusion path is from top to bottom in the first pyramid layer from the large-scale fusion unit to the small-scale fusion unit;

跳连融合路径，用于连接所述第一列金字塔层与所述第五列金字塔层中同一尺度的融合单元。A skip-connect fusion path is used to connect the fusion units of the same scale in the pyramid layer in the first column and the pyramid layer in the fifth column.
根据权利要求4所述的手术动作识别模型的训练方法，其中，所述主干网络得到的层次化特征图具有三种尺度，所述第一列金字塔层的五个融合单元分别是由下至上尺度递增的第一融合单元、第二融合单元、第三融合单元、第四融合单元和第五融合单元，三种尺度的层次化特征图分别输入到所述第一融合单元、所述第二融合单元和所述第三融合单元；所述第一融合单元、所述第二融合单元和所述第三融合单元通过第五融合路径连接，所述第三融合单元、所述第四融合单元和所述第五融合单元通过第一融合路径连接。The training method of the surgical action recognition model according to claim 4, wherein the hierarchical feature map obtained by the backbone network has three scales, and the five fusion units of the first column of pyramid layers are respectively from the bottom to the top scale Incremental first fusion unit, second fusion unit, third fusion unit, fourth fusion unit, and fifth fusion unit, the hierarchical feature maps of three scales are respectively input to the first fusion unit, the second fusion unit unit and the third fusion unit; the first fusion unit, the second fusion unit and the third fusion unit are connected by a fifth fusion path, and the third fusion unit, the fourth fusion unit and The fifth fusion unit is connected through a first fusion path.
根据权利要求4所述的手术动作识别模型的训练方法，其中，所述损失函数的公式如下：The training method of surgical action recognition model according to claim 4, wherein, the formula of described loss function is as follows:

其中L _cls是Focal损失函数，L _reg是IOU损失函数，N _pos代表正样本的数量，λ是L _reg的平衡权重且值是1，
指对特征图上的所有点(x,y)所对应的损失进行求和，
指点(x,y)对应ground-truth的类别，P _x,y指点(x,y)对应预测值得类别，
指点(x,y)对应ground-truth的目标框，t _x,y指点(x,y)对应预测值的目标框，
是指数函数，当
时值为1，
取其它值时指数函数为0。 Where L _cls is the Focal loss function, L _reg is the IOU loss function, N _pos represents the number of positive samples, λ is the balance weight of L _reg and the value is 1,
Refers to summing the losses corresponding to all points (x, y) on the feature map,
Pointing (x, y) corresponds to the category of ground-truth, P _{x, y} pointing (x, y) corresponds to the predicted value category,
Pointing (x, y) corresponds to the target box of ground-truth, t _{x, y} pointing (x, y) corresponds to the target box of the predicted value,
is an exponential function, when
The time value is 1,
The exponential function is 0 when other values are taken.
一种计算机可读存储介质，其中，所述计算机可读存储介质存储有手术动作识别模型的训练程序，所述手术动作识别模型的训练程序被处理器执行时实现权利要求1所述的手术动作识别模型的训练方法。A computer-readable storage medium, wherein the computer-readable storage medium stores a training program of a surgical action recognition model, and when the training program of the surgical action recognition model is executed by a processor, the surgical action described in claim 1 is realized The training method for the recognition model.
根据权利要求7所述的计算机可读存储介质，其中，所述特征图汇集模块包括融合单元数量递减的第一列金字塔层、第二列金字塔层和第三列金字塔层，所述特征图发散模块包括融合单元数量递增的所述第三列金字塔层、第四列金字塔层和第五列金字塔层，其中，所述第一列金字塔层为所述特征图汇集模块的输入单元，所述第五列金字塔层为所述所述特征图发散模块的输出单元，且所述第一列金字塔层与所述第五列金字塔层的融合单元数量相同，所述第二列金字塔层与所述第四列金字塔层的融合单元数量相同，各个融合单元通过预定融合路径网进行信息传递。The computer-readable storage medium according to claim 7, wherein the feature map collection module includes a first column of pyramid layers, a second column of pyramid layers, and a third column of pyramid layers with decreasing numbers of fusion units, and the feature maps diverge The module includes the third column of pyramid layers, the fourth column of pyramid layers and the fifth column of pyramid layers with increasing numbers of fusion units, wherein the first column of pyramid layers is the input unit of the feature map collection module, and the first column of pyramid layers is the input unit of the feature map collection module. Five columns of pyramid layers are the output units of the feature map divergence module, and the number of fusion units of the first column of pyramid layers is the same as that of the fifth column of pyramid layers, and the second column of pyramid layers is the same as that of the fifth column of pyramid layers. The number of fusion units in the four columns of pyramid layers is the same, and each fusion unit transmits information through a predetermined fusion path network.
根据权利要求8所述的计算机可读存储介质，其中，所述第一列金字塔层与所述第五列金字塔层均包括五个不同特征尺度的融合单元，所述第二列金字塔层与所述第四列金字塔层均包括三个不同特征尺度的融合单元，所述第三列金字塔层具有一个融合单元。The computer-readable storage medium according to claim 8, wherein the pyramid layers in the first column and the pyramid layers in the fifth column each include fusion units of five different feature scales, and the pyramid layers in the second column and the pyramid layers in the second column The fourth pyramid layer includes three fusion units with different feature scales, and the third pyramid layer has one fusion unit.
根据权利要求9所述的计算机可读存储介质，其中，所述预定融合路径网包括：The computer-readable storage medium according to claim 9, wherein the predetermined fusion path network comprises:

第一融合路径，在金字塔层中自下而上由小尺度的融合单元指向大尺度的融合单元；The first fusion path is from bottom to top in the pyramid layer from small-scale fusion units to large-scale fusion units;

第二融合路径，用于对角连接两个相邻层之间的融合单元，通过下采样融合相邻层之间的不同尺度特征图信息；The second fusion path is used to diagonally connect the fusion units between two adjacent layers, and fuse the feature map information of different scales between adjacent layers by downsampling;

第三融合路径，用于对角连接两个相邻层之间的融合单元，通过上采样融合相邻层之间的不同尺度特征图信息；The third fusion path is used to diagonally connect the fusion units between two adjacent layers, and fuse the feature map information of different scales between adjacent layers by upsampling;

第四融合路径，用于水平连接同一层的融合单元，以融合相同尺度的特征图信息；The fourth fusion path is used to horizontally connect the fusion units of the same layer to fuse the feature map information of the same scale;

第五融合路径，在第一列金字塔层中自上而下由大尺度的融合单元指向小尺度的融合单元；The fifth fusion path is from top to bottom in the first pyramid layer from the large-scale fusion unit to the small-scale fusion unit;

跳连融合路径，用于连接所述第一列金字塔层与所述第五列金字塔层中同一尺度的融合单元。A skip-connect fusion path is used to connect the fusion units of the same scale in the pyramid layer in the first column and the pyramid layer in the fifth column.
根据权利要求10所述的计算机可读存储介质，其中，所述主干网络得到的层次化特征图具有三种尺度，所述第一列金字塔层的五个融合单元分别是由下至上尺度递增的第一融合单元、第二融合单元、第三融合单元、第四融合单元和第五融合单元，三种尺度的层次化特征图分别输入到所述第一融合单元、所述第二融合单元和所述第三融合单元；所述第一融合单元、所述第二融合单元和所述第三融合单元通过第五融合路径连接，所述第三融合单元、所述第四融合单元和所述第五融合单元通过第一融合路径连接。The computer-readable storage medium according to claim 10, wherein the hierarchical feature map obtained by the backbone network has three scales, and the five fusion units of the first column of pyramid layers are respectively incremented from bottom to top The first fusion unit, the second fusion unit, the third fusion unit, the fourth fusion unit and the fifth fusion unit, the hierarchical feature maps of three scales are respectively input to the first fusion unit, the second fusion unit and The third fusion unit; the first fusion unit, the second fusion unit and the third fusion unit are connected through a fifth fusion path, and the third fusion unit, the fourth fusion unit and the The fifth fusion units are connected through the first fusion path.
根据权利要求10所述的计算机可读存储介质，其中，所述损失函数的公式如下：The computer-readable storage medium according to claim 10, wherein the formula of the loss function is as follows:

其中L _cls是Focal损失函数，L _reg是IOU损失函数，N _pos代表正样本的数量，λ是L _reg的平衡权重且值是1，
指对特征图上的所有点(x,y)所对应的损失进行求和，
指点(x,y)对应ground-truth的类别，P _x,y指点(x,y)对应预测值得类别，
指点(x,y)对应ground-truth的目标框，t _x,y指点(x,y)对应预测值的目标框，
是指数函数，当
时值为1，
取其它值时指数函数为0。 Where L _cls is the Focal loss function, L _reg is the IOU loss function, N _pos represents the number of positive samples, λ is the balance weight of L _reg and the value is 1,
Refers to summing the losses corresponding to all points (x, y) on the feature map,
Pointing (x, y) corresponds to the category of ground-truth, P _{x, y} pointing (x, y) corresponds to the predicted value category,
Pointing (x, y) corresponds to the target box of ground-truth, t _{x, y} pointing (x, y) corresponds to the target box of the predicted value,
is an exponential function, when
The time value is 1,
The exponential function is 0 when other values are taken.
一种计算机设备，其中，所述计算机设备包括计算机可读存储介质、处理器和存储在所述计算机可读存储介质中的手术动作识别模型的训练程序，所述手术动作识别模型的训练程序被处理器执行时实现权利要求1的手术动作识别模型的训练方法。A computer device, wherein the computer device includes a computer-readable storage medium, a processor, and a training program for a surgical action recognition model stored in the computer-readable storage medium, and the training program for the surgical action recognition model is When the processor executes, it realizes the training method of the surgical action recognition model of claim 1.