CN116472565A

CN116472565A - Multi-view medical activity recognition system and method

Info

Publication number: CN116472565A
Application number: CN202180076403.2A
Authority: CN
Inventors: O·莫哈雷里; A·T·施密特; A•莎吉卡尔甘罗迪
Original assignee: Intuitive Surgical Operations Inc
Current assignee: Intuitive Surgical Operations Inc
Priority date: 2020-11-13
Filing date: 2021-11-12
Publication date: 2023-07-21
Also published as: CN116508070A

Abstract

Multi-view medical activity recognition systems and methods are described herein. In some illustrative examples, a system accesses a plurality of data streams representing images of a scene of a medical session captured by a plurality of sensors from a plurality of viewpoints. The system temporarily aligns the multiple data streams and utilizes a point-of-view agnostic machine learning model and determines activity within the scene based on the multiple data streams.

Description

Multi-view medical activity recognition system and method

RELATED APPLICATIONS

The present application claims priority from U.S. provisional patent application Ser. No. 63/141,830, filed on day 26 of 1 month 2021, U.S. provisional patent application Ser. No. 63/141,853, filed on day 26 of 1 month 2021, and U.S. provisional patent application Ser. No. 63/113,685, filed on day 11 and 13 of 2020, the contents of which are incorporated herein by reference in their entireties.

Background

Computer-implemented activity recognition typically involves capturing and processing images (images) of a scene to determine characteristics of the scene. Conventional activity recognition may lack a desired level of accuracy and/or reliability for dynamic and/or complex environments. For example, some objects in dynamic and complex environments (such as those associated with surgical procedures) may be occluded from view of the imaging device.

Disclosure of Invention

The following description presents a simplified summary of one or more aspects of the systems and methods described herein. This summary is not an extensive overview of all contemplated aspects, and is intended to neither identify key or critical elements of all aspects nor delineate the scope of any or all aspects. Its sole purpose is to present one or more aspects of the systems and methods described herein as a prelude to the more detailed description that is presented later.

An illustrative system includes a memory storing instructions and a processor communicatively coupled to the memory and configured to execute the instructions to access a plurality of data streams representing images of a scene of a medical session captured by a plurality of sensors from a plurality of viewpoints; temporarily (temporaly) aligning the plurality of data streams; and determining activity within the scene using the point-of-view agnostic machine learning model and based on the plurality of data streams.

An illustrative method includes accessing, by a processor, a plurality of data streams representing images of a scene of a medical session captured by a plurality of sensors from a plurality of viewpoints; temporarily aligning, by a processor, the plurality of data streams; and determining, by the processor, activity within the scene using the point-of-view agnostic machine learning model and based on the plurality of data streams.

An illustrative non-transitory computer readable medium stores instructions executable by a processor to access a plurality of data streams representing images of a scene of a medical session captured by a plurality of sensors from a plurality of viewpoints; temporarily aligning the plurality of data streams; and determining activity within the scene using the point-of-view agnostic machine learning model and based on the plurality of data streams.

Drawings

The accompanying drawings illustrate various embodiments and are a part of the specification. The illustrated embodiments are merely examples and do not limit the scope of the disclosure. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements.

Fig. 1 depicts an illustrative multi-view medical activity recognition system in accordance with principles described herein.

FIG. 2 depicts an illustrative processing system in accordance with the principles described herein.

Fig. 3-5 depict an illustrative multi-view medical activity recognition system in accordance with the principles described herein.

Fig. 6 depicts an illustrative computer-aided robotic surgical system in accordance with principles described herein.

Fig. 7 depicts an illustrative configuration of an imaging device attached to a robotic surgical system in accordance with principles described herein.

Fig. 8 depicts an illustrative method in accordance with the principles described herein.

FIG. 9 depicts an illustrative computing device in accordance with the principles described herein.

Detailed Description

Systems and methods for multi-view medical activity identification are described herein. The activity recognition system may include a plurality of sensors including at least two imaging devices configured to capture images of a scene from different arbitrary viewpoints. The activity recognition system may determine an activity within a scene captured in the imagery based on the captured imagery. The activity may be determined using a point-of-view agnostic machine learning model trained to fuse data based on the imagery and the activity. The view-agnostic model and/or system may be configured to receive any number of data streams from any location and/or view, and to fuse data with any number of data streams, and to determine activity within a scene based on the fused data. As described herein, the machine learning model may be configured to fuse data and determine activity within a scene in various ways.

In some examples, the scene may be a medical session, such as a surgical session (session), and the activity may include a phase of the surgical session. Because the systems and methods described herein are viewpoint agnostic, the systems and methods described herein may be implemented in any suitable environment. Any suitable number and/or configuration of sensors may be deployed and used to capture data provided as input to the system, which may then determine activity based on the data stream provided by the sensors.

The systems and methods described herein may provide various advantages and benefits. For example, the systems and methods described herein may provide accurate, dynamic, and/or flexible activity recognition in a variety of environments with a variety of sensor configurations. The illustrative examples of activity recognition described herein may be more accurate and/or flexible than conventional activity recognition based on single sensor activity recognition or fixed multi-sensor activity recognition. Illustrative examples of the systems and methods described herein may be well suited for activity recognition of dynamic and/or complex scenarios, such as scenarios associated with medical sessions.

Various illustrative embodiments will now be described in more detail. The disclosed systems and methods may provide one or more of the above benefits and/or various additional and/or alternative benefits that will be apparent herein.

Fig. 1 depicts an illustrative multi-view medical activity recognition system 100 (system 100). As shown, the system 100 may include a plurality of sensors positioned relative to the scene 104, such as imaging devices 102-1 and 102-2 (collectively, "imaging devices 102"), the imaging devices 102 may be configured to image the scene 104 by capturing images of the scene 104 simultaneously.

Scene 104 may include any environment and/or element of an environment that may be imaged by imaging device 102, for example, scene 104 may include a tangible real-world scene (real-world scene) of physical elements. In some illustrative examples, scene 104 is associated with a medical session, such as a surgical session. For example, the scene 104 may include a surgical scene at a surgical site (such as a surgical device, operating room, or the like). For example, the scene 104 may include all or a portion of an operating room in which a surgical procedure may be performed on a patient. In some implementations, the scene 104 includes an area proximate to an operating room of a robotic surgical system for performing a surgical procedure. In some implementations, the scene 104 includes an area within the patient. While certain illustrative examples described herein are directed to a scene 104 that includes a scene at a surgical device, one or more principles described herein may be applied to other suitable scenes in other implementations.

Imaging device 102 may include any imaging device configured to capture an image of scene 104. For example, imaging device 102 may include a video imaging device, an infrared imaging device, a visible light imaging device, a non-visible light imaging device, an intensity imaging device (e.g., color, grayscale, black and white imaging devices), a depth imaging device (e.g., stereoscopic imaging device, time-of-flight imaging device, infrared imaging device, etc.), an endoscopic imaging device, any other imaging device, or any combination or sub-combination of such imaging devices. The imaging device 102 may be configured to capture images of the scene 104 at any suitable capture rate. The imaging device 102 may be synchronized in any suitable manner for synchronizing the capturing of images of the scene 104. Synchronization may include the operation of the imaging devices being synchronized and/or the data sets output by the imaging devices being synchronized by matching the data sets to a common point in time.

Fig. 1 shows a simple configuration of two imaging devices 102 positioned to capture images of a scene 104 from two different viewpoints. This configuration is exemplary. It will be appreciated that a multi-sensor architecture, such as a multi-view architecture, may include two or more imaging devices 102 positioned to capture images of a scene 104 from two or more different viewpoints. For example, the system 100 may include any number of imaging devices 102 up to a predefined maximum that the system 100 is configured to receive. The predefined maximum value may be based on the number of input ports of the imaging device 102, the maximum processing power of the system 100, the maximum bandwidth for communication of the system 100, or any other such characteristic. The imaging devices 102 may be positioned at any location that each allows the respective imaging device 102 to capture an image of the scene 104 from a particular viewpoint or viewpoints. Any suitable location of the sensor may be considered an arbitrary location, which may include a fixed location, a random location, and/or a dynamic location that is not determined by the system 100. The viewpoint of the imaging device 102 (i.e., the position, the fetch, and the view settings, such as the zoom of the imaging device 102) determines the content of the image captured by the imaging device 102. The multi-sensor architecture may also include additional sensors positioned to capture data of the scene 104 from additional locations. Such additional sensors may include any suitable sensor configured to capture data, such as microphones, kinematic sensors (e.g., accelerometers, gyroscopes, sensors associated with robotic surgical systems, etc.), force sensors (e.g., sensors associated with surgical instruments, etc.), temperature sensors, motion sensors, non-imaging devices, additional imaging devices, other types of imaging devices, etc.

The system 100 may include a processing system 106 communicatively coupled to the imaging device 102. The processing system 106 may be configured to access imagery captured by the imaging device 102 and determine activity of the scene 104, as further described herein.

Fig. 2 illustrates an example configuration of a processing system 106 of a multi-view medical activity recognition system (e.g., system 100). The processing system 106 may include, but is not limited to, a storage device 202 and a processing device 204 that are selectively and communicatively coupled to each other. The devices 202 and 204 may each include or be implemented by one or more physical computing devices including hardware and/or software components, such as processors, memory, storage drives, communication interfaces, instructions stored in memory for execution by the processors, and the like. Although devices 202 and 204 are shown as separate devices in fig. 2, devices 202 and 204 may be combined into fewer devices, such as into a single device, or divided into more devices that may serve a particular implementation. In some examples, each of the devices 202 and 204 may be distributed among multiple apparatuses and/or multiple orientations that may serve a particular implementation.

The storage device 202 may hold (e.g., store) executable data used by the processing device 204 to perform any of the functions described herein. For example, the storage device 202 may store instructions 206 that may be executed by the processing device 204 to perform one or more of the operations described herein. The instructions 206 may be implemented by any suitable application, software, code, and/or other executable data example. The storage device 202 may also hold any data received, generated, managed, utilized, and/or transmitted by the processing device 204.

The processing device 204 may be configured to perform (e.g., execute the instructions 206 stored in the storage device 202 to perform) various operations associated with activity recognition, such as activity recognition of a scene of a medical session performed and/or facilitated by a computer-assisted surgical system.

These and other illustrative operations that may be performed by the processing system 106 (e.g., by the processing device 204 of the processing system 106) are described herein. In the following description, any reference to functions performed by the processing system 106 may be understood as being performed by the processing device 204 based on the instructions 206 stored in the storage device 202.

Fig. 3 illustrates an example configuration 300 of the processing system 106. As shown, the processing system 106 accesses images 302 (e.g., images 302-1 through 302-N) of a scene (e.g., scene 104) captured by an imaging device (e.g., imaging device 102) of an activity recognition system (e.g., system 100). The processing system 106 includes an image alignment module 304 configured to temporarily align the image 302. The processing system 106 also includes a machine learning model 306, the machine learning model 306 configured to determine activity within the scene based on the temporally aligned images 302.

For example, the processing system 106 may receive the image 302-1 from the imaging device 102-1. The imagery 302-1 may include any image data representative of a plurality of images or one or more aspects of images captured by the imaging device 102-1 of the scene 104 and/or be represented by any image data representative of a plurality of images or one or more aspects of images captured by the imaging device 102-1 of the scene 104. For example, the plurality of images may be a stream of images in the form of one or more video clips. Each video clip may include a time ordered series of images captured over a period of time. Each video segment may include any suitable number (e.g., 16, 32, etc.) of frames (e.g., images). The video clip may capture one or more activities performed in the scene 104. An activity may be any action performed by a person or system in the scene 104. In some examples, the scene 104 may depict a medical session, and the activity may be specific to an action performed in association with the medical session of the scene 104 (such as a predefined phase of the medical session). For example, a particular surgical session may include 10-20 (or any other suitable number) of different predefined phases, such as sterile preparation, patient call-in (roll in), surgery, etc., which may be a defined active set from which the system 100 classifies the activity of the scene 104 as captured in a particular video clip.

The processing system 106 may access the image 302-1 (e.g., one or more video clips) in any suitable manner. For example, the processing system 106 may receive the image 302-1 from the imaging device 102-1, retrieve the image 302-1 from the imaging device 102-1, receive and/or retrieve the image 302-1 from a storage device and/or any other suitable device communicatively coupled to the imaging device 102-1, and the like.

Image alignment module 304 may access image 302-1 and images 302-2 through 302-N and temporarily align image 302. For example, the imagery 302-1 may include images of the scene 104 captured from a first view point associated with the imaging device 102-1. The imagery 302-2 may include images of the scene 104 captured from a second point of view associated with the imaging device 102-2, and so on for each instance of imagery 302 (which may be captured by additional imaging devices not shown in fig. 1). The image alignment module 304 may temporarily align the imagery 302 such that aligned images (e.g., temporarily aligned video frames) of the imagery 302 depict the same or substantially the same point in time of the scene 104 captured from different viewpoints.

The image alignment module 304 may temporarily align the image 302 in any suitable manner. For example, some or all of the images of the imagery 302 may include a timestamp or other time information associated with the images, and the image alignment module 304 may utilize this information to align the imagery 302. For example, one or more image streams of the imagery 302 (e.g., imagery 302-1) may be used as the primary image stream, while other image streams (e.g., imagery 302-2 through imagery 302-N) may be aligned to the primary image stream with (nearest prior-time stamp) images for each of the other image streams. In this manner, the image alignment module 304 may temporarily align the image 302 in real-time even if the image stream of the image 302 includes a different number of images, frame rates, discarded (drop) images, and the like.

The machine learning model 306 may determine the activity of the scene 104 captured by the imagery 302 based on the temporally aligned imagery 302. The machine learning model 306 may determine the activity in any suitable manner, as further described herein. For example, the machine learning model 306 may be a point-of-view agnostic machine learning model trained to determine the activity of the scene 104 based on the imagery 302, the imagery 302 comprising any number of image streams captured from any point of view. Thus, the configuration of imaging devices 102 is not limited by the model to a fixed number of imaging devices 102 or to imaging devices 102 located only at certain fixed or relative positions, but processing system 106 may be configured to receive input from any configuration of imaging devices 102 in any suitable medical setting and/or environment. For example, the system 100 may be or include a dynamic system, such as one or more imaging devices 102 having viewpoints that may be dynamically changed during a medical session (e.g., during any stage of the medical session, such as during pre-operative activities (e.g., setup activities), intra-operative activities, and/or post-operative activities). The viewpoint of the imaging device 102 may be dynamically changed in any manner that changes the field of view of the imaging device 102, such as by changing a position, pose, orientation, zoom setting, or other parameter of the imaging device 102. Further, while configuration 300 illustrates imagery 302 including an image stream, machine learning model 306 (and processing system 106) may be configured to access any suitable data stream (e.g., audio data, kinematic data, etc.) captured from scene 104 by any suitable sensor as described herein. The machine learning model 306 may be trained to further determine the activity of the scene 104 based on such data streams.

FIG. 4 illustrates an example configuration 400 of the processing system 106, which shows an example implementation of the machine learning model 306. As in configuration 300, configuration 400 shows processing system 106 accessing image 302 and image alignment module 304 temporarily aligning image 302. Further, the processing system 106 is configured to utilize the machine learning model 306 to determine the activity of the scene 104 captured by the imagery 302. As shown, the machine learning model 306 includes an activity recognition algorithm 402 (e.g., activity recognition algorithms 402-1 through 402-N), a Recurrent Neural Network (RNN) algorithm 404 (e.g., RNN algorithms 404-1 through 404-N), and a data fusion module 406.

As described, each instance of the imagery 302 may be an image stream including video clips. Machine learning model 306 utilizes activity recognition algorithm 402 to extract features of video clips of the respective image streams to determine activity within a scene captured in the video clip. For example, the activity recognition algorithm 402-1 may extract features of a video segment of the image 302-1, the activity recognition algorithm 402-2 may extract features of a video segment of the image 302-2, and so on. The activity recognition algorithm 402 may be implemented by any suitable algorithm or algorithms, such as a fine-tuned I3D model or any other neural network or other algorithm. Each of the activity recognition algorithms 402 may be instances of the same algorithm set and/or implemented with different algorithm sets.

The activity recognition algorithms 402 each provide an output to a corresponding RNN algorithm 404. The RNN algorithm 404 may utilize the features extracted by the activity recognition algorithm 402 to determine a corresponding classification of the activity of the scene 104. For example, the RNN algorithm 404-1 may receive features extracted from the image 302-1 by the activity recognition algorithm 402-1 and determine a first classification of activity of the scene 104 captured from a first viewpoint associated with the imaging device 102-1. Similarly, the RNN algorithm 404-2 may determine a second classification of the activity of the scene 104 captured from a second viewpoint associated with the imaging device 102-2 based on features extracted from the image 302-2 by the activity recognition algorithm 402-2, to the RNN algorithm 404-N, and so on.

The RNN algorithms 404 may each provide a classification to a data fusion module 406, which data fusion module 406 may generate fusion data for determining the activity of the scene 104. For example, the data fusion module 406 may receive respective classifications of the activity of the scene 104 from each of the RNN algorithms 404 and determine a final classification of the activity of the scene 104 based on the respective classifications. The data fusion module 406 may generate the fusion data in any suitable manner to determine the final classification. For example, the data fusion module 406 may weight the classifications from the RNN algorithm 404 to determine the final classification.

Additionally, in some examples, the data fusion module 406 may receive additional information with each classification to generate fusion data to determine the activity of the scene 104. For example, the data fusion module 406 may also receive an activity visibility metric for each video clip or image stream that ranks the visibility of the activity of the scene 104 in the corresponding imagery. The activity visibility metric may include a score or any other metric that represents a ranking of the visibility of the activity of the scene 104 in the imagery. For example, the activity visibility metric may be based on general visibility of the image 302 and/or specific visibility of activity in the image 302. The general visibility may correspond to an overall level of visibility of any content of the image 302 in the image 302, while the particular visibility of the activity may be based on the level of visibility of the activity of the scene 104 in the image 302, which may be separate from the general visibility. Based on such activity visibility metrics, the data fusion module 406 may weight the classification determined from the imagery higher for relatively high activity visibility metrics and/or lower for relatively low activity visibility metrics.

Additionally or alternatively, the data fusion module 406 may receive confidence measures of the classifications generated by the RNN algorithm 404. The data fusion module 406 may further weight the classification based on the confidence measure. Additionally or alternatively, the data fusion module 406 may base the generation of fusion data and/or the determination of the activity of the scene 104 on any other such suitable information associated with the classification and/or the imagery.

Further, the machine learning model 306 as shown includes a multi-layer (e.g., stage) algorithm. Such a layer may refer to an algorithm or process (e.g., activity recognition algorithm 402, RNN algorithm 404): represented as a "vertical" layer in the configuration 400, and/or a channel of data processing (e.g., the image 302-1 processed by the activity recognition algorithm 402-1, RNN algorithm 404-1, etc.) represented as a "horizontal" layer in the configuration 400. Other implementations of the machine learning model 306 may include additional, fewer, or different layers (e.g., different layer configurations). Further, the layers (horizontal and/or vertical) of the machine learning model 306 may be connected in any suitable manner such that the connected layers may communicate and/or share data between or among the layers.

As one example implementation of configuration 400, each video clip of image 302 may be represented asAs ending at time t of size l _clip Is a synchronous segment of (a). i denotes the viewpoint of the main image stream and j denotes the viewpoint of the secondary image stream aligned with the main image stream.

The activity recognition algorithm 402 may be implemented with an I3D algorithm that may be trained to contain a set of weights for the I3D model f configured to receive video clips and output classifications. Thus, transforming the video segment with the I3D model to generate a set of potential vectors z:

These potential vectors may be input into an implementation of RNN algorithm 404 denoted g, where RNN algorithm 404 utilizes the potential vectors, some fully connected layers, and RNNs to estimate an output classification:

wherein the method comprises the steps ofIs the estimated logarithmic probability (logit probability) for the segment s from viewpoint i, g is the RNN model, and fc: -is>Is the final layer of the full join that outputs the log (logs) of the size class (class). Model g generates a corresponding classification for each image stream (using a single view version of the model, g _single ) And adaptively fuses the classifications.

For example, each g _single Can be configured to output d _latent Dimension output:

where g receives as input all previous frames of a single view I and outputs a featureWhich is converted into a logarithmic probability with fully connected layers. The full join layer may be used to obtain an estimated classification vector:

the data fusion module 406 may be implemented to generate

g _multi ＝mix(g _single (z ⁱ⁰ )，...，g _single (z ^iN ))，

Wherein mix receives a set d _iatent Size vectors, and fusing vectors by summing the vector sets:

the full connectivity layer may output the final classification:

the mixing weights w may be predefined, such as w _j =1/N, which results in an average pooling (pooling) of each image stream. Additionally or alternatively, any other such predefined function may be utilized, such as a maximum function (e.g., selecting the most confident class), and so forth.

Alternatively, the weight w may be based on an input as described herein. For example, the weighting may be determined using an attention algorithm, such as a weight vector defined by the following equation

Where q is a query vector globally estimated using an average pooling of potential vectors, k is a matrix of potential view feature vectors, and d _k Is the dimension of the mixer module of the data fusion module 406. Thus, the example machine learning model 306 may be represented as

FIG. 5 illustrates an example configuration 500 showing another example implementation of the machine learning model 306. Configuration 500 may be similar to configuration 300, including processing system 106 and image alignment module 304, but is not shown in fig. 5. While configuration 400 illustrates machine learning model 306 configured to generate fusion data based on classifications determined from each instance (e.g., each data stream) of image 302, configuration 500 illustrates machine learning model 306 configured to generate fusion data based more directly on image 302 and features extracted from image 302.

For example, as shown, the machine learning model 306 includes a data fusion module 502 (e.g., data fusion modules 502-1 through 502-4). The machine learning model 306 also includes feature processing modules 504 (e.g., feature processing modules 504-1 and 504-2), feature processing modules 506 (e.g., feature processing modules 506-1 and 506-2), and feature processing modules 508 (e.g., feature processing modules 508-1 and 508-2). Each of the data fusion modules 502 may be configured to receive data (e.g., images, features extracted from images, and/or other features), combine the data, and provide the data to one or more subsequent modules.

For example, the data fusion module 502-1 can access the images 302 (e.g., the image 302-1 and the image 302-2). The data fusion module 502-1 may generate fusion data based on the image 302 and provide the fusion data to the feature processing module 504 and the data fusion module 502-2. The feature processing module 504 may be configured to extract features from the image 302 based on the fused data received from the data fusion module 502-1. The data fusion module 502-2 may receive the fusion data from the data fusion module 502-1 and the features extracted by the feature processing module 504 and generate fusion data based on some or all of these inputs. Further, the data fusion module 502-2 may output the fused data to the feature processing module 506 and the data fusion module 502-3. The feature processing module 506 may be configured to extract features (e.g., dimension reduction, etc.) from the features extracted by the feature processing module 504 based on the fusion data generated by the data fusion module 502-2. Additionally or alternatively, feature processing module 506 (and feature processing modules 504 and 508) may be configured to otherwise process features (e.g., concatenation, addition, pooling, regression, etc.) based on the fused data.

Each data fusion module 502 may be configured to fuse data in any suitable manner. For example, the data fusion module 502 may include a machine learning algorithm trained to weight inputs based on the activity of the imagery 302 and the scene 104 captured by the imagery 302. The data fusion module 502 may be trained end-to-end to learn these configurations based on training data as described herein.

The machine learning model 306 also includes a video long-short term memory (LSTM) 510 (e.g., video LSTM 510-1 and 510-2) configured to determine a classification of the activity of the scene 104 captured by the imagery 302. For example, video LSTM 510-1 may determine a first classification of an activity based on image 302-1 and features extracted and/or processed by feature processing modules 504-1, 506-1, and 508-1. The video LSTM 510-2 may determine a second classification of the activity based on the image 302-2 and the features extracted and/or processed by the feature processing modules 504-2, 506-2, and 508-2. As shown, while the classifications of video LSTM 510 may be based on the respective image streams of image 302 (e.g., video LSTM 510-1 based on image 302-1 and video LSTM 510-2 based on image 302-2), because feature processing modules 504-508 share the fused data generated by data fusion module 502, each respective classification may result in a more accurate determination of the activity of scene 104 than a classification based solely on the individual image streams.

The machine learning model 306 also includes a global LSTM 512, the global LSTM 512 configured to determine a global classification of the activity of the scene 104 based on the fused data generated by the data fusion module 502-4. Since the global classification is based on the fused data, the global classification may be based on a determination of the activity of the scene 104 for both the image 302-1 and the image 302-2.

The machine learning model 306 also includes a data fusion module 514, the data fusion module 514 configured to receive the classification of the video LSTM 510 and the global classification of the global LSTM 512. Based on these classifications, the data fusion module 514 may determine a final classification to determine the activity of the scene 104, and the data fusion module 514 may determine the final classification in any suitable manner as described herein.

Although configuration 500 illustrates two image streams of imagery 302, machine learning model 306 may be configured to receive and utilize any arbitrary number of image streams and/or other data streams from any point of view to determine activity of scene 104. Further, while configuration 500 shows three stages of feature processing and four stages of data fusion module 502 between feature processing modules 504-508, machine learning model 306 may include any suitable number of feature processing modules and data fusion modules. For example, in some examples, the fused data may be generated on a subset of features and/or data (e.g., only on the image 302, only after the feature processing module 508, or any other suitable combination). Further, while configuration 500 includes video LSTM 510, in some examples, machine learning model 306 may omit video LSTM 510 (and data fusion module 514) and base the final classification on the global classification determined by global LSTM 512.

To determine the weights applied to the inputs to generate fusion data, the machine learning model 306 may be trained based on the training data. Once trained, the machine learning model 306 is configured to determine weights applied to the inputs. For example, for configuration 400, the input may include a classification based on one or more of a classification, an image, and/or an activity within the scene. For configuration 500, the input may include images 302, features of images 302, and/or activity within a scene.

Machine learning model 306 may be trained end-to-end based on the labeled image set. Additionally or alternatively, particular modules and/or sets of modules (e.g., RNN algorithm 404 and/or data fusion module 406, any of data fusion module 502, video LSTM 510, and/or global LSTM 512) may be trained on the labeled image set to predict an activity classification based on image 302.

The training dataset may include images of the medical session captured by the imaging device, such as images similar to image 302. The training data set may further include a subset of the images captured by the imaging device of the medical session. For example, a particular medical session may be captured by four imaging devices and video clips labeled as four image streams that generate a training set. A subset of video clips comprising three of the four image streams may be used as another training data set. Thus, with the same set of image streams, multiple training data sets may be generated. Additionally or alternatively, the training data set may be generated based on the image stream. For example, video segments from two or more image streams may be interpolated (interpolated) and/or otherwise processed to generate additional video segments that may be included in additional training data sets. In this way, the machine learning model 306 may be trained to be view agnostic, enabling the activity of a scene to be determined based on any number of image streams from any view. In some implementations, view agnostic may mean any number of imaging devices capturing imagery from a predetermined view point. In some implementations, view agnostic may mean that a predetermined number of imaging devices capturing imagery are disposed from any location, orientation, and/or setting of imaging devices 102. In some implementations, view agnostic may mean any number of imaging devices capturing imagery from any view of the imaging devices. Thus, the view-agnostic model may be agnostic to the number of image capture devices 102 and/or the view points of those image capture devices 102.

The system 100 may be associated with a computer-assisted robotic surgical system such as that shown in fig. 6. Fig. 6 shows an illustrative computer-aided robotic surgical system 600 ("surgical system 600"). The system 100 may be implemented by the surgical system 600, connected to the surgical system 600, and/or otherwise used in conjunction with the surgical system 600. For example, system 100 may be implemented by one or more components of surgical system 600 (such as a manipulation system, a user control system, or an auxiliary system). As another example, system 100 may be implemented by a stand-alone computing system communicatively coupled to a computer-assisted surgery system.

As shown, surgical system 600 may include a manipulation system 602, a user control system 604, and an auxiliary system 606 communicatively coupled to each other. The surgical team may utilize the surgical system 600 to perform a computer-assisted surgical procedure on the patient 608. As shown, the surgical team may include a surgeon 610-1, an assistant 610-2, a nurse 610-3, and an anesthesiologist 610-4, all of which may be collectively referred to as a "surgical team member 610". Additional or alternative surgical team members may be present during the surgical session.

While fig. 6 illustrates an ongoing minimally invasive surgical procedure, it should be appreciated that the surgical system 600 may similarly be used to perform an open surgical procedure or other types of surgical procedures that may similarly benefit from the accuracy and convenience of the surgical system 600. In addition, it will be appreciated that medical sessions such as surgical sessions, which may be used throughout the procedure of the surgical system 600, may include not only intraoperative phases of the surgical procedure (as shown in fig. 6), but may also include preoperative (which may include setup of the surgical system 600), postoperative, and/or other suitable phases of the surgical session.

As shown in fig. 6, manipulation system 602 can include a plurality of manipulator arms 612 (e.g., manipulator arms 612-1 through 612-4) and a plurality of surgical instruments can be coupled to manipulator arms 612. Each surgical instrument may be implemented by: any suitable surgical tool (e.g., a tool with tissue interaction functionality), medical tool, imaging device (e.g., an endoscope, ultrasonic tool, etc.), sensing instrument (e.g., a force sensing surgical instrument), diagnostic instrument, or other instrument that may be used in a computer-assisted surgical procedure on patient 608 (e.g., by being at least partially inserted into patient 608 and manipulated to perform a computer-assisted surgical procedure on patient 608). Although steering system 602 is depicted and described herein as including four manipulator arms 612, it will be appreciated that steering system 602 may include only a single manipulator arm 612 or any other number of manipulator arms that may serve a particular implementation.

The manipulator arm 612 and/or a surgical instrument attached to the manipulator arm 612 may include one or more displacement transducers, orientation sensors, and/or position sensors for generating raw (i.e., uncorrected) kinematic information. One or more components of the surgical system 600 may be configured to utilize kinematic information to track (e.g., determine pose) and/or control a surgical instrument, as well as anything connected to the instrument and/or arm. As described herein, the system 100 can utilize kinematic information to track components of the surgical system 600 (e.g., the manipulator arm 612 and/or a surgical instrument attached to the manipulator arm 612).

The user control system 604 may be configured to facilitate control of the manipulator arm 612 and the surgical instrument attached to the manipulator arm 612 by the surgeon 610-1. For example, the surgeon 610-1 may interact with the user control system 604 to remotely move or manipulate the manipulator arm 612 and surgical instrument. To this end, the user control system 604 may provide an image (e.g., a high definition 3D image) of the surgical site associated with the patient 608 captured by the imaging system (e.g., endoscope) to the surgeon 610-1. In some examples, user control system 604 may include a stereoscopic viewer having two displays, wherein a surgical site associated with patient 608 and a stereoscopic image generated by the stereoscopic imaging system may be viewed by surgeon 610-1. The surgeon 610-1 may utilize the images displayed by the user control system 604 to perform one or more procedures with one or more surgical instruments attached to the manipulator arm 612.

To facilitate control of the surgical instrument, the user control system 604 may include a set of master controllers. These master controllers may be manipulated by the surgeon 610-1 to control movement of the surgical instrument (e.g., by utilizing robotic and/or teleoperational techniques). The master controller may be configured to detect a wide variety of hand, wrist, and finger movements of the surgeon 610-1. In this manner, surgeon 610-1 may intuitively perform the procedure using one or more surgical instruments.

The auxiliary system 606 may include one or more computing devices configured to perform the processing operations of the surgical system 600. In such a configuration, one or more computing devices included in auxiliary system 606 may control and/or coordinate operations performed by various other components of surgical system 600 (e.g., steering system 602 and user control system 604). For example, computing devices included in user control system 604 may communicate instructions to manipulation system 602 via one or more computing devices included in auxiliary system 606. As another example, the assistance system 606 may receive and process image data representing imagery captured by one or more imaging devices attached to the manipulation system 602.

In some examples, the assistance system 606 may be configured to present visual content to a surgical team member 610 that may not have access to the images provided to the surgeon 610-1 at the user control system 604. To this end, the assistance system 606 may include a display monitor 614, the display monitor 614 configured to display one or more user interfaces, such as images of the surgical site, information associated with the patient 608 and/or surgical procedure, and/or any other visual content that may serve a particular implementation. For example, the display monitor 614 may display an image of the surgical site, as well as additional content (e.g., graphical content, contextual information, etc.) that is displayed concurrently with the image. In some implementations, display monitor 614 is implemented by a touch screen display that surgical team member 610 may interact with (e.g., via touch gestures) to provide user input to surgical system 600.

Manipulation system 602, user control system 604, and assistance system 606 can be communicatively coupled to one another in any suitable manner. For example, as shown in fig. 6, steering system 602, user control system 604, and auxiliary system 606 may be communicatively coupled via control lines 616, where control lines 616 may represent any wired or wireless communication links that may serve a particular implementation. To this end, steering system 602, user control system 604, and auxiliary system 606 may each include one or more wired or wireless communication interfaces, such as one or more local area network interfaces, wi-Fi network interfaces, cellular interfaces, and the like.

In some examples, an imaging device, such as imaging device 102, may be attached to a component of surgical system 600 and/or a component of a surgical apparatus in which surgical system 600 is disposed. For example, the imaging device may be attached to a component of the manipulation system 602.

Fig. 7 depicts an illustrative configuration 700 of imaging device 102 (imaging devices 102-1 through 102-4) attached to a component of steering system 602. As shown, imaging device 102-1 may be attached to an Orientation Platform (OP) 702 of manipulation system 602, imaging device 102-2 may be attached to a manipulator arm 612-1 of manipulation system 602, imaging device 102-3 may be attached to a manipulator arm 612-4 of manipulation system 602, and imaging device 102-4 may be attached to a base 704 of manipulation system 602. The imaging device 120-1 attached to the OP 702 may be referred to as an OP imaging device, the imaging device 120-2 attached to the manipulator arm 612-1 may be referred to as a universal set manipulator 1 (USM 1) imaging device, the imaging device 120-3 attached to the manipulator arm 612-4 may be referred to as a universal set manipulator 4 (USM 4) imaging device, and the imaging device 120-4 attached to the BASE 704 may be referred to as a BASE imaging device, or BASE imaging device. In implementations where the manipulation system 602 is positioned proximate to the patient (e.g., as a patient side cart), placement of the imaging device 602 at a strategic location on the manipulation system 602 provides a favorable imaging viewpoint proximate to the patient and a surgical procedure performed on the patient.

In some implementations, components of manipulation system 602 (or other robotic systems in other examples) may have redundant degrees of freedom that allow multiple configurations of components to reach the same output position of an end effector (e.g., instrument connected to manipulator arm 612) attached to the components. Thus, the processing system 106 can direct movement of the components of the manipulation system 602 without affecting the position of the end effector attached to the components. This may allow repositioning of the component to perform activity recognition without changing the position of the end effector attached to the component.

The illustrative placement of the imaging device 102 to components of the manipulation system 602 is exemplary. Additional and/or alternative positioning of any suitable number of imaging devices 102, other components of surgical system 600, and/or other components at a surgical device on manipulation system 602 may be utilized in other implementations. Imaging device 102 may be attached to components of manipulation system 602, other components of surgical system 600, and/or other components at a surgical instrument in any suitable manner.

Fig. 8 illustrates an exemplary method 800 of a multi-view medical activity recognition system. While fig. 8 illustrates exemplary operations according to one embodiment, other embodiments may omit, add, reorder, combine, and/or modify any of the operations illustrated in fig. 8. One or more of the operations shown in fig. 8 may be performed by an activity recognition system, such as system 100, any components included therein, and/or any implementation thereof. .

In operation 802, the activity recognition system may access a plurality of data streams representing images of a scene of a medical session captured by a plurality of sensors from a plurality of viewpoints. Operation 802 may be performed in any of the ways described herein.

In operation 804, the activity recognition system may temporarily align the plurality of data streams. Operation 804 may be performed in any manner described herein.

In operation 806, the activity recognition system may utilize the point-of-view agnostic machine learning model and determine activity within the scene based on the plurality of data streams. Operation 806 may be performed in any of the manners described herein.

The multi-view medical activity recognition principles, systems, and methods described herein may be used in a variety of applications. As an example, one or more activity recognition aspects described herein may be used to conduct surgical workflow analysis in real-time or retrospectively. As another example, one or more activity recognition aspects described herein may be used for automatic transcription of surgical sessions (e.g., for documentation, further planning, and/or resource allocation purposes). As another example, one or more of the activity recognition aspects described herein may be used for automation of surgical subtasks. As another example, one or more of the activity recognition aspects described herein may be used for computer-aided setup of a surgical system and/or surgical device (e.g., one or more operations of a robotic surgical system may be automated based on perception of a surgical scene and automatic movement of the robotic surgical system). These examples of applications of the activity recognition principles, systems, and methods described herein are exemplary. The activity recognition principles, systems, and methods described herein may be implemented for other suitable applications.

Further, while the activity recognition principles, systems, and methods described herein have focused on classification of activities of a scene captured by a sensor, similar principles, systems, and methods may be applied to any suitable scene-aware application (e.g., scene segmentation, object recognition, etc.).

Additionally, while the activity recognition principles, systems, and methods described herein generally have included machine learning models, similar principles, systems, and methods may be implemented using any suitable algorithm including any artificial intelligence algorithm and/or non-machine learning algorithm.

In some examples, a non-transitory computer-readable medium storing computer-readable instructions may be provided in accordance with the principles described herein. The instructions, when executed by a processor of a computing device, may direct the processor and/or the computing device to perform one or more operations, including one or more operations described herein. Such instructions may be stored and/or transmitted using any of a variety of known computer-readable media.

A non-transitory computer-readable medium as referred to herein may include any non-transitory storage medium that participates in providing data (e.g., instructions) that may be read and/or executed by a computing device (e.g., by a processor of the computing device). For example, a non-transitory computer readable medium may include, but is not limited to, any combination of non-volatile storage media and/or volatile storage media. Exemplary non-volatile storage media include, but are not limited to, read-only memory, flash memory, solid state drives, magnetic storage devices (e.g., hard disks, floppy disks, tape, etc.), ferroelectric random access memory ("RAM"), and optical disks (e.g., compact disks, digital video disks, blu-ray disks, etc.). Exemplary volatile storage media include, but are not limited to, RAM (e.g., dynamic RAM).

Fig. 9 illustrates an example computing device 900 that can be specifically configured to perform one or more of the processes described herein. Any of the systems, units, computing devices, and/or other components described herein may be implemented or realized by computing device 900.

As shown in fig. 9, computing device 900 may include a communication interface 902, a processor 904, a storage 906, and an input/output ("I/O") module 908 communicatively connected to each other via a communication infrastructure 910. While an exemplary computing device 900 is illustrated in fig. 9, the components illustrated in fig. 9 are not intended to be limiting. Additional or alternative components may be utilized in other embodiments. The components of computing device 900 shown in fig. 9 will now be described in more detail.

The communication interface 902 may be configured to communicate with one or more computing devices. Examples of communication interface 902 include, but are not limited to, a wired network interface (such as a network interface card), a wireless network interface (such as a wireless network interface card), a modem, an audio/video connection, and any other suitable interface.

Processor 904 generally represents any type or form of processing unit capable of processing data and/or interpreting, executing, and/or directing the execution of one or more of the instructions, processes, and/or operations described herein. The processor 904 may perform operations by executing computer-executable instructions 912 (e.g., applications, software, code, and/or other executable data examples) stored in the storage 906.

Storage 906 may include one or more data storage media, devices, or configurations, and may take any type, form of data storage media, and/or combination thereof. For example, storage 906 may include, but is not limited to, any combination of the non-volatile media and/or volatile media described herein. Electronic data, including data described herein, may be temporarily and/or permanently stored in storage 906. For example, data representing computer-executable instructions 912 configured to direct processor 904 to perform any of the operations described herein may be stored within storage 906. In some examples, the data may be arranged in one or more databases residing within the storage 906.

The I/O modules 908 may include one or more I/O modules configured to receive user input and provide user output. The I/O module 908 may include any hardware, firmware, software, or combination thereof that supports input and output capabilities. For example, the I/O module 908 may include hardware and/or software for capturing user input, including but not limited to a keyboard or keypad, a touch screen component (e.g., a touch screen display), a receiver (e.g., an RF or infrared receiver), a motion sensor, and/or one or more input buttons.

The I/O module 908 may include one or more devices for presenting output to a user, including, but not limited to, a graphics engine, a display (e.g., a display screen), one or more output drivers (e.g., a display driver), one or more audio speakers, and one or more audio drivers. In some implementations, the I/O module 908 is configured to provide graphical data to a display for presentation to a user. The graphical data may represent one or more graphical user interfaces and/or any other graphical content that may serve a particular implementation.

In some examples, any of the systems, modules, and/or apparatus described herein may be implemented by or within one or more components of computing device 900. For example, one or more applications 912 present within the storage 906 may be configured to direct the implementation of the processor 904 to perform one or more operations or functions associated with the processing system 108 of the system 100.

As mentioned, one or more operations described herein may be performed during a medical session, e.g., dynamically, in real-time, and/or near real-time. As used herein, an operation described as occurring "in real time" will be understood to be performed immediately and without excessive delay, even though absolute zero delay is not possible.

Any of the systems, devices, and/or components thereof may be implemented in any suitable combination or sub-combination. For example, any of the systems, devices, and/or components thereof may be implemented as an apparatus configured to perform one or more of the operations described herein.

In the description herein, various exemplary embodiments have been described. It will, however, be evident that various modifications and changes may be made thereto, and additional embodiments may be implemented, without departing from the scope of the invention as set forth in the appended claims. For example, certain features of one embodiment described herein may be combined with or substituted for features of another embodiment described herein. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.

Claims

1. A system, comprising:

a memory storing instructions;

a processor communicatively coupled to the memory and configured to execute the instructions to:

accessing a plurality of data streams representing images of a scene of a medical session captured by a plurality of sensors from a plurality of viewpoints;

temporarily aligning the plurality of data streams; and

an activity within the scene is determined using a point-of-view agnostic machine learning model and based on the plurality of data streams.

2. The system of claim 1, wherein:

the machine learning model is configured to generate fusion data based on the plurality of data streams; and

the determining of the activity within the scene is based on the fused data.

3. The system of claim 2, wherein:

the plurality of data streams includes a first data stream and a second data stream;

the machine learning model is further configured to:

determining a first classification of the activity within the scene based on the first data stream, and

determining a second classification of the activity within the scene based on the second data stream; and

the generating the fused data includes combining the first classification and the second classification with weights determined based on the first data stream, the second data stream, and the activity within the scene.

4. The system of claim 2, wherein:

the plurality of data streams includes a first data stream and a second data stream; and

the generating fusion data includes:

determining a global classification of the activity within the scene based on the first data stream and the second data stream, determining a first classification of the activity within the scene based on the first data stream and the global classification, determining a second classification of the activity within the scene based on the second data stream and the global classification, and

The first classification, the second classification, and the global classification are combined with weights determined based on the first data stream, the second data stream, and the activity within the scene.

5. The system of claim 4, wherein the determining a global classification comprises combining, for a point in time, respective temporal alignment data from the first data stream and the second data stream corresponding to the point in time using weights determined based on the first data stream, the second data stream, and the activity within the scene.

6. The system of claim 4, wherein the determining a global classification comprises:

extracting a first feature from data of the first data stream:

extracting a second feature from data of the second data stream; and

the first feature and the second feature are combined with weights determined based on the first data stream, the second data stream, and the activity within the scene.

7. The system of claim 1, wherein the determining the activity within the scene is performed during the activity within the scene.

8. The system of claim 1, wherein the plurality of data streams further comprises a data stream representing data captured by a non-imaging sensor.

9. The system of claim 1, wherein the view-agnostic model is agnostic to a number of the plurality of sensors.

10. The system of claim 1, wherein the view-agnostic model is agnostic to a location of the plurality of sensors.

11. A method, comprising:

accessing, by a processor, a plurality of data streams representing images of a scene of a medical session captured by a plurality of sensors from a plurality of viewpoints;

temporarily aligning, by the processor, the plurality of data streams; and

a viewpoint-agnostic machine learning model is utilized by the processor and activity within the scene is determined based on the plurality of data streams.

12. The method according to claim 11, wherein:

the determining the activity within the scene is based on the fused data.

13. The method according to claim 12, wherein:

the machine learning model is further configured to:

14. The method according to claim 12, wherein:

the generating fusion data includes:

15. The method of claim 14, wherein the determining a global classification comprises combining, for a point in time, respective temporal alignment data from the first data stream and the second data stream corresponding to the point in time with weights determined based on the first data stream, the second data stream, and the activity within the scene.

16. The method of claim 14, wherein the determining a global classification comprises:

extracting a first feature from data of the first data stream;

extracting a second feature from data of the second data stream; and

17. The method of claim 11, wherein the determining an activity within the scene is performed during the activity within the scene.

18. The method of claim 11, wherein the plurality of data streams further comprises a data stream representing data captured by a non-imaging sensor.

19. A non-transitory computer-readable medium storing instructions executable by a processor to:

temporarily aligning the plurality of data streams; and

20. The non-transitory computer-readable medium of claim 19, wherein:

the determining of the activity within the scene is based on the fused data.

21. The non-transitory computer-readable medium of claim 20, wherein:

the machine learning model is further configured to:

22. The non-transitory computer-readable medium of claim 20, wherein:

the generating fusion data includes:

23. The non-transitory computer-readable medium of claim 22, wherein the determining a global classification comprises combining, for a point in time, respective temporal alignment data from the first data stream and the second data stream corresponding to the point in time with weights determined based on the first data stream, the second data stream, and the activity within the scene.

24. The non-transitory computer-readable medium of claim 22, wherein the determining a global classification comprises:

extracting a first feature from data of the first data stream;

extracting a second feature from data of the second data stream; and

25. The non-transitory computer-readable medium of claim 19, wherein the determining the activity within the scene is performed during the activity within the scene.

26. The non-transitory computer readable medium of claim 19, wherein the plurality of data streams further comprises a data stream representing data captured by a non-imaging sensor.