WO2024036367A1

WO2024036367A1 - Object detection and point localization in medical images using structured prediction

Info

Publication number: WO2024036367A1
Application number: PCT/AU2023/050770
Authority: WO
Inventors: Xavier Holt; Jarrel Seah; Cyril Tang
Original assignee: Annalise-Ai Pty Ltd
Priority date: 2022-08-16
Filing date: 2023-08-16
Publication date: 2024-02-22

Abstract

Methods and systems for object detection and point localization in medical images using structured prediction are provided. In an aspect, a system may receive an image comprising a plurality of points. The system may input the image into a structured prediction model for identifying objects within the image. In an aspect, the structured prediction model comprises a transformer model. The system may receive, from the structured prediction model, a set of one or more object descriptors, wherein each object descriptor specifies an object type for an identified object and a set of one or more locations within the image occupied by that identified object. In an aspect, the input image is displayed to a user, showing the identified objects within the image.

Description

OBJECT DETECTION AND POINT LOCALIZATION IN MEDICAL IMAGES USING STRUCTURED PREDICTION

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] The present application claims priority from Australian Provisional Patent Application No 2022902321 filed on 16 August 2022, the contents of which are incorporated herein by reference in their entirety.

TECHNICAL FIELD

[0002] The various aspects and embodiments described herein generally relate to object detection and point localization in medical images.

BACKGROUND

[0003] Medical images, such as chest X-rays, often show not only the patient’s tissues, but also other paraphernalia such as intravenous (IV) lines, drainage tubes, electrocardiogram (EKG) wires, etc., which are generically referred to as polylines. Visual analysis of a medical image can be streamlined and enhanced by automatic detection and labelling of polylines and other visual artifacts, e.g., so that a human screener does not have to spend time on this task, and also to avoid human error.

[0004] It is the object of this application to provide a faster and more automatic object detection and point localization in medical and non-medical images to solve these present problems.

[0005] Any discussion of documents, acts, materials, devices, articles or the like which has been included in the present specification is not to be taken as an admission that any or all of these matters form part of the prior art base or were common general knowledge in the field relevant to the present disclosure as it existed before the priority date of each of the appended claims.

SUMMARY

[0006] The following presents a simplified summary relating to one or more aspects disclosed herein. Thus, the following summary should not be considered an extensive overview relating to all contemplated aspects, nor should the following summary be considered to identify key or critical elements relating to all contemplated aspects or to delineate the scope associated with any particular aspect. Accordingly, the following summary has the sole purpose to present certain concepts relating to one or more aspects relating to the mechanisms disclosed herein in a simplified form to precede the detailed description presented below.

[0007] In an aspect, a method for object detection and point localization in images using structured prediction includes receiving an image comprising a plurality of points, inputting the image into a structured prediction model for identifying objects within the image, wherein the structured prediction model comprises a transformer model, and receiving, from the structured prediction model, a set of one or more object descriptors, wherein each object descriptor specifies an object type for an identified object and a set of one or more locations within the image occupied by that identified object.

[0008] In an aspect, a system for object detection and point localization in images using structured prediction includes a memory and at least one processor communicatively coupled to the memory. The at least one processor is configured to receive an image comprising a plurality of points, input the image into a structured prediction model for identifying objects within the image, and receive, from the structured prediction model, a set of one or more object descriptors, wherein each object descriptor specifies an object type for an identified object and a set of one or more locations within the image occupied by that identified object and the structured prediction model comprises a transformer model.

[0009] In an aspect, a method for object detection and point localization in images using structured prediction includes receiving an image comprising a plurality of points, inputting the image into a structured prediction model for identifying objects within the image, and receiving, from the structured prediction model, a set of one or more object descriptors, wherein each object descriptor specifies an object type for an identified object and a set of one or more locations within the image occupied by that identified object, wherein inputting the image into the structured prediction model comprises inputting the image into a recurrent neural network (RNN) that processes the image to generate the set of one or more object descriptors. [0010] Other objects and advantages associated with the aspects disclosed herein will be apparent to those skilled in the art based on the accompanying drawings and detailed description.

BRIEF DESCRIPTION OF THE DRAWINGS

[0011] A more complete appreciation of the various aspects and embodiments described herein and many attendant advantages thereof will be readily obtained as the same becomes better understood by reference to the following detailed description when considered in connection with the accompanying drawings which are presented solely for illustration and not limitation, and in which:

[0012] FIG. 1 is a system diagram of system for object detection and point localization in medical images using structured prediction, according to aspects of the disclosure;

[0013] FIG. 2 illustrates one of the differences between conventional object detection systems and the techniques of the present disclosure;

[0014] FIG. 3 is a block diagram of an example transformer, according to aspects of the disclosure;

[0015] FIG. 4 illustrates an example encoder in more detail, according to aspects of the disclosure;

[0016] FIG. 5 illustrates an example decoder in more detail, according to aspects of the disclosure;

[0017] FIG. 6 illustrates an example of an application of transformer as a classifier or predictor, according to aspects of the disclosure;

[0018] FIG. 7 is a flowchart of an example process associated with object detection and point localization in medical images using structured prediction, according to aspects of the disclosure; and [0019] FIG. 8 compares a chest X-ray that was annotated by a human and the same X- ray that was annotated by the structured prediction model, according to aspects of the disclosure.

DETAILED DESCRIPTION

[0020] Methods and systems for object detection and point localization in medical images using structured prediction are provided. In an aspect, a system may receive an image comprising a plurality of points. The system may input the image into a structured prediction model for identifying objects within the image. In an aspect, the structured prediction model comprises a transformer model. The system may receive, from the structured prediction model, a set of one or more object descriptors, wherein each object descriptor specifies an object type for an identified object and a set of one or more locations within the image occupied by that identified object. In an aspect, the input image is displayed to a user, showing the identified objects within the image.

[0021] Various aspects and embodiments are disclosed in the following description and related drawings to show specific examples relating to exemplary aspects and embodiments. Alternate aspects and embodiments will be apparent to those skilled in the pertinent art upon reading this disclosure, and may be constructed and practiced without departing from the scope or spirit of the disclosure. Additionally, well-known elements will not be described in detail or may be omitted so as to not obscure the relevant details of the aspects and embodiments disclosed herein.

[0022] The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any embodiment described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments. Likewise, the term “embodiments” does not require that all embodiments include the discussed feature, advantage, or mode of operation.

[0023] The terminology used herein describes particular embodiments only and should not be construed to limit any embodiments disclosed herein. As used herein, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. Those skilled in the art will further understand that the terms “comprises,” “comprising,” “includes,” and/or “including,” as used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

[0024] Further, various aspects and/or embodiments may be described in terms of sequences of actions to be performed by, for example, elements of a computing device. Those skilled in the art will recognize that various actions described herein can be performed by specific circuits (e.g., an application specific integrated circuit (ASIC)), by program instructions being executed by one or more processors, or by a combination of both. Additionally, these sequences of actions described herein can be considered to be embodied entirely within any form of non-transitory computer-readable medium having stored thereon a corresponding set of computer instructions that upon execution would cause an associated processor to perform the functionality described herein. Thus, the various aspects described herein may be embodied in a number of different forms, all of which have been contemplated to be within the scope of the claimed subject matter. In addition, for each of the aspects described herein, the corresponding form of any such aspects may be described herein as, for example, “logic configured to” and/or other structural components configured to perform the described action.

[0025] Those skilled in the art will appreciate that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, transmissions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.

[0026] Further, those skilled in the art will appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the aspects disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted to depart from the scope of the various aspects and embodiments described herein.

[0027] The various illustrative logical blocks, modules, and circuits described in connection with the aspects disclosed herein may be implemented or performed with a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices (e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration).

[0028] FIG. 1 is a system diagram of system 100 for object detection and point localization in medical images using structured prediction, according to aspects of the disclosure. In the example illustrated in FIG. 1, system 100 includes a structured prediction model 102 and one or more processor(s) 104. The structured prediction model 102 receives as input an image 106, which may be a medical image, such as an X-ray image, and identifies objects within the image 106. In the example illustrated in FIG. 1, the processor(s) 104 combines the original image 106 with a graphic representation of the objects identified within the image 106 by the structured prediction model 102. In the example illustrated in FIG. 1, the processor(s) 104 provides the combined image, e.g., to a display 108 for viewing by the user. The combined image may also be referred to as an annotated image. In the example illustrated in FIG. 1, the system 100 includes a memory 110 communicatively coupled to the structure prediction model 102 and/or the processor(s) 104.

[0029] In the example illustrated in FIG. 1, an example of an unannotated medical image that is processed by the structured prediction model 102 is shown at the top left of FIG. 1, and an example of an annotated medical image, showing the identified features, that may be shown on the display 108 is shown at the top right of FIG.1.

[0030] The example system 100 illustrated in FIG. 1 is intended to be illustrative and not limiting. For example, in some aspects, the image 106 may come from an image database that is communicatively coupled to the system 100. In some aspects, the output of the structured prediction model may be provided as a data fde separately from the annotated image.

[0031] In some aspects, the structured prediction model 102 comprises a trained transformer model. In some aspects, the image is a medical image that is serialized and provided to the transformer model as a token stream. The transformer processes the serialized image and produces a set of predictions. In some aspects, a convolutional neural network (CNN) is used to preprocess the serialized image before the serialized image is provided to the transformer. For example, the CNN may create a less-spatially fine-grained representation of the image, with more features, or to create a representation of the image that's been transformed in some way.

[0032] In some examples, the CNN may comprise an additional layer at the end of the model architecture that converts the output of the CNN, such that the output as suitable as input into the transformer. This additional layer may be a reshape layer, for example. In some aspects, the structured prediction model comprises the CNN. The CNN may be based on a ResNet or EfficientNet type architecture, for example. In some aspects, the structured prediction model 102 comprises a recurrent neural network (RNN), which analyzes the image and produces the set of predictions.

[0033] For example, the image may be a 2D image, such as chest X-ray, having a height of H number of pixels and a width of W number of pixels. In some aspects, the image may first be downsized to an image of H/2 by W/2. For example, a 512 x 512 chest X-ray may be downsized to a 256 x 256 image, then serialized into a stream of tokens representing pixel values. In some aspects, the image is serialized in a left-to- right, top-to-bottom order. The structured prediction model 102 processes the serialized image and produces a set of predictions. In some aspects, the set of predictions comprises a set of one or more object descriptors, where each object descriptor specifies an object type for an identified object and a set of one or more locations within the image occupied by that identified object. For example, the input image may be a chest X-ray of a patient having IV lines, tubes, wires, or other objects visible in the X-ray image. The structured prediction model may process this image and produce a list of all lines, tubes, and/or wires that were identified, along with a set of pixels within the image that the identified object occupies.

[0034] FIG. 2 illustrates one of the differences between conventional object detection systems and the techniques of the present disclosure. In the example shown in FIG. 2, an X-ray image having two different polylines is analyzed by the conventional object detection system (left) and by the structure prediction model (right). Unlike conventional object detection systems that output boundary boxes for each object identified, as shown in image 200, the structured prediction model 102 produces a constellation of points in 2D or 3D space that represent locations of at least portions of the object within that 2D or 3D space, as shown in image 202.

[0035] Conventional object detection systems may have difficulty distinguishing between overlapping objects and, even when successful, produce bounding boxes that make the exact spatial relationship between the overlapping objects difficult or impossible to determine. In contrast, the structured prediction model 102 produces an output that makes it easier to distinguish between objects that overlap, and whose relative spatial relationship is easy to understand.

[0036] In the example illustrated in FIG. 2, the output of the structured prediction model 102 is a set of points within the 2D image, and each point may be described as an {x,y} coordinate pair, a position index within the serialized image, or other format. The raw data produced by the structured prediction model 102 may then be converted by the processor(s) 104 into a graphic format for viewing by an operator of the system. In the example illustrated in FIG. 2, one type of polyline is shown as a yellow line and another type of line is shown as a purple line, where the colored circles indicate the location of the points in the X-ray that have been identified by the structured prediction model 102 as being part of that particular polyline. As can be seen in the example shown in FIG. 2, the structured prediction model 102 need not identify every point in the image that represents a portion of a particular object; it may instead generate a set of points at some maximum spacing and interpolate the location of the object between adjacent points.

[0037] FIG. 3 is a block diagram of an example transformer 300, according to aspects of the disclosure in which the structured prediction model 102 comprises a transformer model. In the example shown in FIG. 3, the transformer 300 includes a plurality of encoders 302 and a plurality of decoders 304. Each encoder 302 may also be referred to herein as an encoder layer, and the plurality of encoders 302 may be collectively referred to herein as an encoder stack. Each decoder 304 may also be referred to herein as a decoder layer, and the plurality of decoders 304 may be collectively referred to herein as a decoder stack.

[0038] In the example shown in FIG. 3, an embedding and encoding block 306 receives a stream of objects, which may also be referred to herein as tokens, from an input stream 308. The embedding and encoding block 306 performs two functions: embedding, which determines and encodes the meaning of a token; and position encoding, which encodes the position of the token. Transformers gained popularity as a tool for natural language processing (NLP) and translation; in these applications, the input stream is a sentence and each word in the sentence is a token. In NLP applications, for example, the embedding step assigns an identifier to each word - e.g., the word “a” may have identifier 10, the word “an” may have identifier 11, the word “the” may have identifier 12, and so on. In NLP applications, the encoding step generates a position code that is assigned to the word to indicate the word’s position within the sentence, which can be used to determine each word’s relative position to other words.

[0039] In the example shown in FIG. 3, the output of the first encoder 302 is input to the second encoder 302, the output of the second encoder 302 is input to the third encoder 302, and the output of the third encoder 302 is provided as an input into each of the decoders 304.

[0040] In the example shown in FIG. 3, an embedding and encoding block 310 provides an input to the first decoder 304, the output of the first decoder 304 is input into the second decoder 304, the output of the second decoder 304 is input into the third decoder 304, and the output of the third decoder 304 is provided to an output generator 312, which generates the output 314. In some examples, the output 314 may be a vector, where the elements of the vector are numbers which correspond to the set of one or more locations within the medical image occupied by an identified object. The elements of the vectors may also correspond to a predicted object type for the identified object in the medical image.

[0041] During training of the transformer 300, a training input 308 is paired with a target 316 input that comprises the output that the transformer 300 should produce as its output 314 after processing the input 308. During this training, the target 316 is provided as input to the embedding and encoding block 310. In an NLP application, the transformer 300 may be used for natural language translation, in which case, during training, the input 308 may comprise a sentence in one language while the target 316 may comprise that sentence as properly translated into the target language. Training may then continue until the output 314 of the transformer correctly matches the target 316, at which time training may be considered complete. After training, during operation of the transformer 300, no target 316 is provided and the embedding and encoding block 310 is initialized a null or empty value prior to receiving the input 308.

[0042] FIG. 4 illustrates an example encoder 302 in more detail, according to aspects of the disclosure. In the example illustrated in FIG. 4, each encoder 302 includes a selfattention layer 400 that receives the input and sends its output to a normalization layer 402, then to a feed-forward layer 404, and then to a second normalization layer 406. The output of the second normalization layer 406 is the output of the encoder 302.

[0043] The self-attention layer 400 relates each token in the input stream to every other token in the input stream and determines which tokens are related to each other. In some aspects, each token is given a score that indicates its relevance to each of the other tokens. For Relevance between tokens may be determined based on the tokens’ embedding and positions, for example. In some aspects, an attention layer may perform its computations multiple times in parallel, using what are referred to as multiple attention heads. The parallel attention calculations can be combined together to produce a final attention score. This parallelism enables a transformer greater power to encode multiple relationships and nuances for each token. Any of the attention layers, not just the self-attention layer 400, maybe implemented as a multiple attention head. [0044] The normalization layer 402 normalizes the output of the self-attention layer 400, e.g., by scaling or adjusting the range of relevance scores. The feed-forward layer 404 applies weights that are determined during training of the encoder 302 to each token, and the normalization layer 406 normalizes the output of the feed-forward layer 404.

[0045] FIG. 5 illustrates an example decoder 304 in more detail, according to aspects of the disclosure. A decoder 304 receives the encoder stack’s encoded representation of the input stream to produce an encoded representation of the target sequence that the output generator 312 will convert into probabilities to produce an output sequence or set of output predictions. In the example illustrated in FIG. 5, each decoder 304 includes a self-attention layer 500 that receives the input and sends its output through first normalization layer 502 to an encoder-decoder attention layer 504. The encoder-decoder attention layer 504 sends its output through a second normalization layer 506 to a feedforward layer 508. The output of the feed-forward layer 508 is sent to a third normalization layer 510. The output of the third normalization layer 510 is the output of the decoder 304.

[0046] The self-attention layer 500 relates each token in the generated output stream to every other token in the generated output stream - which starts out empty and increases in size with every token received from the encoders - and determines which tokens are related to each other. The normalization layer 502 normalizes the output of the selfattention layer 500. The encoder-decoder attention layer 504 determines how tokens in the input stream are related to tokens in the output stream, and its output is normalized via the normalization layer 506 before being provided to the feed-forward layer 508. The feed-forward layer 508 applies weights that were determined during training of the decoder 304 to each output token, and the normalization layer 510 normalizes the output of the feed-forward layer 508.

[0047] Not every transformer includes a decoder 304. For example, in an NLP application the transformer might be used to translate from one language to another language, in which case the decoder stack would be needed to produce the translated sentence. That is, the decoder may process words in the source sentence to predict what words should be in the destination sentence. However, in another NLP application, i.e., a classification application, the transformer may need only to tag or identify certain features of the sentence, such as identifying the subject, verb, and object of the sentence.

[0048] FIG. 6 illustrates an example of an application of transformer 300 as a classifier or predictor, according to aspects of the disclosure. In this example, the transformer 300 analyzes and input 308 and, with the assistance of a classification head 600, will report one or more classifications, which may be positive classification, negative classifications, or both. For example a medical image may be the input and the classification head 600 may classify the image as containing or not containing an object of a certain type, such as a polyline, stent, staple, screw, filling etc. This may be done in addition to providing a set of points that identify the location of the identified object within the medical image.

[0049] In some aspects, the structured prediction model may comprise the transformer 300 and the classification head 600. In some examples, an output of the transformer 300 may be input into the classification head 600 and the classification head 600 may classify the image as containing or not containing an object of a certain type, such as a polyline, stent, staple, screw, filling etc. As such, the transformer model 300 may determine a set of one or more locations within the medical image occupied by an identified object, while the classification head 600 determines an object type for the identified object. Therefore, an object descriptor can be determined using the structured prediction model.

[0050] FIG. 7 is a flowchart of an example process 700 associated with object detection and point localization in medical images using structured prediction, according to aspects of the disclosure. In some implementations, one or more process blocks of FIG. 7 may be performed by a system (e.g., system 100). In some implementations, one or more process blocks of FIG. 7 may be performed by another device or a group of devices separate from or including the system. Additionally, or alternatively, one or more process blocks of FIG. 7 may be performed by one or more components of an apparatus, such as the structured prediction model 102, processor(s) 104, or memory 110 of the system 100, any or all of which may be means for performing the operations of process 700. [0051] As shown in FIG. 7, process 700 may include receiving, at block 702, an image comprising a plurality of points. Means for performing the operation of block 702 may include the structured prediction model 102, processor(s) 104, or memory 110 of the system 100. For example, the system 100 may receive an image comprising a plurality of points and store the image in the memory 110. In some aspects, the image may come from a database of images, e.g., from a database internal to the system 100 or an external database to which the system 100 is communicably coupled.

[0052] As further shown in FIG. 7, process 700 may include, at block 704, inputting the image into a structured prediction model for identifying objects within the image. Means for performing the operation of block 704 may include the structured prediction model 102, processor(s) 104, or memory 110 of the system 100. For example, the processor(s) 104 may transfer the image from the memory 110 into the structured prediction model 102.

[0053] As further shown in FIG. 7, process 700 may include, at block 706, receiving, from the structured prediction model, a set of one or more object descriptors, wherein each object descriptor specifies an object type for an identified object and a set of one or more locations within the image occupied by that identified object. Means for performing the operation of block 706 may include the structured prediction model 102, processor(s) 104, or memory 110 of the system 100. For example, the structured prediction model 102 may output the set of object descriptors to the processor(s) 104, to the memory 110, or to both.

[0054] In some aspects, process 700 includes outputting the image, and a graphical representation of the set of one or more locations within the image occupied by at least one identified object, to a display. For example, the processor(s) 104 may combine the original image 106 with a graphical representation of the set of identified objects to create an annotated image, which is then sent to the display 108.

[0055] In some aspects, receiving the image comprising the plurality of points comprises receiving a two-dimensional image comprising a plurality of pixels or receiving a three-dimensional image comprising a plurality of voxels. In some aspects, receiving the image comprises receiving a medical image. In some aspects, receiving the medical image comprises receiving an X-ray image. In some aspects, inputting the image into the structured prediction model comprises serializing a two-dimensional or multi-dimensional image into a one -dimensional stream of data that is input into the structured prediction model. In some aspects, inputting the image into the structured prediction model comprises downsizing a two-dimensional image to reduce the number of pixels or downsizing a three-dimensional image to reduce the number of voxels.

[0056] In some aspects, inputting the image into the structured prediction model comprises inputting the image into a transformer model that processes the image and generates the set of one or more object descriptors. In some aspects, inputting the image into the structured prediction model comprises inputting the image into a convolutional neural network (CNN) that preprocesses the image to produce an encoded image, and a transformer model that processes the encoded image and generates the set of one or more object descriptors. In some aspects, inputting the image into the structured prediction model comprises inputting the image into a recurrent neural network (RNN) that processes the image and generates the set of one or more object descriptors.

[0057] In some aspects, at least one of the one or more identified objects comprises a polyline. In some aspects, the set of one or more locations within the image occupied by the polyline defines a path of the polyline. In some aspects, at least one of the one or more identified objects comprises a volumetric object. In some aspects, the set of one or more locations within the image occupied by the volumetric object defines a perimeter or outline of the object in a two-dimensional plane.

[0058] In some aspects, process 700 includes, prior to inputting the image into a structured prediction model for identifying objects within the image, training the structured prediction model. For example, in some aspects, the structured prediction model can be trained using a set of images containing identifiable objects and having labels to indicate the type and location of the objects. In the examples shown herein, the structured prediction model 102 is trained on chest X-ray images containing polylines, but these examples are illustrative and not limiting; other types of images, including medical or non-medical, 2D or 3D, etc., may be analyzed, and the structured prediction model 102 may be trained to identify other types of objects, not just polylines. [0059] In some aspects, the structured prediction model 102 may be trained using a training apparatus that is separate from the system 100, after which the trained model is transferred into, loaded into, or configured within the structured prediction model 102 of the system 100. In some aspects, the structured prediction model 102 may be trained in situ within the system 100, e.g., during a training session.

[0060] In some aspects, once the structured prediction model 102 is considered trained, e.g., it correctly detects the objects in the training images, the structured prediction model 102 may be locked, i.e., the internal parameters of the model are protected from future changes. In some aspects, the structured prediction model 102 may be subject to continual, ongoing training even during production use. For example, if the structured prediction model 102 incorrectly identifies an object, the user of the system can flag the result, in which case the system 100 may store that image and the result for later training or retraining of the structured prediction model 102, and/or send that information to the developer of the structured prediction model 102 as feedback regarding the performance and accuracy of the system 100.

[0061] Process 700 may include additional implementations, such as any single implementation or any combination of implementations described below and/or in connection with one or more other processes described elsewhere herein. Although FIG. 7 shows example blocks of process 700, in some implementations, process 700 may include additional blocks, fewer blocks, different blocks, or differently arranged blocks than those depicted in FIG. 7. Additionally, or alternatively, two or more of the blocks of process 700 may be performed in parallel.

[0062] In some aspects, the system 100 may be a component within another system and provide additional functionality to the other system. For example, the components of system 100 may be incorporated into medical computers that connect to various imagers, such as computed tomography (CT) imagers, magnetic resonance imagers (MRI), X-ray imagers, positron emission tomography (PET) imagers, ultrasound (US) or infrared (IR) imagers, etc. The medical computers may be co-located with such imagers or may be co-located with doctors (including remotely) such that the connection between the imagers and medical computers may be over the internet, ethemet, a local area network, fiber optics, wireless, or the like. The medical images from the imagers may first be received by the one or more medical computers and inputted into the system 100.

[0063] FIG. 8 compares a chest X-ray 800 that was annotated by a human (left) and the same X-ray 802 that was annotated by the structured prediction model 102 (right), according to aspects of the disclosure. As can be seen in FIG. 8, the structured prediction model correctly identified the same three polylines that the human annotator identified, including two lines that overlapped each other.

[0064] In the detailed description above it can be seen that different features are grouped together in examples. This manner of disclosure should not be understood as an intention that the example clauses have more features than are explicitly mentioned in each clause. Rather, the various aspects of the disclosure may include fewer than all features of an individual example clause disclosed. Therefore, the following clauses should hereby be deemed to be incorporated in the description, wherein each clause by itself can stand as a separate example. Although each dependent clause can refer in the clauses to a specific combination with one of the other clauses, the aspect(s) of that dependent clause are not limited to the specific combination. It will be appreciated that other example clauses can also include a combination of the dependent clause aspect(s) with the subject matter of any other dependent clause or independent clause or a combination of any feature with other dependent and independent clauses. The various aspects disclosed herein expressly include these combinations, unless it is explicitly expressed or can be readily inferred that a specific combination is not intended (e.g., contradictory aspects, such as defining an element as both an electrical insulator and an electrical conductor). Furthermore, it is also intended that aspects of a clause can be included in any other independent clause, even if the clause is not directly dependent on the independent clause.

[0065] Implementation examples are described in the following numbered clauses:

[0066] Clause 1. A method for object detection and point localization in images using structured prediction, the method comprising: receiving an image comprising a plurality of points; inputting the image into a structured prediction model for identifying objects within the image; and receiving, from the structured prediction model, a set of one or more object descriptors, wherein each object descriptor specifies an object type for an identified object and a set of one or more locations within the image occupied by that identified object.

[0067] Clause 2. The method of clause 1, further comprising: outputting the image, and a graphical representation of the set of one or more locations within the image occupied by at least one identified object, to a display.

[0068] Clause 3. The method of any of clauses 1 to 2, wherein receiving the image comprising the plurality of points comprises receiving a two-dimensional image comprising a plurality of pixels or receiving a three-dimensional image comprising a plurality of voxels.

[0069] Clause 4. The method of any of clauses 1 to 3, wherein receiving the image comprises receiving a medical image.

[0070] Clause 5. The method of clause 4, wherein receiving the medical image comprises receiving an X-ray image.

[0071] Clause 6. The method of any of clauses 1 to 5, wherein inputting the image into the structured prediction model comprises serializing a two-dimensional or multidimensional image into a one-dimensional stream of data that is input into the structured prediction model.

[0072] Clause 7. The method of any of clauses 1 to 6, wherein inputting the image into the structured prediction model comprises downsizing a two-dimensional image to reduce the number of pixels or downsizing a three-dimensional image to reduce the number of voxels.

[0073] Clause 8. The method of any of clauses 1 to 7, wherein inputting the image into the structured prediction model comprises inputting the image into a transformer model that processes the image and generates the set of one or more object descriptors.

[0074] Clause 9. The method of any of clauses 1 to 8, wherein inputting the image into the structured prediction model comprises inputting the image into a convolutional neural network (CNN) that preprocesses the image to produce an encoded image, and a transformer model that processes the encoded image and generates the set of one or more object descriptors.

[0075] Clause 10. The method of any of clauses 1 to 9, wherein inputting the image into the structured prediction model comprises inputting the image into a recurrent neural network (RNN) that processes the image and generates the set of one or more object descriptors.

[0076] Clause 11. The method of any of clauses 1 to 10, wherein at least one of the one or more identified objects comprises a polyline.

[0077] Clause 12. The method of clause 11, wherein the set of one or more locations within the image occupied by the polyline defines a path of the polyline.

[0078] Clause 13. The method of any of clauses 1 to 12, wherein at least one of the one or more identified objects comprises a volumetric object.

[0079] Clause 14. The method of clause 13, wherein the set of one or more locations within the image occupied by the volumetric object defines a perimeter or outline of the object in a two-dimensional plane.

[0080] Clause 15. The method of any of clauses 1 to 14, further comprising, prior to inputting the image into a structured prediction model for identifying objects within the image, training the structured prediction model.

[0081] Clause 16. A system for object detection and point localization in images using structured prediction, the system comprising: a memory; and at least one processor communicatively coupled to the memory, the at least one processor configured to: receive an image comprising a plurality of points; input the image into a structured prediction model for identifying objects within the image; and receive, from the structured prediction model, a set of one or more object descriptors, wherein each object descriptor specifies an object type for an identified object and a set of one or more locations within the image occupied by that identified object. [0082] Clause 17. The system of clause 16, wherein the at least one processor is further configured to: output the image, and a graphical representation of the set of one or more locations within the image occupied by at least one identified object, to a display.

[0083] Clause 18. The system of any of clauses 16 to 17, wherein the at least one processor configured to receive the image comprises the at least one processor configured to the plurality of points comprises receiving a two-dimensional image comprising a plurality of pixels or receiving a three-dimensional image comprising a plurality of voxels.

[0084] Clause 19. The system of any of clauses 16 to 18, wherein the at least one processor configured to receive the image comprises the at least one processor configured to receive a medical image.

[0085] Clause 20. The system of clause 19, wherein the at least one processor configured to receive the medical image comprises the at least one processor configured to receive an X-ray image.

[0086] Clause 21. The system of any of clauses 16 to 20, wherein the at least one processor configured to input the image into the structured prediction model comprises the at least one processor configured to serialize a two-dimensional or multidimensional image into a one-dimensional stream of data that is input into the structured prediction model.

[0087] Clause 22. The system of any of clauses 16 to 21, wherein the at least one processor configured to input the image into the structured prediction model comprises the at least one processor configured to downsize a two-dimensional image to reduce the number of pixels or downsizing a three-dimensional image to reduce the number of voxels.

[0088] Clause 23. The system of any of clauses 16 to 22, wherein the at least one processor configured to input the image into the structured prediction model comprises the at least one processor configured to input the image into a transformer model that processes the image and generates the set of one or more object descriptors.

[0089] Clause 24. The system of any of clauses 16 to 23, wherein the at least one processor configured to input the image into the structured prediction model comprises the at least one processor configured to input the image into a convolutional neural network (CNN) that preprocesses the image to produce an encoded image, and a transformer model that processes the encoded image and generates the set of one or more object descriptors.

[0090] Clause 25. The system of any of clauses 16 to 24, wherein the at least one processor configured to input the image into the structured prediction model comprises the at least one processor configured to input the image into a recurrent neural network (RNN) that processes the image and generates the set of one or more object descriptors.

[0091] Clause 26. The system of any of clauses 16 to 25, wherein at least one of the one or more identified objects comprises a polyline.

[0092] Clause 27. The system of clause 26, wherein the set of one or more locations within the image occupied by the polyline defines a path of the polyline.

[0093] Clause 28. The system of any of clauses 16 to 27, wherein at least one of the one or more identified objects comprises a volumetric object.

[0094] Clause 29. The system of clause 28, wherein the set of one or more locations within the image occupied by the volumetric object defines a perimeter or outline of the object in a two-dimensional plane.

[0095] Clause 30. The system of any of clauses 16 to 29, wherein, prior to inputting the image into a structured prediction model for identifying objects within the image, training the structured prediction model, the at least one processor configured to train the structured prediction model.

[0096] Clause 31. An apparatus comprising a memory, a transceiver, and a processor communicatively coupled to the memory and the transceiver, the memory, the transceiver, and the processor configured to perform a method according to any of clauses 1 to 15.

[0097] Clause 32. An apparatus comprising means for performing a method according to any of clauses 1 to 15.

[0098] Clause 33. A non-transitory computer-readable medium storing computerexecutable instructions, the computer-executable comprising at least one instruction for causing a computer or processor to perform a method according to any of clauses 1 to 15.

[0099] The methods, sequences, logic and/or algorithms described in connection with the aspects disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM, flash memory, ROM, EPROM, EEPROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of non-transitory computer-readable medium known in the art. An exemplary non-transitory computer-readable medium may be coupled to the processor such that the processor can read information from, and write information to, the non-transitory computer-readable medium. In the alternative, the non-transitory computer-readable medium may be integral to the processor. The processor and the non-transitory computer-readable medium may reside in an ASIC. The ASIC may reside in an loT device. In the alternative, the processor and the non- transitory computer-readable medium may be discrete components in a user terminal.

[0100] In one or more exemplary aspects, the functions described herein may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a non-transitory computer-readable medium. Computer- readable media may include storage media and/or communication media including any non-transitory medium that may facilitate transferring a computer program from one place to another. A storage media may be any available media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of a medium. The term disk and disc, which may be used interchangeably herein, includes CD, laser disc, optical disc, DVD, floppy disk, and Blu-ray discs, which usually reproduce data magnetically and/or optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

[0101] While the foregoing disclosure shows illustrative aspects and embodiments, those skilled in the art will appreciate that various changes and modifications could be made herein without departing from the scope of the disclosure as defined by the appended claims. Furthermore, in accordance with the various illustrative aspects and embodiments described herein, those skilled in the art will appreciate that the functions, steps, and/or actions in any methods described above and/or recited in any method claims appended hereto need not be performed in any particular order. Further still, to the extent that any elements are described above or recited in the appended claims in a singular form, those skilled in the art will appreciate that singular form(s) contemplate the plural as well unless limitation to the singular form(s) is explicitly stated.

Claims

1. A method for object detection and point localization in medical images using structured prediction, the method comprising: receiving a medical image comprising a plurality of points; inputting the medical image into a structured prediction model for identifying objects within the image, wherein the structured prediction model comprises a transformer model; and receiving, from the structured prediction model, a set of one or more object descriptors, wherein each object descriptor specifies an object type for an identified object and a set of one or more locations within the medical image occupied by that identified object.

2. The method of claim 1, further comprising: outputting the medical image, and a graphical representation of the set of one or more locations within the medical image occupied by at least one identified object, to a display.

3. The method of claim 1 or 2, wherein receiving the medical image comprising the plurality of points comprises receiving a two-dimensional medical image comprising a plurality of pixels or receiving a three-dimensional image comprising a plurality of voxels.

4. The method of claim 3, wherein receiving the medical image comprises receiving an X-ray image.

5. The method of any one of the preceding claims, wherein inputting the medical image into the structured prediction model comprises serializing a two-dimensional or multi-dimensional image into a one-dimensional stream of data that is input into the structured prediction model.

6. The method of any one of the preceding claims, wherein inputting the medical image into the structured prediction model comprises downsizing a two-dimensional image to reduce the number of pixels or downsizing a three-dimensional image to reduce the number of voxels.

7. The method of any one of the preceding claims, wherein inputting the medical image into the structured prediction model comprises inputting the medical image into a convolutional neural network (CNN) that preprocesses the medical image to produce an encoded image, and the transformer model that processes the encoded image.

8. The method of any one of the preceding claims, wherein at least one of the one or more identified objects comprises a polyline.

9. The method of claim 8, wherein the set of one or more locations within the image occupied by the polyline defines a path of the polyline.

10. The method of any one of the preceding claims, wherein at least one of the one or more identified objects comprises a volumetric object.

11. The method of claim 10, wherein the set of one or more locations within the medical image occupied by the volumetric object defines a perimeter or outline of the object in a two-dimensional plane.

12. The method of any one of the preceding claims, further comprising, prior to inputting the image into a structured prediction model for identifying objects within the medical image, training the structured prediction model.

13 Software that, when executed by a computer, causes the computer to perform the method of any one of the preceding claims.

14. A system for object detection and point localization in medical images using structured prediction, the system comprising: a memory; and at least one processor communicatively coupled to the memory, the at least one processor configured to: receive a medical image comprising a plurality of points; input the medical image into a structured prediction model for identifying objects within the image; and receive, from the structured prediction model, a set of one or more object descriptors, wherein each object descriptor specifies an object type for an identified object and a set of one or more locations within the medical image occupied by that identified object, and the structured prediction model comprises a transformer model.

15. The system of claim 14, wherein the at least one processor is further configured to output the medical image, and a graphical representation of the set of one or more locations within the image occupied by at least one identified object, to a display.

16. The system of claim 14 or 15, wherein the at least one processor configured to receive the medical image comprises the at least one processor configured to the plurality of points comprises receiving a two-dimensional image comprising a plurality of pixels or receiving a three-dimensional image comprising a plurality of voxels.

17. The system of any one of claims 14 to 16, wherein the at least one processor configured to receive the medical image comprises the at least one processor configured to receive an X-ray image.

18. The system of any one of claims 14 to 17, wherein the at least one processor configured to input the medical image into the structured prediction model comprises the at least one processor configured to serialize a two-dimensional or multi-dimensional image into a one-dimensional stream of data that is input into the structured prediction model.

19. The system of any one of claims 14 to 18, wherein the at least one processor configured to input the medical image into the structured prediction model comprises the at least one processor configured to downsize a two-dimensional image to reduce the number of pixels or downsizing a three-dimensional image to reduce the number of voxels.

20. The system of any one of claims 14 to 19, wherein the at least one processor configured to input the image into the structured prediction model comprises the at least one processor configured to input the medical image into a convolutional neural network (CNN) that preprocesses the image to produce an encoded image, and the transformer model that processes the encoded image.

21. The system of any one of claims 14 to 20, wherein at least one of the one or more identified objects comprises a polyline.

22. The system of any one of claims 14 to 21 , wherein the set of one or more locations within the medical image occupied by the polyline defines a path of the polyline.

23. The system of any one of claims 14 to 22, wherein at least one of the one or more identified objects comprises a volumetric object.

24. The system of claim 23, wherein the set of one or more locations within the medical image occupied by the volumetric object defines a perimeter or outline of the object in a two-dimensional plane.

25. The system of any one of claims 15 to 24, wherein, prior to inputting the medical image into a structured prediction model for identifying objects within the medical image, training the structured prediction model, the at least one processor configured to train the structured prediction model.