CN112734740A

CN112734740A - Method for training target detection model, method for detecting target and device thereof

Info

Publication number: CN112734740A
Application number: CN202110065815.8A
Authority: CN
Inventors: 贾玉杰; 李玉才; 余航; 王瑜; 李新阳; 王少康; 陈宽
Original assignee: Infervision Medical Technology Co Ltd
Current assignee: Infervision Medical Technology Co Ltd
Priority date: 2021-01-18
Filing date: 2021-01-18
Publication date: 2021-04-30
Anticipated expiration: 2041-01-18
Also published as: CN112734740B

Abstract

The invention provides a method for training a target detection model, which comprises the following steps: acquiring a feature map of a training image; determining a prediction box in the feature map, wherein the prediction box is used for indicating a target area in the training image; determining a candidate loss value of the prediction box and a weight corresponding to the candidate loss value, wherein the weight is determined according to the difference between the prediction box and the true value; determining a loss value according to the candidate loss value and the weight; and training the target detection model according to the loss value. The method in the embodiment of the invention can solve the problem that the target detection model is not easy to converge in the training process in the prior art.

Description

Method for training target detection model, method for detecting target and device thereof

Technical Field

The invention relates to the technical field of medical equipment, in particular to a method for training a target detection model, a method for detecting a target and a device thereof.

Background

The skeleton is an important tissue organ of the human body, is a storage part of calcium element of the human body, is also an important hematopoietic organ of the human body, and has the functions of supporting the body, protecting internal organs, completing exercise, participating in metabolism and the like. The image data is operated to segment the bone contour, position and the like of the vertebral body, the intervertebral disc and the like by the vertebra segmentation, and the vertebra segmentation plays a key auxiliary role in multiple application fields such as treatment plan making, auxiliary surgical operation and the like.

The CT image is made by using precisely collimated X-ray beam, gamma ray, ultrasonic wave, etc. and a detector with high sensitivity to scan the cross sections of human body one by one, and has the features of fast scanning time, clear image, etc. the CT image has wide application in clinical practice and may be used in the examination of various diseases.

At present, a common method is to use a target detection model obtained by a deep learning method to segment vertebrae, but due to morphological characteristics of vertebral bodies and intervertebral discs, the vertebral bodies generally contain the intervertebral discs at the same time on a CT image, and the target detection model is not easy to converge in a training process, so that the intervertebral disc segmentation effect is not ideal.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method for training a target detection model, a method for target detection, and an apparatus thereof, so as to solve the problem in the prior art that a target detection model is not easy to converge in a training process.

In a first aspect, the present invention provides a method for training a target detection model, the method comprising:

acquiring a feature map of a training image; determining a prediction box in the feature map, wherein the prediction box is used for indicating a target area in the training image; determining a candidate loss value of the prediction box and a weight corresponding to the candidate loss value, wherein the weight is determined according to the difference between the prediction box and the true value; determining a loss value according to the candidate loss value and the weight; and training the target detection model according to the loss value.

In the embodiment of the present invention, the weight is determined according to the difference between the prediction box and the true value, that is, when the prediction result (i.e., the difference between the prediction box and the true value) is different, different weights are determined for the candidate loss values, and at this time, the loss value is determined according to the weights, and the target detection model is trained according to the loss value, so as to solve the problem that the target detection model is not easy to converge in the training process in the prior art.

Furthermore, the target detection model obtained after training by using the method in the embodiment of the invention can also improve the accuracy of intervertebral disc and vertebral body prediction.

In one embodiment, the candidate loss values include a candidate loss value for a category of the prediction box and a candidate loss value for an offset of the prediction box.

In one embodiment, the prediction box comprises at least one of a first prediction box corresponding to a vertebral body and a second prediction box corresponding to an intervertebral disc, and the true value comprises at least one of a vertebral body and an intervertebral disc.

In one embodiment, the determining the candidate loss value of the prediction box and the weight corresponding to the candidate loss value includes:

determining the candidate loss value according to the true value; when the truth value is the boundary box of the intervertebral disc and the class of the second prediction box is not the intervertebral disc, increasing the weight corresponding to the candidate loss value of the second prediction box and/or reducing the weight corresponding to the candidate loss value of the first prediction box; setting the weights corresponding to the candidate loss value of the first prediction frame and the candidate loss value of the second prediction frame to be 0 under the condition that the true value is a cone and the category of the first prediction frame is a cone; in other cases, the weight corresponding to the candidate loss value of the first prediction box and the weight corresponding to the candidate loss value of the second prediction box are both preset values.

In one embodiment, the determining a penalty value from the candidate penalty values and the weights comprises:

determining the loss value according to the weight corresponding to the increased candidate loss value of the second prediction box and/or the weight corresponding to the decreased candidate loss value of the first prediction box and the candidate loss value thereof when the truth value is an intervertebral disc and the category of the second prediction box is not an intervertebral disc; under the condition that the truth value is a cone and the category of the first prediction box is the cone, the loss value is 0; in other cases, the loss value is determined according to the preset value and the candidate loss value.

In one embodiment, the determining a penalty value from the candidate penalty values and the weights comprises determining a penalty value for a class of the prediction box according to the following formula:

ClassLoss＝-((1-p)²+log₁₀(p))

wherein, ClassLoss represents the loss value of the class of the prediction frame, and p represents the probability that the class of the prediction frame is the target type;

determining a loss value of the offset of the prediction box according to the following formula when the truth value is intervertebral disc and the class of the second prediction box is not intervertebral disc:

OffsetLoss＝2*L_fl-disk+0*L_fl-centrum

under the condition that the true value is a cone and the category of the first prediction box is a cone, determining a loss value of the offset of the prediction box according to the following formula:

OffsetLoss＝0*L_fl-disk+0*L_fl-centrum

in other cases, a loss value for the offset of the prediction box is determined according to the following formula:

OffsetLoss＝1*L_fl-disk+1*L_fl-centrum

wherein Offsetloss represents a loss value of an offset of the prediction frame, L_fl-diskA loss candidate value, L, representing an offset of the second prediction box_fl-centrumA candidate loss value representing an offset of the first prediction box.

Alternatively, L_fl-diskAnd L_fl-centrumCan be calculated by Focal Loss (Focal local).

For example, L_fl-diskAnd L_fl-centrumCan be calculated by the following formula:

where L represents a loss value, a and γ represent weights, y represents a true value, and y' represents a predicted value (i.e., a prediction box).

In one embodiment, the determining the prediction box in the feature map comprises:

determining a plurality of anchor frames of the feature map;

and determining the anchor box with the largest intersection ratio with the boundary box of the truth value in the plurality of anchor boxes as the prediction box.

In a second aspect, a method of target detection is provided, the method comprising:

acquiring a characteristic diagram of an input image; determining a prediction box in the feature map by using a target detection model, wherein the prediction box is used for indicating a target area in the training image; wherein the target detection model is obtained by training through the method in any possible implementation manner of the first aspect or the second aspect.

Further, the target detection model obtained by training by using the method in the first aspect can improve the accuracy of intervertebral disc and vertebral body prediction.

In a third aspect, the present invention provides an apparatus for training a target detection model, where the apparatus is configured to perform the method of the first aspect or any possible implementation manner of the first aspect.

In a fourth aspect, the present invention provides an apparatus for object detection, which is configured to perform the method of the second aspect or any possible implementation manner of the second aspect.

In a fifth aspect, an apparatus for training an object detection model is provided, where the apparatus includes a storage medium, which may be a non-volatile storage medium, and a central processing unit, which stores a computer-executable program therein, and the central processing unit is connected to the non-volatile storage medium and executes the computer-executable program to implement the first aspect or the method in any possible implementation manner of the first aspect.

In a sixth aspect, an apparatus for object detection is provided, where the apparatus includes a storage medium, which may be a non-volatile storage medium, and a central processing unit, which stores a computer-executable program therein, and is connected to the non-volatile storage medium, and executes the computer-executable program to implement the method in the second aspect or any possible implementation manner of the second aspect.

In a seventh aspect, a chip is provided, where the chip includes a processor and a data interface, and the processor reads instructions stored in a memory through the data interface to perform the first aspect or the method in any possible implementation manner of the first aspect.

Optionally, as an implementation manner, the chip may further include a memory, where instructions are stored in the memory, and the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the processor is configured to execute the first aspect or the method in any possible implementation manner of the first aspect.

In an eighth aspect, a chip is provided, where the chip includes a processor and a data interface, and the processor reads instructions stored in a memory through the data interface to perform the method of the second aspect or any possible implementation manner of the second aspect.

Optionally, as an implementation manner, the chip may further include a memory, where instructions are stored in the memory, and the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the processor is configured to execute the method in the second aspect or any possible implementation manner of the second aspect.

A ninth aspect provides a computer readable storage medium storing program code for execution by a device, the program code comprising instructions for performing the method of the first aspect or any possible implementation of the first aspect.

A tenth aspect provides a computer readable storage medium storing program code for execution by a device, the program code comprising instructions for performing the method of the second aspect or any possible implementation of the second aspect.

Drawings

Fig. 1 is a diagram of an application scenario applicable to an embodiment of the present invention.

FIG. 2 is a schematic block diagram of a method of training a target detection model in one embodiment of the invention.

FIG. 3 is a schematic block diagram of a method of object detection in one embodiment of the present invention.

Fig. 4 is a schematic block diagram of a method of object detection in another embodiment of the present invention.

FIG. 5 is a schematic block diagram of a method of building a training data set in one embodiment of the invention.

FIG. 6 is a schematic block diagram of a cone coordinate frame in one embodiment of the invention.

FIG. 7 is a schematic block diagram of a method of training a target detection model in another embodiment of the invention.

FIG. 8 is a schematic block diagram of a method of post-processing a prediction box in one embodiment of the invention.

Fig. 9 is a schematic block diagram of an apparatus for training a target detection model according to an embodiment of the present invention.

Fig. 10 is a schematic block diagram of an apparatus for object detection according to an embodiment of the present invention.

Fig. 11 is a schematic block diagram of an apparatus for object detection according to another embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The method in the embodiment of the present invention may be applied to various scenes for performing image segmentation based on medical images, which is not limited in the embodiment of the present invention. For example, the method in the embodiments of the present invention may be applied to a scenario where bone segmentation is performed based on a medical image.

The medical image in the embodiment of the present invention may be a Computed Tomography (CT) image, a Magnetic Resonance Imaging (MRI) image, and the like, and the type of the medical image is not limited in the embodiment of the present invention.

The application scenario 100 in fig. 1 may include an image acquisition device 110 and an object detection device 120.

It should be noted that the application scenario shown in fig. 1 is only an example and is not limited, and more or fewer devices or apparatuses may be included in the application scenario shown in fig. 1, which is not limited in the embodiment of the present invention.

The image acquiring apparatus 110 may be a Computed Tomography (CT) scanner, a Magnetic Resonance Imaging (MRI) or other apparatuses or devices for capturing medical images, and the target detecting apparatus 120 may be a computer device, a server (e.g., a cloud server) or other apparatuses or devices capable of performing target detection (or image segmentation, etc.) on images.

It can be seen that the medical image according to the embodiment of the present invention is not limited, and the medical image according to the embodiment of the present invention may include a CT image, an MRI image, or other images applied in the medical field.

For example, the image acquisition device 110 may be a CT scanner, and the CT scanner may be used for performing an X-ray scan on a human tissue to obtain a CT image sequence of the human tissue.

In one embodiment, a sequence of cross-sectional CT images including bone may be obtained by scanning the bone with a CT scanner. The bone may be, for example, a spine, a tibia, a femur, a rib, a patella, or other bone tissue of an animal or human, which is not limited in the embodiments of the present invention.

For example, the object detection device 120 may be a computer device, and the computer device may be a general-purpose computer or a computer device composed of an application-specific integrated circuit, and the like, which is not limited in the embodiment of the present invention.

Those skilled in the art will appreciate that the number of the above-described computer devices may be one or more than one, and that the types of the plurality of computer devices may be the same or different. The number of terminals and the type of the device are not limited in the embodiment of the present invention.

A neural network model may be deployed in the computer device, and is used for performing target detection (or image segmentation, etc.) processing on an image to be processed (for example, a medical image obtained by the image acquisition apparatus), so as to obtain a prediction frame for indicating a target region in the image to be processed.

For example, the computer device may perform target detection on the CT image through a neural network model deployed therein (e.g., the neural network model may be a target detection model), resulting in a prediction box indicating the vertebral bodies and intervertebral discs in the image to be processed.

The computer device may be a server or a cloud server, and directly performs target detection (or image segmentation, etc.) and other processing on the image to be processed.

Alternatively, the computer device may be connected to a server (not shown in fig. 1) via a communication network. The computer device may send the CT image and the like acquired from the CT scanner to the server, perform target detection on the CT image by using the neural network model in the server, and store the target detection result as a sample image to train the neural network model in the server, so as to obtain the neural network model for target detection.

The computer device may further obtain a CT image from the server, and further perform target detection on the CT image through the neural network model to generate a prediction frame for indicating a target region in the CT image.

FIG. 2 is a schematic block diagram of a method 200 of training a target detection model according to one embodiment of the invention. The method 200 may be performed by the object detection apparatus 120 of fig. 1, and the object detection apparatus 120 may include an object detection model.

It should be understood that fig. 2 shows the steps or operations of method 200, but these steps or operations are only examples, and that other operations or variations of the individual operations of method 200 in fig. 2 may be performed by embodiments of the present invention, or that not all of the steps need be performed, or that the steps may be performed in other orders.

And S210, acquiring a characteristic diagram of the training image.

The training image may be generated from original vertebra segmentation data, which may be obtained by image segmentation of a medical image such as a Computed Tomography (CT) image.

The training image may include a true value corresponding thereto, and the true value may include at least one of a vertebral body and an intervertebral disc.

For example, the true value may be a vertebral body, or the true value may also be an intervertebral disc, or the true value may also include both a vertebral body and an intervertebral disc, which is not limited in the embodiment of the present invention.

The vertebral body may refer to a part of the vertebral column, and the specific definition may refer to the description in the prior art, which is not repeated in the embodiments of the present invention.

Optionally, the training image may be obtained from a preset training data set, and the specific method of the training data set may refer to the following detailed description of the embodiment in fig. 4, which is not repeated here.

S220, determining a prediction frame in the feature map.

Wherein the prediction box may be used to indicate a target region in the training image.

Optionally, the prediction block may include at least one of a first prediction block and a second prediction block.

For example, the first prediction box may refer to a prediction box corresponding to a vertebral body portion in the training image, and the second prediction box may refer to a prediction box corresponding to an intervertebral disc portion in the training image.

Optionally, the determining a prediction box in the feature map may include:

determining a plurality of anchor frames of the feature map; and determining the anchor box with the largest intersection ratio with the boundary box of the truth value in the plurality of anchor boxes as the prediction box.

The specific method for generating a plurality of anchor frames and determining a prediction frame from the plurality of anchor frames may refer to the following detailed description of the embodiment in fig. 4, and will not be described herein again.

It should be noted that the foregoing embodiment is only an example and is not limited, and in the embodiment of the present invention, other methods in the prior art may also be used to determine the prediction block in the feature map, which is not limited in the embodiment of the present invention.

And S230, determining candidate loss values of the prediction frame and weights corresponding to the candidate loss values.

Wherein the weight may be determined according to a difference between the prediction box and a true value.

The difference between the prediction box and the true value may include a difference between a class of the prediction box and a class of the true value, and an offset of a position of the prediction box with respect to a bounding box of the true value.

Optionally, the candidate loss values include a candidate loss value of a category of the prediction box and a candidate loss value of an offset of the prediction box.

For example, in the case where the training image has corresponding true values including a vertebral body and an intervertebral disc, the candidate loss values may include a loss value for the class of the first prediction box, a loss value for the class of the second prediction box, a loss value for the offset of the first prediction box with respect to the bounding box of the vertebral body in the true value, and a loss value for the offset of the second prediction box with respect to the bounding box of the intervertebral disc in the true value.

Wherein the loss value of the category of the first prediction box may refer to: and determining a loss value as the difference between the probability of the vertebral body and the true value according to the category predicted by the first prediction box.

For example, if the probability that the first prediction box is predicted as a pyramid is P and the difference between the probability that the first prediction box is predicted as a pyramid and the true value is (1-P), then the loss value of the class of the first prediction box may be determined according to (1-P).

The loss value of the category of the second prediction frame is similar to the determination method of the loss value of the category of the first prediction frame, and is not repeated here.

Further, the determining the candidate loss value of the prediction box and the weight corresponding to the candidate loss value may include:

determining the candidate loss value according to the true value;

when the truth value is the boundary box of the intervertebral disc and the class of the second prediction box is not the intervertebral disc, increasing the weight corresponding to the candidate loss value of the second prediction box and/or reducing the weight corresponding to the candidate loss value of the first prediction box;

setting the weights corresponding to the candidate loss value of the first prediction frame and the candidate loss value of the second prediction frame to be 0 under the condition that the true value is a cone and the category of the first prediction frame is a cone;

in other cases, the weight corresponding to the candidate loss value of the first prediction box and the weight corresponding to the candidate loss value of the second prediction box are both preset values.

For example, in the case where the true value is an intervertebral disc, the following two cases can be classified:

the first condition is as follows:

and if the type of the second prediction frame is the intervertebral disc, the weight corresponding to the candidate loss value of the first prediction frame and the weight corresponding to the candidate loss value of the second prediction frame are preset values.

Case two:

and if the category of the second prediction frame is a cone, increasing the weight corresponding to the candidate loss value of the second prediction frame and/or decreasing the weight corresponding to the candidate loss value of the first prediction frame.

The reason for this is: the morphology of the intervertebral disc is obvious and should not be predicted incorrectly, in short: the intervertebral disc is easy to identify, and the identification of the error is penalized by increasing the weight corresponding to the loss candidate value of the second prediction box and/or decreasing the weight corresponding to the loss candidate value of the first prediction box.

For another example, in the case where the true value is a vertebral body, the following two cases can be classified:

case three:

and if the category of the first prediction frame is the intervertebral disc, the weight corresponding to the candidate loss value of the first prediction frame and the weight corresponding to the candidate loss value of the second prediction frame are both preset values.

Case four:

if the category of the first prediction frame is a pyramid, no loss is returned (that is, the weights corresponding to the candidate loss value of the first prediction frame and the candidate loss value of the second prediction frame are both set to 0).

The reason for this is: the vertebral bodies in CT images contain intervertebral discs, which are not easily distinguishable, so no loss is returned, in short: the vertebral bodies are not easy to identify and are not easy to identify, so that no punishment is required.

Optionally, the preset value may be a, where a is a positive integer.

For example, a may be 1.

S240, determining a loss value according to the candidate loss value and the weight.

Wherein the loss value can be understood as a total loss value obtained from the training image.

For example, in the case where the training image includes an intervertebral disc and a vertebral body, the loss value of the training image may be classified into a loss value of a type of a prediction frame of the training image and a loss value of an offset amount of the prediction frame of the training image.

The loss value of the class of the prediction frame of the training image may include a loss value of the class of the first prediction frame and a loss value of the class of the second prediction frame, and the loss value of the offset of the prediction frame of the training image may include a loss value of the offset of the first prediction frame and a loss value of the offset of the second prediction frame.

Optionally, the determining a loss value according to the candidate loss value and the weight may include:

determining the loss value according to the weight corresponding to the increased candidate loss value of the second prediction box and/or the weight corresponding to the decreased candidate loss value of the first prediction box and the candidate loss value thereof when the truth value is an intervertebral disc and the category of the second prediction box is not an intervertebral disc;

under the condition that the truth value is a cone and the category of the first prediction box is the cone, the loss value is 0;

in other cases, the loss value is determined according to the preset value and the candidate loss value thereof.

Further, the determining a loss value according to the candidate loss value and the weight may include determining a loss value of a category of the prediction box according to the following formula:

ClassLoss＝-((1-p)²+log₁₀(p))

in the case where the true value is an intervertebral disc and the class of the second prediction box is not an intervertebral disc, the loss value of the offset of the prediction box may be determined according to the following formula:

OffsetLoss＝2*L_fl-disk+0*L_fl-centrum

OffsetLoss＝0*L_fl-disk+0*L_fl-centrum

in other cases, the penalty value for the offset of the prediction box may be determined according to the following formula:

OffsetLoss＝1*L_fl-disk+1*L_fl-centrum

wherein Offsetloss represents a loss value of an offset of the prediction frame, L_fl-diskA loss candidate value, L, representing an offset of the second prediction box_fl-centrimA candidate loss value representing an offset of the first prediction box.

Alternatively, L_fl-siskAnd L_fl-centrumCan be calculated by Focal Loss (Focal local).

And S250, training the target detection model according to the loss value.

After the object detection model has been trained by the method 200 described above, the object detection model can be used to detect vertebral bodies and intervertebral discs in an input image (e.g., a CT image), as shown below in the method 300 of fig. 3.

FIG. 3 is a schematic block diagram of a method 300 of object detection in accordance with one embodiment of the present invention.

The method 300 may be executed by the object detection apparatus 120 in fig. 1, or the method 300 may also be executed by a server or a cloud server (not shown in fig. 1), which is not limited in the embodiment of the present invention.

It should be understood that fig. 3 shows the steps or operations of method 300, but these steps or operations are only examples, and that other operations or variations of the individual operations of method 300 in fig. 3 may be performed by embodiments of the present invention, or that not all of the steps need be performed, or that the steps may be performed in other orders.

And S310, acquiring a characteristic diagram of the input image.

S320, determining a prediction box in the feature map by using a target detection model, wherein the prediction box is used for indicating a target area in the training image;

the target detection model may be obtained by training based on any embodiment of the method 200 shown in fig. 2.

The prediction frame output by the target detection model may include a vertebral body two-dimensional (2D) prediction frame and a 2D prediction frame of an intervertebral disc, and further, the vertebral body two-dimensional (2D) prediction frame and the 2D prediction frame of the intervertebral disc may be converted into the vertebral body three-dimensional (3D) prediction frame and the 3D prediction frame of the intervertebral disc, and a specific conversion method may refer to the following embodiment in fig. 8, which is not described herein again.

FIG. 4 is a schematic block diagram of a method 400 of object detection in accordance with one embodiment of the present invention.

The method 400 may be executed by the object detection apparatus 120 in fig. 1, or the method 300 may also be executed by a server or a cloud server (not shown in fig. 1), which is not limited in the embodiment of the present invention.

It should be understood that fig. 4 shows the steps or operations of method 400, but these steps or operations are only examples, and other operations or variations of the individual operations of method 400 in fig. 4 may be performed by embodiments of the present invention, or not all of the steps need be performed, or the steps may be performed in other orders.

And S410, establishing a training data set.

The training data set may include training images and their corresponding truth values, which may include at least one of vertebral bodies and intervertebral discs.

The vertebral level may refer to a portion of the spine and the vertebral body may refer to a portion of the vertebral level.

The training data set may be generated from original vertebra segmentation data, which may be obtained by image segmentation of medical images such as Computed Tomography (CT) images.

The training data set may be established, for example, by the method 500 shown in FIG. 5.

As shown in fig. 5, the original vertebrae segmentation data may include vertebrae segmentation data and intervertebral disc segmentation data, but are not currently distinguished and processed uniformly.

As shown in fig. 5, the method 500 may include steps 510, 520, 530, 540, 550 and 560, which are as follows:

and S510, adjusting the size (resize).

For example, the raw data may include vertebral segment data (i.e., vertebral image) and disc segment data (i.e., disc image), the raw data may be resized (resize), each image in the raw data is resized to 512 x 512, and the image layer thickness is resized to 1.5 millimeters (mm).

And S520, converting the window width and the window level.

For example, window width/window level conversion may be performed to convert the window level (center) of each image to 300 and the window width (width) of each image to 1500.

S530, acquiring a coordinate frame (box).

For example, for the disc segmentation data, based on the results obtained after the above processing in S510 and S520, the minimum outer rectangular frame of the disc segmentation data in each image is obtained, and the outer rectangular frames are used as the final disc coordinate frame.

For another example, for the vertebra segmentation data, the minimum outer-wrapped rectangular frame of each image vertebra segmentation data may be obtained based on the results obtained after the processing in S510 and S520, and then the rectangular frames are cut to leave only the outer-wrapped rectangular frame of the vertebra.

And S540, cutting the cervical vertebral body.

Because the vertebral body of the cervical vertebra is smaller and has a shape which is not much the same as that of other vertebral bodies, the vertebral body of the cervical vertebra is separately treated.

For example, as shown in the left diagram of fig. 6, the middle 1/3 in the x direction and the front 1/3 in the y direction of each cervical vertebra frame can be cut out according to the empirical values as the final cervical vertebra coordinate frame.

And S550, cutting other vertebral bodies.

For other vertebral bodies, as shown in the right drawing of fig. 6, the middle 1/3 in the x direction and the front 1/2 in the y direction can be cut according to empirical values as the final vertebral body coordinate frame.

And S560, outputting the coordinate frames of the intervertebral disc and the vertebral body.

Through the processing of the steps, the coordinate frame of the intervertebral disc and the coordinate frame of the vertebral body (including the coordinate frame of the cervical vertebral body and the coordinate frames of other vertebral bodies) are generated, the type of the intervertebral disc and the corresponding coordinate frame thereof, the type corresponding to the vertebral body and the corresponding coordinate frame thereof can be set respectively, and a final training data set is generated.

And S420, training a target detection model.

At this time, the target detection model may be trained using the method 200 in fig. 2 or the method 700 in fig. 7 described below based on the training data set established at S410 described above.

The process of training the target detection model is described in detail below in conjunction with method 700 of FIG. 7.

As shown in fig. 7, the method 700 may include steps 710, 720, 730, 740, 750, and 760, as follows:

s710, feature extraction

Alternatively, a feature map (feature map) of a training image may be extracted, which may be acquired from the training dataset established in S410. The training image may be a 3D image (e.g., a 3D CT image).

Optionally, the feature map of the training image may be extracted by a feature extraction module.

The feature extraction module may be composed of a plurality of sets of convolutional layers, pooling layers, activation layers, and feature extraction layers connected in series, and may extract feature information that can be used for detecting or segmenting vertebral bodies and intervertebral discs from the training image, where the feature information may be referred to as a feature map, that is, a feature map of the training image.

S720, up-sampling

Optionally, the feature map extracted in S710 may be up-sampled to extract features of objects with different scales.

Optionally, the feature map may be upsampled by an upsampling module.

In general, the size of the feature map in the feature extraction process becomes smaller and smaller, which is not beneficial for the segmentation of the small-size target, so that the feature map can be enlarged by the up-sampling module and added with the feature map of the corresponding size in the sampling process, which is beneficial for the next detection (or segmentation).

S730, generating an anchor frame

Alternatively, a plurality of anchor boxes of different sizes and aspect ratios may be generated from the feature map. The specific method for generating the anchor frame may refer to the prior art, and is not described herein again.

Optionally, a plurality of anchor frames may be generated from the feature map by an anchor frame generation module.

The anchor frame generation module may set a plurality of anchor frames having different scales or aspect ratios for each cell (here, a cell may refer to a cell region of a preset size in the feature map) centered on each pixel in the feature map, and the boundaries of the prediction frame are based on these anchor frames. This process may reduce the training difficulty to some extent. Typically, multiple anchor boxes may be provided for each cell, with the multiple anchor boxes varying in size and aspect ratio.

S740, distributing anchor frame

Optionally, the target to be detected may be assigned to the generated anchor box (e.g., may contain the annotation category and offset of the anchor box).

In the training dataset, each anchor frame may be regarded as a training sample, and in order to train the target detection model, two types of labels may be labeled for each anchor frame: the first type is the class of objects contained by the anchor frame (e.g., vertebral bodies or intervertebral discs); the second type is the offset of the real bounding box relative to the anchor box.

Based on the multiple anchor frames generated in S730, the similarity between the two bounding boxes can be measured by an intersection over intersection (IoU), then the category and the offset are predicted for each anchor frame, then the anchor frame position is adjusted according to the predicted offset to obtain a predicted frame, and finally the predicted frame to be output is screened.

S750, coordinate frame classification

Optionally, the anchor frame may be classified.

Optionally, the anchor frame may be classified by a coordinate frame classification module.

For example, the anchor frame may be classified by a coordinate frame classification module, and then the true classification of the anchor frame may be determined according to the size of the intersection ratio of the anchor frame and the annotation result (the bounding box of the truth). And then calculating the classification error of the class of the model for the optimal derivation of the model in the current round.

The classification error formula can be as follows:

ClassLoss＝-((1-p)²+log₁₀(p))

wherein, ClassLoss represents the loss value of the classification error, and p represents the probability that the category of the anchor frame is the target classification. The closer the value of p is to 1, the smaller the classification error can be considered.

S760, coordinate frame regression

Optionally, the anchor frame may be regressed by a coordinate frame regression module.

For example, the bias of the anchor frame may be generated by a coordinate frame regression module, and the bias error of the model may be calculated by comparing the bias of the anchor frame with the position of the real bounding box for derivative optimization of the current round of training of the model.

The Bias error of the anchor frame is calculated by Bias Focal Loss (Bias Focal local) error, and the specific calculation method is as follows:

the offset of the anchor frame is the distance between the anchor frame and the real target area, and because the sample distribution of the intervertebral disc and the vertebral body is far smaller than that of a negative sample, the regression error is calculated by adopting Focal Loss (Focal local), and the Focal local can relieve the problem of serious imbalance of the proportion of positive samples and negative samples in target detection.

The specific calculation formula of Focal local may be as follows:

At this time, the calculation formula of the total loss may be as follows:

Loss＝ω₁L_fl-disk+ω₂L_fl-centrum

where Loss denotes total Loss, ω₁And ω₂Is a weight coefficient, L_fl-diskIndicating a loss of intervertebral disc, L_fl-centrumIndicating a loss of the vertebral body.

Alternatively, the disc Loss and vertebral body Loss can be calculated by the above calculation of Focal local.

By default, the weight factor ω₁And ω₂It may be 1 (here, 1 may be regarded as a preset value), that is, the total loss is the loss of the vertebral body + the loss of the intervertebral disc, and in this case, the calculation formula of the total loss may be as follows:

Loss＝1*L_fl-disk+1*L_fl-centrum

wherein Loss represents total Loss, L_fl-diskIndicating a loss of intervertebral disc, L_fl-centrumIndicating a loss of the vertebral body.

However, for the detection of intervertebral discs and vertebral bodies, the Focal local may not be easily converged during the training process, for example, when the intervertebral disc Loss is optimized, the vertebral body Loss may be increased, and when the vertebral body Loss is optimized, the intervertebral disc Loss may be increased.

Therefore, according to the characteristics that the vertebral body in the CT image contains the intervertebral disc, the shape of the intervertebral disc is obvious, and the intervertebral disc can be distinguished from the vertebral body, the Bias Focal local is adopted to calculate the regression error in the embodiment of the present invention, and the specific calculation formula can be as follows:

OffsetLoss＝2*L_fl-disk+0*L_fl-centrum (a)

OffsetLoss＝0*L_fl-disk+0*L_fl-centrum (b)

OffsetLoss＝1*L_fl-disk+1*L_fl-centrum (c)

wherein Offsetloss represents the regression error of the anchor box (i.e., the prediction box), L_fl-diskIndicating a loss of intervertebral disc, L_fl-centrumIndicating a loss of the vertebral body.

As can be seen from the above equation, in the case where the detection target is the intervertebral disc but the intervertebral disc is not predicted as the intervertebral disc, the loss of the intervertebral disc may be increased (the weight coefficient corresponding to the loss of the intervertebral disc is increased from the default value 1 to 2 in the above equation a), and the loss of the vertebral body may be decreased (the weight coefficient corresponding to the loss of the vertebral segment is decreased from the default value 1 to 0 in the above equation a). The reason for this is: the disc morphology is obvious and should not be predicted incorrectly.

In the case where the prediction target is a vertebral body and the vertebral body is also predicted as a vertebral body, no loss may be returned (as in the above equation b, each weight coefficient is).

In other cases, a default value may be used for each weighting factor (each weighting factor is 1 in the above formula c). The reason for this is: intervertebral discs are included in the vertebral body in the CT image, and the intervertebral discs are not easy to distinguish.

In addition, since the CT image itself contains three-dimensional information, in order to use the three-dimensional image information, each input of the training data is a CT image of 2N +1 successive layers, N being an integer greater than or equal to 1.

In this case, when the feature extraction module extracts features, the middle layer of the CT image may be used as a key frame, and the other upper and lower N layers may be used as adjacent related information.

And S430, performing post-processing on the prediction frame output by the target detection model.

At this time, a prediction frame (e.g., a 2D coordinate frame) output by the object detection model may be post-processed, and the process of the post-processing will be described in detail below with reference to the method 800 of fig. 8.

As shown in fig. 8, the method 800 may include steps 810, 820, 830, 840, 850, 860, and 870, as follows:

and S810, removing the false positive.

Alternatively, false positives may be removed by calculation IoU, which may not distinguish between vertebral bodies and discs separately.

For example, the computational logic may be: the prediction boxes are firstly grouped according to IoU, and specifically, the prediction boxes can be as follows: the first coordinate frame is added to the new group by default, and from the second coordinate frame, if the coordinate frame IoU in all the current groups is smaller than the preset threshold, the new group can be added, otherwise, the existing group is added, and the process is continued until the last coordinate frame group is completed.

At this time, N groups are available, and the group with the largest number of coordinate frames is selected as the coordinate frame set of the intervertebral disc and the vertebral body.

And S820, merging the coordinate frames of the intervertebral discs.

And based on the processing result of S810, selecting a coordinate frame with the classification category as the intervertebral disc from the two to perform 3D merging, traversing the coordinate frames layer by layer from head to foot, merging the coordinate frames into the same intervertebral disc if the layer difference with the previous layer does not exceed a preset threshold, otherwise, regarding the intervertebral disc as a new intervertebral disc, and continuing the operation until the 3D prediction frames of all the intervertebral discs are finally obtained.

And S830, interpolating the missing layer in the coordinate frame of the cone.

Based on the processing result of S810, the coordinate frames classified into the vertebral bodies are selected from the images and merged in 3D, the basic logic is the same as that of S820, the missing layers of the vertebral bodies are detected and interpolated, and S840, the interpolated coordinate frames are smoothed.

Based on the processing result of S830, the weighted average can be used to smooth the coordinates between the multiple layers, so as to ensure that the prediction frame of the vertebral body is relatively coherent and no repeated transverse jumps occur.

S850, cutting the vertebral body through the intervertebral disc.

And S860, acquiring the key points of the vertebral body.

Based on the processing result of S850, respectively selecting the central point of each vertebral body 3D prediction frame as the key point of the vertebral body to obtain a vertebral body key point list. So far, the 3D prediction frame of the intervertebral disc and the vertebral body, and the key point list of the vertebral body are all generated.

And S870, outputting the 3D prediction frame and the key point list of the vertebral body.

Through the processing of the steps of method 800 in fig. 8, a 3D prediction frame of the vertebral bodies and discs and a list of key points of the vertebral bodies may be obtained.

And S440, carrying out online detection on the vertebral body and the intervertebral disc by using the trained target detection model.

At the moment, the trained target detection model can be used for carrying out online detection on the vertebral body and the intervertebral disc.

During the online detection process of the intervertebral disc and the vertebral body, the images of a CT sequence can be detected one by one. For a certain CT image, the serial number is i, N continuous images adjacent to the CT image are selected to form data of 2N +1 channels, and the data are used as predicted data to be input into the model.

The prediction model outputs the coordinates of the predicted frame, the corresponding classification number and the confidence coefficient, and a vertebral body key point list obtained based on the vertebral body prediction frame. And screening out the detection results with low confidence coefficient by setting a threshold value of the confidence coefficient, so that the detection results of most false positives can be excluded.

In order to exclude the adverse effect of the in-vitro object on the model, other parts of the body may be segmented by threshold segmentation in the prediction. For example, the segmentation result may include only muscle and bone tissues, but not internal organs, clothes, and the like of the human body. If the prediction result of the model is located in the body segmentation result, the prediction result is retained, otherwise, the model is judged as a false positive, and the false positive is filtered from the output result.

Fig. 9 is a schematic block diagram of an apparatus 900 for training a target detection model according to an embodiment of the present invention. It should be understood that the apparatus 900 for training the target detection model shown in fig. 9 is only an example, and the apparatus 900 of the embodiment of the present invention may further include other modules or units.

It should be understood that the apparatus 900 is capable of performing the various steps in the methods of fig. 2 and 4, and will not be described in detail herein to avoid repetition.

In one possible implementation manner of the present invention, the apparatus includes:

an obtaining unit 910, configured to obtain a feature map of a training image;

a first determining unit 920, configured to determine a prediction box in the feature map, where the prediction box is used to indicate a target region in the training image;

a second determining unit 930, configured to determine a candidate loss value of the prediction box and a weight corresponding to the candidate loss value, where the weight is determined according to a difference between the prediction box and the true value;

a loss value determining unit 940 for determining a loss value according to the candidate loss value and the weight;

a training unit 950, configured to train the target detection model according to the loss value.

It should be appreciated that the apparatus 900 for training an object detection model herein is embodied in the form of a functional module. The term "module" herein may be implemented in software and/or hardware, and is not particularly limited thereto. For example, a "module" may be a software program, a hardware circuit, or a combination of both that implements the functionality described above. The hardware circuitry may include an Application Specific Integrated Circuit (ASIC), an electronic circuit, a processor (e.g., a shared processor, a dedicated processor, or a group of processors) and memory that execute one or more software or firmware programs, a combinational logic circuit, and/or other suitable components that support the described functionality.

As an example, the apparatus 900 for training a target detection model according to an embodiment of the present invention may be a processor or a chip, and is configured to perform the method according to an embodiment of the present invention.

Fig. 10 is a schematic block diagram of an apparatus 1000 for object detection according to an embodiment of the present invention. It should be understood that the apparatus 1000 for object detection illustrated in fig. 10 is only an example, and the apparatus 1000 of the embodiment of the present invention may further include other modules or units.

It should be understood that the apparatus 1000 is capable of performing the various steps in the methods of fig. 3 and 4, and will not be described in detail herein to avoid repetition.

an acquisition unit 1010 configured to acquire a feature map of an input image;

a determining unit 1020, configured to determine a prediction box in the feature map by using a target detection model, where the prediction box is used to indicate a target region in the training image;

for a detailed training process of the apparatus 1000, reference may be made to the embodiments of the method 200 and the method 400, which are not described herein again.

It should be understood that the apparatus 1000 for object detection herein is embodied in the form of functional modules. The term "module" herein may be implemented in software and/or hardware, and is not particularly limited thereto. For example, a "module" may be a software program, a hardware circuit, or a combination of both that implements the functionality described above. The hardware circuitry may include an Application Specific Integrated Circuit (ASIC), an electronic circuit, a processor (e.g., a shared processor, a dedicated processor, or a group of processors) and memory that execute one or more software or firmware programs, a combinational logic circuit, and/or other suitable components that support the described functionality.

As an example, the apparatus 1000 for object detection provided by the embodiment of the present invention may be a processor or a chip, and is configured to perform the method according to the embodiment of the present invention.

FIG. 11 is a schematic block diagram of an apparatus 400 for object detection in accordance with one embodiment of the present invention. The apparatus 400 shown in fig. 11 includes a memory 401, a processor 402, a communication interface 403, and a bus 404. The memory 401, the processor 402 and the communication interface 403 are connected to each other by a bus 404.

The memory 401 may be a Read Only Memory (ROM), a static memory device, a dynamic memory device, or a Random Access Memory (RAM). The memory 401 may store a program, and when the program stored in the memory 401 is executed by the processor 402, the processor 402 is configured to perform the steps of the method for training the object detection model and the method for object detection according to the embodiments of the present invention, for example, the steps of the embodiments shown in fig. 2, fig. 3 and fig. 4 may be performed.

The processor 402 may be a general-purpose Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits, and is configured to execute related programs to implement the method for training a target detection model and the method for target detection according to the embodiment of the present invention.

The processor 402 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the method for training the target detection model and the method for target detection according to the embodiment of the present invention may be implemented by hardware integrated logic circuits in the processor 402 or instructions in the form of software.

The processor 402 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, or discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 401, and the processor 402 reads information in the memory 401, and completes, in combination with hardware of the storage medium, functions that need to be executed by units included in the apparatus for object detection in the embodiment of the present invention, or executes the method for training an object detection model and the method for object detection in the embodiment of the method of the present invention, for example, steps/functions of the embodiments shown in fig. 2, fig. 3, and fig. 4 may be executed.

The communication interface 403 may use transceiver means, such as, but not limited to, a transceiver, to enable communication between the apparatus 400 and other devices or communication networks.

Bus 404 may include a path that transfers information between various components of apparatus 400 (e.g., memory 401, processor 402, communication interface 403).

It should be understood that the apparatus 400 shown in the embodiment of the present invention may be a processor or a chip for executing the method for training the target detection model and the method for target detection described in the embodiment of the present invention.

It should be understood that the processor in the embodiments of the present invention may be a Central Processing Unit (CPU), and the processor may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. In addition, the "/" in this document generally indicates that the former and latter associated objects are in an "or" relationship, but may also indicate an "and/or" relationship, which may be understood with particular reference to the former and latter text.

In the present invention, "at least one" means one or more, "a plurality" means two or more. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple.

It should be understood that, in various embodiments of the present invention, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and the like that are within the spirit and principle of the present invention are included in the present invention.

Claims

1. A method of training a target detection model, comprising:

acquiring a feature map of a training image;

determining a prediction box in the feature map, wherein the prediction box is used for indicating a target area in the training image;

determining a candidate loss value of the prediction box and a weight corresponding to the candidate loss value, wherein the weight is determined according to the difference between the prediction box and the true value;

determining a loss value according to the candidate loss value and the weight;

and training the target detection model according to the loss value.

2. The method of claim 1, wherein the candidate loss values comprise candidate loss values for a class of the prediction box and candidate loss values for an offset of the prediction box.

3. The method of claim 2, wherein the prediction box comprises at least one of a first prediction box corresponding to a vertebral body and a second prediction box corresponding to an intervertebral disc, and wherein the true value comprises at least one of a vertebral body and an intervertebral disc.

4. The method of claim 3, wherein determining the candidate loss value of the prediction box and the weight corresponding to the candidate loss value comprises:

determining the candidate loss value according to the true value;

when the truth value is an intervertebral disc and the category of the second prediction box is not an intervertebral disc, increasing the weight corresponding to the candidate loss value of the second prediction box and/or decreasing the weight corresponding to the candidate loss value of the first prediction box;

in other cases, the weight corresponding to the candidate loss value of the first prediction frame and the weight corresponding to the candidate loss value of the second prediction frame are both preset values.

5. The method of claim 4, wherein determining a penalty value based on the candidate penalty values and the weights comprises:

in other cases, the loss value is determined according to the preset value and the candidate loss value.

6. The method of claim 5, wherein determining a penalty value based on the candidate penalty values and the weights comprises determining a penalty value for a class of the prediction box based on the following equation:

ClassLoss＝-((1-p)²+log₁₀(p))

OffsetLoss＝2*L_fl-disk+0*L_fl-centrum

OffsetLoss＝0*L_fl-disk+0*L_fl-centrum

OffsetLoss＝1*L_fl-disk+1*L_fl-centrum

7. The method of claim 6, wherein determining the prediction box in the feature map comprises:

determining a plurality of anchor frames of the feature map;

8. A method of target detection, comprising:

acquiring a characteristic diagram of an input image;

determining a prediction box in the feature map by using a target detection model, wherein the prediction box is used for indicating a target area in the training image;

wherein the target detection model is obtained after training by the method of any one of claims 1 to 7.

9. An apparatus for training an object detection model, the apparatus being configured to perform the method of any one of claims 1 to 7.

10. An apparatus for object detection, characterized in that the apparatus is adapted to perform the method of claim 8.

11. An apparatus for training an object detection model, comprising a processor and a memory, the memory for storing program instructions, the processor for invoking the program instructions to perform the method of any of claims 1-7.

12. An apparatus for object detection comprising a processor and a memory, the memory for storing program instructions, the processor for invoking the program instructions to perform the method of claim 8.

13. A computer-readable storage medium, in which program instructions are stored, which, when executed by a processor, implement the method of any one of claims 1 to 8.