CN112767532B - Ultrasonic or CT medical image three-dimensional reconstruction method based on transfer learning - Google Patents

Ultrasonic or CT medical image three-dimensional reconstruction method based on transfer learning Download PDF

Info

Publication number
CN112767532B
CN112767532B CN202011621411.4A CN202011621411A CN112767532B CN 112767532 B CN112767532 B CN 112767532B CN 202011621411 A CN202011621411 A CN 202011621411A CN 112767532 B CN112767532 B CN 112767532B
Authority
CN
China
Prior art keywords
image
network
convolution
loss
tensor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011621411.4A
Other languages
Chinese (zh)
Other versions
CN112767532A (en
Inventor
全红艳
钱笑笑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Original Assignee
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University filed Critical East China Normal University
Priority to CN202011621411.4A priority Critical patent/CN112767532B/en
Publication of CN112767532A publication Critical patent/CN112767532A/en
Application granted granted Critical
Publication of CN112767532B publication Critical patent/CN112767532B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses an ultrasonic or CT medical image three-dimensional reconstruction method based on transfer learning. The invention can effectively realize the three-dimensional reconstruction of the ultrasonic or CT image, fully play the role of auxiliary diagnosis in the auxiliary diagnosis of artificial intelligence, and improve the efficiency of auxiliary diagnosis by 3D visual reconstruction results.

Description

Ultrasonic or CT medical image three-dimensional reconstruction method based on transfer learning
Technical Field
The invention belongs to the technical field of computers, relates to ultrasonic or CT image three-dimensional reconstruction in medical auxiliary diagnosis, and discloses a method for three-dimensionally reconstructing an ultrasonic or CT image by means of an imaging rule of a natural image, an artificial intelligent transfer learning strategy and a network structure for establishing codes and decoding.
Background
In recent years, artificial intelligence technology is rapidly developed, and the research on the key technology of medical auxiliary diagnosis is of great significance. At present, in the research of the three-dimensional reconstruction technology of the ultrasonic or CT images, the parameter recovery of a camera has certain difficulty, so that the research of the three-dimensional reconstruction technology of the ultrasonic or CT images has certain difficulty, particularly, the reconstruction of a complex model can bring a serious problem of high time complexity to the three-dimensional reconstruction, and the application of clinical medical auxiliary diagnosis is not facilitated. How to establish an effective deep learning network coding model, and effectively solve the rapid problem of three-dimensional reconstruction of ultrasonic images or CT images, which is a practical problem to be solved urgently.
Disclosure of Invention
The invention provides an ultrasonic or CT medical image three-dimensional reconstruction method based on transfer learning, and provides an ultrasonic or CT image three-dimensional reconstruction method based on deep learning.
The specific technical scheme for realizing the purpose of the invention is as follows:
a three-dimensional reconstruction method of ultrasonic or CT medical images based on transfer learning is disclosed, the method inputs an ultrasonic or CT image sequence, the image resolution is MxN, M is more than or equal to 100 and less than or equal to 1500, N is more than or equal to 100 and less than or equal to 1500, the three-dimensional reconstruction process specifically comprises the following steps:
step 1: building a data set
(a) Constructing a natural image dataset D
Selecting a natural image website, requiring image sequences and corresponding internal parameters of a camera, downloading a image sequences and the corresponding internal parameters of the sequences from the website, wherein a is more than or equal to 1 and less than or equal to 20, for each image sequence, recording every adjacent 3 frames of images as an image b, an image c and an image d, splicing the image b and the image d according to a color channel to obtain an image tau, forming a data element by the image c and the image tau, wherein the image c is a natural target image, the sampling viewpoint of the image c is used as a target viewpoint, and the internal parameters of the image b, the image c and the image d are all et(t-1, 2, 3, 4) wherein e1Is a horizontal focal length, e2Is a vertical focal length, e3And e4Are two components of the principal point coordinates; if the last residual image in the same image sequence is less than 3 frames, discarding; constructing a data set D by using all the sequences, wherein the data set D has f elements, and f is more than or equal to 3000 and less than or equal to 20000;
(b) constructing an ultrasound image dataset E
Sampling g ultrasonic image sequences, wherein g is more than or equal to 1 and less than or equal to 20, recording every adjacent 3 frames of images of each sequence as an image i, an image j and an image k, splicing the image i and the image k according to a color channel to obtain an image pi, forming a data element by the image j and the image pi, wherein the image j is an ultrasonic target image, and a sampling viewpoint of the image j is used as a target viewpoint;
(c) construction of a CT image dataset G
Sampling h CT image sequences, wherein h is more than or equal to 1 and less than or equal to 20, recording every adjacent 3 frames of each sequence as an image l, an image m and an image n, splicing the image l and the image n according to a color channel to obtain an image sigma, forming a data element by the image m and the image sigma, taking the image m as a CT target image, taking a sampling viewpoint of the image m as a target viewpoint, if the last residual image in the same image sequence is less than 3 frames, abandoning, and constructing a data set G by using all the sequences, wherein the data set G has xi elements, and xi is more than or equal to 1000 and less than or equal to 20000;
step 2: constructing neural networks
The resolution of the image or video processed by the neural network is p x o, p is the width, o is the height, and the resolution is 100-2000, 100-2000;
(1) structure of network A
Taking tensor H as input, the scale is alpha multiplied by o multiplied by p multiplied by 3, taking tensor I as output, the scale is alpha multiplied by o multiplied by p multiplied by 1, and alpha is the number of batches;
the network A consists of an encoder and a decoder, and for the tensor H, the output tensor I is obtained after encoding and decoding processing is carried out in sequence;
the encoder consists of 5 residual error units, the 1 st to 5 th units respectively comprise 2, 3, 4, 6 and 3 residual error modules, each residual error module performs convolution for 3 times, the shapes of convolution kernels are 3 multiplied by 3, the number of the convolution kernels is 64, 64, 128, 256 and 512, and a maximum pooling layer is included behind the first residual error unit;
the decoder is composed of 6 decoding units, each decoding unit comprises two steps of deconvolution and convolution, the shapes and the numbers of convolution kernels of the deconvolution and convolution are the same, the shapes of convolution kernels of the 1 st to 6 th decoding units are all 3x3, the numbers of the convolution kernels are 512, 256, 128, 64, 32 and 16 respectively, cross-layer connection is carried out between network layers of the encoder and the decoder, and the corresponding relation of the cross-layer connection is as follows: 1 and 4, 2 and 3, 3 and 2, 4 and 1;
(2) structure of network B
Tensor J and tensor K are used as input, the scales are respectively alpha multiplied by O multiplied by p multiplied by 3 and alpha multiplied by O multiplied by p multiplied by 6, tensor L and tensor O are used as output, the scales are respectively alpha multiplied by 2 multiplied by 6 and alpha multiplied by 4 multiplied by 1, and alpha is the number of batches;
the network B is composed of a module P and a module Q, 11 layers of convolution units are shared, firstly, a tensor J and a tensor K are spliced according to a last channel to obtain a tensor with the scale of alpha multiplied by O multiplied by P multiplied by 9, and an output tensor L and a tensor O are respectively obtained after the tensor is processed by the module P and the module Q;
the module Q and the module P share a front 4-layer convolution unit, and the front 4-layer convolution unit has the structure that the convolution kernel scales in the front two-layer unit are respectively 7 multiplied by 7 and 5 multiplied by 5, the convolution kernel scales from the 3 rd layer to the 4 th layer are all 3 multiplied by 3, and the number of convolution kernels from 1 layer to 4 layers is 16, 32, 64 and 128 in sequence;
for the module P, except for sharing 4 layers, the module P occupies convolution units from the 5 th layer to the 7 th layer of the network B, the scale of convolution kernels is 3 multiplied by 3, the number of the convolution kernels is 256, after the convolution processing is carried out on the processing result of the 7 th layer by using 12 convolution kernels of 3 multiplied by 3, the 12 results are sequentially arranged into 2 rows, and the result of the tensor L is obtained;
for the module Q, except for 1 to 4 layers of the shared network B, 8 th to 11 th layers of convolution units of the network B are occupied, 2 nd layer output of the network B is used as 8 th layer input of the network B, the shapes of convolution kernels in the 8 th to 11 th layers of convolution units are all 3 multiplied by 3, the number of the convolution kernels is all 256, and after convolution processing is carried out on the 11 th layer result by using 4 convolution kernels of 3 multiplied by 3, tensor O results are obtained from 4 channels;
(3) structure of network C
Taking tensor R and tensor S as network input, wherein the scales are both alpha multiplied by o multiplied by p multiplied by 3, taking tensor T as network output, the scales are alpha multiplied by o multiplied by p multiplied by 2, and alpha is the number of batches;
the network C is designed into a coding and decoding structure, firstly, a tensor R and a tensor S are spliced according to a last channel to obtain a tensor with the scale of alpha multiplied by o multiplied by p multiplied by 6, and an output tensor T is obtained after the tensor is subjected to coding and decoding processing;
for the coding structure, the coding structure is composed of 6 layers of coding units, each layer of coding unit comprises 1 convolution processing, 1 batch normalization processing and 1 activation processing, wherein the 1 st layer of coding unit adopts 7x7 convolution kernels, other layer of coding units all adopt 3x3 convolution kernels, the convolution step length of the 1 st and 3 rd layer of coding units is 1, the convolution step length of other layer of coding units is 2, for each layer of coding unit, the coding units are all activated by Relu function, and the number of the convolution kernels of the 1-6 layer of coding units is respectively 16, 32, 64, 128, 256 and 512;
for a decoding structure, the decoding structure comprises 6 layers of decoding units, each layer of decoding unit comprises a deconvolution unit, a connection processing unit and a convolution unit, wherein the deconvolution unit comprises deconvolution processing and Relu activation processing, the sizes of 1-6 layers of deconvolution kernels are all 3x3, for the 1 st-2 layers of decoding units, the deconvolution step length is 1, the deconvolution step length of the 3-6 layers of decoding units is 2, the number of 1-6 layers of deconvolution kernels is 512, 256, 128, 64, 32 and 16 in sequence, the connection processing unit connects the deconvolution results of the coding unit and the corresponding decoding units and inputs the results into the convolution units, the convolution kernel size of the 1-5 layers of convolution units is 3x3, the convolution kernel size of the 6 th layer of convolution unit is 7x7, the convolution step lengths of the 1-6 layers of convolution units are all 2, and the convolution results of the 6 th layer are processed by 2 convolution units 3x3, obtaining a result T;
(4) structure of network mu
Tensor omega is used as network input, the scale is alpha x o x p x3, and tensor
Figure BDA0002872386870000041
As the network output, the scale is α × o × p × 1, and α is the batch number;
the network mu consists of an encoder and a decoder, and for the tensor omega, the output tensor is obtained after the encoding and decoding processing is carried out in sequence
Figure BDA0002872386870000042
The encoder consists of 14 layers of encoding units, each encoding unit comprises 1 convolution processing, 1 batch normalization processing and 1 activation processing, the encoding units of the 1 st and the 2 nd layers adopt a 7x7 convolutional kernel structure, the encoding units of the 3 rd and the 4 th layers adopt a 5 x 5 convolutional kernel structure, the rest encoding units are set as 3x3 convolutional kernels, and the convolution step size of each encoding unit is designed as follows: setting the step sizes of the 1 st, 3 rd, 5 th, 7 th, 9 th, 11 th and 13 th layers to be 2, setting the step sizes of other layers to be 1, adopting Relu function activation processing for each coding unit, respectively setting the numbers of convolution kernels of the 1 st to 8 th layers to be 32, 64, 128, 256 and 256 in a coding structure, and setting the numbers of convolution kernels of the other layers to be 512;
the decoder is composed of 7 layers of decoding units, each layer of decoding unit is composed of a deconvolution unit, a connection processing unit and a convolution unit, the deconvolution unit comprises deconvolution processing and Relu activation processing, the sizes of deconvolution kernels of all layers are 3x3, the step lengths are 2, the number of deconvolution kernels of 1-7 layers is 512, 256, 128, 64, 32 and 16 respectively, the connection processing unit connects the coding unit with deconvolution characteristics of corresponding layers and inputs the connection processing unit to the next convolution unit for convolution and Relu activation processing, the numbers of convolution kernels of 1-7 layers are 512, 256, 128, 64, 32 and 16 respectively in the convolution units, the sizes of convolution kernels are 3x3, the step lengths are 1, 4-7 layers of decoding units and the output of the decoding units is multiplied by weight respectively to obtain output result tensors which are multiplied by weights
Figure BDA0002872386870000043
And step 3: training of neural networks
Respectively dividing samples in a data set D, a data set E and a data set G into a training set and a test set according to a ratio of 9:1, wherein the data in the training set is used for training, the data in the test set is used for testing, when the following steps are trained, the training data are respectively obtained from the corresponding data sets, are uniformly scaled to a resolution ratio p x o, are input into a corresponding network, are subjected to iterative optimization, and the loss of each batch is minimized by continuously modifying the parameters of a network model;
in the training process, the calculation method of each loss is as follows:
internal parameter supervision synthesis loss: in the network model parameter training of the natural image, the output tensor I of the network A is taken as the depth, and the output result L of the network B and the internal parameter label e of the training data are taken as the deptht(t ═ 1, 2, 3, 4) as pose parameters and camera internal parameters, respectively, and according to a computer vision algorithm, two images at the viewpoint of image c are synthesized using image b and image d, respectivelyCalculating the image c and the two images according to the sum of the intensity differences of the pixel-by-pixel and color-by-color channels;
unsupervised synthesis loss: in the network model parameter training of ultrasonic or CT image, the output tensor of the network mu
Figure BDA0002872386870000051
As the depth, the output tensor L and the tensor O of the network B are respectively used as pose parameters and camera internal parameters, images at the view point of a target image are respectively synthesized by using two adjacent images of the target image according to a computer vision algorithm, and the target image and the images at the view point of the target image are respectively used for calculation according to the sum of the intensity differences of pixel-by-pixel and color-by-color channels;
internal parameter error loss: utilizing output result O of network B and internal parameter label e of training datat(t is 1, 2, 3, 4) calculated as the sum of the absolute values of the differences of the components;
spatial structure error loss: in the network model parameter training of ultrasonic or CT image, the output tensor of the network mu
Figure BDA0002872386870000053
As the depth, the output tensor L and the tensor O of the network B are respectively used as pose parameters and camera internal parameters, the target image is reconstructed by taking the viewpoint of the target image as the origin of a camera coordinate system according to a computer vision algorithm, a RANSAC algorithm is adopted to fit the spatial structure of reconstruction points, and the Euclidean distance between each reconstruction point of the target image and the spatial geometric structure is calculated;
transform synthesis loss: in the network model parameter training of ultrasonic or CT images, the output tensor of the network mu is expressed
Figure BDA0002872386870000052
Using the output tensor L and the output tensor O of the network B as the position and pose parameters and the internal parameters of the camera, respectively, and synthesizing the viewpoint of the target image by using two adjacent images of the target image according to a computer vision algorithmIn the two image processes, after each pixel position is obtained for each image in the two synthesized images, the coordinate of each pixel is added with each pixel displacement result output by the network C to obtain a new position of each pixel to form a synthesized result image, and the synthesized result image is obtained by calculating the sum of the pixel-by-pixel and color-by-color channel intensity differences between the synthesized result image and the image j;
(1) on the data set D, the modules P of the network A and the network B are respectively trained 80000 times
Taking out training data from the data set D each time, uniformly scaling to a resolution ratio P x o, inputting the image c into the network A, inputting the image c and the image r into the network B, training the module P of the network B, and calculating the training loss of each batch by the supervision and synthesis loss of internal parameters;
(2) on data set D, model Q of network B was trained 80000 times
Taking out training data from the data set D each time, uniformly scaling to a resolution ratio p x o, inputting the image c into the network A, inputting the image c and the image t into the network B, and training the module Q of the network B, wherein the training loss of each batch is obtained by calculating the sum of the supervised synthesis loss of internal parameters and the error loss of the internal parameters;
(3) on the data set E, the module Q of the network mu and the network B is trained 80000 times for feature migration
Each time, taking out the ultrasonic training data from the data set E, uniformly scaling the ultrasonic training data to the resolution p × o, inputting the image j into the network μ, inputting the image j and the image pi into the network B, and training the module Q of the network B, wherein the training loss of each batch is calculated as follows:
z=v+W+χ (1)
wherein v is unsupervised synthesis loss, W is space structure error loss, and constant depth loss x is calculated by using the mean square error of the output result of the network mu;
(4) on data set E, two modules of network B were trained 80000 times according to the following steps
Taking out ultrasonic training data from a data set E every time, uniformly scaling the ultrasonic training data to a resolution ratio p x o, inputting an image j into a network mu, inputting the image j and the image pi into a network B, adjusting two module parameters of the network B in the training process, and performing iterative optimization to minimize the loss of each image in each batch, wherein the training loss in each batch is composed of the sum of unsupervised synthesis loss, spatial structure error loss and constant depth loss, and the constant depth loss is calculated by using the mean square error of an output result of the network mu;
(5) on data set E, both modules of network C and network B were trained 80000 times
Taking out ultrasonic image training data from a data set E every time, uniformly scaling to a resolution ratio p x o, inputting an image j into a network mu, inputting an image j and an image pi into a network B, taking the output of the network mu as a depth, taking the output of the network B as a pose parameter and a camera internal parameter, synthesizing two images at a visual point of the image j according to an image i and an image k respectively, inputting the two images into a network C, and carrying out iterative optimization by continuously modifying the parameters of the network C and the network B so that the loss of each image in each batch is minimum, wherein the loss in each batch is calculated as the sum of transformation synthesis loss, spatial structure error loss and constant depth loss, and the constant depth loss is calculated by using the mean square error of the output result of the network mu;
(6) on a data set E, two modules of a network C and a network B are trained 50000 times to obtain a model rho
Taking out ultrasonic image training data from a data set E every time, uniformly scaling to a resolution ratio p x o, inputting an image j into a network mu, inputting an image j and an image pi into a network B, taking the output of the network mu as a depth, taking the output of the network B as a pose parameter and a camera internal parameter, synthesizing two images at a visual point of the image j according to an image i and an image k respectively, inputting the two images into a network C, and carrying out iterative optimization by continuously modifying the parameters of the network C and the network B to ensure that the loss of each image in each batch is minimum, obtaining an optimal network model parameter p after iteration, and calculating the loss in each batch as the sum of transformation synthesis loss and spatial structure error loss;
(7) on dataset G, train network C and network B80000 times
Taking CT image training data out of a data set G each time, uniformly scaling to a resolution p x o, inputting an image m into a network mu, inputting the image m and an image sigma into a network B, taking the output of the network mu as a depth, taking the output of the network B as a pose parameter and a camera internal parameter, respectively synthesizing two images at a viewpoint of the image m according to an image l and an image n, inputting the two images into a network C, the loss of each image of each batch is minimized by continuously modifying the parameters of the network C and the network B and carrying out iterative optimization, the loss of each batch is calculated as the sum of transformation synthesis loss, spatial structure error loss, constant depth loss and translational motion loss Y of the camera, the constant depth loss is calculated by utilizing the mean square error of the output result of the network mu, and Y is calculated by the output pose parameter of the network B according to the constraint of the translational motion of the camera;
(8) training the network C and the network B50000 times on the data set G to obtain a model rho'
Taking out CT image training data from a data set G each time, uniformly scaling to a resolution ratio p x o, inputting an image m into a network mu, inputting the image m and an image sigma into a network B, taking the output of the network mu as a depth, taking the output of the network B as a pose parameter and a camera internal parameter, synthesizing two images at a viewpoint of the image m according to an image l and an image n respectively, inputting the two images into a network C, continuously modifying the parameters of the network C and the network B, and carrying out iterative optimization to minimize the loss of each image in each batch, obtaining an optimal network model parameter rho 'after iteration, calculating the loss in each batch as the sum of transformation synthesis loss, spatial structure error loss and camera translation motion loss Y, and calculating the optimal network model parameter rho' according to the output pose parameter of the network B and the constraint calculation of camera translation motion;
and 4, step 4: three-dimensional reconstruction of ultrasound or CT images
Using a self-sampled ultrasonic or CT sequence image, uniformly scaling each frame image to resolution p x o, predicting by using model parameter p or model parameter p', inputting image j to network mu for the ultrasonic sequence image, inputting image j and image pi to network B, inputting image m to network mu for the CT sequence image, inputting image m and image sigma to network B, using the output of network mu as depth, using the output of network B as pose parameter and camera internal parameter, selecting key frames according to the following steps, using the first frame in the sequence as current key frame, using each frame in the sequence image as target frame, synthesizing the image at the viewpoint of the target frame according to the current key frame by using camera pose parameter and internal parameter, calculating error lambda by using the sum of pixel-by-pixel color channel intensity difference between the synthesized image and the target frame, synthesizing an image at a viewpoint of a target frame according to adjacent frames of the target frame by using pose parameters and internal parameters of a camera, calculating an error gamma by using the sum of pixel-by-pixel color-by-color channel intensity differences between the synthesized image and the target frame, further calculating a synthesis error ratio Z by using a formula (2), and updating the current key frame into the current target frame when the Z is greater than a threshold eta and the ratio is 1< eta < 2;
Figure BDA0002872386870000071
and further, calculating to obtain the three-dimensional coordinates in the world coordinate system of each pixel of each frame of image of the sequence by using the geometric transformation of three-dimensional space and by taking the viewpoint of the first frame as the origin of the world coordinate system and combining the pose parameters of all key frames according to the reconstruction algorithm of computer vision.
The method can effectively realize the three-dimensional reconstruction of the ultrasonic or CT image, and can show the slice image with a 3D visual effect in the auxiliary diagnosis of artificial intelligence, thereby improving the auxiliary diagnosis efficiency.
Drawings
FIG. 1 is a three-dimensional reconstruction result diagram of an ultrasound image of the present invention;
FIG. 2 is a three-dimensional reconstruction result of the CT image of the present invention.
Detailed Description
Examples
The invention is further described below with reference to the accompanying drawings.
The embodiment is implemented under a Windows 1064-bit operating system on a PC, and the hardware configuration of the embodiment is CPU i7-9700F, a memory 16G, a GPU NVIDIA GeForce GTX 20708G; the deep learning library adopts Tensorflow1.14; the programming is in Python language.
A three-dimensional reconstruction method of ultrasonic or CT medical images based on transfer learning is disclosed, the method inputs an ultrasonic or CT image sequence, the resolution ratio is M multiplied by N, for the ultrasonic images, M is 450, N is 300, for the CT images, M and N are both 512, the three-dimensional reconstruction process specifically comprises the following steps:
step 1: building a data set
(a) Constructing a natural image dataset D
Selecting a natural image website, requiring image sequences and corresponding internal parameters of a camera, downloading 19 image sequences and the corresponding internal parameters of the sequences from the website, recording every adjacent 3 frames of images as an image b, an image c and an image d for each image sequence, splicing the image b and the image d according to a color channel to obtain an image tau, forming a data element by the image c and the image tau, wherein the image c is a natural target image, the sampling viewpoint of the image c is used as a target viewpoint, and the internal parameters of the image b, the image c and the image d are all et(t ═ 1, 2, 3, 4) in which e1Is a horizontal focal length, e2Is a vertical focal length, e3And e4Are two components of the principal point coordinates; if the last residual image in the same image sequence is less than 3 frames, discarding; constructing a data set D by using all the sequences, wherein the data set D has 3600 elements;
(b) constructing an ultrasound image dataset E
Sampling 10 ultrasonic image sequences, recording 3 adjacent images of each sequence as an image i, an image j and an image k, splicing the image i and the image k according to a color channel to obtain an image pi, forming a data element by the image j and the image pi, wherein the image j is an ultrasonic target image, a sampling viewpoint of the image j is used as a target viewpoint, if the last residual image in the same image sequence is less than 3 frames, discarding the image j, and constructing a data set E by using all the sequences, wherein the data set E has 1600 elements;
(c) construction of a CT image dataset G
Sampling 1 CT image sequence, regarding the sequence, marking every adjacent 3 frames as an image l, an image m and an image n, splicing the image l and the image n according to a color channel to obtain an image sigma, forming a data element by the image m and the image sigma, wherein the image m is a CT target image, a sampling viewpoint of the image m is used as a target viewpoint, if the last residual image in the same image sequence is less than 3 frames, discarding, and constructing a data set G by using all the sequences, wherein the data set G has 2000 elements; step 2: constructing neural networks
The resolution of the image or video processed by the neural network is 416 × 128, 416 is the width, 128 is the height, and the pixel is taken as the unit;
(1) structure of network A
Taking tensor H as input, the scale is 16 multiplied by 128 multiplied by 416 multiplied by 3, taking tensor I as output, and the scale is 16 multiplied by 128 multiplied by 416 multiplied by 1;
the network A consists of an encoder and a decoder, and for the tensor H, the output tensor I is obtained after the encoding and decoding processing is carried out in sequence;
the encoder consists of 5 residual error units, the 1 st to 5 th units respectively comprise 2, 3, 4, 6 and 3 residual error modules, each residual error module performs convolution for 3 times, the shapes of convolution kernels are 3 multiplied by 3, the number of the convolution kernels is 64, 64, 128, 256 and 512, and a maximum pooling layer is included behind the first residual error unit;
the decoder is composed of 6 decoding units, each decoding unit comprises two steps of deconvolution and convolution, the shapes and the numbers of convolution kernels of the deconvolution and convolution are the same, the shapes of convolution kernels of the 1 st to 6 th decoding units are all 3x3, the numbers of the convolution kernels are 512, 256, 128, 64, 32 and 16 respectively, cross-layer connection is carried out between network layers of the encoder and the decoder, and the corresponding relation of the cross-layer connection is as follows: 1 and 4, 2 and 3, 3 and 2, 4 and 1;
(2) structure of network B
Tensor J and tensor K are used as inputs, the scales are respectively 16 × 128 × 416 × 3 and 16 × 128 × 416 × 6, tensor L and tensor O are used as outputs, and the scales are respectively 16 × 2 × 6 and 16 × 4 × 1;
the network B is composed of a module P and a module Q, 11 layers of convolution units are shared, firstly, a tensor J and a tensor K are spliced according to a last channel to obtain a tensor with the dimension of 16 multiplied by 128 multiplied by 416 multiplied by 9, and an output tensor L and a tensor O are respectively obtained after the tensor is processed by the module P and the module Q;
the module Q and the module P share a front 4-layer convolution unit, and the front 4-layer convolution unit has the structure that the convolution kernel scales in the front two-layer unit are respectively 7 multiplied by 7 and 5 multiplied by 5, the convolution kernel scales from the 3 rd layer to the 4 th layer are all 3 multiplied by 3, and the number of convolution kernels from 1 layer to 4 layers is 16, 32, 64 and 128 in sequence;
for the module P, except for sharing 4 layers, the module P occupies convolution units from the 5 th layer to the 7 th layer of the network B, the scale of convolution kernels is 3 multiplied by 3, the number of the convolution kernels is 256, after the convolution processing is carried out on the processing result of the 7 th layer by using 12 convolution kernels of 3 multiplied by 3, the 12 results are sequentially arranged into 2 rows, and the result of the tensor L is obtained;
for the module Q, except for 1 to 4 layers of the shared network B, 8 th to 11 th layers of convolution units of the network B are occupied, 2 nd layer output of the network B is used as 8 th layer input of the network B, the shapes of convolution kernels in the 8 th to 11 th layers of convolution units are all 3 multiplied by 3, the number of the convolution kernels is all 256, and after convolution processing is carried out on the 11 th layer result by using 4 convolution kernels of 3 multiplied by 3, tensor O results are obtained from 4 channels;
(3) structure of network C
Taking tensor R and tensor S as network input, wherein the scales of the tensor R and the tensor S are both 16 multiplied by 128 multiplied by 416 multiplied by 3, taking tensor T as network output, and the scales of the tensor R and the tensor S are 16 multiplied by 128 multiplied by 416 multiplied by 2;
the network C is designed into a coding and decoding structure, firstly, a tensor R and a tensor S are spliced according to a last channel to obtain a tensor with the dimension of 16 multiplied by 128 multiplied by 416 multiplied by 6, and an output tensor T is obtained after the tensor is subjected to coding and decoding processing;
for the coding structure, the coding structure is composed of 6 layers of coding units, each layer of coding unit comprises 1 convolution processing, 1 batch normalization processing and 1 activation processing, wherein the 1 st layer of coding unit adopts 7x7 convolution kernels, other layer of coding units all adopt 3x3 convolution kernels, the convolution step length of the 1 st and 3 rd layer of coding units is 1, the convolution step length of other layer of coding units is 2, for each layer of coding unit, the Relu function activation is adopted, and the number of the convolution kernels of the 1-6 layer of coding units is respectively 16, 32, 64, 128, 256 and 512;
for a decoding structure, the decoding structure comprises 6 layers of decoding units, each layer of decoding unit comprises a deconvolution unit, a connection processing unit and a convolution unit, wherein the deconvolution unit comprises deconvolution processing and Relu activation processing, the sizes of 1-6 layers of deconvolution kernels are all 3x3, for the 1 st-2 layers of decoding units, the deconvolution step length is 1, the deconvolution step length of the 3-6 layers of decoding units is 2, the number of the 1-6 layers of deconvolution kernels is 512, 256, 128, 64, 32 and 16 in sequence, the connection processing unit connects the deconvolution results of the coding unit and the corresponding decoding units and inputs the results into the convolution units, the convolution kernel size of the 1-5 layers of convolution units is 3x3, the convolution kernel size of the 6 th layer of convolution unit is 7x7, the convolution step lengths of the 1-6 layers of convolution units are all 2, and after 2 layers of convolution results of the 6 th layer are processed by 3x3, obtaining a result T;
(4) structure of network mu
Tensor omega is used as network input, the scale is 16 multiplied by 128 multiplied by 416 multiplied by 3, tensor
Figure BDA0002872386870000102
As the network output, the scale is 16 × 128 × 416 × 1;
the network mu consists of an encoder and a decoder, and for the tensor omega, the output tensor is obtained after the encoding and decoding processing is carried out in sequence
Figure BDA0002872386870000101
The encoder consists of 14 layers of encoding units, each encoding unit comprises 1 convolution processing, 1 batch normalization processing and 1 activation processing, the encoding units of the 1 st and the 2 nd layers adopt a 7x7 convolutional kernel structure, the encoding units of the 3 rd and the 4 th layers adopt a 5 x 5 convolutional kernel structure, the rest encoding units are set as 3x3 convolutional kernels, and the convolution step size of each encoding unit is designed as follows: setting the step sizes of the 1 st, 3 rd, 5 th, 7 th, 9 th, 11 th and 13 th layers to be 2, setting the step sizes of other layers to be 1, adopting Relu function activation processing for each coding unit, respectively setting the number of convolution kernels of the 1 st to 8 th layers to be 32, 64, 128, 256 and 256 in a coding structure, and setting the number of convolution kernels of the other layers to be 512;
the decoder consists of 7 layers of decoding units, each layer of decoding unit consists of a deconvolution unit, a connection processing unit and a convolution unit, wherein in any layer of decoding unit, the deconvolution unit comprises deconvolution processing and Relu activation processing, the deconvolution kernels of each layer have the size of 3x3, the step length is 2, the number of the deconvolution kernels from 1 to 7 layers is 512, 256, 128, 64, 32 and 16 respectively, the connection processing unit connects the coding unit with the deconvolution characteristics of the corresponding layer and inputs the connected coding unit to the next convolution unit for convolution and Relu activation processing, in the convolution units, the number of the convolution kernels from 1 to 7 layers is 512, 256, 128, 64, 32 and 16 respectively, in each convolution unit, the convolution kernels have the size of 3x3, the step length is 1, and the output of the decoding units from 4 to 7 layers are multiplied by the weight respectively to obtain the tensor of the output result multiplied by the convolution unit
Figure BDA0002872386870000111
And step 3: training of neural networks
Respectively dividing samples in a data set D, a data set E and a data set G into a training set and a testing set according to a ratio of 9:1, wherein data in the training set is used for training, data in the testing set is used for testing, training data are respectively obtained from corresponding data sets when the following steps are trained, the training data are uniformly scaled to a resolution of 416 x 128 and input into corresponding networks, iterative optimization is carried out, and loss of each batch is minimized by continuously modifying network model parameters;
in the training process, the calculation method of each loss is as follows:
internal parameter supervision synthesis loss: in the network model parameter training of the natural image, the output tensor I of the network A is taken as the depth, and the output result L of the network B and the internal parameter label e of the training data are taken as the deptht(t is 1, 2, 3, 4) respectively as a pose parameter and a camera internal parameter, respectively synthesizing two images at the viewpoint of the image c by using the image b and the image d according to a computer vision algorithm, respectively synthesizing the two images at the viewpoint of the image c by using the image c and the two images, respectively, and performing image-by-image processingCalculating the sum of intensity differences of the pixel channel and the color channel;
unsupervised synthesis loss: in the network model parameter training of ultrasonic or CT image, the output tensor of the network mu
Figure BDA0002872386870000113
As the depth, the output tensor L and the output tensor O of the network B are respectively used as a pose parameter and a camera internal parameter, images at the viewpoint of a target image are respectively synthesized by using two adjacent images of the target image according to a computer vision algorithm, and the target image and the images at the viewpoint of the target image are respectively used for calculation according to the sum of the intensity differences of pixel-by-pixel and color-by-color channels;
internal parameter error loss: utilizing output result O of network B and internal parameter label e of training datat(t is 1, 2, 3, 4) calculated as the sum of the absolute values of the differences of the components;
spatial structure error loss: in the network model parameter training of ultrasonic or CT image, the output tensor of the network mu
Figure BDA0002872386870000112
As the depth, the output tensor L and the tensor O of the network B are respectively used as pose parameters and camera internal parameters, the target image is reconstructed by taking the viewpoint of the target image as the origin of a camera coordinate system according to a computer vision algorithm, a RANSAC algorithm is adopted to fit the spatial structure of reconstruction points, and the Euclidean distance between each reconstruction point of the target image and the spatial geometric structure is calculated;
transform synthesis loss: in the network model parameter training of ultrasonic or CT image, the output tensor of the network mu
Figure BDA0002872386870000121
And as the depth, respectively using the output tensor L and the tensor O of the network B as a pose parameter and an internal parameter of the camera, and synthesizing two images at the viewpoint of the target image by using two adjacent images of the target image according to a computer vision algorithmAfter obtaining the position of each pixel, each image in the system is obtained by adding the coordinate of each pixel to the displacement result of each pixel output by the network C to obtain the new position of each pixel to form a composite result image, and the composite result image is obtained by calculating the sum of the channel intensity differences of each pixel and each color between the composite result image and the image j;
(1) on the data set D, the modules P of the network A and the network B are respectively trained 80000 times
Taking out training data from the data set D each time, uniformly scaling the training data to a resolution of 416 multiplied by 128, inputting the image c into the network A, inputting the image c and the image tau into the network B, and training the module P of the network B, wherein the training loss of each batch is obtained by calculating the internal parameter supervision synthesis loss;
(2) on data set D, model Q of network B was trained 80000 times
Taking out training data from the data set D each time, uniformly scaling the training data to a resolution of 416 multiplied by 128, inputting the image c into the network A, inputting the image c and the image tau into the network B, and training the module Q of the network B, wherein the training loss of each batch is calculated by the sum of the supervised synthesis loss of internal parameters and the error loss of the internal parameters;
(3) on the data set E, the module Q of the network mu and the network B is trained 80000 times for feature migration
Each time, taking out the ultrasonic training data from the data set E, uniformly scaling to the resolution of 416 multiplied by 128, inputting the image j into the network mu, inputting the image j and the image pi into the network B, and training the module Q of the network B, wherein the training loss of each batch is calculated as follows:
z=v+W+χ (1)
wherein v is unsupervised synthesis loss, W is space structure error loss, and constant depth loss x is calculated by using the mean square error of the output result of the network mu;
(4) on data set E, the two modules of network B were trained 80000 times according to the following steps
Taking out ultrasonic training data from a data set E every time, uniformly scaling the ultrasonic training data to a resolution of 416 x 128, inputting an image j into a network mu, inputting the image j and the image pi into a network B, adjusting two module parameters of the network B in the training process, and performing iterative optimization to minimize the loss of each image in each batch, wherein the training loss in each batch is composed of the sum of unsupervised synthesis loss, spatial structure error loss and constant depth loss, and the constant depth loss is calculated by using the mean square error of an output result of the network mu;
(5) on data set E, two modules of network C and network B were trained 80000 times
Taking out ultrasonic image training data from a data set E every time, uniformly zooming to a resolution of 416 x 128, inputting an image j into a network mu, inputting an image j and an image pi into a network B, taking the output of the network mu as a depth, taking the output of the network B as a pose parameter and a camera internal parameter, synthesizing two images at a visual point of the image j according to an image i and an image k respectively, inputting the two images into a network C, and carrying out iterative optimization by continuously modifying the parameters of the network C and the network B so that the loss of each image in each batch is minimum, wherein the loss in each batch is calculated as the sum of transformation synthesis loss, spatial structure error loss and constant depth loss, and the constant depth loss is calculated by using the mean square error of the output result of the network mu;
(6) on a data set E, two modules of a network C and a network B are trained 50000 times to obtain a model rho
Taking out ultrasonic image training data from a data set E every time, uniformly zooming to a resolution of 416 x 128, inputting an image j into a network mu, inputting an image j and an image pi into a network B, taking the output of the network mu as a depth, taking the output of the network B as a pose parameter and a camera internal parameter, synthesizing two images at the viewpoint of the image j according to an image i and an image k respectively, inputting the two images into a network C, and carrying out iterative optimization by continuously modifying the parameters of the network C and the network B so as to minimize the loss of each image in each batch, obtaining an optimal network model parameter rho after iteration, wherein the loss in each batch is calculated as the sum of transformation synthesis loss and spatial structure error loss;
(7) on dataset G, train network C and network B80000 times
Taking CT image training data out of a data set G each time, uniformly zooming to a resolution of 416 x 128, inputting an image m into a network mu, inputting the image m and an image sigma into a network B, taking the output of the network mu as a depth, taking the output of the network B as a pose parameter and a camera internal parameter, respectively synthesizing two images at the viewpoint of the image m according to an image l and an image n, inputting the two images into a network C, the loss of each image of each batch is minimized by continuously modifying the parameters of the network C and the network B and carrying out iterative optimization, the loss of each batch is calculated as the sum of transformation synthesis loss, spatial structure error loss, constant depth loss and translational motion loss Y of the camera, the constant depth loss is calculated by utilizing the mean square error of the output result of the network mu, and Y is calculated by the output pose parameter of the network B according to the constraint of the translational motion of the camera;
(8) training the network C and the network B50000 times on the data set G to obtain a model rho'
Taking CT image training data out of a data set G each time, uniformly zooming to a resolution of 416 x 128, inputting an image m into a network mu, inputting the image m and an image sigma into a network B, taking the output of the network mu as a depth, taking the output of the network B as a pose parameter and a camera internal parameter, respectively synthesizing two images at a viewpoint of the image m according to an image l and an image n, inputting the two images into a network C, continuously modifying the parameters of the network C and the network B, and performing iterative optimization to minimize the loss of each image in each batch, obtaining an optimal network model parameter rho' after iteration, calculating the loss in each batch as the sum of transformation synthesis loss, spatial structure error loss and camera translation motion loss Y, and calculating the Y according to the output pose parameter of the network B and the constraint calculation of the camera translation motion; and 4, step 4: three-dimensional reconstruction of ultrasound or CT images
Using a self-sampled ultrasonic or CT sequence image, uniformly scaling each frame of image to resolution 416 x 128, predicting by using model parameter p or model parameter p', inputting image j to network mu for the ultrasonic sequence image, inputting image j and image pi to network B, inputting image m to network mu for the CT sequence image, inputting image m and image sigma to network B, using the output of network mu as depth, using the output of network B as pose parameter and camera internal parameter, selecting key frames according to the following steps, using the first frame in the sequence as current key frame, using each frame in the sequence image as target frame, synthesizing the image at the viewpoint of the target frame according to the current key frame by using camera pose parameter and internal parameter, calculating error lambda by using the sum of pixel-by-pixel color channel intensity difference between the synthesized image and the target frame, synthesizing images at the viewpoint of the target frame according to adjacent frames of the target frame by using the pose parameter and the internal parameter of the camera, calculating an error gamma by using the sum of the pixel-by-pixel and color-by-color channel intensity differences between the synthesized images and the target frame, further calculating a synthesis error ratio Z by using a formula (2), and updating the current key frame into the current target frame when Z is greater than a threshold value 1.2;
Figure BDA0002872386870000141
for any target frame, the resolution ratio is scaled to M multiplied by N, for ultrasonic images, M is 450, N is 300, for CT images, M and N are 512, according to the internal parameters of a camera, the three-dimensional coordinates in the camera coordinate system of each pixel of each frame of image are calculated according to the reconstruction algorithm of computer vision, further, the viewpoint of the first frame is used as the origin of the world coordinate system, and the three-dimensional coordinates in the world coordinate system of each pixel of each frame of image of the sequence are calculated by combining the pose parameters of all key frames and utilizing three-dimensional space geometric transformation.
In the examples, the experimental hyper-parameters are as follows: the optimizer adopts an Adam optimizer, the network learning rate is 0.0002, and the momentum coefficient is 0.9;
in this embodiment, network training is performed on training sets of data sets D, E and G, and respective tests are performed on test sets of data sets E and G, where table 1 is an error result graph of ultrasound image synthesis, which is obtained by calculation using formula (1), and the ultrasound image is segmented by using DenseNet to generate a 3D reconstruction result, where fig. 1 shows a three-dimensional reconstruction result of the ultrasound image; table 2 shows an error result graph of CT image synthesis, which is calculated by using formula (1), and in order to be able to view a three-dimensional reconstruction result, a 3D reconstruction result is generated by segmenting the CT image by using DenseNet, and fig. 2 shows the three-dimensional reconstruction result of the CT image; from these results, the effectiveness of the present invention can be seen.
TABLE 1
Serial number Error of the measurement
1 0.2427311078258662
2 0.21282879311343286
3 0.25152883686238026
4 0.16034263004408522
5 0.12900625223315293
6 0.1624275462222541
7 0.1094218437271453
8 0.16473407273370247
9 0.18821995626807592
10 0.10771235561024707
TABLE 2
Serial number Error of the measurement
1 0.18082489431244353
2 0.21449472955666968
3 0.21681998801393787
4 0.2115840231923817
5 0.23295494001592154
6 0.21729439551527013
7 0.2595851311112236
8 0.31779933626372536
9 0.2547147372097174
10 0.2614993640731656

Claims (1)

1. A three-dimensional reconstruction method of ultrasonic or CT medical images based on transfer learning is characterized in that the method inputs an ultrasonic or CT image sequence, the image resolution is MxN, M is more than or equal to 100 and less than or equal to 1500, N is more than or equal to 100 and less than or equal to 1500, and the three-dimensional reconstruction process specifically comprises the following steps:
step 1: building a data set
(a) Constructing a natural image dataset D
Selecting a natural image website, requiring image sequences and corresponding internal parameters of a camera, downloading a image sequences and the corresponding internal parameters of the sequences from the natural image website, wherein a is more than or equal to 1 and less than or equal to 20, for each image sequence, recording every 3 adjacent frames of images as an image b, an image c and an image d, splicing the image b and the image d according to a color channel to obtain an image tau, forming a data element by the image c and the image tau, wherein the image c is a natural target image, the sampling viewpoint of the image c is used as a target viewpoint, and the internal parameters of the image b, the image c and the image d are all et(t ═ 1, 2, 3, 4) in which e1Is a horizontal focal length, e2Is a vertical focal length, e3And e4Are two components of the principal point coordinates; if the last residual image in the same image sequence is less than 3 frames, discarding; constructing a data set D by using all the sequences, wherein the data set D has f elements, and f is more than or equal to 3000 and less than or equal to 20000;
(b) constructing an ultrasound image dataset E
Sampling g ultrasonic image sequences, wherein g is more than or equal to 1 and less than or equal to 20, recording every adjacent 3 frames of images of each sequence as an image i, an image j and an image k, splicing the image i and the image k according to a color channel to obtain an image pi, forming a data element by the image j and the image pi, wherein the image j is an ultrasonic target image, and a sampling viewpoint of the image j is used as a target viewpoint;
(c) construction of a CT image dataset G
Sampling h CT image sequences, wherein h is more than or equal to 1 and less than or equal to 20, recording 3 adjacent frames of each sequence as an image l, an image m and an image n, splicing the image l and the image n according to a color channel to obtain an image sigma, forming a data element by the image m and the image sigma, taking the image m as a CT target image, taking a sampling viewpoint of the image m as a target viewpoint, and if the last residual images in the same image sequence are less than 3 frames, abandoning the image, and constructing a data set G by using all the sequences, wherein the data set G has xi elements, and xi is more than or equal to 1000 and less than or equal to 20000; step 2: constructing neural networks
The resolution of the image or video processed by the neural network is p x o, p is the width, o is the height, and the resolution is 100-2000, 100-2000;
(1) structure of network A
Taking tensor H as input, the scale is alpha multiplied by o multiplied by p multiplied by 3, taking tensor I as output, the scale is alpha multiplied by o multiplied by p multiplied by 1, and alpha is the number of batches;
the network A consists of an encoder and a decoder, and for the tensor H, the output tensor I is obtained after encoding and decoding processing is carried out in sequence;
the encoder consists of 5 residual error units, the 1 st to 5 th units respectively comprise 2, 3, 4, 6 and 3 residual error modules, each residual error module performs convolution for 3 times, the shapes of convolution kernels are 3 multiplied by 3, the number of the convolution kernels is 64, 64, 128, 256 and 512, and a maximum pooling layer is included behind the first residual error unit;
the decoder is composed of 6 decoding units, each decoding unit comprises two steps of deconvolution and convolution, the shapes and the numbers of convolution kernels of the deconvolution and convolution are the same, the shapes of convolution kernels of the 1 st to 6 th decoding units are all 3x3, the numbers of the convolution kernels are 512, 256, 128, 64, 32 and 16 respectively, cross-layer connection is carried out between network layers of the encoder and the decoder, and the corresponding relation of the cross-layer connection is as follows: 1 and 4, 2 and 3, 3 and 2, 4 and 1;
(2) structure of network B
Tensor J and tensor K are used as input, the scales are respectively alpha multiplied by O multiplied by p multiplied by 3 and alpha multiplied by O multiplied by p multiplied by 6, tensor L and tensor O are used as output, the scales are respectively alpha multiplied by 2 multiplied by 6 and alpha multiplied by 4 multiplied by 1, and alpha is the number of batches;
the network B is composed of a module P and a module Q, 11 layers of convolution units are shared, firstly, a tensor J and a tensor K are spliced according to a last channel to obtain a tensor with the scale of alpha multiplied by O multiplied by P multiplied by 9, and an output tensor L and a tensor O are respectively obtained after the tensor is processed by the module P and the module Q;
the module Q and the module P share the first 4 layers of convolution units, and the structure of the first 4 layers of convolution units is as follows: the convolution kernel scales in the first two layers of units are respectively 7 × 7 and 5 × 5, the convolution kernel scales from the 3 rd layer to the 4 th layer are all 3 × 3, and the number of convolution kernels from 1 layer to 4 layers is 16, 32, 64 and 128 in sequence;
for the module P, except for sharing 4 layers, the module P occupies convolution units from the 5 th layer to the 7 th layer of the network B, the scale of convolution kernels is 3 multiplied by 3, the number of the convolution kernels is 256, after the convolution processing is carried out on the processing result of the 7 th layer by using 12 convolution kernels of 3 multiplied by 3, the 12 results are sequentially arranged into 2 rows, and the result of the tensor L is obtained;
for the module Q, except for 1 to 4 layers of the shared network B, 8 th to 11 th layers of convolution units of the network B are occupied, 2 nd layer output of the network B is used as 8 th layer input of the network B, the shapes of convolution kernels in the 8 th to 11 th layers of convolution units are all 3 multiplied by 3, the number of the convolution kernels is all 256, and after convolution processing is carried out on the 11 th layer result by using 4 convolution kernels of 3 multiplied by 3, tensor O results are obtained from 4 channels;
(3) structure of network C
Taking tensor R and tensor S as network input, wherein the scales are both alpha multiplied by o multiplied by p multiplied by 3, taking tensor T as network output, the scales are alpha multiplied by o multiplied by p multiplied by 2, and alpha is the number of batches;
the network C is designed into a coding and decoding structure, firstly, a tensor R and a tensor S are spliced according to a last channel to obtain a tensor with the scale of alpha multiplied by o multiplied by p multiplied by 6, and an output tensor T is obtained after the tensor is subjected to coding and decoding processing;
for the coding structure, the coding structure is composed of 6 layers of coding units, each layer of coding unit comprises 1 convolution processing, 1 batch normalization processing and 1 activation processing, wherein the 1 st layer of coding unit adopts 7x7 convolution kernels, other layer of coding units all adopt 3x3 convolution kernels, the convolution step length of the 1 st and 3 rd layer of coding units is 1, the convolution step length of other layer of coding units is 2, for each layer of coding unit, the coding units are all activated by Relu function, and the number of the convolution kernels of the 1-6 layer of coding units is respectively 16, 32, 64, 128, 256 and 512;
for a decoding structure, the decoding structure comprises 6 layers of decoding units, each layer of decoding unit comprises a deconvolution unit, a connection processing unit and a convolution unit, wherein the deconvolution unit comprises deconvolution processing and Relu activation processing, the sizes of 1-6 layers of deconvolution kernels are all 3x3, for the 1 st-2 layers of decoding units, the deconvolution step length is 1, the deconvolution step length of the 3-6 layers of decoding units is 2, the number of the 1-6 layers of deconvolution kernels is 512, 256, 128, 64, 32 and 16 in sequence, the connection processing unit connects the deconvolution results of the coding unit and the corresponding decoding units and inputs the results into the convolution units, the convolution kernel size of the 1-5 layers of convolution units is 3x3, the convolution kernel size of the 6 th layer of convolution unit is 7x7, the convolution step lengths of the 1-6 layers of convolution units are all 2, and after 2 layers of convolution results of the 6 th layer are processed by 3x3, obtaining a result T;
(4) structure of network mu
Tensor omega is used as network input, the scale is alpha multiplied by o multiplied by p multiplied by 3, tensor
Figure FDA0002872386860000031
As network output, the scale is α × o × p × 1, and α is the batch number;
the network mu consists of an encoder and a decoder, and the tensor omega is subjected to encoding and decoding in sequence to obtain an output tensor
Figure FDA0002872386860000032
The encoder consists of 14 layers of encoding units, each encoding unit comprises 1 convolution processing, 1 batch normalization processing and 1 activation processing, the encoding units of the 1 st and the 2 nd layers adopt a 7x7 convolutional kernel structure, the encoding units of the 3 rd and the 4 th layers adopt a 5 x 5 convolutional kernel structure, the rest encoding units are set as 3x3 convolutional kernels, and the convolution step size of each encoding unit is designed as follows: setting the step sizes of the 1 st, 3 rd, 5 th, 7 th, 9 th, 11 th and 13 th layers to be 2, setting the step sizes of other layers to be 1, adopting Relu function activation processing for each coding unit, respectively setting the number of convolution kernels of the 1 st to 8 th layers to be 32, 64, 128, 256 and 256 in a coding structure, and setting the number of convolution kernels of the other layers to be 512;
the decoder is composed of 7 layers of decoding units, each layer of decoding unit is composed of a deconvolution unit, a connection processing unit and a convolution unit, the deconvolution unit comprises deconvolution processing and Relu activation processing, the sizes of deconvolution kernels of all layers are 3x3, the step lengths are 2, the number of deconvolution kernels of 1-7 layers is 512, 256, 128, 64, 32 and 16 respectively, the connection processing unit connects the coding unit with deconvolution characteristics of corresponding layers and inputs the connection processing unit to the next convolution unit for convolution and Relu activation processing, the numbers of convolution kernels of 1-7 layers are 512, 256, 128, 64, 32 and 16 respectively in the convolution units, the sizes of convolution kernels are 3x3, the step lengths are 1, 4-7 layers of decoding units and the output of the decoding units is multiplied by weight respectively to obtain output result tensors which are multiplied by weights
Figure FDA0002872386860000041
And step 3: training of neural networks
Respectively dividing samples in a data set D, a data set E and a data set G into a training set and a test set according to a ratio of 9:1, wherein the data in the training set is used for training, the data in the test set is used for testing, when the following steps are trained, the training data are respectively obtained from the corresponding data sets, are uniformly scaled to a resolution ratio p x o, are input into a corresponding network, are subjected to iterative optimization, and the loss of each batch is minimized by continuously modifying the parameters of a network model;
in the training process, the calculation method of each loss is as follows:
internal parameter supervision synthesis loss: in the network model parameter training of the natural image, the output tensor I of the network A is taken as the depth, and the output result L of the network B and the internal parameter label e of the training data are taken as the deptht(t is 1, 2, 3, 4) respectively used as a pose parameter and a camera internal parameter, respectively synthesizing two images at the viewpoint of the image c by using the image b and the image d according to a computer vision algorithm, and respectively calculating by using the image c and the two images according to the sum of the intensity difference of pixel-by-pixel and color-by-color channels;
unsupervised synthesis loss: in the network model parameter training of ultrasonic or CT image, the output tensor of the network mu
Figure FDA0002872386860000042
As the depth, the output tensor L and the output tensor O of the network B are respectively used as a pose parameter and a camera internal parameter, images at the viewpoint of a target image are respectively synthesized by using two adjacent images of the target image according to a computer vision algorithm, and the target image and the images at the viewpoint of the target image are respectively used for calculation according to the sum of the intensity differences of pixel-by-pixel and color-by-color channels;
internal parameter error loss: utilizing the output result O of the network B and the internal parameter label e of the training datat(t is 1, 2, 3, 4) calculated as the sum of the absolute values of the differences of the components;
spatial structure error loss: in the network model parameter training of ultrasonic or CT image, the output tensor of the network mu
Figure FDA0002872386860000043
As the depth, the output tensor L and the tensor O of the network B are respectively used as pose parameters and camera internal parameters, the target image is reconstructed by taking the viewpoint of the target image as the origin of a camera coordinate system according to a computer vision algorithm, a RANSAC algorithm is adopted to fit the spatial structure of reconstruction points, and the Euclidean distance between each reconstruction point of the target image and the spatial geometric structure is calculated;
transform synthesis loss: in the ultrasonic field orIn the network model parameter training of CT image, the output tensor of the network mu is expressed
Figure FDA0002872386860000044
Taking the output tensor L and the tensor O of the network B as a pose parameter and a camera internal parameter respectively, and according to a computer vision algorithm, in the process of synthesizing two images at the viewpoint of a target image by using two adjacent images of the target image, adding the coordinate of each pixel to the displacement result of each pixel output by the network C after obtaining the position of each pixel for each image in the two synthesized images to obtain a new position of each pixel to form a synthesized result image, and calculating by using the sum of the pixel-by-pixel and color-by-color channel intensity differences between the synthesized result image and the image j;
(1) on the data set D, the modules P of the network A and the network B are respectively trained 80000 times
Taking out training data from the data set D each time, uniformly scaling to a resolution ratio P x o, inputting the image c into the network A, inputting the image c and the image r into the network B, training the module P of the network B, and calculating the training loss of each batch by the supervision and synthesis loss of internal parameters;
(2) on data set D, model Q of network B was trained 80000 times
Taking out training data from the data set D each time, uniformly scaling to a resolution ratio p x o, inputting the image c into the network A, inputting the image c and the image t into the network B, and training the module Q of the network B, wherein the training loss of each batch is obtained by calculating the sum of the supervised synthesis loss of internal parameters and the error loss of the internal parameters;
(3) on the data set E, the module Q of the network mu and the network B is trained 80000 times for feature migration
Taking out the ultrasonic training data from the data set E each time, uniformly scaling to the resolution p x o, inputting the image j into the network mu, inputting the image j and the image pi into the network B, and training the module Q of the network B, wherein the training loss of each batch is calculated as follows:
z=v+W+χ (1)
wherein v is unsupervised synthesis loss, W is space structure error loss, and constant depth loss x is calculated by using the mean square error of the output result of the network mu;
(4) on data set E, the two modules of network B were trained 80000 times according to the following steps
Taking out ultrasonic training data from a data set E every time, uniformly scaling the ultrasonic training data to a resolution ratio p x o, inputting an image j into a network mu, inputting the image j and the image pi into a network B, adjusting two module parameters of the network B in the training process, and performing iterative optimization to minimize the loss of each image in each batch, wherein the training loss in each batch is composed of the sum of unsupervised synthesis loss, spatial structure error loss and constant depth loss, and the constant depth loss is calculated by using the mean square error of an output result of the network mu;
(5) on data set E, both modules of network C and network B were trained 80000 times
Taking out ultrasonic image training data from a data set E every time, uniformly scaling to a resolution ratio p x o, inputting an image j into a network mu, inputting an image j and an image pi into a network B, taking the output of the network mu as a depth, taking the output of the network B as a pose parameter and a camera internal parameter, synthesizing two images at a visual point of the image j according to an image i and an image k respectively, inputting the two images into a network C, and carrying out iterative optimization by continuously modifying the parameters of the network C and the network B so that the loss of each image in each batch is minimum, wherein the loss in each batch is calculated as the sum of transformation synthesis loss, spatial structure error loss and constant depth loss, and the constant depth loss is calculated by using the mean square error of the output result of the network mu;
(6) on a data set E, two modules of a network C and a network B are trained 50000 times to obtain a model rho
Taking out ultrasonic image training data from a data set E every time, uniformly scaling to a resolution ratio p x o, inputting an image j into a network mu, inputting an image j and an image pi into a network B, taking the output of the network mu as a depth, taking the output of the network B as a pose parameter and a camera internal parameter, synthesizing two images at a visual point of the image j according to an image i and an image k respectively, inputting the two images into a network C, and carrying out iterative optimization by continuously modifying the parameters of the network C and the network B to ensure that the loss of each image in each batch is minimum, obtaining an optimal network model parameter p after iteration, and calculating the loss in each batch as the sum of transformation synthesis loss and spatial structure error loss;
(7) on dataset G, train network C and network B80000 times
Taking CT image training data out of a data set G each time, uniformly scaling to a resolution p x o, inputting an image m into a network mu, inputting the image m and an image sigma into a network B, taking the output of the network mu as a depth, taking the output of the network B as a pose parameter and a camera internal parameter, respectively synthesizing two images at a viewpoint of the image m according to an image l and an image n, inputting the two images into a network C, the loss of each image of each batch is minimized by continuously modifying the parameters of the network C and the network B and carrying out iterative optimization, the loss of each batch is calculated as the sum of transformation synthesis loss, spatial structure error loss, constant depth loss and translational motion loss Y of the camera, the constant depth loss is calculated by utilizing the mean square error of the output result of the network mu, and Y is calculated by the output pose parameter of the network B according to the constraint of the translational motion of the camera;
(8) training the network C and the network B50000 times on the data set G to obtain a model rho'
Taking out CT image training data from a data set G each time, uniformly scaling to a resolution ratio p x o, inputting an image m into a network mu, inputting the image m and an image sigma into a network B, taking the output of the network mu as a depth, taking the output of the network B as a pose parameter and a camera internal parameter, synthesizing two images at a viewpoint of the image m according to an image l and an image n respectively, inputting the two images into a network C, continuously modifying the parameters of the network C and the network B, and carrying out iterative optimization to minimize the loss of each image in each batch, obtaining an optimal network model parameter rho 'after iteration, calculating the loss in each batch as the sum of transformation synthesis loss, spatial structure error loss and camera translation motion loss Y, and calculating the optimal network model parameter rho' according to the output pose parameter of the network B and the constraint calculation of camera translation motion;
and 4, step 4: three-dimensional reconstruction of ultrasound or CT images
Using a self-sampled ultrasonic or CT sequence image, uniformly scaling each frame image to resolution p x o, predicting by using model parameter p or model parameter p', inputting image j to network mu for the ultrasonic sequence image, inputting image j and image pi to network B, inputting image m to network mu for the CT sequence image, inputting image m and image sigma to network B, using the output of network mu as depth, using the output of network B as pose parameter and camera internal parameter, selecting key frames according to the following steps, using the first frame in the sequence as current key frame, using each frame in the sequence image as target frame, synthesizing the image at the viewpoint of the target frame according to the current key frame by using camera pose parameter and internal parameter, calculating error lambda by using the sum of pixel-by-pixel color channel intensity difference between the synthesized image and the target frame, synthesizing an image at a viewpoint of a target frame by using pose parameters and internal parameters of a camera according to adjacent frames of the target frame, calculating an error gamma by using the sum of pixel-by-pixel color channel intensity differences between the synthesized image and the target frame, further calculating a synthesis error ratio Z by using a formula (2), and updating the current key frame into the current target frame when the Z is greater than a threshold eta and the ratio is more than 1 and less than 2;
Figure FDA0002872386860000071
and (3) for any target frame, the resolution ratio of the target frame is scaled to MxN, the three-dimensional coordinates in the camera coordinate system of each pixel of each frame of image are calculated according to the internal parameters of the camera and the reconstruction algorithm of computer vision, further, the viewpoint of the first frame is used as the origin of the world coordinate system, and the three-dimensional coordinates in the world coordinate system of each pixel of each frame of image of the sequence are calculated by utilizing the geometric transformation of three-dimensional space and combining the pose parameters of all key frames.
CN202011621411.4A 2020-12-30 2020-12-30 Ultrasonic or CT medical image three-dimensional reconstruction method based on transfer learning Active CN112767532B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011621411.4A CN112767532B (en) 2020-12-30 2020-12-30 Ultrasonic or CT medical image three-dimensional reconstruction method based on transfer learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011621411.4A CN112767532B (en) 2020-12-30 2020-12-30 Ultrasonic or CT medical image three-dimensional reconstruction method based on transfer learning

Publications (2)

Publication Number Publication Date
CN112767532A CN112767532A (en) 2021-05-07
CN112767532B true CN112767532B (en) 2022-07-08

Family

ID=75698230

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011621411.4A Active CN112767532B (en) 2020-12-30 2020-12-30 Ultrasonic or CT medical image three-dimensional reconstruction method based on transfer learning

Country Status (1)

Country Link
CN (1) CN112767532B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113689543B (en) * 2021-08-02 2023-06-27 华东师范大学 Epipolar constrained sparse attention mechanism medical image three-dimensional reconstruction method
CN113689548B (en) * 2021-08-02 2023-06-23 华东师范大学 Medical image three-dimensional reconstruction method based on mutual attention transducer
CN113689542B (en) * 2021-08-02 2023-06-23 华东师范大学 Ultrasonic or CT medical image three-dimensional reconstruction method based on self-attention transducer
CN113689547B (en) * 2021-08-02 2023-06-23 华东师范大学 Ultrasonic or CT medical image three-dimensional reconstruction method of cross-view visual transducer
CN113689545B (en) * 2021-08-02 2023-06-27 华东师范大学 2D-to-3D end-to-end ultrasound or CT medical image cross-modal reconstruction method
CN113689544B (en) * 2021-08-02 2023-06-27 华东师范大学 Cross-view geometric constraint medical image three-dimensional reconstruction method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109584164A (en) * 2018-12-18 2019-04-05 华中科技大学 Medical image super-resolution three-dimensional rebuilding method based on bidimensional image transfer learning
CN110428887A (en) * 2019-08-05 2019-11-08 河南省三门峡市中心医院(三门峡市儿童医院、三门峡市妇幼保健院) A kind of brain tumor medical image three-dimensional reconstruction shows exchange method and system
CN110458950A (en) * 2019-08-14 2019-11-15 首都医科大学附属北京天坛医院 A kind of method for reconstructing three-dimensional model, mobile terminal, storage medium and electronic equipment
WO2020015167A1 (en) * 2018-07-17 2020-01-23 西安交通大学 Image super-resolution and non-uniform blur removal method based on fusion network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020015167A1 (en) * 2018-07-17 2020-01-23 西安交通大学 Image super-resolution and non-uniform blur removal method based on fusion network
CN109584164A (en) * 2018-12-18 2019-04-05 华中科技大学 Medical image super-resolution three-dimensional rebuilding method based on bidimensional image transfer learning
CN110428887A (en) * 2019-08-05 2019-11-08 河南省三门峡市中心医院(三门峡市儿童医院、三门峡市妇幼保健院) A kind of brain tumor medical image three-dimensional reconstruction shows exchange method and system
CN110458950A (en) * 2019-08-14 2019-11-15 首都医科大学附属北京天坛医院 A kind of method for reconstructing three-dimensional model, mobile terminal, storage medium and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CNN based dense underwater 3D scene reconstruction by transfer learning using bubble database;Kazuto Ichimaru et al.;《2019 IEEE Winter Conference on Applications of Computer Vision》;20191231;全文 *
基于单视图3D重建的快递纸箱体积测量算法;冯相如第;《计算机***应用》;20200930;第29卷(第10期);全文 *

Also Published As

Publication number Publication date
CN112767532A (en) 2021-05-07

Similar Documents

Publication Publication Date Title
CN112767532B (en) Ultrasonic or CT medical image three-dimensional reconstruction method based on transfer learning
CN107341452B (en) Human behavior identification method based on quaternion space-time convolution neural network
CN113689545B (en) 2D-to-3D end-to-end ultrasound or CT medical image cross-modal reconstruction method
CN111310707A (en) Skeleton-based method and system for recognizing attention network actions
CN110264526B (en) Scene depth and camera position and posture solving method based on deep learning
CN113177882A (en) Single-frame image super-resolution processing method based on diffusion model
CN111583285A (en) Liver image semantic segmentation method based on edge attention strategy
CN110930378A (en) Emphysema image processing method and system based on low data demand
CN112700534B (en) Ultrasonic or CT medical image three-dimensional reconstruction method based on feature migration
CN112734906B (en) Three-dimensional reconstruction method of ultrasonic or CT medical image based on knowledge distillation
CN112734907B (en) Ultrasonic or CT medical image three-dimensional reconstruction method
CN112700535B (en) Ultrasonic image three-dimensional reconstruction method for intelligent medical auxiliary diagnosis
CN113689548B (en) Medical image three-dimensional reconstruction method based on mutual attention transducer
CN113689542B (en) Ultrasonic or CT medical image three-dimensional reconstruction method based on self-attention transducer
CN117036162B (en) Residual feature attention fusion method for super-resolution of lightweight chest CT image
CN113689544B (en) Cross-view geometric constraint medical image three-dimensional reconstruction method
CN113689546B (en) Cross-modal three-dimensional reconstruction method for ultrasound or CT image of two-view twin transducer
Bazrafkan et al. Deep neural network assisted iterative reconstruction method for low dose ct
CN113689547B (en) Ultrasonic or CT medical image three-dimensional reconstruction method of cross-view visual transducer
CN113379863B (en) Dynamic double-tracing PET image joint reconstruction and segmentation method based on deep learning
CN113689543B (en) Epipolar constrained sparse attention mechanism medical image three-dimensional reconstruction method
CN112419283A (en) Neural network for estimating thickness and method thereof
CN113743411A (en) Unsupervised video consistent part segmentation method based on deep convolutional network
CN114283216A (en) Image artifact removing method, device and equipment and storage medium
Arbane et al. DRSU-net: Depth-Residual Separable U-net model for Semantic Segmentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant