CN114998647A

CN114998647A - Breast cancer full-size pathological image classification method based on attention multi-instance learning

Info

Publication number: CN114998647A
Application number: CN202210526657.6A
Authority: CN
Inventors: 张建新; 侯存巧; 张冰冰; 韩雨童
Original assignee: Dalian Minzu University
Current assignee: Dalian Minzu University
Priority date: 2022-05-16
Filing date: 2022-05-16
Publication date: 2022-09-02
Anticipated expiration: 2042-05-16
Also published as: CN114998647B

Abstract

The breast cancer full-size pathological image classification method based on attention multi-instance learning comprises the following steps: step 1: acquiring a data set and a label; step 2: preprocessing a data set; and step 3: constructing a two-stage full-size pathological image (WSI) classification network; and 4, step 4: saving the optimal weight of the two-stage network; and 5: and calculating the accuracy of the network on the test set. The SAMIL of the present invention introduces a lightweight and efficient SA module, SA fusing spatial attention and channel attention, which are used to capture pixel-level pairings and channel dependencies, respectively. SAMIL stacks MHA with LSTM to adaptively highlight the most unique instance features to better compute the correlation between selected instances, improving classification accuracy.

Description

Breast cancer full-size pathological image classification method based on attention multi-instance learning

Technical Field

The invention relates to the technical field of image classification methods, in particular to a breast cancer full-size pathological image classification method based on attention multi-instance learning.

Background

According to the recent global cancer estimation, 230 ten thousand newly diagnosed cases of breast cancer in women in 2020 have been the most common cancer in the world over lung cancer. Meanwhile, the digitization of full-size images (WSI), i.e., hematoxylin and eosin (H & E) -stained biopsy tissue specimens, provides an exact reference for breast cancer diagnosis.

In recent years, computer-assisted methods of WSI classification for cancer diagnosis have received increased attention with the breakthrough success of deep learning in various computer tasks. In particular, some researchers turned WSI classification into a weakly supervised task and introduced multi-instance learning (MIL) as a solution to the problem caused by the large scale of WSI and the difficulty of pixel-level labeling in fully supervised learning. The MIL solution mainly focuses on two key links, namely an instance level selection module is constructed, the positive probability of a slice level image is calculated based on extracted depth features, and the first K slices with the maximum probability are used as candidate instances; the design aggregation operator generates packet insertions for calculating a score for each packet. Although multi-instance learning has made great progress in the task of full-slice pathology image classification.

It has the following disadvantages: the characteristic correlation of each sub-feature is rarely described in the spatial or channel dimensions, which is not conducive to finding cancer cells with minimal breast cancer lymph node metastasis. There are limitations in capturing the dependencies between different instances that help classify WSI.

Disclosure of Invention

The invention aims to provide a breast cancer full-size pathological image classification method based on attention multi-instance learning, which can obtain patch level representation with higher discriminability and can improve the accuracy of breast cancer metastasis lymph node pathological image classification.

A breast cancer full-size pathological image classification method based on attention multi-instance learning comprises the following steps:

step 1: acquiring a data set and a label: acquiring a data set and a label of a breast cancer histopathology image, and randomly dividing the breast cancer histopathology image into a training set, a verification set and a test set according to a proportion;

step 2: preprocessing a data set: preprocessing the divided data sets based on the anti-binarization threshold processing operation, generating a mask of a background/tissue area for each WSI picture, cutting the tissue area into slices of a multiplied by a size, and storing coordinate groups of the slices. To further reduce the amount of calculation, a probability p is added, and when the portion of the tissue region in the slice is greater than the probability p, the coordinates of the slice are saved, and the processed WSI image X' _i Can be represented by X' _i ＝{x _i,1 ,x _i,2 …,x _i,m M is the number of slices in each full-size breast cancer pathological image;

and step 3: constructing a two-stage full-size pathological image (WSI) classification network: the first stage is used for selecting examples, the SA-ResNet50 network is used for carrying out feature extraction on slices, the first K examples with the highest probability in each WSI are selected based on a multi-example learning method, the second stage is used for predicting at the full-size level, and an aggregator constructed by overlapping a multi-head attention (MHA) network with a long-short term memory (LSTM) network makes reliable prediction on the whole WSI image;

step 31: in one stage, the SA-ResNet50 network performs feature extraction on the slices: let slice X' be epsilon R ^C×H×W As an input to the pre-trained SA-ResNet50 network, after the residual structure of ResNet50, a feature matrix X ∈ R is obtained ^c×h×w Replacement attention first divides X into G groups along the channel dimension, i.e., X ═ X ₁ ,…,X _G ]，X _k ∈R ^c/G×h×w ，X _k Is continuously divided into two branches, X respectively _k1 ,X _k2 ∈R ^c/2G×h×w One branch uses the mutual relation among channels to output a channel attention map, the other branch uses the spatial relation among features to generate a spatial attention map, and the results of the two branches are connected so that the number of channels is X' _k And X _k Are the same, and then all the feature matrices X' _k Performing an aggregation operation with the final output of the SA module being X _out ∈R ^c×h×w 。X _out Generation of feature vector X for a slice by global mean pooling _gap 。

Step 32: obtaining a small training SA-ResNet50 network: after the feature vector of each slice is obtained, the probability of each slice is obtained through a Softmax function, the probabilities of the slices in each full-size image are sorted from small to large, and T small blocks with the most front probability ranking in each full-size image are used for training the SA-ResNet50 network.

Step 33: input V to obtain full-scale level prediction: predicting slices in each WSI by using an optimal weight file pre-trained in one stage, sequencing predicted probabilities, and taking the first K instances with the highest probability in each full-size image as input V ═ V of full-size level prediction ₁ ,…,v _K ]∈R ^K×C 。

Step 34: the first K instances with the highest aggregation probability: using MHA and LSTM, for the ith head attention cell (H) in MHA _i ) The calculation formula is as follows:

wherein V ═ V ₁ ,…,v _K ]∈R ^K×C V denotes the characteristics of the first K selected instances, K denotes the number of instances, V ₁ ,…,v _K Representing a single instance feature, v _j ,v _k E.g. V, C is an example characteristic embedding dimension, and a convolution kernel is W e.R ^D×1 And Z ∈ CR ^D×C And D is the feature embedding dimension. The hyperbolic tangent tanh is the activation function. After element multiplication, for MHA, connecting all outputs of the head unit, another convolution is performed to project back to the original dimension:

wherein the content of the first and second substances,

denotes the first K examples after feature enhancement, V ═ V ₁ ,…,v _K ]∈R ^K×C V denotes the characteristics of the first K selected instances, K denotes the number of instances, V ₁ ,…,v _K Representing a single instance feature, W _pro ∈R ^(H×D)×C Representing a convolution kernel, T representing the transpose of a matrix, H ₁ ,…,H _h Representing head attention unit, h representing number of heads, C and D feature embedding dimensions.

Step 35: the dependencies between the selected Top-K instances are further modeled: LSTM is further used to construct interactions and fuse interaction instances to obtain differentiated image-level representations. LSTM can capture short-term and long-term dependencies given an input signature sequence (v) ₁ ,…,v _K ) The hidden layer of LSTM is recursively calculated from t-1 to t-K using the following formula:

wherein f is _t ，i _t ，o _t Respectively showing a forgetting gate, an input gate and an output gate. W _{f,i,o,c} And U _{f,i,o,c} Weight matrix representing the need for learning, b _{f,i,o,c} Represents a deviation vector, h _t-1 Is a hidden vector, c _t Representing memory cells, Sigmoid and tanh represent activation functions. The output of the last LSTM is used as the final packet-level representation vector for prediction.

And 4, step 4: and (3) saving the optimal weight of the two-stage network: inputting a data set into a two-stage classification network, training a one-stage network by adopting a training set, updating network parameters in each iteration, carrying out primary verification on a verification set every three iterations, storing the optimal weight of the one-stage network according to the accuracy of the optimal verification set, processing the data set by using the optimal weight of the one stage, selecting K examples with the most top probability ranking in each WSI as two-stage input, initializing the two-stage network by using the optimal weight of the one stage, carrying out primary verification after one iteration is completed in each training, and storing the optimal weight of the two-stage network according to the accuracy of the optimal verification set;

and 5: and calculating the accuracy of the network on the test set: and initializing the network by using the two-stage optimal weight, inputting the test set into the network to obtain a prediction result of each WSI, comparing the prediction result with the real label data, counting the number of the WSIs with correct prediction and wrong prediction, and calculating the accuracy of the network on the test set.

Compared with the prior art, the invention has the following beneficial effects:

(1) SAMIL introduces a lightweight and efficient SA module, which fuses spatial attention and channel attention, which are used to capture pixel-level pairings and channel dependencies, respectively.

(2) SAMIL stacks MHA with LSTM to adaptively highlight the most distinctive instance features to better compute the correlation between selected instances, improving classification accuracy.

Drawings

FIG. 1 is an overall block diagram of the SAMIL model.

Detailed Description

The experimental data used in the present invention was from a lymph node metastasis data set of 2016Camelyon Grand Challenge. The data set contained 399 full-size images, including both normal and metastatic forms, intact for detection of metastasis in HE stained histological sections of sentinel assist lymph nodes in breast cancer patients.

In the schematic diagram of the invention, a method for classifying two-stage breast cancer full-size pathological images based on attention multiple instances comprises the following steps:

step 1: acquiring a data set and a label: the lymph node metastasis data set is randomly divided into training sets according to the proportion of 2:1: and (4) verification set: and testing sets, wherein 204 training sets, 95 verification sets and 100 testing sets are used.

Step 2: preprocessing a data set: the method is characterized in that the divided data sets are preprocessed based on an anti-binarization threshold processing operation, masks of background/tissue areas are generated for each WSI picture, the tissue areas are divided into slices with the size of 512 multiplied by 512, and coordinate groups of the slices are stored. To further reduce the amount of calculation, a probability value of 0.4 is added, and when the tissue region in the slice is larger than 0.4, the coordinates of the slice are saved, and the processed WSI image X' _i Can be represented by X' _i ＝{x _i,1 ,x _i,2 …,x _i,m M is the number of slices in each full-size breast cancer pathological image;

and step 3: constructing a two-stage full-size pathological image (WSI) classification network: the first stage is used for selecting examples, the SA-ResNet50 network is used for carrying out feature extraction on slices, 10 examples with the maximum probability in each WSI are selected based on a multi-example learning method, the second stage is used for predicting at the full-size level, and an aggregator constructed by overlapping a multi-head attention (MHA) network with a long-short term memory (LSTM) network makes reliable prediction on the whole WSI image;

step 31: in one stage, the SA-ResNet50 network performs feature extraction on the slices: slice x _i,j ∈R ^3×512×512 Scaled to 224 x 3 pixels as input to the pre-trained SA-ResNet50 network. An SA module is inserted into each residual phase (e.g., Conv2_ x) in the ResNet-50. The input of SA is a characteristic matrix X ∈ R ^256×56×56 . The SA module first divides X into 64 groups along the channel dimension, i.e., X ═ X ₁ ,…,X _k ,…,X ₆₄ ]，X _k ∈R ^4×56×56 ，X _k Is continuously divided into two branches, X respectively _k1 ,X _k2 ∈R ^2×56×56 One branch utilizes the interrelationship between channels, outputting channel attention maps X' _k1 ∈R ^2×56×56 The other branch utilizes the spatial relationship among the features to generate a spatial attention map X' _k2 ∈R ^2×56×56 Connecting the two branches to obtain X' _k ∈R ^4×56×56 Then all feature matrices X' _k Performing an aggregation operation with the final output of the SA module being X _out ∈R ^256×56×56 . SA blocks in Conv3_ X, Conv4_ X, Conv5_ X residual blocks are similar, X _out The feature vector generated by global average pooling is X _gap ∈R ^2048×1×1 。

Step 32: acquiring a small training SA-ResNet50 network: and after the feature vector of each slice is obtained, the probability of each slice is obtained through a Softmax function, the probabilities of the slices in each full-size image are ranked from small to large, and 2 small blocks with the most front probability rank in each full-size image are taken to train the SA-ResNet50 network.

Step 33: input V to obtain full-scale level prediction: predicting slices in each WSI by using a one-stage pre-trained optimal weight file, sequencing predicted probabilities, and taking the first 10 examples with highest probability in each full-size image as input V ═ V ═ of two-stage full-size level prediction ₁ ,…,v ₁₀ ]∈R ^2048×1 。

Step 34: the first K instances with the highest aggregation probability: with MHA and LSTM, for the ith head attention cell in multi-head attention, the calculation formula is as follows:

wherein V ═ V ₁ ,…,v ₁₀ ]∈R ^10×2048 V denotes the first 10 selected example features, V ₁ ,…,v ₁₀ Representing a single instance feature, v _j ,v _k Belongs to V, and the convolution kernel is W belongs to R ^512×1 And Z ∈ R ^512×2048 . The hyperbolic tangent tanh is the activation function. After element multiplications, key instances are highlighted according to the relationship between them. For MHA, the invention connects all the outputs of the head unit, performs another convolution to project back to the original dimensions:

wherein the content of the first and second substances,

the first 10 examples after feature enhancement are shown, V ═ V ₁ ,…,v ₁₀ ]∈R ^10×2048 V denotes the first 10 selected example features, V ₁ ,…,v ₁₀ Representing a single instance feature, W _pro ∈R ^{(3×512)×2048} Representing a convolution kernel, T representing the transpose of a matrix, H ₁ ,…,H _h The head attention unit is indicated, h indicates the number of heads, and in this study, h is 3. Multi-headed attention recalibrates all instance features from the different representation subspaces, enriching the original selected instance V.

Step 35: the dependencies between the first 10 selected instances are further modeled: LSTM is further used to construct interactions and fuse interaction instances to obtain differentiated image-level representations. LSTM can capture short-term and long-term dependencies given an input signature sequence (v) ₁ ,…,v ₁₀ ) The hidden layer of LSTM is recursively calculated from t 1 to t 10 using the following formula: :

wherein f is _t ，i _t ，o _t Respectively representing a forgetting gate, an input gate and an output gate. W _{f,i,o,c} And U _{f,i,o,c} Weight matrix representing the need for learning, b _{f,i,o,c} Represents a deviation vector, h _t Is a hidden vector, c _t Is a memory unit, Sigmoid and tanh represent activation functions. In the feature fusion module, the present invention stacks two layers of LSTMs so that the enhanced instances can interact more fully. The output of the last LSTM is used as the final packet-level representation vector for prediction.

And 4, step 4: and (3) saving the optimal weight of the two-stage network: inputting a data set into a two-stage classification network, training the one-stage network by adopting a training set, updating network parameters in each iteration, verifying a verification set once in every three iterations, saving the optimal weight of the one-stage network according to the accuracy of the optimal verification set, and relieving the gradient oscillation problem by using an Adam optimizer in the training process, wherein the learning rate is set to be 1e-4, and the weight attenuation is set to be 1 e-5. Processing the data set by using the optimal weight in the first stage, selecting 10 examples with the most top probability ranking in each WSI as input in the second stage, initializing the network in the second stage by using the optimal weight in the first stage, in the training process in the second stage, using an Adam optimizer, setting the learning rate to be 1e-4, setting the weight attenuation to be 1e-4, performing 1 verification after 1 iteration is completed in each training, and storing the optimal weight of the network in the second stage according to the accuracy of the optimal verification set;

and 5: calculating the accuracy of the network on a test set: and initializing the network by using two-stage optimal weight, inputting the test set into the network to obtain a prediction result of each WSI, comparing the prediction result with the real label data of 100 test sets, and counting the number of the WSIs with correct prediction and wrong prediction so as to calculate the accuracy of SAMIL on the test sets.

According to the steps, the invention provides a novel SAMIL model for a breast cancer WSI classification task. SAMIL uses a displacement attention (SA) module to select discriminant instances and implements packet-level prediction using multi-head attention (MHA) of LSTM, thereby well exploring the advantages of the attention mechanism to solve the MIL problem. In addition, experimental results show that compared with the most advanced MIL method, the method has superior performance on the Camellyon 16 data set, and the accuracy rate is 96.56% at most.

Claims

1. The breast cancer full-size pathological image classification method based on attention multi-instance learning is characterized by comprising the following steps of: the method comprises the following steps: step 1: acquiring a data set and a label: acquiring a data set and a label of a breast cancer histopathology image, and randomly dividing the breast cancer histopathology image into a training set, a verification set and a test set according to a proportion; step 2: preprocessing a data set: preprocessing the divided data set based on an inverse binarization threshold processing operation, generating a mask of a background/tissue area for each WSI picture, cutting the tissue area into slices of a X a size, storing a coordinate group of the slices, adding a probability p for further reducing the calculation amount, storing the coordinates of the slices when the part of the tissue area in the slices is greater than the probability p, and processing the WSI image X' _i Can be represented by X' _i ＝{x _i,1 ,x _i,2 …,x _i,m Wherein m is eachThe number of slices in the full-size breast cancer pathological image; and step 3: constructing a two-stage full-size pathological image (WSI) classification network: the first stage is used for selecting examples, the SA-ResNet50 network is used for extracting features of slices, the first K examples with the highest probability in each WSI are selected through a multi-example learning-based method, the second stage is used for predicting at the full-size level, and an aggregator constructed by overlapping multi-head attention (MHA) and a long-short term memory (LSTM) network makes reliable prediction on the whole WSI image; and 4, step 4: and (3) saving the optimal weight of the two-stage network: inputting a data set into a two-stage classification network, training a one-stage network by adopting a training set, updating network parameters in each iteration, carrying out primary verification on the verification set every three iterations, storing the optimal weight of the one-stage network according to the accuracy of the optimal verification set, processing the data set by using the optimal weight of the one stage, selecting K instances with the most advanced probability ranking in each WSI as two-stage input, initializing the two-stage network by using the optimal weight of the one stage, carrying out primary verification after one iteration is completed in each training, and storing the optimal weight of the two-stage network according to the accuracy of the optimal verification set; and 5: calculating the accuracy of the network on a test set: and initializing the network by using the two-stage optimal weight, inputting the test set into the network to obtain a prediction result of each WSI, comparing the prediction result with the real label data, counting the number of the WSIs with correct prediction and wrong prediction, and calculating the accuracy of the network on the test set.

2. The breast cancer full-size pathological image classification method based on attention multi-instance learning according to claim 1, characterized in that: in step 3, step 31: in one stage, the SA-ResNet50 network performs feature extraction on the slices: the slice X' is the same as R ^C×H×W As an input to the pre-trained SA-ResNet50 network, after the residual structure of ResNet50, a feature matrix X ∈ R is obtained ^c×h×w Replacement attention first divides X into G groups along the channel dimension, i.e., X ═ X ₁ ,…,X _G ]，X _k ∈R ^c/G×h×w ，X _k Is successively divided into two branches, X respectively _k1 ,X _k2 ∈R ^c/2G×h×w One branch uses the mutual relation among channels to output a channel attention map, the other branch uses the spatial relation among features to generate a spatial attention map, and the results of the two branches are connected so that the number of channels is X' _k And X _k Are the same, and then all feature matrices X' _k Performing an aggregation operation with the final output of the SA module being X _out ∈R ^c×h×w ，X _out Generation of feature vector X for a slice by global mean pooling _gap (ii) a Step 32: acquiring a small training SA-ResNet50 network: after the characteristic vector of each slice is obtained, the probability of each slice is obtained through a Softmax function, the probabilities of the slices in each full-size image are sorted from small to large, and T small blocks with the most front probability ranking in each full-size image are used for training an SA-ResNet50 network; step 33: input V to obtain full-scale level prediction: predicting slices in each WSI by using an optimal weight file pre-trained in one stage, sequencing predicted probabilities, and taking the first K instances with the highest probability in each full-size image as input V ═ V of full-size level prediction ₁ ,…,v _K ]∈R ^K×C (ii) a Step 34: the first K instances with the highest aggregation probability: using MHA and LSTM, for the ith head attention cell (H) in MHA _i ) The calculation formula is as follows:

wherein V is [ V ═ V ₁ ,…,v _K ]∈R ^K×C V denotes the characteristics of the first K selected instances, K denotes the number of instances, V ₁ ,…,v _K Representing a single instance feature, v _j ,v _k E.g. V, C is an example characteristic embedding dimension, and a convolution kernel is W e.R ^D×1 And Z ∈ R ^D×C D is the feature embedding dimension, tanh is the activation function, in element multiplication

Then, for the MHA, connecting all the outputs of the head unit, another convolution is performed to project back to the original dimensions:

wherein the content of the first and second substances,

denotes the first K examples after feature enhancement, V ═ V ₁ ,…,v _K ]∈R ^K×C V denotes the characteristics of the first K selected instances, K denotes the number of instances, V ₁ ,…,v _K Representing a single instance feature, W _pro ∈R ^(H×D)×C Representing a convolution kernel, T representing the transpose of a matrix, H ₁ ,…,H _h Representing head attention units, h representing number of heads, C and D feature embedding dimensions; step 35: further modeling the dependencies between the selected Top-K instances: the LSTM is further used to construct interactions and fuse interaction instances to obtain differentiated image-level representations, which can capture short-term and long-term dependencies given a sequence of input features (v;) ₁ ,…,v _K ) The hidden layer of LSTM is recursively calculated from t-1 to t-K using the following formula:

wherein f is _t ，i _t ，o _t Respectively showing a forgetting gate, an input gate and an output gate, W _{f,i,o,c} And U _{f,i,o,c} Weight matrix representing the need for learning, b _{f,i,o,c} Represents a deviation vector, h _t-1 Is a hidden vector, c _t Representing memory units, Sigmoid and tanh represent activation functions, and the output of the last LSTM serves as the final packet level representation vector for prediction.