CN117496276B - Lung cancer cell morphology analysis and identification method and computer readable storage medium - Google Patents
Lung cancer cell morphology analysis and identification method and computer readable storage medium Download PDFInfo
- Publication number
- CN117496276B CN117496276B CN202311857682.3A CN202311857682A CN117496276B CN 117496276 B CN117496276 B CN 117496276B CN 202311857682 A CN202311857682 A CN 202311857682A CN 117496276 B CN117496276 B CN 117496276B
- Authority
- CN
- China
- Prior art keywords
- cell
- model
- data
- self
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 49
- 206010058467 Lung neoplasm malignant Diseases 0.000 title claims abstract description 27
- 201000005202 lung cancer Diseases 0.000 title claims abstract description 27
- 208000020816 lung neoplasm Diseases 0.000 title claims abstract description 27
- 238000004458 analytical method Methods 0.000 title claims abstract description 19
- 238000002372 labelling Methods 0.000 claims abstract description 65
- 238000012549 training Methods 0.000 claims abstract description 65
- 238000001514 detection method Methods 0.000 claims abstract description 39
- 238000013145 classification model Methods 0.000 claims abstract description 17
- 238000003745 diagnosis Methods 0.000 claims abstract description 16
- 238000007781 pre-processing Methods 0.000 claims abstract description 14
- 230000008569 process Effects 0.000 claims abstract description 14
- 239000011159 matrix material Substances 0.000 claims description 21
- 230000001575 pathological effect Effects 0.000 claims description 16
- 230000007246 mechanism Effects 0.000 claims description 12
- 230000000877 morphologic effect Effects 0.000 claims description 11
- 239000010813 municipal solid waste Substances 0.000 claims description 10
- 206010041823 squamous cell carcinoma Diseases 0.000 claims description 10
- 208000009956 adenocarcinoma Diseases 0.000 claims description 8
- 206010041067 Small cell lung cancer Diseases 0.000 claims description 6
- 238000013507 mapping Methods 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 6
- 101100532684 Arabidopsis thaliana SCC3 gene Proteins 0.000 claims description 5
- 102100035590 Cohesin subunit SA-1 Human genes 0.000 claims description 5
- 101100043640 Homo sapiens STAG1 gene Proteins 0.000 claims description 5
- 101100062195 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) CPR4 gene Proteins 0.000 claims description 5
- 101100532687 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) IRR1 gene Proteins 0.000 claims description 5
- 208000000649 small cell carcinoma Diseases 0.000 claims description 5
- 238000003384 imaging method Methods 0.000 claims description 4
- 238000012216 screening Methods 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 230000000903 blocking effect Effects 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 230000008030 elimination Effects 0.000 claims description 3
- 238000003379 elimination reaction Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 230000006870 function Effects 0.000 claims description 3
- 230000000873 masking effect Effects 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 claims description 3
- 238000003066 decision tree Methods 0.000 abstract description 6
- 210000004027 cell Anatomy 0.000 description 133
- 230000007170 pathology Effects 0.000 description 8
- 230000008901 benefit Effects 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 4
- 239000000126 substance Substances 0.000 description 4
- 206010036790 Productive cough Diseases 0.000 description 3
- 230000001413 cellular effect Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000002360 preparation method Methods 0.000 description 3
- 210000003802 sputum Anatomy 0.000 description 3
- 208000024794 sputum Diseases 0.000 description 3
- 210000004085 squamous epithelial cell Anatomy 0.000 description 3
- 208000010507 Adenocarcinoma of Lung Diseases 0.000 description 2
- 230000005856 abnormality Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000002380 cytological effect Effects 0.000 description 2
- 210000002919 epithelial cell Anatomy 0.000 description 2
- 238000010191 image analysis Methods 0.000 description 2
- 201000005249 lung adenocarcinoma Diseases 0.000 description 2
- 238000004393 prognosis Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 241001085205 Prenanthella exigua Species 0.000 description 1
- 206010040844 Skin exfoliation Diseases 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 238000001574 biopsy Methods 0.000 description 1
- 210000000621 bronchi Anatomy 0.000 description 1
- 230000001680 brushing effect Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000011109 contamination Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000035618 desquamation Effects 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000012530 fluid Substances 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 239000007788 liquid Substances 0.000 description 1
- 210000004072 lung Anatomy 0.000 description 1
- 201000005243 lung squamous cell carcinoma Diseases 0.000 description 1
- 230000003211 malignant effect Effects 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 208000002154 non-small cell lung carcinoma Diseases 0.000 description 1
- 210000002345 respiratory system Anatomy 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 208000000587 small cell lung carcinoma Diseases 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000011285 therapeutic regimen Methods 0.000 description 1
- 208000029729 tumor suppressor gene on chromosome 11 Diseases 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/0895—Weakly supervised learning, e.g. semi-supervised or self-supervised learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/01—Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/03—Recognition of patterns in medical or anatomical images
Abstract
The invention discloses a lung cancer cell morphology analysis and identification method and a computer readable storage medium, wherein the method comprises the following steps: the unlabeled data without labels are divided into data to be self-supervised and data to be auxiliary labeled; performing auxiliary labeling after preprocessing auxiliary labeling data, and outputting data of a complete label; preprocessing the data to be self-supervised, then executing self-supervision pre-training, and outputting a self-supervision training model; inputting the data with the complete label into a cell detection model, and performing model training; after training, the cell detection model detects whether the cell sample is suspicious positive cells: if yes, inputting the cell classification model to classify, and outputting the class of each cell in the data; and judging the final diagnosis category through the decision tree. The invention accelerates the process from labeling to cell detection of cell data, reduces the time of manual labeling, and improves the speed and efficiency of labeling data in the auxiliary diagnosis process, thereby improving the working efficiency.
Description
Technical Field
The invention relates to the field of digital pathology, in particular to a lung cancer cell morphological analysis and identification method and a computer readable storage medium.
Background
Digital pathology is a branch of pathology that uses digitizing techniques to acquire, manage and interpret biological and clinical information. Digital pathology may provide more accurate, rapid and efficient pathology analysis than traditional pathology.
In digital pathology, cell detection technology is one of the core technologies. It relates to identifying and classifying cells on tissue sections, including determining whether cells are malignant. This technique requires a high resolution scanner, image analysis software and a large amount of memory space.
In the prior art, a tissue sample is inspected by mainly relying on a traditional microscope, and subjective judgment of a pathologist is highly dependent, so that limitations exist in speed, efficiency and accuracy: each smear has tens of thousands of cells, cytopathologists need to identify whether canceration occurs one by one and identify the canceration type under a microscope, and each cytopathologist diagnoses at most 200 smears a day; even no pathologist exists in primary hospitals, respiratory tract desquamation cytology screening technology cannot sink to primary hospitals, and primary medical level is affected.
The cytological diagnosis of lung cancer is to obtain exfoliated cells through bronchofiberscope brushing, alveolar lavage fluid and sputum, and a pathologist observes cell morphology and judges disease types, and the value in the diagnosis of lung cancer comprises the following aspects: 1. early lung cancer has no obvious nodule in clinic, and the foreign squamous epithelial cells can be mixed with sputum to be discharged out of the body after falling off, so that sputum falling off cytology or a bronchofiberscope brush is a simple and effective noninvasive method for diagnosing early lung squamous cell carcinoma, and in addition, the alveolar lavage liquid can collect adenocarcinoma cells to find early lung adenocarcinoma. 2. Common pathological types of lung cancer are small cell lung cancer and non-small cell lung cancer, the latter is divided into lung squamous carcinoma and lung adenocarcinoma, different pathological types have different treatment schemes and prognosis, and most cytological specimens can be correctly typed through morphological observation, so that the method has important value for treatment and prognosis evaluation of patients. 3. Patients with advanced lung cancer are ideal methods for lung cancer diagnosis, pathology typing and providing therapeutic regimens because they cannot take histological biopsies or surgically resected specimens.
Disclosure of Invention
The invention aims to overcome the defects and shortcomings of the prior art and provide a morphological analysis and identification method for lung cancer cells.
The aim of the invention is achieved by the following technical scheme:
the morphological analysis and identification method of lung cancer cells comprises the following steps:
S1, dividing unlabeled data without labels into data to be self-supervised and data to be auxiliary labeled according to a preset proportion;
s2, performing auxiliary labeling after preprocessing the auxiliary labeling data, and outputting the data of the complete label;
meanwhile, preprocessing the data to be self-supervised, and then executing self-supervision pre-training to output a self-supervision training model;
s3, inputting the data with the complete label, which is marked in an auxiliary way, into a cell detection model, and carrying out model training; the initialization parameters of the model training are parameters of a self-supervision training model;
s4, detecting whether the cell sample is a suspicious positive cell or not by using the cell detection model after training;
S5, inputting the suspicious lung cancer positive cells into a cell classification model for classification, and outputting the category of each cell in the data, wherein the total output number of the categories is 9: adeno, SCC, SCC3, SCLC, SC, columar, garbage, trash, WN;
Adeno, SCC, SCC3 and SCLC, SC, columar, garbage, trash, WN have the meanings given below: single adenocarcinoma cells, adenocarcinoma cell clusters, non-keratinized single squamous carcinoma cells, non-keratinized squamous carcinoma cell clusters, keratinized squamous carcinoma cells, small cell carcinomas, normal squamous epithelial cells, ciliated columnar epithelial cells, mixed cells of clusters that detect abnormalities, non-cellular subjects, alveolar cells.
S6, outputting confidence degrees of 9 categories of a cell picture of a cell classification model, calculating variance of weights of single pictures, voting for the first plurality of picture categories with the maximum variance after calculating the variance of all detected cell pictures of cases, wherein the voting categories are final diagnosis categories, and the final diagnosis categories comprise: suspected adenocarcinoma, suspected squamous carcinoma, suspected small cell carcinoma, suspected atypical cell, negative.
In step S2, the specific process of the auxiliary labeling is as follows:
S201, preprocessing the labeling data, wherein the preprocessing comprises the steps of eliminating digital pathological images with unqualified quality, and the situation of unqualified quality comprises the following steps: blank content, cell number less than a first preset value, imaging blurring, exposure not in a first preset range, and color deviation degree exceeding a second preset value;
S202, submitting a small amount of preprocessed data to manual labeling, wherein the manual labeling requires labeling positive cells in a complete digital pathological image;
s203, providing a small amount of data of the manual annotation for the auxiliary annotation model for training;
s204, generating a cell candidate frame to be marked, which is used for predicting suspicious cells to be marked, for the auxiliary marking model, wherein the cell candidate frame is required to be segmented into digital pathological images;
s205, performing Patch pretreatment on all the patches after segmentation, including blank Patch elimination and color normalization;
s206, predicting the preprocessed Patches by using an auxiliary labeling model, and generating a cell candidate frame to be labeled in each Patch;
s207, manually checking candidates to be marked in the marking tool, and screening candidate frames to remove candidate frames with sizes not in a second preset range;
S208, manually marking the appointed cell category in the candidate frame marked with the suspicious, thereby completing the marking operation of the candidate frame;
s209, generating a data set from all manually marked candidate frames and candidate frames which are not marked, and finishing marking.
The auxiliary labeling model adopts Swin Transformer V < 2+ > RETINANET, and the Swin Transformer V < 2 > and RETINANET are connected through a pyramid structure.
In step S2, the training process of the self-supervision pre-training model is as follows:
(1) Unified scaling is carried out on the non-marked picture data, and the non-marked picture data are segmented into a plurality of grids;
(2) Carrying out random masking on the divided grids according to the proportion of 75%, wherein the mask filling value is 0;
(3) Arranging each grid data containing the masked data in a one-dimensional vector mode, and merging the position information of the grid data on the image in a cosine coding mode;
(4) Embedding the masked data and the unmasked data into class marks, wherein the class marks comprise masked grids and unmasked grids;
(5) Inputting the encoded one-dimensional vector into an encoder and a decoder of the self-supervision training model;
(6) Carrying out hierarchical standardization on the characteristics output by the self-supervision training model, outputting the predicted pixel value of each masked grid, and restoring the complete image by utilizing the principle of three channels of RGB images;
(7) Calculating pixel difference value loss between the restored image predicted by the self-supervision training model and the original image, and optimizing model parameters according to the loss;
(8) After the steps (1) to (7) are finished, judging whether T iterations are finished, and if so, outputting a self-supervision training model; if not, performing the next iteration; t is the total iteration number set at the beginning of training.
The self-supervision training model is Swin Transformer V.
In step S4, the cell detection model adopts Swin Transformer V2 + RETINANET, and the Swin Transformer V and RETINANET are connected by adopting a pyramid structure; the cell detection model is based on a Swin transducer self-attention mechanism, and comprises a feature extractor and a detection head; the feature extractor comprises an encoder and a decoder, and the detection head maps the features output by the decoder into categories, the size and the position of a target frame; the categories have negative cells and suspicious positive cells, and each category is assigned a confidence level ranging between 0 and 1.
In step S5, swin Transformer V2 is adopted as the cell classification model; the cell classification model is based on a Swin transducer self-attention mechanism, and comprises a feature extractor and a classification head; the feature extractor includes an encoder and a decoder, and the classification head maps the features into a number of cell categories and assigns each cell category a confidence level in the range of 0 to 1.
The Swin Transformer V includes an encoder and a decoder, both of which are composed of a multi-head attention mechanism model and a same-layer normalized alternate connection, and the expression of the multi-head attention mechanism model is as follows:
;
wherein, 、/>、/>The method is characterized in that the method is a mapping matrix of a 3-channel cell image after the cell image is subjected to blocking operation, and the meaning of the mapping matrix is respectively a query matrix, a key matrix and a value matrix; /(I)Is a matrix/>Is a transpose operation of (a); /(I)Is the relative positional offset term for each matrix; /(I)Is a learnable scaling factor; /(I)Is a learnable class balancing weight; softMax is an activation function of the multi-classification problem; attention is the output of multi-head self-Attention parameters followed by homolayer normalization and full connection layer feature extraction or classification.
The RETINANET Focal Loss is used in a target detection scenario where there is an extreme imbalance between foreground and background categories during training;
the cross entropy CE loss of the binary classification starts to introduce focus loss, and the calculation formula of the cross entropy CE of the binary classification is as follows:
;
wherein, Estimating the class probability of the candidate frame; /(I)Is true tag value,/>Take the value of-1 or 1, when/>For a correct classification of 1, the class is incorrect of-1;
;
representing the final output class probability of the model combined with the positive and negative labels of the class;
;
For the final cross entropy,/> Meaning of (2): for a sample t of the input model, weighting parameters of each cell category of cross entropy in the neural network; by learning the/>The weight value of the model (C) can solve the problem of unbalanced distribution of different types of cells in the digital pathological section.
In step S6, the threshold value of the confidence degrees of the 9 categories is obtained by the following formula:
;
wherein, For the category confidence of the specified category,/>All cells of this class were predicted for the model,/>For the/>, under categoryPicture of cells,/>Representing the maximum probability value of the cell class,/>The predicted variance in the cell model is shown.
Meanwhile, the invention provides:
A server, the server comprises a processor and a memory, wherein at least one section of program is stored in the memory, and the program is loaded and executed by the processor to realize the lung cancer cell morphology analysis and identification method.
A computer readable storage medium having stored therein at least one program loaded and executed by a processor to implement the above-described lung cancer cell morphology analysis, identification method.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. The invention reduces subjectivity in cell detection by introducing advanced image analysis technology and artificial intelligent algorithm, thereby improving accuracy and consistency of diagnosis.
2. According to the invention, the process from labeling to cell detection of cell data is accelerated, the manual labeling and time are reduced, and the speed and efficiency of labeling data in an auxiliary diagnosis process are improved, so that the working efficiency is improved.
3. The invention can realize higher degree of automation, so that cell detection becomes more intelligent, and the work load of pathologists is reduced.
Drawings
FIG. 1 is a flow chart of a method for morphological analysis and identification of lung cancer cells according to the present invention.
FIG. 2 is a flow chart of the auxiliary labeling according to the present invention.
FIG. 3 is a flow chart of the self-monitoring model training according to the present invention.
FIG. 4 is a schematic structural diagram of a cell detection model according to the present invention.
FIG. 5 is a schematic diagram of the structure of the cell classification model according to the present invention.
FIG. 6 is a schematic diagram of a decision tree corresponding to the final diagnostic category of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.
As shown in fig. 1 to 6, the whole flow of the lung cancer cell morphological analysis and identification method is completed by an auxiliary labeling model (semi-automatic labeling) +self-supervision learning+cell detection model+cell classification model+diagnosis decision tree. Wherein:
auxiliary labeling model: it is adopted that both Swin Transformer V & lt2+ & gt RETINANET are connected by using a pyramid structure.
Autonomous supervised learning: swin Transformer V2 was used.
Cell detection model: swin Transformer V2 < 2+ > RETINANET is used, and the two are connected by using a pyramid structure.
Cell classification model: swin Transformer V2 was used.
Diagnostic decision tree: a decision tree is employed.
The Swin Transformer V includes an encoder and a decoder, both of which are composed of a multi-head attention mechanism model and a same-layer normalized alternate connection, and the expression of the multi-head attention mechanism model is as follows:
;
wherein, 、/>、/>The method is characterized in that the method is a mapping matrix of a 3-channel cell image after the cell image is subjected to blocking operation, and the meaning of the mapping matrix is respectively a query matrix, a key matrix and a value matrix; /(I)Is a matrix/>Is a transpose operation of (a); /(I)Is the relative positional offset term for each matrix; Is a learnable scaling factor; /(I) Is a learnable class balancing weight; softMax is an activation function of the multi-classification problem; attention is the output of multi-head self-Attention parameters followed by homolayer normalization and full connection layer feature extraction or classification.
The RETINANET Focal Loss is used in a target detection scenario where there is an extreme imbalance (e.g., 1:1000) between foreground and background categories during training;
the cross entropy CE loss of the binary classification starts to introduce focus loss, and the calculation formula of the cross entropy CE of the binary classification is as follows:
;
wherein, Estimating the class probability of the candidate frame; /(I)Is true tag value,/>Take the value of-1 or 1, when/>For a correct classification of 1, the class is incorrect of-1;
;
representing the final output class probability of the model combined with the positive and negative labels of the class;
;
For the final cross entropy,/> Meaning of (2): for a sample t of the input model, weighting parameters of each cell category of cross entropy in the neural network; by learning the/>The weight value of the model (C) can solve the problem of unbalanced distribution of different types of cells in the digital pathological section.
The final decision tree is shown in fig. 6, where the confidence threshold is obtained by:
;
wherein, For the category confidence of the specified category,/>All cells of this class were predicted for the model,/>For the/>, under categoryPicture of cells,/>Representing the maximum probability value of the cell class,/>The predicted variance in the cell model is shown.
Specifically, as shown in fig. 1, the morphological analysis and identification method of lung cancer cells comprises the following steps:
s1, unlabeled data without labels is according to 9:1, dividing the data into data to be self-supervised and data to be auxiliary marked;
s2, performing auxiliary labeling after preprocessing the auxiliary labeling data, and outputting the data of the complete label;
meanwhile, preprocessing the data to be self-supervised, and then executing self-supervision pre-training to output a self-supervision training model;
s3, inputting the data with the complete label, which is marked in an auxiliary way, into a cell detection model, and carrying out model training; the initialization parameters of the model training are parameters of a self-supervision training model;
s4, detecting whether the cell sample is a suspicious positive cell or not by using the cell detection model after training;
S5, inputting the suspicious lung cancer positive cells into a cell classification model for classification, and outputting the category of each cell in the data, wherein the total output number of the categories is 9: adeno, SCC, SCC3, SCLC, SC, columar, garbage, trash, WN;
Adeno, SCC, SCC3 and SCLC, SC, columar, garbage, trash, WN have the meanings given below: single adenocarcinoma cells, adenocarcinoma cell clusters, non-keratinized single squamous carcinoma cells, non-keratinized squamous carcinoma cell clusters, keratinized squamous carcinoma cells, small cell carcinomas, normal squamous epithelial cells, ciliated columnar epithelial cells, mixed cells of clusters that detect abnormalities, non-cellular subjects, alveolar cells.
S6, outputting confidence degrees of 9 categories of a cell picture of a cell classification model, calculating variance of weights of single pictures, voting for the first 16 picture categories with the largest variance after calculating the variance of all detected cell pictures of cases, wherein the voting categories are final diagnosis categories, and the final diagnosis categories comprise: suspected adenocarcinoma, suspected squamous carcinoma, suspected small cell carcinoma, suspected atypical cell, negative.
Therefore, the complete auxiliary diagnosis of the bronchial cytology digital pathological image is realized, the manual labeling data amount is reduced, and the cost for labeling the cell data is reduced. Meanwhile, unlabeled data is used for model pre-training, so that model training time is shortened, and accuracy of verification data sets is improved. Finally, the auxiliary diagnosis system for the digital pathological images of the bronchi cytology, which comprises semi-automatic data acquisition training, is realized.
Referring to fig. 2, in step S2, the specific process of the auxiliary labeling is as follows:
S201, preprocessing the labeling data, wherein the preprocessing comprises the steps of eliminating digital pathological images with unqualified quality, and the situation of unqualified quality comprises the following steps: blank content, cell number less than a first preset value, imaging blurring, exposure not in a first preset range, and color deviation degree exceeding a second preset value;
blank content: refers to an area observed under a scanner or microscope without any cells or substances. This may be due to improper sample preparation or incorrect microscope setup.
The number of cells is less than a first predetermined value: meaning that the number of cells observed under a scanner or microscope is less than expected. This may be due to improper sample preparation, incorrect microscope setup, or insufficient viewing area. It is generally set up that the number of cells in a digital slice should be greater than 1000.
Imaging blur: refers to blurring of an image observed under a scanner or microscope. This may be due to improper sample preparation, incorrect microscope setup, lens contamination, or incorrect focal length.
The exposure is not within the first preset range and is manifested as overexposure or underexposure.
Overexposure: meaning that the image observed under a scanner or microscope is too bright and details of the cell or substance cannot be clearly observed. This may be due to incorrect microscope settings or excessively long exposure times. The degree of overexposure is based on the inability of the picture to identify cell boundaries or nuclear contours. Typically appearing as a bright white picture.
Too low exposure: meaning that the image observed under a scanner or microscope is too dark and the details of the cell or substance cannot be clearly observed. This may be due to incorrect microscope settings or too short exposure times. The underexposure is based on the inability of the picture to identify cell boundaries or nuclear contours. Usually appearing as a dark picture.
The degree of color deviation exceeds a second preset value: meaning that the image observed under a scanner or microscope does not correspond in color to the actual color. This may be due to incorrect microscope settings or incorrect color temperature of the light source. The degree of color deviation is based on the cell picture red channel mean value exceeding 220, the blue channel mean value exceeding 200, and the green channel 200.
S202, submitting a small amount of preprocessed data to manual labeling, wherein the manual labeling requires labeling positive cells in a complete digital pathological image;
s203, providing a small amount of data of the manual annotation for the auxiliary annotation model for training;
s204, generating a suspicious cell candidate frame to be marked for the auxiliary marking model, wherein the suspicious cell candidate frame to be marked needs to be segmented into 1024x1024, 2048x2048 and 4096x4096 of the digital pathological image. Different dimensions are selected according to different scanner magnifications.
S205, performing Patch pretreatment on all the patches after segmentation, including blank Patch elimination and color normalization;
s206, predicting the preprocessed Patches by using an auxiliary labeling model, and generating a cell candidate frame to be labeled in each Patch;
s207, manually checking candidates to be marked in the marking tool, and screening candidate frames to remove candidate frames with sizes not in a second preset range;
The dimensions are not within the second preset range, and are represented as being oversized or undersized: refers to the fact that the digitally observed cell or substance size does not correspond to the actual size. This may be due to incorrect microscope settings or incorrect magnification. The default scan magnification is 20 times, the cell pictures scanned more than 20 times are too large, and the cell pictures scanned less than 20 times are too small.
S208, manually marking the appointed cell category in the candidate frame marked with the suspicious, thereby completing the marking operation of the candidate frame;
s209, generating a data set from all manually marked candidate frames and candidate frames which are not marked, and finishing marking.
The auxiliary labeling model adopts Swin Transformer V < 2+ > RETINANET, and the Swin Transformer V < 2 > and RETINANET are connected through a pyramid structure.
The auxiliary labeling of the invention has the following advantages:
1. Efficiency is improved: the semi-automatic labeling method can remarkably improve the labeling efficiency of the cell target detection task. Compared to fully manual labeling, the labeling personnel only need to participate in part of the work, e.g. selecting the region of interest or labeling some key points, which can save a lot of time and human resources.
2. Reducing annotation errors: manual labeling may have labeling errors, while semi-automatic labeling methods may reduce these errors by using computer vision algorithms, and may more accurately detect cells or objects, thereby reducing the error rate of labeling.
3. Consistency and accuracy: the semi-automatic labeling method helps to improve the consistency and accuracy of labeling because the computer algorithm maintains consistent labeling rules between different images. This helps to ensure consistent labeling between different samples in the dataset, improving reliability of training and performance assessment of the model.
4. The cost is saved: the semiautomatic labeling method can reduce labor cost of labeling. Especially in the case of large-scale datasets, the use of semi-automatic labeling can significantly reduce the economic cost of labeling.
5. Acceleration model training: labeling is one of the key steps in training a deep learning model. By the semi-automatic labeling method, a large-scale labeled data set can be generated more quickly, so that the training process of the model is accelerated, and the cell target detection model can be developed and optimized more quickly.
6. Large-scale data should be handled: in a cell target detection task, it is often necessary to process large-scale image data. Semi-automatic labeling methods make it feasible to process large-scale data because it can generate labeling data more quickly without excessive manpower and time.
The semi-automatic labeling method has remarkable advantages in a cell target detection task, can improve efficiency and accuracy, reduce cost, accelerate a model training process and is beneficial to better meeting the requirement of large-scale data labeling. However, it should be noted that semi-automatic labeling methods typically require careful design and verification to ensure that the labels that are generated remain of high quality.
As shown in fig. 3, in step S2, the training process of the self-supervised pre-training model is as follows:
(1) Uniformly scaling the unlabeled picture data to 448x448 size, and dividing the unlabeled picture data into NxN grids; where N may be 14, 19. And adjusting according to the model training effect.
(2) Carrying out random masking on the divided grids according to the proportion of 75%, wherein the mask filling value is 0;
(3) Arranging each grid data containing the masked data in a one-dimensional vector mode, and merging the position information of the grid data on the image in a cosine coding mode;
(4) Embedding the masked data and the unmasked data into class marks, wherein the class marks comprise masked grids and unmasked grids;
(5) Inputting the encoded one-dimensional vector into an encoder and a decoder of the self-supervision training model;
(6) Carrying out hierarchical standardization on the characteristics output by the self-supervision training model, outputting the predicted pixel value of each masked grid, and restoring the complete image by utilizing the principle of three channels of RGB images;
(7) Calculating pixel difference value loss between the restored image predicted by the self-supervision training model and the original image, and optimizing model parameters according to the loss;
(8) After the steps (1) to (7) are finished, judging whether T iterations are finished, and if so, outputting a self-supervision training model; if not, performing the next iteration; t is the total iteration number set at the beginning of training.
The self-supervision training model is Swin Transformer V.
The self-supervised learning of the present invention is an unsupervised learning method in which the model learns the characterization from unlabeled data without the need for external tags.
1. Data benefit: self-supervised learning allows for pre-training with large scale unlabeled data by maximizing the utilization of the data. In the biomedical field, cell images and data are often expensive and time-consuming to collect, so existing unlabeled data can be fully utilized using self-supervised learning, thereby reducing data collection costs.
2. And (3) feature learning: self-supervised model pre-training helps learn rich and generic feature representations. These representations may capture key information in the image, such as cell shape, texture, color, etc., to aid in cell detection and classification tasks. The model may learn sensitivity to different cellular features in the image during the pre-training phase, which helps to improve performance of subsequent tasks.
3. Data enhancement: self-supervised learning typically involves a variety of data enhancement techniques, which help make the model more robust and increase its ability to adapt to changes in different illumination, scale, noise, etc. This is particularly useful for cell detection and classification tasks, as biological images may be affected by various interfering factors.
5. Multitasking learning: the self-supervised model may be used for multitasking learning while handling multiple related tasks, such as cell detection and classification. This helps the model learn more comprehensive knowledge and thus perform well across multiple tasks.
Overall, self-supervised model pre-training provides a versatile benefit to cell detection and classification tasks, including better feature learning, data benefit, migration learning, and data enhancement. These benefits may improve the performance of the model, reduce reliance on the marker data, and help address challenges in biomedical image analysis.
As shown in fig. 4, in step S4, the cell detection model adopts a pyramidal structure for connection between Swin Transformer V2 + RETINANET, swin Transformer V2 and RETINANET; the cell detection model is based on a Swin transducer self-attention mechanism, and comprises a feature extractor and a detection head; the feature extractor comprises an encoder and a decoder, and the detection head maps the features output by the decoder into categories, the size and the position of a target frame; the categories have negative cells and suspicious positive cells, and each category is assigned a confidence level ranging between 0 and 1.
Referring to fig. 5, in step S5, swin Transformer V is used as the cell classification model; the cell classification model is based on a Swin transducer self-attention mechanism, and comprises a feature extractor and a classification head; the feature extractor includes an encoder and a decoder, and the classification head maps the features into a number of cell categories and assigns each cell category a confidence level in the range of 0 to 1.
Meanwhile, the invention provides:
A server, the server comprises a processor and a memory, wherein at least one section of program is stored in the memory, and the program is loaded and executed by the processor to realize the lung cancer cell morphology analysis and identification method.
A computer readable storage medium having stored therein at least one program loaded and executed by a processor to implement the above-described lung cancer cell morphology analysis, identification method.
The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.
Claims (8)
1. The morphological analysis and identification method of lung cancer cells is characterized by comprising the following steps:
S1, dividing unlabeled data without labels into data to be self-supervised and data to be auxiliary labeled according to a preset proportion;
s2, performing auxiliary labeling after preprocessing the auxiliary labeling data, and outputting the data of the complete label;
The specific process of the auxiliary labeling is as follows:
S201, preprocessing the labeling data, wherein the preprocessing comprises the steps of eliminating digital pathological images with unqualified quality, and the situation of unqualified quality comprises the following steps: blank content, cell number less than a first preset value, imaging blurring, exposure not in a first preset range, and color deviation degree exceeding a second preset value;
S202, submitting a small amount of preprocessed data to manual labeling, wherein the manual labeling requires labeling positive cells in a complete digital pathological image;
s203, providing a small amount of data of the manual annotation for the auxiliary annotation model for training;
s204, generating a cell candidate frame to be marked, which is used for predicting suspicious cells to be marked, for the auxiliary marking model, wherein the cell candidate frame is required to be segmented into digital pathological images;
s205, performing Patch pretreatment on all the patches after segmentation, including blank Patch elimination and color normalization;
s206, predicting the preprocessed Patches by using an auxiliary labeling model, and generating a cell candidate frame to be labeled in each Patch;
s207, manually checking candidates to be marked in the marking tool, and screening candidate frames to remove candidate frames with sizes not in a second preset range;
S208, manually marking the appointed cell category in the candidate frame marked with the suspicious, thereby completing the marking operation of the candidate frame;
s209, generating a data set from all manually marked candidate frames and candidate frames which are not marked, and finishing marking;
meanwhile, preprocessing the data to be self-supervised, and then executing self-supervision pre-training to output a self-supervision training model;
The training process of the self-supervision pre-training model is as follows:
(1) Unified scaling is carried out on the non-marked picture data, and the non-marked picture data are segmented into a plurality of grids;
(2) Carrying out random masking on the divided grids according to the proportion of 75%, wherein the mask filling value is 0;
(3) Arranging each grid data containing the masked data in a one-dimensional vector mode, and merging the position information of the grid data on the image in a cosine coding mode;
(4) Embedding the masked data and the unmasked data into class marks, wherein the class marks comprise masked grids and unmasked grids;
(5) Inputting the encoded one-dimensional vector into an encoder and a decoder of the self-supervision training model;
(6) Carrying out hierarchical standardization on the characteristics output by the self-supervision training model, outputting the predicted pixel value of each masked grid, and restoring the complete image by utilizing the principle of three channels of RGB images;
(7) Calculating pixel difference value loss between the restored image predicted by the self-supervision training model and the original image, and optimizing model parameters according to the loss;
(8) After the steps (1) to (7) are finished, judging whether T iterations are finished, and if so, outputting a self-supervision training model; if not, performing the next iteration; t is the total iteration number set at the beginning of training;
s3, inputting the data with the complete label, which is marked in an auxiliary way, into a cell detection model, and carrying out model training; the initialization parameters of the model training are parameters of a self-supervision training model;
s4, detecting whether the cell sample is a suspicious positive cell or not by using the cell detection model after training;
S5, inputting the suspicious lung cancer positive cells into a cell classification model for classification, and outputting the category of each cell in the data, wherein the total output number of the categories is 9: adeno, SCC, SCC3, SCLC, SC, columar, garbage, trash, WN;
S6, outputting confidence degrees of 9 categories of a cell picture of a cell classification model, calculating variance of weights of single pictures, voting for the first plurality of picture categories with the maximum variance after calculating the variance of all detected cell pictures of cases, wherein the voting categories are final diagnosis categories, and the final diagnosis categories comprise: suspected adenocarcinoma, suspected squamous carcinoma, suspected small cell carcinoma, suspected atypical cell, negative.
2. The method according to claim 1, wherein in step S4, the cell detection model is connected by a pyramid structure between Swin Transformer V2 + RETINANET, swin Transformer V2 and RETINANET; the cell detection model is based on a Swin transducer self-attention mechanism, and comprises a feature extractor and a detection head; the feature extractor comprises an encoder and a decoder, and the detection head maps the features output by the decoder into categories, the size and the position of a target frame; the categories have negative cells and suspicious positive cells, and each category is assigned a confidence level ranging between 0 and 1.
3. The method according to claim 1, wherein in step S5, swin Transformer V is used as the cell classification model; the cell classification model is based on a Swin transducer self-attention mechanism, and comprises a feature extractor and a classification head; the feature extractor includes an encoder and a decoder, and the classification head maps the features into a number of cell categories and assigns each cell category a confidence level in the range of 0 to 1.
4. The method for morphological analysis and identification of lung cancer cells according to claim 2 or 3, wherein Swin Transformer V comprises an encoder and a decoder, both of which are composed of a multi-head attention mechanism model and a homolayer normalized alternate connection, the expression of the multi-head attention mechanism model is as follows:
;
wherein, 、/>、/>The method is characterized in that the method is a mapping matrix of a 3-channel cell image after the cell image is subjected to blocking operation, and the meaning of the mapping matrix is respectively a query matrix, a key matrix and a value matrix; /(I)Is a matrix/>Is a transpose operation of (a); /(I)Is the relative positional offset term for each matrix; /(I)Is a learnable scaling factor; /(I)Is a learnable class balancing weight; softMax is an activation function of the multi-classification problem; attention is the output of multi-head self-Attention parameters followed by homolayer normalization and full connection layer feature extraction or classification.
5. The method of claim 2, wherein the RETINANET Focal Loss is used in a target detection scenario in which there is an extreme imbalance between foreground and background categories during training;
the cross entropy CE loss of the binary classification starts to introduce focus loss, and the calculation formula of the cross entropy CE of the binary classification is as follows:
;
wherein, Estimating the class probability of the candidate frame; /(I)Is true tag value,/>Take the value of-1 or 1, when/>For a correct classification of 1, the class is incorrect of-1;
;
representing the final output class probability of the model combined with the positive and negative labels of the class;
;
For the final cross entropy,/> Meaning of (2): for a sample t of the input model, weighting parameters of each cell category of cross entropy in the neural network; by learning the/>The weight value of the model (C) can solve the problem of unbalanced distribution of different types of cells in the digital pathological section.
6. The method for morphological analysis and identification of lung cancer cells according to claim 1, wherein in step S6, the threshold value of the confidence levels of the 9 categories is obtained by the following formula:
;
wherein, For the category confidence of the specified category,/>All cells of this class were predicted for the model,/>For the/>, under categoryPicture of cells,/>Representing the maximum probability value of the cell class,/>The variance predicted in the cell classification model is shown.
7. A server comprising a processor and a memory, wherein the memory stores at least one program, and the program is loaded and executed by the processor to implement the lung cancer cell morphology analysis and identification method of any one of claims 1 to 6.
8. A computer readable storage medium, wherein at least one program is stored in the storage medium, and the program is loaded and executed by a processor to implement the method for morphological analysis and identification of lung cancer cells according to any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311857682.3A CN117496276B (en) | 2023-12-29 | 2023-12-29 | Lung cancer cell morphology analysis and identification method and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311857682.3A CN117496276B (en) | 2023-12-29 | 2023-12-29 | Lung cancer cell morphology analysis and identification method and computer readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117496276A CN117496276A (en) | 2024-02-02 |
CN117496276B true CN117496276B (en) | 2024-04-19 |
Family
ID=89681465
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311857682.3A Active CN117496276B (en) | 2023-12-29 | 2023-12-29 | Lung cancer cell morphology analysis and identification method and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117496276B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020081504A1 (en) * | 2018-10-15 | 2020-04-23 | Upmc | Systems and methods for specimen interpretation |
WO2022233916A1 (en) * | 2021-05-05 | 2022-11-10 | The Institute Of Cancer Research: Royal Cancer Hospital | Analysis of histopathology samples |
CN115761342A (en) * | 2022-11-21 | 2023-03-07 | 中国科学院微电子研究所 | Lung CT image pneumonia classification method, device and equipment |
WO2023051377A1 (en) * | 2021-09-30 | 2023-04-06 | 北京地平线信息技术有限公司 | Desensitization method and apparatus for image data |
CN116612351A (en) * | 2023-05-24 | 2023-08-18 | 西南交通大学 | Urban rail vehicle bottom anomaly detection method based on multi-scale mask feature self-encoder |
CN116883994A (en) * | 2023-05-31 | 2023-10-13 | 温州医科大学 | Method, device and storage medium for identifying non-small cell lung cancer peripheral tissue pathological types based on self-supervision learning |
CN117173232A (en) * | 2023-07-27 | 2023-12-05 | 北京邮电大学 | Depth image acquisition method, device and equipment |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2008005426A2 (en) * | 2006-06-30 | 2008-01-10 | University Of South Florida | Computer-aided pathological diagnosis system |
US11545237B2 (en) * | 2017-09-26 | 2023-01-03 | Visiongate, Inc. | Morphometric genotyping of cells in liquid biopsy using optical tomography |
US11893482B2 (en) * | 2019-11-14 | 2024-02-06 | Microsoft Technology Licensing, Llc | Image restoration for through-display imaging |
-
2023
- 2023-12-29 CN CN202311857682.3A patent/CN117496276B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020081504A1 (en) * | 2018-10-15 | 2020-04-23 | Upmc | Systems and methods for specimen interpretation |
WO2022233916A1 (en) * | 2021-05-05 | 2022-11-10 | The Institute Of Cancer Research: Royal Cancer Hospital | Analysis of histopathology samples |
WO2023051377A1 (en) * | 2021-09-30 | 2023-04-06 | 北京地平线信息技术有限公司 | Desensitization method and apparatus for image data |
CN115761342A (en) * | 2022-11-21 | 2023-03-07 | 中国科学院微电子研究所 | Lung CT image pneumonia classification method, device and equipment |
CN116612351A (en) * | 2023-05-24 | 2023-08-18 | 西南交通大学 | Urban rail vehicle bottom anomaly detection method based on multi-scale mask feature self-encoder |
CN116883994A (en) * | 2023-05-31 | 2023-10-13 | 温州医科大学 | Method, device and storage medium for identifying non-small cell lung cancer peripheral tissue pathological types based on self-supervision learning |
CN117173232A (en) * | 2023-07-27 | 2023-12-05 | 北京邮电大学 | Depth image acquisition method, device and equipment |
Also Published As
Publication number | Publication date |
---|---|
CN117496276A (en) | 2024-02-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11756318B2 (en) | Convolutional neural networks for locating objects of interest in images of biological samples | |
CN108364288B (en) | Segmentation method and device for breast cancer pathological image | |
CN109903284B (en) | HER2 immunohistochemical image automatic discrimination method and system | |
Kang et al. | Stainnet: a fast and robust stain normalization network | |
CN110245657B (en) | Pathological image similarity detection method and detection device | |
Pan et al. | Cell detection in pathology and microscopy images with multi-scale fully convolutional neural networks | |
CN115088022A (en) | Federal learning system for training machine learning algorithms and maintaining patient privacy | |
US20210214765A1 (en) | Methods and systems for automated counting and classifying microorganisms | |
CN110796661B (en) | Fungal microscopic image segmentation detection method and system based on convolutional neural network | |
CN112990214A (en) | Medical image feature recognition prediction model | |
CN115526834A (en) | Immunofluorescence image detection method and device, equipment and storage medium | |
CN115909006A (en) | Mammary tissue image classification method and system based on convolution Transformer | |
CN115170518A (en) | Cell detection method and system based on deep learning and machine vision | |
CN111583226A (en) | Cytopathological infection evaluation method, electronic device, and storage medium | |
CN111047559A (en) | Method for rapidly detecting abnormal area of digital pathological section | |
CN114387596A (en) | Automatic interpretation system for cytopathology smear | |
CN113470041B (en) | Immunohistochemical cell image cell nucleus segmentation and counting method and system | |
CN116912240B (en) | Mutation TP53 immunology detection method based on semi-supervised learning | |
CN116468690B (en) | Subtype analysis system of invasive non-mucous lung adenocarcinoma based on deep learning | |
CN117036288A (en) | Tumor subtype diagnosis method for full-slice pathological image | |
CN117496276B (en) | Lung cancer cell morphology analysis and identification method and computer readable storage medium | |
CN116309333A (en) | WSI image weak supervision pathological analysis method and device based on deep learning | |
Galton et al. | Ontological Levels in Histological Imaging. | |
Taher et al. | Morphology analysis of sputum color images for early lung cancer diagnosis | |
CN114782948A (en) | Global interpretation method and system for cervical liquid-based cytology smear |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |